Biological data generation has been growing exponentially in recent decades, and this has only accelerated with advances in next-generation sequencing technologies and other high-content “omics” platforms. Data generation, it would seem, is not the hard part—making sense of all the data is. As a way of coping, we humans have devised databases to keep everything structured and more easily accessible. And once you set out to create a database, that’s when the real fun starts!
How do you organize and standardize data derived from multiple sources? How do you know that all these data are reliable and of sufficient quality? And how will the end-user navigate and use this information? The short answer to these questions is biocuration and databases.
If you kept the wealth of biological data in a basement, then biocurators would be the professional organizers who came in and prevented it from becoming hopelessly cluttered!
In this article, you’ll discover more about biocuration, why it’s so important, and what it takes to become a biocurator. It turns out that biocuration as a career choice is easier than you might think!
What is Biocuration?
Biocuration can be defined as a process that organizes and standardizes data structure, with data annotated with an accepted ontology and attribution of the data’s source.  You can think of biocuration as a set of guidelines that promote good data stewardship and adhere to the FAIR principles, whereby data are structured so that they are
- Interoperable, and
- Reusable. 
Why is Biocuration Important?
It goes without saying that science is collaborative. With large datasets being generated by multiple researchers worldwide, it becomes challenging to compare data and establish equivalence. In other words, how can you make diverse large datasets combinable? If researchers are using public datasets created by others, you must be able to combine these to generate scientifically valid observations and conclusions. In fact, making datasets combinable is the value created by biocuration.
Without biocuration, it would be almost impossible to utilize “omics”-scale information effectively.
And there is lots of interest in biocuration. So much so that a professional society known as the International Society for Biocuration (ISB) was established to promote data stewardship of scientific data. In addition, the European life sciences infrastructure for biological information (ELIXIR) was established as sort of a clearinghouse for European bioinformatics resources that are important for biocuration.
Extracting information from the literature and tagging it with controlled vocabularies and ontologies takes a lot of time and effort, and curated information derived from literature becomes the gold standard for the data. 
Imagine this scenario: you are looking at the proteome of your new and novel cell line. Your publication will probably describe many different proteins and classes of proteins. Tagging the Enzyme Commission (EC) numbers would connect your paper to other databases. But those tags must conform and be universally recognizable.
That is where gold-standard and common ontologies come in—and someone has to curate all of them. Yes, you guessed it—biocurators. As such, the importance of biocuration will persist for some time yet.
Types of Biocuration
Databases span every facet of biology—from nucleic acids and proteins to metabolic pathways. You name it, there is a database for it. Don’t believe me? Try a web search for any research topic that comes to mind. You can bet money you will find a database to house the ever-expanding data generated in any almost field. And every single one of those databases needs curation. Even PubMed needs curation!
Biocurators spend a lot of time creating shared vocabularies for the entries within a given database. These are known as ontologies—and biocurators utilize biological ontologies a lot. The Open Biological and Biomedical Ontology (OBO) Foundry is an organization that sets community-developed ontologies (i.e., vocabulary) that are scientifically accurate and logical.
The Gene Ontology Consortium is another organization that establishes a computational model of how genes function in cells. There are two parts here:
- the Gene Ontology (i.e., the vocabulary); and
- the annotations, which are derived from evidence in the literature.
Automation of the curation process is critical. Given the sheer volume of information being generated for entry into biological databases, there is no hope of curating these manually! Automation is where bioinformatics and coding skills become important for the biocurator.
Examples of Biocurated Databases
Table 1 provides a handy list of some of the common resources. You can find many more in the annual database issue of Nucleic Acids Research, which typically comes out at the beginning of January each year. The journal Database publishes articles on new databases and curation tools too.
Table 1. Freely available bioinformatics resources
|Curated database of protein, gene, and chemical interactions
|Registry of bioinformatics tools and biomedical databases hosted by ELIXIR
|Database of drugs, targets, pathways, and pharmacogenomics
Free to search, with paid premium options available
|Curated database for G-protein coupled receptors, including sequence alignments, structure, and receptor mutations
Functionality to generate phylogenetic trees
|HUGO Gene Nomenclature Committee database
|Curated repository of HUGO-approved gene names
Includes protein-coding genes, RNA-coding genes, pseudogenes, and other genomic features;
Ensures that each gene has only one single name and symbol
|National Center for Biotechnology Information (NCBI)
|Collection of databases and resources for biomedical research
Includes GenBank, PubMed, and the BLAST
|Online Mendelian Inheritance in Man
|Compendium of human genes and genetic phenotypes
Contains known Mendelian disorders for 16,000 genes;
|Database of post-translational modifications (phosphorylation, acetylation, etc.)
Searchable by protein and by substrate (i.e., site)
|Database of protein domains, families, and functional sites
|Protein sequences and functional information
What Makes a Good Biocurator?
You may be wondering at this point what kind of people are needed to maintain all these different biological databases. The short answer is anyone. And if the thought of organizing and cataloging all the items in your closet thrills you, then you might be the right kind of fit!
You don’t have to become a professional biocurator either. Most biocurators start out as bench scientists who have moved into biocuration after developing subject matter expertise in a given field of study. Others are researchers who had to learn how to curate the data they were submitting to a database and eventually found themselves doing biocuration full time. 
Regardless of how you get to into biocuration, there are a few characteristics that successful biocurators have in common:
- Organized. Organization and structure are at the heart of biocuration. If you enjoy bringing order to chaos, you might be cut out to be a biocurator!
- Collaborative. Biological databases are made freely available to whoever wants to use them, and data structures and tools are often open source. Indeed, the purpose of biological databases is to foster scientific discovery and collaboration between researchers worldwide.
- Interested in and skilled with bioinformatics tools. Aside from being able to use tools with a web front end, are you comfortable using scripts from the command line? While this might be intimidating at first, you will have no trouble developing this skill if you are interested in bioinformatics.
Biocuration as a Career Choice—And How Can I Become One?
Now you know what makes a good biocurator, are you ready to dive in?
If so, then your first stop should be the ISB’s website. There you can find job postings, generic job descriptions, and links to training materials. Browsing these will give you an idea of what skills and experience biocuration employers are looking for.
You can also attend workshops and conferences to meet others who have become biocurators and network with potential employers. Many of these are also advertised on the ISB’s website.
Consider your own research, too—you may already be a biocurator! If you routinely submit nucleotide or protein sequences to a database, you might already be curating your own data prior to submission. You can expand on that, or as listed above, there are numerous databases cataloging all types of biological information.
For instance, let’s say you are doing ground-breaking research in metabolomics using a novel cell line. You wouldn’t keep that all to yourself, would you? Of course not! You would upload your Mass Spec and NMR data to the Metabolomics Database. And you would probably frequent and use that database. So, in addition to sharing your ground-breaking research, you are also running your side hustle as a part-time biocurator.
And perhaps the easiest way to become a biocurator is to start supporting voluntary curation efforts. The ISB posts numerous volunteer opportunities.  For instance, the Clinical Genome database seeks volunteers to help define the clinical relevance of different genetic variants. 
Finally, it may not hurt to develop some good bioinformatics skills. Big datasets require automation and scripting techniques. Having some training in bioinformatics would be an asset when applying for biocuration positions.
What Training Do I Need to Become a Biocurator?
If you’re now convinced that you’d like to become a biocurator, the next step is getting the proper training. If you want to stay within your current area of research, you are already off to a good start. For example, suppose you conduct clinical-grade NGS testing. In this case, you probably already possess subject matter expertise in clinical genomics and have a good foundation for any role in biocurating clinical genomics data.
We touched on volunteering opportunities earlier—through which you can interact with more experienced curators from whom you can learn a great deal. An added benefit of this approach is that this training is free!
For a more formal approach to your training, a postgraduate certificate in biocuration is offered by the University of Cambridge.
And if you want additional formal training, one idea might be to browse various biocuration job ads to see what skills they are seeking and take some stand-alone classes. For example, a few Data Wrangler positions posted on the ISB website list shell scripting, Python, SQL, and UNIX as desired skills.  Training and coursework on any of these would make your application more favorable.
Have any pointers on getting into this field? Don’t be shy—please enter them in the comments below!
- Tang YA. et al. (2019) Ten quick tips for biocuration. PLOS Computational Biology. 15: e1006906.
- Wilkinson, MD. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 3:160018.
- Howe, D. et al. (2008). Big data: The future of biocuration. Nature. 455:47–50.
- Burge, S. et al. (2012). Biocurators and biocuration: surveying the 21st century challenges. Database. 2012, bar059. https://doi.org/10.1093/database/bar059
- Curate Now. International Society for Biocuration, accessed 23 Oct 2021
- Volunteer to Curate. ClinGen, accessed 23 Oct 2021
- Job Openings. International Society for Biocuration, accessed 23 Oct 2021