CpGDB : A Comprehensive Database of Chloroplast Genomes

Chloroplast Genome Database (CpGDB) is user friendly, web-based, freely available and dynamic relational database which provides a platform for researchers to search and download complete chloroplast genome sequences, individual gene sequences and feature records of plant species belonging to same or different families of spermatophytes. Presently, the database consists of genome sequences, individual gene sequences and feature records of chloroplast genomes of 3823 plant species belonging to 1527 genera from 256 families, which will be updated regularly with the availability of new sequences at NCBI. Extensive data mining of feature records from GenBank files, uniform nomenclature for majority of genes, enriched intron/exon feature records makes CpGDB a valuable resource for studies in chloroplast genomics while complementing existing chloroplast databases.


Background:
Chloroplasts are photosynthetic intracellular organelles of plants and are considered as Earth's main solar energy converters. Besides photosynthesis, chloroplasts are also involved in synthesis of amino acids, vitamins, lipids, pigments, precursors of plant hormones etc. They contain their own distinct genome, which is known to be derived from a cyanobacterial ancestor. Among three genomes of a plant cell, chloroplast genome is the most gene dense constituting 172 ©Biomedical Informatics (2020) more than 100 genes in genome size of 120 to 210 kb [1]. The chloroplast genome generally has a highly conserved organization in term of size, structure and gene content [2]. The presence of a pair of inverted repeats (IRs) is one of the outstanding features of chloroplast genome, which separates two single copy DNA regions, a large single-copy (LSC) and a small single-copy (SSC) region, on a single circular DNA molecule [3]. Chloroplast Genome Database (CpGDB) is an attempt to organize and integrate enormous amount of data related to chloroplast genomes of spermatophytes. It has two levels of information: NCBI based sequence data and curated annotations. Its main focus is to make available complete chloroplast genome sequences and their different features e.g. gene, CDS, rRNA, tRNA, intron, exon etc. for different types of analysis. This is a dynamic relational database available at http://www.gndu.ac.in/CpGDB and supports different queries specifically with respect to comparative genome analysis across different species/genera/families. It allows search by family, genus and gene names along with provision to download complete chloroplast genome sequences or individual gene sequences for selected plant species belonging to same or different families. Additionally, it provides unified annotations with respect to gene names. This database would be a valuable platform for researchers in the field of chloroplast genomics.

Methodology: Data mining and organization
The CpGDB was designed using MS-Visual Studio 2015 as front end and SQL Server 2012 as back end (Figure 1) was passed as web request to NCBI nucleotide database to download sequences both in FASTA and GenBank format and store in local repository for providing faster and efficient sequence retrieval system. The GenBank files were parsed using parsing script written in ASP.NET and database tables Family_Master, Plant_Master and Plant_Sequences were updated for each accession. In order to avoid duplicate occurrences of chloroplast genome sequences due to availability of more than one accession for some of the plant species, the most recently modified versions were considered. Also accessions containing hybrid plant species or with missing species name were not considered. Hence, out of total 3881 accessions available upto 30/11/2019, 3823 were considered for further 173 ©Biomedical Informatics (2020) analysis (Supplementary Table S1 -see excel file). These accessions belong to 1527 different genera from 256 plant families. Each plant species was assigned a unique four letter code to make data retrieval, analysis and comparison simple and more convenient. An algorithm was developed to extract and export feature records of plant species from GenBank files. The data related to gene name, geneID, location, product and note information etc. was extracted and stored separately for each plant species. The execution of this algorithm resulted in compilation of 10,56,377 feature records belonging to 3823 plant species. Any missing information was indicated with '-' value in the record.  A total of 11,830 feature records across 1160 plant species were updated based on their product information given in the feature description due to nonavailability of gene names in the feature records (Supplementary Table S3 -see excel file). Similarly, 558 records spanning 115 plant species with missing gene name as well as product information were updated using note description of the corresponding record (Supplementary Table S4 -see excel file). Finally, 12,089 records belonging to 1171 plant species with missing gene name, product and note description were updated from the corresponding tRNA/rRNA/CDS feature records (Supplementary  Table S5 -see excel file). To enrich information about insufficiently annotated split genes, 1,89,002 missing exon/intron features belonging to 3802 plant species were determined using tRNA/CDS features of the corresponding genes and added in the database (Supplementary Table S6 -see excel file). All these manual curations resulted in 12,45,379 feature records available in the database. Figure 2 shows the complete workflow along with data for creation of CpGDB.

Database Interface and Utility:
The database can be accessed easily through Chloroplast Genome Information Retrieval System (CGIRS) link available on the top menu bar of home page of CpGDB (Figure 3).

Conclusion:
CpGDB provides curated information on complete chloroplast genome sequences available upto 30/11/2019. The database will be regularly updated with the availability of new sequences. The strength of the database lies in its user friendliness, ability to retrieve specific data and uniform nomenclature of majority of genes. CpGDB will serve as a valuable resource for chloroplast genomic studies.

Supplementary Data:
Supplementary files are available for downloading at journal website and CpGDB website.