Environmental Microbiology (EM) | Microbial Ecology and Diversity
Microbiol. Biotechnol. Lett. 2024; 52(2): 163-178
https://doi.org/10.48022/mbl.2306.06005
Neeti Pandey1*, Raman Rajagopal2, and Shubham Dhara3
1Kalindi College, Department of Zoology, University of Delhi, New Delhi, India
2Gut Biology Lab, Room No 117, Department of Zoology, University of Delhi-07, India
3SGTB Khalsa College, University of Delhi-07, India
Correspondence to :
Neeti Pandey, netipandey.87@gmail.com
Pan-genome analysis is used to interpret genome heterogeneity and diversification of bacterial species. Here, we present pan-genome analysis of 22 strains of Enterococcus mundtii. The GenBank file of E. mundtii strains that have been isolated from different sources i.e., human fecal matter, soil, leaf, dairy products, and insects was downloaded from National Center for Biotechnology Information (NCBI) database and analyzed using BPGA-1.3.0 (Bacterial Pan Genome Analysis) pipeline. Out of a total, 4503 gene families, 1843 belongs to the core genes whereas 1,762 gene families represent the accessory genes and 898 gene families depict the unique genes among all the selected genomes. Majority of the core genes belongs to the categories of Metabolism (37.83%) and Information storage & processing (29.84%) whereas unique genes belongs to the category of Information storage & processing (48.08%). Further, accessory genes are almost equally present in both functional categories i.e. Information storage & processing and Metabolism (34.34% and 32.27% respectively). Further, subset analysis on the basis of the origin of isolates exhibits presence and absence of exclusive gene families. The observation suggests that even closely related strains of a species show extensive disparity in genome owing to their ability to adapt to a specific environment.
Keywords: Enterococcus mundtii, Pan genome, core genome, accessory genome
The members of the genus
With the advancement of available bioinformatics resources, comparative studies of enormous sequenced data have become relatively simple and rapid. This led to a lot of studies related to finding of habitat-specific variation in the microbiome composition. In the present study, we report the pan-genome analysis of 22
A total of 48 whole genome sequences (assembled at different levels i.e. contig/scaffold/complete genome) of
In order to identify niche specific genomic features among the
Protein function was predicted by using the functional analysis module of Pan Genome against COG database of prokaryotic proteins. This module is also used to determine metabolic pathways against KEGG database.
In order to assess the genetic diversity among the genomes derived from
Overall nucleotide compositions (A%, U%, G%, and C%) and GC content of
Pairwise Ka/Ks ratios of the two bacterial strains namely
Initially, all the 52 genomes of
The 22 high quality genome assemblies of
In order to generate pan-genome plot, we have plotted the total number of gene families obtained (4,503) against the number of genomes selected for study. In a similar way, core genome plot was obtained by plotting the shared/core gene families (1,843) against the number of selected genomes. To avoid any biasness in the chronological addition of new genomes, random permutations (50 by default) in the sequence of addition of genomes were carried out while generating pan-core genome plots. A median was calculated on the size of pan-core genome after each step and the median values were plotted using power equation [f(x)=a.x^b}] and exponential equation [f1(x)=c.e^(d.x)] to generate pancore plot respectively. An exponent ‘b’ of < 0 implies that the pangenome is “closed,” and its size has reached a constant value as additional genomes are added. However, ‘b’ value between 0 to 1 implies that the pangenome is still “open”. Thus, the pan-genome profile of
Further, we carried out the subset analysis by distributing the
Table 1 . Group specific Pan-Core genome statistics of
Group | No. of genomes per group | Total no. of protein conding genes | Orthologous clusters (Pan genome size) | b-parameter | Pangenome status | Core Clusters (% of Pan-genome) | Core-genome status (possibility of further contraction) | Group specific genes | Exclusively absent clusters (Group specific) |
---|---|---|---|---|---|---|---|---|---|
Group-1-Human fecal matter | 12 | 35024 | 3520 | 0.065 | Open (almost closd) | 2011 (57.13% of pangenome) | Yes | 314 | 68 |
Group-2-Insect samples | 03 | 8400 | 3242 | 0.161 | Open | 2229 (68.75%) | Yes | 349 | 381 |
Group-3-Non-dairy samples (soil/leaf) | 03 | 9062 | 3302 | 0.167 | Open | 2346(71.04%) | Yes | 153 | 131 |
Representative protein coding sequences of the core (1843), accessory (1,762) and unique (898) gene families were picked to establish their COG/KEGG identities. Figure depicts the distribution of the major COG categories in the above mentioned three categories of gene families (Fig. 2). Majority of the core genes belongs to the categories of
Similarly, KEGG pathway analysis depicted the distribution of the core, accessory and unique genes into six major sub-categories namely Cellular Processes, Environmental Information, Processing, Genetic Information Processing, Human Diseases, Metabolism and Organismal Systems (Fig. 4). KEGG pathway analysis revealed that a majority of accessory (70.76%) and core genes (64.43%) are involved in the
Further, group specific COG analysis revealed that core genes cluster of all the three groups majorly belong to the functional category of
The average amino acid identity (AAI) refers to an index of pairwise genomic relatedness and is calculated on the basis of conserved protein coding genes between a pair of genomes using the BLAST algorithm. AAI shows better resolution in revealing taxonomic structure beyond the species rank as compared to average nucleotide identity (ANI), which is considered as a standard criterion in species delineation [34]. In the current study, AAI analysis was done using AAI calculator which estimates the average amino acid identity using both best hits (one-way AAI) and reciprocal best hits (two-way AAI) among multiple genomic datasets of proteins. The emergence of whole-genome AAI has assisted greatly on evaluating species boundaries by calculating genetic relatedness between two genomes, where strains from the same microbial species share ≥ 95per cent identity [34]. Our results are in agreement with the abovementioned AAI cut off value (Fig. 8). Further, a neighbor joining phylogenomic tree was constructed based on Gene presence/absence matrix of all CDSs of 22
Subset analysis revealed the presence of habitat specific gene families i.e. they are exclusively present in the genomes isolated from a specific source. As shown in Table 1, there are 314, 349 and 153 gene cluster which have members exclusively present in
COG distribution pattern of the group specific genes are depicted in heatmap (Fig. 10). COG category Replication, recombination and repair (L) is hugely represented in all the three groups. Apart from this, Group- 1 specific gene families majorly belong to Transcription (K) (8.21%) and Carbohydrate transport and metabolism (G) (9.65%) categories whereas genes under categories like Transcription (K) (9.57%), Defense mechanisms (V) (11.76%), Cell wall/membrane/envelope biogenesis (M) (11.69%) are significantly represented in group-2. Specific gene families of group 3 are frequently present in Categories like Transcription (K) (14.505%) and Cell cycle control, cell division, chromosome partitioning (D) (7.88%) (Fig. 8).
Nucleotide composition and Codon usage bias analysis was carried out for the CDS region of 22
RSCU analysis is extensively used to standardize the analysis of codon usage bias. Here, 6 codons were overrepresented, and 17 codons were underrepresented, indicating significant codon usage bias. Generally, it has been observed that codons which are used less by the host are selected in the process of evolution in order to avoid competition with the host cell during gene translation. In RSCU analysis, it has been observed that majority of the abundantly used codon were ending in either ‘A’ or ‘T’ (Supplementary Table S6). Further, as stated above the ‘AT’ content was higher than the ‘GC’ content and similarly there was a preference for ‘AT’ than ‘GC’ at third codon position. This indicates a positive correlation between nucleotide and codon composition. Thus, mutation pressure is found to be an important force affecting the codon usage bias.
Ka/Ks ratio which is also known as ω ratio is used to approximate the balance between positive, purifying or neutral selection acting on a set of homologous proteincoding genes. The selection pressure ratio (ω) is calculated by dividing non-synonymous substitution rate (Ka) to the synonymous substitutions rate (Ks). The Ka/Ks ratio is used to infer the direction and magnitude of natural selection acting on protein coding genes. A Ka/Ks ratio < 1 implies purifying selection; Ka/Ks > 1 indicates positive or Darwinian selection (driving change); and Ka/Ks values close to 1 shows neutral or relaxed selection (reference kimura.ivanova). All the 2196 protein coding sequences shared between two strains of
Table 2 . Protein ids and function of genes under positive selection pressure.
Protein ids of | Potein ids of | Ka/Ks | Protein Description |
---|---|---|---|
lcl|NZ_CP018061.1_cds_WP_071866700.1_260 | lcl|NZ_CP022340.1_cds_WP_010736301.1_489 | 2.2961 | Hypothetical Protein |
lcl|NZ_CP018061.1_cds_WP_071866694.1_227 | lcl|NZ_CP022340.1_cds_WP_096081363.1_456 | 1.8764 | Protein=vWA domain-containing protein |
lcl|NZ_CP018061.1_cds_WP_071866167.1_3031 | lcl|NZ_CP022340.1_cds_WP_010736522.1_279 | 1.7728 | Hypothetical protein |
lcl|NZ_CP018061.1_cds_WP_071867297.1_2008 | lcl|NZ_CP022340.1_cds_WP_096081658.1_1973 | 1.6428 | TetR/AcrR family transcriptional regulator |
lcl|NZ_CP018061.1_cds_WP_179948050.1_885 | lcl|NZ_CP022340.1_cds_WP_023519435.1_950 | 1.4363 | Hypothetical Protein |
lcl|NZ_CP018061.1_cds_WP_023519140.1_293 | lcl|NZ_CP022340.1_cds_WP_096081382.1_520 | 1.4221 | Galactose mutarotase |
lcl|NZ_CP018061.1_cds_WP_071866695.1_248 | lcl|NZ_CP022340.1_cds_WP_010736313.1_477 | 1.2478 | Protein=tagatose-bisphosphate aldolase |
lcl|NZ_CP018061.1_cds_WP_071866535.1_2601 | lcl|NZ_CP022340.1_cds_WP_034691573.1_2620 | 1.2418 | Hypothetical Protein |
lcl|NZ_CP018061.1_cds_WP_071867143.1_1048 | lcl|NZ_CP022340.1_cds_WP_010735694.1_1082 | 1.2373 | Hypothetical Protein |
lcl|NZ_CP018061.1_cds_WP_071867840.1_2312 | lcl|NZ_CP022340.1_cds_WP_096081784.1_2342 | 1.2123 | DUF624 domain-containing protein |
lcl|NZ_CP018061.1_cds_WP_233433731.1_1971 | lcl|NZ_CP022340.1_cds_WP_010734841.1_1937 | 1.156 | Hypothetical Protein |
lcl|NZ_CP018061.1_cds_WP_071866706.1_278 | lcl|NZ_CP022340.1_cds_WP_096081377.1_505 | 1.1501 | RsiV family protein |
lcl|NZ_CP018061.1_cds_WP_019722494.1_449 | lcl|NZ_CP022340.1_cds_WP_096081408.1_635 | 1.1211 | Hypothetical Protein |
lcl|NZ_CP018061.1_cds_WP_071868037.1_882 | lcl|NZ_CP022340.1_cds_WP_010735831.1_947 | 1.0752 | Discoidin domain-containing protein |
lcl|NZ_CP018061.1_cds_WP_071866347.1_938 | lcl|NZ_CP022340.1_cds_WP_010735791.1_986 | 1.0623 | Lysozyme family protein |
lcl|NZ_CP018061.1_cds_WP_071868082.1_2032 | lcl|NZ_CP022340.1_cds_WP_010734783.1_1996 | 1.0247 | Type II toxin-antitoxin system RelE/ ParE family toxin |
lcl|NZ_CP018061.1_cds_WP_071866602.1_213 | lcl|NZ_CP022340.1_cds_WP_010736349.1_442 | 1.014 | Zinc ABC transporter substrate-binding protein AdcA |
lcl|NZ_CP018061.1_cds_WP_071866711.1_304 | lcl|NZ_CP022340.1_cds_WP_010736250.1_531 | 1.0103 | S-(hydroxymethyl) glutathione dehydrogenase/class III alcohol dehydrogenase |
lcl|NZ_CP018061.1_cds_EM4838_RS11985_2351 | lcl|NZ_CP022340.1_cds_- CO646_RS12160_2380 | 1.0005 | HAD hydrolase family protein |
In this study, we have presented the pan-genome analysis of different strains of
Further, one of the major objectives of the current study was to explore whether there are any specific gene families that are exclusively present or absent from the genomes derived from a particular habitat as they might have some function in adaptation of the microorganism to a particular habitat. The result indicates both presence as well as absence of exclusive gene families from the genomes of a specific habitat. Microorganisms often exhibit alteration in the genomic composition according to their specific environment [40]. Therefore, it can be concluded that ‘gain of function’ in addition to ‘loss of function’ have contributed significantly towards the adaptation of the strains of
Codon usage as well as synomymous/non-synonymous substitutions plays significant role in genome evolution. Synonymous (Ks) and Non-synonyous substitution rates and Ka/Ks ratio helps to determine the sequence divergence as well as positive, purifying or neutral selection in protein coding genes. In maority of the protein coding genes except rapidly evolving genes, Synonymous substitution (Ks) occurs more regularly when compared to non-synonymous substitutions (Ka) [47]. Positive selection of a protein coding gene by a species is indicative of adaptive evolution. It implies that the gene is beneficial for host’s exsistence, to adapt to a particular niche or environmental conditions.
Thus, to conclude, in this study, we have presented the pan-genome analysis of different
BPGA: Bacterial Pan-genome analysis Pipeline
AAI: Average Amino acid Identity
RSCU: Relative Synonymous Codon Usage
PAIs: Pathogenicity islands
We thank Dr. Vinod Kumar Gupta (Mayo clinic, Rochester) for his valuable time, suggestions and guidance that improved the manuscript.
The authors have no financial conflicts of interest to declare.