Article Search

Microbiology and Biotechnology Letters

Research Article(보문)

View PDF

Environmental Microbiology (EM)  |  Microbial Ecology and Diversity

Microbiol. Biotechnol. Lett. 2024; 52(2): 163-178

Received: June 15, 2023; Revised: September 20, 2023; Accepted: October 16, 2023

Pan-Genome Analysis Reveals Origin Specific Genome Expansion in Enterococcus mundtii Strains

Neeti Pandey1*, Raman Rajagopal2, and Shubham Dhara3

1Kalindi College, Department of Zoology, University of Delhi, New Delhi, India
2Gut Biology Lab, Room No 117, Department of Zoology, University of Delhi-07, India
3SGTB Khalsa College, University of Delhi-07, India

Correspondence to :
Neeti Pandey,

Pan-genome analysis is used to interpret genome heterogeneity and diversification of bacterial species. Here, we present pan-genome analysis of 22 strains of Enterococcus mundtii. The GenBank file of E. mundtii strains that have been isolated from different sources i.e., human fecal matter, soil, leaf, dairy products, and insects was downloaded from National Center for Biotechnology Information (NCBI) database and analyzed using BPGA-1.3.0 (Bacterial Pan Genome Analysis) pipeline. Out of a total, 4503 gene families, 1843 belongs to the core genes whereas 1,762 gene families represent the accessory genes and 898 gene families depict the unique genes among all the selected genomes. Majority of the core genes belongs to the categories of Metabolism (37.83%) and Information storage & processing (29.84%) whereas unique genes belongs to the category of Information storage & processing (48.08%). Further, accessory genes are almost equally present in both functional categories i.e. Information storage & processing and Metabolism (34.34% and 32.27% respectively). Further, subset analysis on the basis of the origin of isolates exhibits presence and absence of exclusive gene families. The observation suggests that even closely related strains of a species show extensive disparity in genome owing to their ability to adapt to a specific environment.

Keywords: Enterococcus mundtii, Pan genome, core genome, accessory genome

The members of the genus Enterococcus (Family-Enterococcaceae) are Gram-positive cocci occurring either singly, in pairs, groups or as short chains [1]. Initially, the genus Enterococcus was recognized in the year 1903 [2], however, at that time, it was not widely accepted as a separate genus and thus, Enterococcus species owing to their ability to produce short chains while growing were reclassified as Streptococcus faecalis in the year, 1906 [3]. However, major differences were observed between S. faecalis and S. faecium when compared to other members of the genus Streptococcus and subsequently, both these species were allocated to the genus Enterococcus [4]. With time, more species like Enterococcus gallinarum, Enterococcus avium, Enterococcus durans, Enterococcus casselipavus and Enterococcus malodoratus were added to this newly accepted genus [5] and to date, 62 different bacterial species have been classified under Enterococccus. Enterococcus species are ubiquitous bacteria and have been isolated from diverse environments like gastrointestinal tracts of human & animals, soil, water, leaf, faecal matter etc [69]. They exhibit technological relevance as they act as key players in dairy industry [10], meat production and storage [10], produces bacteriocins which prevents food spoilage [11] or help improve general immunity [12]. However, certain species of Enterococcus has serious clinical implications like endocarditis, urinary infection, infections of the central nervous system etc. [13, 14].

Enterococcus mundtii is a Gram-positive, non-motile, yellow pigmented, nonspore forming, and facultative anaerobic coccobacillus. Taxonomically, it has been assigned as a member of the group Enterococcus faecium based on homology in 16S rDNA sequence [15]. It has low GC content which ranges between 38 to 39% and it lack enzymes like catalase as well as cytochrome-C oxidase. It has been shown to be associated with diverse niches like soil, leaf, dairy products, human & Ovine fecal matter, cow’s teats and milker’s hands [11, 1618]. E. mundtii strains can play essential role in dairy processes like fermentation to produce lactic acid or a few E. mundtii strains are bacteriocinogenic in nature and highly active against bacteria such as Clostridium, Klebsiella, Lactobacillus, Pseudomonas, Acinetobacter [1921]. Further, they can be used to prevent mastitis in cows [18].

With the advancement of available bioinformatics resources, comparative studies of enormous sequenced data have become relatively simple and rapid. This led to a lot of studies related to finding of habitat-specific variation in the microbiome composition. In the present study, we report the pan-genome analysis of 22 E. mundtii strains (data obtained from NCBI) obtained from diverse niches. The rationale behind selecting E. mundtii for pan-genome analysis lies in its presence as a ubiquitous bacterium that thrives under different environmental conditions and produces powerful bacteriocins which inhibit the proliferation of pathogenic bacteria. Enterococcus spp. plays a significant role in food industry due to their ability to produce powerful enterocins which can inhibit the growth of food degradative bacteria [22, 23]. However, use of few species of Enterococcus like Enterococcus faecalis and Enterococcus faecium would be highly doubtful due to their tendency to infect human beings [24]. A study done in the past has emphasized the significance of using biologically safe lactic acid producing non-pathogenic bacteria like E. mundtii in the food industry for commercial purposes [25]. However, a comparative genome analysis has not been done for this species to date. Therefore, we aim to gain insight into the niche driven changes in the genomic content of E. mundtii genomes.

Genome sequences retrieval and selection for Pangenome analysis

A total of 48 whole genome sequences (assembled at different levels i.e. contig/scaffold/complete genome) of E. mundtii are available in the NCBI database as on March, 2022. We calculated the percentage of completeness and contamination of each assembly using CheckM v1.1.6 [26] software as well as assessed the quality of each genome using QUAST version 5.2 [27]. Only twenty-two high quality (genomes assemblies that exhibit ≥ 99.19% completeness and ≤ 0.87% were considered as high quality) near to complete genome assemblies of E. mundtii having information on the source of isolation were selected for pan-genome analysis. The full GenBank file of all 22 genomes was downloaded from NCBI database that served as input file for pan-genome analysis. Among, 22 genome assemblies, 12 were isolated from the human fecal material, 3 from insect samples, 3 from environmental samples (soil/leaf) and 4 from varied samples (Ovine feces/artisanal cheese/cows teats/dairy cows).

Protein clustering and Pan-Core Genome analysis using BPGA-1.3.0 pipeline

In order to identify niche specific genomic features among the E. mundtii strains, the computational pipeline BPGA-1.3.0 (Bacterial pan-genome analysis) was used [28]. The full GenBank files of 22 assemblies downloaded from the NCBI served as input file for BPGA-1.3.0 pipeline. BPGA-1.3.0 pipeline processed all 22 GenBank files for orthologous cluster analysis and generated an input file containing annotated genes. The data file obtained was used for clustering of annotated genes into gene families by USEARCH with default parameters i.e. 50% sequence similarity as a cutoff. BPGA-1.3.0 pipeline was used to extrapolate the pancore genome models of E. mundtii by applying default parameters. The size of the Pan-genome was estimated by using the Power equation [f(x)=a.x^b}] using a builtin program of the BPGA-1.3.0 pipeline where f(x) indicates total number of non-orthologous gene families within its pangenome, x denotes the number of strains considered for pan-genome analysis, a and b represents fitting parameters. Additionally, the exponential equation [f1(x)=c.e^(d.x)] was used to generate core genome (number of shared gene families) from 50 random permutations where f1(x) denotes the total number of core gene families whereas c and d are fitting parameters. The frequency distribution of various gene families among 22 selected genomes is also determined by pan genome profile analysis. Further, core, accessory and unique gene families were analyzed by using the pangenome sequence extraction module. Exclusive gene families i.e. unique genes present/absent in different E. mundtii strains were also analyzed.

COG/KEGG analysis using functional analysis module of BPGA-1.3.0

Protein function was predicted by using the functional analysis module of Pan Genome against COG database of prokaryotic proteins. This module is also used to determine metabolic pathways against KEGG database.

Phylogenomic analysis

In order to assess the genetic diversity among the genomes derived from E. mundtii, we performed an average amino acid identity (AAI) analysis using AAI calculator [29]. This tool calculates a pair-wise AAI value for the dataset. Further, Pan-genome information was converted into binary matrix on the basis of presence (coded as 1) and absence (coded as 0) of genes using BPGA-1.3.0 pipeline. The pan-matrix was then used to construct Neighbor Joining tree using web based server Morpheus [30].

Nucleotide composition and Relative Codon Usage Analysis (RSCU)

Overall nucleotide compositions (A%, U%, G%, and C%) and GC content of E. mundtii strains was analyzed using MEGA7 [31]. RSCU (Relative Synonymous Codon Usage) values were calculated for the coding region of E. mundtii genomes using cusp software (EMBOSS version [32].

Calculation of Ka/Ks ratio

Pairwise Ka/Ks ratios of the two bacterial strains namely E. mundtii strain namely DSM 4838 (isolated from soil) and E. mundtii strain EMB156 (isolated from insect gut) were calculated using the concatenated single-CDS alignments with KaKs Calculator 2.0 [33].

Quality assessment of genome assemblies

Initially, all the 52 genomes of E. mundtii available at the NCBI were downloaded for pan-genome analysis. However, on performing quality assessment using QUAST VERSION 5.2 server and CheckM v1.1.6software, we observed that only 22 genomes are high quality genomes and can be used further for downstream analysis. The dataset of E. mundtii draft genomes used in the current study contains 22 annotated high quality genome assemblies of different strains isolated from human faeces, insect samples, ovine feces, soil, leaf, artisanal cheese, dairy cows and cow’s teats (Supplementary Table S1). CheckM was used to estimate the completeness and contamination of daft genome assemblies. All the selected assemblies exhibit ≥ 99.19% completeness and ≤ 0.87% contamination which ensured the high quality of genomes (Supplementary Table S2). All genome assemblies selected for current analysis were found to be of high quality to carry out the downstream analysis. Further, analysis by QUAST VERSION 5.2 server assured that the draft assemblies of E. mundtii used for comparative genome analysis in this study were of very high quality. As seen in Supplementary Table S2, the total genome size among different strains varies from 2.83 Mb to 3.50 Mb. Interestingly, the genomes sequenced from human feces exhibit wide variation in the assembly size ranging from 2.83 Mb to 3.45 Mb. The G + C content vary slightly from 38.1% to 38.5% among all the selected 22 genome assemblies.

Orthologous gene families - classification into the core, flexible, and singleton genes

The 22 high quality genome assemblies of E. mundtii were used as input data for the BPGA-1.3.0 (Bacterial Pan Genome Analysis) pipeline. Initial genome wise analysis by BPGA-1.3.0 pipeline revealed that the number of protein coding sequences varies from 2,282 to 2,892. A total of 60, 476 protein coding sequences have been grouped into 4,503 gene families by USEARCH. All the gene families have been further categorized as (1) ‘Core genes’ that are present in all the 22 selected genomes (2) ‘Flexible genes’ that are present in some but absent in others. The flexible genes are further classified as (a) ‘unique genes’ or ‘singletons’ which are specific to single genomes (b) accessory genes that exist in two or more than two genomes but not in all. Among 4,503 gene families, 1,843 (40.93% of the pan genome) gene families represent the core gene families present among the selected 22 E. mundtii genomes. The number of protein coding genes per E. mundtii genome is estimated to be 2749 ± 158, and roughly 67.04% of the genes correspond to conserved core genes. 1,762 gene families (39.12% of pan genome) represent the accessory genes whereas only 898 gene families (19.94% of pan genome) depict the unique genes among all the selected genomes in the present analysis. The genome wise distribution of the core, accessory, unique and exclusively absent genes in all the E. mundtii strain is depicted in table (Supplementary Table S3).

Pan-core genome plots

In order to generate pan-genome plot, we have plotted the total number of gene families obtained (4,503) against the number of genomes selected for study. In a similar way, core genome plot was obtained by plotting the shared/core gene families (1,843) against the number of selected genomes. To avoid any biasness in the chronological addition of new genomes, random permutations (50 by default) in the sequence of addition of genomes were carried out while generating pan-core genome plots. A median was calculated on the size of pan-core genome after each step and the median values were plotted using power equation [f(x)=a.x^b}] and exponential equation [f1(x)=c.e^(d.x)] to generate pancore plot respectively. An exponent ‘b’ of < 0 implies that the pangenome is “closed,” and its size has reached a constant value as additional genomes are added. However, ‘b’ value between 0 to 1 implies that the pangenome is still “open”. Thus, the pan-genome profile of E. mundtii is still open with a ‘b’ value of 0.141 (Fig. 1A). Distribution of gene families as well as new genes within the pan-genome of the E. mundtii are also shown (Fig. 1B and Fig. 1C). The core genome plot exhibits the trend of contraction in the number of shared gene families with subsequent addition of new genome. As the figure clearly indicates (Fig. 1A), core genome curve of E. mundtii is yet to reach a plateau stage and further reduction in the shared gene families is expected with addition of new genomes.

Figure 1.Pan-Core genome analysis. Panel (A) represent the pan and core genome curves. Upper curve (orange curve) represent pan-genome curve whereas lower curve (blue curve) represents core-genome curve. Panel (B) exhibit the frequency distribution of gene families within genomes. The number of new genes added to each genome is depicted in Panel (C).

Further, we carried out the subset analysis by distributing the E. mundtii genome assemblies among three groups classified on the basis of source of isolation of genomes. Group-1 includes 12 genome assemblies isolated from the human fecal material, Group-2 includes 3 genome assemblies isolated from insect samples, Group- 3 includes 3 genome assemblies isolated from environmental samples i.e. soil (2 genome assemblies) and leaf (1 genome assembly) (Supplementary Table S1). All the three genomic groups assigned above were used for a subset analysis in BPGA-1.3.0. Niche-specific pan-core genomes were derived with 50 random permutations as explained above. Subset analysis revealed that the pangenome of all the groups is still open although the pangenome of Group-1 with a ‘b’ value of 0.065 is almost closed (Table 1). The detailed subset analysis revealed that each genomic group had a different proportion of core gene families even though all the subsets exhibited the trend of contraction in the number of shared gene families with subsequent addition of new genome as we observed earlier in the comparative genome analysis of all 22 strains together (Table 1). The pan-core statistics of all the groups has been mentioned in table (Table 1).

Table 1 . Group specific Pan-Core genome statistics of E. mundtii strains.

GroupNo. of genomes per groupTotal no. of protein conding genesOrthologous clusters (Pan genome size)b-parameterPangenome statusCore Clusters (% of Pan-genome)Core-genome status (possibility of further contraction)Group specific genesExclusively absent clusters (Group specific)
Group-1-Human fecal matter123502435200.065Open (almost closd)2011 (57.13% of pangenome)Yes31468
Group-2-Insect samples03840032420.161Open2229 (68.75%)Yes349381
Group-3-Non-dairy samples (soil/leaf)03906233020.167Open2346(71.04%)Yes153131

COG/KEGG analysis

Representative protein coding sequences of the core (1843), accessory (1,762) and unique (898) gene families were picked to establish their COG/KEGG identities. Figure depicts the distribution of the major COG categories in the above mentioned three categories of gene families (Fig. 2). Majority of the core genes belongs to the categories of Metabolism (37.83%) and Information storage & processing (29.84%) whereas unique genes belong to the category of Information storage & processing (48.08%). Further, accessory genes are almost equally present in both functional categories i.e. Information storage and processing as well as Metabolism (34.34% and 32.27% respectively) (Fig. 2A). Further, detailed examination revealed that 25.6% of unique genes are involved in the process of Replication, recombination and repair [L] when compared to 11.08% of accessory and 5.38% of core genes belonging to this particular function. Additionally, unique genes (11.41%) are involved more in the process of Cell wall/membrane/ envelope biogenesis [M] as compared to core genes (4.75%). On the contrary, core genes (8.52%) belong to the functional category of Translation, ribosomal structure and biogenesis [J] when compared to accessory (1.34%) and unique genes (0.69%). Additionally, it has been observed that both core (11.23%) as well as accessory genes (14.25%) is more involved in the process of Carbohydrate transport and metabolism [G] as contrast to unique genes (4.15%). However, there are functional categories like Coenzyme transport and metabolism [H] and Secondary metabolites biosynthesis, transport, and catabolism [Q] where unique genes are entirely absent (Fig. 2B).

Figure 2.Distribution of general & functional COG categories. Panel (A) depicts the relative abundance as well as distribution of general COG categories among core, accessory and unique clusters of E. mundtii genomes. Panel (B) depicts the relative abundance as well as distribution of functional COG categories among core (green), accessory (red) and unique (blue) clusters of E. mundtii genomes.

Similarly, KEGG pathway analysis depicted the distribution of the core, accessory and unique genes into six major sub-categories namely Cellular Processes, Environmental Information, Processing, Genetic Information Processing, Human Diseases, Metabolism and Organismal Systems (Fig. 4). KEGG pathway analysis revealed that a majority of accessory (70.76%) and core genes (64.43%) are involved in the Metabolism related pathways as compared to unique genes (51.02%) (Fig. 3A). On the contrary, a much higher percentage of unique genes (16.33%) are involved in Human disease related pathways as compared to core (4.26%) and accessory genes (3.80%) (Fig. 3A). A detailed examination of functional sub-categories under the major categories is depicted in Figure (Fig. 3B).

Figure 3.Distribution of general & functional KEGG categories. Panel (A) depicts the relative abundance as well as distribution of general KEGG categories among core (green), accessory (red) and unique (blue) clusters of E. mundtii genomes. Panel (B) depicts the relative abundance as well as distribution of functional KEGG categories among core (green), accessory (red) and unique (blue) clusters of E. mundtii genomes.
Figure 4.COG general category frequency heatmap. Figure depicts the relative abundance and distribution of general COG categories among core, accessory and unique clusters of three specific groups (subsets) of E. mundtii genomes.

Further, group specific COG analysis revealed that core genes cluster of all the three groups majorly belong to the functional category of Metabolism and Information storage & processing respectively. However, unique gene cluster of all three groups are primarily involved in Information storage and processing. A class wise detailed description of COG functional categories is depicted in heat map (Fig. 4 and Fig. 5). Further, KEGG pathway analysis revealed that core, unique and accessory clusters of all the groups are hugely involved in Metabolism related pathways except unique gene cluster of Group 3 which are interestingly equally involved (33.33%) in both Metabolism as well as Human disease related pathways. A detailed description of all class within the major category for all the groups is depicted in heat map (Figs. 6 and 7).

Figure 5.COG functional category frequency heatmap. Figure depicts the relative abundance and distribution of functional COG categories among core, accessory and unique clusters of three specific groups (subsets) of E. mundtii genomes.
Figure 6.KEGG general category frequency heatmap. Figure depicts the relative abundance and distribution of general KEGG categories among core, accessory and unique clusters of three specific groups (subsets) of E. mundtii genomes.
Figure 7.KEGG functional category frequency heatmap. Figure depicts the relative abundance and distribution of functional KEGG categories among core, accessory and unique clusters of three specific groups (subsets) of E. mundtii genomes.

Phylogenomic analysis

The average amino acid identity (AAI) refers to an index of pairwise genomic relatedness and is calculated on the basis of conserved protein coding genes between a pair of genomes using the BLAST algorithm. AAI shows better resolution in revealing taxonomic structure beyond the species rank as compared to average nucleotide identity (ANI), which is considered as a standard criterion in species delineation [34]. In the current study, AAI analysis was done using AAI calculator which estimates the average amino acid identity using both best hits (one-way AAI) and reciprocal best hits (two-way AAI) among multiple genomic datasets of proteins. The emergence of whole-genome AAI has assisted greatly on evaluating species boundaries by calculating genetic relatedness between two genomes, where strains from the same microbial species share ≥ 95per cent identity [34]. Our results are in agreement with the abovementioned AAI cut off value (Fig. 8). Further, a neighbor joining phylogenomic tree was constructed based on Gene presence/absence matrix of all CDSs of 22 E. mundtii genomes (Fig. 9). It has been observed that E. mundtii strains exhibits niche specific distribution of genes. Pan matrix tree revealed that 10 genomes derived from human faecal matter; 2 genomes derived from insect samples and 2 genomes derived from environmental samples get clustered together.

Figure 8.Heatmap of AAI values. Heatmap shows Average Amino Acid identity values (AAI) for all 22 E. mundtii genomes.
Figure 9.Phylogenomic tree based on binary Pan-matrix. Figure depicts the Gene presence/absence matrix of all CDSs of 22 E. mundtii genomes. Red depicts presence while blue shows absence of a gene. Specie names are shown on the right side.

Group specific cluster analysis

Subset analysis revealed the presence of habitat specific gene families i.e. they are exclusively present in the genomes isolated from a specific source. As shown in Table 1, there are 314, 349 and 153 gene cluster which have members exclusively present in E. mundtii genomes derived from Human faecal matter, Insect samples and Environmental samples respectively. Surprisingly, the highest number of niche specific gene families is present in Group-2 (Insect Samples) even though only three genomes out of 22 in our study belonged to that group. A list of the genes that are specifically present in genomes derived from specific groups is provided in Supplementary (Table S4). Further, it was observed that around 580 gene cluster are exclusively absent in E. mundtii genomes derived from specific source. As mentioned in Table 1, there are 68, 381 and 131 gene cluster which are exclusively absent from Group-1, Group-2 and Group-3 specific genomes respectively. A list of the genes that are absent from genomes derived from specific groups is provided Supplementary (Table S5).

COG distribution pattern of the group specific genes are depicted in heatmap (Fig. 10). COG category Replication, recombination and repair (L) is hugely represented in all the three groups. Apart from this, Group- 1 specific gene families majorly belong to Transcription (K) (8.21%) and Carbohydrate transport and metabolism (G) (9.65%) categories whereas genes under categories like Transcription (K) (9.57%), Defense mechanisms (V) (11.76%), Cell wall/membrane/envelope biogenesis (M) (11.69%) are significantly represented in group-2. Specific gene families of group 3 are frequently present in Categories like Transcription (K) (14.505%) and Cell cycle control, cell division, chromosome partitioning (D) (7.88%) (Fig. 8).

Figure 10.COG functional category frequency heatmap: Heatmap depicts the COG distribution patterns and relative abundance of the origin-specific orthologous gene families.

Nucleotide composition and Relative Synonymous Codon Usage (RSCU) analysis

Nucleotide composition and Codon usage bias analysis was carried out for the CDS region of 22 E. mundtii strains. The nucleotide ‘A’ was the most abundant with a mean value of 33.01%, followed by T (27.76%), G (21.54%) and C (17.69%). The mean AT content was calculated to be 60.05% which was higher than the GC content i.e., 39.94% only. Further analysis revealed that the third letter GC content (34.11%) was significantly lower than the first letter GC (49.02%) and slightly lower than that of second letter GC content (34.58%). RSCU i.e., Relative Synonymous Codon Usage (RSCU) value refers to the ratio of the observed frequency of a particular synonymous codon to the expected frequency (no codon usage bias). RSCU is considered to be a significant parameter of measuring the codon usage bias [35, 36]. If the calculated RSCU value of a codon is 1.0, then it signifies no codon usage bias i.e., equivalent use of the codons for that particular amino acid. However, if the RSCU value is found to be greater than 1.0, then it points towards a positive codon usage bias. On the Contrary, if the RSCU value is less than 1.0 then it points towards a negative codon usage bias. Moreover, if the RSCU values are calculated to be higher than 1.6 or lower than 0.6 for a specific amino acid then it indicates overrepresented or underrepresented codons respectively [37, 38]. In the present study, RSCU values were calculated for the 61 codons used in the coding region of core genome to determine the codon usage bias Supplementary Table S6. Six codons namely: TTA (2.238), TTG (1.686), CCA (1.828), CAA (1.618), AAA (1.602), CGT (2.142) were overrepresented i.e., RSCU value > 1.6 (RSCU values are indicated in parenthesis). Seventeen codons namely: CTC (0.27), CTG (0.42), ATA (0.297), TCC (0.45), CCC (0.276), ACC (0.524), GCC (0.548), TAC (0.538), CAC (0.498), CAG (0.382), AAC (0.562), AAG (0.398), GAC (0.484), GAG (0.424), TGC (0.484), CGG (0.438), AGG (0.252) were underrepresented i.e., RSCU value < 0.6.

RSCU analysis is extensively used to standardize the analysis of codon usage bias. Here, 6 codons were overrepresented, and 17 codons were underrepresented, indicating significant codon usage bias. Generally, it has been observed that codons which are used less by the host are selected in the process of evolution in order to avoid competition with the host cell during gene translation. In RSCU analysis, it has been observed that majority of the abundantly used codon were ending in either ‘A’ or ‘T’ (Supplementary Table S6). Further, as stated above the ‘AT’ content was higher than the ‘GC’ content and similarly there was a preference for ‘AT’ than ‘GC’ at third codon position. This indicates a positive correlation between nucleotide and codon composition. Thus, mutation pressure is found to be an important force affecting the codon usage bias.

Ka/Ks- positive gene selective pressure analysis

Ka/Ks ratio which is also known as ω ratio is used to approximate the balance between positive, purifying or neutral selection acting on a set of homologous proteincoding genes. The selection pressure ratio (ω) is calculated by dividing non-synonymous substitution rate (Ka) to the synonymous substitutions rate (Ks). The Ka/Ks ratio is used to infer the direction and magnitude of natural selection acting on protein coding genes. A Ka/Ks ratio < 1 implies purifying selection; Ka/Ks > 1 indicates positive or Darwinian selection (driving change); and Ka/Ks values close to 1 shows neutral or relaxed selection (reference kimura.ivanova). All the 2196 protein coding sequences shared between two strains of E. mundtii namely: E. mundtii DSM 4838 and E. mundtii EMB 156 were subjected to Ka/Ks analysis in order to identify the genes which are under beneficial or positive selection pressure. 19 genes were found to be under positive selection (Table 2). The genes undergoing positive selection plays significant role in metabolic and physiological activities of host. These genes help the bacterium to colonize effectively or adapt better to a particular environment.

Table 2 . Protein ids and function of genes under positive selection pressure.

Protein ids of Enterococcus mundtii DSM4838Potein ids of Enterococcus mundtii EMB156Ka/KsProtein Description
lcl|NZ_CP018061.1_cds_WP_071866700.1_260lcl|NZ_CP022340.1_cds_WP_010736301.1_4892.2961Hypothetical Protein
lcl|NZ_CP018061.1_cds_WP_071866694.1_227lcl|NZ_CP022340.1_cds_WP_096081363.1_4561.8764Protein=vWA domain-containing protein
lcl|NZ_CP018061.1_cds_WP_071866167.1_3031lcl|NZ_CP022340.1_cds_WP_010736522.1_2791.7728Hypothetical protein
lcl|NZ_CP018061.1_cds_WP_071867297.1_2008lcl|NZ_CP022340.1_cds_WP_096081658.1_19731.6428TetR/AcrR family transcriptional regulator
lcl|NZ_CP018061.1_cds_WP_179948050.1_885lcl|NZ_CP022340.1_cds_WP_023519435.1_9501.4363Hypothetical Protein
lcl|NZ_CP018061.1_cds_WP_023519140.1_293lcl|NZ_CP022340.1_cds_WP_096081382.1_5201.4221Galactose mutarotase
lcl|NZ_CP018061.1_cds_WP_071866695.1_248lcl|NZ_CP022340.1_cds_WP_010736313.1_4771.2478Protein=tagatose-bisphosphate aldolase
lcl|NZ_CP018061.1_cds_WP_071866535.1_2601lcl|NZ_CP022340.1_cds_WP_034691573.1_26201.2418Hypothetical Protein
lcl|NZ_CP018061.1_cds_WP_071867143.1_1048lcl|NZ_CP022340.1_cds_WP_010735694.1_10821.2373Hypothetical Protein
lcl|NZ_CP018061.1_cds_WP_071867840.1_2312lcl|NZ_CP022340.1_cds_WP_096081784.1_23421.2123DUF624 domain-containing protein
lcl|NZ_CP018061.1_cds_WP_233433731.1_1971lcl|NZ_CP022340.1_cds_WP_010734841.1_19371.156Hypothetical Protein
lcl|NZ_CP018061.1_cds_WP_071866706.1_278lcl|NZ_CP022340.1_cds_WP_096081377.1_5051.1501RsiV family protein
lcl|NZ_CP018061.1_cds_WP_019722494.1_449lcl|NZ_CP022340.1_cds_WP_096081408.1_6351.1211Hypothetical Protein
lcl|NZ_CP018061.1_cds_WP_071868037.1_882lcl|NZ_CP022340.1_cds_WP_010735831.1_9471.0752Discoidin domain-containing protein
lcl|NZ_CP018061.1_cds_WP_071866347.1_938lcl|NZ_CP022340.1_cds_WP_010735791.1_9861.0623Lysozyme family protein
lcl|NZ_CP018061.1_cds_WP_071868082.1_2032lcl|NZ_CP022340.1_cds_WP_010734783.1_19961.0247Type II toxin-antitoxin system RelE/ ParE family toxin
lcl|NZ_CP018061.1_cds_WP_071866602.1_213lcl|NZ_CP022340.1_cds_WP_010736349.1_4421.014Zinc ABC transporter substrate-binding protein AdcA
lcl|NZ_CP018061.1_cds_WP_071866711.1_304lcl|NZ_CP022340.1_cds_WP_010736250.1_5311.0103S-(hydroxymethyl) glutathione dehydrogenase/class III alcohol dehydrogenase
lcl|NZ_CP018061.1_cds_EM4838_RS11985_2351lcl|NZ_CP022340.1_cds_- CO646_RS12160_23801.0005HAD hydrolase family protein

In this study, we have presented the pan-genome analysis of different strains of E. mundtii isolated from diverse habitats. The pan-genome plot clearly indicates that the size of the pan-genome increases with inclusion of new genome and the plot is yet to reach a plateau stage even though the pan-genome size of the dataset has already exceeded 1.6 times the average genome size of a typical E. mundtii genome. On an average, 67% of protein coding genes of any E. mundtii genome correspond to conserved core genes, while the remaining 33% belongs to accessory and unique genes category. This exemplifies the requirement for taking into consideration a huge number of genomes from diverse sources to get a better idea about the conserved and dispensable genes. However, the result might vary from one study to another depending upon the way studies have been carried out; origin of genomes, software used as well as extrapolation was done. For instance, closed pangenome has been interpreted for Clostridium difficile with only 26 genomes taken into consideration [39]. Hence, it becomes difficult to comment upon the nature of pangenome i.e., whether it is open or closed for a particular bacteria. Further, we believe that the pangenome for bacteria always remains open due to constantly changing environment, loss or gain of genes to adapt to particular environment or horizontal gene transfer. It is interesting to observe here that even though all the genomes belong to the same species yet the presence of a sizeable percentage of flexible genes (33%) implies that each E. mundtii genome has undergone some changes like loss/gain of genes in order to adapt to host environment. Distribution of the functional categories (COG/ KEGG) clearly indicates that core genes are involved in basic and essential functions of the microorganisms whereas flexible genes (unique as well as accessory genes) are the ones which are the ones which would help the microorganism to adapt to a specific habitat.

Further, one of the major objectives of the current study was to explore whether there are any specific gene families that are exclusively present or absent from the genomes derived from a particular habitat as they might have some function in adaptation of the microorganism to a particular habitat. The result indicates both presence as well as absence of exclusive gene families from the genomes of a specific habitat. Microorganisms often exhibit alteration in the genomic composition according to their specific environment [40]. Therefore, it can be concluded that ‘gain of function’ in addition to ‘loss of function’ have contributed significantly towards the adaptation of the strains of E. mundtii to a particular habitat. Further, trend in distribution of functional COG categories varies significantly among habitat specific gene families which clearly indicate habitat driven changes in the genomic composition of microbiome. Microorganisms are known to exhibit habitat specific divergence or convergence in genomic composition. The genomic content as well as gene of the bacterial genomes obtained from different habitats might vary due to several factors like environmental stress, HGT (horizontal gene transfer), genetic recombination etc [4046].

Codon usage as well as synomymous/non-synonymous substitutions plays significant role in genome evolution. Synonymous (Ks) and Non-synonyous substitution rates and Ka/Ks ratio helps to determine the sequence divergence as well as positive, purifying or neutral selection in protein coding genes. In maority of the protein coding genes except rapidly evolving genes, Synonymous substitution (Ks) occurs more regularly when compared to non-synonymous substitutions (Ka) [47]. Positive selection of a protein coding gene by a species is indicative of adaptive evolution. It implies that the gene is beneficial for host’s exsistence, to adapt to a particular niche or environmental conditions.

Thus, to conclude, in this study, we have presented the pan-genome analysis of different E. mundtii strains isolated from diverse niches. This has led to gain of better understanding of this versatile genus. The study sheds light on the fact that even closely related strains of a species exhibit wide variation in genome owing to their ability to adapt to a particular environment. Further, open pan-genome of the E. mundtii suggests the need for sequencing other genomes to get more comprehensible genome pool.

BPGA: Bacterial Pan-genome analysis Pipeline

AAI: Average Amino acid Identity

RSCU: Relative Synonymous Codon Usage

PAIs: Pathogenicity islands

We thank Dr. Vinod Kumar Gupta (Mayo clinic, Rochester) for his valuable time, suggestions and guidance that improved the manuscript.

  1. Holzapfel WH, Wood BJB. 2014. Lactic acid bacteria: Biodiversity and taxonomy. Wiley-Blackwell.
  2. Thiercelin ME, Jouhaud L. 1903. Reproduction de l'entérocoque; tachescentrales; granulations péripheriques et microblastes. Comptes Rendus desSeances de la Societe de Biologie et de ses Filiales 55: 686-698.
  3. Andrewes FW, Horder TJ. 1906. A study of the Streptococci pathogenic for man. Lancet 168: 708-713.
  4. Schleifer KH, Kilpper-Balz R. 1984. Transfer of Streptococcus faecalis and Streptococcus faecium to the genus Enterococcus nom rev. as Enterococcus faecalis comb. nov. and Enterococcus faecium comb. nov. Int. J. Syst. Bacteriol. 34: 31-34.
  5. Collins MD, Jones D, Farrow JAE, Kilpper-Balz R, Schleifer KH. 1984. Enterococcus avium nom. rev., comb. nov.; E. casseliflavus norn. rev., comb. nov.; E. durans norn. rev., comb. nov.; E. gallinarum comb. nov.; and E. malodoratus sp. nova. Int. J. Syst. Bacteriol. 34: 220-223.
  6. Naser SM, Vancanneyt M, De Graef E, Devriese LA, Snauwaert C, Lefebvre K, et al. 2005. Enterococcus canintestini sp. nov., from faecal samples of healthydogs. Int. J. Syst. Evol. Microbiol. 55: 2177-2182.
    Pubmed CrossRef
  7. de Vaux A, Laguerre G, Divies C, Prevost H. 1998. Enterococcus asini sp. nov. isolated from the caecum of donkeys (Equus asinus). Int. J. Syst. Bacteriol. 48(Pt.2): 383-387.
    Pubmed CrossRef
  8. Lebreton F, Willems RJL, Gilmore MS. 2014. Enterococcus Diversity, Origins in Nature, and Gut Colonization. In: Gilmore, M. S., Clewell, D. B., Ike, Y., Shankar, N. (Eds.), Enterococci: From Commensals to Leading Causes of Drug Resistant Infection. Massachusetts Eye and Ear Infirmary, Boston, pp. 5-63.
  9. Sistek V, Maheux AF, Boissinot M, Bernard KA, Cantin P, Cleenwerck I, et al. 2012. Enterococcus ureasiticus sp. nov. and Enterococcus quebecensis sp. nov., isolated from water. Int. J. Syst. Evol. Microbiol. 62: 1314-1320.
    Pubmed CrossRef
  10. Foulquie Moreno MR, Sarantinopoulos P, Tsakalidou E, De Vuyst L. 2006. The role and application of Enterococci in food and health. Int. J. Food Microbiol. 106: 1-24.
    Pubmed CrossRef
  11. Giraffa G. 2003. Functionality of Enterococci in dairy products. Int. J. Food Microbiol. 88: 215-222.
    Pubmed CrossRef
  12. Franz CM, Huch M, Abriouel H, Holzapfel W, Galvez A. 2011. Enterococci asprobiotics and their implications in food safety. Int. J. Food Microbiol. 151: 125-140.
    Pubmed CrossRef
  13. Moellering RC Jr. 1992. Emergence of Enterococcus as a significant pathogen. Clin. Infect. Dis. 14: 1173-1176.
    Pubmed CrossRef
  14. O'Driscoll T, Crank CW. 2015. Vancomycin-resistant Enterococcal infections: epidemiology, clinical manifestations, and optimal management. Infect. Drug Resist. 8: 217-230.
    Pubmed KoreaMed CrossRef
  15. Klein G. 2003. Taxonomy, ecology and antibiotic resistance of enterococci from food and the gastrointestinal tract. Int. J. Food Microbiol. 88: 123-131.
    Pubmed CrossRef
  16. Collins MD, Farrow JA, Jones D. 1986. Enterococcus mundtii sp. nov. Int. J. Systemat. Evol. Microbiol. 36: 8-12.
  17. Giraffa G, Carminati D, Neviani E. 1997. Enterococci isolated from dairy products: a review of risks and potential technological use. J. Food Protect. 60: 732-738.
    Pubmed CrossRef
  18. Espeche MC, Otero MC, Sesma F, Nader-Macias ME. 2009. Screening of surface properties and antagonistic substances production by lactic acid bacteria isolated from the mammary gland of healthy and mastitic cows. Vet. Microbiol. 135: 346-357.
    Pubmed CrossRef
  19. De Kwaadsteniet M, Todorov SD, Knoetze H, Dicks LM. 2005. Characterization of a 3944 Da bacteriocin, produced by Enterococcus mundtii ST15, with activity against Gram-positive and Gram-negative bacteria. Int. J. Food Microbiol. 105: 433-444.
    Pubmed CrossRef
  20. Ferreira AE, Canal N, Morales D, Fuentefria DB, Corção G. 2007. Characterization of enterocins produced by Enterococcus mundtii isolated from humans feces. Brazil. Arch. Biol. Technol. 50: 249-258.
  21. Settanni L, Valmorri S, Suzzi G, Corsetti A. 2008. The role of environmental factors and medium composition on bacteriocin-like inhibitory substances (BLIS) production by Enterococcus mundtii strains. Food Microbiol. 25: 722-728.
    Pubmed CrossRef
  22. Giraffa G. 1995. Enterococcal bacteriocins: their potential use as anti-Listeria factors in dairy technology. Food Microbiol. 12: 551-556.
  23. Franz CMAP, Holzapfel WH, Stiles ME. 1999. Enterococci at the crossroads of food safety? Intern. J. Food Microbiol. 47: 1-24.
    Pubmed CrossRef
  24. De Vuyst L, Foulquié Moreno M, Revets H. 2002. Screening for enterocins and detection of hemolysin and vancomycin resistance in enterococci of different origins. Intern. J. Food Microbiol. 2635: 1-20.
  25. Ferreira AE, Canal N, Morales D, Bopp D, Corcao G. 2007. Characterization of enterocins produced by Enterococcus mundtii isolated from humans feces. Braz. Arch. Biol. Technol. 50: 249-258.
  26. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014. Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25: 1043-1055.
    Pubmed KoreaMed CrossRef
  27. Alexey G, Vladislav S, Nikolay V, Glenn Tesler. 2013. QUAST VERSION 5.2: quality assessment tool for genome assemblies. Bioinformatics 29: 1072-1075.
    Pubmed KoreaMed CrossRef
  28. Chaudhari N, Gupta V, Dutta C. 2016. BPGA-1.3.0- an ultra-fast pan-genome analysis pipeline. Sci. Rep. 6: 24373.
    Pubmed KoreaMed CrossRef
  29. Rodriguez-R LM, Konstantinidis KT. 2016. The enveomics collection: a toolbox for specialized analyses of microbial genomes and metagenomes. PeerJ. 4: e1900v1.
  30. Kumar S, Stecher G, Tamura K. 2016. MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol. Biol. Evol. 33: 1870-1874.
    Pubmed KoreaMed CrossRef
  31. Rice P, Longden I, Bleasby A. 2000. EMBOSS: The european molecular biology open software suite. Trends Gen. 16: 276-277.
    Pubmed CrossRef
  32. Wang D, Zhang Y, Zhang Z, et al. 2010. KaKs_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies. Genom. Proteom. Bioinform. 8: 77-80.
    Pubmed CrossRef
  33. Thompson CC, Chimetto L, Edwards RA, Swings J, Stackebrandt E, Thompson FL. 2013. Microbial genomic taxonomy. BMC Genomics. doi: 10.1186/1471-2164-14-913.
    Pubmed KoreaMed CrossRef
  34. Singh NK, Tyagi A. 2017. A detailed analysis of codon usage patterns and influencing factors in Zika virus. Arch. Virol. 162: 1963-1973.
    Pubmed CrossRef
  35. Chen Y, Shi Y, Deng H, Gu T, Xu J, Ou J, et al. 2014. Characterization of the porcine epidemic diarrhea virus codon usage bias. Infect. Genet. Evol. 28: 95-100.
    Pubmed KoreaMed CrossRef
  36. Butt AM, Nasrullah I, Qamar R, Tong Y. 2016. Evolution of codon usage in Zika virus genomes is host and vector specific. Emerg. Microbes Infect. 5: e107.
    Pubmed KoreaMed CrossRef
  37. Singh RK, Pandey SP. 2017. Phylogenetic and evolutionary analysis of plant ARGONAUTES. Methods Mol. Biol. 1640: 267-294.
    Pubmed CrossRef
  38. Scaria J, Ponnala L, Janvilisri T, Yan W, Mueller LA, Chang YF. 2010. Analysis of ultra low genome conservation in Clostridium difficile. PLoS One 5: e15147.
    Pubmed KoreaMed CrossRef
  39. Dutta C, Paul S. 2012. Microbial lifestyle and genome signatures. Curr. Genomics 13: 153-162.
    Pubmed KoreaMed CrossRef
  40. Ochman H, Moran NA. 2001. Genes lost and genes found: evolution of bacterial pathogenesis and symbiosis. Science 292: 1096-1099.
    Pubmed CrossRef
  41. Toft C, Andersson SG. 2010. Evolutionary microbial genomics: insights into bacterial host adaptation. Nat. Rev. Genet. 11: 465-475.
    Pubmed CrossRef
  42. Dobrindt U, Hochhut B, Hentschel U, Hacker J. 2004. Genomic islands in pathogenic and environmental microorganisms. Nat. Rev. Microbiol. 2: 414-424.
    Pubmed CrossRef
  43. Ochman H, Lawrence JG, Groisman EA. 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405: 299-304.
    Pubmed CrossRef
  44. Didelot X, Maiden MC. 2010. Impact of recombination on bacterial evolution. Trends Microbiol. 18: 315-322.
    Pubmed KoreaMed CrossRef
  45. Lefebure T, Stanhope MJ. 2009. Pervasive, genome-wide positive selection leading to functional divergence in the bacterial genus Campylobacter. Genome Res. 19: 1224-1232.
    Pubmed KoreaMed CrossRef
  46. Liu LX, Li R, Worth JRP, Li X, Li P, Cameron KM, et al. 2017. The complete chloroplast genome of Chinese bayberry (Morella rubra, myricaceae): Implications for understanding the evolution of fagales. Front. Plant Sci. 8. doi: 10.3389/fpls.2017.00968.
    Pubmed KoreaMed CrossRef

Starts of Metrics

Share this article on :

Most Searched Keywords ?

What is Most Searched Keywords?

  • It is most registrated keyword in articles at this journal during for 2 years.