Indel mechanisms and dynamics, and their influences on genome size in Gossypium

Current Project: Genome evolution in diploid and polyploid cotton

Previous work on genome size evolution in Gossypium

Comparison of the homoeologous CesA region sequences from G. hirsutum

Comparison of the AdhA region from G. hirsutum (A T and D T), G. raimondii (D), and G. arboreum (A)

Detailed analysis of the indels occurring in the CesA and AdhA regions of A T, D T, D, and A genomes for all branches

Global genomic architecture in diploid Gossypium

 

Genome evolution in diploid and polyploid cotton

(from our grant project summary) Given the prevalence of polyploidy in plants and its significance to plant evolution, developing an enhanced understanding of genomic interactions and processes of genomic change in the polyploid nucleus has fundamental importance to many areas of biology and crop productivity. Here we seek to partially redress this gap in our knowledge, proposing studies that will address the tempo (e.g., stochastically regular vs. episodic bursts), directionality (both phylogenetically and with respect to genomic contraction or expansion), and absolute scale of genome size change in diploid and polyploid plants. The cotton genus (Gossypium) is particularly well suited to addressing these questions, with direct relevance to the productivity and quality of one of the world’s leading crops. We will compare divergent diploid (A-genome and D-genome) and allopolyploid (AD-genome) members of the cotton genus (Gossypium), noting that these diploids have genomes that vary twofold in size. Our analyses will be phylogenetically informed through the inclusion of the outgroup species Gossypioides kirkii, an approach we argue is essential for maximizing inference of scope and scale of genome evolution, and for evaluating the relative roles of the various responsible genetic mechanisms. With the D genome well-characterized and entered into the sequencing queue, a genome-wide genetically anchored scaffold of A-genome BACs is the next logical step toward the long-term goal of characterizing genomic diversity in the genus, thereby enabling a host of future studies. We will build the conceptual framework needed to understand the architecture of the highly repetitive A-genome, an essential prelude to unraveling the unique features that distinguish cultivated allopolyploid cottons from their diploid progenitors, by identification and sequencing of a set of approximately 100 BACs, systematically sampling diverse genomic contexts to gain insight into the structure and evolution of DNA elements that account for the twofold difference in the size of the A- and D-genomes since their divergence from a common ancestor, and to clarify the types and levels of intergenomic evolution that have occurred since the joining of two widely divergent diploid genomes in a common polyploid nucleus. These data will be supplemented by comparative FISH and phylogenetic analyses that, in conjunction with the BAC sequence data, will provide unparalleled insight into the evolutionary processes responsible for genome evolution during diploid divergence and following polyploid formation.

Current progress: Thus far, all twenty G. raimondii BACs have been sequenced and the remaining eighty (G. arboreum [20], G. hirsutum [20 A T, 20D T], G. kirkii [20]) are in various stages of sequencing. Once sequencing is completed, comparative analysis of these twenty loci can begin.

Loci selected from chromosomes 2 and 8 in G. raimondii.

Comparison of the homoeologous CesA region sequences from G. hirsutum

Published in Genome Research (2004) [ link to the pdf, Grover et al, Genome Research, 2004 ], the 105kb gene-rich region surrounding cellulose synthase ( CesA1; pictured below) from the two genomes that comprise G. hirsutum (A­­ T and D T) was our first look into genome size evolution in Gossypium. This region was fairly gene rich, with 14 shared genes predicted along this length, an average gene density that is slightly less than Arabidopsis, but similar to that of rice. Also predicted were four retrotransposons and two DNA transposons, with only one retrotransposon and one DNA transposon shared between the two genomes.

Our first, and perhaps most striking, observation from this region was the extraordinary conservation of intergenic sequence, both in terms of sequence and length. The conservation demonstrated by this region contrasted the dogma laid down by prior microcolinearity studies, all of which displayed little to no conservation of intergenic space. While it was tempting to attribute this lack of divergence to the relative youth of the genus, reports from the grasses indicated that 11 million years is sufficient to remove homology outside of genes, and in some cases only ½ to 1 million years is required.

Given the considerable conservation of intergenic space, we suspected that the mechanisms that generated the two-fold genome size difference between the A and D genomes were not operating differentially in this region, which was confirmed by subsequent analyses. Overall, this region did not lead us any closer to uncovering the mechanisms operating to affect genome size in Gossypium; however, it did highlight a property of genome size evolution in Gossypium. This region demonstrated that genome size evolution in Gossypium must be the result of heterogeneously operating mechanisms that serve to expand or contract certain regions of the genomes, while others remain relatively unscathed.

http://www.eeob.iastate.edu/faculty/WendelJ/images/CesA.jpg

The blue boxes on the diagram indicate predicted genes, while the green indicates shared intergenic space. The grey boxes indicate intergenic space that is unique to that genome. The retrotransposons are noted individually (rTE), and triangles denote predicted LTRs. DNA transposable elements are listed individually (POGO and Mutator), as is a cpDNA insertion of ycf2 origin. The middle panel indicates a continuous window of sequence identity between the two BACs, scaled from 50% to 100%.

Comparison of the AdhA region from G. hirsutum (A T and D T ), G. raimondii (D), and G. arboreum (A)

As the CesA1 region raised just as many questions as it answered, we sequenced a second region surrounding the gene encoding alcohol dehydrogenase A ( AdhA; pictured below and published in The Plant Journal, 2007 [ link to the pdf, Grover et al, Plant Journal, 2007]) from the two genomes of the tetraploid, as before, but also from the model diploid progenitors, whose resources had become available.

In this comparison, ~ 100kb of shared sequence was obtained from the A and A T genomes and ~ 50kb was obtained from the D and D T genomes, sizes that reflect the overall differences in genome size. The gene density of the region was about ½ to 1 / 3 that observed in the previous region. The major difference in the AdhA region, as opposed to the previous one, was the accumulation of transposable elements in the A and A T genomes, particularly gypsy elements (red).

This region was congruent with what we would expect based upon genome size, considering transposable element accumulation is generally a primary contributor to genome size evolution. Here we see nearly five times the number and length of TEs in the A genomes as in the D genomes (~25 - 32kb in A and A T versus ~5 - 7kb in D and D T). There was evidence for one event of intra-strand homologous recombination in the gypsy rich region of the A T genome, which corresponds with the slightly less than additive size of the tetraploid. Further contributing to the “genomic down-sizing” experienced by the polyploid, may be due to increased illegitimate recombination, which was observed in this region for the polyploid genomes relative to the diploid genomes. All four genomes were evaluated for evidence of a bias in small indels for those that could be polarized (i.e. those occurring after diploid-polyploid divergence). This region did display a biased accumulation, such that the smaller genomes averaged more frequent and longer deletions than the larger genomes. Further exaggerating this bias was the tendency for the A genome diploid to acquire longer insertions more frequently as well.

From this analysis, our supposition that genome size evolution was heterogeneous was further confirmed, even occurring among what could be considered gene islands. The primary contributor to genome size change in the region was, inarguably, expansion in the A and A T genomes via transposable element proliferation, namely gypsy elements. Realistically, however, this mechanism only accounted for about half of the observed difference in the region—about 25 out of the 50kb difference. The rest of this genome size difference was likely due to a variety of contributors, some of which were yet unknown. The analysis did indicate that a bias in small indels and increased illegitimate recombination in the polyploid may contribute to genome size differences in this genus and warrant further investigation.

- Image unable to load - Click here to view image

Multiple alignment of orthologous AdhA BACs from four different genomes (A, D, A T and D T; the latter two are co-resident in the nucleus of polyploid cottons). Numbered blue boxes are predicted genes; copia elements are in orange, gypsy elements in red and LINE elements in pink. Identifiable long terminal repeats (LTRs) are depicted by triangles. Continuous windows of sequence identity are shown between each pair of BACs, with that in the middle illustrating sequence identity between the two BAC pairs (A and A T versus D and D T); all are scaled from 50 to 100%. Grey diamonds on the identity plots denote the location of large (>400 bp), unpolarized indels between the diploid progenitor and respective polyploid genome. The scale bar at the bottom indicates increments of 10 Kb.

Detailed analysis of the indels occurring in the CesA and AdhA regions of A T , D T , D, and A genomes for all branches

A reanalysis of both regions, with the addition of the newly available outgroup resources, provided the ability to address genome size evolution on the ancestral branches (i.e. before diploid-polyploid divergence), as well as on the tips (A, D, A T, and D T alone). By addressing the evolution on the ancestral branches separately from the tips, we were able to calculate the rates at which these genomes have expanded or contracted over time, as well as due to specific mechanisms. Using those calculated rates, we began to map genome growth and reduction onto the phylogeny of Gossypium to shed light on the history of genome size change in Gossypium.

Mechanisms affecting genome size in Gossypium: From the regions analyzed, we see that transposable elements have affected genome size to varying degrees along four of the six branches analyzed (pictured below), while the removal of transposable elements via intra-strand homologous recombination was only detected once. Illegitimate recombination was present on all six branches, and was biased either toward gain ( ↑ ) or loss ( ↓ ) in each genome. In addition for all six of the branches analyzed, some insertions and deletions were not able to be assigned a category, leaving them in a lumped section (“unknown”) that also cou ld be biased toward gain or loss in each genome. (Note: this “unknown” category could represent some of the other mechanisms noted that have lost their hallmarks due to subsequent evolution and are now unrecognizable, as well as mechanisms that do not leave hallmarks that we are aware of). In examining these mechanisms with respect to relative impact along each branch, we saw that there was no single dominant mechanism of change for the genus. Some branches had clear “winners” that accounted for a vast majority of the change (e.g. A/A T ancestral branch), while other branches were more evenly spread (e.g. D/D T ancestral branch); however, all mechanisms of genome size change currently implicated in other systems have operated in shaping the genomes of Gossypium, although to varying degrees.

http://www.eeob.iastate.edu/faculty/WendelJ/images/Mechanisms.jpg

Mechanisms affecting genome size during the evolution of Gossypium species are listed along each branch. The percentages after the mechanism names indicate the proportion of genomic turnover attributed to that mechanism for that branch.

Rates of genome size evolution in Gossypium: From our earlier data we realized that genome size evolution in Gossypium was subject to regional phenomena; however, data concerning the rates of evolution for each region and genome allowed us to better quantify this observation. As demonstrated on the figure below, not only are many of the rates an order of magnitude different for each region within a genome, but for half of the branches, there is also a difference in direction (expansion or contraction) between regions that was not consistent across genomes (i.e. neither the AdhA or CesA consistently expanded or contracted across all branches). This indicates that not only are there regional differences within a genome, allowing some regions to expand while others contract, those differences or biases do not need to remain static across evolution (i.e. a region that is expanding along one branch may end up contracting on a later branch)

The trend in genome size for Gossypium is typically toward growth, with the notable exception of the A T genome. The diploid branches all experienced growth, as did the D T genome, although notably slower. Also notable were the rate changes between the ancestral and extant branches; in both cases, the rate of change on the ancestral branch is faster than the rate of change in the tips of the tree. The most significant amount of genome size change took place during the ancestral genome evolution, which was not unexpected due to the relative amount of time spent on that branch.

Further information concerning the analysis and the conclusions drawn from this data (including rates partitioned by region, genome, and mechanism) can be found in Grover, MBE, 2008; however, the main conclusions of this data are as follows. For Gossypium, the diploid genomes appear to have achieved their difference in sizes due primarily to growth. All diploid branches experienced growth, while the polyploid (as a whole) experienced contraction—which is consistent with its less than additive genome size. The rate of growth itself (in nt per year) slowed (or reversed) in all cases after diploid-polyploid divergence. This may be due in part to the episodic nature of TEs, whose proliferation in the A/A T ancestor accounts for most of the genome size differences between the four extant genomes; however, there is no mechanism in Gossypium that consistently operates to effect the most change. Intra-strand homologous recombination, for example, had the greatest effect on the A Tgenome and was responsible, in large part, for the contraction of the polyploid sequence, although increased illegitimate recombination in the A T and D T genomes (relative to the ancestor) contributed as well.

- Image unable to load - Click here to view image

The rates of genome loss and gain as inferred by the combined indel data for the CesA and AdhA regions. The evolutionary relationship and times of divergence between the model diploid progenitors ( G. arboreum (A) and G. raimondii (D)), and the true parents to the polyploid, as well as their subsequent reunion in the polyploid (AD) are shown. Branch lengths reflect time, and branch thickness indicates change in genome size (filled denotes sequence gain; open indicates sequence loss). Gossypium diverged from the outgroup ( Gossypioides kirkii, 1C 5 590 Mb) approximately 10–15 mya, and A-genome and D-genome cottons diverged from each other approximately 6.8 mya. The genome groups evolved independently for 5.2 and 4.2 my, respectively, before the model diploid progenitors diverged from the actual (and extinct) parents of the polyploid 1.6 and 2.6 mya for the A and D genomes, respectively. Approximately 1.3 mya, the A and D genomes were reunited in a polyploid nucleus, whose genome size is slightly less than the sum of the 2 model parents. Overall rates of genome size change are represented by the first line in the green boxes, whereas the individual regional rates are listed independently underneath. Rates of deletion (d), non-TE insertions (i), and TE insertions (TE) are also listed in the gray boxes.

Global genomic architecture in diploid Gossypium

To discover the genomic components responsible for genome size variation in Gossypium, we generated sequence data from whole genome shotgun (WGS) libraries for three representative diploid members that range 3-fold in DNA content and one outgroup species, Gossypioides kirkii (Hawkins et al., Genome Research 2006) . Approximately 0.2% of the haploid genome from each species was sequenced, resulting in a total of almost 12 Mb of sequence information.

Taxon/genome group

Genome size in Mb

# clones in lib.

Successfully sequenced

Average read (bp)

% genome sequenced

#Mb sequenced

Gossypioides kirkii

Outgroup

588

1920

1535

753

0.20

1.15

Gossypium raimondii

D genome

880

3072

2815

770

0.25

2.17

Gossypium herbaceum

A genome

1667

6048

4994

704

0.21

3.52

Gossypium exiguum

K genome

2460

10368

6980

704

0.20

4.91

Total

11.75

 

Sequences were queried against GenBank using BLASTX and against each other using BLASTN, and repetitive sequences were subsequently classified into gypsy-like, copia-like, LINE-like, Mutator-like, hAT-like, En/Spm-like, tandem repeats, and unknown repetitive classes. Copy numbers for each class were estimated using a novel modeling approach.

Congruent with results from plant taxa studied to date, we found that the majority of the Gossypium genome consists of dispersed repetitive sequences. Copy number and density estimates including all dispersed repeats indicate that a minimum of 40-65% of each genome is composed of transposable elements. In agreement with results from other well studied taxa, the majority of this repetitive fraction consists of Class I retrotransposons, particularly gypsy-like sequences. Class II DNA transposons comprised only a minor fraction of the Gossypium genomes (~2%) ( Fig. 1). Additionally, there was no significant variation in copy number among tandem repeats.

FIGURE 1

http://www.eeob.iastate.edu/faculty/WendelJ/images/TE-copy-number-chart.jpg

  

A key conclusion of this analysis is that different types of repetitive sequences have accumulated at different rates in different plant lineages. Excellent examples of this are illustrated by the gypsy-like sequences designated “ Gorge” for Go ssypium retrotransposon g ypsy-like element. Phylogenetic analysis of 373 gypsy-like reverse transcriptase sequences assembled from the four WGS libraries revealed three distinct classes, designated Gorge1, Gorge2, and Gorge3 ( Fig. 2). Gorge1 is similar to Arabidopsis thaliana gypsy sequence Athila, Gorge2 is similar to maize Cinful, and Gorge3 is similar to dea1 from Ananas comosus and del1-46 from Lilium henryi. Copy number calculations for the three types of sequences revealed relatively stable copy numbers for Gorge1/em> and Gorge2 across all four species, but a profound increase in copy number of Gorge3 in the larger-genome species, suggesting that differential, lineage-specific amplification of transposable elements not only occurs among different repetitive families, but also among different clades of elements within each family of retrotransposons.

FIGURE 2

http://www.eeob.iastate.edu/faculty/WendelJ/images/Gorge-tree_high.jpg