First fungal genome sequence from Africa : A preliminary analysis

Authors: Brenda D. Wingfield1 Emma T. Steenkamp2 Quentin C. Santana1 Martin P.A. Coetzee1 Stefan Bam1 Irene Barnes1 Chrizelle W. Beukes2 Wai Yin Chan2 Lieschen de Vos1 Gerda Fourie2 Melanie Friend1 Thomas R. Gordon3 Darryl A. Herron2 Carson Holt4 Ian Korf5 Marija Kvas2 Simon H. Martin1 X. Osmond Mlonyeni1 Kershney Naidoo1 Mmatshepho M. Phasha2 Alisa Postma1 Oleg Reva6 Heidi Roos1 Melissa Simpson1 Stephanie Slinski3 Bernard Slippers1 Rene Sutherland2 Nicolaas A. van der Merwe1 Magriet A. van der Nest1 Stephanus N. Venter2 Pieter M. Wilken1 Mark Yandell4 Renate Zipfel1 Mike J. Wingfield1


Introduction The target genome
The Ascomycete fungus Fusarium circinatum is the causal agent of pitch canker, which is a serious disease that affects numerous Pinus species worldwide. 1The term 'pitch canker' refers to the large resinous cankers that develop on roots, trunks, branches and reproductive organs of established or mature Pinus hosts (Figure 1).On seedlings, the pathogen mainly causes root and collar rot, which are also the symptoms that were observed in South Africa when this pathogen was first detected in 1990. 2,3In contrast to the situation in other parts of the world, F. circinatum remained a nursery pathogen since this first outbreak, and it was only in 2007 that it emerged as a major pathogen in plantations planted to susceptible Pinus species. 4Apart from the losses associated with the plantation outbreaks of pitch canker, F. circinatum -related mortality during plantation establishment has been estimated to exceed R10 million annually. 5The pitch canker fungus thus represents a serious threat to the future of the pine forestry industry in this country.
Relatively little is known regarding the genetics of F. circinatum, with the bulk of knowledge at this level relating to its phylogeny and diagnostics, 6,7 as well as to its population biology. 8,9Previous studies have, for example, shown that F. circinatum is a heterothallic fungus capable of both sexual and asexual reproduction.Unlike many other Ascomycete pathogens, sexual and asexual reproduction of F. circinatum have been shown in regions of the world where F. circinatum has been introduced relatively recently. 10,11,12Furthermore, studies have also shown that the fungus probably originated in Mexico or Central America and that it has been accidently introduced into pine-growing regions around the world. 13In all cases, however, these previous DNA-based studies have utilised information from either housekeeping loci or microsatelliterich regions, which in most cases represent small or limited portions of the pathogen's genome.
Whole-genome analysis procedures such as genetic linkage mapping and genome sequence comparisons have increased our understanding of the genetic basis of various biological phenomena in fungi.Well-known examples include the development of spores in Pleurotus pulmonarius 14 and the development of ectomycorrhizal symbiosis in Laccaria bicolor. 15Such whole-genome approaches have also shed light on the evolution of fungal pathogenicity, 16,17 which has also been particularly true for Fusarium species such as Fusarium oxysporum, Fusarium verticillioides and Fusarium graminearum. 18,19The fact that the genomic data for these Fusarium species are in the public domain, and that a framework map is available for F. circinatum, 20 therefore presents ideal opportunities to understand the genetic basis for pathogenicity in the pitch canker fungus.
The aim of this study was to sequence, assemble and annotate the genome of F. circinatum.In addition, we present a preliminary analysis of putative gene clusters that are unique to F. circinatum and we compare three loci of this genome with the genomes of three close relatives: F. oxysporum, F. verticillioides and F. graminearum (Figure 2).From a South African perspective, this study will have significant impact -not only because the pitch canker pathogen is the first eukaryotic organism for which the entire genome has been sequenced in Africa, but also because the project strongly promotes human capacity development in the field of genome sequencing on the African continent.Furthermore, data emerging from this sequence will promote many studies concerning the pathogen and potentially lead to innovating approaches to reduce the losses that the pathogen is causing in South Africa and elsewhere in the world.

The sequence: Genome sequencing, assembly and integrity
In this study we specifically targeted a F. circinatum isolate (FSP34) for which a genetic linkage map based on amplified fragment length polymorphisms is available from a previous study. 20The availability of this framework map would thus provide some higher level structure for the final genome assembly.High quality DNA was isolated 23 and then    25 In order to confirm the integrity of the assembled F. circinatum genome, we interrogated the assembly for the presence and order of the open reading frames (ORFs) known to be encoded at the mating type (MAT) locus of this fungus.
From previous research it is known that the mating type of F. circinatum isolate FSP34 is MAT-1. 20,26Within the F. circinatum assembly we thus expected to find three MAT-1 ORFs (MAT 1.1.1,MAT 1.1.2and MAT 1.1.3)and the entire region to be flanked by genes encoding a cytoskeleton assembly control protein (SLA1) and a DNA lyase (APN1). 25,27ocal Basic Local Alignment Search Tool (BLAST) analysis of the assembly indicated that a single contig (Contig00012) contained MAT-1 sequences.Examination of this 25 000 bp contig confirmed the presence of the genes, in both the same orientation and order as those found in other Fusarium species (Figure 3).This process of verification was repeated on two additional contigs (data not shown) containing genes that were of interest and also confirmed the accuracy of the assembly.
The completeness of the F. circinatum genome sequence was determined by subjecting the sequence to the CEGMA (Core Eukaryotic Genes Mapping Approach) pipeline. 28A defined set of conserved protein families known to occur in all eukaryotes was used for the analysis. 28This procedure also allows for the production of an initial set of reliable gene annotations in a eukaryotic genome, even in a draft form.The analysis revealed that the F. circinatum genome sequence assembly included the large majority (95%) of the genes common to other eukaryotes.The assembled F. circinatum genome was thus at least 95% complete.Future studies will seek to verify whether the missing genes are indeed not encoded by the pitch canker fungus.
Fusarium circinatum is a haploid fungus and the isolate sequenced was established from a single spore.Therefore, as opposed to diploid or polyploid organisms, only a single allele would be found at any particular locus in the genome.This simplifies the genome assembly process for haploid species, which generally requires less sequence coverage to produce an accurate assembly.Based on the estimated size of the F. circinatum genome and the amount of sequence information generated, an 11X sequence coverage was obtained.We were, therefore, confident that the genome of F. circinatum had been sequenced close to completeness and that the accuracy and integrity of the assembly was as good as could reasonably be expected.

Gene annotation and curation
Although computer annotation of genomes has progressed substantially in the last decade, the robustness of genome annotations is still dependent on 'gene calling' programs, each of which has inherent strengths and weaknesses.Most are also designed for animals or plants with genome and gene architectures that are significantly different from those of fungi.In this study, the MAKER annotation pipeline 29 was used because it is designed to particularly deal with eukaryotic genomes smaller than 100 Mb.For ab initio ORF predictions, MAKER utilised the programs Genemark ES, 30 Augustus 31 and SNAP. 32To streamline the ORF prediction process, the MAKER pipeline also used genome data available for F. verticillioides, F. oxysporum and F. graminearum.
In addition, some expressed sequence tag (EST) sequence data were included (data not shown) to refine the accuracy of identifying the intron-exon boundaries.After several rounds of annotation to train MAKER, thereby improving its gene calling, approximately 15 000 ORFs were identified in the F. circinatum assembly (Table 2).
Whilst computer annotation programs have become substantially more sophisticated, final annotations typically need to be done manually, which currently presents the most substantial obstacle for all genome projects. 33,34In this study, we used the program Apollo 35 to manually annotate and curate the F. circinatum genome.Apollo can directly utilise the sequence output from MAKER and this program also has the advantage of being relatively user friendly for biologists not familiar with computer programming.In addition to utilising manual curation for the F. circinatum annotation, we followed the novel strategy of engaging students as annotators in the process.This approach was adopted because the skills required for curating a simple eukaryotic genome require little more than a basic degree in the biological sciences with some molecular biology focus.By following this approach we were able to achieve our second aim of promoting human capacity in the field of genome annotation in South Africa.A team of 20 graduate student volunteers was identified for this study.The students were then exposed to a 2-day training course in which the theoretical background involved in gene and genome structure was reinforced and the basic concepts and requirements of the annotation process were learned.
All the annotators were supplied with a number of contigs to curate and a support programme was implemented to assist those annotators that encountered problems.In most cases the learning curve for members of the annotation team was considerable, but tackling the annotation process in this way clearly highlighted the value of genome sequences to a biological sciences programme.The project made it possible to not only foster an appreciation of the methodologies and approaches associated with genome sequencing projects, but also provided a large number of graduate students with the opportunity to become experienced in the process of genome sequence annotation.
During the curation, each predicted ORF was compared with the predicted genes from the genomes of F. verticillioides, F. oxysporum and F. graminearum.What was immediately obvious was that about 70% of the F. circinatum ORFs were most similar to those of F. verticillioides, which is consistent Note: As F. graminearum is a homothallic species, this locus contains both the mating type loci and thus, in addition to the genes MAT 1-1-1, MAT 1-1-2 and MAT 1-1-3, the MAT 2 genes are also present (MAT 1.2.

Fusarium circinatum
Gibberella zea (Fusarium graminearum) Fusarium oxysporum  18 with the fact that these two fungi are more closely related to one another than to the other two species (Figure 2).In many cases, when a F. circinatum ORF was not most similar to one in F. verticillioides, the dissimilarity was found to be as a result of differences in intron prediction between the two genomes.Although the ORFs in F. verticillioides have been annotated using FGENESH 36 that also utilises a hidden Markov model-based algorithm to find genes, the genome of this fungus has not been subject to much manual annotation.Also, the numbers of predicted ORFs in F. circinatum and F. verticillioides differed considerably.Compared to F. circinatum, which has about 15 000 ORFs, F. verticillioides contains only about 13 500 ORFs.Of the ORFs apparently missing in F. verticillioides, a significant proportion had, in fact, not been annotated, despite the availability of EST evidence in many cases.This absence suggests that the annotation of F. verticillioides as presented on the Broad Institute website 25 requires additional analyses which would probably increase the number of predicted ORFs in this genome by as much as 5%.
Inspection of the annotated output for the F. circinatum assembly revealed further discrepancies amongst the results of the different predictions programs employed by MAKER.For example, Genemark predicted 15 713 ORFs, whilst Augustus predicted 14 210 ORFs.By manually curating the annotation, it was thus possible to evaluate the various ORF prediction outputs of the pipeline in terms of intron-exon boundaries and EST evidence for F. circinatum and the other Fusarium species.After the manual curation, the F. circinatum assembly contained 15 049 predicted ORFs, with an accuracy of at least 90% for the combined gene prediction of these two programs.
From the curation it was also observed that most often the contigs terminated in intergenic regions.Although this could be ascribed to the reduced ability of the gene prediction programs to find ORFs in the absence of 3' or 5' gene signatures, the CGEMA output indicated that more than 95% of the core eukaryotic genes were present in the F. circinatum.A more likely explanation is that the assembly program Newbler was not able to assemble across DNA repeat regions, which are most often found in the intergenic regions.

Analysis of unique gene clusters
Reciprocal BLAST analyses were used to compare the predicted ORFs in the F. circinatum genome to those of the other Fusarium species.Within the resulting set of 2599 ORFs unique to F. circinatum (i.e.present in F. circinatum and absent from one or more of the other three Fusarium genomes) we identified 1031 ORFs that occurred next to each other in clusters of 4 or more.The BLAST function of the cDNA Annotation System (dCAS) v1.4.3 was then used to compare our 'unique' set of 1031 ORFs to the Pfam database (http://pfam.sanger.ac.uk/).dCAS uses the BLAST executable and BLAST databases (of which Pfam is one) to find regions of local similarity between sequences in these databases and the user's target sequence.Within the list of protein families identified amongst our 'unique' ORFs (Online Supplementary Tables 1 and 2), those with possible carbohydrate-active enzyme (CAZy) properties were identified using the CAZy database (http://www.cazy.org).
The KEGG BRITE database (http://www.genome.jp/kegg/brite.html)classifications were used to group these families into classes, although a significant proportion of the ORFs could not be placed into any class using the KEGG database (Online Supplementary Table 2).

Comparison of two mycotoxin gene clusters
Fusarium species are widely known for the range of secondary metabolites or mycotoxins that they produce. 37,38Amongst the species for which genome sequence information is available, F. verticillioides and F. graminearum are highly toxigenic, with each capable of producing a range of mycotoxins. 37,39. verticillioides is particularly known for producing high levels of fumonisins and fusaric acid, and F. graminearum for producing high levels of trichothecenes and zearalenone.37,39 In contrast, F. oxysporum and F. circinatum are not considered to be highly toxigenic, although some strains of F. circinatum produce beauvericin and some F. oxysporum strains produce trichothecenes and other compounds.37,38,39 The genes encoding the structural and regulatory elements involved in the biosynthesis of these toxic metabolites are usually clustered within the genomes of these species.38 In this study, we compared the genomic structure and organisation of the fumonisin and fusarin C gene clusters amongst the four Fusarium species.
The fumonisin gene cluster has been well characterised in F. verticillioides. 38By making use of this information, we were able to compare the organisation and composition of this cluster in the genomes of F. verticillioides, F. oxysporum, F. circinatum and F. graminearum (Figure 4).Our results confirm previous reports that these genes are absent in both F. circinatum and F. graminearum, neither of which has ever been shown to produce fumonisins. 37,38,39,40This gene cluster is also missing from the genome of the isolate of F. oxysporum for which the genome is available, but this locus is known to be present in another isolate (strain FRC O-1890) of the same species. 41It is interesting that one of the genes (ORF 20) flanking this locus is in a different orientation in F. oxysporum and F. circinatum and entirely missing from the F. graminearum genome.In addition, in F. circinatum ORF 20 is placed between the genes Znf1 and Zdb1, whilst these two genes are alongside each other in F. verticillioides, F. oxysporum and F. graminearum.These rearrangements and deletions thus suggest that recombination, especially in the regions flanking the cluster, determines whether fumonisin will be produced.Such recombination events could potentially facilitate horizontal transfer of this cluster amongst unrelated strains, which has been suggested to explain the patchy distribution of fumonisin production amongst lineages of Fusarium. 40number of Fusarium species have been shown to produce the mycotoxin fusarin C, although its role in disease has not been established. 37,38Comparison of the fusarin C gene cluster in the four Fusarium species revealed its presence in F. circinatum, as well as in F. verticillioides and F. graminearum (Figure 5).Within the cluster, the gene order in F. circinatum is different from that found in F. verticillioides and F. graminearum, where the gene order is similar.In all instances the regions flanking this cluster were also unique.The fact that the locus seems to occur in a different position in the genomes of F. verticillioides, F. graminearum and F. circinatum, thus suggests that this gene cluster has been translocated more than once during the evolution of this fungal genus.In the case of F. circinatum, this translocation has been accompanied by a change in the gene order.

Discussion
This investigation had two core areas of focus.Scientifically, the aim was to sequence and annotate the genome of a fungal pathogen that is highly relevant in South Africa.The results of our preliminary comparisons represent a solid foundation of data that will significantly promote research aimed at a better understanding and management of the impact of pitch canker of pines in South Africa and elsewhere in the world.
The second and equally important focus area was strongly educational, as our intention was to involve a relatively large number of students and researchers, for the first time, in a genome sequencing project.The underlying aim here was to promote an interest in this field of growing importance and to build human capacity that will contribute to similar projects in South Africa in the future.
The genome of an isolate of the pitch canker pathogen, F. circinatum, was sequenced and manually annotated.
A more detailed analysis of the genome will be published during the coming year.The sequence data is available at Genbank Bioproject: PRJNA41113; ID: 41113 Locus Tag Prefix: FCIRG and on the FABI website (http://www.fabinet.up.ac.za/genomes).The F. circinatum genome, whilst being in the expected size range for a Fusarium genome, has almost 1000 more protein coding genes than its closest relative, F. verticillioides.F. oxysporum has 3000 more protein coding genes than F. verticillioides and many of these have been proposed to originate from the acquisition of lineage-specific genomic regions. 19nderstanding the origin of the additional 1000 genes in F. circinatum will add to this intriguing hypothesis.
Fusarium species are well known for their production of mycotoxins.Whilst the study of mycotoxins is of particular importance for species that contaminate food and feed stocks, these toxins have also been shown to be important in plant pathogenesis . 38 most noteworthy that in the case of both these gene clusters, significant deletions or insertions and rearrangements have occurred.There is also increasing evidence of horizontal gene transfer in fungi 19 and it is likely that there would be particularly strong selective pressure in terms of acquiring the ability to produce mycotoxins.Further analysis of the F. circinatum genome will focus on this possibility.
Analysis of the unique gene clusters identified in F. circinatum, whilst interesting, did not contain any significant surprises.The genome of this fungus contains a large number of proteases or peptidases, lyases and transferases that deserve further investigation.For example, these proteins could be involved in the synthesis of secondary metabolites, which in turn have the potential to be involved in pathogenesis.Amongst the four Fusarium genomes compared, F. circinatum is unique in being a gymnosperm and tree pathogen.The other three Fusarium species are pathogens typically of angiosperm monocotyledonous crops and we thus expect that F. circinatum could have a unique set of genes involved in pathogenesis.
A number of viral proteins were also found in the gene clusters, suggesting the presence of viral genomes within the F. circinatum genome.We did not filter for transposable elements and some of the genes identified as viral proteins could in fact represent elements associated with transposons.There is also precedence for the presence of both retroviruses and transposons in fungal genomes. 42e availability of the F. circinatum genome will strongly promote various projects currently being undertaken on the pathogen.For example, variation in pathogenicity was observed in the interspecies cross between F. circinatum and Fusarium subglutinans, which was used by De Vos et al. 20 to produce a genetic linkage map.The availability of both the linkage map and the genome sequence will be used to study some genome regions that are potentially associated with pathogenicity.These studies will include further comparative analyses of the genome for mycotoxin biosynthetic gene clusters and other secondary metabolic gene clusters.In addition, quantitative trait loci have been identified and linked to the growth of F. circinatum (De Vos, unpublished data).With the availability of the F. circinatum genome sequence, the genetic components of these loci can now be understood and this will add to knowledge regarding the mycelial growth in fungi.
Genome sequencing is rapidly growing in importance and there is little question that this field will impact increasingly on most aspects of biology.This growth is a logical continuation of the situation that existed little more than two decades ago when the ability to sequence relatively small numbers of genes began to influence the field.The number of complete genome sequences has increased from one to close to 1000 in just 15 years. 43Apart from next generation sequencing that has substantially influenced this growth, new technological developments will ensure that this process continues, ultimately resulting in broad applications relating

Fusarium circinatum
Gibberella zea (Fusarium graminearum) to genome sequences.This investigation, promoting the sequencing and annotation of a relatively small, but very significant genome, will undoubtedly come to represent a milestone in genome sequencing in South Africa.
An important element of genome sequencing is that the bulk of the work lies in annotating and interpreting the data.
The entry point to this process is the laboratory component, the physical sequencing process.In South African terms, this is still relatively expensive, even for genomes as small as that of F. circinatum.However, the costs are dropping rapidly and suites of genomes, or the genomes of numerous strains of single species, are now being sequenced in consortium projects interested in comparative genomics.
Whilst the wet laboratory work might be somewhat beyond the budgets of many South African laboratories, there are substantial opportunities for local scientists and students with knowledge of genome annotation to utilise genome data that is already freely available.In the future, there will be even larger volumes of genome data available for study.Some of the scientists involved in the investigation presented here will be well positioned to capitalise on this and related opportunities.

FIGURE 1 :
FIGURE 1: Symptoms of pitch canker caused by Fusarium circinatum on Pinus radiata near Cape Town, South Africa: (a) dying trees in a plantation, (b) a dead branch in the crown of an infected tree, (c) stem canker resulting from infection of a pruning wound and (d) resin (pitch) impregnated wood.

FIGURE 3 :
FIGURE 3: Diagrammatic comparison of the mating type (MAT) locus in Fusarium circinatum, F. verticillioides, F. oxysporum and F. graminearum, based on the available genome sequences.

FIGURE 5 :
FIGURE 5: Diagrammatic comparison of the fusarin locus in Fusarium circinatum, F. verticillioides, F. oxysporum and F. graminearum, based on the available genome sequences.

TABLE 1 :
22trics for the assembly of the Fusarium circinatum genome.The metrics represent some average and range statistics with regards to the genome coverage and assembly, but for contigs larger than 500 bp only.Relationships are based on those presented by Chaverri et al.21and Geiser et al.22

FIGURE 2 :
Evolutionary relationships amongst the Fusarium species for which whole-genome sequence information is available.

TABLE 2 :
Genome statistics for four species of Fusarium.
† , Data from Ma et al.
A comparative analysis of the F. circinatum genome with the other Fusarium genomes currently available has enabled us to determine that this fungus does not contain the fumonisin gene cluster, although the genes for the synthesis of fusarin C are present.Whether or not the pitch canker pathogen is capable of producing this compound remains to be determined.Also, the role of fusarin C has not been established in human or animal disease, nor has its role in plant pathogenesis been determined.It is perhaps Note: Gene sizes do not correspond to actual nucleotide length.FSG, Fusarium graminearum [Gibberella zea] open reading frames.Diagrammatic comparison of the FUM locus in Fusarium circinatum, F. verticillioides, F. oxysporum and F. graminearum, based on the available genome sequences.