This Is AuburnElectronic Theses and Dissertations

Assembly of 500,000 Inter-Specific Catfish Expressed Sequence Tags and Large Scale Gene-Associated Marker Development for Genome Selection Studies

Date

2009-08-05

Author

Wang, Shaolin

Type of Degree

dissertation

Department

Fisheries and Allied Aquacultures

Abstract

Expressed Sequence Tag (EST) sequencing is one of the most efficient means for gene discovery and gene expression profiling. With a good resource of ESTs, a large number of molecular markers can be identified, and issues related to alternative splicing and differential poly adenylation can be addressed at the genome-wide scale. Through the Community Sequencing Program, a catfish EST sequencing project was selected by the DOE’s Joint Genome Institute (JGI). In this project, a total of 12 cDNA libraries were constructed including eight from channel catfish (Ictalurus punctatus) and four from blue catfish (I. furcatus). A total of 600,000 sequencing attempts were made, generating a total of 438,321 quality ESTs. With previously existing ESTs in GenBank, this project brings the total of ESTs to nearly 500,000 in the catfish. The JGI EST sequencing had an overall sequencing success rate of 73% with an average length of 576 bp. All the ESTs were assembled using CAP3, resulting in 111,578 unique sequences, including 45,306 contigs and 66,272 singletons. Of these unique sequences, over 35% had significant similarities to known genes by BLASTX searches, which allowed the identification of 14,776 unique genes in the catfish. A total of 1,350 and 849 full length cDNAs have been identified from channel catfish and blue catfish, respectively. The ESTs are an enormous resource for SNP identification. The quality assessment parameters for EST-derived were established based on a pilot study with 384 SNPs. In order to select reliable SNPs, contigs containing four or more ESTs should be used and the minor allele sequence should be represented at least twice. Genotyping primers should be designed from a single exon, completely avoiding introns. Application of such quality assessment measures, along with large resources of ESTs, should provide effective means for SNP identification in species where genome sequence resources are lacking. Over 300,000 putative SNPs have been identified, of which over 48,000 are high quality SNPs as defined by contig size of at least four sequences and the minor allele presence of at least twice in the contig. The EST resource should also be valuable for identification of microsatellites, comparative genome analysis. This large scale EST sequencing project would allow the identification of majority of catfish transcriptome. The parallel analysis of ESTs from the two closely related ictalurid catfishes should also provide powerful means for the evaluation of ancient and recent gene duplications, and for the development of high-density microarrays in catfish. The inter- and intra- specific SNPs identified from all catfish EST dataset assembly will greatly benefit the catfish introgression breeding selection and whole genome association studies. All ESTs have been deposited in GenBank.