Estimating Cell-Type Profiles and Cell-Type Proportions in Heterogeneous Gene Expression Data
Date
2012-05-16Type of Degree
thesisDepartment
Mathematics and Statistics
Metadata
Show full item recordAbstract
Understanding the mechanisms underlying natural variation in gene expression is an important question in medical and evolutionary genetics. Many studies intend to compare either (i) cell-type expression profiles across individuals for the same cell-type, or (ii) cell-type expression profiles within an individual for different cell-types (NIH 2012). Naturally, accurate estimates of these expression profiles is of great importance. However, the presence of heterogeneity of cell-types in gene expression data can result in inaccurate estimates of such cell-type expression profiles (Leek and Storey 2007). The standard statistical method for assaying gene expression data is to use a simple linear regression model, with the assumption that the presence of minor alleles in the genotype has an additive effect on gene expression levels (Veyrieras 2008). This method assumes that the observed gene expression data has a homogeneous composition of a single cell-type. However there are many scenarios where it may be more appropriate to assume that observed gene expression data is composed of two cell-types; for example a brain tissue sample would presumably have a heterogeneous mixture of neuron and glial cell-types (GeneNetwork 2012). Previous studies have developed methodologies for estimating cell-type expression profiles given prior information regarding individual cell-type proportions; or conversely for estimating cell-type proportions with prior knowledge of cell-type expression profiles. This thesis derives a computational method for estimation of both the cell-type expression profiles and individual cell-type proportions for a two cell-type model, without any prior information. The parameter estimation techniques are based on an alternating-regression least-squares process. This methodology is applied to both simulated data and a real dataset, and the results are examined.