Document Type

Thesis - Open Access

Award Date

2021

Degree Name

Master of Science (MS)

Department / School

Mathematics and Statistics

First Advisor

Xijin Ge

Keywords

differential gene expression analysis, differential gene expression methods, false discovery rate, negative binomial distribution, RNA-seq, Tukey test

Abstract

RNA-sequencing (RNA-seq) has rapidly become the tool in many genome-wide transcriptomic studies. It provides a way to understand the RNA environment of cells in different physiological or pathological states to determine how cells respond to these changes. RNA-seq provides quantitative information about the abundance of different RNA species present in a given sample. If the difference or change observed in the read counts or expression level between two experimental conditions is statistically significant, the gene is declared as differentially expressed. A large number of methods for detecting differentially expressed genes (DEGs) with RNA-seq have been developed, such as the methods based on negative binomial models (edgeR, DESeq and baySeq), non-parametric approaches (NOIseq and SAMseq), transformations of gene-level read counts for linear modeling with Limma, as well as transcript-based detection methods that also enable gene-level differential expression reports (Cuffdiff 2, EBSeq and TSPM.) Recently, there have been several studies on the comparison of software packages for detecting differential expression. Some of them can be used to detect DEGs by comparing a single sample with a control. It is necessary to compare these methods in order to find a more efficient and accurate method. S. R. Zaim, C. Kenost, J. Berghout, et. al. proposed an “all-against-one” framework and compared it with eight single-subject methods (NOISeq, DEGseq, edgeR, mixture model, DESeq, DESeq2, iDEG, and ensemble) for identifying DEGs from the single-subject RNA-seq data. They claimed that different methods had different performance under different conditions, and it remained difficult to have a single method obtained both high precision and recall. Differential expression analysis requires a comparison of gene expression values between samples. However, sometimes it is hard to obtain replicates, such as only one single sample from a cancer patient can be obtained. Hence it is necessary to study methods for detecting DEGs without replicates. We focused on comparing the log fold change, edgeR, NOISeq, iDEG and ACDtool methods. The log fold change method can directly obtain the differential change value when detecting DEGs, so it has advantages in the research related to the absolute value of a differential expression. However, it is more difficult to select the required threshold. The edgeR method uses empirical Bayesian estimation and precise tests based on the negative binomial distribution to determine differential genes. It adjusts the degree of over-dispersion across genes between genes and uses a precise test similar to Fisher's exact test but adapts to over-dispersed data to assess the differential expression of each gene. The NOISeq method contains various diagnostic maps to identify sources of bias in RNA-seq data and apply appropriate standardization procedures in each case. It is more effective in avoiding false positive detection at the cost of certain sensitivity. The iDEG method uses the algorithm based on modeling read counts via a re-parameterized negative binomial distribution. It applies the Variance Stabilizing Transformation for each gene in order to detect the identified DEG set. It is a method for assessing singlesubject gene differential expression. The ACDtool is a fully revamped version of the Audic-Claverie (AC) test adapted to the diverse and much larger datasets produced by contemporary omics techniques. Under the null hypothesis that the tag counts are generated from Poisson distributions with equal means (or proportional to the respective sample sizes), this approach returns the probability that the compared samples contain the same proportion of the event. We used the data set in the SEQC project, and the gene expression levels of the samples by using the RT-PCR technologies to compare several methods for detecting single-sample differentially expressed genes by the performance on the receiver operating characteristic curves: 1) With the differentially expressed genes obtained by Limma applying to genes with RT-PCR data; 2) With the differentially expressed genes obtained by DESeq2 method on all genes; 3) Applying an experimental method to compare the false positive rates. We conclude that the iDEG method gives the least false positive rate with sacrificing the sensitivity. Although the edgeR and simple fold change methods give higher false positive rate comparing with the iDEG method, they obtain the best trade-off and hence are the most reliable and efficient methods among all of the methods we studied for the single-sample RNA-seq data.

Library of Congress Subject Headings

Gene expression -- Data processing.
RNA -- Analysis.
Nucleotide sequence.

Number of Pages

82

Publisher

South Dakota State University

Share

COinS
 

Rights Statement

In Copyright