Document Type

Thesis - Open Access

Award Date

2019

Degree Name

Master of Science (MS)

Department / School

Mathematics and Statistics

First Advisor

Xijin Ge

Keywords

Biological sequences data analysis, Deep learning, EMR data analysis, Feature Selection, Machine learning

Abstract

The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and protein-protein interactions datasets to evaluate the validation of the tool. The results show that the tool can effectively perform biological sequence analysis and classification tasks. Applying machine learning algorithms for Electronic medical record (EMR) data analysis is also included in this dissertation. Chronic kidney disease (CKD) is prevalent across the world and well defined by an estimated glomerular filtration rate (eGFR). The progression of kidney disease can be predicted if future eGFR can be accurately estimated using predictive analytics. Thus, I present a prediction model of eGFR that was built using Random Forest regression. The dataset includes demographic, clinical and laboratory information from a regional primary health care clinic. The final model included eGFR, age, gender, body mass index (BMI), obesity, hypertension, and diabetes, which achieved a mean coefficient of determination of 0.95. The estimated eGFRs were used to classify patients into CKD stages with high macro-averaged and micro-averaged metrics.

Library of Congress Subject Headings

Machine learning.
Data mining.
Medical records.
Bioinformatics.
Computer algorithms.

Format

application/pdf

Number of Pages

Publisher

South Dakota State University

Recommended Citation

Gu, Shaopeng, "Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records" (2019). Electronic Theses and Dissertations. 3666.
https://openprairie.sdstate.edu/etd/3666

Download

Included in

Bioinformatics Commons, Statistics and Probability Commons

COinS

Rights Statement

Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange

Electronic Theses and Dissertations

Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records

Document Type

Award Date

Degree Name

Department / School

First Advisor

Keywords

Abstract

Library of Congress Subject Headings

Format

Number of Pages

Publisher

Recommended Citation

Included in

Rights Statement

Search

Browse

Author Corner

Links

Links

Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange

Electronic Theses and Dissertations

Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records

Author

Document Type

Award Date

Degree Name

Department / School

First Advisor

Keywords

Abstract

Library of Congress Subject Headings

Format

Number of Pages

Publisher

Recommended Citation

Included in

Share

Rights Statement

Search

Browse

Author Corner

Links

Links