Subscribe to RSS Feed (Opens in New Window)

Schedule
2018
Sunday, February 11th
12:00 PM

Check-in/Registration

South Dakota State University

McCrory Gardens Foyer

12:00 PM - 5:00 PM

1:00 PM

Workshop 1: Data Science in the Cloud with Microsoft Azure ML

Ryan Swanstrom, Unify Consulting

McCrory Gardens Great Hall

1:00 PM - 5:00 PM

Azure ML is a web-based tool for machine learning. It has a simple drag and drop interface, yet it is still powerful enough to integrate with R and/or Python. This course will teach you how to create experiments with Azure ML. This course is laboratory based and will include time to build machine learning experiments. You will leave the course with everything you created.

1:00 PM

Workshop 2: Python for Data Science

David Zeng, South Dakota State University

McCrory Gardens Meeting Room

1:00 PM - 5:00 PM

Outline of the Workshop
1. Setup and Python/Jupyter Notebook Basics
2. 10 minutes to Pandas/DataFrame (actually about 30 mins)

5:00 PM

Banquet

South Dakota State University

McCrory Gardens Great Hall

5:00 PM - 6:00 PM

6:00 PM

Social Time

South Dakota State University

McCrory Gardens Great Hall

6:00 PM - 6:30 PM

6:30 PM

Dinner

South Dakota State University

McCrory Gardens Great Hall

6:30 PM - 8:00 PM

7:15 PM

Welcome

South Dakota State University

McCrory Gardens Great Hall

7:15 PM - 7:30 PM

7:30 PM

Keynote: Data and Analytics - What you don't know can HELP you

Steve Cross, Great West Casualty

McCrory Gardens Great Hall

7:30 PM - 8:30 PM

Monday, February 12th
7:30 AM

Checkin / Luggage Check

South Dakota State University

University Student Union: Volstorff Lounge

7:30 AM - 12:00 PM

8:00 AM

Breakfast

South Dakota State University

University Student Union: Volstorff B

8:00 AM - 9:00 AM

9:00 AM

Opening Session: Welcome and Introduction

Barry Dunn, South Dakota State University

University Student Union: Volstorff B

9:00 AM - 9:10 AM

9:10 AM

Keynote: Combining Machine Learning with Domain Expertise for FICO® Score Development

Gerald Fahner, FICO

University Student Union: Volstorff B

9:10 AM - 10:00 AM

Machine learning models—often regarded as a black boxes—are attractive due to their high degree of automation and predictive power, whereas highly scrutinized credit scoring operations must employ transparent models. We investigate the potential of modern ML techniques, and of blending ML and Scorecard technology, to ascertain maximum predictive power of the FICO® Score subject to regulatory and explainability requirements. Our first case study benchmarks the US FICO® Score against modern ML approaches and discusses explainability challenges with “unfettered” ML models such as Gradient Boosted Decision Trees (GBDT). Our second case study concerns a recent development of a new FICO® Score outside the US where we combined the raw predictive power of ML with the advantages of Scorecard technology (ability to impute domain knowledge, ease of explanation) in a “best of both worlds” approach. This session shares experiences and methodologies that can be interesting for anyone who seeks to effectively leverage ML and domain knowledge to develop highly predictive yet still explainable models.

10:00 AM

Panel Discussion: Making Analytics Great Again

Mark Gorman, The Gorman Group Consultancy

University Student Union: Volstorff B

10:00 AM - 10:50 AM

10:50 AM

Networking break / Exhibitors

South Dakota State University

University Student Union: Volstorff A

10:50 AM - 11:00 AM

11:00 AM

Session 1: Tools - Setting up Python and R in Classroom

Anton Bezuglov, Buena Vista University

University Student Union: Clark Room 262 B

11:00 AM - 12:00 PM

Python and R are perhaps the two most popular free analytics platforms for Data Science. Unfortunately, for novice users such as students, installation of proper components and tool configuration is usually a problem. Even more difficult it is to ensure that everyone in the classroom works in the same environment, uses the same package versions, etc. Here, we present a classroom setup, where the students access their Python and R notebooks through HTTP via JupyterHub server. In this setup, the students have full permissions to their home directories as well as read/execute access to shared notebooks and data directories. The instructors can use the latter two directories to share in-class work, lectures, and data with all the students. The JupyterHub server runs on a virtual Linux machine to facilitate resource management and backups. This approach has been tested on smaller classes of up to 15 simultaneous users (both students and faculty). For larger classes, JupyterHub can be deployed on multiple nodes using Docker Swarm. Overall, this setup is an excellent platform where the users can focus on their work: learning or research without wasting time on package configuration, backups, and resource management.

11:00 AM

Session 1: Tools Use case of Amazon Web Services SageMaker in Digital Analytics

Ally Pelletier, Star Tribune

University Student Union: Clark Room 262 B

11:00 AM - 12:00 PM

Amazon Web Services (AWS) provides cloud computing resources to companies around the world. These services include data storage, computing, and analytics. In November 2017, AWS released SageMaker. SageMaker is an end-to-end service which allows developers and data scientists to explore data, train models, and deploy models in production all within an easy to use platform. SageMaker comes with built-in common algorithms but also allows data scientists to bring their own algorithms in many languages including R and Python. During this presentation we will walk through a use case of SageMaker in the digital analytics industry. The use case will include all steps of the modeling process including data preparation, training, and deployment.

11:00 AM

Session 2: Healthcare - Collaboration and Innovation: Pushing Forward Data-Driven Population Health

Emily Griese, Sanford Health

University Student Union: Pheasant Room 253 A/B

11:00 AM - 12:00 PM

As healthcare continues to transition to value-based care, understanding how to effectively leverage data for population health is essential. Sanford Health recognizes this vital step and is leveraging models of innovation and collaboration to ensure success in a shifting healthcare climate. This talk will address current efforts including the Sanford Data Collaborative as well as introduce advanced analytics teamwork occurring across our footprint.

11:00 AM

Session 3: Methods - Table 1: SAS and R Tools to Build Summary Statistics Tables

Paul Thompson, Sanford Research

University Student Union: Dakota Room 250 A/C

11:00 AM - 12:00 PM

In presenting clinical data, Table 1 is used to present demographic, clinical, and pre-treatment data for trials. This includes continuous variables (age, BP, BMI - presented as means&stddev or median-IQR) and discrete variables (gender, race, cancer stage - presented as proportions-counts). While the actual statistical/data analytic techniques involved are trivial, constructing such tables takes a lot of time as results must be assembled into standard tables. Automating this is a huge time saver. SAS and R methods for building this table in a consistent manner are presented.

11:00 AM

Session 3: Methods - Using Atypicality to Identify Outliers

Austin O' Brien, Dakota State University

University Student Union: Dakota Room 250 A/C

11:00 AM - 12:00 PM

This presentation will outline the development and use of a probabilistic measure for outlier detection, referred to as atypicality. Given a set of objects, we can create a corresponding set of similarity scores between them. Assuming the set of scores has a normal distribution, we can estimate the score distribution’s parameters. We compute atypicality by comparing the likelihood of an object given these estimated parameters to the likelihood of bootstrapped samples. The atypicality measure is then used as a p-value in a hypothesis test, where the null hypothesis states that the object in question is similar to the remaining objects; the alternative hypothesis is that the object is an outlier. This can be used in a variety of applications, especially where we have multiple objects in multi-dimensional space.

11:00 AM

Session 4: Financial/Methods - Enhancing Collection Effectiveness through Net Lift Analysis

Matt Nissen, Capital Services

University Student Union: Pasque Room 255

11:00 AM - 12:00 PM

Do you want to collect more dollars while spending less? Net lift analysis can help target customers with the highest propensity to be positively influenced by a collections treatment while balancing the costs of collecting. A collections effectiveness case study was conducted that outlines the basics of net lift analysis. Strategies using net lift analysis resulted in an increase in dollars collected and a decrease in collections costs. Key takeaways are the importance testing, using the correct metrics, and evolving as you learn.

11:00 AM

Session 4: Financial/Methods - Estimating the Quantile Elasticity of Intertemporal Substitution with Instrumental Variables Quantile Regression

Lance Cundy, University of Iowa

University Student Union: Pasque Room 255

11:00 AM - 12:00 PM

estimate the quantile elasticity of intertemporal substitution (QEIS) of consumption using instrumental variables quantile regression. The elasticity of intertemporal substitution represents the willingness of a consumer to substitute future consumption for present consumption. In this paper, agents have a quantile utility preference instead of standard expected utility. This allows for the capture of heterogeneity along the conditional distribution of agents. The QEIS considers structural breaks in the data and is estimated for each regime using linearized Epstein-Zin preferences and by the use of fixed effects, instrumental variables, and quantile regression. The estimator is a feasible estimator based on smoothed sample moments. In order to estimate the model, the Nielsen Consumer Panel dataset is used. This dataset is built from transactional data that follows households in the United States and their grocery purchases from 2004 through 2014. Because of the transactional nature of the dataset, there is a low source of measurement error in consumption, and aggregation bias can be minimized. To estimate the model, consumption is aggregated weekly, and consumption growth is measured over a four-week time period in order to match four-week Treasury bills. Results give evidence of heterogeneity of the QEIS along the quantiles of the conditional distribution.

12:00 PM

Lunch and Poster Session

South Dakota State University

University Student Union: Volstorff A and B

12:00 PM - 1:00 PM

1:00 PM

Keynote: Navigating Data Integration in Healthcare: What, Why, How, and Who?

Benson Hsu, South Dakota State University

University Student Union: Volstorff B

1:00 PM - 1:00 PM

As healthcare transitions from a fee-for-service model towards a value-based paradigm, healthcare organizations from payers to provider are exploring new data sets to augment existing traditional health (or more specifically, illness) data sets. In this discussion, we will explore exactly what types of data can serve to improve understanding of our communities, why these data sets are essential in this transition of our approach to healthcare payment and delivery, how these data sets can be integrated as well as the barriers to integration, and who ultimately “owns” this data as various (and at times, conflicting) constituents strive to serve the population. Is there genuine data democracy?

2:00 PM

Session 5: Tools - JMP and the Predictive Modeling Workflow

Kevin Potcner, JMP

University Student Union: Clark Room 262 B

2:00 PM - 3:00 PM

A typical real-world predictive modeling workflow includes (sometimes iteratively) data cleaning and exploration, model fitting, model validation, model comparison, final model selection and deployment of the final predictive model. In this session we illustrate the predictive modeling workflow by analyzing a real dataset. After data preparation and initial exploration, we will create a number of predictive models such as multiple linear regression, regression tree, partition based methods, Neural Net, among others. We will “publish” models to the Formula Depot, and explore and select the best model(s) using the Prediction Profiler and JMP’s Model Comparison tool.

2:00 PM

Session 6: Healthcare - Analyzing Ligand-binding Proteins Using Their Structural Information

Galkande Premarathna, Minnesota State University, Mankato

University Student Union: Pheasant Room 253 A/B

2:00 PM - 3:00 PM

It is known that a protein’s biological function is in some way related to its physical structure. Many researchers have studied this relationship both for the entire backbone structures of proteins as well as their binding sites, which are where binding activity occurs. However, despite this research, it remains an open challenge to predict a protein’s function from its structure. There are many useful applications from protein function predictions, such as effective drug discovery with fewer side effects, development of structure-based drug designs, disease diagnosis, and many more. This presentation will discuss how this ligand-binding protein prediction problem is approached by taking a higher level object-oriented approach, which is named as Covariances of Distances to Principal Axis (CDPA) that summarizes the description of the binding site so that it reduces the amount of information lost compared to most of the other approaches. Thereby, a model-based method is considered, where the nonparametric model is implemented by using the features of the binding sites for a given ligand group for understanding and classification purposes. Then the results obtained using the model-based approach are compared to the alignmentbased method used by Ellingson and Zhang (2012) and Hoffmann et al. (2010).

2:00 PM

Session 6: Healthcare -Mining Users Feedback: Discovering the Gaps in Mobile Patient Portals

Cherie Notebloom, Dakota State University

University Student Union: Pheasant Room 253 A/B

2:00 PM - 3:00 PM

Patient portals are positioned as a central component of patient engagement through the potential to change the physician-patient relationship and enable chronic disease self-management. In this article, we extend the existing literature by discovering design gaps for patient portals from a systematic analysis of negative users’ feedback from the actual use of patient portals. Specifically, we adopt topic modeling approach, LDA algorithm, to discover design gaps from online low rating user reviews of a common mobile patient portal, EPIC’s mychart. To validate the extracted gaps, we compared the results of LDA analysis with that of human analysis. Overall, the results revealed opportunities to improve collaboration and to enhance the design of portals intended for patient-centered care.

2:00 PM

Session 7: Methods - Modeling Vegetation Growth on Termite Mounds

Matthew Biesecker, South Dakota State University

University Student Union: Dakota Room 250 A/C

2:00 PM - 3:00 PM

Coupled systems of nonlinear reaction-diffusion PDE’s have been model pattern formation since the 1950’s. In particular, Turing proved that minor perturbations to initial conditions can result in exotic pattern formations. More recently, systems of PDE have been used to model plant/groundwater interactions. In this talk, we will discuss a recent work on mathematical models used to model the growth of vegetation on or around termite mounds in the African Savannah.

2:00 PM

Session 7: Methods - Shiny App as a Solution for Streamlining Complex

Xijin Ge, South Dakota State University

University Student Union: Dakota Room 250 A/C

2:00 PM - 3:00 PM

Rapid innovation in biotechnology, especially DNA sequencing, holds great promise for revolutionizing medicine. The main bottleneck is how to analyze and interpret the massive amount of data effectively. Many stand-alone software packages exists, mostly as R packages. We developed iDEP(Integrated Differential Expression and Pathway analysis), a large Shiny application that integrates hundreds of R packages, and large gene annotation database. Available at (http://ge-lab.org/idep/), iDEP streamlines complex bioinformatics pipelines as a friendly web interface, that can turn data into biological insights within minutes, instead of months.

2:00 PM

Session 7: Methods - Why Data Science is Difficult

Ryan Swanstrom, Unify Consulting

University Student Union: Dakota Room 250 A/C

2:00 PM - 3:00 PM

It should come as no surprise that many data science and analytics projects fail. There are a whole number of reasons, and this talk will cover some of them. We will walk through the journey of planning a pizza party, studying for a test, and a few other fun stories. All of it will relate back to challenges with data science in the real world.

2:00 PM

Session 8: Financial - Using Attribution Modeling to Find Profitable Lending Customers

Nirav Bhagat, Metafunding

University Student Union: Pasque Room 255

2:00 PM - 3:00 PM

A career in marketing data science will entail answering questions about how best to allocate spend within a marketing portfolio. As marketing budgets continue to shift towards online channels and away from print, the availability and quantity of campaign data make the data scientist’s job more challenging than ever. Within financial services (and credit card issuers), a strategy that optimizes the mix of paid direct mail, paid online, organic search and on-site optimization has historically led to more efficient marketing costs. That strategy can been derived through attribution modeling, a close cousin to marketing mix models. In this presentation, Nirav will present examples of the conversion path of customers over paid, earned and owned marketing channels. He’ll introduce the data engineering and analytics tools used to give credit to each channel for conversions. And, he’ll introduce a few areas where science and tools are nascent.

3:00 PM

Networking break / Exhibitors

South Dakota State University

University Student Union: Volstorff B

3:00 PM - 3:30 PM

3:30 PM

Session 10: Healthcare - AI in Healthcare: Automated Chest X-ray Screening

KC Santosh, University of South Dakota

University Student Union: Pheasant Room 253 A/B

3:30 PM - 5:00 PM

Unstructured data, i.e. image is worth a thousand words. Image analysis has several different applications; healthcare, for instance. Fundamental image processing mechanics let us focus on how we can actually represent visual images to be processed in machine learning algorithms. More specifically, the talk aimed to provide how data scientist works with an emphasis on image processing and pattern recognition. In this context, we will present an automatic chest X-rays screening system to detect pulmonary abnormalities using chest X-rays (CXR) in nonhospital settings. In particular, the primary motivator of the project is the need for screening HIV+ populations in resource-constrained regions for the evidence of Tuberculosis (TB). The system analyzes thoracic edge map, shapes as well as symmetry that exists between the lung sections of the posteroanterior CXRs. For classification, we have used several different classifiers, such as support vector machine, Bayesian network, multilayer perceptron neural networks, random forest and convolutional neural network. Using CXR benchmark collections made available by the National Institutes of Health (NIH) and National Institute of Tuberculosis and Respiratory Diseases, India, the proposed method outperforms the previously reported state-of-the-art methods by more than 5% in terms of accuracy and 3% in terms of area under the ROC curve (AUC). On the whole, the talk will consider state-of-theart works in image analysis, pattern recognition and machine learning under the framework of healthcare and/or medical imaging. Having all these topics, we will provide/summarize how AI and machine learning have helped healthcare advance than it used to be.

3:30 PM

Session 10: Healthcare - Deep Neural Networkds to Predict Self-perception of Cardiovascular Disease: A Technical Demo

David Zeng, Dakota State University

University Student Union: Pheasant Room 253 A/B

3:30 PM - 5:00 PM

I train a multiple-hidden-layer deep neural network that predicts self-perception of heart health (including Cardiovascular Disease) with a large data set of 1729 features and about 30,000 samples. The dataset is based on the CDC Demographics, Dietary, Examination, Laboratory, and Questionnaire datasets collected from 1999 to 2016. Substantial data cleaning and pre-processing are done with Python pandas library. The objective of this research is three-fold:

  • Better understanding of how well DNN would improve the accuracy of prediction on perception of cardiovascular disease;
  • Framework of developing more sophisticated DNN models to predict medical outcomes;
  • Foundation for learning multi-dimensional/distributed representation of healthcare concepts that are both interpretable and scalable.

The presentation focuses on the technical (training a deep neural network with latest developments in the field of deep learning) aspects of the research.

3:30 PM

Session 10: Healthcare - No Show Predictive Model: A Bayesian Approach

Robert Menzie, Sanford Health

University Student Union: Pheasant Room 253 A/B

3:30 PM - 5:00 PM

Patients not showing up to their appointments is a detriment to both the patient and the health care system. As health care systems transition from fee-for-service programs to value-based program, clinical visits (especially primary care) will become the gateway to improving overall patient health outcomes. In order to ensure patients are receiving the appropriate treatment and maintaining a healthy lifestyle they must be completing their scheduled visits. The main goal of the model is to predict patient no-show probabilities with the intent of taking the model one step further by linking it to actionable data points and decisions. The model employs the use of a logistic regression and Bayesian update approach. The regression is devised of patient demographical, behavioral and diagnosis characteristics, as well as visit logistics. The logistic regression creates a priori probability based on requisite factors. Then due to the highly behavioral impetus of missing appointments, a Bayesian update is applied to the priori probability to obtain a final, posterior probability. The Bayesian application to this model significantly contributes to the patient’s probability and details the importance behind patient-level interventions. The output of the model has a high level of accuracy that allows clinics not only to see which patients have a high risk of not showing up, but also the factors that physicians may be able to remedy down the road. The model was built using a standard 10-fold cross-validation. The test set was then ran through the model and used to determine the weighting for the Bayesian update. Lastly the data was validated using the remaining 10%, which resulted in an AUC of .927. Combining the accuracy of this model with the prescriptive ability of the factors, can allow for a significant reduction of no-shows, not only by enhancing appointment logistics (calls, overbooking, etc.) but also by improving patients’ lifestyle.

3:30 PM

Session 11: Making the Analytics Journey Without Getting Lost in the Cloud

Jason Rogowski, Polaris Industries

University Student Union: Dakota Room 250 A/C

3:30 PM - 5:00 PM

The Polaris data science team was founded with a simple objective: be predictive. Driving business value with initial quick wins was relatively easy, but scaling our ability to drive change was limited by the small size of the team. Furthermore, many of the business units needed fundamental reporting automation more than a neural network. We will discuss how we pivoted our strategy and are taking a more holistic self-service enablement approach alongside predictive algorithm development.

3:30 PM

Session 11: Methods Towards A Distant Supervision Paradigm for Clinical Information Extraction: Creating Large Training Datasets for Machine Learning

Yanshan Wang, Mayo Clininc
Elizabeth J. Atkinson, Mayo Clininc
Shreyasee Amin, Mayo Clinic
Hongfang Liu, Mayo Clinic

University Student Union: Dakota Room 250 A/C

3:30 PM - 5:00 PM

Background In the era of big data, a large number of clinical narratives exist in electronic health records. Automatic extraction of key variables from clinical narratives has facilitated many aspects of healthcare and biomedical research. Conventional approaches are based on rule-based natural language processing (NLP) techniques that rely on expert knowledge and exhaustive human efforts of designing rules. Recently machine learning has seen a big performance gain compared to conventional NLP approaches. Despite the impressive improvements achieved by machine learning models, large manual labeled training data are the crucial building blocks of conventional machine learning methods and key enablers of recent deep learning methods. However, large training data are not always readily available and usually expensive to obtain from human annotators. This problem becomes more significant for use cases in clinical domain due to the Health Insurance Portability and Accountability Act (HIPAA) where methods, such as crowdsourcing, are not applicable, and requirements of annotators being medical experts. Method In this paper, we propose a distant supervision paradigm for clinical information extraction. In this paradigm, rule-based NLP algorithms are used to generate large training data with labels automatically. Machine learning models are subsequently trained on these distant labels with word embedding features. Results We study the effectiveness of the proposed framework on two clinical information extraction tasks i2b2 smoking status extraction shared task and a fracture extraction task at our institution. We tested three prevalent machine learning models, namely, Convolutional Neural Networks, Support Vector Machine, and Random Forrest. Conclusion The experimental results show that the proposed distant supervision paradigm is effective for the machine learning models to learn rules towards gold standard from distant labels. Moreover, the machine learning models trained on the distant labels generated by a rule-based NLP algorithm could perform better than the NLP algorithm given sufficient data. Additionally, we showed that CNN was more sensitive to the data size than the conventional machine learning models and that all the tested machine learning methods were viable options for the distant supervision paradigm.

3:30 PM

Session 11: Methods - Selecting Categorical and Quantitative Variables in Linear Regression Analysis

Jixiang Wu, South Dakota State University

University Student Union: Dakota Room 250 A/C

3:30 PM - 5:00 PM

Variable selection is an important means to construct a model that predicts a target/responsible variable with a set of predictable variables. The predictable variables could include quantitative, binary, and/or categorical variables; however, commonly used variable selection methods such as forward selection, backward selection, and stepwise selection are more focused on quantitative variables. It will be a helpful addition to multiple linear regression if categorical variables can be integrated with the commonly used variable selection methods. We proposed a generalized variable selection method that can be used to select both categorical and quantitative variables simultaneously. The detailed results will be presented at the symposium.

3:30 PM

Session 12: Applications - An In-Class Geospatial Data Analytics Project Inspired by a Comedian

Russ Goodman, Central College

University Student Union: Pasque Room 255

3:30 PM - 5:00 PM

This talk will share the details of an intriguing and appealing in-class project for students in an introductory Data Analytics class for advanced mathematics majors. The project originated with a joke from a popular comedian about whether “La Quinta” is Spanish for “Next to Denny’s” and developed into an investigation of that quip. In this project, students learn to acquire the appropriate geospatial data, learn some new skills in Excel, RStudio or other data analysis software, experience quite a bit of problem-solving, and work hard to communicate their results.

3:30 PM

Session 12: Applications - Predicting the Origins of Artwork Found in Rural Churches

Nathan Axvig, Concordia College - Moorhead

University Student Union: Pasque Room 255

3:30 PM - 5:00 PM

A few years ago, I was contacted by Rodney Oppegard, a church historian. He had spent many years collecting information on ecclesiastical furnishings and artwork found in the Lutheran churches of rural North Dakota, and while his data set was extensive it was by no means complete. Some artwork was unsigned or the signature obscured, other pieces had been transferred to different churches, and in some cases the church itself had been destroyed by fire years before, leaving only incomplete records and fading memories as clues to the original church’s configuration. Mr. Oppegard wanted to know whether there was a mathematical way to use existing data to “fill in the holes” of his data set. In this talk, I will outline how geospatial and rudimentary archival data were used to construct and evaluate models for determining which of several popular artists was responsible for a particular church’s altar painting.

3:30 PM

Session 12: Applications - Prediction on Oscar Winners Based on Twitter Sentiment Analysis Using R

Sayeeed Sal, Minot State University
Israt Jahan, Minot State University

University Student Union: Pasque Room 255

3:30 PM - 5:00 PM

In the new era of development, social media is not only getting popularity but also useful to reveal hidden information. Most of the people use social media to connect friends and family, to express their emotions, to give feedback, and to raise concerns as quickly as possible. We can reveal important information from people responses in social media. OSCAR nominations were announced for the year 2017 and all the nominees are very active on Twitter. All the tweets regarding their nominations are publicly available. Here, in this paper, we analyzed the public tweets from Twitter and predict who will be the OSCAR winner in 2017. More specifically, we analyzed all the tweets of the OSCAR nominees in the category of “Actor in leading role” since the day of OSCAR nomination announcement. After analyzing the data, we predicted who will be the winner based on the twitter sentiment analysis using R programming.

3:30 PM

Session 9: Tools - Building R Package

Yuhlong Lio, University of South Dakota

University Student Union: Clark Room 262 B

3:30 PM - 5:00 PM

In this talk, a set of procedures for building R package will be discussed and some recently built R packages will be introduced.

3:30 PM

Session 9: Tools - Sentiment Analysis of Donald Trump’s Tweets

Andre de Waal, SAS

University Student Union: Clark Room 262 B

3:30 PM - 5:00 PM

Social media generates huge amounts of data every day and most of the data is unstructured. This is a untapped resource that may provide significant benefits to companies able to exploit this data. SAS Visual Analytics is a big data tool that facilitates the visualization of large data sets. In this talk we demonstrate how insight can be derived from the analysis of Donald Trump’s tweets. First, a word cloud is built and then a sentiment analysis is done on all of his tweets. Tweets are grouped into topics and the sentiment surrounding each topic is analyzed. This leads to the discovery of novel and interesting insights.

5:00 PM

Closing Session

Thomas Brandenburger, South Dakota State University

University Student Union: Volstorff B

5:00 PM - 5:30 PM