Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange - SDSU Data Science Symposium: Variable Selection Techniques for Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
 

Variable Selection Techniques for Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Presentation Type

Event

Abstract

Mixtures of von Mises-Fisher distributions have been shown to be an effective model for clustering data on a unit hypersphere, but variable selection for these models remains an important and challenging problem. In this paper, we derive two variants of the Expectation-Maximization (EM) framework, which are each used to identify a specific type of irrelevant clustering variable in these models. The first type are noise variables, which are not useful for separating any pairs of clusters. The second type are redundant variables, which may be useful for separating pairs of clusters, but do not enable any additional separation beyond the separability provided by some other variable. Removing these irrelevant variables is shown to improve cluster quality in simulated as well as benchmark text datasets.

Start Date

2-12-2018 12:00 PM

This document is currently not available here.

Share

COinS
 
Feb 12th, 12:00 PM

Variable Selection Techniques for Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

University Student Union: Volstorff A

Mixtures of von Mises-Fisher distributions have been shown to be an effective model for clustering data on a unit hypersphere, but variable selection for these models remains an important and challenging problem. In this paper, we derive two variants of the Expectation-Maximization (EM) framework, which are each used to identify a specific type of irrelevant clustering variable in these models. The first type are noise variables, which are not useful for separating any pairs of clusters. The second type are redundant variables, which may be useful for separating pairs of clusters, but do not enable any additional separation beyond the separability provided by some other variable. Removing these irrelevant variables is shown to improve cluster quality in simulated as well as benchmark text datasets.