In supervised learning, we train a model using techniques (e.g., linear regression) that allow us to assess the quality of the results obtained. In unsupervised learning, there is training set for the model. Therefore, there is no objective way (at least compared to supervised learning) to assess results. Still, unsupervised learning methods are very useful during exploratory data analysis. If you want to learn a little more about supervised and unsupervised learning, check out this blog from Machine Learning Mastery. Let's get on to our example.
We have been given noise dosimetry results in a spreadsheet (see preview below). Each row represents a worker and each column represents a full-shift sample.
The first error we see is that results were not imported correctly and should be multiplied by 100 to get % Dose. The second issue is that there are three manufacturing plants at our work location. However, there is no indication as to which plant each observation is associated with.
We have two tasks:
To accomplish our tasks we will use the R language and the R Studio software. Our first step is to import the necessary libraries. dplyr is an R package that is powerful to use when tidying data.
Next, we load the data from our .csv file into a data frame called dosimetry.twa and inspect it by calling a 'head' statement. It appears that the data was loaded properly (referencing the .csv file).
To convert the data to % Dose, we multiply the entire data frame contents by 100.
We inspect the data again and it appears satisfactory.
Feature scaling is a useful step prior to applying ML methods. We should normalize the data such that all values fit between 0 and 1. To do that, we define a function called feat_scal_normalize - just a name I made up, you can call yours anything you want.
We then apply the function we just created to all of the columns in the data set.
As always, we inspect the data to see the results. Looks good.
We need to apply the k-Means algorithm specifying three clusters. That is, I want the algorithm to separate all the data into 3 segments/Plants. We'll set the seed for code output reproducibility.
How many workers were assigned to each cluster? We call the plant_cluster$size statement to find out.
Is the the algorithm accurate? We don't know because we're using unsupervised learning. However, if you are an IH and you're given sketchy data, applying k-means unsupervised learning can assist you with segmenting the data. Want to know more? K-means is described in page 517 of the free text An Introduction to Statistical Learning.
If you are interested in the original dosimetry.csv file or the R code, send an email to firstname.lastname@example.org or leave a comment in our LinkedIn or Facebook pages.