AIHA-RMS - Clustering of Noise Dosimetry

Upcoming events

America/Denver
Safety Data Sheet Fluency: The Essential Skills for OEHS Professionals

08/01/2025 8:00 AM

Zoom
America/Denver
Ventilation 101: A Review of What You Used to Know When you Passed the CIH or What you Should Know to Pass the CIH Exam, Plus Some Practical Stuff

08/18/2025 8:30 AM

CU Boulder Environmental Health and Safety Center 1000 Regent Drive Boulder CO, 80309
America/Denver
Fundamentals of Toxicology for Industrial Hygienists: Occupational Exposure Banding & Occupational Exposure Limits

09/12/2025 8:15 AM

CU Boulder EHS Center, 1000 Regent Drive, Boulder, CO 80309

In supervised learning, we train a model using techniques (e.g., linear regression) that allow us to assess the quality of the results obtained. In unsupervised learning, there is training set for the model. Therefore, there is no objective way (at least compared to supervised learning) to assess results. Still, unsupervised learning methods are very useful during exploratory data analysis. If you want to learn a little more about supervised and unsupervised learning, check out this blog from Machine Learning Mastery. Let's get on to our example.

We have been given noise dosimetry results in a spreadsheet (see preview below). Each row represents a worker and each column represents a full-shift sample.

The first error we see is that results were not imported correctly and should be multiplied by 100 to get % Dose. The second issue is that there are three manufacturing plants at our work location. However, there is no indication as to which plant each observation is associated with.

We have two tasks:

fix the dosimetry results to display in %
apply unsupervised learning to assist with clustering the results into three groups (corresponding to three plants)

To accomplish our tasks we will use the R language and the R Studio software. Our first step is to import the necessary libraries. dplyr is an R package that is powerful to use when tidying data.

Next, we load the data from our .csv file into a data frame called dosimetry.twa and inspect it by calling a 'head' statement. It appears that the data was loaded properly (referencing the .csv file).

To convert the data to % Dose, we multiply the entire data frame contents by 100.

We inspect the data again and it appears satisfactory.

Feature scaling is a useful step prior to applying ML methods. We should normalize the data such that all values fit between 0 and 1. To do that, we define a function called feat_scal_normalize - just a name I made up, you can call yours anything you want.

We then apply the function we just created to all of the columns in the data set.

As always, we inspect the data to see the results. Looks good.

We need to apply the k-Means algorithm specifying three clusters. That is, I want the algorithm to separate all the data into 3 segments/Plants. We'll set the seed for code output reproducibility.

How many workers were assigned to each cluster? We call the plant_cluster$size statement to find out.

Is the the algorithm accurate? We don't know because we're using unsupervised learning. However, if you are an IH and you're given sketchy data, applying k-means unsupervised learning can assist you with segmenting the data. Want to know more? K-means is described in page 517 of the free text An Introduction to Statistical Learning.

If you are interested in the original dosimetry.csv file or the R code, send an email to webmaster@aiha-rms.org or leave a comment in our LinkedIn or Facebook pages.

Upcoming events

Clustering of Noise Dosimetry Results using Unsupervised Learning (R)