At times, you may extract data that is messy. That is, it may contain stray marks, odd formatting, inconsistent lower/upper case text, and other unlikeable characteristics. When we notice that the inconsistencies are consistent throughout the data we receive, we can clean the incoming data using a variety of means. In this example, we'll use regular expressions in Python to clean an OSHA spreadsheet and make it more readable (for humans) and more digestible (for computers).
I visited the OSHA Enforcement Data page and extracted a large file containing citations that was over 500,000 rows! In the screenshot below, you can see that the tenth column of the .csv file contains the initial penalty in dollars. However, the fourth column, which contains the standard reference, isn't well structured for filtering and analysis - it's a concatenation of the federal OSHA and State standards. We're interested in Federal OSHA citations. How can we rearrange the fourth column so that we can extract the part (19xx) and section (19xxxxxx)?
We'll use the Python language clean this up. Note that we're using a Python package called Pandas, which is ubiquitous in analytics and data science, and the re package, which allows us to apply regular expressions. The details of Pandas and re aren't relevant here, and we're using it only for convenience. We would normally spend time inspecting the data. For brevity, we'll skip ahead to the illustration.
First, we import the appropriate packages and the .csv file containing the OSHA data.
Next, we define a regular expression or regex called ‘standards_section’ that asks our program to find anything that starts with '19' and is immediately followed by digits. Think of regex as a more powerful version of the Find feature in your commonly used software.
Apply the regex to the 'standard' column and store the result in a new column called 'nineteens'.
Looking at the first five rows of the OSHA table directly below, check out how the new 'nineteens' column at the far right compares to the 'standard' column. All of the rows that didn't match the regex are labeled as NaN in the 'nineteens' column. NaN means Not a Number.
We now use a slice to separate the part and section. The first statement slices the first four characters (19xx) in 'nineteens' and the second statement slices the rest of the the characters (19xxxxxx) of 'nineteens'.
After six simple lines of code, the result is one column that can be sorted by the federal part and another column that can be sorted by the federal section (see below). We can continue with this approach to separate the paragraphs, subsections, etc.. In the future, you can run this program (with a few clicks) and the results will appear in less than a second.
If you are interested in the original OSHA .csv file or the Python code, send an email to firstname.lastname@example.org or leave a comment in our LinkedIn or Facebook pages.