Automating the Industry Code Encoding Process with the National Institute for Occupational Safety and Health

Harvard Computer Society Tech for Social Good

6 min readOct 6, 2021

You can read a joint blog post by NIOSH and T4SG on NIOSH’s page here.

The Story

The National Institute for Occupational Safety and Health (NIOSH) team studies how injury and illness relate to the workplace with one key input being qualitative work background attached to hospital information. Since it is nearly impossible to directly garner any demographic insights from qualitative data, it is first necessary to encode qualitative information about the workplace into discrete industry codes.

Example of what a row in the database might look like

Previously, this encoding was done manually by an individual. The NIOSH team correctly identified this labor intensive labeling process as being ripe for replacement by a machine learning algorithm that could free up the team to work on other projects! Thus, the Harvard Computer Society Tech for Social Good (T4SG) team was tasked with creating an “auto-encoder” and describing the process of training. This auto-encoder would use machine learning on previously encoded datasets to label new datasets. Moreover, even an auto-encoder that accurately labels only sections of the datasets would be massively valuable in reducing the time spent labeling information, since manual labelling would only need to be done on the sections of the dataset that the machine learning algorithm was unable to label.

The Team

Typically, T4SG teams are composed of 3–4 software engineers with some introductory web-development experience, a project manager, and a senior software engineer. Since our teams are entirely student based, the skillsets of our software engineers, senior software engineers, and project managers are not the same as those of people in similarly named roles in industry.

Our team specifically had to overcome a knowledge gap before even delivering on our project. The team was made up of primarily Harvard freshman and sophomores with no industry experience, and little to no machine learning experience. As a result, our process for developing a solution was constructed to both provide team members with the experience necessary to deliver a successful project, as well as actually deliver a quality auto-encoder. In building a project timeline, it was a balancing act between spending time on enrichment, and time on the final goal of the project itself.

The Process

Our process for developing the algorithm can be roughly broken down into 3 phases: Data Exploration, Sub-Problem Model Construction, Problem Model Construction.

Data Exploration

We first began developing the algorithm by performing a general data exploration to get a better understanding of how the data is structured and potential patterns within it.

In this particular case, our data exploration was also in part to verify whether the project of creating a machine learning auto-encoder was feasible given the data we would be training on. There are a couple of pieces of information that could have indicated to us that the project was infeasible or would have limitations.

One main problem we were worried about early was an insufficiency of information. The main problem here would be auto-labeling codes where there are not many examples of that label in the training data-set. To provide an analogy, training a machine learning algorithm to perform in the real world is much like doing practice problems to perform well on a test. If the practice problems you do are very similar to those you have on the test, then you will have an easy time. Conversely, if the practice problems you do are not very similar to those on the test, the test will be quite difficult. In the same vein, if our machine learning algorithm does not get enough “practice problems” to learn from on a particular code, it will not be able to properly label this code in the future. In the training dataset, we actually found that two codes were entirely missing, and others had very few examples, which represents a structural limitation of the algorithm.

*A desired example of our training data looking similar to our test data!*

Sub-Problem Model Construction

After exploring the data to get a better intuition of what machine learning models would and wouldn’t work, we then began by tackling “sub-problems” of the main encoding problem. We defined sub-problems as easier and more limited versions of the main auto-encoding problem that we were trying to solve. The reasoning behind this was two-fold. First, tackling an easier sub-problem could elucidate some of the issues we might encounter when attempting the main problem. Debugging these issues in a more controlled environment might give us domain knowledge when attempting the larger more complex problem. Second, because our team was composed of some novice machine learning engineers, tackling sub-problems worked as a learning experience.

One example of a sub-problem that we attempted to solve, was the binary decision problem of whether or not a row represented the “construction” industry. To explain this problem, we begin with some background. In our exploration of the data, we noticed that rows labeled as “construction” industries were very well represented, making up almost 10% of the data. Thus, we thought it might be a useful sub-task to attempt to determine whether or not a row was specifically a construction row or not. This is a sub-problem, in that it tackles only the problem of labeling a row as class A or not class A, as opposed to the full problem of labeling a row into 1 of potentially 257 industry codes.

Example of binary decision problem on food that determines whether something is or isn’t a hotdog. This is taken from the popular HBO comedy “Silicon Valley” which parodied the power of machine learning.

Model Construction

Once we had familiarized ourselves with the domain by completing our data-exploration and tackling different subproblems, our team felt ready to tackle the main problem. While the data-exploration was not extremely useful outside of providing information about the structural limitations of any model that we might construct, the knowledge garnered from the construction of models to handle different sub-problems proved very useful. In fact, the construction of our final model was actually just a natural extension of one of the sub-problems that we had spent time optimizing earlier in the semester.

In addition to actually constructing the model and training it, we also spent time creating reproducible code-bits so that as new training data becomes available in the future, it can be used to re-train the model. This will be necessary, as the training data-set that we used to achieve a high degree of accuracy today, may not necessarily be representative of the data that might exist tomorrow. This is another instance of the “practice questions and test questions” problem that is mentioned above in the Data Exploration section.

The Impact

The final version of the auto-encoder was a model that attempted to classify data-points into one of 257 industry codes depending on information including the employer name, and a qualitative description of the industry. Moreover, the model outputted the estimated confidence in its prediction in the form of a probability. We also delivered a wrapper code-base, that allowed the model to only make a prediction if it was sufficiently certain that the probability was correct. The exact probability of “sufficiently certain”, was configurable to allow for higher or lower model confidence.

Our original goal when heading into the project was to reduce the number of rows that had to be manually labeled by 20–30%. When attempting to label every single row into one of 257 classes, we achieved close to 60% accuracy. While this number is impressive, it is unlikely that the NIOSH team will use the model in a similar way, as a high degree of precision (that the items that we do label are correctly labeled) is more important than a high accuracy (we label a large number of items correctly despite some items being incorrectly labeled). We expect that by manipulating the variable of “sufficient certainty” according to their needs, the NIOSH team will be able to significantly reduce the number of items that have to be manipulated. Moreover, since our processes for generating the model are well documented, we expect that they will be able to build upon our work where necessary.