Cleaning Software Installation Data to Mitigate Security Risks for the USDA

Harvard Computer Society Tech for Social Good

7 min readMay 25, 2021

The Client: United States Department of Agriculture, Office of the CIO

The United States Department of Agriculture (USDA) is the $141B federal agency that provides leadership on food, agriculture, natural resources, rural development, nutrition, and related issues. They’re one of the foremost agencies responsible for welfare in the US — one of their largest programs being SNAP, which provides food-purchasing assistance to 40 million people in need in the US each month.

Tech for Social Good (T4SG) worked specifically with the USDA Client Experience Center team which recently centralized the IT services of the more than 100,000 USDA contractors and employees of the organization. Through this process, the USDA tracks the software applications that are installed across these workstations to strategize and implement processes to reduce the variety of software, making it easy to negotiate contracts with technology vendors and ensure that all applications are up to security standards. For example, one single machine may have multiple applications that serve the same purpose (ex: Zoom and Microsoft Teams) or also have different security levels (ex: maybe the Zoom license is a few years old and poses a security risk). The out-of-date applications and unnecessary applications should be uninstalled to ensure security across the USDA system — the importance of which can’t be understated, as exemplified by the recent SolarWinds attack.

The Problem: Complex Data

In order to track software installations, the USDA uses a program that parses through the installation logs on each computer (7M total lines). However, sometimes installation logs include lots of extraneous information. For example, a single application could have multiple installations logged for it which ought to be grouped together (including a driver, an updater, a licenser, etc.) or could even have duplicate installations which ought to be flagged. This presents a significant challenge to the USDA team in identifying accurate numbers for their applications and an opportunity for our Tech for Social Good team to step in.

The Solution: Demystifying 7M Lines of Application Data

To begin, we needed access to the data, which required approval from the USDA team to transfer hundreds of megabytes worth of files over to us, as well as the pre-removal or encryption of sensitive data. Then, we ported the data into Jupyter Notebook through Pandas and started with developing dashboards of the data to get a high-level understanding of what we’d be dealing with.

Removing Duplicates

Almost immediately, we found a number of insights around the most common applications throughout the dataset, including tens of thousands of duplicate entries when comparing publishers, applications, and version numbers between logs. With so many improvements so easily found, we were motivated by the promise of how data analysis could be used to drastically improve the quality of this data.

As a result, we developed a specific dashboard that included the number of total entries, number of duplicate installations, and number of unique installations for each application. This showed the USDA the highest value initial applications that they could investigate in order to reduce the number of inaccurate installations.

Grouping Installation Logs of the Same Application

We used that dashboard to identify groupings of applications that had multiple installation logs for the same application. To do so, we first identified commonalities between such types of installations using manual data exploration. We found four primary commonalities:

Similar names of applications (typically, installation logs included the name of the application and then the differentiator, ex: Microsoft Edge Updater and Microsoft Edge)
Same publisher and version number
Similar number of unique applications installed and on the same workstations (logically, if an application installation results in multiple installation logs, this should be uniform across the workstations and thus can be used to identify groupings)

As such, we developed an algorithm that analyzed the previously developed dashboard following these steps:

Sort the dashboard by number of unique installations and start parsing through them row by row, beginning with the application with the most unique installations
For each application that hasn’t been grouped yet, identify the applications that are within 10% of the number of unique installations (called the similar count applications) of the current application
For each similar count application, clean the application name to remove common words that could result in non-useful similarities between application names (ex: tool, module, update, etc.) and punctuation (ex: periods, hyphens, etc.) which could result in non-useful differences between application names.
To determine whether the similar count application should actually be grouped with the current application, all of the following checks should come back true: [1] Check if versions and publishers are the same [2] Check if there’s at least one common word between the cleaned current application and cleaned similar count application [3] Compare cleaned current application and cleaned similar count application using Fuzzywuzzy Partial Ratio to identify string similarity (require a similarity rating of at least 50) [4] Check if there’s at least a 70% similarity between the workstations that the current application are on and the workstations that the similar count application are on.
If the similar count application passes those checks, then it is grouped with the current application and marked with “grouped” such that we can skip over that application later.

Identify Installation Log Errors and Inconsistencies

Lastly, throughout the process, we noticed a number of errors and inconsistencies that the installation logs had in the publisher column and exported such examples into a dashboard for the client. For example, “Brother Industries” was occasionally logged as “Brother Insutries” and “Google LLC” was occasionally logged as “Google, Inc.”

In case you’re interested in the full technical implementation of any of the aforementioned work, our public GitHub repository is here.

The Impact

After developing this algorithm, we handed it off to the USDA team so that they could continue the non-technical programmatic side of the project, turning these data insights into software and security improvements within the USDA. In total, hundreds of thousands of duplicate installations were identified and more than one million applications were identified as groups — an incredible opportunity to empower governments to be more effective, efficient, and impactful through technology.

Student Reflections on Using Data in Government

As a student group, T4SG’s core mission is to “empower student leaders to leverage technology to tackle the world’s big problems.” Some reflections of the students who made this project possible were collected below.

April: It was enlightening to work on a real problem that I wouldn’t have even known existed, if it weren’t for this opportunity to assist the USDA. My biggest takeaway from tackling this problem is that the most effective solution ideas came from our initial data explorations and intuition. We considered using more complex ML algorithms and libraries (and perhaps there exist complicated solutions that would outperform us!) but in the end, our common sense code has made a big difference.

Angelika: The development process highlighted the necessity of good communication and teamwork. As with most data quality and interpretation challenges, the real impact is not limited to algorithms and code but also expands to finding ways to scale the solutions to provide high quality insights that are applicable to both current and future data. The essence of this is good development flow, constant communication with the client, and clear structure of solutions, which combined helped us to bring our ideas together and reach the results of this challenge.

Eric: Working in civic tech provides a unique opportunity to make an outsized impact! This project alone will impact the hundreds of thousands of USDA employees and the general strategy around application rationalization of the entire department. It’s been incredible to be able to work on something of such magnitude.

Kevin: Data in the real world is messy, imperfect, and nuanced — which makes finding a way to sort through and create data-driven insights all the more impactful. Navigating the needs between the USDA department and the institutional limitations of data collection and processing helped realize this project which will undoubtedly be a foundation for future development.

Jeremy: Working through such a large data set made clear the importance of an iterative development process. The many discoveries we made as we ran our different algorithms created something of a moving goal post, where communication with the USDA was key in ensuring that we could maximize our impact.

Tech for Social Good Team

April Chen is Director of Growth of Tech for Social Good and pursuing a joint concentration in Computer Science and Government. She’s fascinated by technology and privacy policy, machine learning fairness, and education technology.

Angelika Antsmae is a Software Engineer for Tech for Social Good and she is concentrating in Computer Science. She is passionate about developing data-driven solutions to leverage the resources of technology.

Eric Li is the Director of Tech for Social Good and a CS concentrator with a minor in Economics. He’s particularly interested in how technology can be used to create significant and scalable impact in the sectors of sustainability and international development.

Kevin Tan is a UX Designer at Tech for Social Good and is studying Chemical & Physical Biology and Computer Science. He is excited to explore how thoughtful design, technology and analytics, and education can empower us to do better and create more inclusive infrastructure.

Jeremy Lee is a Software Engineer for Tech for Social Good and is studying Statistics and Computer Science. He is interested in how data science and technology can be used to optimize solutions for social inequities and sustainable development.