Further Cleaning Software Installation Data to Mitigate Security Risks for the USDA

This is a summer continuation of our spring partnership with the USDA.

USDA building

The Client

The U.S. Department of Agriculture (USDA) is the department within the US federal government that governs most matters pertaining to food and agriculture, natural resources, nutrition, and similar issues. To execute their programs, which include the Supplemental Nutrition Assistance Program (SNAP), USDA-related workers use a wide variety of software. To facilitate this, the Client Experience Center (CEC) manages IT services and tracks applications for over 100,000 contractors and employees. Harvard Computer Society Tech for Social Good (T4SG) worked with the CEC to assist in the crucial role of better managing, understanding, and analyzing the various technologies used by the aforementioned contractors and employees.

The Problem

Across the large number of workstations tracked by the USDA, there are various software programs which perform nearly identical tasks and, as a result, are highly redundant. Furthermore, there are other inefficiencies in the current list of used applications that can also be removed. Such redundancies create barriers when negotiating contracts with technology vendors or when scanning to ensure the security of all workstations. As a result, one of the objectives that we assisted the CEC with was to identify the programs that performed the same job or were otherwise inessential and to consolidate this information in a dashboard to make their systems more efficient. In addition to this, the CEC sought a strategy of automated, repeatable steps to increase the efficiency of the daily process of feeding data into these dashboards.

The Solution

In order to rationalize and condense the USDA’s software data, we undertook three main tasks.

First, we filtered out utility applications that did not serve any business functions. These included, but were not limited to drivers, servers, updates, and plug-ins. We accomplished this by first conducting a keyword search within application names, flagging words such as “compiler,” “installer,” and “tool.” Following this, we filtered out applications that had publishers that primarily produced utilities, such as Intel, AMD, and Dell. After these processes, we were able to mark over 25% of the applications as utilities.

The second task we tackled was normalizing application names. Many of the applications had names that included version numbers, “64-bit” and “32-bit” tags, and various other extraneous words. This increased the clutter of the whole dataset, as for a single application, there could have been a large number of different names that referred to it over many different installations. After normalizing all names, the number of unique application names decreased by nearly 40%.

The final task we undertook was streamlining all of the data processes we built. We integrated our algorithms with the ones T4SG built in the spring to create a single runnable script that would create a dashboard, bundle applications that shared a license, remove utilities, and normalize application names.

Our final outcome was a significant consolidation of the USDA’s software installation data. We removed duplicate entries and resolved inconsistencies in application naming, leading to an improvement in the USDA’s efficiency for sorting out licensing deals and security checks.

Reflections

Jeremy: Through working on this project, I learned a lot about working through large datasets while being adaptable about our goals whenever we encountered new discoveries about the data that required us to shift focus.

Justin: The experience working with the USDA was a thought-provoking, positive one that allowed me to continue working developing my data science skills through working with NumPy and large, often messy datasets.

The Team

Jeremy and Justin

Jeremy: Jeremy is a rising sophomore at Harvard prospectively studying Statistics and Computer Science. In his free time, he loves to play tennis and volleyball.

Justin: Justin is a rising sophomore at Harvard prospectively concentrating in Computer Science. When he’s not coding, he enjoys photography and playing Ultimate Frisbee.

--

--

Harvard Computer Society Tech for Social Good

HCS Tech for Social Good is the hub of social impact tech for Harvard undergrads. See more at socialgood.hcs.harvard.edu