HashSets.com complements NSRL

This website was designed to complement file hash sets released by the National Software Reference Library (NSRL), US Commerce Department NIST (National Institute of Standards and Technology) (www.nsrl.nist.gov).  The NSRL maintains the largest known number of hash values (more than 374 million files analyzed as of 2022) which are free to the public.

During November 2003, while reviewing the NSRL Dataset releases we observed the MD5 and SHA-1 hash values were a direct result of very advanced custom scripting aimed at software product media (Floppy, CD and DVD).  This advanced scripting included processes to parse out and hash files found within a software product’s compressed files (cab files, zip files, etc).

While performing some of our own validation testing of the NSRL Datasets we discovered that far more unidentified hash values could be derived from the actual installation of computer software, operating systems, etc. The NSRL Datasets were unfortunately not a direct result of a product’s ‘installation process’.

To show the differences during our earlier findings we installed a typical Microsoft Windows Operating System (i.e. Vista Home Basic) onto two non-similar IBM compatible computers and then performed a file hash analysis across both systems to see how many hash values the most recent release of a NSRL hash set we could detect.

From an average of 36,002 files installed onto either Intel compatible computer system the NSRL hash sets detected 8,324 files from within its own hash library. That is a discovery of 23% of files that are known to be installed from a sample Microsoft Windows operating system CD/DVD and are therefore considered trustworthy, known and non-threatening during any typical computer forensic examination.

Using our own method of installing an operating system and then gathering the common hash values between both computers we were able to detect 99.98% of the files that were known to be installed from a Microsoft Windows operating system and were therefore also considered trustworthy, known and non-threatening. Specifically, 35,456 files were detected on either test computer.

Based on the larger number of hash values discovered we decided that spending the added time and effort of installing an operating system, hashing and then gathering all unique hash values into one hash set would be just as valuable as the NSRL datasets and would additionally complement any current NSRL datasets during computer forensic examinations.

It is important to understand that this analysis does NOT suggest in any form or manner that computer forensic and computer security examiners should consider discontinuing the use of NSRL datasets. On the contrary, the NSRL datasets are EXTREMELY significant to the computer forensic and computer security communities as they provide the largest known depository of hash values (far more than 98,000,000+ unique as of 2020) for free for many current and legacy software and operating system programs.

To summarize, our goal with this website is to recommend that when performing computer forensics or computer security investigations every analyst, examiner and professional should seek out and consider additional hash values that could possibly off set other ‘unidentified’ computer files and their hash values throughout an examination. This is especially true if the computer forensic or security analysis entails large scale, timely and thorough analysis.