Please Note: This site is currently under construction.
Real Data Corpus
The Real Data Corpus (RDC) is a collection of raw data extracted from devices that were purchased on a global scale on the secondary market. Discarded by their original users, the data is compiled from multiple device types including hard drives, cell phones, USB memory sticks, and other data-carrying devices. The purchase of this data allows users to run scenarios that closely mimics the real world.
The Real Data Corpus is a one-of-a-kind scientific resource for:
- Developing and validating forensic and data recovery tools.
- Training students in forensics and data recovery
- Developing and validating document translation software.
- Exploring and characterizing real-world computing practices, configuration choices, and option settings.
- Studying the storage allocation strategies of file systems under real-world conditions
The following countries are represented:
|Country Code||Country Name||Size||Bytes|
|AE||United Arab Emirates||8.1T||8901391414209|
As of February 13, 2017, the Non-US Person's Corpus consists of the following:
- xxxx hard drive images ranging in size from xxxMB to xxxGB.
- xxx flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
- xx CDROMs
For a total of xxTB of data (uncompressed).
Access and Availability
If you would like us to run your model, please contact us to see what our availability is. While we do not guarantee that everyone will get access, we are working on a standard process to efficiently provide research results to those who would like it. Your source code and build instructions will be required.
IRB Required for Research
The National Research Act (NRA) of 1974 and the Common Rule, govern all federally funded research in the United States that is performed with human beings as experimental subjects. Because portions of the Real Data Corpus were funded by the US Government, this legal framework must be followed in research involving the Real Data Corpus. The Common Rule creates a four-part test that determines whether or not proposed activity must be reviewed by an IRB. Specifically, IRB approval is required if:
- . The activity constitutes scientific “research,” a term that the Common Rule broadly defines as “a systematic investigation, including research development, testing and evaluation, designed to develop or contribute to generalizable knowledge.”
- . The research must be federally funded.
- . The research must involve human subjects, which the Common Rule defines as “a living individual about whom an investigator (whether professional or student) conducting research obtains (1) data through intervention or interaction with the individual, or (2) identifiable private information.”
- . The research is not “exempt” under the regulations. The Common Rule exempts research involving “existing data, documents, [and] records…” provided that the data set is either “publicly available” or that the subjects “cannot be identified, directly or through identifiers linked to the subjects”(§46.101(b)(4)).
Research involving the Real Data Corpus is not exempt under the Common Rule because the RDC is not publicly available and in many cases it is possible to identify individuals whose data are in the collection. Furthermore, the majority of the subjects included in the Real Data Corpus have not provided consent to have their data used for research. Mitigating factors allowing the use of this data is the fact that the data was lawfully obtained, research involving this data is “minimal risk” (provided that the data is properly protected and personally identifiable information inside the RDC is kept confidential), the fact that there is substantial public benefit in using the RDC for research into computer forensics and computer security, and the fact that there is no practical alternative to using this data. Even if research involving the RDC were exempt, most US universities do not allow experiments to make their own determination of exemption. Instead, these institutions require that the experimenter submit an application for exempt research to the IRB. To date no IRB has blocked the approval of research that involves the RDC. In order to submit an application to an IRB it is necessary for all experimenters who will make use of the human subject data to take the appropriate human subject training proscribed by their institution. Most institutions prohibit students from filing applications directly, and instead require that an application be filed by a researcher or professor that can be considered a “principal investigator” for external funding. As a result, any proposed use of the RDC in research requires that an IRB application be filed with the host institution and with the Naval Postgraduate School. A copy of both the application and the approval from both the host institution and NPS must be provided prior to access being granted. The application must clearly state:
- The proposed research that is to be done.
- Why it is necessary to use the RDC; why simulated or realistic data cannot be used as an alternative.
- What measures will be used to protect the data in the RDC.
- What measures will be used to prevent the publication of personally identifiable information in any research products.
Please provide us with your IRB application prior to submitting it to your IRB! We can review the application and let you know if it is consistent with the IRB approval that we have already approved, or if we will need to apply for additional IRB approval. Sample applications are available upon request.