Data Sets

From deep
Jump to: navigation, search

Please Note: This site is currently under construction.

Real Data Corpus

The Real Data Corpus (RDC) is a collection of raw data extracted from devices that were purchased on a global scale on the secondary market. Discarded by their original users, the data is compiled from multiple device types including hard drives, cell phones, USB memory sticks, and other data-carrying devices. The purchase of this data allows users to run scenarios that closely mimics the real world.

Potential Uses

The Real Data Corpus is a one-of-a-kind scientific resource for:

  • Developing and validating forensic and data recovery tools.
  • Training students in forensics and data recovery
  • Developing and validating document translation software.
  • Exploring and characterizing real-world computing practices, configuration choices, and option settings.
  • Studying the storage allocation strategies of file systems under real-world conditions

Current Contents

The following countries are represented:

Data By Country
Country Code Country Name Size Bytes
AE United Arab Emirates 8.1T 8901391414209
AT Austria 1.6T 1679356972941
BA Bosnia 1.1G 1136964581
BD Bangladesh 1.7T 1776359989910
BS Bahamas 1.4T 1521077149007
CA Canada 964G 1034792301812
CH Switzerland 1.7G 1727335287
CN China 562G 602571221673
CZ Czech Republic 1.4T 1530418038383
DE Germany 637G 683552267154
Eg Egypt 54G 57710207790
GH Ghana 661G 708863071235
GR Greece 6.1G 6535007337
HK Hong Kong 146G 156677356784
HU Hungary 511G 548098050616
IL Israel 7.6T 8246885325799
IN India 12T 12388611337970
JP Japan 30G 31760609183
MA Malaysia 1.9T 2043907349900
PA Panama 205G 219454743117
PK Pakistan 3.5T 3784807700719
PS Palestine 1.1T 1174391288173
RS Serbia 819G 879290972122
SG Singapore 6.1T 6663236540700
TH Thailand 7.0T 7681882815641
TR Turkey 485G 520583275107
UK United Kingdom 858G 920751501426

As of February 13, 2017, the Non-US Person's Corpus consists of the following:

  • xxxx hard drive images ranging in size from xxxMB to xxxGB.
  • xxx flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
  • xx CDROMs

For a total of xxTB of data (uncompressed).

Access and Availability

If you would like us to run your model, please contact us to see what our availability is. While we do not guarantee that everyone will get access, we are working on a standard process to efficiently provide research results to those who would like it. Your source code and build instructions will be required.

IRB Required for Research

The National Research Act[2] (NRA) of 1974 and the Common Rule,[3] govern all federally funded research in the United States that is performed with human beings as experimental subjects. Because portions of the Real Data Corpus were funded by the US Government, this legal framework must be followed in research involving the Real Data Corpus. The Common Rule creates a four-part test that determines whether or not proposed activity must be reviewed by an IRB. Specifically, IRB approval is required if:

  1. . The activity constitutes scientific “research,” a term that the Common Rule broadly defines as “a systematic investigation, including research development, testing and evaluation, designed to develop or contribute to generalizable knowledge.”[4]
  2. . The research must be federally funded.[5]
  3. . The research must involve human subjects, which the Common Rule defines as “a living individual about whom an investigator (whether professional or student) conducting research obtains (1) data through intervention or interaction with the individual, or (2) identifiable private information.”[6]
  4. . The research is not “exempt” under the regulations.[7] The Common Rule exempts research involving “existing data, documents, [and] records…” provided that the data set is either “publicly available” or that the subjects “cannot be identified, directly or through identifiers linked to the subjects”(§46.101(b)(4)).

Research involving the Real Data Corpus is not exempt under the Common Rule because the RDC is not publicly available and in many cases it is possible to identify individuals whose data are in the collection. Furthermore, the majority of the subjects included in the Real Data Corpus have not provided consent to have their data used for research. Mitigating factors allowing the use of this data is the fact that the data was lawfully obtained, research involving this data is “minimal risk” (provided that the data is properly protected and personally identifiable information inside the RDC is kept confidential), the fact that there is substantial public benefit in using the RDC for research into computer forensics and computer security, and the fact that there is no practical alternative to using this data. Even if research involving the RDC were exempt, most US universities do not allow experiments to make their own determination of exemption. Instead, these institutions require that the experimenter submit an application for exempt research to the IRB. To date no IRB has blocked the approval of research that involves the RDC. In order to submit an application to an IRB it is necessary for all experimenters who will make use of the human subject data to take the appropriate human subject training proscribed by their institution. Most institutions prohibit students from filing applications directly, and instead require that an application be filed by a researcher or professor that can be considered a “principal investigator” for external funding. As a result, any proposed use of the RDC in research requires that an IRB application be filed with the host institution and with the Naval Postgraduate School. A copy of both the application and the approval from both the host institution and NPS must be provided prior to access being granted. The application must clearly state:

  • The proposed research that is to be done.
  • Why it is necessary to use the RDC; why simulated or realistic data cannot be used as an alternative.
  • What measures will be used to protect the data in the RDC.
  • What measures will be used to prevent the publication of personally identifiable information in any research products.

Please provide us with your IRB application prior to submitting it to your IRB! We can review the application and let you know if it is consistent with the IRB approval that we have already approved, or if we will need to apply for additional IRB approval. Sample applications are available upon request.

Contact Information