EDRM Data Set Project Print
  • RSS
  • Twitter
  • Add to favorites
  • LinkedIn
  • Facebook
  • Google Bookmarks

Data Set



The EDRM Data Set Project provides industry-standard, reference data sets of electronically stored information (ESI) and software files that can be used to test various aspects of e-discovery software and services, through three initiatives:

  • EDRM ESI Reference Data Sets
  • EDRM Software Reference Data Set
  • EDRM Probabilistic Hash Data Set

EDRM ESI Reference Data Sets

This initiative collects, evaluates, and publishes ESI data sets for use in testing e-discovery software and services. There are currently three data sets being offered today and more are under evaluation. The three sets currently are offered:

EDRM Enron PST Data Set: Enron e-mail messages and attachments organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files.

The data in the EDRM Enron PST Data Set files is sourced from the FERC Enron Investigation release made available by Lockheed Martin Corporation, and has been reconstituted as PST files with attachments by ZL Technologies for the EDRM Data Set Project. It is our understanding that Lockheed Martin has not placed any restrictions on any the Enron material that it has released to the public.

PLEASE NOTE: These files may contain viruses, as can be the case with any set of files collected during discovery.

EDRM File Format Data Set: 381 files covering 200 file formats.

EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.

EDRM Software Reference Data Set

With the EDRM Software Reference Data Set initiative, EDRM seeks to augment the NIST Reference Data Set hashes used in e-discovery with additional hashes of known software files that can be further culled for review purposes.

While the NIST list focuses on a selection of software applications and only as the software exists on installation media (e.g. DVDs, and CDs), this initiative will provide the hashes for the software after it has been extracted from compressed media containers and installed on a system, as well as for software not currently being handled by NIST, e.g. software that is downloaded from the Internet as opposed to received on DVD and/or CD media.

This initiative will modernize and enhance the list of hashes available for culling software files to reduce e-discovery costs.

EDRM Probabilistic Hash Data Set

To further improve the culling process, the Probabilistic Hash Data Set initiative seeks to collect as many anonymous hashes as possible of files encountered in real world e-discovery.

The frequency of the appearance of hashes can then be used to determine the likelihood that a particular file could be classified as probably not relevant. This initiative seeks to sig- nificantly improve the performance of automated culling of non-ESI files for e-discovery, resulting in both more reliable results and lower cost.

Questions | Answers | Ask a Question



Questions

  1. How do I get the EDRM Data Set files?
  2. How much do I have to pay to get the EDRM Data Sets? How much do I have to pay to use them?
  3. Where does the data in the EDRM Enron PST Data Set come from? Do you have the rights to redistribute it?
  4. Does the EDRM PST Data Set contain viruses?
  5. Is there a set of MD5 checksums for the EDRM PST Data Set files?
  6. Why does the EDRM Enron PST Data Set contain duplicates?
  7. How do the EDRM Enron PST Data Set and the CMU Data Set differ?
  8. How do the EDRM Enron PST Data Set and the Berkeley ANLP Categorization differ?

Answers

  1. How do I get the EDRM Data Set files?

    EDRM Data Sets can be downloaded from the EDRM Data Set page. Go to the EDRM Data Set page,edrm.net/21, select the "Downloads" tab, select the desired data set, and follow any additional instructions.

  2. How much do I have to pay to get the EDRM Data Sets? How much do I have to pay to use them?

    We do not charge for access to the EDRM Data Sets, nor do we charge for use of the data sets.

    We have made this content available under a Creative Commons Attribution 3.0 United States License. To provide the attribution required under that license, when sharing or remixing the content please cite "EDRM (edrm.net)".

  3. Where does the data in the EDRM Enron PST Data Set come from? Do you have the rights to redistribute it?

    The data in the EDRM Enron PST Data Set files is sourced from the FERC Enron Investigation release made available by Lockheed Martin Corporation, and has been reconstituted as PST files with attachments by ZL Technologies for the EDRM Data Set Project. It is our understanding that Lockheed Martin has not placed any restrictions on any the Enron material that it has released to the public.

  4. Does the EDRM PST Data Set contain viruses?

    We have been told that some of the files in the EDRM PST Data Set contain viruses. We view the task of addressing possible viruses as a responsibility that rests with the entity processing or otherwise working with the files, as in the case in real-world e-discovery undertakings.

  5. Is there a set of MD5 checksums for the EDRM PST Data Set files?

    Yes. A txt file containing MD5 hash values is available at EDRM-Enron-PST-MD5.txt

  6. Why does the EDRM Enron PST Data Set contain duplicates?

    We are attempting to match, as best we can, a real-world e-discovery situation. With the Enron data, multiple collections were made of many custodians’ email over a period of several months. Because the same email often was collected several times, the set contains duplicates. In the PST set, each collection can sometimes be seen as a top level folder in the PST file. De-duplication and near de-duplication can be used for this. Alternately, It may be beneficial to have each of those top level folders separated out as a separate PST file which would at least limit the duplicates to within a collection.

  7. How do the EDRM Enron PST Data Set and the CMU Data Set differ?

    There is some overlap between the email in the PST files and in the CMU corpus but it is not a 100% overlap. We have been talking about creating a mapping between the EDRM PST email and the CMU corpus but have not completed that project yet.

  8. How do the EDRM Enron PST Data Set and the Berkeley ANLP Categorization differ?

    Our participants have looked at this set and mapped it to some other Enron email data sets, but have not mapped it to the EDRM data set yet. We agree it would be useful to incorporate this into a data set offering for use as a training set and for other purposes.


Ask a Question


Question:
Email: 1

1 - Notification of when your question has been answered. (Optional)
Downloads for the EDRM Data Set project:



Posted here are the EDRM Data Set downloads. They include:

  • EDRM Enron PST Data Set: Enron e-mail messages and attachments organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files.
    PLEASE NOTE: These files may contain viruses, as can be the case with any set of files collected during discovery.
  • EDRM File Formats Data Set: 381 files covering 200 file formats.
  • EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.

The files are most easily downloaded from the EDRM Enron Data Set Files page.

The total size of the compressed files is approximately 19 GB. The total size of the uncompressed files is approximately 43 GB.

  • xls EDRM Data Set File Sizes (version 1.0)
    (posted Feb 01, 2010; xls; 114 KB)
    Excel spreadsheet with file sizes per custodian (Enron) and language (Internationalization)
  • pdf EDRM Data Set Overview (version 1.0)
    (posted Feb 01, 2010; pdf; 89.8 KB)
    Overview of the EDRM Data Set Project, prepared for Feb. 1, 2010 EDRM lunch-and-learn session
  • zip EDRM File Formats Data Set (version 1.0)
    (posted Jan 08, 2010; zip; 17.56 MB)
    This data set includes 381 files covering over 200 file formats. The current file formats are covered in the Excel spreadsheet: EDRM_Data-Set_File-Formats_1-0_Manifest.xls
  • zip EDRM Internationalization Data Set (version 1.0)
    (posted Jan 08, 2010; zip; 176.49 MB)
    The data set currently consists of a snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.
  • ppt EDRM LegalTech 2009 Luncheon Presentation
    (posted Feb 09, 2009; ppt; 4.94 MB)
    Presentation from the EDRM luncheon at LegalTech New York, Feb. 3, 2009
  • pptx EDRM LegalTech 2009 Luncheon Presentation
    (posted Feb 09, 2009; pptx; 1.82 MB)
    Presentation from the EDRM luncheon at LegalTech New York, Feb. 3, 2009
  • pdf EDRM LegalTech 2009 Luncheon Presentation
    (posted Feb 09, 2009; pdf; 1.42 MB)
    Presentation from the EDRM luncheon at LegalTech New York, Feb. 3, 2009
  • zip EDRM-Enron-PST-001.zip
    (posted Nov 19, 2009; zip; )
    596,574 KB. EDRM-Enron-PST-001.zip: File 1 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-002.zip
    (posted Nov 19, 2009; zip; )
    337,221 KB. EDRM-Enron-PST-002.zip: File 2 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-003.zip
    (posted Nov 19, 2009; zip; )
    425,152 KB. EDRM-Enron-PST-003.zip: File 3 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-004.zip
    (posted Nov 19, 2009; zip; )
    597,678 KB. EDRM-Enron-PST-004.zip: File 4 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-005.zip
    (posted Nov 19, 2009; zip; )
    616,356 KB. EDRM-Enron-PST-005.zip: File 5 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-006.zip
    (posted Nov 19, 2009; zip; )
    669,398 KB. EDRM-Enron-PST-006.zip: File 6 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-007.zip
    (posted Nov 19, 2009; zip; )
    596,069 KB. EDRM-Enron-PST-007.zip: File 7 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-008.zip
    (posted Nov 19, 2009; zip; )
    609,898 KB. EDRM-Enron-PST-008.zip: File 8 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-009.zip
    (posted Nov 19, 2009; zip; )
    511,159 KB. EDRM-Enron-PST-009.zip: File 9 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-010.zip
    (posted Nov 19, 2009; zip; )
    611,992 KB. EDRM-Enron-PST-010.zip: File 10 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-011.zip
    (posted Nov 19, 2009; zip; )
    675,515 KB. EDRM-Enron-PST-011.zip: File 11 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-012.zip
    (posted Nov 19, 2009; zip; )
    657,253 KB. EDRM-Enron-PST-012.zip: File 12 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-013.zip
    (posted Nov 19, 2009; zip; )
    632,682 KB. EDRM-Enron-PST-013.zip: File 13 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-014.zip
    (posted Nov 19, 2009; zip; )
    634,667 KB. EDRM-Enron-PST-014.zip: File 14 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-015.zip
    (posted Nov 19, 2009; zip; )
    592,252 KB. EDRM-Enron-PST-015.zip: File 15 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-016.zip
    (posted Nov 19, 2009; zip; )
    580,085 KB. EDRM-Enron-PST-016.zip: File 16 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-017.zip
    (posted Nov 19, 2009; zip; )
    270,488 KB. EDRM-Enron-PST-017.zip: File 17 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-018.zip
    (posted Nov 19, 2009; zip; )
    691,155 KB. EDRM-Enron-PST-018.zip: File 18 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-019.zip
    (posted Nov 19, 2009; zip; )
    607,032 KB. EDRM-Enron-PST-019.zip: File 18 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-020.zip
    (posted Nov 19, 2009; zip; )
    667,569 KB. EDRM-Enron-PST-020.zip: File 20 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-021.zip
    (posted Nov 19, 2009; zip; )
    646,863 KB. EDRM-Enron-PST-021.zip: File 21 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-022.zip
    (posted Nov 19, 2009; zip; )
    528,298 KB. EDRM-Enron-PST-022.zip: File 22 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-023.zip
    (posted Nov 19, 2009; zip; )
    547,570 KB. EDRM-Enron-PST-023.zip: File 23 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-024.zip
    (posted Nov 19, 2009; zip; )
    466,738 KB. EDRM-Enron-PST-024.zip: File 1 of 24 zipped .pst files.
  • zip EDRM-Enron-PST-025.zip
    (posted Nov 19, 2009; zip; )
    689,658 KB. EDRM-Enron-PST-025.zip: File 25 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-026.zip
    (posted Nov 19, 2009; zip; )
    460,209 KB. EDRM-Enron-PST-026.zip: File 26 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-027.zip
    (posted Nov 19, 2009; zip; )
    545,710 KB. EDRM-Enron-PST-027.zip: File 27 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-028.zip
    (posted Nov 19, 2009; zip; )
    678,708 KB. EDRM-Enron-PST-028.zip: File 28 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-029.zip
    (posted Nov 19, 2009; zip; )
    561,109 KB. EDRM-Enron-PST-029.zip: File 29 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-030.zip
    (posted Nov 19, 2009; zip; )
    408,594 KB. EDRM-Enron-PST-030.zip: File 30 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-031.zip
    (posted Nov 19, 2009; zip; )
    661,782 KB. EDRM-Enron-PST-031.zip: File 31 of 32 zipped .pst files.
  • zip EDRM-Enron-PST-032.zip
    (posted Nov 19, 2009; zip; )
    126,386 KB. EDRM-Enron-PST-032.zip: File 32 of 32 zipped .pst files.
  • xls EDRM-Enron-PST-Listing.xls
    (posted Nov 19, 2009; xls; )
    43.5 KB. Spreadsheet file listing zipped EDRM Enron PST files and the .pst files contained in the zipped files
  • txt EDRM-Enron-PST-MD5.txt
    (posted Dec 02, 2009; txt; 9.1 KB)
    MD5 hash values for the 168 EDRM Enron PST files.

  • Full Set of Materials from EDRM Lunch-and-Learn Session at LegalTech - March 1, 2010

    Here are links to all the materials used and distributed at the EDRM lunch-and-learn session at New York LegalTech

  • 2010-2011 EDRM Kickoff Meeting Update - February 26, 2010

    Whether you are a current EDRM participant, an EDRM alumnus, or have never participated in EDRM, we invite you to the kickoff meeting for the 2010-2011 EDRM year - our sixth year!

  • Electronic Discovery Reference Model (EDRM) Announces Public Comment Period for All E-Discovery Projects - February 23, 2010

    ST. PAUL, Minn. – February 23, 2010 – The Electronic Discovery Reference Model (EDRM) project teams announced today announced the beginning of the public comment period for new work product drafts posted on the EDRM website.

  • More Overview Materials for EDRM Lunch-and-Learn Session - February 1, 2010

    Additional overview materials for the Feb. 1, 2010 EDRM lunch-and-learn session are now available for download: EDRM Data Set Overview; EDRM Data Set File Sizes Spreadsheet

  • New EDRM Internationalization Data Set Now Available - January 9, 2010

    A third EDRM data set is now available, the EDRM Internationalization Data Set.

  • New EDRM File Formats Data Set Now Available - January 8, 2010

    A new EDRM data set is now available. The EDRM File Formats Data Set includes 381 files covering over 200 file formats. The current file formats are covered in the Excel spreadsheet: EDRM_Data-Set_File-Formats_1-0_Manifest.xls

  • Expanded EDRM Enron Data Set Download Capacity - December 6, 2009

    The EDRM Enron Data Set files should be much easier to download now. They all are loaded on Amazon Web Services, and as a result the time to download them should be much shorter.

  • EDRM Enron Data Set Hash Values Now Available - December 2, 2009

    We have posted a list of MD5 hash values for the 168 .pst files that comprise the EDRM Enron Data Set. The list is available at the Data Set page. Go to http://edrm.net/activities/projects/data-set, select the "Enron Downloads" tab, and the select the EDRM-Enron-PST-MD5.txt file.

  • EDRM Data Set Project: EDRM Enron PST Files Now Available - November 19, 2009

    The EDRM Enron PST files are now available on the EDRM website. They are posted as 32 zipped files, each less than 700 MB in size. Also posted is a spreadsheet listing the zipped files and the 168 .pst files contained in the zipped files.

  • EDRM at LegalTech - November 10, 2009

    EDRM will at LegalTech NY, with a Lunch & Learn session on Monday, February 1, 2010, in the Hilton's Petit Trianon room. Find out what we have been doing with EDRM, where we hope to go with the projects, and how you can get involved.

  • EDRM Approaches Mid-Year Meeting With New Website and Significant Project Advancements - October 17, 2009

    The Electronic Discovery Reference Model (EDRM) project today announced that it is now easier for users to find the valuable research and standards created by the leading e-discovery industry group via a completely re-designed website, www.edrm.net. In addition, the EDRM leaders, Tom Gelbmann and George Socha, have provided updates to all of the working projects in advance of the mid-year meeting, which is being held from Oct. 20-21, 2009, in St. Paul, Minn.

  • What’s New at EDRM - September 11, 2009

    Great strides have been taken over the past two years to flesh out and further define the EDRM model. Over the next few months, you will be seeing two years worth of active collaboration, refinement and modeling being added to the EDRM website. As part of that process, we are re-building the site to provide users an updated interface with easier access to the valuable research and content. In that realm, we wanted to provide a quick update on what’s happening in each of the projects.

  • EDRM LegalTech Luncheon Presentation Now Available - February 9, 2009

    The presentation from the EDRM Luncheon at LegalTech New York is now available for download. On Feb. 3, 2009, EDRM leaders gave presentations on their working group activities and progress in 2008 as well as activities of focus in 2009. They covered all six EDRM projects: Evergreen, XML, Metrics, Search, Model Code of Conduct, and Data Set.

  • EDRM Luncheon at LegalTech New York Conference to Address Key E-Discovery Issues - January 29, 2009

    St. Paul, MN (January 28, 2009) – The Electronic Discovery Reference Model (“EDRM”), an industry group created to develop and establish practical guidelines and standards for electronic discovery, today announces: WHAT: An EDRM “Lunch and Learn” featuring George Socha, Tom Gelbmann and other industry leaders for a presentation on the key trends impacting e-discovery professionals, including advancing standards across all phases of the process (Evergreen), ethical challenges (Model Code of Conduct), search practices and technology advancements (Search), data challenges (Data Set) and XML integration between various e-discovery tools (XML). Attendees will also have an opportunity to learn more about involvement in this industry-leading organization.

16 comments to Data Set

Go to top | leave a comment

Leave a comment

Go to top | go to comments

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>