Duplicate Identification Project Overview
During discovery, disclosure or an investigation, it is often useful to identify duplicate emails in data exchanged between parties. This can deliver many benefits, including the ability for legal teams to rapidly triage emails already reviewed that also reside in data received from others.
While current approaches effectively identify email duplicates within native datasets processed by a single vendor, they do not enable duplicate identification across emails processed by multiple vendor platforms. Vendors use similar methods to detect email duplicates, but there are nuanced differences in their proprietary algorithms.
Currently no means of cross platform email duplicate identification exists, except to reprocess the data using a single vendor platform, often expending significant time and cost. The EDRM Duplicate Identification project set out to develop a solution to cross platform email duplicate identification.
Our solution is a simple, but effective approach which involves the use of the hash value of an email Message ID metadata field that we have named the EDRM Message Identification Hash (“MIH”). This new approach need not replace current vendor email deduplication methods, but will enable cross platform email duplicate identification. It is expected cross platform duplicate identification using the MIH will be applied to email data sets that have already been deduplicated using a vendor’s standard deduplication process. It is envisioned that the EDRM MIH will be an additional field that will be generated as part of the processing functionality in each vendor’s platform and used by recipients to further identify duplicates for collections with established EDRM MIH values. This email duplicate identification process can be used for various purposes, including grouping email duplicates to enable a more efficient review process or deduplicating a cross platform email dataset.
The EDRM Email Duplicate Identification Specification Committee (“Committee”) has developed the EDRM Email Duplicate Identification Toolkit (“Toolkit”) to facilitate cross platform identification of duplicate email messages. The Committee anticipates that the use of EDRM MIH will lead to significant time and cost benefits The Toolkit is designed for a range of stakeholders:
- Parties who are encouraged to exchange EDRM MIH values as additional fields in their produced metadata to facilitate efficient identification of duplicate emails across data received from other parties irrespective of the vendor platform used to produce it.
- Vendors who are encouraged to add the calculation of EDRM MIH values to their production toolsets and methodologies. They do not need to replace current code or change the way they deduplicate email messages internally within their platforms.
- Regulators & Courts who are encouraged to include the EDRM MIH in exchange protocols and practice notes.
The Toolkit includes:
- EDRM Message Identification Hash (MIH) Specification (v1.0) is a succinct, technical specification with advisory notes geared to software developers and has been written for the target audience of vendors who are implementing the MIH in their platform. It defines a process to identify duplicate emails across disparate formats and forms of data employed in electronic discovery and disclosure.
- EDRM Email Duplicate Identification Guidelines (v1.0) is a non-technical reference for those who need to understand why and how to use the MIH. It outlines the objectives, methodology, potential use cases, advantages, and usage considerations of the Specification. These Guidelines are intended for use by those who want to use the MIH for cross platform duplicate identification, including parties and counsel, vendors and service providers and regulators and courts.
Download “DupeID Specification & Guidelines: EDRM Cross Platform Email Duplicate Identification” EDRM-Cross-Platform-Email-Duplicate-Identification-v3.pdf – 290.21 KB
Other components of the Toolkit are:
Whitepaper is a practical, non-technical introduction to the use of the EDRM MIH and is a useful tool for lawyers who need to quickly understand the benefits of using the MIH.
Download “DupeID Whitepaper by Craig Ball” EDRM-Cross-Platform-Email-Duplicate-Identification-Whitepaper.pdf – 154.87 KB
Infographic is a simple one-page explanation of the solution.
Download “DupeID Infographic: EDRM Email Duplicate Identification” EDRM-Email-Duplicate-Identification-Infographic-v3.pdf – 690.69 KB
Data and utilities to support use of the EDRM MIH Specification
Tools and data are Copyright EDRM 2023, licensed under Creative Commons 4.0 International with attribution to https://edrm.net. Use after your own diligence.
Test Data Set – a corpus of emails to be used for testing and verification of the EDRM MIH in the implementation by product vendors. The emails are all available publicly or are emails the project team have created and can be shared. There are 2 files:
1. All 70 emails in the Sample Data Set, FINAL EDRM MIH Example Data 20240123.zip, have been sourced from publicly available data as follows:
- Jeb Bush Emails Sample 1.zip: a small subset of Jeb Bush emails available publicly at https://fcir.org/2014/12/29/search-jeb-bush-email/
- Sample Emails 2.zip: a small set of emails gathered from the EDRM DupeID Committee where content can be shared publicly.
- Sample Emails 3.mbox: an mbox which is part of the EDRM Micro Dataset https://edrm.net/resources/data-sets/.
Download “FINAL EDRM MIH Example Data” FINAL-EDRM-MIH-Example-Data-20240123.zip – 2.90 MB
2. Small Dataset MIH Calculator – an Excel-based tool to generate EDRM MIH values for small sets of Message-IDs. This Excel spreadsheet, FINAL EDRM MIH Sample Email Index, is to be used for verification of MessageID extraction and MIH calculations for all emails in the Example Data Set.
Download “FINAL EDRM MIH Sample Email Index” EDRM-Excel-Webservice-MIH-Calculator-Template-20230328-1.xlsx – 9.37 MB
[In development, coming soon]: Open-source code with GUI frontend to extract MSGID from emails, calculate EDRM MIH values from Outlook PST containers and output an email identifier, MSGID & MIH in a CSV file.
The project team (organizations noted for identification purposes only) includes:
- Murali Baddula, Chief Digital Officer at Law In Order (Sydney, Australia)
- Craig Ball, Attorney, Certified Computer Forensic Examiner and Adjunct Professor, University of Texas School of Law (Austin, Texas USA)
- Klaus-Peter Finke-Härkönen, Director Strategic Initiatives, Cyber Diligence LLC, Syosset NY USA
- Ian Folkman, Vice President – Forensic Technology at Deloitte Tohmatsu Financial Advisory (Japan)
- Scott Foster, Senior Managing Director at FTI Consulting – Technology Lead (Australia)
- Matthew Golab, Director Legal Informatics and R+D at Gilbert + Tobin (Sydney, Australia)
- Phil Haselden, Technical Fellow at EDT (Australia)
- Greg Houston, Workflow Enablement at Relativity (Chicago, Illinois USA)
- Dr Paul Hunter, Chief Data Scientist at EDT (Australia)
- Dinesh Karamchandani, Senior Data Scientist at Reveal-Brainspace (Chicago, Illinois USA)
- Lisa Kozaris, Chief Innovation & Legal Solutions Officer at Allens (Melbourne, Australia)
- Enzo Lisciotto, Director of Customer Success at Reveal-Brainspace & Co-Membership Director of ACEDS ANZ (Sydney, Australia)
- James MacGregor, Partner at FORCYD (London, UK)
- Karan Mehta, Head of Legal Technology at Allens (Sydney, Australia)
- Rachi Messing, Co-Founder at Altorney (Israel)
- Elizabeth Miller, Senior Director at Law in Order (Australia)
- Beth Patterson, Director at ESPconnect & Adjunct Professor at University of Technology Sydney Law School (Australia)
- Alexander Poelma, Product Manager Nuix Workstation & Engine at Nuix (California, USA)
- Jo Sherman, CEO & Founder at EDT (Australia/USA)
- Paul Sirkis, Division Director, Regional Head of Litigation, Americas at Macquarie Group (New York, New York USA)
- George Socha, Senior Vice President of Brand Awareness at Reveal-Brainspace (Chicago, Illinois USA)
- Stephen Stewart, CTO at Nuix (Philadelphia, PA USA)
- Gavin Wingfield, Director Applied Legal Technology at King & Wood Mallesons
- Emma Young, Head of eDiscovery Delivery at Sky Discovery (Brisbane, Australia)