DupeID

Duplicate Identification Project Overview

During discovery, disclosure or an investigation, it is often useful to identify duplicate emails in data exchanged between parties. This can deliver many benefits, including the ability for legal teams to rapidly triage emails already reviewed that also reside in data received from others.

While current approaches effectively identify email duplicates within native datasets processed by a single vendor, they do not enable duplicate identification across emails processed by multiple vendor platforms. Vendors use similar methods to detect email duplicates, but there are nuanced differences in their proprietary algorithms.

Currently no means of cross platform email duplicate identification exists, except to reprocess the data using a single vendor platform, often expending significant time and cost. The EDRM Duplicate Identification project set out to develop a solution to cross platform email duplicate identification.

Our solution is a simple, but effective approach which involves the use of the hash value of an email Message ID metadata field that we have named the EDRM Message Identification Hash (“MIH”). This new approach need not replace current vendor email deduplication methods, but will enable cross platform email duplicate identification. It is expected cross platform duplicate identification using the MIH will be applied to email data sets that have already been deduplicated using a vendor’s standard deduplication process. It is envisioned that the EDRM MIH will be an additional field that will be generated as part of the processing functionality in each vendor’s platform and used by recipients to further identify duplicates for collections with established EDRM MIH values. This email duplicate identification process can be used for various purposes, including grouping email duplicates to enable a more efficient review process or deduplicating a cross platform email dataset.

The EDRM Email Duplicate Identification Specification Committee (“Committee”) has developed the EDRM Email Duplicate Identification Toolkit (“Toolkit”) to facilitate cross platform identification of duplicate email messages. The Committee anticipates that the use of EDRM MIH will lead to significant time and cost benefits The Toolkit is designed for a range of stakeholders:

  • Parties who are encouraged to exchange EDRM MIH values as additional fields in their produced metadata to facilitate efficient identification of duplicate emails across data received from other parties irrespective of the vendor platform used to produce it.
  • Vendors who are encouraged to add the calculation of EDRM MIH values to their production toolsets and methodologies. They do not need to replace current code or change the way they deduplicate email messages internally within their platforms.
  • Regulators & Courts who are encouraged to include the EDRM MIH in exchange protocols and practice notes.

The Toolkit includes:

  • EDRM Message Identification Hash (MIH) Specification (v1.0) is a succinct, technical specification with advisory notes geared to software developers and has been written for the target audience of vendors who are implementing the MIH in their platform. It defines a process to identify duplicate emails across disparate formats and forms of data employed in electronic discovery and disclosure.
  • EDRM Email Duplicate Identification Guidelines (v1.0) is a non-technical reference for those who need to understand why and how to use the MIH. It outlines the objectives, methodology, potential use cases, advantages, and usage considerations of the Specification. These Guidelines are intended for use by those who want to use the MIH for cross platform duplicate identification, including parties and counsel, vendors and service providers and regulators and courts.

Other components of the Toolkit are:

  • Whitepaper is a practical, non-technical introduction to the use of the EDRM MIH and is a useful tool for lawyers who need to quickly understand the benefits of using the MIH.
  • Infographic is a simple one-page explanation of the solution.
  • Data and utilities to support use of the EDRM MIH Specification
    • Test Data Set – a corpus of emails enabling testing and verification of EDRM MIH implementations by vendors. 
    • Small Dataset MIH Calculator – an Excel-based tool to generate EDRM MIH values for small sets of Message-IDs.
    • Open-source code with GUI frontend to extract MSGID from emails, calculate EDRM MIH values from Outlook PST containers and output an email identifier, MSGID & MIH in a CSV file. 

All components of the Toolset will be made accessible from the EDRM website.

Download “DupeID Infographic: EDRM Email Duplicate Identification” EDRM-Email-Duplicate-Identification-Infographic-v3.pdf – 691 KB

Download “DupeID Specification & Guidelines: EDRM Cross Platform Email Duplicate Identification” EDRM-Cross-Platform-Email-Duplicate-Identification-v3.pdf – 290 KB

Download “DupeID Whitepaper by Craig Ball” EDRM-Cross-Platform-Email-Duplicate-Identification-Whitepaper.pdf – 155 KB


Public comments will be accepted until xxxx.

Contributors

The project team (organizations noted for identification purposes only) includes:

  • Murali Baddula, Chief Digital Officer at Law In Order (Sydney, Australia)
  • Craig Ball, Attorney, Certified Computer Forensic Examiner and Adjunct Professor, University of Texas School of Law (Austin, Texas USA)
  • Ian Folkman, Vice President – Forensic Technology at Deloitte Tohmatsu Financial Advisory (Japan)
  • Scott Foster, Senior Managing Director at FTI Consulting – Technology Lead (Australia)
  • Matthew Golab, Director Legal Informatics and R+D at Gilbert + Tobin (Sydney, Australia)
  • Phil Haselden, Technical Fellow at EDT (Australia)
  • Greg Houston, Workflow Enablement at Relativity (Chicago, Illinois USA)
  • Dr Paul Hunter, Chief Data Scientist at EDT (Australia)
  • Dinesh Karamchandani, Senior Data Scientist at Reveal-Brainspace (Chicago, Illinois USA)
  • Lisa Kozaris, Chief Innovation & Legal Solutions Officer at Allens (Melbourne, Australia) 
  • Enzo Lisciotto, Director of Customer Success at Reveal-Brainspace & Co-Membership Director of ACEDS ANZ (Sydney, Australia)
  • James MacGregor, Partner at FORCYD (London, UK)
  • Karan Mehta, Head of Legal Technology at Allens (Sydney, Australia)
  • Rachi Messing, Co-Founder at Altorney (Israel)
  • Elizabeth Miller, Senior Director at Law in Order (Australia)
  • Beth Patterson, Director at ESPconnect & Adjunct Professor at University of Technology Sydney Law School (Australia)
  • Alexander Poelma, Product Manager Nuix Workstation & Engine at Nuix (California, USA)
  • Jo Sherman, CEO & Founder at EDT (Australia/USA)
  • Paul Sirkis, Division Director, Regional Head of Litigation, Americas at Macquarie Group (New York, New York USA)
  • George Socha, Senior Vice President of Brand Awareness at Reveal-Brainspace (Chicago, Illinois USA)
  • Stephen Stewart, CTO at Nuix (Philadelphia, PA USA)
  • Gavin Wingfield, Director Applied Legal Technology at King & Wood Mallesons
  • Emma Young, Head of eDiscovery Delivery at Sky Discovery (Brisbane, Australia)

en_USEnglish