Document deduplication


During the ingest process, DISCO deduplicates documents strategically based on the document’s family structure. Duplicates are handled as follows:

  1. Emails with an attachment: if there are identical child documents that have two or more unique parents, DISCO  ingests each attachment as a unique document. Previously, email attachments that had different parents were deduplicated. Now you will see the duplicate email attachments in DISCO as separate documents. For example, an email with the same attachment (a contract) was sent to one's lawyer, then sent to a vendor to sign the contract. In DISCO, you will see contract A sent to the lawyer and may tag it as “Attorney-client”. Separately you will see contract B as sent to the vendor and may tag it as “responsive”.
  2. Duplicate instances of all other documents: if there are identical documents in the database that do not have unique parents, these documents are deduplicated, per historical DISCO processing procedures. For example, two different custodians have the same Word file on their computer. During collections, both identical “.doc” files ingest into DISCO and are deduplicated.
  3. Limiting documents with 1,000+ instances: during ingest if there are 1,000+ identical documents, DISCO will - at the very end of ingest - cleanup the duplicates that litter up a database and leave only one copy of the document in the database. The copy left will be the first instance ingested. The ingest report will have a message when the Doc Count Limit script is utilized.


In DISCO, one document would exist, showing it has two or more unique instances by reviewing the metadata. For example, here we see this document has several custodians. Filepaths will also show deduplicated documents.


Figure 1: document metadata (2017)

Navigate to the metadata pane within the lower left corner of the document viewer to view custodians (Figure 1; DISCO 2017). 


Upon production you have the option to split deduplicated documents out again on production (into separately Bates stamped images/files) in several ways. Navigate to the Advanced Options > Deduplication level select bar from a new production.


These options run on a spectrum from getting the most number of documents in the production (No deduplication - meaning produce exactly in the way and number the duplicate documents were ingested into DISCO) to the fewest (One copy for the entire production - meaning produce only one copy of each set of duplicated documents, no matter how many duplicates there were).

The default option - one copy per custodian and per parent - is the most common option, guaranteeing that at least one copy of a duplicated document will be produced for each custodian, and one copy of each email attachment will be produced for each email that duplicate was attached to.

  1. No deduplication
  2. One copy per custodian and per parent (default)
  3. One copy per parent across custodians
  4. One copy per custodian
  5. One copy for the entire production

If you have questions, DISCO is here to support you. Email us at or call 877-941-0583.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request