During the ingest process, DISCO strategically deduplicates documents based on the document’s family structure. Duplicates are handled as follows:
- Emails with attachments – If there are identical child documents that have two or more unique parents, DISCO ingests each attachment as a unique document. Previously, email attachments that had different parents were deduplicated. Now you will see the duplicate email attachments in DISCO as separate documents. For example, an email with the same attachment (a contract) was sent to one's lawyer, then sent to a vendor to sign the contract. In DISCO, you will see contract A sent to the lawyer and may tag it Attorney-client. Separately, you will see contract B sent to the vendor and may tag it responsive.
- Duplicate instances of all other documents – If there are identical documents in the database that do not have unique parents, these documents are deduplicated, per historical DISCO processing procedures. For example, two different custodians have the same Word file on their computer. During collections, both identical .doc files are ingested into DISCO and are deduplicated.
- Limiting documents with 200+ instances – During ingest, if there are 200+ identical documents, DISCO will, at the end of ingest, remove the duplicates and leave only one copy of the document in the database. The copy left will be the first instance ingested. The ingest report will contain a message when the Doc Count Limit script is utilized.
If a document is deduplicated, one document will exist in DISCO, and the metadata will show that it has two or more unique instances. For example, here we see that a document has several custodians. Filepaths will also show deduplicated documents.
Upon production, you have the option to produce (or not produce) duplicates in the following ways:
- Global deduplication by family (Default) – Produces each duplicated family in the production one time.
- Custodian-level deduplication by family – Produces a separate copy for each custodian associated with a duplicate family.
- Full reduplication – Produces documents as they were ingested into DISCO, prior to deduplication.
The deduplication level option is available on the New Production screen, under Advanced options.
This feature is available in databases created on or after August 8, 2015. If a database was created prior to this date, deduplication will follow #2 above and all identical documents will be deduplicated.
If you have questions, contact us.