Ingest
During the ingest process, DISCO evaluates every new file ingested and strategically deduplicates documents based on a document's content and extracted metadata.
All DISCO sites by default deduplicate native data globally across custodians and data sources. This results in a single instance of each family for review.
How DISCO identifies Duplicates
To identify duplicate native documents, DISCO calculates a SHA1 hash value for every file based on the file’s content, extracted metadata, and family relationships.
To identify duplicate emails, DISCO calculates a hash based on the email content rather than bytes in order to deduplicate emails across multiple file types. DISCO's deduplication uses the content of field values, extracted metadata, and family relationships, specifically an email's Sent Time, Sender Address, To/Cc/Bcc contents, Subject, Message Body, and Attachments. Calendar items are hashed using this same method, reducing recurring meetings to a single instance when deduplicated.
Duplicates found during the ingest process are handled as follows:
- Duplicate instances of documents – If there are identical documents in the database that do not have unique parents, these documents are deduplicated. For example, two different custodians have the same Word file on their computer. During collections, both identical .doc files are ingested into DISCO and are deduplicated based on their dedupe hash. The custodian information and any other different metadata will be included on the review document and available within DISCO.
- Duplicate documents with attachments – If there are identical child documents that have two or more unique parents, DISCO ingests each attachment as a unique document.
Previously in DISCO, email attachments that had different parents were deduplicated. Now you will see the duplicate email attachments in DISCO as separate documents. For example, an email with the same attachment (a contract) was sent to one's lawyer, then sent to a vendor to sign the contract. In DISCO, you will see contract A sent to the lawyer and may tag it Attorney-client. Separately, you will see contract B sent to the vendor and may tag it responsive. - Documents with 200+ duplicate instances – During ingest, if there are 200+ duplicate instances of a document, DISCO will rollover the 201+ instances into a new review document.
- Historically, when there were more than 200 duplicate instances of a document, DISCO has removed the duplicates and left only one copy of the document in the database. The copy left was the first instance ingested. When this happened, the ingest report contained a message noting the exception.
Learn more about the difference between Documents vs. Instances.
Review
If a document is deduplicated, one document will exist in DISCO, and the metadata will show that it has two or more unique instances. For example, here we see that a document has several custodians. Filepaths will also show deduplicated documents.
Production
Upon production, you have the option to produce (or not produce) duplicates in the following ways:
- Global deduplication by family (Default) – Produces each duplicated family in the production one time.
- Custodian-level deduplication by family – Produces a separate copy for each custodian associated with a duplicate family.
- Full reduplication – Produces documents as they were ingested into DISCO, prior to deduplication.
The deduplication level option is available on the New Production screen, under Other options.
More information about deduplication during production can be found in Production deduplication.
This feature is available in databases created on or after August 8, 2015. If a database was created prior to this date, deduplication will follow #2 above and all identical documents will be deduplicated.
If you have questions, contact us.