During the ingest process, DISCO strategically deduplicates documents based on the document’s family structure. Duplicates are handled as follows:
- Emails with an attachment: If there are identical child documents that have two or more unique parents, DISCO ingests each attachment as a unique document. Previously, email attachments that had different parents were deduplicated. Now you will see the duplicate email attachments in DISCO as separate documents. For example, an email with the same attachment (a contract) was sent to one's lawyer, then sent to a vendor to sign the contract. In DISCO, you will see contract A sent to the lawyer and may tag it as Attorney-client. Separately, you will see contract B was sent to the vendor and may tag it as Responsive.
This feature is available in databases created on or after August 8, 2015. If the database was created prior to this date, deduplication will follow #2 and all identical documents are deduplicated. - Duplicate instances of all other documents: If there are identical documents in the database that do not have unique parents, these documents are deduplicated, per historical DISCO processing procedures. For example, two different custodians have the same Word file on their computer. During collections, both identical DOC files ingest into DISCO and are deduplicated.
- Limiting documents with 1,000+ instances: At the end of ingest, if there are 1,000+ identical documents, DISCO will remove the duplicates and leave only one copy of the document in the database. That copy will be the first instance ingested. The ingest report will indicate if the Doc Count Limit script is utilized.