During the ingest process, DISCO deduplicates documents strategically based on the document’s family structure. Duplicates are handled as follows:
- Emails with an attachment: if there are identical child documents that have two or more unique parents, DISCO ingests each attachment as a unique document. Previously, email attachments that had different parents were deduplicated. Now you will see the duplicate email attachments in DISCO as separate documents. For example, an email with the same attachment (a contract) was sent to one's lawyer, then sent to a vendor to sign the contract. In DISCO, you will see contract A sent to the lawyer and may tag it as “Attorney-client”. Separately you will see contract B as sent to the vendor and may tag it as “responsive”.
Note: this feature is only available for databases created on or after August 8th, 2015. If the database was created prior to this date, deduplication will follow #2 and all identical documents are deduplicated.
- Duplicate instances of all other documents: if there are identical documents in the database that do not have unique parents, these documents are deduplicated, per historical DISCO processing procedures. For example, two different custodians have the same Word file on their computer. During collections, both identical “.doc” files ingest into DISCO and are deduplicated.
- Limiting documents with 1,000+ instances: during ingest if there are 1,000+ identical documents, DISCO will - at the very end of ingest - cleanup the duplicates that litter up a database and leave only one copy of the document in the database. The remaining copy will be the first instance ingested. The ingest report will have a message when the Doc Count Limit script is utilized.