***





Deduplication

Definition [of deduplication]

Duplicates

Totally distinct URLs - but same content

'Near'-duplicates [almost identical]

Duplicates: mirroring

Why worry about exact duplicates?

Why worry about near-exact duplicates?

Solving the issue of duplicates/near-duplicates

Cryptographic hash function: webpage -> number

https://bytes.usc.edu/~saty/tools/xem/run.html?x=MD5

Identifying identicals, near-identicals

Identifying (near) duplicates - overall idea

Distance and set measures (to compute similarity)

'Jaccard similarity/index'

FYI if you are interested: in p.475 of our text is a probabilistic approach to this that reduces the # of shingle comparisons we need to make.

SimHash LSH [Locality-Sensitive Hashing]

SimHash (aka Charikar Similarity) is essentially a dimension reduction technique - it maps a set of weighted features (contents of a document) to a low dimensional fingerprint, eg. a 64-bit word.

And, documents that are nearly identical have nearly similar fingerprints that differ only in a small # of bits. In other words, similar inputs lead to similar outputs (hash values), hence 'Sim'Hash; other hashing techniques, eg. MD5, do not have this property (in other words, even a tiny change in the input leads to a huge change in the output). This similarity property is what makes SimHash, an excellent tool for similarity detection of documents.

Here is the SimHash paper.

Example:

Q: which bit pattern pairs are almost similar?

So we can 'rotate, sort, check adjacent' 'B' times (eg. 64 times; depending on how many bits we have), to discover (almost) all the near-duplicates.

In other words:

Here is a nice page on SimHash, and this is a Google paper that discusses using SimHash at scale.