Deduplication

Definition [of deduplication]

Duplicates

Totally distinct URLs - but same content

'Near'-duplicates [almost identical]

Duplicates: mirroring

Why worry about exact duplicates?

Why worry about near-exact duplicates?

Solving the issue of duplicates/near-duplicates

Cryptographic hash function: webpage -> number

https://bytes.usc.edu/~saty/tools/xem/run.html?x=MD5

Identifying identicals, near-identicals

Identifying (near) duplicates - overall idea

Distance and set measures (to compute similarity)

'Jaccard similarity/index'

FYI if you are interested: in p.475 of our text is a probabilistic approach to this that reduces the # of shingle comparisons we need to make.

SimHash LSH [Locality-Sensitive Hashing]

SimHash (aka Charikar Similarity) is essentially a dimension reduction technique - it maps a set of weighted features (contents of a document) to a low dimensional fingerprint, eg. a 64-bit word.

And, documents that are nearly identical have nearly similar fingerprints that differ only in a small # of bits. In other words, similar inputs lead to similar outputs (hash values), hence 'Sim'Hash; other hashing techniques, eg. MD5, do not have this property (in other words, even a tiny change in the input leads to a huge change in the output). This similarity property is what makes SimHash, an excellent tool for similarity detection of documents.

Here is the SimHash paper.

Example:

Q: which bit pattern pairs are almost similar?

So we can 'rotate, sort, check adjacent' 'B' times (eg. 64 times; depending on how many bits we have), to discover (almost) all the near-duplicates.

In other words:

comparing SimHash values (ie computing 'Charikar Similarity' values) is a great way to identify near-duplicates
for 'n' documents, comparing them all pairwise would take a long time [O(n^2)]
so as a shortcut, we can sort their decimal representations and only compare adjacents - this will identify similarities based on low-end bits; but this will miss similarities based on the higher-end bits; as an aside, we can look for one more possible low-bits near-duplicate by comparing the top-most and bottom-most values too, like in Gray Code
so to fix the problem of missing finding high order bit similarities, we can rotate (spin) all the docs' bits identically to the right (so that the high order bits become a 'bit' (lol) lower) to produce 'new' hashes, sort *those*, compare for near-duplicates
we can progressively spin right by 1 bit, 2 bits, 3 bits... to discover more and more similarities [we will rediscover existing similarities but ignore those]
note that we can spin left as well
doing the above is STILL faster than O(n^2) :)

Here is a nice page on SimHash, and this is a Google paper that discusses using SimHash at scale.