May 19, 2009

You have a billion urls, where each has a huge page. How do you detect the duplicate documents?

You have a billion urls, where each has a huge page. How do you detect the duplicate documents?

Suggest a cryptographic hash function, SHA1 or MD5. They are expensive but very good hash function.

No comments:

Post a Comment