Jiaqin's Technical Notes: You have a billion urls, where each has a huge page. How do you detect the duplicate documents?

May 19, 2009

You have a billion urls, where each has a huge page. How do you detect the duplicate documents?

You have a billion urls, where each has a huge page. How do you detect the duplicate documents?

Suggest a cryptographic hash function, SHA1 or MD5. They are expensive but very good hash function.

Jiaqin's Technical Notes

May 19, 2009

You have a billion urls, where each has a huge page. How do you detect the duplicate documents?

No comments:

Post a Comment

Brief Description

Labels