You have a billion urls, where each has a huge page. How do you detect the duplicate documents?
Suggest a cryptographic hash function, SHA1 or MD5. They are expensive but very good hash function.
Subscribe to:
Post Comments (Atom)
DFS the life without backtracking
No comments:
Post a Comment