You have a very large document, for an instance the document containing 1 million words. You are given a huge file, which contains list of 1 million words (includes multi-word strings). Give an algorithm/datastructure which returns the positions of the words/multiwords occurring in document which exist in the list given.
Example:
given document (which can be upto 1 million words):
I am xyz who did my bachelors from Aab Bbc Ccd Dde university. I like solving puzzles like Albert Einstein used to.
given list:
Albert Einstein
Abdul Kalam
Aab Bbc Ccd Dde
Massachusetts Institute of Technology
xyz
so no...upto million words
Result:
Word position:
3 (xyz)
9-12 (Aab Bbc Ccd Dde)
19-20 (Albert Einstein)
The basic idea is suffix tree, if the memory is not big enough, build the suffix array which is sorted, then sort the word list if it is unsorted, do the merge to two lists to find matching words
O(nlongn)
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment