Jiaqin's Technical Notes: You have a very large document, for an instance the document containing 1 million words. You are given a huge file, which contains list of 1 million w

May 25, 2009

You have a very large document, for an instance the document containing 1 million words. You are given a huge file, which contains list of 1 million w

You have a very large document, for an instance the document containing 1 million words. You are given a huge file, which contains list of 1 million words (includes multi-word strings). Give an algorithm/datastructure which returns the positions of the words/multiwords occurring in document which exist in the list given.

Example:
given document (which can be upto 1 million words):

I am xyz who did my bachelors from Aab Bbc Ccd Dde university. I like solving puzzles like Albert Einstein used to.

given list:

Albert Einstein
Abdul Kalam
Aab Bbc Ccd Dde
Massachusetts Institute of Technology
xyz
so no...upto million words

Result:
Word position:

3 (xyz)
9-12 (Aab Bbc Ccd Dde)
19-20 (Albert Einstein)
The basic idea is suffix tree, if the memory is not big enough, build the suffix array which is sorted, then sort the word list if it is unsorted, do the merge to two lists to find matching words
O(nlongn)

Jiaqin's Technical Notes

May 25, 2009

You have a very large document, for an instance the document containing 1 million words. You are given a huge file, which contains list of 1 million w

No comments:

Post a Comment

Brief Description

Labels