Simple search system that includes inverted index builder and boolean query processor for information retrieval.
This program uses Reuters-21578 Dataset. Please place the dataset before build inverted index.
Also do not forget to add stopwords in stopwords.txt
file.
The file tree must be like this:
βββ dictionary.pkl (not necessary) βββ main.py βββ README.md βββ reuters21578 βΒ Β βββ lewis.dtd βΒ Β βββ README.txt βΒ Β βββ reut2-000.sgm βΒ Β βββ reut2-001.sgm βΒ Β βββ ... βΒ Β βββ reut2-021.sgm βββ src βΒ Β βββ base.py βΒ Β βββ inverted_intex.py βΒ Β βββ query_processor.py βΒ Β βββ sgm_preprocessor.py βββ stopwords.txt
Programs run with
python main.py
command. Program gets input query and print result until q
is given.
4 different query types are implemented:
- Conjunction: w1 AND w2 AND w3...AND wn
example: oil AND agriculture AND vegetable
result: [3950, 5655, 7625, 8003, 9550, 9756, 10720, 14509, 15341, 18403, 20232] - Disjunction: w1 OR w2 OR w3...OR wn
example: hate OR love OR cry
result: 1895, 3148, 6338, 7366, 8827, 10890, 17099, 17903, 19559] - Conjunction and Negation: w1 AND w2...AND wn NOT wn+1 NOT wn+2 ...NOT wn+m
example: oil AND agriculture AND vegetable NOT price
result: [3950, 5655, 7625, 8003, 9550, 9756, 10720, 14509, 15341, 20232] - Disjunction and Negation: w1 OR w2...OR wn NOT wn+1 NOT wn+2 ...NOT wn+m
example: hate OR love OR cry NOT money NOT price
result: [1895, 3148, 6338, 7366, 8827, 10890, 17099]