Design and Architecture¶
Tech-Stack¶
- Django
- MySQL
Algorithms for searching¶
2 algorithms are used:
- Partial_token_ratio (for SQLite)
- Soundex (for MySQL)
While searching, Network calls are reduced by using Debouncing.
Keyword Generation Process¶
First the pdf file is passed through Tesseract OCR to generate text.
Then this text is word tokenized and stopwords are removed.
Then the bag of words is transferred to a dictionary where such bags are indexed by their paper ids.
Then TF-IDF algorithm is run to find keywords. (RAKE provides worse results)
This must run locally as:
- This is resource-heavy.
- The threshold of keeping the keywords must be determined by human beings. This cannot be automated.
Retraining of model must be done once per semester.