An NLP Based Framework for content recommendation
In this era of Big Data, where content generation knows no end, the need for an efficient content recommendation framework becomes indispensable. In this article we propose an extensible and scalable architecture for a NLP based content recommendation engine. The framework is generic in nature, to work with any kind of content type which either is itself text or contains some textual description.
- Python for code base
- ElasticSearch for storage and search
- Word2vec, word embeddings for context-based similarity
Architecture of the framework
The framework follows the adapter design pattern, which connects all content sources and the recommendation engine to an ElasticSearch Service. Every content source is a client to ElasticSearch and interacts to it via a dedicated adapter and every adapter follows the behaviour as defined by an abstract base class.
Architecture of the Framework
The framework is extensible to inclusion of multiple content sources. To include a new data source, one would only require including a new class to retrieve data and a corresponding adapter to communicate with ElasticSearch.
The application currently has four content sources, WordPress, Medium, YouTube and Unsplash, which include content types of Text, Images and Videos. Recommendation is based on the textual description available with article of every content source. Additional content sources can be added by following the guidelines described by the architecture of the framework.
Data is collected from the content sources by either web scraping or through API calls and fed into the ElasticSearch storage. The recommendation engine reads data from this storage and retrieves results based on the user given keywords, the output is then sorted based on relevance and is returned to the user as a response to the user’s request.
We have a map (document schema) defined for ElasticSearch, with a format consisting of eight fields for storage and retrieval of content. Amongst these, the searchable fields (useful for recommendation), title, description and tags are analysed using snowball analyser, which provides tokenization, word stemming, and filtering based on stop words, case sensitivity etc. These NLP based techniques help in making efficient the search for the families of derivationally related words with similar meanings such as democracy, democratic, and democratization. Filtering helps us avoid redundant storage.
When a user requests recommendation based on a set of keywords, the recommendation algorithm searches the entire ElasticSearch storage for the keywords and retrieves relevant information of the matched articles. The retrieved articles are then sorted based on relevance as per the user defined keywords, calculated using word embeddings, and response is returned to the user.
Link to Git repository: