Arxiv Scout
jonathanlimsc | 2023-01-15 Image generated via Stable Diffusion 2
For this project, I wanted to use text embeddings as the basis to search over recent ArXiv papers. This was inspired by Karpathy’s Arxiv Sanity, which uses a SVM and Tf-idf approach.
Logic flow
On every user query, the client on this site sends a POST /api/query
request to a Flask microservice I deployed on Render, a PaaS similar to Heroku. I chose Render due to their free compute tier, which has 512MB RAM, and the ease of deploying Flask applications. Similar to Heroku, usage is fuss-free. I simply had to connect my Github account and it will automatically deploy on any change.
The Flask microservice scrapes the latest 1000 ArXiv papers and generates text embeddings using the paper titles and summaries. Embeddings for the user query is also generated, and the cosine similarity is computed between the user query embedding and each embeddings of the 1000 papers. The top 20 similar papers are returned in the response.
Currently, the Cohere Embed API is used as the main embedding model option, since it is easily available, free and easy-to-use for development, and eliminates model training and model deployment. Hence Cohere serves as a good baseline and benchmark against any future model that I incorporate or train.
Future work
- Scrape a training dataset (100-500K documents) from ArXiv to train my own embedding models and compare against Cohere API
- N-gram Bag-of-Words
- Tf-idf
- FastText
- Word2Vec / Doc2Vec
- Add OpenAI embedding API
- Index papers from Arxiv and their corresponding embeddings over a longer period of time (e.g. 1-5 years) using OpenSearch