Skip to content

exyzhao/rag-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Retrieval-Augmented Generation for PDF

Fullstack application that uses OpenAI's API to answer questions based on the Startup Playbook by Sam Altman or the Settlers of Catan Rulebook.

Setup

Backend

Install backend dependencies.

cd backend
pip install -r requirements.txt

Create a .env file for your OpenAI API key.

touch .env

In the .env file, paste in your key like so:

OPENAI_API_KEY=<your-api-key-here>

Create the Chroma DB.

python3 create_database.py

Start the query server.

python3 query_data.py

To answer questions about a different PDF, add the PDF to backend/data/ or replace the PDF stored there, then rebuild the Chroma DB.

Frontend

Starting back at the root, install frontend dependencies.

cd frontend
npm install

Start the React app.

npm start

Technologies Used

  • Langchain and Chroma are used to parse, store, and query data from the PDF, with embeddings from OpenAI.
  • The backend is served with Flask.
  • The frontend is built with React, with styling in MaterialUI.

Assumptions and Limitations

  • Since the goal is to "ask questions to a document," one major assumption I made is that queries are independent questions that do not build off of one another. This is also less resource intensive, with only a single answer needing to be stored at a time for a given user. To reflect this from a user flow perspective, I made the chat interface not store previous answers but still store previous questions for quick access.

  • Another assumption I made was that this was a proof of concept of a PDF querier and would not be deployed at scale without changes such as caching answers or optimizing how requests are handled from multiple users. Following the advice of Paul Graham, this is "doing something that doesn't scale" to test usage.

  • I deployed the application on Render, but found that since I'm on the free tier, the backend needs at least 50 seconds to wake up from inactivity, leading to high latency responses and even timing out. Since local deployment only needs around ~2.5s to generate responses, I've published only this repository.

  • If I had more time, I'd scope out the next feature depending on the users - either including a page and line number in the source cited, allowing for quicker search of multiple PDFs, or allowing for upload of a PDF. This would depend on the use case; e.g., a researcher would likely prioritize seeing page/line numbers to quickly filter through a source, whereas a student writing an essay may prefer to be able to upload PDFs in the application to synthesize information.

Demo Video

https://youtu.be/51Bej32HirE

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published