A lightweight FastAPI server that hosts an open-source Large Language Model (LLM) as an API. This project makes it easy for friends to use a powerful language model through simple API calls.
- Simple /generateendpoint that takes a prompt and returns the model's response
- Uses TinyLlama (truly free and open-source, no authentication required)
- Can be hosted completely offline after initial download
- Optimized for deployment on free hosting platforms
- No authentication required for easy access
- Lightweight implementation for efficiency
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Run the server:
python main.py
- Access the API at http://localhost:8000
After the first run, the model will be downloaded to your Hugging Face cache (usually in ~/.cache/huggingface). To run completely offline:
- Make sure you've run the server at least once to download the model
- Set the environment variable to use local files:
export TRANSFORMERS_OFFLINE=1
- Run the server as usual:
python main.py
This will prevent the server from trying to connect to Hugging Face and use only local files.
Build and run the Docker container:
docker build -t free-llm-api .
docker run -p 8000:8000 free-llm-apiFor offline Docker usage, you can mount the Hugging Face cache directory into the container:
docker run -p 8000:8000 -v ~/.cache/huggingface:/root/.cache/huggingface -e TRANSFORMERS_OFFLINE=1 free-llm-api- Create a new Space on Hugging Face
- Choose Dockerfile as the Space type
- Push this code to the Space repository
- Create a new project on Railway
- Connect this GitHub repository
- Railway will automatically build and deploy the application
- Create a new Web Service on Render
- Connect your repository
- Use "Docker" as the runtime
Send a POST request to /generate with a JSON body containing your prompt:
curl -X 'POST' \
  'http://localhost:8000/generate' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Tell me a joke about AI."
}'Example response:
{
  "response": "Why did the AI break up with its partner? It needed more data!"
}You can configure the model by setting the following environment variables:
- MODEL_NAME: The Hugging Face model ID to use (default: "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
- PORT: The port to run the server on (default: 8000)
- TRANSFORMERS_OFFLINE=1: Run in offline mode (uses only locally cached models)
The default model is TinyLlama, which is a small but capable open-source model that's truly free (no authentication required). Other free options include:
- "google/flan-t5-small" (Very lightweight T5 model)
- "facebook/opt-125m" (Small OPT model from Meta)
For more capable models (some may require authentication):
- "facebook/opt-1.3b" (Larger OPT model)
- "EleutherAI/pythia-1.4b" (Open source model from EleutherAI)