Skip to content

Vincent-Tiono/local-llm

Repository files navigation

Free LLM API

A lightweight FastAPI server that hosts an open-source Large Language Model (LLM) as an API. This project makes it easy for friends to use a powerful language model through simple API calls.

Features

  • Simple /generate endpoint that takes a prompt and returns the model's response
  • Uses TinyLlama (truly free and open-source, no authentication required)
  • Can be hosted completely offline after initial download
  • Optimized for deployment on free hosting platforms
  • No authentication required for easy access
  • Lightweight implementation for efficiency

Getting Started

Local Development

  1. Clone the repository
  2. Install dependencies:
    pip install -r requirements.txt
    
  3. Run the server:
    python main.py
    
  4. Access the API at http://localhost:8000

Running Offline

After the first run, the model will be downloaded to your Hugging Face cache (usually in ~/.cache/huggingface). To run completely offline:

  1. Make sure you've run the server at least once to download the model
  2. Set the environment variable to use local files:
    export TRANSFORMERS_OFFLINE=1
    
  3. Run the server as usual:
    python main.py
    

This will prevent the server from trying to connect to Hugging Face and use only local files.

Using Docker

Build and run the Docker container:

docker build -t free-llm-api .
docker run -p 8000:8000 free-llm-api

For offline Docker usage, you can mount the Hugging Face cache directory into the container:

docker run -p 8000:8000 -v ~/.cache/huggingface:/root/.cache/huggingface -e TRANSFORMERS_OFFLINE=1 free-llm-api

Deploying to Cloud Platforms

Hugging Face Spaces

  1. Create a new Space on Hugging Face
  2. Choose Dockerfile as the Space type
  3. Push this code to the Space repository

Railway

  1. Create a new project on Railway
  2. Connect this GitHub repository
  3. Railway will automatically build and deploy the application

Render

  1. Create a new Web Service on Render
  2. Connect your repository
  3. Use "Docker" as the runtime

API Usage

Generate Endpoint

Send a POST request to /generate with a JSON body containing your prompt:

curl -X 'POST' \
  'http://localhost:8000/generate' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Tell me a joke about AI."
}'

Example response:

{
  "response": "Why did the AI break up with its partner? It needed more data!"
}

Configuration

You can configure the model by setting the following environment variables:

  • MODEL_NAME: The Hugging Face model ID to use (default: "TinyLlama/TinyLlama-1.1B-Chat-v1.0")
  • PORT: The port to run the server on (default: 8000)
  • TRANSFORMERS_OFFLINE=1: Run in offline mode (uses only locally cached models)

Notes on Model Selection

The default model is TinyLlama, which is a small but capable open-source model that's truly free (no authentication required). Other free options include:

  • "google/flan-t5-small" (Very lightweight T5 model)
  • "facebook/opt-125m" (Small OPT model from Meta)

For more capable models (some may require authentication):

  • "facebook/opt-1.3b" (Larger OPT model)
  • "EleutherAI/pythia-1.4b" (Open source model from EleutherAI)

About

A FastAPI server hosting TinyLlama as an API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published