Skip to content

seanwevans/an-ki

Repository files navigation

Distributed Neural Network System

Neural Network Diagram on Teal Background

A distributed neural network project that supports task scheduling, load balancing, fault tolerance, and secure communication across a network of nodes. This system is designed for high availability and scalability, with asynchronous operations, health monitoring, and leader election to ensure robustness.

Table of Contents

Features

  • Task Scheduling: Efficient task assignment using a load balancer and asynchronous execution.
  • Fault Tolerance: Built-in backup and recovery mechanisms for task persistence.
  • Secure Communication: JWT-based authentication, role-based access control, and AES-GCM message encryption.
  • Dynamic Node Discovery: Uses a distributed hash table (DHT) for node management.
  • Leader Election: Ensures high availability with automatic leader selection.
  • Monitoring and Metrics: Supports Prometheus metrics and detailed logging for monitoring.
  • Configurable: Easily configurable using environment variables and configuration files.

Task Recovery Consistency

Tasks are first inserted into an in-memory map and then written to the database. If the database write fails, the in-memory insert is rolled back and the error is surfaced to the caller. Clients should only rely on a task being present after handling the Result from the add operation.

Architecture

The system is composed of three main types of nodes:

  1. Principal Node: Acts as the coordinator, responsible for role assignments and maintaining global state consistency.
  2. An Nodes: Handle task distribution to Ki nodes and manage communication with the principal.
  3. Ki Nodes: Execute tasks assigned by An nodes and report results back.

Inter-node communication is facilitated via RabbitMQ, and tasks are scheduled using a load balancer to optimize resource utilization.

Getting Started

Prerequisites

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/distributed-neural-network.git
    cd distributed-neural-network
  2. Install Rust toolchain: Make sure you have Rust installed. If not, install it using rustup:

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

    After installation, ensure Rust is accessible:

    rustc --version
  3. Set up RabbitMQ: If RabbitMQ is not already installed, follow these steps:

    • On macOS (using Homebrew):
      brew update
      brew install rabbitmq
    • On Ubuntu/Debian:
      sudo apt-get update
      sudo apt-get install rabbitmq-server
    • On Windows: Follow the official installation guide.

    Start the RabbitMQ server:

    rabbitmq-server

Set up CockroachDB: If CockroachDB is not already installed, follow these steps:

On macOS (using Homebrew): brew install cockroachdb/tap/cockroach On Ubuntu/Debian: sudo apt-get install -y cockroachdb On Windows: Follow the official installation guide. Start a single-node CockroachDB cluster (for local development):

cockroach start-single-node --insecure --listen-addr=localhost

Set the connection string for the application. The project uses tokio-postgres to connect to CockroachDB, so provide a standard PostgreSQL URL:

export DATABASE_URL="postgresql://root@localhost:26257/defaultdb?sslmode=disable"
  1. Configure the project: Create a config/ directory and add configuration files. Start by copying the provided example configuration:

    mkdir -p config
    cp config/default.example config/default.toml

    Edit config/default.toml to set values for amqp_addr, jwt_secret_key, database_url, and optionally otlp_endpoint to point to your OpenTelemetry collector (default is http://localhost:4317). You can also supply the JWT secret via the JWT_SECRET_KEY environment variable and the OTLP endpoint via OTLP_ENDPOINT if preferred.

  2. Build the project: Compile the project to ensure there are no issues.

    cargo build --release
  3. Run tests: It's recommended to run the tests to verify that everything is set up correctly.

    cargo test

Now you're ready to start running the nodes as described in the Usage section.

Usage

Running the Nodes

To run the different nodes, follow the steps below. Each type of node should be run in a separate terminal session to simulate a distributed system.

Principal Node: Run the principal node, which coordinates the overall system and assigns roles.

cargo run -- principal

An Node: Run one or more An nodes to handle task distribution and communication.

cargo run -- an

Ki Node: Run one or more Ki nodes to execute tasks assigned by An nodes and report results back.

cargo run -- ki

Make sure all nodes are running simultaneously to ensure proper communication and task assignment.

API Endpoints

The system provides a REST API for managing tasks, available via the Warp web server. Below are the available endpoints:

GET /tasks/{task_id}: Retrieve a specific task by providing its ID.

curl http://localhost:3030/tasks/{task_id}

POST /tasks: Add a new task to the system. Provide the task data in the request body.

curl -X POST http://localhost:3030/tasks -H "Content-Type: application/json" -d '{"task_data": "sample task data"}'

DELETE /tasks/{task_id}: Delete a specific task by providing the task ID.

curl -X DELETE http://localhost:3030/tasks/{task_id}

These endpoints allow external interaction with the distributed network, such as adding new tasks or querying the current tasks.

Deployment

Docker Images

Separate Dockerfiles are provided for each node type. Build the images using:

docker build -f Dockerfile.principal -t an-ki:principal .
docker build -f Dockerfile.an -t an-ki:an .
docker build -f Dockerfile.ki -t an-ki:ki .

Each image runs the appropriate node (principal, an, or ki) when started.

Helm Chart

A Helm chart in helm/an-ki simplifies Kubernetes deployment. Replica counts and core settings are configurable through values.yaml, with environment-specific overrides available in files like values-dev.yaml and values-prod.yaml.

Deploy to a cluster with:

helm install an-ki ./helm/an-ki -f helm/an-ki/values.yaml -f helm/an-ki/values-dev.yaml

Override the values-dev.yaml file with values-prod.yaml or your own file for production environments.

Service Discovery

Within Kubernetes, each node is exposed via a Service, enabling discovery through Kubernetes DNS (e.g., principal.default.svc.cluster.local). Outside Kubernetes, nodes rely on the built-in DHT for dynamic discovery or can be configured with explicit addresses using the configuration system.

Development

Pre-commit Hooks

This project uses pre-commit to automatically run cargo fmt and cargo clippy on commits. Install pre-commit and set up the hooks:

pip install pre-commit
pre-commit install

Run all checks manually with:

pre-commit run --all-files

Monitoring

OpenTelemetry Collector

An OpenTelemetry Collector can aggregate metrics and traces from regional nodes. A sample configuration is available at config/otel-collector-config.yaml. Run the collector with Docker:

docker run --rm -p 4317:4317 -p 9464:9464 \
  -v $(pwd)/config/otel-collector-config.yaml:/etc/otel-collector-config.yaml \
  otel/opentelemetry-collector:latest \
  --config /etc/otel-collector-config.yaml

Nodes export traces to the collector via OTLP on port 4317 (configurable via the otlp_endpoint setting or OTLP_ENDPOINT environment variable) and expose Prometheus metrics on port 9090. The collector scrapes these metrics and re-exports them on 9464 for federation.

Grafana Dashboard

Import the sample dashboard found at config/grafana/node_overview.json into Grafana. It visualizes:

  • Node Status: current health of each node.
  • Consensus State: leader or follower role.
  • Task Throughput: rate of tasks processed per minute.

Configure Grafana to use the collector's Prometheus exporter (http://localhost:9464) as a data source.

Languages