A distributed neural network project that supports task scheduling, load balancing, fault tolerance, and secure communication across a network of nodes. This system is designed for high availability and scalability, with asynchronous operations, health monitoring, and leader election to ensure robustness.
- Features
- Architecture
- Getting Started
- Usage
- Deployment
- Modules Overview
- Development
- Contributing
- License
- Task Scheduling: Efficient task assignment using a load balancer and asynchronous execution.
- Fault Tolerance: Built-in backup and recovery mechanisms for task persistence.
- Secure Communication: JWT-based authentication, role-based access control, and AES-GCM message encryption.
- Dynamic Node Discovery: Uses a distributed hash table (DHT) for node management.
- Leader Election: Ensures high availability with automatic leader selection.
- Monitoring and Metrics: Supports Prometheus metrics and detailed logging for monitoring.
- Configurable: Easily configurable using environment variables and configuration files.
Tasks are first inserted into an in-memory map and then written to the database.
If the database write fails, the in-memory insert is rolled back and the error is
surfaced to the caller. Clients should only rely on a task being present after
handling the Result from the add operation.
The system is composed of three main types of nodes:
- Principal Node: Acts as the coordinator, responsible for role assignments and maintaining global state consistency.
- An Nodes: Handle task distribution to Ki nodes and manage communication with the principal.
- Ki Nodes: Execute tasks assigned by An nodes and report results back.
Inter-node communication is facilitated via RabbitMQ, and tasks are scheduled using a load balancer to optimize resource utilization.
- Rust (latest stable version): Install from rustup.rs
- RabbitMQ: Install via official RabbitMQ installation guide
- Prometheus: For metrics collection (optional)
- CockroachDB: Install via official CockroachDB installation guide for distributed database management.
-
Clone the repository:
git clone https://github.com/your-username/distributed-neural-network.git cd distributed-neural-network -
Install Rust toolchain: Make sure you have Rust installed. If not, install it using rustup:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
After installation, ensure Rust is accessible:
rustc --version
-
Set up RabbitMQ: If RabbitMQ is not already installed, follow these steps:
- On macOS (using Homebrew):
brew update brew install rabbitmq
- On Ubuntu/Debian:
sudo apt-get update sudo apt-get install rabbitmq-server
- On Windows: Follow the official installation guide.
Start the RabbitMQ server:
rabbitmq-server
- On macOS (using Homebrew):
Set up CockroachDB: If CockroachDB is not already installed, follow these steps:
On macOS (using Homebrew): brew install cockroachdb/tap/cockroach On Ubuntu/Debian: sudo apt-get install -y cockroachdb On Windows: Follow the official installation guide. Start a single-node CockroachDB cluster (for local development):
cockroach start-single-node --insecure --listen-addr=localhostSet the connection string for the application. The project uses
tokio-postgres to connect to CockroachDB, so provide a standard
PostgreSQL URL:
export DATABASE_URL="postgresql://root@localhost:26257/defaultdb?sslmode=disable"-
Configure the project: Create a
config/directory and add configuration files. Start by copying the provided example configuration:mkdir -p config cp config/default.example config/default.toml
Edit
config/default.tomlto set values foramqp_addr,jwt_secret_key,database_url, and optionallyotlp_endpointto point to your OpenTelemetry collector (default ishttp://localhost:4317). You can also supply the JWT secret via theJWT_SECRET_KEYenvironment variable and the OTLP endpoint viaOTLP_ENDPOINTif preferred. -
Build the project: Compile the project to ensure there are no issues.
cargo build --release
-
Run tests: It's recommended to run the tests to verify that everything is set up correctly.
cargo test
Now you're ready to start running the nodes as described in the Usage section.
To run the different nodes, follow the steps below. Each type of node should be run in a separate terminal session to simulate a distributed system.
Principal Node: Run the principal node, which coordinates the overall system and assigns roles.
cargo run -- principal
An Node: Run one or more An nodes to handle task distribution and communication.
cargo run -- an
Ki Node: Run one or more Ki nodes to execute tasks assigned by An nodes and report results back.
cargo run -- ki
Make sure all nodes are running simultaneously to ensure proper communication and task assignment.
The system provides a REST API for managing tasks, available via the Warp web server. Below are the available endpoints:
GET /tasks/{task_id}: Retrieve a specific task by providing its ID.
curl http://localhost:3030/tasks/{task_id}
POST /tasks: Add a new task to the system. Provide the task data in the request body.
curl -X POST http://localhost:3030/tasks -H "Content-Type: application/json" -d '{"task_data": "sample task data"}'
DELETE /tasks/{task_id}: Delete a specific task by providing the task ID.
curl -X DELETE http://localhost:3030/tasks/{task_id}
These endpoints allow external interaction with the distributed network, such as adding new tasks or querying the current tasks.
Separate Dockerfiles are provided for each node type. Build the images using:
docker build -f Dockerfile.principal -t an-ki:principal .
docker build -f Dockerfile.an -t an-ki:an .
docker build -f Dockerfile.ki -t an-ki:ki .Each image runs the appropriate node (principal, an, or ki) when started.
A Helm chart in helm/an-ki simplifies Kubernetes deployment. Replica counts and core
settings are configurable through values.yaml, with environment-specific overrides
available in files like values-dev.yaml and values-prod.yaml.
Deploy to a cluster with:
helm install an-ki ./helm/an-ki -f helm/an-ki/values.yaml -f helm/an-ki/values-dev.yamlOverride the values-dev.yaml file with values-prod.yaml or your own file for
production environments.
Within Kubernetes, each node is exposed via a Service, enabling discovery through
Kubernetes DNS (e.g., principal.default.svc.cluster.local). Outside Kubernetes,
nodes rely on the built-in DHT for dynamic discovery or can be configured with
explicit addresses using the configuration system.
This project uses pre-commit to automatically run cargo fmt and cargo clippy on commits. Install pre-commit and set up the hooks:
pip install pre-commit
pre-commit installRun all checks manually with:
pre-commit run --all-filesAn OpenTelemetry Collector can aggregate metrics and traces from regional nodes. A sample configuration is available at
config/otel-collector-config.yaml. Run the collector with Docker:
docker run --rm -p 4317:4317 -p 9464:9464 \
-v $(pwd)/config/otel-collector-config.yaml:/etc/otel-collector-config.yaml \
otel/opentelemetry-collector:latest \
--config /etc/otel-collector-config.yamlNodes export traces to the collector via OTLP on port 4317 (configurable via the
otlp_endpoint setting or OTLP_ENDPOINT environment variable) and expose Prometheus metrics on port 9090.
The collector scrapes these metrics and re-exports them on 9464 for federation.
Import the sample dashboard found at config/grafana/node_overview.json into Grafana. It visualizes:
- Node Status: current health of each node.
- Consensus State: leader or follower role.
- Task Throughput: rate of tasks processed per minute.
Configure Grafana to use the collector's Prometheus exporter (http://localhost:9464) as a data source.