Skip to content

hra42/go-vault

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Go-Vault

A high-performance containerized Go application that continuously exports PostgreSQL data to Parquet files with Hive-style partitioning, smart resume capabilities, and enterprise-grade monitoring.

πŸš€ Features

  • Continuous Export: Automated exports on configurable schedules (minutes to days)
  • Smart Resume: Automatically detects and resumes from the last exported timestamp
  • Hive Partitioning: Industry-standard organization: table=<name>/year=YYYY/month=MM/day=DD/
  • High Performance:
    • Batch processing with configurable batch sizes
    • Optimal Parquet compression (Snappy)
    • Concurrent table exports
    • Memory-efficient streaming
  • Data Integrity:
    • Atomic file writes prevent corruption
    • Timestamp-based filenames for precise tracking
    • Graceful error handling with continuation
  • Operational Excellence:
    • Health check endpoints for container orchestration
    • Comprehensive metrics API
    • Structured logging with operation tracking
    • Graceful shutdown with timeout
  • Production Ready:
    • Docker containerized with multi-stage builds
    • Non-root user execution
    • Resource limits support
    • Volume mounting for data persistence

πŸ“‹ Table of Contents

πŸƒ Quick Start

Using Docker Compose (Recommended)

  1. Clone the repository:
git clone https://github.com/hra42/go-vault.git
cd go-vault
  1. Configure your PostgreSQL connection in docker-compose.yml:
environment:
  - POSTGRES_HOST=your-postgres-host
  - POSTGRES_DATABASE=your-database
  - POSTGRES_USERNAME=your-username
  - POSTGRES_PASSWORD=your-password
  1. Start the service:
docker-compose up -d
  1. Verify it's running:
# Check health
curl http://localhost:8080/health

# View metrics
curl http://localhost:8080/metrics

Using Docker Directly

# Build the image
docker build -t go-vault .

# Run the container
docker run -d \
  --name go-vault \
  -e POSTGRES_HOST=postgres.example.com \
  -e POSTGRES_PORT=5432 \
  -e POSTGRES_DATABASE=metrics \
  -e POSTGRES_USERNAME=postgres \
  -e POSTGRES_PASSWORD=your-secure-password \
  -e EXPORT_INTERVAL=1h \
  -v $(pwd)/parquet-data:/data \
  -p 8080:8080 \
  --restart unless-stopped \
  go-vault

πŸ—οΈ Architecture

Component Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   PostgreSQL    │────▢│   Exporter   │────▢│  Parquet Files  β”‚
β”‚    Database     β”‚     β”‚   Service    β”‚     β”‚  (Hive Format)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚ Health/Metricsβ”‚
                        β”‚   Endpoints   β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

  1. Table Discovery: Automatically finds tables with timestamp columns
  2. Resume Logic: Scans existing Parquet files to determine starting point
  3. Batch Processor: Fetches data in configurable batches for memory efficiency
  4. Parquet Writer: Converts PostgreSQL data to compressed Parquet format
  5. Scheduler: Cron-based scheduling system for regular exports
  6. Health Server: Provides monitoring endpoints for operational visibility

βš™οΈ Configuration

All configuration is done through environment variables:

Variable Description Default Example
POSTGRES_HOST PostgreSQL hostname localhost db.example.com
POSTGRES_PORT PostgreSQL port 5432 5432
POSTGRES_DATABASE Target database name metrics production_metrics
POSTGRES_USERNAME Database username postgres readonly_user
POSTGRES_PASSWORD Database password (required) secure-password
DATA_PATH Directory for Parquet files /data /mnt/parquet
EXPORT_INTERVAL Export frequency 1h 30m, 2h, 24h
HEALTH_PORT Port for health/metrics API 8080 9090

Export Interval Examples

  • 5m - Every 5 minutes
  • 30m - Every 30 minutes
  • 1h - Every hour (default)
  • 6h - Every 6 hours
  • 24h - Daily

πŸ“ Data Organization

The service uses Hive-style partitioning for optimal query performance:

/data/
β”œβ”€β”€ table=users/
β”‚   β”œβ”€β”€ year=2025/
β”‚   β”‚   β”œβ”€β”€ month=01/
β”‚   β”‚   β”‚   β”œβ”€β”€ day=01/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ 1735689600000_1735693200000.parquet
β”‚   β”‚   β”‚   β”‚   └── 1735693200000_1735696800000.parquet
β”‚   β”‚   β”‚   └── day=02/
β”‚   β”‚   β”‚       └── 1735776000000_1735779600000.parquet
β”‚   β”‚   └── month=02/
β”‚   β”‚       └── day=01/
β”‚   β”‚           └── 1738454400000_1738458000000.parquet
β”‚   └── year=2024/
β”‚       └── month=12/
β”‚           └── day=31/
β”‚               └── 1735603200000_1735606800000.parquet
└── table=orders/
    └── year=2025/
        └── month=01/
            └── day=01/
                └── 1735689600000_1735696800000.parquet

File Naming Convention

Files are named with Unix timestamps (milliseconds):

  • Format: <start_timestamp>_<end_timestamp>.parquet
  • Example: 1735689600000_1735693200000.parquet
  • This represents data from 2025-01-01 00:00:00 to 2025-01-01 01:00:00 UTC

πŸ“Š Monitoring

Health Check Endpoint

GET http://localhost:8080/health

Response:

{
  "status": "healthy"
}

Status codes:

  • 200 OK - Service is healthy
  • 503 Service Unavailable - Service is unhealthy

Metrics Endpoint

GET http://localhost:8080/metrics

Response:

{
  "rows_processed": {
    "users": 1543210,
    "orders": 892341,
    "products": 45678
  },
  "files_created": 156,
  "error_count": 2,
  "last_export_time": "2025-01-22T10:30:00Z",
  "avg_export_duration": "45s",
  "last_errors": [
    "connection timeout to PostgreSQL",
    "disk space insufficient"
  ]
}

Ready Endpoint

GET http://localhost:8080/ready

Always returns 200 OK when the service is running.

πŸ”Œ API Reference

Health Endpoints

Endpoint Method Description Response Codes
/health GET Service health status 200, 503
/ready GET Service readiness 200
/metrics GET Export metrics and statistics 200

πŸ› οΈ Development

Prerequisites

  • Go 1.24 or higher
  • Docker and Docker Compose (optional)
  • PostgreSQL instance with data

Local Development Setup

  1. Clone the repository:
git clone https://github.com/hra42/go-vault.git
cd go-vault
  1. Install dependencies:
go mod download
  1. Set environment variables:
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DATABASE=testdb
export POSTGRES_USERNAME=postgres
export POSTGRES_PASSWORD=password
export DATA_PATH=./data
export EXPORT_INTERVAL=5m
  1. Run the application:
go run cmd/exporter/main.go

Building from Source

# Build for current platform
go build -o exporter cmd/exporter/main.go

# Build for Linux (for Docker)
GOOS=linux GOARCH=amd64 go build -o exporter cmd/exporter/main.go

Running Tests

# Run all tests
go test ./...

# Run with coverage
go test -cover ./...

# Run specific package tests
go test ./internal/parquet

πŸš€ Production Deployment

Docker Deployment

  1. Build the production image:
docker build -t go-vault:latest .
  1. Create a dedicated network:
docker network create parquet-export
  1. Run with production settings:
docker run -d \
  --name go-vault \
  --network parquet-export \
  --restart unless-stopped \
  --memory="2g" \
  --cpus="2.0" \
  -e POSTGRES_HOST=prod-db.internal \
  -e POSTGRES_DATABASE=production \
  -e POSTGRES_USERNAME=readonly_user \
  -e POSTGRES_PASSWORD=${DB_PASSWORD} \
  -e EXPORT_INTERVAL=30m \
  -v /mnt/parquet-storage:/data \
  -p 8080:8080 \
  go-vault:latest

Kubernetes Deployment

Create a ConfigMap for configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: go-vault-config
data:
  POSTGRES_HOST: "postgres-service.default.svc.cluster.local"
  POSTGRES_PORT: "5432"
  POSTGRES_DATABASE: "metrics"
  EXPORT_INTERVAL: "1h"

Deploy the service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: go-vault
spec:
  replicas: 1
  selector:
    matchLabels:
      app: go-vault
  template:
    metadata:
      labels:
        app: go-vault
    spec:
      containers:
      - name: exporter
        image: go-vault:latest
        envFrom:
        - configMapRef:
            name: go-vault-config
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        volumeMounts:
        - name: parquet-data
          mountPath: /data
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
      volumes:
      - name: parquet-data
        persistentVolumeClaim:
          claimName: parquet-pvc

Performance Tuning

  1. Batch Size: Adjust batch size based on available memory and row size
  2. Export Interval: Balance between data freshness and system load
  3. Connection Pool: Configure PostgreSQL connection pool settings
  4. Resource Limits: Set appropriate CPU and memory limits

Security Considerations

  1. Database Credentials: Use secrets management (Kubernetes Secrets, HashiCorp Vault, etc.)
  2. Network Security: Restrict database access to the exporter service
  3. File Permissions: Ensure proper permissions on the data volume
  4. Non-root User: Container runs as non-root user by default

πŸ”§ Troubleshooting

Common Issues

Service Won't Start

Check logs:

docker logs go-vault

Common causes:

  • Incorrect PostgreSQL credentials
  • Database unreachable
  • Insufficient permissions

No Data Being Exported

  1. Check if tables have timestamp columns:
SELECT table_name, column_name 
FROM information_schema.columns 
WHERE data_type LIKE 'timestamp%';
  1. Verify health endpoint:
curl http://localhost:8080/health
  1. Check metrics for errors:
curl http://localhost:8080/metrics | jq .last_errors

Disk Space Issues

Monitor disk usage:

df -h /path/to/parquet-data

Consider:

  • Implementing data retention policies
  • Compressing older partitions
  • Moving data to object storage

Debug Mode

Enable verbose logging:

docker run -e LOG_LEVEL=debug ...

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under The Unlicense - see the LICENSE file for details.

πŸ™ Acknowledgments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published