A high-performance containerized Go application that continuously exports PostgreSQL data to Parquet files with Hive-style partitioning, smart resume capabilities, and enterprise-grade monitoring.
- Continuous Export: Automated exports on configurable schedules (minutes to days)
- Smart Resume: Automatically detects and resumes from the last exported timestamp
- Hive Partitioning: Industry-standard organization:
table=<name>/year=YYYY/month=MM/day=DD/ - High Performance:
- Batch processing with configurable batch sizes
- Optimal Parquet compression (Snappy)
- Concurrent table exports
- Memory-efficient streaming
- Data Integrity:
- Atomic file writes prevent corruption
- Timestamp-based filenames for precise tracking
- Graceful error handling with continuation
- Operational Excellence:
- Health check endpoints for container orchestration
- Comprehensive metrics API
- Structured logging with operation tracking
- Graceful shutdown with timeout
- Production Ready:
- Docker containerized with multi-stage builds
- Non-root user execution
- Resource limits support
- Volume mounting for data persistence
- Quick Start
- Architecture
- Configuration
- Data Organization
- Monitoring
- API Reference
- Development
- Production Deployment
- Troubleshooting
- Contributing
- Clone the repository:
git clone https://github.com/hra42/go-vault.git
cd go-vault- Configure your PostgreSQL connection in
docker-compose.yml:
environment:
- POSTGRES_HOST=your-postgres-host
- POSTGRES_DATABASE=your-database
- POSTGRES_USERNAME=your-username
- POSTGRES_PASSWORD=your-password- Start the service:
docker-compose up -d- Verify it's running:
# Check health
curl http://localhost:8080/health
# View metrics
curl http://localhost:8080/metrics# Build the image
docker build -t go-vault .
# Run the container
docker run -d \
--name go-vault \
-e POSTGRES_HOST=postgres.example.com \
-e POSTGRES_PORT=5432 \
-e POSTGRES_DATABASE=metrics \
-e POSTGRES_USERNAME=postgres \
-e POSTGRES_PASSWORD=your-secure-password \
-e EXPORT_INTERVAL=1h \
-v $(pwd)/parquet-data:/data \
-p 8080:8080 \
--restart unless-stopped \
go-vaultβββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
β PostgreSQL ββββββΆβ Exporter ββββββΆβ Parquet Files β
β Database β β Service β β (Hive Format) β
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Health/Metricsβ
β Endpoints β
ββββββββββββββββ
- Table Discovery: Automatically finds tables with timestamp columns
- Resume Logic: Scans existing Parquet files to determine starting point
- Batch Processor: Fetches data in configurable batches for memory efficiency
- Parquet Writer: Converts PostgreSQL data to compressed Parquet format
- Scheduler: Cron-based scheduling system for regular exports
- Health Server: Provides monitoring endpoints for operational visibility
All configuration is done through environment variables:
| Variable | Description | Default | Example |
|---|---|---|---|
POSTGRES_HOST |
PostgreSQL hostname | localhost |
db.example.com |
POSTGRES_PORT |
PostgreSQL port | 5432 |
5432 |
POSTGRES_DATABASE |
Target database name | metrics |
production_metrics |
POSTGRES_USERNAME |
Database username | postgres |
readonly_user |
POSTGRES_PASSWORD |
Database password | (required) | secure-password |
DATA_PATH |
Directory for Parquet files | /data |
/mnt/parquet |
EXPORT_INTERVAL |
Export frequency | 1h |
30m, 2h, 24h |
HEALTH_PORT |
Port for health/metrics API | 8080 |
9090 |
5m- Every 5 minutes30m- Every 30 minutes1h- Every hour (default)6h- Every 6 hours24h- Daily
The service uses Hive-style partitioning for optimal query performance:
/data/
βββ table=users/
β βββ year=2025/
β β βββ month=01/
β β β βββ day=01/
β β β β βββ 1735689600000_1735693200000.parquet
β β β β βββ 1735693200000_1735696800000.parquet
β β β βββ day=02/
β β β βββ 1735776000000_1735779600000.parquet
β β βββ month=02/
β β βββ day=01/
β β βββ 1738454400000_1738458000000.parquet
β βββ year=2024/
β βββ month=12/
β βββ day=31/
β βββ 1735603200000_1735606800000.parquet
βββ table=orders/
βββ year=2025/
βββ month=01/
βββ day=01/
βββ 1735689600000_1735696800000.parquet
Files are named with Unix timestamps (milliseconds):
- Format:
<start_timestamp>_<end_timestamp>.parquet - Example:
1735689600000_1735693200000.parquet - This represents data from 2025-01-01 00:00:00 to 2025-01-01 01:00:00 UTC
GET http://localhost:8080/healthResponse:
{
"status": "healthy"
}Status codes:
200 OK- Service is healthy503 Service Unavailable- Service is unhealthy
GET http://localhost:8080/metricsResponse:
{
"rows_processed": {
"users": 1543210,
"orders": 892341,
"products": 45678
},
"files_created": 156,
"error_count": 2,
"last_export_time": "2025-01-22T10:30:00Z",
"avg_export_duration": "45s",
"last_errors": [
"connection timeout to PostgreSQL",
"disk space insufficient"
]
}GET http://localhost:8080/readyAlways returns 200 OK when the service is running.
| Endpoint | Method | Description | Response Codes |
|---|---|---|---|
/health |
GET | Service health status | 200, 503 |
/ready |
GET | Service readiness | 200 |
/metrics |
GET | Export metrics and statistics | 200 |
- Go 1.24 or higher
- Docker and Docker Compose (optional)
- PostgreSQL instance with data
- Clone the repository:
git clone https://github.com/hra42/go-vault.git
cd go-vault- Install dependencies:
go mod download- Set environment variables:
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export POSTGRES_DATABASE=testdb
export POSTGRES_USERNAME=postgres
export POSTGRES_PASSWORD=password
export DATA_PATH=./data
export EXPORT_INTERVAL=5m- Run the application:
go run cmd/exporter/main.go# Build for current platform
go build -o exporter cmd/exporter/main.go
# Build for Linux (for Docker)
GOOS=linux GOARCH=amd64 go build -o exporter cmd/exporter/main.go# Run all tests
go test ./...
# Run with coverage
go test -cover ./...
# Run specific package tests
go test ./internal/parquet- Build the production image:
docker build -t go-vault:latest .- Create a dedicated network:
docker network create parquet-export- Run with production settings:
docker run -d \
--name go-vault \
--network parquet-export \
--restart unless-stopped \
--memory="2g" \
--cpus="2.0" \
-e POSTGRES_HOST=prod-db.internal \
-e POSTGRES_DATABASE=production \
-e POSTGRES_USERNAME=readonly_user \
-e POSTGRES_PASSWORD=${DB_PASSWORD} \
-e EXPORT_INTERVAL=30m \
-v /mnt/parquet-storage:/data \
-p 8080:8080 \
go-vault:latestCreate a ConfigMap for configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: go-vault-config
data:
POSTGRES_HOST: "postgres-service.default.svc.cluster.local"
POSTGRES_PORT: "5432"
POSTGRES_DATABASE: "metrics"
EXPORT_INTERVAL: "1h"Deploy the service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: go-vault
spec:
replicas: 1
selector:
matchLabels:
app: go-vault
template:
metadata:
labels:
app: go-vault
spec:
containers:
- name: exporter
image: go-vault:latest
envFrom:
- configMapRef:
name: go-vault-config
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
volumeMounts:
- name: parquet-data
mountPath: /data
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
volumes:
- name: parquet-data
persistentVolumeClaim:
claimName: parquet-pvc- Batch Size: Adjust batch size based on available memory and row size
- Export Interval: Balance between data freshness and system load
- Connection Pool: Configure PostgreSQL connection pool settings
- Resource Limits: Set appropriate CPU and memory limits
- Database Credentials: Use secrets management (Kubernetes Secrets, HashiCorp Vault, etc.)
- Network Security: Restrict database access to the exporter service
- File Permissions: Ensure proper permissions on the data volume
- Non-root User: Container runs as non-root user by default
Check logs:
docker logs go-vaultCommon causes:
- Incorrect PostgreSQL credentials
- Database unreachable
- Insufficient permissions
- Check if tables have timestamp columns:
SELECT table_name, column_name
FROM information_schema.columns
WHERE data_type LIKE 'timestamp%';- Verify health endpoint:
curl http://localhost:8080/health- Check metrics for errors:
curl http://localhost:8080/metrics | jq .last_errorsMonitor disk usage:
df -h /path/to/parquet-dataConsider:
- Implementing data retention policies
- Compressing older partitions
- Moving data to object storage
Enable verbose logging:
docker run -e LOG_LEVEL=debug ...We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under The Unlicense - see the LICENSE file for details.
- Apache Arrow for the excellent Parquet library
- lib/pq for PostgreSQL driver
- robfig/cron for scheduling