GPU Energy Agent
A lightweight Python daemon that streams GPU telemetry from your machines to AluminatiAi. Runs with <0.1% CPU overhead and batches uploads every 60 seconds.
Overview
The agent uses NVIDIA's NVML library (via pynvml) to sample each GPU every 5 seconds. Metrics are buffered locally and flushed to the AluminatiAi ingest API every 60 seconds. If an upload fails, metrics are persisted to a write-ahead log (data/wal/metrics.wal) and automatically replayed on the next successful connection.
Power draw
Instantaneous watts per GPU
Energy delta
Joules consumed since last sample
GPU utilization
Compute + memory bandwidth %
Temperature
Junction temperature in °C
Memory usage
Used / total VRAM in MB
Clock speeds
SM and memory clocks in MHz
Job attribution
job_id, team_id, model_tag via scheduler
WAL durability
Local buffer survives network outages
Installation
Requires nvidia-smi to be working before you start. Run it to confirm your drivers are installed.
Option A — pip (recommended)
Install the agent directly from PyPI. No cloning required.
pip install aluminatiai
# Verify GPU detection
aluminatiai --helpOption B — install.sh (systemd auto-setup)
Installs as a systemd service, creates a dedicated system user, and writes your API key to /etc/aluminatai/agent.env.
pip install aluminatiai
curl -sSL https://aluminatiai.com/install.sh | bashnvidia-smi — if it fails, install drivers first (sudo apt install nvidia-driver-535 on Ubuntu).Dependencies:
| Package | Version | Purpose |
|---|---|---|
pynvml | 11.5.0 | NVML bindings for GPU metrics |
requests | 2.32.3 | HTTP upload to ingest API |
python-dotenv | 1.0.1 | Load env vars from .env file |
rich | 13.9.4 | Live terminal table (optional) |
CLI flags
Run with the aluminatiai command after install. Only ALUMINATAI_API_KEY is required.
# Minimal — reads API key from env
ALUMINATAI_API_KEY=alum_xxx aluminatiai| Flag | Default | Description |
|---|---|---|
--interval, -i | 5.0 | Sampling interval in seconds. Lower = more accurate energy totals, slightly higher CPU overhead. Min: 0.1s. |
--output, -o | none | Write metrics to a CSV file in addition to uploading. Useful for local analysis or validation. |
--duration, -d | ∞ | Stop after this many seconds. Omit to run indefinitely. |
--quiet, -q | off | Suppress all console output. Recommended for systemd / background use. |
# 1-second sampling for high-accuracy energy profiling
aluminatiai --interval 1
# Run for 10 minutes, save CSV, no output
aluminatiai --duration 600 --output data/run.csv --quietEnvironment variables
Set these in your shell, a .env file in the repo root, or your systemd environment file. Only ALUMINATAI_API_KEY is required.
| Variable | Default | Description |
|---|---|---|
ALUMINATAI_API_KEY | — | Required. Your API key from Dashboard → Setup. Starts with alum_. |
ALUMINATAI_API_ENDPOINT | https://aluminatiai.com/v1/metrics/ingest | Override the ingest endpoint. Useful for self-hosted deployments. |
SAMPLE_INTERVAL | 5.0 | Seconds between NVML reads. Overridden by --interval flag. |
UPLOAD_INTERVAL | 60 | Seconds between batch uploads. |
UPLOAD_BATCH_SIZE | 100 | Max metrics per upload request. |
UPLOAD_MAX_RETRIES | 5 | Retry attempts before writing to WAL. |
SCHEDULER_POLL_INTERVAL | 30 | How often (seconds) to query the scheduler for job attribution. |
METRICS_PORT | 9100 | Prometheus metrics server port. Set to 0 to disable. |
LOG_LEVEL | INFO | DEBUG, INFO, WARNING, ERROR. |
DATA_DIR | ./data | Directory for WAL and local metric backups. |
ALUMINATAI_CA_BUNDLE | system CA | Path to a custom CA PEM for corporate proxy environments. |
# .env file in repo root
ALUMINATAI_API_KEY=alum_YOUR_KEY_HERE
UPLOAD_INTERVAL=60
SAMPLE_INTERVAL=5.0
LOG_LEVEL=INFORunning persistently
For production monitoring, run the agent as a background service so it survives SSH disconnects and reboots.
Option A — install.sh (easiest)
The installer handles everything: systemd unit, system user, environment file, and auto-start on boot.
pip install aluminatiai
curl -sSL https://aluminatiai.com/install.sh | bash
# Prompts for your API key, then:
# sudo systemctl status aluminatai-agent
# sudo journalctl -u aluminatai-agent -fOption B — tmux (quick test)
tmux new-session -d -s aluminatai \
"ALUMINATAI_API_KEY=alum_YOUR_KEY_HERE aluminatiai --quiet"
# Reattach to check logs
tmux attach -t aluminataiOption C — systemd (manual setup)
Create /etc/systemd/system/aluminatai-agent.service:
[Unit]
Description=AluminatiAi GPU Energy Agent
After=network.target
[Service]
Type=simple
User=aluminatai
WorkingDirectory=/opt/aluminatai-agent
EnvironmentFile=/etc/aluminatai/agent.env
ExecStart=/usr/local/bin/aluminatiai
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target# Write your API key to the env file
sudo mkdir -p /etc/aluminatai
echo "ALUMINATAI_API_KEY=alum_YOUR_KEY_HERE" | sudo tee /etc/aluminatai/agent.env
sudo chmod 600 /etc/aluminatai/agent.env
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable aluminatai-agent
sudo systemctl start aluminatai-agent
# Check status
sudo systemctl status aluminatai-agent
sudo journalctl -u aluminatai-agent -fOption D — Windows (PowerShell)
# Install then run
pip install aluminatiai
$env:ALUMINATAI_API_KEY = "alum_YOUR_KEY_HERE"
aluminatiai --quiet
# To run as a background job (Task Scheduler):
# Trigger: At startup
# Program: aluminatiai
# Arguments: --quiet
# Set ALUMINATAI_API_KEY in the task's environment variablesReliability & WAL
The v0.2 agent is built for unattended production use. It handles network outages, API rate limits, and process restarts without losing metric data.
Write-ahead log (WAL)
When an upload fails, metrics are written to data/wal/metrics.wal as newline-delimited JSON. On the next startup the agent replays the WAL before resuming normal collection. The WAL is cleared only after all entries have been successfully uploaded.
| Setting | Default | Effect |
|---|---|---|
WAL_MAX_AGE_HOURS | 24 | Entries older than this are discarded on replay. |
WAL_MAX_MB | 512 | If the WAL exceeds this size, the oldest half is dropped. |
Exponential backoff
Failed uploads are retried with exponential backoff: 1s → 2s → 4s → 8s → 16s, capped at 60s, with ±20% jitter. After UPLOAD_MAX_RETRIES (default 5) the batch is written to the WAL and the agent continues collecting new metrics.
| HTTP status | Action |
|---|---|
| 200 | Success — clear batch from buffer. |
| 429 | Respect Retry-After header, then retry with backoff. |
| 401 / 403 | Permanent auth failure — write to WAL immediately, log error. |
| 5xx | Transient error — retry with backoff up to max retries. |
| Timeout / ConnectionError | Retry with backoff. |
Prometheus metrics
The agent exposes an HTTP metrics endpoint on localhost:9100/metrics (Prometheus format). Useful for integrating agent health into your existing monitoring stack. Set METRICS_PORT=0 to disable.
# Check agent health metrics
curl localhost:9100/metricsScheduler integration
The agent can enrich metrics with job attribution by polling your workload scheduler. When a job is detected on a GPU, the agent tags every metric from that GPU with job_id, team_id, model_tag, and scheduler_source.
| Scheduler | scheduler_source | How it works |
|---|---|---|
| Kubernetes | kubernetes | Polls local kubelet API to map GPU device IDs to pod names and labels. |
| Slurm | slurm | Reads SLURM_JOB_ID and squeue output to identify running jobs. |
| Run:ai | runai | Queries Run:ai scheduler API for active workloads. |
| Manual | manual | Set JOB_ID, TEAM_ID, MODEL_TAG env vars before running the agent. |
# Manual attribution — wrap your training script
JOB_ID=llama3-finetune-v2 \
TEAM_ID=ml-infra \
MODEL_TAG=llama3-70b \
aluminatiaiTroubleshooting
"No NVIDIA GPUs found" or nvidia-smi fails
NVIDIA driver not installed or not working.
# Ubuntu — install driver
sudo apt install nvidia-driver-535
sudo reboot
# Verify after reboot
nvidia-smi"Failed to initialize NVML"
Permission issue or driver version mismatch.
# Check your groups
groups
# Add yourself to the video group
sudo usermod -a -G video $USER
# Log out and back in, then retry"Module 'pynvml' not found"
Dependencies not installed in the active Python environment.
pip install aluminatiai
# If using a venv, activate it first:
source venv/bin/activate && pip install aluminatiaiMetrics not appearing in dashboard
The agent batches uploads every 60 seconds. Wait one minute then refresh. Check for upload errors in agent output or journal logs.
# Run with verbose logging
LOG_LEVEL=DEBUG aluminatiai
# Systemd logs
sudo journalctl -u aluminatai-agent -f
# Check WAL for buffered metrics
ls -lh data/wal/Windows: GPU not detected
Make sure you are running in native PowerShell or Command Prompt — not WSL. NVML is not accessible from inside WSL.
# In PowerShell (not WSL):
pip install aluminatiai
$env:ALUMINATAI_API_KEY = "alum_YOUR_KEY_HERE"
aluminatiai