Agent Reference

GPU Energy Agent

A lightweight Python daemon that streams GPU telemetry from your machines to AluminatiAi. Runs with <0.1% CPU overhead and batches uploads every 60 seconds.

v0.2.0Python 3.9+Linux · WindowsNVIDIA GPU required

Overview

The agent uses NVIDIA's NVML library (via pynvml) to sample each GPU every 5 seconds. Metrics are buffered locally and flushed to the AluminatiAi ingest API every 60 seconds. If an upload fails, metrics are persisted to a write-ahead log (data/wal/metrics.wal) and automatically replayed on the next successful connection.

Power draw

Instantaneous watts per GPU

Energy delta

Joules consumed since last sample

GPU utilization

Compute + memory bandwidth %

Temperature

Junction temperature in °C

Memory usage

Used / total VRAM in MB

Clock speeds

SM and memory clocks in MHz

Job attribution

job_id, team_id, model_tag via scheduler

WAL durability

Local buffer survives network outages

Installation

Requires nvidia-smi to be working before you start. Run it to confirm your drivers are installed.

Option A — pip (recommended)

Install the agent directly from PyPI. No cloning required.

pip install aluminatiai

# Verify GPU detection
aluminatiai --help

Option B — install.sh (systemd auto-setup)

Installs as a systemd service, creates a dedicated system user, and writes your API key to /etc/aluminatai/agent.env.

pip install aluminatiai
curl -sSL https://aluminatiai.com/install.sh | bash
NVIDIA driver 450.80.02 or newer required. Run nvidia-smi — if it fails, install drivers first (sudo apt install nvidia-driver-535 on Ubuntu).

Dependencies:

PackageVersionPurpose
pynvml11.5.0NVML bindings for GPU metrics
requests2.32.3HTTP upload to ingest API
python-dotenv1.0.1Load env vars from .env file
rich13.9.4Live terminal table (optional)

CLI flags

Run with the aluminatiai command after install. Only ALUMINATAI_API_KEY is required.

# Minimal — reads API key from env
ALUMINATAI_API_KEY=alum_xxx aluminatiai
FlagDefaultDescription
--interval, -i5.0Sampling interval in seconds. Lower = more accurate energy totals, slightly higher CPU overhead. Min: 0.1s.
--output, -ononeWrite metrics to a CSV file in addition to uploading. Useful for local analysis or validation.
--duration, -dStop after this many seconds. Omit to run indefinitely.
--quiet, -qoffSuppress all console output. Recommended for systemd / background use.
# 1-second sampling for high-accuracy energy profiling
aluminatiai --interval 1

# Run for 10 minutes, save CSV, no output
aluminatiai --duration 600 --output data/run.csv --quiet

Environment variables

Set these in your shell, a .env file in the repo root, or your systemd environment file. Only ALUMINATAI_API_KEY is required.

VariableDefaultDescription
ALUMINATAI_API_KEYRequired. Your API key from Dashboard → Setup. Starts with alum_.
ALUMINATAI_API_ENDPOINThttps://aluminatiai.com/v1/metrics/ingestOverride the ingest endpoint. Useful for self-hosted deployments.
SAMPLE_INTERVAL5.0Seconds between NVML reads. Overridden by --interval flag.
UPLOAD_INTERVAL60Seconds between batch uploads.
UPLOAD_BATCH_SIZE100Max metrics per upload request.
UPLOAD_MAX_RETRIES5Retry attempts before writing to WAL.
SCHEDULER_POLL_INTERVAL30How often (seconds) to query the scheduler for job attribution.
METRICS_PORT9100Prometheus metrics server port. Set to 0 to disable.
LOG_LEVELINFODEBUG, INFO, WARNING, ERROR.
DATA_DIR./dataDirectory for WAL and local metric backups.
ALUMINATAI_CA_BUNDLEsystem CAPath to a custom CA PEM for corporate proxy environments.
# .env file in repo root
ALUMINATAI_API_KEY=alum_YOUR_KEY_HERE
UPLOAD_INTERVAL=60
SAMPLE_INTERVAL=5.0
LOG_LEVEL=INFO

Running persistently

For production monitoring, run the agent as a background service so it survives SSH disconnects and reboots.

Option A — install.sh (easiest)

The installer handles everything: systemd unit, system user, environment file, and auto-start on boot.

pip install aluminatiai
curl -sSL https://aluminatiai.com/install.sh | bash
# Prompts for your API key, then:
# sudo systemctl status aluminatai-agent
# sudo journalctl -u aluminatai-agent -f

Option B — tmux (quick test)

tmux new-session -d -s aluminatai \
  "ALUMINATAI_API_KEY=alum_YOUR_KEY_HERE aluminatiai --quiet"

# Reattach to check logs
tmux attach -t aluminatai

Option C — systemd (manual setup)

Create /etc/systemd/system/aluminatai-agent.service:

[Unit]
Description=AluminatiAi GPU Energy Agent
After=network.target

[Service]
Type=simple
User=aluminatai
WorkingDirectory=/opt/aluminatai-agent
EnvironmentFile=/etc/aluminatai/agent.env
ExecStart=/usr/local/bin/aluminatiai
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
# Write your API key to the env file
sudo mkdir -p /etc/aluminatai
echo "ALUMINATAI_API_KEY=alum_YOUR_KEY_HERE" | sudo tee /etc/aluminatai/agent.env
sudo chmod 600 /etc/aluminatai/agent.env

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable aluminatai-agent
sudo systemctl start aluminatai-agent

# Check status
sudo systemctl status aluminatai-agent
sudo journalctl -u aluminatai-agent -f

Option D — Windows (PowerShell)

Important: run in native PowerShell or Command Prompt — NOT WSL. WSL cannot access NVIDIA NVML directly.
# Install then run
pip install aluminatiai
$env:ALUMINATAI_API_KEY = "alum_YOUR_KEY_HERE"
aluminatiai --quiet

# To run as a background job (Task Scheduler):
# Trigger: At startup
# Program: aluminatiai
# Arguments: --quiet
# Set ALUMINATAI_API_KEY in the task's environment variables

Reliability & WAL

The v0.2 agent is built for unattended production use. It handles network outages, API rate limits, and process restarts without losing metric data.

Write-ahead log (WAL)

When an upload fails, metrics are written to data/wal/metrics.wal as newline-delimited JSON. On the next startup the agent replays the WAL before resuming normal collection. The WAL is cleared only after all entries have been successfully uploaded.

SettingDefaultEffect
WAL_MAX_AGE_HOURS24Entries older than this are discarded on replay.
WAL_MAX_MB512If the WAL exceeds this size, the oldest half is dropped.

Exponential backoff

Failed uploads are retried with exponential backoff: 1s → 2s → 4s → 8s → 16s, capped at 60s, with ±20% jitter. After UPLOAD_MAX_RETRIES (default 5) the batch is written to the WAL and the agent continues collecting new metrics.

HTTP statusAction
200Success — clear batch from buffer.
429Respect Retry-After header, then retry with backoff.
401 / 403Permanent auth failure — write to WAL immediately, log error.
5xxTransient error — retry with backoff up to max retries.
Timeout / ConnectionErrorRetry with backoff.

Prometheus metrics

The agent exposes an HTTP metrics endpoint on localhost:9100/metrics (Prometheus format). Useful for integrating agent health into your existing monitoring stack. Set METRICS_PORT=0 to disable.

# Check agent health metrics
curl localhost:9100/metrics

Scheduler integration

The agent can enrich metrics with job attribution by polling your workload scheduler. When a job is detected on a GPU, the agent tags every metric from that GPU with job_id, team_id, model_tag, and scheduler_source.

Schedulerscheduler_sourceHow it works
KuberneteskubernetesPolls local kubelet API to map GPU device IDs to pod names and labels.
SlurmslurmReads SLURM_JOB_ID and squeue output to identify running jobs.
Run:airunaiQueries Run:ai scheduler API for active workloads.
ManualmanualSet JOB_ID, TEAM_ID, MODEL_TAG env vars before running the agent.
# Manual attribution — wrap your training script
JOB_ID=llama3-finetune-v2 \
TEAM_ID=ml-infra \
MODEL_TAG=llama3-70b \
aluminatiai
Scheduler integration is optional. Without it, metrics are attributed to your account but not to specific jobs. Energy manifests will still be generated when jobs complete.

Troubleshooting

"No NVIDIA GPUs found" or nvidia-smi fails

NVIDIA driver not installed or not working.

# Ubuntu — install driver
sudo apt install nvidia-driver-535
sudo reboot

# Verify after reboot
nvidia-smi

"Failed to initialize NVML"

Permission issue or driver version mismatch.

# Check your groups
groups

# Add yourself to the video group
sudo usermod -a -G video $USER
# Log out and back in, then retry

"Module 'pynvml' not found"

Dependencies not installed in the active Python environment.

pip install aluminatiai

# If using a venv, activate it first:
source venv/bin/activate && pip install aluminatiai

Metrics not appearing in dashboard

The agent batches uploads every 60 seconds. Wait one minute then refresh. Check for upload errors in agent output or journal logs.

# Run with verbose logging
LOG_LEVEL=DEBUG aluminatiai

# Systemd logs
sudo journalctl -u aluminatai-agent -f

# Check WAL for buffered metrics
ls -lh data/wal/

Windows: GPU not detected

Make sure you are running in native PowerShell or Command Prompt — not WSL. NVML is not accessible from inside WSL.

# In PowerShell (not WSL):
pip install aluminatiai
$env:ALUMINATAI_API_KEY = "alum_YOUR_KEY_HERE"
aluminatiai
AluminatiAi Agent · v0.2.0