Agent Reference

GPU Energy Agent

A lightweight Python daemon that streams GPU telemetry from your machines to NemulAI. Runs with <0.1% CPU overhead and batches uploads every 60 seconds.

v0.2.0Python 3.9+Linux · WindowsNVIDIA GPU required

Overview

The agent uses NVIDIA's NVML library (via pynvml) to sample each GPU every 5 seconds. Metrics are buffered locally and flushed to the NemulAI ingest API every 60 seconds. If an upload fails, metrics are persisted to a write-ahead log (data/wal/metrics.wal) and automatically replayed on the next successful connection.

Power draw

Instantaneous watts per GPU

Energy delta

Joules consumed since last sample

GPU utilization

Compute + memory bandwidth %

Temperature

Junction temperature in °C

Memory usage

Used / total VRAM in MB

Clock speeds

SM and memory clocks in MHz

Job attribution

job_id, team_id, model_tag via scheduler

WAL durability

Local buffer survives network outages

Installation

Requires nvidia-smi to be working before you start. Run it to confirm your drivers are installed.

Option A — pip (recommended)

Install the agent directly from PyPI. No cloning required.

pip install nemulai

# Verify GPU detection
nemulai --help

Option B — install.sh (systemd auto-setup)

Installs as a systemd service, creates a dedicated system user, and writes your API key to /etc/aluminatai/agent.env.

pip install nemulai
curl -sSL https://nemulai.com/install.sh | bash

NVIDIA driver 450.80.02 or newer required. Run nvidia-smi — if it fails, install drivers first (sudo apt install nvidia-driver-535 on Ubuntu).

Dependencies:

Package	Version	Purpose
`pynvml`	11.5.0	NVML bindings for GPU metrics
`requests`	2.32.3	HTTP upload to ingest API
`python-dotenv`	1.0.1	Load env vars from .env file
`rich`	13.9.4	Live terminal table (optional)

CLI flags

Run with the nemulai command after install. Only ALUMINATAI_API_KEY is required.

# Minimal — reads API key from env
ALUMINATAI_API_KEY=alum_xxx nemulai

Flag	Default	Description
`--interval, -i`	5.0	Sampling interval in seconds. Lower = more accurate energy totals, slightly higher CPU overhead. Min: 0.1s.
`--output, -o`	none	Write metrics to a CSV file in addition to uploading. Useful for local analysis or validation.
`--duration, -d`	∞	Stop after this many seconds. Omit to run indefinitely.
`--quiet, -q`	off	Suppress all console output. Recommended for systemd / background use.

# 1-second sampling for high-accuracy energy profiling
nemulai --interval 1

# Run for 10 minutes, save CSV, no output
nemulai --duration 600 --output data/run.csv --quiet

Environment variables

Set these in your shell, a .env file in the repo root, or your systemd environment file. Only ALUMINATAI_API_KEY is required.

Variable	Default	Description
`ALUMINATAI_API_KEY`	—	Required. Your API key from Dashboard → Setup. Starts with `alum_`.
`ALUMINATAI_API_ENDPOINT`	https://nemulai.com/v1/metrics/ingest	Override the ingest endpoint. Useful for self-hosted deployments.
`SAMPLE_INTERVAL`	5.0	Seconds between NVML reads. Overridden by --interval flag.
`UPLOAD_INTERVAL`	60	Seconds between batch uploads.
`UPLOAD_BATCH_SIZE`	100	Max metrics per upload request.
`UPLOAD_MAX_RETRIES`	5	Retry attempts before writing to WAL.
`SCHEDULER_POLL_INTERVAL`	30	How often (seconds) to query the scheduler for job attribution.
`METRICS_PORT`	9100	Prometheus metrics server port. Set to 0 to disable.
`LOG_LEVEL`	INFO	`DEBUG`, `INFO`, `WARNING`, `ERROR`.
`DATA_DIR`	./data	Directory for WAL and local metric backups.
`ALUMINATAI_CA_BUNDLE`	system CA	Path to a custom CA PEM for corporate proxy environments.

# .env file in repo root
ALUMINATAI_API_KEY=alum_YOUR_KEY_HERE
UPLOAD_INTERVAL=60
SAMPLE_INTERVAL=5.0
LOG_LEVEL=INFO

Running persistently

For production monitoring, run the agent as a background service so it survives SSH disconnects and reboots.

Option A — install.sh (easiest)

The installer handles everything: systemd unit, system user, environment file, and auto-start on boot.

pip install nemulai
curl -sSL https://nemulai.com/install.sh | bash
# Prompts for your API key, then:
# sudo systemctl status aluminatai-agent
# sudo journalctl -u aluminatai-agent -f

Option B — tmux (quick test)

tmux new-session -d -s aluminatai \
  "ALUMINATAI_API_KEY=alum_YOUR_KEY_HERE nemulai --quiet"

# Reattach to check logs
tmux attach -t aluminatai

Option C — systemd (manual setup)

Create /etc/systemd/system/aluminatai-agent.service:

[Unit]
Description=NemulAI GPU Energy Agent
After=network.target

[Service]
Type=simple
User=aluminatai
WorkingDirectory=/opt/aluminatai-agent
EnvironmentFile=/etc/aluminatai/agent.env
ExecStart=/usr/local/bin/nemulai
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

# Write your API key to the env file
sudo mkdir -p /etc/aluminatai
echo "ALUMINATAI_API_KEY=alum_YOUR_KEY_HERE" | sudo tee /etc/aluminatai/agent.env
sudo chmod 600 /etc/aluminatai/agent.env

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable aluminatai-agent
sudo systemctl start aluminatai-agent

# Check status
sudo systemctl status aluminatai-agent
sudo journalctl -u aluminatai-agent -f

Option D — Windows (PowerShell)

Important: run in native PowerShell or Command Prompt — NOT WSL. WSL cannot access NVIDIA NVML directly.

# Install then run
pip install nemulai
$env:ALUMINATAI_API_KEY = "alum_YOUR_KEY_HERE"
nemulai --quiet

# To run as a background job (Task Scheduler):
# Trigger: At startup
# Program: nemulai
# Arguments: --quiet
# Set ALUMINATAI_API_KEY in the task's environment variables

Reliability & WAL

The v0.2 agent is built for unattended production use. It handles network outages, API rate limits, and process restarts without losing metric data.

Write-ahead log (WAL)

When an upload fails, metrics are written to data/wal/metrics.wal as newline-delimited JSON. On the next startup the agent replays the WAL before resuming normal collection. The WAL is cleared only after all entries have been successfully uploaded.

Setting	Default	Effect
`WAL_MAX_AGE_HOURS`	24	Entries older than this are discarded on replay.
`WAL_MAX_MB`	512	If the WAL exceeds this size, the oldest half is dropped.

Exponential backoff

Failed uploads are retried with exponential backoff: 1s → 2s → 4s → 8s → 16s, capped at 60s, with ±20% jitter. After UPLOAD_MAX_RETRIES (default 5) the batch is written to the WAL and the agent continues collecting new metrics.

HTTP status	Action
200	Success — clear batch from buffer.
429	Respect Retry-After header, then retry with backoff.
401 / 403	Permanent auth failure — write to WAL immediately, log error.
5xx	Transient error — retry with backoff up to max retries.
Timeout / ConnectionError	Retry with backoff.

Prometheus metrics

The agent exposes an HTTP metrics endpoint on localhost:9100/metrics (Prometheus format). Useful for integrating agent health into your existing monitoring stack. Set METRICS_PORT=0 to disable.

# Check agent health metrics
curl localhost:9100/metrics

Scheduler integration

The agent can enrich metrics with job attribution by polling your workload scheduler. When a job is detected on a GPU, the agent tags every metric from that GPU with job_id, team_id, model_tag, and scheduler_source.

Scheduler	scheduler_source	How it works
Kubernetes	`kubernetes`	Polls local kubelet API to map GPU device IDs to pod names and labels.
Slurm	`slurm`	Reads SLURM_JOB_ID and squeue output to identify running jobs.
Run:ai	`runai`	Queries Run:ai scheduler API for active workloads.
Manual	`manual`	Set JOB_ID, TEAM_ID, MODEL_TAG env vars before running the agent.

# Manual attribution — wrap your training script
JOB_ID=llama3-finetune-v2 \
TEAM_ID=ml-infra \
MODEL_TAG=llama3-70b \
nemulai

Scheduler integration is optional. Without it, metrics are attributed to your account but not to specific jobs. Energy manifests will still be generated when jobs complete.

Troubleshooting

"No NVIDIA GPUs found" or nvidia-smi fails

NVIDIA driver not installed or not working.

# Ubuntu — install driver
sudo apt install nvidia-driver-535
sudo reboot

# Verify after reboot
nvidia-smi

"Failed to initialize NVML"

Permission issue or driver version mismatch.

# Check your groups
groups

# Add yourself to the video group
sudo usermod -a -G video $USER
# Log out and back in, then retry

"Module 'pynvml' not found"

Dependencies not installed in the active Python environment.

pip install nemulai

# If using a venv, activate it first:
source venv/bin/activate && pip install nemulai

Metrics not appearing in dashboard

The agent batches uploads every 60 seconds. Wait one minute then refresh. Check for upload errors in agent output or journal logs.

# Run with verbose logging
LOG_LEVEL=DEBUG nemulai

# Systemd logs
sudo journalctl -u aluminatai-agent -f

# Check WAL for buffered metrics
ls -lh data/wal/

Windows: GPU not detected

Make sure you are running in native PowerShell or Command Prompt — not WSL. NVML is not accessible from inside WSL.

# In PowerShell (not WSL):
pip install nemulai
$env:ALUMINATAI_API_KEY = "alum_YOUR_KEY_HERE"
nemulai

NemulAI Agent · v0.2.0

API reference Quick setup Get help