Operator guide
Cubby operator guide
QUICKSTART.md gets you to a signed change against a lab device. This document is for the operator driving the system on something more than a laptop lab — a dev-net a vendor lent you, a pre-production cell, or a small pilot of your own network.
Read this before you invite other humans to drive the system.
Environment-variable matrix
Every var is read in exactly one place (packages/common/runtime_config.py), cached as a singleton, and is discoverable via cubby config show. The ones that materially change behaviour:
| Variable | Default | What changes when you set it |
|---|---|---|
NETOPS_ENV | development | production flips the plugin registry to strict mode — simulated adapters are rejected at register time, so the default build_demo_harness will refuse to boot. You must wire real adapters yourself before setting this. |
NETOPS_API_AUTH_MODE | dev | dev prints a random token on start-up. hmac validates bearer tokens signed with NETOPS_API_HMAC_SECRET. oidc validates JWTs against the IdP. production + dev is refused at boot. |
NETOPS_API_HMAC_SECRET | (empty) | HMAC signing secret for API bearer tokens. 32+ bytes. Required when NETOPS_API_AUTH_MODE=hmac. |
NETOPS_OIDC_ISSUER / NETOPS_OIDC_AUDIENCE / NETOPS_OIDC_JWKS_URL | (empty) | OIDC validator config. JWKS URL must be HTTPS — non-HTTPS URLs are refused at refresh time. JWKS is cached 15 minutes by default. |
ANTHROPIC_API_KEY | (empty) | Selects ClaudeAgentRuntime over any OpenAI option. Preferred when multiple are set. |
NETOPS_ANTHROPIC_MODEL | claude-opus-4-7 | Override the Claude model id. |
OPENAI_API_KEY | (empty) | Selects OpenAIAgentRuntime with a static API key. Honours OPENAI_BASE_URL for Azure / vLLM / Ollama / etc. |
NETOPS_CODEX_CREDENTIAL_PATH | (empty) | Path to a Codex CLI auth.json. Bills against a ChatGPT subscription via OAuth refresh; requires NETOPS_CODEX_TOKEN_URL for the refresh endpoint. |
NETOPS_EVIDENCE_HMAC_SECRET | (empty) | Production evidence-signing key. When unset, a deterministic dev key is written under var/keys/. Set NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1 to refuse the dev fallback. |
NETOPS_APPROVAL_HMAC_SECRET | (empty) | Approval-signing key (distinct from evidence). Same dev-key fallback policy as above. |
NETOPS_EVIDENCE_LEGACY_KEY_IDS | (empty) | Comma-separated list of key_ids the verifier tolerates without cryptographic check. Use only for unrecoverable key-loss scenarios. |
NETOPS_API_MAX_BODY_BYTES | 65536 | HTTP request body cap in bytes. Rejects both Content-Length and chunked requests that exceed the cap. |
NETOPS_WIKI_ROOT | <repo>/docs | Root of the hand-curated knowledge base the agents read. |
NETOPS_CAB_ACKNOWLEDGE_SHARED_SECRET | (empty) | Set to 1 to acknowledge a shared-secret CAB quorum and either (a) stop the stderr boot banner in non-production envs or (b) allow NETOPS_ENV=production boot despite the shared-secret limitation. Without this, production boot with a multi-member CAB + single HMAC signer fails fast. |
cubby config show renders this matrix against the current process environment so you can see what's resolved vs what's falling back to defaults.
Demo vs production posture
Two failure modes the platform enforces at boot when NETOPS_ENV=production:
- No simulated adapters. The plugin registry refuses to register any plugin with
simulated=True, sobuild_demo_harness()fails fast withSimulationLeakErroron the first simulated device adapter. You must wire real vendor adapters (plugins/device/*/real_adapter.py) and/or custom adapters before the harness will construct. - No dev auth.
NETOPS_API_AUTH_MODE=devis refused — you must sethmacoroidcand supply the matching secret/issuer config.
Both are intentional: it's much safer for the system to refuse to start than to silently boot a prod-tagged deployment on demo adapters or a printed dev token.
Wiring real device adapters
Real adapters exist today for:
- Cisco IOS-XE (
plugins/device/cisco_iosxe/real_adapter.py) - Cisco NX-OS (
plugins/device/cisco_nxos/real_adapter.py) - Arista EOS (
plugins/device/arista_eos/real_adapter.py) - Junos (
plugins/device/junos/real_adapter.py) - PAN-OS (
plugins/device/panos/real_adapter.py) - Fortinet (
plugins/device/fortinet/real_adapter.py) - Nokia SR Linux (
plugins/device/nokia_srl/real_adapter.py)
To use them in production, construct a harness that registers the real classes in place of the simulated defaults. The quickest path is a thin wrapper on build_demo_harness(..., allow_simulated=False) that replaces the registry's device_adapters dict. A reference implementation lives at tests/devicelab/harness.py:build_lab_harness — it routes every call through real adapters via LabDeviceRouter.
If your vendor isn't in the list above, you can either:
- Build a plugin that inherits from
VendorRealAdapterBase(plugins/device/_common/real_adapter_base.py) and implement_build_change_commands,precheck,execute,verify; - Or use the generic
ssh_exectransport (packages/transport/ssh_exec.py) with a per-vendor command-wrapper and let Cubby drive it as a CLI over SSH.
CAB signing — from shared-secret to per-approver
The default bootstrap pairs a multi-member CAB (alice, bob, carol, …) with a single HMAC approval-signing key. That configuration works, but at boot the system logs a loud warning because anyone holding the HMAC secret can mint approvals under any approver name — quorum separation is nominal, not cryptographic.
To upgrade to real multi-party authorization:
- Generate per-approver Ed25519 keypairs. Each approver holds their private key on a YubiKey or equivalent.
- Build a
SignerKeyringat bootstrap that loads every approver's public key under theirkey_id(the key_id is what ends up inSignedApproval.signer_key_id). Ed25519 signers implement the sameEvidenceSignerinterface as the HMAC signer, so the rest of the CAB code is unchanged. - Configure
ApproverGroup.memberswith the approver identities ("alice","bob", …). The verifier checks both that thesigner_key_idresolves to a known signer AND that theapprovername is a member of the group. - Remove or demote the shared HMAC signer. Keep it only as a legacy-verifier entry via
NETOPS_EVIDENCE_LEGACY_KEY_IDSif you have historical bundles signed with it.
Until you do this, assume the CAB is "one person with the secret can do anything" and size your deployment's operator trust accordingly.
API auth — dev → HMAC → OIDC
Three modes, increasing production-readiness:
dev: A single token is generated (or read fromNETOPS_API_DEV_TOKEN) and all holders getnetwork-operator+auditorroles. Local work only. Refused whenNETOPS_ENV=production.hmac: Tokens are HMAC-SHA256 over"<subject>|<roles>|<expiry>". Issue withHmacTokenValidator.issue(); the validator checks HMAC + expiry. Subject and roles are whatever you encoded — the system trusts them because the signature proves the issuer authorised them.oidc: Tokens are JWTs validated against a configured issuer + audience + JWKS URL. Roles come from a configurable claim. JWKS fetch is HTTPS only and cached 15 minutes.
Role names the routes check today:
network-operator— can call mutating routes (/access-port/change-vlan,/runbooks/evaluate,/events/webhook, …)auditor— read-only token; sees/knowledge/similarand authenticated/readyz?detail=1but is refused from mutating routes with 403- Plugin-specific roles (
lead:security,cab:carol, …) are CAB member identities, not API role gates
Secrets custody — what's dev-generated and what must be rotated
Everything under var/keys/ is dev-generated and committed to state between runs. On a first prod deployment, rotate all of them:
| File | Role | Rotation path |
|---|---|---|
var/keys/dev_evidence_hmac.key | Signs evidence bundles | Set NETOPS_EVIDENCE_HMAC_SECRET (inline) or NETOPS_EVIDENCE_HMAC_KEY_PATH (file). Set NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1 to refuse fallback. |
var/keys/dev_approval_hmac.key | Signs CAB approvals | Same mechanism as evidence, with NETOPS_APPROVAL_* env vars. Ideally replaced with per-approver Ed25519 keys (see above). |
var/evidence/chain.tip | Prev-hash pointer for the evidence chain | Not a secret — safe to check in, but do not delete after a prod deployment starts. Deleting breaks the chain; use NETOPS_EVIDENCE_CHAIN_RESET_BUNDLE_IDS only for known planned resets. |
The operator should also rotate:
NETOPS_API_HMAC_SECRET(or OIDC config)ANTHROPIC_API_KEY/OPENAI_API_KEY— treat as secrets; pass via secret store, not.envfiles
Test-user readiness checklist
Before letting another human operator drive the system against anything other than a lab they own:
- [ ]
NETOPS_ENVunset ordevelopment, OR you've wired real adapters AND removed every simulated adapter. - [ ] API auth is
hmacoroidc. Dev auth is off. - [ ] Evidence + approval HMAC secrets are set via env, not falling back to dev keys.
- [ ] CAB signer is per-approver Ed25519 (or you've told the operator "one secret = full approval authority").
- [ ] The operator has a bearer token scoped to the role they need — no shared
network-operator+auditortoken in a chat channel. - [ ]
var/evidence/chain.tipis on durable storage (not a container/tmp). - [ ] A monitoring endpoint is polling
/livezand/readyzso a broken bootstrap is visible. - [ ] The operator has read
QUICKSTART.mdend-to-end and runcubby smokeagainst their own harness.
Where to go if something's wrong
- Something broke on a change Cubby executed — read
docs/ROLLBACK.md. Covers self-rollback, stuck workflows, false-success cases, and evidence-chain recovery. - Workflow failures — check
var/evidence/for the bundle of the failing run; every stage is signed and captures the snapshot at that point. - Agent failures — set
NETOPS_LOG_LEVEL=DEBUGand inspectSafetyGateverdicts +AgentContext.metadata. Injection hits are logged at WARNING. - CAB failures — the reasons array surfaces
plan hash mismatch/signature invalid/failed signer verification(generic — detail is in the server log). - Lab-only issues — see
tests/devicelab/README.md; most SR Linux / EOS boot-timing issues are covered there.