CODI LLM Module

CODI augments deterministic rules with an optional local LLM that ranks candidates, generates rationales, and provides operator-friendly insights. This document walks through every stage of the LLM lifecycle: data collection, training, adapter packaging, runtime integration, and evaluation.

1. Design Principles

2. Component Overview

Layer Module Description
Data collection data/collect_github.py, data/label_smells.py, data/standardize.py, data/synth_pairs_from_rules.py, data/split_dataset.py Builds curated datasets of Dockerfiles and instruction pairs.
Training training/qwen15b_lora/ QLoRA configuration, scripts, and Colab notebook.
Runtime core/llm.py, docker/runtime_complete.py, docker/scripts/verify_adapter.py Embeds llama.cpp server and wires CLI/API to local endpoints.
Evaluation eval/ package, llm_metrics.json, dashboard summaries Measures adapter quality and logs telemetry.

3. Data Pipeline

3.1 Collection (data/collect_github.py)

3.2 CMD Script Extraction (data/extract_cmd_scripts.py)

3.3 Labeling (data/label_smells.py)

3.4 Standardisation (data/standardize.py)

3.5 Pair Generation (data/synth_pairs_from_rules.py)

3.6 Splitting (data/split_dataset.py)

3.7 R2 Sync Utilities

3.8 Make Targets

make data-collect
make data-extract
make data-label
make data-prepare       # standardize -> pairs -> splits
make data-prepare-full  # full reprocess (no incremental caching)
make data-clean

4. Training Pipeline

4.1 Configuration

4.2 Hardware Requirements

4.3 Training Script

python training/qwen15b_lora/train.py \
  --config training/qwen15b_lora/config.yaml \
  --dataset data/splits/train.jsonl \
  --validation data/splits/val.jsonl

Features: - Loads base model (Qwen/Qwen2.5-Coder-1.5B-Instruct). - Applies QLoRA via bitsandbytes. - Supports --resume-from checkpoints. - Logs to TensorBoard (training/qwen15b_lora/logs).

4.4 Colab Workflow

4.5 Outputs

5. Adapter Packaging

  1. Copy training outputs into repo under models/adapters/<adapter-id>/.
  2. Populate metadata.json with fields: json { "adapter_version": "qwen15b-lora-v0.1", "created_at": "2025-11-25T04:12:00Z", "base_model": "qwen2.5-coder-1.5b", "dataset": "data/splits/train.jsonl", "checksums": { "adapter_model.safetensors": "sha256:..." } }
  3. Run python docker/scripts/verify_adapter.py models/adapters/qwen15b-lora-v0.1 to ensure structure and checksums are valid.
  4. Document compatibility in patterns/rules.yml llm_assist section.

6. Runtime Integration

6.1 Module core/llm.py

6.2 Complete Container

6.3 CLI/API Hooks

6.4 Environment Variables

Variable Description
LLM_ENABLED Enables or disables LLM features.
LLM_ENDPOINT URL of llama.cpp server (default http://127.0.0.1:8081).
CODE_MODEL Base model ID recorded in telemetry.
ADAPTER_PATH Filesystem path to adapter directory.
MODEL_MOUNT_PATH Base mount for weights/adapters.
LLAMA_CPP_THREADS Number of CPU threads for inference.
LLM_TIMEOUT_SECONDS Optional override for HTTP timeouts.

6.5 Health Monitoring

7. Evaluation & Promotion

7.1 Evaluation Suite (eval/)

7.2 Promotion Criteria

7.3 Telemetry

8. Troubleshooting

Issue Resolution
Adapter missing or unreadable Verify mount path, permissions, and metadata.json.
LLM requests timing out Increase LLM_TIMEOUT_SECONDS, ensure llama.cpp process has enough CPU threads.
Air-gap violation logs Add internal hosts to AIRGAP_ALLOWLIST or disable temporarily for testing.
Ranking outputs unexpected text Validate LLMRankingService response schema; adapters must not emit raw Dockerfile content.
Need deterministic stub Set LLM_ENABLED=false or run CLI with --skip-llm.

9. Extending the LLM Module

  1. Collect more data using make data-collect and friends.
  2. Augment rules to generate synthetic training pairs reflecting new optimisations.
  3. Train new adapter with updated dataset and config.
  4. Package adapter into models/adapters/<id>/ with metadata + checksums.
  5. Validate using verify_adapter.py and run evaluation suite.
  6. Promote by updating patterns/rules.yml and referencing evaluation results.