=============== Audit Logger =============== Overview ======== WeightsLab includes a comprehensive audit logging system that tracks ALL user interactions from the UI through gRPC. This enables: - **Compliance tracking**: Document what actions were performed and by whom - **Debugging**: Understand the sequence of operations that led to a particular state - **Historical analysis**: Review the complete experiment history including before/after values - **Error investigation**: Identify when and why operations failed The audit logging system automatically logs all gRPC user interactions with detailed before/after values, immediately writing synchronous records to both JSON and CSV formats. Key Features ============ - **Automatic logging**: All gRPC handlers are automatically instrumented with audit logging - **Detailed tracking**: Before/after values show exactly what changed in each operation - **Dual format output**: Both JSON (for parsing) and CSV (for spreadsheet analysis) - **Thread-safe**: Concurrent operations are safely logged without data loss - **Immediate writes**: Events are written to disk immediately after logging (no data loss on process crash) - **Reverse chronological**: Newest events appear first in JSON for easy review - **ISO 8601 timestamps**: Microsecond precision for accurate sequencing Logged Actions ============== The audit logger tracks the following user actions across all gRPC handlers: **Model & Training Control** - ``hp_change``: Hyperparameter modifications (learning rate, batch size, etc.) - ``pause``: Training paused (from ExperimentCommand) - ``resume``: Training resumed (from ExperimentCommand) - ``mode_switch``: Mode changes (train/audit/evaluation) **Data Operations** - ``tag_add``: Add tags to samples (from EditDataSample) - ``tag_remove``: Remove tags from samples (from EditDataSample) - ``sample_discard``: Mark samples as discarded (from EditDataSample) - ``sample_restore``: Restore discarded samples (from EditDataSample) - ``query_execute``: Execute data queries (filters, analysis) from ApplyDataQuery **Checkpoint & Evaluation** - ``checkpoint_restore``: Restore model from checkpoint (from RestoreCheckpoint) - ``evaluation_start``: Begin evaluation on a dataset split (from TriggerEvaluation) **Annotations** - ``note_write``: Write or clear notes on plot points (from ExperimentCommand) See :doc:`grpc_functions` for details on all RPC methods. Details Captured ================ Each log entry includes: - **timestamp**: ISO 8601 format with microseconds (UTC) - **action_type**: Type of action performed - **status**: "success" or "failed" - **details**: Dictionary containing: - Before/after values for changes - Affected item count - Sample IDs for data operations - Configuration details - Any other context relevant to the action - **error**: Error message if status == "failed" File Locations ============== Audit logs are automatically stored in the experiment's ``root_log_dir`` directory: - **JSON format**: ``{root_log_dir}/audit_log.json`` - **CSV format**: ``{root_log_dir}/audit_log.csv`` Both files are created automatically on first use and appended to with each operation. JSON Format =========== The JSON file contains an array of event objects with full details: .. code-block:: json [ { "timestamp": "2026-05-27T14:30:00.123456Z", "action_type": "hp_change", "status": "success", "details": { "changed_params": { "learning_rate": 0.001, "batch_size": 32 } }, "error": null }, { "timestamp": "2026-05-27T14:30:05.456789Z", "action_type": "tag_add", "status": "success", "details": { "tag_name": "defect", "samples_affected": 5, "sample_ids": ["s1", "s2", "s3", "s4", "s5"], "origins": ["train", "train", "val", "val", "test"] }, "error": null }, { "timestamp": "2026-05-27T14:30:10.789012Z", "action_type": "query_execute", "status": "failed", "details": { "query_type": "natural_language", "query_text": "invalid syntax here" }, "error": "Invalid query syntax: unexpected token" } ] **Advantages:** - Complete structured data with nested details - Easy to parse with standard JSON tools - Preserves all context about each operation - Suitable for programmatic analysis CSV Format ========== The CSV file provides a flattened view suitable for spreadsheet analysis: .. code-block:: text timestamp,action_type,status,details,error 2026-05-27T14:30:00.123456Z,hp_change,success,"{""changed_params"": {""learning_rate"": 0.001, ""batch_size"": 32}}", 2026-05-27T14:30:05.456789Z,tag_add,success,"{""tag_name"": ""defect"", ""samples_affected"": 5, ""sample_ids"": [""s1"", ""s2""]}", 2026-05-27T14:30:10.789012Z,query_execute,failed,"{""query_type"": ""natural_language""}","Invalid query syntax: unexpected token" **Advantages:** - Open in Excel, Google Sheets, or any spreadsheet application - Details field contains escaped JSON for full context - Easy to filter and sort operations - Familiar format for non-technical users Configuration ============== Audit logging is automatically enabled when: 1. A checkpoint manager is initialized with a ``root_log_dir`` 2. The gRPC server starts 3. A user interaction triggers a gRPC handler Output Format Selection ----------------------- Control which format audit logs are written to using the ``AUDIT_LOG_FORMAT`` environment variable: .. code-block:: bash # JSON format (default) - full structured data with nested details export AUDIT_LOG_FORMAT=json # CSV format - flattened view for spreadsheet analysis export AUDIT_LOG_FORMAT=csv # Disable audit logging completely export AUDIT_LOG_FORMAT=none **Default Behavior:** - If not specified: ``AUDIT_LOG_FORMAT`` defaults to ``json`` - Only one format file is created per experiment (not both) - File is created in ``root_log_dir`` as either: - ``audit_log.json`` (for json format) - ``audit_log.csv`` (for csv format) - When set to ``none``, no audit logs are created or maintained **Valid Options:** - ``json`` - Full structured JSON with nested details (default) - ``csv`` - Flattened CSV view for spreadsheet analysis - ``none`` - Disable audit logging entirely (no files created) **Precedence:** 1. Explicit format parameter in code (highest priority) 2. Environment variable ``AUDIT_LOG_FORMAT`` 3. Default: ``json`` (lowest priority) **Use Cases for ``AUDIT_LOG_FORMAT=none``:** - Reduce disk I/O overhead in high-performance scenarios - Disable audit history for development/debugging sessions - Focus on other logging without audit pollution Directory Configuration ----------------------- The ``root_log_dir`` is typically determined by: - The ``checkpoint_manager`` configuration - Or set via environment variables/hyperparameters - Default: ``root_experiment`` directory Example: Using Audit Logs ========================== **Python API** Access audit logs programmatically after an experiment: .. code-block:: python import json from pathlib import Path # Load audit log audit_path = Path("root_log_dir") / "audit_log.json" with open(audit_path, 'r') as f: events = json.load(f) # Find all hyperparameter changes hp_changes = [e for e in events if e['action_type'] == 'hp_change'] for event in hp_changes: print(f"At {event['timestamp']}: {event['details']}") # Find failures failures = [e for e in events if e['status'] == 'failed'] for event in failures: print(f"FAILED {event['action_type']}: {event['error']}") # Get summary from weightslab.backend.audit_logger import AuditLogger logger = AuditLogger("root_log_dir") summary = logger.get_log_summary() print(f"Total events: {summary['total_events']}") print(f"By action type: {summary['by_action_type']}") print(f"By status: {summary['by_status']}") **Spreadsheet Analysis** (when using CSV format) 1. Open ``audit_log.csv`` in Excel or Google Sheets (requires ``AUDIT_LOG_FORMAT=csv``) 2. Use filters to find specific action types (Data → Filter) 3. Sort by timestamp to review operation sequence 4. Parse the details column as JSON for full context **Command Line** .. code-block:: bash # Count operations by type jq '.[] | .action_type' audit_log.json | sort | uniq -c # Find all failures jq '.[] | select(.status == "failed")' audit_log.json # Extract hyperparameter changes jq '.[] | select(.action_type == "hp_change") | .details' audit_log.json Real-World Scenarios ==================== **Scenario 1: Debugging Model Degradation** You notice your model accuracy dropped. Use the audit log to: 1. Find all ``hp_change`` events to see parameter adjustments 2. Identify when the degradation started by looking at timestamps 3. Cross-reference with evaluation metrics to find the problematic change 4. Review ``checkpoint_restore`` events to understand rollback attempts **Scenario 2: Data Quality Audit** You need to document data preparation for compliance: 1. Extract all ``tag_add``, ``tag_remove``, ``sample_discard`` events 2. Create a summary report showing what was excluded and why 3. Generate timestamps showing exactly when operations occurred 4. Export CSV to stakeholders for review **Scenario 3: Reproducing Experiments** You need to reproduce a previous experiment exactly: 1. Extract all ``hp_change`` events in chronological order 2. Note the final hyperparameter values 3. Review ``query_execute`` to understand data preparation (tags, filtering) 4. Reproduce using the same sequence of operations **Scenario 4: Investigating Failures** A model checkpoint restore failed: 1. Search for ``checkpoint_restore`` with ``status == "failed"`` 2. Review the error message in the ``error`` field 3. Check preceding ``pause`` operations 4. Verify checkpoint ID in the details Testing ======= The audit logger includes comprehensive unit tests covering: - Event creation and serialization - JSON and CSV file writing - Thread-safe concurrent logging - Error handling and edge cases - Complex nested data structures - Real-world usage scenarios Run tests with: .. code-block:: bash pytest weightslab/tests/backend/test_audit_logger.py -v **Test Coverage:** - 26 unit tests - Success and failure scenarios - Concurrent logging with 10+ threads - Special characters and Unicode handling - Edge cases (empty details, missing files, etc.) Troubleshooting =============== **Audit logs not being created** 1. Verify ``root_log_dir`` is set and writable 2. Check that checkpoint manager is initialized 3. Ensure gRPC handlers are being called 4. Check application logs for initialization errors **CSV details field is invalid JSON** This shouldn't happen, but if it does: 1. Check for special characters or newlines in details 2. Verify Python version supports JSON serialization 3. Report as a bug with the problematic event **Concurrent logging causes file conflicts** The audit logger uses file locking to prevent conflicts. If you see errors: 1. Check filesystem supports file locking (network drives may not) 2. Verify file permissions on ``root_log_dir`` 3. Check available disk space **Timestamps are not in chronological order** This can happen with: 1. System clock adjustments during experiment 2. High-frequency operations on slow filesystems 3. Microsecond precision limits of the system Solution: Sort by timestamp when analyzing. API Reference ============= .. code-block:: python from weightslab.backend.audit_logger import AuditLogger, AuditEvent # Create a logger logger = AuditLogger(root_log_dir="/path/to/logs", experiment_name="my_experiment") # Log a successful operation logger.log_event( action_type="hp_change", status="success", details={"learning_rate": 0.001, "batch_size": 32} ) # Log a failed operation logger.log_event( action_type="checkpoint_restore", status="failed", details={"checkpoint_id": "ckpt_001"}, error="Checkpoint file not found" ) # Get summary statistics summary = logger.get_log_summary() # Returns: { # 'total_events': 42, # 'by_action_type': {'hp_change': 5, 'tag_add': 3, ...}, # 'by_status': {'success': 40, 'failed': 2} # } **Parameters:** - ``action_type`` (str): Type of action (e.g., "hp_change", "tag_add") - ``status`` (str): "success" or "failed" - ``details`` (dict, optional): Operation context with before/after values - ``error`` (str, optional): Error message if status == "failed" **Output:** - ``audit_log.json``: Full event details in JSON format - ``audit_log.csv``: Flattened events for spreadsheet analysis - Thread-safe, append-only, no data loss on crash Best Practices ============== 1. **Regular Backups**: Regularly backup your ``root_log_dir`` for long-running experiments 2. **Analysis Scripts**: Create scripts to analyze audit logs for your specific workflows: .. code-block:: python def analyze_experiment(root_log_dir): import json path = Path(root_log_dir) / "audit_log.json" with open(path) as f: events = json.load(f) # Your analysis here return insights 3. **Integration**: Integrate audit logs with your experiment tracking system (MLflow, Weights & Biases, etc.) 4. **Compliance**: Use audit logs as evidence for compliance audits and regulatory requirements 5. **Documentation**: Include audit log summaries in experiment reports and publications See Also ======== - :doc:`grpc_functions`: All gRPC RPC handlers and their behavior - :doc:`/weights_studio`: Using the UI to trigger logged actions - :doc:`/configuration`: gRPC configuration options