gRPC Functions

Overview

WeightsLab uses gRPC (gRPC Remote Procedure Call) for communication between the backend training service and the frontend UI (Weights Studio). gRPC provides a high-performance, language-neutral remote procedure call framework built on HTTP/2, enabling real-time bidirectional communication.

Why gRPC? - Performance: Binary protocol with HTTP/2 multiplexing - Real-time: Server can push updates to clients - Language-neutral: Generated code available in multiple languages - Typed: Protocol buffer definitions ensure type safety - Efficient: Low overhead, suitable for frequent polling from UI

Architecture

Weights Studio (Frontend)        WeightsLab Backend
==================              =================
- gRPC Client                    - gRPC Server
- UI triggers actions            - Experiment Service
- Receives updates               - Data Service
- Polls for metrics              - Model Service
                |
                +------ HTTP/2 gRPC Channel ------+
                                |
                ExperimentServiceServicer
                (routes to specific handlers)

The backend runs a gRPC server listening on a configurable port (default: 50051) and exposes a single service: ExperimentService with multiple RPC methods.

Connection

Server Configuration:

  • Host: Configurable via GRPC_BACKEND_HOST (default: “0.0.0.0”)

  • Port: Configurable via GRPC_BACKEND_PORT (default: 50051)

  • TLS: Optional mTLS support via GRPC_TLS_ENABLED

  • Auth: Optional bearer token auth via GRPC_AUTH_TOKEN or GRPC_AUTH_TOKENS

  • Max message size: 256 MB (configurable via GRPC_MAX_MESSAGE_BYTES)

Example:

# Backend starts gRPC server
from weightslab.trainer.trainer_services import grpc_serve

grpc_serve(
    n_workers_grpc=8,
    grpc_host="0.0.0.0",
    grpc_port=50051
)

Frontend connects:

// Weights Studio connects to gRPC server
const channel = grpc.web.grpc.createChannel("http://localhost:50051");
const client = new ExperimentServiceClient(channel);

```

RPC Methods

The ExperimentService exposes the following RPC methods:

Training & Hyperparameter Control

  1. ExperimentCommand

    Execute training-related commands: pause/resume, hyperparameter changes, mode switches.

    Request:

    message ExperimentCommandRequest {
        oneof command {
            HyperParameterChange hyper_parameter_change = 1;
            PlotNoteOperation plot_note_operation = 2;
            LoadCheckpointOperation load_checkpoint_operation = 3;
        }
        bool get_hyper_parameters = 4;
        bool get_interactive_layers = 5;
        bool get_data_records = 6;
        string get_single_layer_info_id = 7;
    }
    

    Response:

    message CommandResponse {
        bool success = 1;
        string message = 2;
        repeated HyperParameterDesc hyper_parameters_descs = 3;
        repeated LayerRepresentation layer_representations = 4;
        SampleStatistics sample_statistics = 5;
    }
    

    Behavior:

    • Pause/Resume: Controls trainer.pause() / trainer.resume()

    • HP Changes: Updates hyperparameters and pauses training

    • Mode Switch: Switches between train/audit/evaluation modes

    • Plot Notes: Add/edit notes on metric points

    • Checkpoint Load: Restore model from previous checkpoint

    Audit Logged: Yes - hp_change, pause, resume, mode_switch

Logger & Metrics

  1. GetLatestLoggerData

    Retrieve training metrics and signals logged during training.

    Request:

    message GetLatestLoggerDataRequest {
        bool request_full_history = 1;
        int32 max_points = 2;
        bool break_by_slices = 3;
        repeated string tags = 4;
        string graph_name = 5;
    }
    

    Response:

    message GetLatestLoggerDataResponse {
        repeated LoggerDataPoint points = 1;
    }
    
    message LoggerDataPoint {
        string metric_name = 1;
        int32 model_age = 2;
        float metric_value = 3;
        string experiment_hash = 4;
        int32 timestamp = 5;
        string sample_id = 6;
        bool is_evaluation_marker = 7;
        string split_name = 8;
        repeated string evaluation_tags = 9;
        string point_note = 10;
        bool audit_mode = 11;
    }
    

    Parameters:

    • request_full_history (bool): Return all history or just new data since last poll

    • max_points (int): Maximum points per signal (for downsampling)

    • break_by_slices (bool): Filter by tags and return per-sample metrics

    • tags (list): Tags to filter samples when break_by_slices=True

    • graph_name (str): Specific graph/metric to retrieve

    Behavior:

    • Called frequently by UI (every 1-2 seconds) to update metric displays

    • Returns metrics from signal_logger

    • Handles downsampling for large datasets (>1000 points)

    • Per-sample data available when tagged samples tracked

    • Enforces concurrency limit (max 3 concurrent calls)

    Audit Logged: No (read-only operation)

Checkpoint Management

  1. RestoreCheckpoint

    Restore model weights and training state from a previous checkpoint.

    Request:

    message RestoreCheckpointRequest {
        string experiment_hash = 1;  // Can include @@weights_step=N for weights-only restore
    }
    

    Response:

    message RestoreCheckpointResponse {
        bool success = 1;
        string message = 2;
    }
    

    Behavior:

    • Pauses training before restoration

    • Loads model weights, optimizer state, data state

    • Supports full restore or weights-only restore

    • Weights-only restore specified via experiment_hash@@weights_step=5000

    • Returns to checkpoint step (model_age resets)

    • Synchronizes all components (model, optimizer, data)

    Audit Logged: Yes - checkpoint_restore

Evaluation

  1. TriggerEvaluation

    Start an evaluation pass on a dataset split.

    Request:

    message TriggerEvaluationRequest {
        string split_name = 1;  // "val", "test", etc.
        repeated string tags = 2;
        bool use_full_set = 3;
    }
    

    Response:

    message TriggerEvaluationResponse {
        bool success = 1;
        string message = 2;
    }
    

    Parameters:

    • split_name (str): Dataset split to evaluate (“val”, “test”, etc.)

    • tags (list): Optional tags to filter samples for evaluation

    • use_full_set (bool): Evaluate full split or just tagged samples

    Behavior:

    • Queues evaluation request in eval_controller

    • Evaluation runs asynchronously in training thread

    • Can only have one active evaluation at a time

    • Pauses training during evaluation by default

    • Results available via GetLatestLoggerData with is_evaluation_marker=True

    Audit Logged: Yes - evaluation_start

  2. GetEvaluationStatus

    Poll status of current/pending evaluation.

    Request:

    message GetEvaluationStatusRequest {}
    

    Response:

    message GetEvaluationStatusResponse {
        string status = 1;  // "idle", "pending", "running", "completed"
        int32 current = 2;  // Progress: current sample
        int32 total = 3;    // Progress: total samples
        string message = 4;
        string error = 5;
        string split_name = 6;
    }
    

    Behavior:

    • Non-blocking status check

    • Used by UI to show progress bar

    • Includes error messages if evaluation failed

    Audit Logged: No

  3. CancelEvaluation

    Cancel pending or running evaluation.

    Request:

    message CancelEvaluationRequest {
        string reason = 1;
    }
    

    Response:

    message CancelEvaluationResponse {
        bool success = 1;
        string message = 2;
    }
    

    Behavior:

    • Stops evaluation immediately

    • Returns control to training or idle state

    • No audit log (just a cancellation, not a user action)

    Audit Logged: No

Data Operations

  1. GetDataSamples

    Retrieve sample batch from the dataset with metadata and optional image thumbnails.

    Request:

    message GetDataSamplesRequest {
        int32 start_index = 1;
        int32 records_cnt = 2;
        bool include_raw_data = 3;
        int32 resize_width = 4;
        int32 resize_height = 5;
    }
    

    Response:

    message DataSamplesResponse {
        bool success = 1;
        string message = 2;
        repeated DataRecord data_records = 3;
    }
    
    message DataRecord {
        string sample_id = 1;
        string origin = 2;  // "train", "val", "test"
        map<string, string> metadata = 3;
        bytes raw_data = 4;  // Image bytes (optional)
    }
    

    Parameters:

    • start_index (int): Starting row index in dataset

    • records_cnt (int): Number of samples to retrieve

    • include_raw_data (bool): Include image bytes (for display)

    • resize_width (int): Resize image to width (optional)

    • resize_height (int): Resize image to height (optional)

    Behavior:

    • Called when user scrolls data grid

    • Lazily loads samples on demand (pagination)

    • Caches thumbnails for fast preview requests

    • Parallel batch processing using thread pool (8 workers)

    • Respects current filters/query view

    • Returns metadata columns for sorting/filtering

    Audit Logged: No (read-only operation)

  2. ApplyDataQuery

    Execute a filter, sort, or analysis operation on the dataset.

    Request:

    message DataQueryRequest {
        string query = 1;
        bool is_natural_language = 2;
    }
    

    Response:

    message DataQueryResponse {
        bool success = 1;
        string message = 2;
        int32 number_of_all_samples = 3;
        int32 number_of_samples_in_the_loop = 4;
        int32 number_of_discarded_samples = 5;
        repeated string unique_tags = 6;
        string agent_intent_type = 7;
        string analysis_result = 8;
    }
    

    Query Types:

    • Direct filters: “quality > 0.7 and confidence < 0.9”

    • Pandas operations: “@"""df[df[‘quality’] > 0.5]"""”

    • Natural language: “show me low quality samples” (uses AI agent)

    • Special commands: “@reset” (clear filters), “@overview” (summary)

    Behavior:

    • Modifies in-memory view of dataframe

    • Direct queries bypass agent for speed

    • Natural language queries use LLM agent

    • Returns updated sample counts

    • Sets is_filtered=True when filters applied

    • Can take several seconds for complex queries

    Audit Logged: Yes - query_execute

  3. EditDataSample

    Modify sample metadata: add/remove tags, discard/restore samples.

    Request:

    message DataEditRequest {
        string stat_name = 1;  // "tags:tagname", "discarded", etc.
        repeated string samples_ids = 2;
        repeated string sample_origins = 3;
        string string_value = 4;  // For tag operations
        bool bool_value = 5;      // For discard/restore
        string type = 6;          // EditType: ADD, REMOVE, OVERRIDE
    }
    

    Response:

    message DataEditsResponse {
        bool success = 1;
        string message = 2;
    }
    

    Operations:

    • Tag add: stat_name=”tags”, type=EDIT_ADD, string_value=”tag_name”

    • Tag remove: stat_name=”tags”, type=EDIT_REMOVE, string_value=”tag_name”

    • Discard: stat_name=”discarded”, bool_value=True

    • Restore: stat_name=”discarded”, bool_value=False

    • Copy metadata: stat_name=”__copy_metadata__”, string_value=”source_column”

    • Delete metadata: stat_name=”__delete_metadata__”, string_value=”column_name”

    Behavior:

    • Pauses training before modifications

    • Batch updates for performance (multiple samples at once)

    • Updates both in-memory dataframe and persistent storage (H5)

    • Flushes to disk immediately for persistence

    • Triggers internal refresh to reflect changes

    • Tags stored as separate boolean columns per tag

    Audit Logged: Yes - tag_add, tag_remove, sample_discard, sample_restore

Data Splits

  1. GetDataSplits

    Get list of available dataset splits (train, val, test, etc.).

    Request:

    message GetDataSplitsRequest {}
    

    Response:

    message DataSplitsResponse {
        bool success = 1;
        repeated string split_names = 2;
    }
    

    Behavior:

    • Returns splits from dataframe “origin” column

    • Called once on UI initialization

    • Determines available evaluation targets

    Audit Logged: No

Model Inspection

  1. GetWeights

    Retrieve model layer weights for inspection.

    Request:

    message GetWeightsRequest {
        string layer_id = 1;
    }
    

    Response:

    message WeightsResponse {
        bytes weights_data = 1;
        string format = 2;
    }
    

    Behavior:

    • Returns weights as NumPy array (serialized)

    • Used by visualization features

    • Only available for supported frameworks (PyTorch, TensorFlow)

    Audit Logged: No

  2. GetActivations

    Retrieve layer activations for a specific sample.

    Request:

    message GetActivationsRequest {
        string layer_id = 1;
        string sample_id = 2;
    }
    

    Response:

    message ActivationsResponse {
        bytes activation_data = 1;
        string shape = 2;
    }
    

    Behavior:

    • Forward pass through network with sample

    • Returns activation maps

    • Used for activation visualization

    Audit Logged: No

  3. GetSamples

    High-level sample retrieval (images, segmentation masks, etc.).

    Request:

    message GetSamplesRequest {
        repeated string sample_ids = 1;
        int32 resize_width = 2;
        int32 resize_height = 3;
    }
    

    Response:

    message SamplesResponse {
        repeated Sample samples = 1;
    }
    
    message Sample {
        string sample_id = 1;
        bytes image = 2;
        bytes segmentation_mask = 3;
        bytes reconstruction = 4;
    }
    

    Behavior:

    • Returns specific samples with their data

    • Used for detailed sample view

    • Supports multiple output modalities

    Audit Logged: No

Common Patterns

Polling Pattern (UI to Backend)

The UI frequently polls for updates:

// Poll metrics every 1 second
setInterval(() => {
    client.getLatestLoggerData(
        { request_full_history: false },
        (err, response) => {
            if (!err) {
                updateMetricDisplay(response.getPointsList());
            }
        }
    );
}, 1000);

User Action Pattern (UI Trigger)

// User clicks "Resume" button
const request = new ExperimentCommandRequest();
request.setHyperParameterChange(
    new HyperParameterChange()
        .setHyperParameters(
            new HyperParameters()
                .setIsTraining(true)
        )
);

client.experimentCommand(request, (err, response) => {
    if (response.getSuccess()) {
        showNotification("Training resumed");
    }
});

Concurrency Patterns

  • GetLatestLoggerData: Limited to 3 concurrent calls (semaphore)

  • GetDataSamples: Parallel processing with 8 worker threads

  • EditDataSample: Serialized per lock to prevent conflicts

  • ApplyDataQuery: Single operation at a time per lock

Error Handling

gRPC Error Codes

  • OK (0): Success

  • INVALID_ARGUMENT (3): Invalid request parameters

  • NOT_FOUND (5): Resource not found (checkpoint, layer, etc.)

  • ALREADY_EXISTS (6): Resource already exists

  • ABORTED (10): Operation aborted (e.g., lock timeout)

  • RESOURCE_EXHAUSTED (8): Resource limit (concurrent calls, memory)

  • INTERNAL (13): Internal server error

Response Pattern

All responses include:

message Response {
    bool success = 1;
    string message = 2;
}
  • Check success flag before processing

  • Read message for error details or operation summary

Example Error Handling:

try:
    response = client.experiment_command(request)
    if not response.success:
        logger.error(f"Command failed: {response.message}")
    else:
        logger.info(f"Command succeeded: {response.message}")
except grpc.RpcError as e:
    logger.error(f"gRPC error: {e.code()} - {e.details()}")

Performance Considerations

Concurrency Limits

  • GetLatestLoggerData: 3 concurrent (buffer overflow protection)

  • EditDataSample: 1 concurrent (serialized for data consistency)

  • GetDataSamples: 8 parallel workers (configurable)

Timeouts

  • Default: 120 seconds per RPC call

  • Long operations (queries, evaluations): May take minutes

  • Recommend client-side timeout of 5 minutes for long operations

Message Size

  • Maximum message: 256 MB

  • Typical metrics response: <1 MB

  • Large image batches: 10-50 MB

Optimization Tips

  1. Use request_full_history=False for GetLatestLoggerData (incremental updates)

  2. Batch data edits (multiple samples in one EditDataSample call)

  3. Limit GetDataSamples batch size to 32-64 samples

  4. Cache metric history client-side instead of re-requesting

  5. Use tags to reduce query results instead of filtering client-side

Debugging

Enable Verbose gRPC Logging:

export GRPC_VERBOSITY=debug
export GRPC_TRACE=all

Monitor gRPC Performance:

from weightslab.watchdog import RpcWatchdogState

# Watchdog monitors RPC latency and throughput
# Logs slow calls (>2s) with full context

Common Issues

  • Connection refused: Check GRPC_BACKEND_HOST and GRPC_BACKEND_PORT

  • Timeout: Backend might be processing heavy operations (eval, query)

  • Channel closed: Backend crashed or restarted

  • Lock timeout: Training lock held too long (exceeded 3 minutes)

See Also