gRPC Functions¶
Overview¶
WeightsLab uses gRPC (gRPC Remote Procedure Call) for communication between the backend training service and the frontend UI (Weights Studio). gRPC provides a high-performance, language-neutral remote procedure call framework built on HTTP/2, enabling real-time bidirectional communication.
Why gRPC? - Performance: Binary protocol with HTTP/2 multiplexing - Real-time: Server can push updates to clients - Language-neutral: Generated code available in multiple languages - Typed: Protocol buffer definitions ensure type safety - Efficient: Low overhead, suitable for frequent polling from UI
Architecture¶
Weights Studio (Frontend) WeightsLab Backend
================== =================
- gRPC Client - gRPC Server
- UI triggers actions - Experiment Service
- Receives updates - Data Service
- Polls for metrics - Model Service
|
+------ HTTP/2 gRPC Channel ------+
|
ExperimentServiceServicer
(routes to specific handlers)
The backend runs a gRPC server listening on a configurable port (default: 50051) and exposes a single service: ExperimentService with multiple RPC methods.
Connection¶
Server Configuration:
Host: Configurable via
GRPC_BACKEND_HOST(default: “0.0.0.0”)Port: Configurable via
GRPC_BACKEND_PORT(default: 50051)TLS: Optional mTLS support via
GRPC_TLS_ENABLEDAuth: Optional bearer token auth via
GRPC_AUTH_TOKENorGRPC_AUTH_TOKENSMax message size: 256 MB (configurable via
GRPC_MAX_MESSAGE_BYTES)
Example:
# Backend starts gRPC server
from weightslab.trainer.trainer_services import grpc_serve
grpc_serve(
n_workers_grpc=8,
grpc_host="0.0.0.0",
grpc_port=50051
)
Frontend connects:
// Weights Studio connects to gRPC server
const channel = grpc.web.grpc.createChannel("http://localhost:50051");
const client = new ExperimentServiceClient(channel);
RPC Methods¶
The ExperimentService exposes the following RPC methods:
Training & Hyperparameter Control
ExperimentCommand
Execute training-related commands: pause/resume, hyperparameter changes, mode switches.
Request:
message ExperimentCommandRequest { oneof command { HyperParameterChange hyper_parameter_change = 1; PlotNoteOperation plot_note_operation = 2; LoadCheckpointOperation load_checkpoint_operation = 3; } bool get_hyper_parameters = 4; bool get_interactive_layers = 5; bool get_data_records = 6; string get_single_layer_info_id = 7; }
Response:
message CommandResponse { bool success = 1; string message = 2; repeated HyperParameterDesc hyper_parameters_descs = 3; repeated LayerRepresentation layer_representations = 4; SampleStatistics sample_statistics = 5; }
Behavior:
Pause/Resume: Controls trainer.pause() / trainer.resume()
HP Changes: Updates hyperparameters and pauses training
Mode Switch: Switches between train/audit/evaluation modes
Plot Notes: Add/edit notes on metric points
Checkpoint Load: Restore model from previous checkpoint
Audit Logged: Yes - hp_change, pause, resume, mode_switch
Logger & Metrics
GetLatestLoggerData
Retrieve training metrics and signals logged during training.
Request:
message GetLatestLoggerDataRequest { bool request_full_history = 1; int32 max_points = 2; bool break_by_slices = 3; repeated string tags = 4; string graph_name = 5; }
Response:
message GetLatestLoggerDataResponse { repeated LoggerDataPoint points = 1; } message LoggerDataPoint { string metric_name = 1; int32 model_age = 2; float metric_value = 3; string experiment_hash = 4; int32 timestamp = 5; string sample_id = 6; bool is_evaluation_marker = 7; string split_name = 8; repeated string evaluation_tags = 9; string point_note = 10; bool audit_mode = 11; }
Parameters:
request_full_history(bool): Return all history or just new data since last pollmax_points(int): Maximum points per signal (for downsampling)break_by_slices(bool): Filter by tags and return per-sample metricstags(list): Tags to filter samples when break_by_slices=Truegraph_name(str): Specific graph/metric to retrieve
Behavior:
Called frequently by UI (every 1-2 seconds) to update metric displays
Returns metrics from signal_logger
Handles downsampling for large datasets (>1000 points)
Per-sample data available when tagged samples tracked
Enforces concurrency limit (max 3 concurrent calls)
Audit Logged: No (read-only operation)
Checkpoint Management
RestoreCheckpoint
Restore model weights and training state from a previous checkpoint.
Request:
message RestoreCheckpointRequest { string experiment_hash = 1; // Can include @@weights_step=N for weights-only restore }
Response:
message RestoreCheckpointResponse { bool success = 1; string message = 2; }
Behavior:
Pauses training before restoration
Loads model weights, optimizer state, data state
Supports full restore or weights-only restore
Weights-only restore specified via
experiment_hash@@weights_step=5000Returns to checkpoint step (model_age resets)
Synchronizes all components (model, optimizer, data)
Audit Logged: Yes - checkpoint_restore
Evaluation
TriggerEvaluation
Start an evaluation pass on a dataset split.
Request:
message TriggerEvaluationRequest { string split_name = 1; // "val", "test", etc. repeated string tags = 2; bool use_full_set = 3; }
Response:
message TriggerEvaluationResponse { bool success = 1; string message = 2; }
Parameters:
split_name(str): Dataset split to evaluate (“val”, “test”, etc.)tags(list): Optional tags to filter samples for evaluationuse_full_set(bool): Evaluate full split or just tagged samples
Behavior:
Queues evaluation request in eval_controller
Evaluation runs asynchronously in training thread
Can only have one active evaluation at a time
Pauses training during evaluation by default
Results available via GetLatestLoggerData with is_evaluation_marker=True
Audit Logged: Yes - evaluation_start
GetEvaluationStatus
Poll status of current/pending evaluation.
Request:
message GetEvaluationStatusRequest {}
Response:
message GetEvaluationStatusResponse { string status = 1; // "idle", "pending", "running", "completed" int32 current = 2; // Progress: current sample int32 total = 3; // Progress: total samples string message = 4; string error = 5; string split_name = 6; }
Behavior:
Non-blocking status check
Used by UI to show progress bar
Includes error messages if evaluation failed
Audit Logged: No
CancelEvaluation
Cancel pending or running evaluation.
Request:
message CancelEvaluationRequest { string reason = 1; }
Response:
message CancelEvaluationResponse { bool success = 1; string message = 2; }
Behavior:
Stops evaluation immediately
Returns control to training or idle state
No audit log (just a cancellation, not a user action)
Audit Logged: No
Data Operations
GetDataSamples
Retrieve sample batch from the dataset with metadata and optional image thumbnails.
Request:
message GetDataSamplesRequest { int32 start_index = 1; int32 records_cnt = 2; bool include_raw_data = 3; int32 resize_width = 4; int32 resize_height = 5; }
Response:
message DataSamplesResponse { bool success = 1; string message = 2; repeated DataRecord data_records = 3; } message DataRecord { string sample_id = 1; string origin = 2; // "train", "val", "test" map<string, string> metadata = 3; bytes raw_data = 4; // Image bytes (optional) }
Parameters:
start_index(int): Starting row index in datasetrecords_cnt(int): Number of samples to retrieveinclude_raw_data(bool): Include image bytes (for display)resize_width(int): Resize image to width (optional)resize_height(int): Resize image to height (optional)
Behavior:
Called when user scrolls data grid
Lazily loads samples on demand (pagination)
Caches thumbnails for fast preview requests
Parallel batch processing using thread pool (8 workers)
Respects current filters/query view
Returns metadata columns for sorting/filtering
Audit Logged: No (read-only operation)
ApplyDataQuery
Execute a filter, sort, or analysis operation on the dataset.
Request:
message DataQueryRequest { string query = 1; bool is_natural_language = 2; }
Response:
message DataQueryResponse { bool success = 1; string message = 2; int32 number_of_all_samples = 3; int32 number_of_samples_in_the_loop = 4; int32 number_of_discarded_samples = 5; repeated string unique_tags = 6; string agent_intent_type = 7; string analysis_result = 8; }
Query Types:
Direct filters: “quality > 0.7 and confidence < 0.9”
Pandas operations: “@"""df[df[‘quality’] > 0.5]"""”
Natural language: “show me low quality samples” (uses AI agent)
Special commands: “@reset” (clear filters), “@overview” (summary)
Behavior:
Modifies in-memory view of dataframe
Direct queries bypass agent for speed
Natural language queries use LLM agent
Returns updated sample counts
Sets is_filtered=True when filters applied
Can take several seconds for complex queries
Audit Logged: Yes - query_execute
EditDataSample
Modify sample metadata: add/remove tags, discard/restore samples.
Request:
message DataEditRequest { string stat_name = 1; // "tags:tagname", "discarded", etc. repeated string samples_ids = 2; repeated string sample_origins = 3; string string_value = 4; // For tag operations bool bool_value = 5; // For discard/restore string type = 6; // EditType: ADD, REMOVE, OVERRIDE }
Response:
message DataEditsResponse { bool success = 1; string message = 2; }
Operations:
Tag add: stat_name=”tags”, type=EDIT_ADD, string_value=”tag_name”
Tag remove: stat_name=”tags”, type=EDIT_REMOVE, string_value=”tag_name”
Discard: stat_name=”discarded”, bool_value=True
Restore: stat_name=”discarded”, bool_value=False
Copy metadata: stat_name=”__copy_metadata__”, string_value=”source_column”
Delete metadata: stat_name=”__delete_metadata__”, string_value=”column_name”
Behavior:
Pauses training before modifications
Batch updates for performance (multiple samples at once)
Updates both in-memory dataframe and persistent storage (H5)
Flushes to disk immediately for persistence
Triggers internal refresh to reflect changes
Tags stored as separate boolean columns per tag
Audit Logged: Yes - tag_add, tag_remove, sample_discard, sample_restore
Data Splits
GetDataSplits
Get list of available dataset splits (train, val, test, etc.).
Request:
message GetDataSplitsRequest {}
Response:
message DataSplitsResponse { bool success = 1; repeated string split_names = 2; }
Behavior:
Returns splits from dataframe “origin” column
Called once on UI initialization
Determines available evaluation targets
Audit Logged: No
Model Inspection
GetWeights
Retrieve model layer weights for inspection.
Request:
message GetWeightsRequest { string layer_id = 1; }
Response:
message WeightsResponse { bytes weights_data = 1; string format = 2; }
Behavior:
Returns weights as NumPy array (serialized)
Used by visualization features
Only available for supported frameworks (PyTorch, TensorFlow)
Audit Logged: No
GetActivations
Retrieve layer activations for a specific sample.
Request:
message GetActivationsRequest { string layer_id = 1; string sample_id = 2; }
Response:
message ActivationsResponse { bytes activation_data = 1; string shape = 2; }
Behavior:
Forward pass through network with sample
Returns activation maps
Used for activation visualization
Audit Logged: No
GetSamples
High-level sample retrieval (images, segmentation masks, etc.).
Request:
message GetSamplesRequest { repeated string sample_ids = 1; int32 resize_width = 2; int32 resize_height = 3; }
Response:
message SamplesResponse { repeated Sample samples = 1; } message Sample { string sample_id = 1; bytes image = 2; bytes segmentation_mask = 3; bytes reconstruction = 4; }
Behavior:
Returns specific samples with their data
Used for detailed sample view
Supports multiple output modalities
Audit Logged: No
Common Patterns¶
Polling Pattern (UI to Backend)
The UI frequently polls for updates:
// Poll metrics every 1 second
setInterval(() => {
client.getLatestLoggerData(
{ request_full_history: false },
(err, response) => {
if (!err) {
updateMetricDisplay(response.getPointsList());
}
}
);
}, 1000);
User Action Pattern (UI Trigger)
// User clicks "Resume" button
const request = new ExperimentCommandRequest();
request.setHyperParameterChange(
new HyperParameterChange()
.setHyperParameters(
new HyperParameters()
.setIsTraining(true)
)
);
client.experimentCommand(request, (err, response) => {
if (response.getSuccess()) {
showNotification("Training resumed");
}
});
Concurrency Patterns
GetLatestLoggerData: Limited to 3 concurrent calls (semaphore)
GetDataSamples: Parallel processing with 8 worker threads
EditDataSample: Serialized per lock to prevent conflicts
ApplyDataQuery: Single operation at a time per lock
Error Handling¶
gRPC Error Codes
OK (0): SuccessINVALID_ARGUMENT (3): Invalid request parametersNOT_FOUND (5): Resource not found (checkpoint, layer, etc.)ALREADY_EXISTS (6): Resource already existsABORTED (10): Operation aborted (e.g., lock timeout)RESOURCE_EXHAUSTED (8): Resource limit (concurrent calls, memory)INTERNAL (13): Internal server error
Response Pattern
All responses include:
message Response {
bool success = 1;
string message = 2;
}
Check
successflag before processingRead
messagefor error details or operation summary
Example Error Handling:
try:
response = client.experiment_command(request)
if not response.success:
logger.error(f"Command failed: {response.message}")
else:
logger.info(f"Command succeeded: {response.message}")
except grpc.RpcError as e:
logger.error(f"gRPC error: {e.code()} - {e.details()}")
Performance Considerations¶
Concurrency Limits
GetLatestLoggerData: 3 concurrent (buffer overflow protection)
EditDataSample: 1 concurrent (serialized for data consistency)
GetDataSamples: 8 parallel workers (configurable)
Timeouts
Default: 120 seconds per RPC call
Long operations (queries, evaluations): May take minutes
Recommend client-side timeout of 5 minutes for long operations
Message Size
Maximum message: 256 MB
Typical metrics response: <1 MB
Large image batches: 10-50 MB
Optimization Tips
Use
request_full_history=Falsefor GetLatestLoggerData (incremental updates)Batch data edits (multiple samples in one EditDataSample call)
Limit GetDataSamples batch size to 32-64 samples
Cache metric history client-side instead of re-requesting
Use tags to reduce query results instead of filtering client-side
Debugging¶
Enable Verbose gRPC Logging:
export GRPC_VERBOSITY=debug
export GRPC_TRACE=all
Monitor gRPC Performance:
from weightslab.watchdog import RpcWatchdogState
# Watchdog monitors RPC latency and throughput
# Logs slow calls (>2s) with full context
Common Issues
Connection refused: Check GRPC_BACKEND_HOST and GRPC_BACKEND_PORT
Timeout: Backend might be processing heavy operations (eval, query)
Channel closed: Backend crashed or restarted
Lock timeout: Training lock held too long (exceeded 3 minutes)
See Also¶
Audit Logger: Audit logging for all gRPC operations
Weights Studio Guide: gRPC client implementation (UI)
Configuration: gRPC configuration options