TLDR: A theoretical framework for understanding and comparing machine learning model serving systems in cloud environments, focusing on SageMaker, Vertex AI, and Azure ML. Just as previous "Next 700" papers sought to distill the essence of programming languages, we extract core concepts underlying ML model deployment systems.
1. Introduction
Today's ML engineers must choose between various serving systems, each with their own abstractions, terminology, and trade-offs. These platforms differ in their approaches to fundamental concepts like:
- Model containerization and packaging
- Scaling and resource allocation
- Version management and deployment strategies
- Monitoring and observability
- Resource optimization and cost management
2. A Calculus for ML Model Serving
Core Concepts
ModelArtifact ::= (code, weights, metadata) Container ::= (ModelArtifact, runtime, deps) Endpoint ::= (Container, scaling_config, routing) Version ::= (Endpoint, traffic_weight)
Operations
package : ModelArtifact → Container deploy : Container → Endpoint scale : Endpoint × Config → Endpoint route : Version × Version × Weight → Version
3. Platform Analysis
3.1 Amazon SageMaker
SageMaker's approach closely mirrors our theoretical model, with explicit container building and endpoint management. Key mappings include:
- Model artifacts are packaged into ECR containers
- Endpoints provide real-time inference with automatic scaling
- Production variants enable traffic splitting
Basic Model Deployment
Theoretical Representation:
# SageMaker strict implementation of core grammar
ModelArtifact ::= (
code = "s3://bucket/model.tar.gz", # Model code and artifacts
weights = "s3://bucket/weights", # Model weights
metadata = { # Essential metadata only
"framework": str, # e.g., "huggingface"
"version": str, # e.g., "4.37"
"py_version": str # e.g., "py310"
}
)
Container ::= (
ModelArtifact,
runtime = {
"image": str, # ECR image URI
"execution_role": str # IAM role
},
deps = {
"environment": dict, # Environment variables
"entry_point": str # Inference script
}
)
Endpoint ::= (
Container,
scaling_config = {
"instance_count": int,
"instance_type": str
},
routing = {
"variants": list[str], # Production variant names
"weights": list[float] # Traffic weights
}
)
Version ::= (
Endpoint,
traffic_weight = float # Simple weight for this version
)
# Core operations
package : ModelArtifact → Container # Create SageMaker model
deploy : Container → Endpoint # Deploy to endpoint
scale : Endpoint × Config → Endpoint # Update instance count/type
route : Version × Version × Weight → Version # Update traffic split
Implementation:
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.huggingface.model import get_huggingface_llm_image_uri
# Define the model image
image_uri = get_huggingface_llm_image_uri(
"huggingface",
version="1.4.2"
)
# Create the model (packaging step)
huggingface_model = HuggingFaceModel(
env=env, # Environment variables for the container
role=SAGEMAKER_ROLE,
transformers_version="4.37",
pytorch_version="2.1",
py_version="py310",
image_uri=image_uri
)
# Deploy the model (endpoint creation)
predictor = huggingface_model.deploy(
initial_instance_count=deployment.instance_count,
instance_type=deployment.instance_type,
endpoint_name=endpoint_name,
)
# Inference invocation
predictor = HuggingFacePredictor(
endpoint_name=endpoint_name,
sagemaker_session=sagemaker_session
)
response = predictor.predict(input)
3.2 Azure ML SDK
Azure ML implements a workspace-centric approach with managed online endpoints, emphasizing environment management and model registry integration.
- Managed deployments handle container creation implicitly
- Scaling is defined through deployment configurations
- Blue-green deployments manage version transitions
Theoretical Representation:
# Azure ML implementation of our core grammar
ModelArtifact ::= (
code = "model/path", # Local or registry path
weights = "weights/path",
metadata = {
"name": str, # e.g., "hf-model"
"type": AssetType, # e.g., CUSTOM_MODEL
"description": str,
"registry": optional[str] # e.g., "HuggingFace"
}
)
Container ::= (
ModelArtifact,
runtime = {
"image": str, # e.g., "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04"
"conda_file": {
"channels": list[str],
"dependencies": list[str]
}
},
deps = {
"environment_variables": dict[str, str],
"pip_packages": list[str]
}
)
Endpoint ::= (
Container,
scaling_config = {
"instance_type": str, # e.g., "Standard_DS3_v2"
"instance_count": int,
"min_replicas": int,
"max_replicas": int
},
routing = {
"deployment_name": str,
"traffic_percentage": int
}
)
Version ::= (
Endpoint,
traffic_weight = {
"blue_green_config": {
"active": str, # blue or green
"percentage": int,
"evaluation_rules": dict
}
}
)
# Core operations
package(ModelArtifact) → Container # Creates Azure container environment
deploy(Container) → Endpoint # Deploys to Azure managed endpoint
scale(Endpoint × Config) → Endpoint # Updates endpoint scaling
route(Version × Weight) → Version # Updates traffic routing
Implementation:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import (
Environment,
Model,
ManagedOnlineEndpoint,
ManagedOnlineDeployment
)
from azure.ai.ml.constants import AssetTypes
# Initialize workspace client
credential = DefaultAzureCredential()
ml_client = MLClient(
credential=credential,
subscription_id=subscription_id,
resource_group_name=resource_group,
workspace_name=workspace_name
)
# Define environment with dependencies
environment = Environment(
name="bert-env",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
conda_file={
"channels": ["conda-forge", "pytorch"],
"dependencies": [
"python=3.11",
"pip",
"pytorch",
"transformers",
"numpy"
]
}
)
# Register model from registry
model = Model(
path=f"hf://{model_id}",
type=AssetTypes.CUSTOM_MODEL,
name="hf-model",
description="HuggingFace model from Model Hub"
)
# Create and configure endpoint
endpoint_name = f"hf-ep-{int(time.time())}"
ml_client.begin_create_or_update(
ManagedOnlineEndpoint(name=endpoint_name)
).wait()
# Deploy model
deployment = ml_client.online_deployments.begin_create_or_update(
ManagedOnlineDeployment(
name="demo",
endpoint_name=endpoint_name,
model=model_id,
environment=environment,
instance_type="Standard_DS3_v2",
instance_count=1,
)
).wait()
# Update traffic rules
endpoint = ml_client.online_endpoints.get(endpoint_name)
endpoint.traffic = {"demo": 100}
ml_client.begin_create_or_update(endpoint).result()
3.3 Google Cloud Vertex AI
Vertex AI takes a streamlined approach to model deployment, with strong integration with Google Cloud's container infrastructure and emphasis on GPU acceleration.
Theoretical Representation:
# Vertex AI implementation of our core grammar
ModelArtifact ::= (
code = "gs://model/path", # GCS path
weights = "gs://weights/path",
metadata = {
"model_id": str, # e.g., "hf-bert-base"
"framework": str, # e.g., "huggingface"
"generation_config": dict
}
)
Container ::= (
ModelArtifact,
runtime = {
"image_uri": str, # e.g., "us-docker.pkg.dev/vertex-ai/prediction/..."
"accelerator": str # e.g., "NVIDIA_TESLA_A100"
},
deps = {
"env_vars": {
"MODEL_ID": str,
"MAX_INPUT_LENGTH": str,
"MAX_TOTAL_TOKENS": str,
"NUM_SHARD": str
}
}
)
Endpoint ::= (
Container,
scaling_config = {
"machine_type": str, # e.g., "a2-highgpu-4g"
"min_replica_count": int,
"max_replica_count": int,
"accelerator_count": int
},
routing = {
"traffic_split": dict[str, int],
"prediction_config": dict
}
)
Version ::= (
Endpoint,
traffic_weight = {
"split_name": str,
"percentage": int,
"monitoring_config": dict
}
)
# Core operations
package(ModelArtifact) → Container # Creates Vertex AI container
deploy(Container) → Endpoint # Deploys to Vertex endpoint
scale(Endpoint × Config) → Endpoint # Updates endpoint scaling
route(Version × Weight) → Version # Updates traffic routing
Implementation:
from google.cloud import aiplatform
def deploy_hf_model(
project_id: str,
location: str,
model_id: str,
machine_type: str = "a2-highgpu-4g",
):
aiplatform.init(project=project_id, location=location)
env_vars = {
"MODEL_ID": model_id,
"MAX_INPUT_LENGTH": "512",
"MAX_TOTAL_TOKENS": "1024",
"MAX_BATCH_PREFILL_TOKENS": "2048",
"NUM_SHARD": "1"
}
# Upload model with container configuration
model = aiplatform.Model.upload(
display_name=f"hf-{model_id.replace('/', '-')}",
serving_container_image_uri=(
"us-docker.pkg.dev/deeplearning-platform-release/gcr.io/"
"huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310"
),
serving_container_environment_variables=env_vars
)
# Deploy model with compute configuration
endpoint = model.deploy(
machine_type=machine_type,
min_replica_count=1,
max_replica_count=1,
accelerator_type="NVIDIA_TESLA_A100",
accelerator_count=1,
sync=True
)
return endpoint
def create_completion(
endpoint,
prompt: str,
max_tokens: int = 100,
temperature: float = 0.7
):
response = endpoint.predict({
"text": prompt,
"parameters": {
"max_new_tokens": max_tokens,
"temperature": temperature,
"top_p": 0.95,
"top_k": 40,
}
})
return response
4. Hypothetical Frameworks
4.1 ServerlessML
ServerlessML takes a radical approach by completely eliminating the concept of endpoints and containers, instead treating models as pure functions:
Theoretical Representation:
ModelArtifact ::= (code, weights, metadata, scaling_rules) Function ::= (ModelArtifact, memory_size, timeout) Invocation ::= (Function, cold_start_policy) # Key innovation: No explicit container or endpoint deploy : ModelArtifact → Function invoke : Function → Response scale : automatic based on concurrent invocations
Implementation:
from serverlessml import MLFunction
model = MLFunction(
model_path="model.pkl",
framework="pytorch",
memory_size="2GB",
scaling_rules={
"cold_start_policy": "eager_loading",
"max_concurrent": 1000,
"idle_timeout": "10m"
}
)
# Deployment is implicit - function is ready to serve
function_url = model.deploy()
Pros:
- Zero infrastructure management - models are treated as pure functions
- True pay-per-invocation pricing with no idle costs
- Automatic scaling from zero to thousands of concurrent requests
Cons:
- Cold starts can impact latency-sensitive applications
- Limited control over underlying infrastructure
- May be more expensive for constant high-throughput workloads
4.2 StatefulML
StatefulML introduces a novel approach by making model state and caching first-class concepts:
Theoretical Representation:
ModelArtifact ::= (code, weights, metadata) ModelState ::= (cache, warm_weights, dynamic_config) Container ::= (ModelArtifact, ModelState, runtime) StateManager ::= (Container, caching_policy, update_strategy) # Key innovation: Explicit state management deploy : (ModelArtifact, StateManager) → Container update_state : (Container, ModelState) → Container cache_forward : (Container, Request) → Response
Implementation:
from statefulml import MLContainer, StateManager
state_manager = StateManager(
caching_policy={
"strategy": "predictive_cache",
"cache_size": "4GB",
"eviction_policy": "feature_based_lru"
},
update_strategy={
"type": "incremental",
"frequency": "5m",
"warm_up": True
}
)
model = MLContainer(
model_path="model.pkl",
framework="tensorflow",
state_manager=state_manager,
dynamic_config={
"feature_importance_tracking": True,
"automatic_cache_tuning": True
}
)
endpoint = model.deploy()
Pros:
- Intelligent caching reduces latency for common patterns
- State persistence improves warm start performance
- Dynamic optimization based on actual usage patterns
Cons:
- More complex deployment and management
- Higher memory requirements for state maintenance
- Potential consistency issues with distributed state
5. Future Directions
DynamicResource ::= (
GranularAllocation,
ElasticScaling,
CostAwareScheduling
)
SharedModelConfig ::= (
CrossEndpointSharing,
DynamicModelLoading,
ResourcePooling
)
EnhancedMonitoring ::= (
PredictiveAlerts,
AutomaticDiagnosis,
AdaptiveOptimization
)
6. Conclusion
This framework provides a way to understand and compare ML serving systems. While current platforms have significant differences, many reflect platform-specific constraints rather than fundamental requirements of the domain. Future systems can benefit from this analysis to provide more consistent and powerful abstractions for ML deployment.