Deploying a Multi-Model Endpoint to Amazon SageMaker Real-Time Inference

May 5, 2025 · 6 min read

Deploying machine learning (ML) models in production often involves balancing performance, cost, and manageability. One powerful technique available in Amazon SageMaker is the ability to deploy multiple models to a single real-time inference endpoint. This capability not only optimizes cost and infrastructure but also simplifies the architecture for use cases involving many similar models, such as regional predictive models.

In this tutorial, we will walk through how to:

Set up a SageMaker Studio environment.
Copy trained model artifacts from a public S3 bucket.
Create and configure a multi-model endpoint.
Invoke the endpoint for real-time predictions.
Clean up all associated resources to prevent unwanted charges.

Overview of Inference Options in SageMaker

Amazon SageMaker offers a spectrum of inference options:

Real-Time Inference: For workloads needing low latency (milliseconds).
Serverless Inference: For intermittent workloads where endpoints scale automatically.
Asynchronous Inference: For processing large payloads or long-running inference tasks.
Batch Transform: For large datasets processed in bulk.

In this post, we focus on Real-Time Inference using Multi-Model Endpoints (MME).

Use Case: House Price Prediction

In this hands-on example, you will deploy multiple pre-trained XGBoost models that predict house prices based on features such as:

Number of bedrooms
Square footage
Number of bathrooms

Each model is specialized for a different city (Chicago, Houston, New York, Los Angeles).

Step 1: Set up SageMaker Studio

If you don't already have a SageMaker Studio domain:

Use the provided AWS CloudFormation template to create a SageMaker Studio domain and IAM roles.
Make sure your region is set to US East (N. Virginia).
Create the stack and wait ~10 minutes until its status is CREATE_COMPLETE.

Link: Launch CloudFormation Template

Step 2: Configure Notebook Environment

Open SageMaker Studio → Notebook.
Create a new Notebook.
Set up AWS SDK clients and S3 parameters:

import boto3, sagemaker
from sagemaker.image_uris import retrieve

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

read_bucket = "sagemaker-sample-files"
read_prefix = "models/house_price_prediction"
model_prefix = "models/xgb-hpp"
location = ['Chicago_IL', 'Houston_TX', 'NewYork_NY', 'LosAngeles_CA']
test_data = [1997, 2527, 6, 2.5, 0.57, 1]

Copy public model artifacts into your bucket:

s3 = boto3.resource("s3")
for i in range(0, 4):
    copy_source = {'Bucket': read_bucket, 'Key': f"{read_prefix}/{location[i]}.tar.gz"}
    bucket = s3.Bucket(default_bucket)
    bucket.copy(copy_source, f"{model_prefix}/{location[i]}.tar.gz")

Step 3: Deploy the Multi-Model Endpoint

Create SageMaker Model

training_image = retrieve(framework="xgboost", region=region, version="latest")
model_name = "housing-prices-prediction-mme-xgb"
model_artifacts = f"s3://{default_bucket}/{model_prefix}/"

primary_container = {
    "Image": training_image,
    "ModelDataUrl": model_artifacts,
    "Mode": "MultiModel"
}

model = boto3.client("sagemaker").create_model(
    ModelName=model_name,
    PrimaryContainer=primary_container,
    ExecutionRoleArn=role
)

Create Endpoint Configuration

endpoint_config_name = f"{model_name}-endpoint-config"

boto3.client("sagemaker").create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[{
        "VariantName": "AllTraffic",
        "ModelName": model_name,
        "InstanceType": "ml.m5.large",
        "InitialInstanceCount": 1
    }]
)

Create Endpoint

endpoint_name = f"{model_name}-endpoint"

boto3.client("sagemaker").create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name
)

Step 4: Run Inference

Once the endpoint is InService, run predictions:

sm_runtime = boto3.client("sagemaker-runtime")
payload = ' '.join([str(elem) for elem in test_data])

for i in range(0, 4):
    response = sm_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="text/csv",
        TargetModel=f"{location[i]}.tar.gz",
        Body=payload
    )
    result = response['Body'].read().decode()
    print(f"Prediction for {location[i]}: ${result}")

Complete Code:

import boto3
import sagemaker
import time

from sagemaker.image_uris import retrieve
from time import gmtime, strftime
from sagemaker.amazon.amazon_estimator import image_uris

s3 = boto3.client("s3")

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
write_prefix = "housing-prices-prediction-mme-demo"

region = sagemaker_session.boto_region_name
s3_client = boto3.client("s3", region_name=region)
sm_client = boto3.client("sagemaker", region_name=region)
sm_runtime_client = boto3.client("sagemaker-runtime")
role = sagemaker.get_execution_role()

# S3 locations used for parameterizing the notebook run
read_bucket = "sagemaker-sample-files"
read_prefix = "models/house_price_prediction"
model_prefix = "models/xgb-hpp"

# S3 location of trained model artifact
model_artifacts = f"s3://{default_bucket}/{model_prefix}/"

# Location
location = ['Chicago_IL', 'Houston_TX', 'NewYork_NY', 'LosAngeles_CA']

test_data = [1997, 2527, 6, 2.5, 0.57, 1]

for i in range (0,4):
    copy_source = {'Bucket': read_bucket, 'Key': f"{read_prefix}/{location[i]}.tar.gz"}
    bucket = s3.Bucket(default_bucket)
    bucket.copy(copy_source, f"{model_prefix}/{location[i]}.tar.gz")

# Retrieve the SageMaker managed XGBoost image
training_image = retrieve(framework="xgboost", region=region, version="1.3-1")

# Specify an unique model name that does not exist
model_name = "housing-prices-prediction-mme-xgb"
primary_container = {
                     "Image": training_image,
                     "ModelDataUrl": model_artifacts,
                     "Mode": "MultiModel"
                    }

model_matches = sm_client.list_models(NameContains=model_name)["Models"]
if not model_matches:
    model = sm_client.create_model(ModelName=model_name,
                                   PrimaryContainer=primary_container,
                                   ExecutionRoleArn=role)
else:
    print(f"Model with name {model_name} already exists! Change model name to create new")

# Endpoint Config name
endpoint_config_name = f"{model_name}-endpoint-config"

# Create endpoint if one with the same name does not exist
endpoint_config_matches = sm_client.list_endpoint_configs(NameContains=endpoint_config_name)["EndpointConfigs"]
if not endpoint_config_matches:
    endpoint_config_response = sm_client.create_endpoint_config(
                                                                EndpointConfigName=endpoint_config_name,
                                                                ProductionVariants=[
                                                                    {
                                                                        "InstanceType": "ml.m5.xlarge",
                                                                        "InitialInstanceCount": 1,
                                                                        "InitialVariantWeight": 1,
                                                                        "ModelName": model_name,
                                                                        "VariantName": "AllTraffic",
                                                                    }
                                                                ],
                                                                )
else:
    print(f"Endpoint config with name {endpoint_config_name} already exists! Change endpoint config name to create new")
    
endpoint_name = f"{model_name}-endpoint"

# Endpoint name
endpoint_name = f"{model_name}-endpoint"

endpoint_matches = sm_client.list_endpoints(NameContains=endpoint_name)["Endpoints"]
if not endpoint_matches:
    endpoint_response = sm_client.create_endpoint(
                                                  EndpointName=endpoint_name,
                                                  EndpointConfigName=endpoint_config_name
                                                 )
else:
    print(f"Endpoint with name {endpoint_name} already exists! Change endpoint name to create new")

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
while status == "Creating":
    print(f"Endpoint Status: {status}...")
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
print(f"Endpoint Status: {status}")

Step 5: Clean Up Resources

boto3.client("sagemaker").delete_model(ModelName=model_name)
boto3.client("sagemaker").delete_endpoint_config(EndpointConfigName=endpoint_config_name)
boto3.client("sagemaker").delete_endpoint(EndpointName=endpoint_name)

Also delete CloudFormation stack that will delete objects from your S3 bucket and delete the resources created.

Conclusion

You've successfully deployed a set of XGBoost models to a multi-model endpoint using Amazon SageMaker. This pattern is ideal for scalable deployments of similar models—saving cost and simplifying management.

For advanced production setups, consider adding model versioning, endpoint autoscaling, or using SageMaker Pipelines to automate the lifecycle.

Ready to take it further? Try deploying a multi-model endpoint for NLP tasks, or integrate with a RESTful API gateway for external inference calls.

🔚 Call to Action

Choosing the right platform depends on your organizations needs. For more insights, subscribe to our newsletter for insights on cloud computing, tips, and the latest trends in technology. or follow our video series on cloud comparisons.

Interested in having your organization setup on cloud? If yes, please contact us and we'll be more than glad to help you embark on cloud journey.

💬 Comment below:
Which tool is your favorite? What do you want us to review next?