Whitepaper 'FinOps and cost management for Kubernetes'
Please consider giving OptScale a Star on GitHub, it is 100% open source. It would increase its visibility to others and expedite product development. Thank you!
Ebook 'From FinOps to proven cloud cost management & optimization strategies'
menu icon
OptScale — FinOps
FinOps overview
Cost optimization:
AWS
MS Azure
Google Cloud
Alibaba Cloud
Kubernetes
menu icon
OptScale — MLOps
ML/AI Profiling
ML/AI Optimization
Big Data Profiling
OPTSCALE PRICING
menu icon
Acura — Cloud migration
Overview
Database replatforming
Migration to:
AWS
MS Azure
Google Cloud
Alibaba Cloud
VMWare
OpenStack
KVM
Public Cloud
Migration from:
On-premise
menu icon
Acura — DR & cloud backup
Overview
Migration to:
AWS
MS Azure
Google Cloud
Alibaba Cloud
VMWare
OpenStack
KVM

Key MLOps processes (part 4): Serving and monitoring machine learning models

In this article, we describe the block of the scheme, devoted to serving and monitoring machine learning models.

Please, find the whole scheme, which describes key MLOps processes here. The main parts of the scheme are horizontal blocks, inside of which the procedural aspects of MLOps are described. Each of them is designed to solve specific tasks within the framework of ensuring the uninterrupted operation of the company’s ML services. 

ML models in production require generating predictions. However, the machine learning model is a file that cannot easily generate predictions. A standard solution found online is for a team to use FastAPI and write a Python wrapper around the model to “retrieve predictions.”

MLOps_Serving-and-monitoring-machine-learning-models

If we add more details, there are several possible scenarios from the moment the team receives the ML model file. The team can:

  • Write all the code to set up a RESTful service,
  • Implement all the necessary wrapper code around it,
  • Collect everything in a Docker image,
  • Eventually, spin up a container from this image somewhere,
  • Scale it in some way,
  • Organize metrics collection,
  • Configure alerts,
  • Set up rules for rolling out new model versions,
  • and much more.

Doing this for all models while also maintaining the code base in the future is a laborious task. To make it easier, special serving tools have emerged that have introduced three new entities into the system:

  • Inference Instance/Service,
  • Inference Server,
  • Serving Engine.

An Inference Instance or Inference Service is a specific ML model prepared to receive requests and generate predictive responses. In essence, such an entity can be represented by a container with the necessary technical equipment for its operation in a Kubernetes cluster.

An Inference Server creates Inference Instances/Services. There are many implementations of Inference Servers, each of which can work with specific ML frameworks, converting trained models into ready-to-process input requests and generating predictions.

A Serving Engine performs the main management functions. It determines which Inference Server will be used, how many copies of the received Inference Instance need to be launched, and how to scale them.

In the context of the discussed system, there is no component-level model serving detail, but there are similar aspects:

  • The CI/CD component that handles the deployment of models ready for production (it can be considered one of the versions of Serving Engine), and

Model Serving, which organizes the scheme for generating predictions for ML models within the available infrastructure, both for streaming and batch scenarios (it can be considered one of the versions of Inference Server).

CI CD component MLOps process

As an example of a completed stack for Serving, one can refer to the Seldon stack:

  • Seldon Core is a Serving Engine,
  • Seldon ML Server is an Inference Server, which prepares access to the model via REST or gRPC,
  • Seldon ML Server Custom Runtime is an Inference Instance – an example of a wrapper for any ML model, an instance of which needs to be launched to generate predictions.

There is even a standardized protocol for implementing Serving, the support of which is de facto mandatory in all similar tools. It is called the V2 Inference Protocol and was developed by several major market players – KServe, Seldon, Nvidia Triton.

cost optimization ML resource management

Free cloud cost optimization & enhanced ML/AI resource management for a lifetime

Serving vs deploy

In various sources, one can come across the mention of “Serving and Deploy” tools. However, it is important to understand the difference in their purpose. This is a debatable issue, but in this article, it will be as follows:

Serving is about creating a model API and getting predictions from it, i.e., ultimately obtaining a single service instance with a model inside.

Deploy – is about distributing the service instance in the required quantity to process incoming requests (you can imagine a replica set in Kubernetes deployment).

There are many strategies for deploying models, but this is not ML-specific. By the way, the paid version of Seldon supports several of these strategies, so you can choose this stack and enjoy how everything works by itself.

It is essential not to forget that model performance metrics must be tracked. Otherwise, it will not be possible to solve emerging problems on time. Tracking metrics is a big question. Arize AI company has built a whole business on this, but Grafana with VictoriaMetrics has not been canceled either.

💡 You might be also interested in our article ‘Key MLOps processes (part 3): Automated machine learning workflow’ → https://optscale.ai/key-mlops-processes-part-3-automated-machine-learning-workflow.

✔️ OptScale, a FinOps & MLOps open source platform, which helps companies optimize cloud costs and bring more cloud usage transparency, is fully available under Apache 2.0 on GitHub → https://github.com/hystax/optscale.

Enter your email to be notified about new and relevant content.

Thank you for joining us!

We hope you'll find it usefull

You can unsubscribe from these communications at any time. Privacy Policy

News & Reports

MLOps open source platform

A full description of OptScale as an MLOps open source platform.

Enhance the ML process in your company with OptScale capabilities, including

  • ML/AI Leaderboards
  • Experiment tracking
  • Hyperparameter tuning
  • Dataset and model versioning
  • Cloud cost optimization

How to use OptScale to optimize RI/SP usage for ML/AI teams

Find out how to: 

  • enhance RI/SP utilization by ML/AI teams with OptScale
  • see RI/SP coverage
  • get recommendations for optimal RI/SP usage

Why MLOps matters

Bridging the gap between Machine Learning and Operations, we’ll cover in this article:

  • The driving factors for MLOps
  • The overlapping issues between MLOps and DevOps
  • The unique challenges in MLOps compared to DevOps
  • The integral parts of an MLOps structure