Adolf Hohl

Aug 20, 2024 • World Congress 2024

Efficient deployment and inference of GPU-accelerated LLMs

What if you could deploy a fully optimized LLM with a single command? See how NVIDIA NIM abstracts away the complexity of self-hosting for massive performance gains.

#1about 2 minutes

The evolution of generative AI from experimentation to production

Generative AI has rapidly moved from experimentation with models like Llama and Mistral to production-ready applications in 2024.

#2about 3 minutes

Comparing managed AI services with the DIY approach

Managed services offer ease of use but limited control, while a do-it-yourself approach provides full control but introduces significant complexity.

#3about 4 minutes

Introducing NVIDIA NIM for simplified LLM deployment

NVIDIA Inference Microservices (NIM) provide a containerized, OpenAI-compatible solution for deploying models anywhere with enterprise support.

#4about 2 minutes

Boosting inference throughput with lower precision quantization

Using lower precision formats like FP8 dramatically increases model inference throughput, providing more performance for the same hardware investment.

#5about 2 minutes

Overview of the NVIDIA AI Enterprise software platform

The NVIDIA AI Enterprise platform is a cloud-native software stack that abstracts away low-level complexities like CUDA to streamline AI pipeline development.

#6about 2 minutes

A look inside the NIM container architecture

NIM containers bundle optimized inference tools like TensorRT-LLM and Triton Inference Server to accelerate models on specific GPU hardware.

#7about 3 minutes

How to run and interact with a NIM container

A NIM container can be launched with a simple Docker command, automatically discovering hardware and exposing OpenAI-compatible API endpoints for interaction.

#8about 2 minutes

Efficiently serving custom models with LoRA adapters

NIM enables serving multiple customized LoRA adapters on a single base model simultaneously, saving memory while providing distinct model endpoints.

#9about 3 minutes

How NIM automatically handles hardware and model optimization

NIM simplifies deployment by automatically selecting the best pre-compiled model based on the detected GPU architecture and user preference for latency or throughput.

Andrew Comp
Cosio Valtellino, Italy

Intermediate

TypeScript

Cards Co

Remote

Intermediate

JavaScript

TypeScript

Name of

Remote

Intermediate

PHP

Java

+1

Deploying models with TensorRT and Triton Inference Server

02:16 MIN

Deploying models with TensorRT and Triton Inference Server

Trends, Challenges and Best Practices for AI at the Edge

Deploying enterprise AI applications with NVIDIA NIM

02:14 MIN

Deploying enterprise AI applications with NVIDIA NIM

WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA

Deploying and scaling models with NVIDIA NIM on Kubernetes

03:08 MIN

Deploying and scaling models with NVIDIA NIM on Kubernetes

LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices

Introduction to large-scale AI infrastructure challenges

01:37 MIN

Introduction to large-scale AI infrastructure challenges

Your Next AI Needs 10,000 GPUs. Now What?

The technical challenges of running LLMs in browsers

01:32 MIN

The technical challenges of running LLMs in browsers

From ML to LLM: On-device AI in the Browser

Using NVIDIA NIMs and blueprints to deploy models

05:47 MIN

Using NVIDIA NIMs and blueprints to deploy models

Your Next AI Needs 10,000 GPUs. Now What?

Deploying custom inference workloads with NVIDIA NIMs

03:08 MIN

Deploying custom inference workloads with NVIDIA NIMs

From foundation model to hosted AI solution in minutes

Running large language models locally with Web LLM

12:42 MIN

Running large language models locally with Web LLM

Generative AI power on the web: making web apps smarter with WebGPU and WebNN

Featured Partners

Self-Hosted LLMs: From Zero to Inference

Self-Hosted LLMs: From Zero to Inference

Roberto Carratalá & Cedric Clyburn

about 6 months ago • World Congress 2025

Your Next AI Needs 10,000 GPUs. Now What?

Your Next AI Needs 10,000 GPUs. Now What?

Anshul Jindal & Martin Piercy

about 6 months ago • World Congress 2025

LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices

LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices

Anshul Jindal

about 6 months ago • World Congress 2025

DevOps for AI: running LLMs in production with Kubernetes and KubeFlow

DevOps for AI: running LLMs in production with Kubernetes and KubeFlow

Aarno Aukia

about a year ago • WeAreDevelopers LIVE

WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA

WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA

Ankit Patel

about 2 years ago • World Congress 2024

Unveiling the Magic: Scaling Large Language Models to Serve Millions

Unveiling the Magic: Scaling Large Language Models to Serve Millions

Patrick Koss

about 6 months ago • World Congress 2025

Exploring LLMs across clouds

Exploring LLMs across clouds

Tomislav Tipurić

about 6 months ago • World Congress 2025

Unlocking the Power of AI: Accessible Language Model Tuning for All

Unlocking the Power of AI: Accessible Language Model Tuning for All

Cedric Clyburn & Legare Kerrison

about 2 years ago • World Congress 2024

Related Articles

View all articles

BB

Benedikt Bischof

MLops – Deploying, Maintaining And Evolving Machine Learning Models in Production

Welcome to this issue of the WeAreDevelopers Live Talk series. This article recaps an interesting talk by Bas Geerdink who gave advice on MLOps.‍About the speaker:‍Bas is a programmer, scientist, and IT manager. At ING, he is responsible for the Fast...

MLops – Deploying, Maintaining And Evolving Machine Learning Models in Production

BB

Benedikt Bischof

MLOps And AI Driven Development

Welcome to this issue of the WeAreDevelopers Dev Talk Recap series. This article recaps an interesting talk by Natalie Pistunovic who spoke about the development of AI and MLOps. What you will learn:How the concept of AI became an academic field and ...

MLOps And AI Driven Development

DC

Daniel Cranney

Panel Discussion: Responsible AI in Practice - Real-World Examples and Challenges

IntroductionIn the ever-evolving landscape of artificial intelligence, the concept of "responsible AI" has emerged as a cornerstone for ethical and practical AI implementation. During the WWC24 Panel discussion, three eminent experts—Mina, Bjorn Brin...

Panel Discussion: Responsible AI in Practice - Real-World Examples and Challenges

CH

Chris Heilmann

Coffee with Developers - Maria Apazoglou - Making AI understandable for all in production

Hello and welcome to another edition of Coffee with Developers. Today, we're excited to share an intriguing conversation with Maria Apazoglou, a leading figure in the AI space at Thomson Reuters. Maria's career journey, insights on AI, and the exciti...

Coffee with Developers - Maria Apazoglou - Making AI understandable for all in production

From learning to earning

Jobs that call for the skills explored in this talk.

Product Owner/Projektleiter (m/w/d)

relyon AG
Tübingen, Germany

Junior

Intermediate

Senior

Scrum

Machine Learning Engineer (m/f/d)

evoila Frankfurt GmbH
Mainz, Germany

Senior

Keras

DevOps

Tensorflow

Machine Learning & Data Engineer

vengine GmbH
Hamburg, Germany

Junior

Intermediate

Python

Data Engineer (f/m/d) - AI

smartclip Europe GmbH
Hamburg, Germany

Intermediate

Senior

ETL

Java

Scala

AI & Embedded ML Engineer (Real-Time Edge Optimization)

autonomous-teaming

Remote

GIT

Linux

PyTorch

Backend AI-Developer mit einer Prise MLOps

ONTEC AG

Remote

€55K

NoSQL

DevOps

Gitlab

+7

ML Data Engineer - Object Detection & Active Learning

autonomous-teaming

Remote

NoSQL

NumPy

Pandas

Docker

ML Data Engineer - Object Detection & Active Learning

autonomous-teaming

Remote

NoSQL

NumPy

Pandas

Docker

Machine Learning Expert (Llms)

NLP People
Municipality of Madrid, Spain

Senior

Machine Learning