vLLM

Operate large language models with high performance on your own infrastructure

vLLM is the open-source inference engine for productive LLM serving: high throughput, efficient GPU utilization and an OpenAI-compatible API – GDPR-compliant and under your control, built for you by specialists.

How we work with you

You don’t have to set up vLLM on your own. We accompany you step by step – and stay by your side afterwards.

Step 1

Analysis & concept

We look at your use cases, models and existing GPU hardware and plan together which setup really fits. We know the pitfalls from practical experience - so you avoid oversized hardware or a setup that collapses under load.
"
Step 2

Structure & integration

We set up vLLM to fit perfectly: Model selection, quantization, GPU allocation and the OpenAI-compatible API - neatly integrated into your systems, if desired via Docker and Kubernetes. A well thought-out design saves you expensive conversions later on and scales with your needs.
"
Step 3

Commissioning & Serving

Your LLM endpoint goes live and serves many simultaneous requests efficiently - thanks to PagedAttention and Continuous Batching at maximum GPU utilization. This avoids expensive idle times and latency peaks that affect productive applications.
"
Step 4

Support & Operation

On request, we can take over ongoing operations completely (outsourcing) or support your team with support and training. Updates, scaling and GPU monitoring take up a lot of time internally - we keep your LLM serving stable so that you can concentrate on your core business.

vLLM Features

Operate large language models on your own GPU infrastructure in a high-performance and GDPR-compliant manner

Shaping IT together

We help you to strategically plan, technically implement and sustainably operate modern AI and inference solutions. We combine consulting, implementation and support to create a tailor-made service that is geared to your requirements. Our aim is to make high-performance LLM deployments transparent, stable and efficient to use.

Managed AI Models

Smart AI via an API – without compromising on data protection

With our Managed AI Models, you can use powerful open source models directly via a simple API – ready to use in just a few minutes. Your data is transmitted exclusively in encrypted form, not stored and not used for training, hosted in our ISO-certified data centers in Germany. You only pay for what you use: token-based, with full cost control. Get in touch with our team if you need a customized solution.

Know-how

More know-how about Ansible

Questions & Answers

The most frequently asked questions about vLLM

What is vLLM simply explained?

2
3
vLLM is an open-source inference engine with which you can provide large language models on your own GPU servers. It ensures that many requests are processed efficiently at the same time and makes the model available via an OpenAI-compatible interface. This allows you to operate AI models with high performance in your own environment.

Is vLLM open source?

2
3
Yes. vLLM is open source under the Apache 2.0 license and is developed by a large community of research institutions and companies. The complete source code is publicly available on GitHub, and you can use vLLM freely.

What does vLLM cost?

2
3
The software itself is free of charge. Costs are incurred for the GPU hardware or cloud instances as well as for our services relating to implementation, operation and training. On request, we operate vLLM managed via NWS. Just ask us about the possible costs.

Is vLLM GDPR-compliant?

2
3
vLLM is pure software and does not store any data itself. If you run it on your own hardware or in a European data center, your data never leaves your environment. This makes vLLM particularly suitable for data protection-sensitive use cases - a clear advantage over external AI APIs.

What hardware do I need for vLLM?

2
3
This depends on the model size. Smaller models already run on a single modern GPU, while large models require several GPUs with distributed inference. We analyze your needs and recommend the right hardware - or you can use GPU instances via NWS.

How does vLLM differ from Ollama?

2
3
Ollama is optimized for local development on a single computer. vLLM is aimed at productive operation with many parallel users on GPU servers. Those who put LLMs into production generally rely on vLLM - both complement each other in the typical development stack.

Which models does vLLM support?

2
3
vLLM supports over 200 model architectures from Hugging Face, including popular open source models such as Llama, Mistral, Qwen and many more. Both instruction-tuned and basic models can be operated, as well as your own fine tunes.

We look forward to your message






    captcha

    We look forward to your message






      captcha