An introduction to embedding an LLM into your application (2024)

Hands on Large language models (LLMs) are generally associated with chatbots such as ChatGPT, Copilot, and Gemini, but they're by no means limited to Q&A-style interactions. Increasingly, LLMs are being integrated into everything from IDEs to office productivity suites.

Besides content generation, these models can be used to, for example, gauge the sentiment of writing, identify topics in documents, or clean up data sources, with of course the right training, prompts, and guardrails. As it turns out, baking LLMs for these purposes into your application code to add some language-based analysis isn't all that difficult thanks to highly extensible inferencing engines, such as Llama.cpp or vLLM. These engines take care of the process of loading and parsing a model, and performing inference with it.

In this hands on, aimed at intermediate-level-or-higher developers, we'll be taking a look at a relatively new LLM engine written in Rust called Mistral.rs.

This open source code boasts support for a growing number of popular models and not just those from Mistral the startup, seemingly the inspiration for the project's name. Plus, Mistral.rs can be integrated into your projects using Python, Rust, or OpenAI-compatible APIs, making it relatively easy to insert into new or existing projects.

But, before we jump into how to get Mistral.rs up and running, or the various ways it can be used to build generative AI models into your code, we need to discuss hardware and software requirements.

Hardware and software support

With the right flags, Mistral.rs works with Nvidia CUDA, Apple Metal, or can be run directly on your CPU, although performance is going to be much slower if you opt for your CPU. At the time of writing, the platform doesn't support AMD nor Intel's GPUs just yet.

In this guide, we're going to be looking at deploying Mistral.rs on an Ubuntu 22.04 system. The engine does support macOS, but, for the sake of simplicity, we're going to be sticking with Linux for this one.

We recommend a GPU with a minimum of 8GB of vRAM, or at least 16GB of system memory if running on your CPU — your mileage may vary depending on the model.

Nvidia users will also want to make sure they've got the latest proprietary drivers and CUDA binaries installed before proceeding. You can find more information on setting that uphere.

Grabbing our dependencies

Installing Mistral.rs is fairly straightforward, and varies slightly depending on your specific use case. Before getting started, let's get the dependencies out of the way.

According to the Mistral.rs README, the only packages we need are libssl-dev and pkg-config. However, we found a few extra packages were necessary to complete the installation. Assuming you're running Ubuntu 22.04 like we are, you can install them by executing:

sudo apt install curl wget python3 python3-pip git build-essential libssl-dev pkg-config

Once those are out of the way, we can install and activate Rust by running the Rustup script.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

. "$HOME/.cargo/env"

Yes, this involves downloading and executing a script right away; if you prefer to inspect the script before it runs, the code for it is here.

By default, Mistral.rs uses Hugging Face to fetch models on our behalf. Because many of these files require you to be logged into before you deploy them, we'll need to install the huggingface_hub by running:

pip install --upgrade huggingface_hub

huggingface-cli login

You'll be prompted to enter your Hugging Face access token, which you can create by visiting huggingface.co/settings/tokens.

Installing Mistral.rs

With our dependencies installed, we can move on to deploying Mistral.rs itself. To start, we'll use git to pull down the latest release of Mistral.rs from GitHub and navigate to our working directory:

git clone https://github.com/EricLBuehler/mistral.rs.git

cd mistral.rs

Here's where things get a little tricky, depending on how your system is configured or what kind of accelerator you're using. In this case, we'll be looking at CPU (slow) and CUDA (fast)-based inferencing in Mistral.rs.

For CPU-based inferencing, we can simply execute:

cargo build --release

Meanwhile, those with Nvidia-based systems will want to run:

cargo build --release --features cuda

This bit could take a few minutes to complete, so you may want to a grab a cup of tea or coffee while you wait. After the executable is finished compiling, we can copy it to our working directory:

cp ./target/release/mistralrs-server ./mistralrs_server

Testing out Mistral.rs

With Mistral.rs installed, we can check that it actually works by running a test model, such as Mistral-7b-Instruct, in interactive mode. Assuming you've got a GPU with around 20GB or more of vRAM, you can just run:

./mistralrs_server -i plain -m mistralai/Mistral-7B-Instruct-v0.3 -a mistral

However, the odds are your GPU doesn't have the memory necessary to run the model at the 16-bit precision it was designed around. At this precision, you need 2GB of memory for every billion parameters, plus additional capacity for the key value cache. And even if you have enough system memory to deploy it on your CPU, you can expect performance to be quite poor as your memory bandwidth will quickly become a bottleneck.

Instead, we want to use quantization to shrink the model to a more reasonable size. In Mistral.rs there are two ways to go about this. The first is to simply use in-situ quantization, which will download the full-sized model and then quantize it down to the desired size. In this case, we'll be quantizing the model down from 16 bits to 4 bits. We can do this by adding --isq Q4_0 to the previous command like so:

./mistralrs_server -i --isq Q4_0 plain -m mistralai/Mistral-7B-Instruct-v0.3 -a mistral

Note: If Mistral.rs crashes before finishing, you probably don't have enough system memory and may need to add a swapfile — we added a 24GB one — to complete the process. You can temporarily add and enable a swapfile — just remember to delete it after you reboot — by running:

sudo fallocate -l 24G /swapfilesudo chmod 600 /swapfilesudo mkswap /swapfilesudo swapon /swapfile

Once the model has been quantized, you should be greeted with a chat-style interface where you can start querying the model. You should also notice that the model is using considerably less memory — around 5.9GB in our testing — and performance should be much better.

However, if you'd prefer not to quantize the model on the fly, Mistral.rs also supports pre-quantized GGUF and GGML files, for example these ones from Tom "TheBloke" Jobbins on Hugging Face.

The process is fairly similar, but this time we'll need to specify that we're running a GGUF model and set the ID and filename of the LLM we want. In this case, we'll download TheBloke's 4-bit quantized version of Mistral-7B-Instruct.

./mistralrs_server -i gguf --quantized-model-id TheBloke/Mistral-7B-Instruct-v0.2-GGUF --quantized-filename mistral-7b-instruct-v0.2.Q4_0.gguf

Putting the LLM to work

Running an interactive chatbot in a terminal is cool and all, but it isn't all that useful for building AI-enabled apps. Instead, Mistral.rs can be integrated into your code using Rust or Python APIs or via an OpenAI API-compatible HTTP server.

To start, we'll look at tying into the HTTP server, since it's arguably the easiest to use. In this example, we'll be using the same 4-bit quantized Mistral-7B model as our last example. Note that instead of starting the Mistral.rs in interactive mode, we've replaced the -i with a -p and provided the port we want the server to be accessible on.

./mistralrs_server -p 8342 gguf --quantized-model-id TheBloke/Mistral-7B-Instruct-v0.2-GGUF --quantized-filename mistral-7b-instruct-v0.2.Q4_0.gguf

Once the server is up and running, we can access it programmatically in a couple of different ways. The first would be to use curl to pass the instructions we want to give to the model. Here, we're posing the question: "In machine learning, what is a transformer?"

curl http://localhost:8342/v1/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer EMPTY" \-d '{"model": "Mistral-7B-Instruct-v0.2-GGUF","prompt": "In machine learning, what is a transformer?"}'

After a few seconds, the model should spit out a neat block of text formatted in JSON.

We can also interact with this using the openAI Python library. Though, you will probably need to install it using pip first:

pip install openai

You can then call the Mistral.rs server using a template, such as this one written for completion tasks.

import openaiquery = "In machine learning, what is a transformer?" # The prompt we want to pass to the LLMclient = openai.OpenAI( base_url="http://localhost:8342/v1", #The address of your Mistral.rs server api_key = "EMPTY")completion = client.completions.create( model="", prompt=query, max_tokens=256, frequency_penalty=1.0, top_p=0.1, temperature=0,)print(completion.choices[0].text)

You can find more examples showing how to work with the HTTP server over in the Mistral.rs Github repo here.

How to run an LLM on your PC, not in the cloud, in less than 10 minutes
GPU-accelerated VMs on Proxmox, XCP-ng? Here's what you need to know
From RAGs to riches: A practical guide to making your local AI chatbot smarter
AI PC vendors gotta have their TOPS – but is this just the GHz wars all over again?

Embedding Mistral.rs deeper into your projects

While convenient, the HTTP server isn't the only way to integrate Mistral.rs into our projects. You can achieve similar results using Rust or Python APIs.

Here's a basic example from the Mistral.rs repo showing how to to use the project as a Rust crate – what the Rust world calls a library – to pass a query to Mistral-7B-Instruct and generate a response. Note: We found we had to a make a few tweaks to the original example code to get it to run.

use std::sync::Arc;use std::convert::TryInto;use tokio::sync::mpsc::channel;use mistralrs::{ Constraint, Device, DeviceMapMetadata, GGUFLoaderBuilder, GGUFSpecificConfig, MistralRs, MistralRsBuilder, ModelDType, NormalRequest, Request, RequestMessage, Response, SamplingParams, SchedulerMethod, TokenSource,};fn setup() -> anyhow::Result<Arc<MistralRs>> { // Select a Mistral model // We do not use any files from HF servers here, and instead load the // chat template from the specified file, and the tokenizer and model from a // local GGUF file at the path `.` let loader = GGUFLoaderBuilder::new( GGUFSpecificConfig { repeat_last_n: 64 }, Some("mistral.json".to_string()), None, ".".to_string(), "mistral-7b-instruct-v0.2.Q4_K_M.gguf".to_string(), ) .build(); // Load, into a Pipeline let pipeline = loader.load_model_from_hf( None, TokenSource::CacheToken, &ModelDType::Auto, &Device::cuda_if_available(0)?, false, DeviceMapMetadata::dummy(), None, )?; // Create the MistralRs, which is a runner Ok(MistralRsBuilder::new(pipeline, SchedulerMethod::Fixed(5.try_into().unwrap())).build())}fn main() -> anyhow::Result<()> { let mistralrs = setup()?; let (tx, mut rx) = channel(10_000); let request = Request::Normal(NormalRequest { messages: RequestMessage::Completion { text: "In machine learning, what is a transformer ".to_string(), echo_prompt: false, best_of: 1, }, sampling_params: SamplingParams::default(), response: tx, return_logprobs: false, is_streaming: false, id: 0, constraint: Constraint::None, suffix: None, adapters: None, }); mistralrs.get_sender().blocking_send(request)?; let response = rx.blocking_recv().unwrap(); match response { Response::CompletionDone(c) => println!("Text: {}", c.choices[0].text), _ => unreachable!(), } Ok(())}

If you want to test this out for yourself, start by stepping up out of the current directory, creating a folder for a new Rust project, and entering that directory. We could use cargo new to create the project, which is recommended, but this time we'll do it by hand so you can see the steps.

cd ..mkdir test_appcd test_app

Once there, you'll want to copy the mistral.json template from ../mistral.rs/chat_templates/ and download the mistral-7b-instruct-v0.2.Q4_K_M.gguf model file from Hugging Face.

Next, we'll create a Cargo.toml file with the dependencies we need to build the app. This file tells the Rust toolchain details about your project. Inside this .toml file, paste the following:

[package]name = "test_app"version = "0.1.0"edition = "2018"[dependencies]tokio = "1"anyhow = "1"mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", tag="v0.1.18", features = ["cuda"] }[[bin]]name = "main"path = "test_app.rs"

Note: You'll want to remove the , features = ["cuda"] part if you aren't using GPU acceleration.

Finally, paste the contents of the demo app above into a file called test_app.rs.

With these four files test_app.rs, Cargo.toml, mistral-7b-instruct-v0.2.Q4_K_M.gguf, and mistral.json in the same folder, we can test whether it works by running:

cargo run

After about a minute, you should see the answer to our query appear on screen.

Obviously, this is an incredibly rudimentary example, but it illustrates how Mistral.rs can be used to integrate LLMs into your Rust apps, by incorporating the crate and using its library interface.

If you're interested in using Mistral.rs in your Python or Rust projects, we highly recommend checking out its documentation for more information and examples.

We hope to bring you more stories on utilizing LLMs soon, so be sure to let us know what we should explore next in the comments. ®

Editor's Note:Nvidia providedThe Registerwith an RTX 6000 Ada Generation graphics card to support this story and others like it. Nvidia had no input as to the contents of this article.

An introduction to embedding an LLM into your application (2024)

FAQs

What is an embedding in LLM? ›

Embeddings: These are high-dimensional vectors representing tokens in a way that captures their semantic meaning and relationships. Embeddings enable LLMs to understand context and nuances in data, whether it's text, images, or videos.

How do I use LLM in my application? ›

Here are the high-level steps you need to know to build an LLM application:

Focus on a single problem first. The key? ...
Choose the right LLM. ...
Customize the LLM. ...
Set up the app architecture. ...
Conduct online evaluations of your app.

May 17, 2024

How does an embedding model work? ›

How do embeddings work? Embeddings convert raw data into continuous values that ML models can interpret. Conventionally, ML models use one-hot encoding to map categorical variables into forms they can learn from. The encoding method divides each category into rows and columns and assigns them binary values.

Get More Info ›

Can LLM generate JSON? ›

A simple trick to increase the likelihood of LLM creating a pure json as output is to use the “logic_bias” parameter of openAI API.

Get More Info ›

What is the LLM format? ›

An LLM is a language model, which is not an agent as it has no goal, but it can be used as a component of an intelligent agent. Researchers have described several methods for such integrations. The ReAct pattern, a portmanteau of "Reason + Act", constructs an agent out of an LLM, using the LLM as a planner.

See Details ›

How does LLM generate text? ›

The language model processes the prompt and generates a sequence of words representing the response. During this step, the LLM uses the knowledge it gained during training, where it learned patterns and language from a vast amount of data, to generate a coherent and relevant response.