An introduction to embedding an LLM into your application (2024)

Hands on Large language models (LLMs) are generally associated with chatbots such as ChatGPT, Copilot, and Gemini, but they're by no means limited to Q&A-style interactions. Increasingly, LLMs are being integrated into everything from IDEs to office productivity suites.

Besides content generation, these models can be used to, for example, gauge the sentiment of writing, identify topics in documents, or clean up data sources, with of course the right training, prompts, and guardrails. As it turns out, baking LLMs for these purposes into your application code to add some language-based analysis isn't all that difficult thanks to highly extensible inferencing engines, such as Llama.cpp or vLLM. These engines take care of the process of loading and parsing a model, and performing inference with it.

In this hands on, aimed at intermediate-level-or-higher developers, we'll be taking a look at a relatively new LLM engine written in Rust called Mistral.rs.

This open source code boasts support for a growing number of popular models and not just those from Mistral the startup, seemingly the inspiration for the project's name. Plus, Mistral.rs can be integrated into your projects using Python, Rust, or OpenAI-compatible APIs, making it relatively easy to insert into new or existing projects.

But, before we jump into how to get Mistral.rs up and running, or the various ways it can be used to build generative AI models into your code, we need to discuss hardware and software requirements.

Hardware and software support

With the right flags, Mistral.rs works with Nvidia CUDA, Apple Metal, or can be run directly on your CPU, although performance is going to be much slower if you opt for your CPU. At the time of writing, the platform doesn't support AMD nor Intel's GPUs just yet.

In this guide, we're going to be looking at deploying Mistral.rs on an Ubuntu 22.04 system. The engine does support macOS, but, for the sake of simplicity, we're going to be sticking with Linux for this one.

We recommend a GPU with a minimum of 8GB of vRAM, or at least 16GB of system memory if running on your CPU — your mileage may vary depending on the model.

Nvidia users will also want to make sure they've got the latest proprietary drivers and CUDA binaries installed before proceeding. You can find more information on setting that uphere.

Grabbing our dependencies

Installing Mistral.rs is fairly straightforward, and varies slightly depending on your specific use case. Before getting started, let's get the dependencies out of the way.

According to the Mistral.rs README, the only packages we need are libssl-dev and pkg-config. However, we found a few extra packages were necessary to complete the installation. Assuming you're running Ubuntu 22.04 like we are, you can install them by executing:

sudo apt install curl wget python3 python3-pip git build-essential libssl-dev pkg-config

Once those are out of the way, we can install and activate Rust by running the Rustup script.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
. "$HOME/.cargo/env"

Yes, this involves downloading and executing a script right away; if you prefer to inspect the script before it runs, the code for it is here.

By default, Mistral.rs uses Hugging Face to fetch models on our behalf. Because many of these files require you to be logged into before you deploy them, we'll need to install the huggingface_hub by running:

pip install --upgrade huggingface_hub
huggingface-cli login

You'll be prompted to enter your Hugging Face access token, which you can create by visiting huggingface.co/settings/tokens.

Installing Mistral.rs

With our dependencies installed, we can move on to deploying Mistral.rs itself. To start, we'll use git to pull down the latest release of Mistral.rs from GitHub and navigate to our working directory:

git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs

Here's where things get a little tricky, depending on how your system is configured or what kind of accelerator you're using. In this case, we'll be looking at CPU (slow) and CUDA (fast)-based inferencing in Mistral.rs.

For CPU-based inferencing, we can simply execute:

cargo build --release

Meanwhile, those with Nvidia-based systems will want to run:

cargo build --release --features cuda

This bit could take a few minutes to complete, so you may want to a grab a cup of tea or coffee while you wait. After the executable is finished compiling, we can copy it to our working directory:

cp ./target/release/mistralrs-server ./mistralrs_server

Testing out Mistral.rs

With Mistral.rs installed, we can check that it actually works by running a test model, such as Mistral-7b-Instruct, in interactive mode. Assuming you've got a GPU with around 20GB or more of vRAM, you can just run:

./mistralrs_server -i plain -m mistralai/Mistral-7B-Instruct-v0.3 -a mistral

However, the odds are your GPU doesn't have the memory necessary to run the model at the 16-bit precision it was designed around. At this precision, you need 2GB of memory for every billion parameters, plus additional capacity for the key value cache. And even if you have enough system memory to deploy it on your CPU, you can expect performance to be quite poor as your memory bandwidth will quickly become a bottleneck.

Instead, we want to use quantization to shrink the model to a more reasonable size. In Mistral.rs there are two ways to go about this. The first is to simply use in-situ quantization, which will download the full-sized model and then quantize it down to the desired size. In this case, we'll be quantizing the model down from 16 bits to 4 bits. We can do this by adding --isq Q4_0 to the previous command like so:

./mistralrs_server -i --isq Q4_0 plain -m mistralai/Mistral-7B-Instruct-v0.3 -a mistral

Note: If Mistral.rs crashes before finishing, you probably don't have enough system memory and may need to add a swapfile — we added a 24GB one — to complete the process. You can temporarily add and enable a swapfile — just remember to delete it after you reboot — by running:

sudo fallocate -l 24G /swapfilesudo chmod 600 /swapfilesudo mkswap /swapfilesudo swapon /swapfile

Once the model has been quantized, you should be greeted with a chat-style interface where you can start querying the model. You should also notice that the model is using considerably less memory — around 5.9GB in our testing — and performance should be much better.

However, if you'd prefer not to quantize the model on the fly, Mistral.rs also supports pre-quantized GGUF and GGML files, for example these ones from Tom "TheBloke" Jobbins on Hugging Face.

The process is fairly similar, but this time we'll need to specify that we're running a GGUF model and set the ID and filename of the LLM we want. In this case, we'll download TheBloke's 4-bit quantized version of Mistral-7B-Instruct.

./mistralrs_server -i gguf --quantized-model-id TheBloke/Mistral-7B-Instruct-v0.2-GGUF --quantized-filename mistral-7b-instruct-v0.2.Q4_0.gguf

Putting the LLM to work

Running an interactive chatbot in a terminal is cool and all, but it isn't all that useful for building AI-enabled apps. Instead, Mistral.rs can be integrated into your code using Rust or Python APIs or via an OpenAI API-compatible HTTP server.

To start, we'll look at tying into the HTTP server, since it's arguably the easiest to use. In this example, we'll be using the same 4-bit quantized Mistral-7B model as our last example. Note that instead of starting the Mistral.rs in interactive mode, we've replaced the -i with a -p and provided the port we want the server to be accessible on.

./mistralrs_server -p 8342 gguf --quantized-model-id TheBloke/Mistral-7B-Instruct-v0.2-GGUF --quantized-filename mistral-7b-instruct-v0.2.Q4_0.gguf

Once the server is up and running, we can access it programmatically in a couple of different ways. The first would be to use curl to pass the instructions we want to give to the model. Here, we're posing the question: "In machine learning, what is a transformer?"

curl http://localhost:8342/v1/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer EMPTY" \-d '{"model": "Mistral-7B-Instruct-v0.2-GGUF","prompt": "In machine learning, what is a transformer?"}'

After a few seconds, the model should spit out a neat block of text formatted in JSON.

We can also interact with this using the openAI Python library. Though, you will probably need to install it using pip first:

pip install openai

You can then call the Mistral.rs server using a template, such as this one written for completion tasks.

import openaiquery = "In machine learning, what is a transformer?" # The prompt we want to pass to the LLMclient = openai.OpenAI( base_url="http://localhost:8342/v1", #The address of your Mistral.rs server api_key = "EMPTY")completion = client.completions.create( model="", prompt=query, max_tokens=256, frequency_penalty=1.0, top_p=0.1, temperature=0,)print(completion.choices[0].text)

You can find more examples showing how to work with the HTTP server over in the Mistral.rs Github repo here.

  • How to run an LLM on your PC, not in the cloud, in less than 10 minutes
  • GPU-accelerated VMs on Proxmox, XCP-ng? Here's what you need to know
  • From RAGs to riches: A practical guide to making your local AI chatbot smarter
  • AI PC vendors gotta have their TOPS – but is this just the GHz wars all over again?

Embedding Mistral.rs deeper into your projects

While convenient, the HTTP server isn't the only way to integrate Mistral.rs into our projects. You can achieve similar results using Rust or Python APIs.

Here's a basic example from the Mistral.rs repo showing how to to use the project as a Rust crate – what the Rust world calls a library – to pass a query to Mistral-7B-Instruct and generate a response. Note: We found we had to a make a few tweaks to the original example code to get it to run.

use std::sync::Arc;use std::convert::TryInto;use tokio::sync::mpsc::channel;use mistralrs::{ Constraint, Device, DeviceMapMetadata, GGUFLoaderBuilder, GGUFSpecificConfig, MistralRs, MistralRsBuilder, ModelDType, NormalRequest, Request, RequestMessage, Response, SamplingParams, SchedulerMethod, TokenSource,};fn setup() -> anyhow::Result<Arc<MistralRs>> { // Select a Mistral model // We do not use any files from HF servers here, and instead load the // chat template from the specified file, and the tokenizer and model from a // local GGUF file at the path `.` let loader = GGUFLoaderBuilder::new( GGUFSpecificConfig { repeat_last_n: 64 }, Some("mistral.json".to_string()), None, ".".to_string(), "mistral-7b-instruct-v0.2.Q4_K_M.gguf".to_string(), ) .build(); // Load, into a Pipeline let pipeline = loader.load_model_from_hf( None, TokenSource::CacheToken, &ModelDType::Auto, &Device::cuda_if_available(0)?, false, DeviceMapMetadata::dummy(), None, )?; // Create the MistralRs, which is a runner Ok(MistralRsBuilder::new(pipeline, SchedulerMethod::Fixed(5.try_into().unwrap())).build())}fn main() -> anyhow::Result<()> { let mistralrs = setup()?; let (tx, mut rx) = channel(10_000); let request = Request::Normal(NormalRequest { messages: RequestMessage::Completion { text: "In machine learning, what is a transformer ".to_string(), echo_prompt: false, best_of: 1, }, sampling_params: SamplingParams::default(), response: tx, return_logprobs: false, is_streaming: false, id: 0, constraint: Constraint::None, suffix: None, adapters: None, }); mistralrs.get_sender().blocking_send(request)?; let response = rx.blocking_recv().unwrap(); match response { Response::CompletionDone(c) => println!("Text: {}", c.choices[0].text), _ => unreachable!(), } Ok(())}

If you want to test this out for yourself, start by stepping up out of the current directory, creating a folder for a new Rust project, and entering that directory. We could use cargo new to create the project, which is recommended, but this time we'll do it by hand so you can see the steps.

cd ..mkdir test_appcd test_app

Once there, you'll want to copy the mistral.json template from ../mistral.rs/chat_templates/ and download the mistral-7b-instruct-v0.2.Q4_K_M.gguf model file from Hugging Face.

Next, we'll create a Cargo.toml file with the dependencies we need to build the app. This file tells the Rust toolchain details about your project. Inside this .toml file, paste the following:

[package]name = "test_app"version = "0.1.0"edition = "2018"[dependencies]tokio = "1"anyhow = "1"mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", tag="v0.1.18", features = ["cuda"] }[[bin]]name = "main"path = "test_app.rs"

Note: You'll want to remove the , features = ["cuda"] part if you aren't using GPU acceleration.

Finally, paste the contents of the demo app above into a file called test_app.rs.

With these four files test_app.rs, Cargo.toml, mistral-7b-instruct-v0.2.Q4_K_M.gguf, and mistral.json in the same folder, we can test whether it works by running:

cargo run

After about a minute, you should see the answer to our query appear on screen.

Obviously, this is an incredibly rudimentary example, but it illustrates how Mistral.rs can be used to integrate LLMs into your Rust apps, by incorporating the crate and using its library interface.

If you're interested in using Mistral.rs in your Python or Rust projects, we highly recommend checking out its documentation for more information and examples.

We hope to bring you more stories on utilizing LLMs soon, so be sure to let us know what we should explore next in the comments. ®

Editor's Note:Nvidia providedThe Registerwith an RTX 6000 Ada Generation graphics card to support this story and others like it. Nvidia had no input as to the contents of this article.

An introduction to embedding an LLM into your application (2024)

FAQs

What is an embedding in LLM? ›

Embeddings: These are high-dimensional vectors representing tokens in a way that captures their semantic meaning and relationships. Embeddings enable LLMs to understand context and nuances in data, whether it's text, images, or videos.

How do I use LLM in my application? ›

Here are the high-level steps you need to know to build an LLM application:
  1. Focus on a single problem first. The key? ...
  2. Choose the right LLM. ...
  3. Customize the LLM. ...
  4. Set up the app architecture. ...
  5. Conduct online evaluations of your app.
May 17, 2024

How to restrict llm response? ›

Yes, it is possible to restrict the LLM's response to only address the context of the user's question and the data it has been trained on. This can be achieved by using the RAG methodology.

How to embed data in LLM? ›

llm embed
  1. The simplest way to use this command is to pass content to it using the -c/--content option, like this:
  2. -m 3-small specifies the OpenAI text-embedding-3-small model. ...
  3. You can omit the -m/--model option if you set a default embedding model.
  4. Some models such as llm-clip can run against binary data.

What is an example of embedding? ›

These word embeddings show the power of vector arithmetic. The famous example is the equation king - man + woman ≈ queen. The vector for 'king', minus the vector for 'man' and plus the vector for 'woman', is very close to the vector for 'queen'.

What do LLM programs look for? ›

Eligibility for LLM Programs

Most LLM CAS-participating schools require that applicants have completed, or be in the process of completing, a first degree in law. This is important. Note: Requirements for other law programs vary. Check with the law school about individual programs.

What is the meaning of LLM in app? ›

Large language models, also known as LLMs, are very large deep learning models that are pre-trained on vast amounts of data. The underlying transformer is a set of neural networks that consist of an encoder and a decoder with self-attention capabilities.

What are LLM used for? ›

A large language model (LLM) is a type of artificial intelligence (AI) program that can recognize and generate text, among other tasks. LLMs are trained on huge sets of data — hence the name "large." LLMs are built on machine learning: specifically, a type of neural network called a transformer model.

Do LLMs use word embeddings? ›

Embeddings are the cornerstone of LLMs, enabling these models to perform a vast array of NLP tasks with remarkable efficiency and accuracy.

How many types of embeddings are there? ›

There two main categories of word embedding methods: Frequency-based embedding: Embedding methods that utilize the frequency of words to generate their vector representations.

How does an embedding model work? ›

How do embeddings work? Embeddings convert raw data into continuous values that ML models can interpret. Conventionally, ML models use one-hot encoding to map categorical variables into forms they can learn from. The encoding method divides each category into rows and columns and assigns them binary values.

Can LLM generate JSON? ›

A simple trick to increase the likelihood of LLM creating a pure json as output is to use the “logic_bias” parameter of openAI API.

What is the LLM format? ›

An LLM is a language model, which is not an agent as it has no goal, but it can be used as a component of an intelligent agent. Researchers have described several methods for such integrations. The ReAct pattern, a portmanteau of "Reason + Act", constructs an agent out of an LLM, using the LLM as a planner.

How does LLM generate text? ›

The language model processes the prompt and generates a sequence of words representing the response. During this step, the LLM uses the knowledge it gained during training, where it learned patterns and language from a vast amount of data, to generate a coherent and relevant response.

What is embedding in linear algebra? ›

Embeddings are vector (mathematical) representations of data where linear distances capture structure in the original datasets. This data could consist of words, in which case we call it a word embedding. But embeddings can also represent images, audio signals, and even large chunks of structured data.

What is the purpose of embedding? ›

Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

What does embedding mean in deep learning? ›

Embeddings are representations of values or objects like text, images, and audio that are designed to be consumed by machine learning models and semantic search algorithms.

What is meant by embedding technique? ›

An embedding method in computer science refers to a technique that transforms a network into a low-dimensional space while retaining its essential information, often used for network reconstruction evaluation. AI generated definition based on: Computer Science Review, 2020.

References

Top Articles
Education as an investment: World Bank education projects yield high economic returns
Career As an Investment Banker
Spectrum Store Appointment
Craigslist Lititz
Pierced Universe Coupon
Indicafans
Guide to Gold Farming in Guild Wars 2 - MMOPIXEL
At 25 Years, Understanding The Longevity Of Craigslist
Shooters Lube Discount Code
73 87 Chevy Truck Air Conditioning Wiring Diagram
Sunday Td Bank
Free Cities Mopoga
Ella And David Steve Strange
2006 Lebanon War | Summary, Casualties, & Israel
What Times What Equals 82
Cyclefish 2023
Cocaine Bear Showtimes Near Amc Braintree 10
Sm64Ex Coop Mods
Food Lion.com/Jobs
Benjamin Hilton co*ck
Math Nation Algebra 2 Practice Book Answer Key
Decree Of Spite Poe
Olentangy Calendar
Swag Codes: The Ultimate Guide to Boosting Your Swagbucks Earnings - Ricky Spears
Loterie Midi 30 Aujourd'hui
The Quiet Girl Showtimes Near Landmark Plaza Frontenac
Wolf Of Wall Street Tamil Dubbed Full Movie
Belly Button Torture Video
Top 10 Best OSRS Ranged Weapons (Bows + Crowssbows) – FandomSpot
Ltlv Las Vegas
Wyr Discount Code
Bakkesmod Preset
Imperialism Flocabulary Quiz Answers
Switchback Travel | Best Camping Chairs of 2024
Recharging Iban Staff
Paper Io 2 Unblocked Games Premium
Is Arnold Swansinger Married
Craigslist/Lakeland
Champaign County Mugshots 2023
Nobivac Pet Passport
Goodwill Southern California Store & Donation Center Montebello Photos
Mtb Com Online
Sxs Korde
Best Homemade Tartar Sauce
Obtaining __________ Is A Major And Critical Closure Activity.
8569 Marshall St, Merrillville, IN 46410 - MLS 809825 - Coldwell Banker
18006548818
The Hollis Co Layoffs
Erfolgsfaktor Partnernetzwerk: 5 Gründe, die überzeugen | SoftwareOne Blog
Papitop
South Florida residents must earn more than $100,000 to avoid being 'rent burdened'
Latest Posts
Article information

Author: Greg O'Connell

Last Updated:

Views: 6324

Rating: 4.1 / 5 (42 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.