Laptop Support

Our approach

There are a variety of ways to run SLMs on laptop and desktop computers already, with a variety of support for different operating systems, chip architectures, hardware backends, and more.

To support this large (and growing) ecosystem, we are actively collaborating with, supporting, and contributing to the open source community to provide efficient and fast inference performance on as many systems as possible – especially for Liquid Foundation Models (LFMs).

Specifically, for most users and environments, we recommend and support llama.cpp as the primary engine for running LFMs and other LEAP models on Windows, Mac, and Linux machines.

We will continue to evaluate different inference engines and update our recommendation, potentially on a per-platform basis, on a regular basis.

Running on laptops/desktops

There are two main ways to run models via llama.cpp:

Run as an OpenAI-compatible HTTP server
Via language-specific library bindings

OpenAI-compatible server

If running an HTTP server locally on your hardware is sufficient for your use case, using llama.cpp directly via command line is the fastest/easiest way to get started. You can use our tool below to help determine which llama.cpp binding is best for your environment, or browse the full list of pre-compiled binaries here .

What is your operating system?

What backend do you plan to use?

Your best pre-compiled binary:

Once you have your appropriate binary, you can use it to run any GGUF file downloaded from the LEAP model library (or elsewhere). For more detailed instructions see the llama.cpp quickstart guide .

If you are considering investing in hardware, here are some profiling results from a variety of machines and inference backends. As it currently stands, AMD Ryzen™ machines generally have the best-in-class performance with relatively standard llama.cpp configuration settings – and with custom configurations, this advantage tends to increase.

Device	Prefill speed (tok/s)	Decode speed (tok/s)
AMD Ryzen™ AI Max+ 395	5476	143
AMD Ryzen™ AI 9 HX 370	2680	113
Apple Mac Mini (M4)	1427	122
Qualcomm Snapdragon™ X1E-78-100	978	125
Intel Core™ Ultra 9 185H	1310	58
Intel Core™ Ultra 7 258V	1104	78

Note: for fair comparison, we conducted these benchmarks on the same model (LFM2-1.2B-Q4_0.gguf). For each hardware device, we also tested across all publicly available llama.cpp binaries, with different thread counts (4, 8, 12) for CPU runners, and took the best performing numbers for prefill and decode independently.

Language bindings

If you are interested in a more embedded inference experience (e.g. for an application), there are a variety of open-source language bindings that you can use within your particular project. These bindings generally act as any other importable library within your application source code.

While language bindings can often provide better performance and more control, support and maturity can be inconsistent across different languages/libraries, especially for newer model architectures. Take note of the most recent update to the publicly released version of each binding when considering what to use.

We have provided some very simple examples that demonstrate how to use three of the more popular language bindings. Getting started is as easy as copy-pasting the below code into the appropriate runtime environment, downloading a GGUF file, and pointing PATH_TO_MODEL to that file:

Node


import { getLlama, LlamaChatSession } from "node-llama-cpp";
 
const llama = await getLlama();
const model = await llama.loadModel({ modelPath: "PATH_TO_MODEL", gpuLayers: 0 });
const context = await model.createContext({ contextSize: 2048 });
const session = new LlamaChatSession({
  contextSequence: context.getSequence(),
  systemPrompt: "You are a helpful AI assistant. Provide clear, concise, accurate responses."
});
 
const response = await session.prompt("Why is the sky blue?");
console.log(response);

Python


from llama_cpp import Llama
 
model = Llama(model_path="PATH_TO_MODEL", verbose=False)
history = [{
  "role": "system",
  "content": "You are a helpful AI assistant. Provide clear, concise, accurate responses."
}]
 
history.append({
  "role": "user",
  "content": "Why is the sky blue?"
})
output = model.create_chat_completion(messages=history)
response = output["choices"][0]["message"]["content"]
print(response)

C#


using LLama;
using LLama.Common;
 
var parameters = new ModelParams("PATH_TO_MODEL")
{
  ContextSize = 2048,
  GpuLayerCount = 0,
};
using var model = LLamaWeights.LoadFromFile(parameters);
using var context = model.CreateContext(parameters);
 
var executor = new InteractiveExecutor(context);
var chatHistory = new ChatHistory();
ChatSession session = new(executor, chatHistory);
InferenceParams inferenceParams = new InferenceParams()
{
  MaxTokens = 2048,
  AntiPrompts = new List<string> { "User:" }
};
session.AddSystemMessage("You are a helpful AI assistant. Provide clear, concise, accurate responses.");
 
string userInput = "Why is the sky blue?";
await foreach (
  var text
  in session.ChatAsync(
    new ChatHistory.Message(AuthorRole.User, userInput),
    inferenceParams
  )
)
{
    Console.Write(text);
}

For more comprehensive example apps, check out the follow repositories:

The full list of llama.cpp language bindings can be found here .