Skip to Content
DocumentationLaptop Support

Laptop Support

Our approach

There are a variety of ways to run SLMs on laptop and desktop computers already, with a variety of support for different operating systems, chip architectures, hardware backends, and more.

To support this large (and growing) ecosystem, we are actively collaborating with, supporting, and contributing to the open source community to provide efficient and fast inference performance on as many systems as possible – especially for Liquid Foundation Models (LFMs).

Specifically, for most users and environments, we recommend and support llama.cpp  as the primary engine for running LFMs and other LEAP models on Windows, Mac, and Linux machines.

We will continue to evaluate different inference engines and update our recommendation, potentially on a per-platform basis, on a regular basis.

Running on laptops/desktops

There are two main ways to run models via llama.cpp:

  • Run as an OpenAI-compatible HTTP server
  • Via language-specific library bindings

OpenAI-compatible server

If running an HTTP server locally on your hardware is sufficient for your use case, using llama.cpp directly via command line is the fastest/easiest way to get started. You can use our tool below to help determine which llama.cpp binding is best for your environment, or browse the full list of pre-compiled binaries here .

Your best pre-compiled binary:

-

Once you have your appropriate binary, you can use it to run any GGUF file downloaded from the LEAP model library (or elsewhere). For more detailed instructions see the llama.cpp quickstart instructions .

If you are considering on investing in hardware, here are some profiling results from a variety of machines and inference backends. As it currently stands, AMD Ryzen™ machines generally have the best-in-class performance with relatively standard llama.cpp configuration settings – and with custom configurations, this advantage tends to increase.

DevicePrefill speed (tok/s)Decode speed (tok/s)
AMD Ryzen™ AI Max+ 3955476143
AMD Ryzen™ AI 9 HX 3702680113
Apple Mac Mini (M4)1427122
Qualcomm Snapdragon™ X1E-78-100978125
Intel Core™ Ultra 9 185H131058
Intel Core™ Ultra 7 258V110478

Note: for fair comparison, we conducted these benchmarks on the same model (LFM2-1.2B-Q4_0.gguf). For each hardware device, we also tested across all publicly available llama.cpp binaries, with different thread counts (4, 8, 12) for CPU runners, and took the best performing numbers for prefill and decode independently.

Language bindings

If you are interested in a more embedded inference experience, there are a variety of open-source language bindings that you can use within your particular project. These bindings generally act as any other importable library within your application source code.

We have provided some very simple example apps that demonstrate how to use two of the more popular language bindings:

The full list of bindings can be found here .

While language bindings can often provide better performance and more control, support and maturity can be inconsistent across different languages/libraries, especially for newer model architectures. Take note of the most recent update to the publicly released version of each binding when considering what to use.

Last updated on