Laptop Support
Our approach
There are a variety of ways to run SLMs on laptop and desktop computers already, with a variety of support for different operating systems, chip architectures, hardware backends, and more.
To support this large (and growing) ecosystem, we are actively collaborating with, supporting, and contributing to the open source community to provide efficient and fast inference performance on as many systems as possible – especially for Liquid Foundation Models (LFMs).
Specifically, for most users and environments, we recommend and support llama.cpp as the primary engine for running LFMs and other LEAP models on Windows, Mac, and Linux machines.
We will continue to evaluate different inference engines and update our recommendation, potentially on a per-platform basis, on a regular basis.
Running on laptops/desktops
There are two main ways to run models via llama.cpp:
- Run as an OpenAI-compatible HTTP server
- Via language-specific library bindings
OpenAI-compatible server
If running an HTTP server locally on your hardware is sufficient for your use case, using llama.cpp directly via command line is the fastest/easiest way to get started. You can use our tool below to help determine which llama.cpp binding is best for your environment, or browse the full list of pre-compiled binaries here .
Your best pre-compiled binary:
-Once you have your appropriate binary, you can use it to run any GGUF file downloaded from the LEAP model library (or elsewhere). For more detailed instructions see the llama.cpp quickstart instructions .
If you are considering on investing in hardware, here are some profiling results from a variety of machines and inference backends. As it currently stands, AMD Ryzen™ machines generally have the best-in-class performance with relatively standard llama.cpp configuration settings – and with custom configurations, this advantage tends to increase.
Device | Prefill speed (tok/s) | Decode speed (tok/s) |
---|---|---|
AMD Ryzen™ AI Max+ 395 | 5476 | 143 |
AMD Ryzen™ AI 9 HX 370 | 2680 | 113 |
Apple Mac Mini (M4) | 1427 | 122 |
Qualcomm Snapdragon™ X1E-78-100 | 978 | 125 |
Intel Core™ Ultra 9 185H | 1310 | 58 |
Intel Core™ Ultra 7 258V | 1104 | 78 |
Note: for fair comparison, we conducted these benchmarks on the same model
(LFM2-1.2B-Q4_0.gguf
). For each hardware device, we also tested across all publicly available
llama.cpp binaries, with different thread counts (4, 8, 12) for CPU runners, and took the best
performing numbers for prefill and decode independently.
Language bindings
If you are interested in a more embedded inference experience, there are a variety of open-source language bindings that you can use within your particular project. These bindings generally act as any other importable library within your application source code.
We have provided some very simple example apps that demonstrate how to use two of the more popular language bindings:
The full list of bindings can be found here .
While language bindings can often provide better performance and more control, support and maturity can be inconsistent across different languages/libraries, especially for newer model architectures. Take note of the most recent update to the publicly released version of each binding when considering what to use.