Skip to Content

iOS API Spec

Latest version: v0.6.0.

Leap

Leap is the static entry point for loading on-device models.

public struct Leap { public static func load( url: URL, options: LiquidInferenceEngineOptions? = nil ) async throws -> ModelRunner }

load(url:options:)

  • Loads a local model file (either a .bundle package or a .gguf checkpoint) and returns a ModelRunner instance.
  • Throws LeapError.modelLoadingFailure if the file cannot be loaded.
  • When an mmproj-*.gguf companion sits alongside the model, the loader forwards it to the backend so multimodal tokens are available (for both bundle and GGUF flows).
  • Must be called from an async context (for example inside an async function or a Task). Keep the returned ModelRunner alive while you need to interact with the model.
// ExecuTorch backend via .bundle let bundleURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "bundle")! let runner = try await Leap.load(url: bundleURL) // llama.cpp backend via .gguf let ggufURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "gguf")! let ggufRunner = try await Leap.load(url: ggufURL)

LiquidInferenceEngineOptions

Pass a LiquidInferenceEngineOptions value when you need to override the default runtime configuration.

public struct LiquidInferenceEngineOptions { public var bundlePath: String public let cacheOptions: LiquidCacheOptions? public let cpuThreads: UInt32? public let contextSize: UInt32? public let nGpuLayers: UInt32? public let mmProjPath: String? }
  • bundlePath: Path to the model file on disk. When you call Leap.load(url:), this is filled automatically.
  • cacheOptions: Configure persistence of KV-cache data between generations.
  • cpuThreads: Number of CPU threads for token generation.
  • contextSize: Override the default maximum context length for the model.
  • nGpuLayers: Number of layers to offload to GPU (for macOS/macCatalyst targets with Metal support).
  • mmProjPath: Path to an auxiliary multimodal projection model (required for some vision-capable models). Leave nil to enable automatic detection.

Backend selection is automatic: .bundle files run on the ExecuTorch backend, while .gguf checkpoints use the embedded llama.cpp backend. Bundled models reference their projection data in metadata; GGUF checkpoints look for a sibling mmproj-*.gguf unless you specify the path through LiquidInferenceEngineOptions. Make sure the projection is available if you want vision-capable checkpoints to consume image content.

Example overriding the number of CPU threads and context size:

let options = LiquidInferenceEngineOptions( bundlePath: bundleURL.path, cpuThreads: 6, contextSize: 8192 ) let runner = try await Leap.load(url: bundleURL, options: options)

ModelRunner

A ModelRunner represents a loaded model instance. The SDK returns concrete ModelRunner implementations, but your code only needs the protocol surface:

public protocol ModelRunner { func createConversation(systemPrompt: String?) -> Conversation func createConversationFromHistory(history: [ChatMessage]) -> Conversation func generateResponse( conversation: Conversation, generationOptions: GenerationOptions?, onResponseCallback: @escaping (MessageResponse) -> Void, onErrorCallback: ((LeapError) -> Void)? ) -> GenerationHandler func unload() async var modelId: String { get } }

Lifecycle

  • Create conversations using createConversation(systemPrompt:) or createConversationFromHistory(history:).
  • Hold a strong reference to the ModelRunner for as long as you need to perform generations.
  • Call unload() when you are done to release native resources (optional, happens automatically on deinit).
  • Access modelId to identify the loaded model (for analytics, debugging, or UI labels).

Low-level generation API

generateResponse(...) drives generation with callbacks and returns a GenerationHandler you can store to cancel the run. Most apps call the higher-level streaming helpers on Conversation, but you can invoke this method directly when you need fine-grained control (for example, integrating with custom async primitives).

let handler = runner.generateResponse( conversation: conversation, generationOptions: options, onResponseCallback: { message in // Handle MessageResponse values here }, onErrorCallback: { error in // Handle LeapError } ) // Stop generation early if needed handler.stop()

GenerationHandler

public protocol GenerationHandler: Sendable { func stop() }

The handler returned by ModelRunner.generateResponse or Conversation.generateResponse(..., onResponse:) lets you cancel generation without tearing down the conversation.

Conversation

Conversation tracks chat state and provides streaming helpers built on top of the model runner.

public class Conversation { public let modelRunner: ModelRunner public private(set) var history: [ChatMessage] public private(set) var functions: [LeapFunction] public private(set) var isGenerating: Bool public init(modelRunner: ModelRunner, history: [ChatMessage]) public func registerFunction(_ function: LeapFunction) public func exportToJSON() throws -> [[String: Any]] public func generateResponse( userTextMessage: String, generationOptions: GenerationOptions? = nil ) -> AsyncThrowingStream<MessageResponse, Error> public func generateResponse( message: ChatMessage, generationOptions: GenerationOptions? = nil ) -> AsyncThrowingStream<MessageResponse, Error> @discardableResult public func generateResponse( message: ChatMessage, generationOptions: GenerationOptions? = nil, onResponse: @escaping (MessageResponse) -> Void ) -> GenerationHandler? }

Properties

  • history: Copy of the accumulated chat messages. The SDK appends the assistant reply when a generation finishes successfully.
  • functions: Functions registered via registerFunction(_:) for function calling.
  • isGenerating: Boolean flag indicating whether a generation is currently running. Attempts to start a new generation while this is true immediately finish with an empty stream (or nil handler for the callback variant).

Streaming Convenience

The most common pattern is to use the async-stream helpers:

let user = ChatMessage(role: .user, content: [.text("Hello! What can you do?")]) Task { do { for try await response in conversation.generateResponse( message: user, generationOptions: GenerationOptions(temperature: 0.7) ) { switch response { case .chunk(let delta): print(delta, terminator: "") case .reasoningChunk(let thought): print("Reasoning:", thought) case .functionCall(let calls): handleFunctionCalls(calls) case .complete(let fullText, let info): print("\nComplete:", fullText) if let stats = info.stats { print("Prompt tokens: \(stats.promptTokens), completions: \(stats.completionTokens)") } } } } catch { print("Generation failed: \(error)") } }

Cancelling the task that iterates the stream stops generation and cleans up native resources.

Callback Convenience

Use generateResponse(message:onResponse:) when you prefer callbacks or need to integrate with imperative UI components:

let handler = conversation.generateResponse(message: user) { response in updateUI(with: response) } // Later handler?.stop()

If a generation is already running, the method returns nil and emits a .complete message with finishReason == .stop via the callback.

The callback overload does not surface generation errors. Use the async-stream helper or call ModelRunner.generateResponse with onErrorCallback when you need error handling.

Export Chat History

exportToJSON() serializes the conversation history into a [[String: Any]] payload that mirrors OpenAI’s chat-completions format. This is useful for persistence, analytics, or debugging tools.

MessageResponse

public enum MessageResponse { case chunk(String) case reasoningChunk(String) case complete(String, GenerationCompleteInfo) case functionCall([LeapFunctionCall]) }
  • chunk: Partial assistant text emitted during streaming.
  • reasoningChunk: Model reasoning tokens wrapped between <think> / </think> (only for models that expose reasoning traces).
  • functionCall: One or more function/tool invocations requested by the model. See the Function Calling guide.
  • complete: Signals the end of generation. The first argument is the full assistant reply assembled by the SDK. GenerationCompleteInfo contains the finishReason (.stop or .exceed_context) and optional GenerationStats (prompt/completion token counts and throughput).
public struct GenerationCompleteInfo { public let finishReason: GenerationFinishReason public let stats: GenerationStats? } public struct GenerationStats { public var promptTokens: UInt64 public var completionTokens: UInt64 public var totalTokens: UInt64 public var tokenPerSecond: Float }

Errors surfaced during streaming are delivered through the thrown error of AsyncThrowingStream, or via the onErrorCallback closure when using the lower-level API.

Chat Messages

Roles

public enum ChatMessageRole: String { case user case system case assistant case tool }

Include .tool messages when you append function-call results back into the conversation.

Message Structure

public struct ChatMessage { public var role: ChatMessageRole public var content: [ChatMessageContent] public var reasoningContent: String? public var functionCalls: [LeapFunctionCall]? public init( role: ChatMessageRole, content: [ChatMessageContent], reasoningContent: String? = nil, functionCalls: [LeapFunctionCall]? = nil ) public init(from json: [String: Any]) throws }
  • content: Ordered fragments of the message. The SDK currently supports .text and .image parts.
  • reasoningContent: Optional text produced inside <think> tags by eligible models.
  • functionCalls: Attach the calls returned by MessageResponse.functionCall when you include tool execution results in the history.

Message Content

public enum ChatMessageContent { case text(String) case image(Data) public init(from json: [String: Any]) throws }

Provide JPEG-encoded bytes for .image. Helper initializers (e.g. ChatMessageContent.fromUIImage) are available for UIKit/AppKit targets. Images are only consumed when the loaded model exposes multimodal weights and an mmproj-*.gguf companion was detected alongside the main GGUF.

GenerationOptions

Tune generation behaviour with GenerationOptions.

public struct GenerationOptions { public var temperature: Float? public var topP: Float? public var minP: Float? public var repetitionPenalty: Float? public var jsonSchemaConstraint: String? public var functionCallParser: LeapFunctionCallParserProtocol? public init( temperature: Float? = nil, topP: Float? = nil, minP: Float? = nil, repetitionPenalty: Float? = nil, jsonSchemaConstraint: String? = nil, functionCallParser: LeapFunctionCallParserProtocol? = LFMFunctionCallParser() ) }
  • Leave a field as nil to fall back to the defaults packaged with the model bundle.
  • functionCallParser controls how tool-call tokens are parsed. Supply your own parser conforming to LeapFunctionCallParserProtocol to support custom formats, or set it to nil to receive raw tool-call text in MessageResponse.chunk.
  • jsonSchemaConstraint activates constrained generation. Use setResponseFormat(type:) to populate it from a type annotated with the @Generatable macro.
var options = GenerationOptions(temperature: 0.6, topP: 0.9) try options.setResponseFormat(type: CityFact.self) for try await response in conversation.generateResponse( message: user, generationOptions: options ) { // Handle structured output }

Function Calling Types

public struct LeapFunction { public let name: String public let description: String public let parameters: [LeapFunctionParameter] } public struct LeapFunctionParameter { public let name: String public let type: LeapFunctionParameterType public let description: String public let optional: Bool } public indirect enum LeapFunctionParameterType: Codable, Equatable { case string(StringType) case number(NumberType) case integer(IntegerType) case boolean(BooleanType) case array(ArrayType) case object(ObjectType) case null(NullType) }

The parameter type wrappers (StringType, NumberType, etc.) let you attach descriptions and enumerations so that JSON schemas accurately describe your tools. See the Function Calling guide for in-depth usage patterns.

Errors

Errors are surfaced as LeapError values. The most common cases are:

  • LeapError.modelLoadingFailure: Problems reading or validating the model bundle.
  • LeapError.generationFailure: Unexpected native inference errors.
  • LeapError.promptExceedContextLengthFailure: Prompt length exceeded the configured context size.
  • LeapError.serializationFailure: JSON encoding/decoding problems when working with chat history or function calls.

Handle thrown errors with do / catch when using async streams, or use the onErrorCallback in the lower-level API.

Putting it together

let runner = try await Leap.load(url: bundleURL) let conversation = runner.createConversation(systemPrompt: "You are a travel assistant.") conversation.registerFunction(weatherFunction) var options = GenerationOptions(temperature: 0.8) try options.setResponseFormat(type: TripRecommendation.self) let userMessage = ChatMessage( role: .user, content: [.text("Plan a 3-day trip to Kyoto with food highlights")] ) for try await response in conversation.generateResponse( message: userMessage, generationOptions: options ) { process(response) }

Refer to the Quick Start for end-to-end project setup, Function Calling for tool invocation, and Constrained Generation for structured outputs.

Last updated on