iOS API Spec
Latest version: v0.6.0.
Leap
Leap is the static entry point for loading on-device models.
public struct Leap {
public static func load(
url: URL,
options: LiquidInferenceEngineOptions? = nil
) async throws -> ModelRunner
}load(url:options:)
- Loads a local model file (either a
.bundlepackage or a.ggufcheckpoint) and returns aModelRunnerinstance. - Throws
LeapError.modelLoadingFailureif the file cannot be loaded. - When an
mmproj-*.ggufcompanion sits alongside the model, the loader forwards it to the backend so multimodal tokens are available (for both bundle and GGUF flows). - Must be called from an async context (for example inside an
asyncfunction or aTask). Keep the returnedModelRunneralive while you need to interact with the model.
// ExecuTorch backend via .bundle
let bundleURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "bundle")!
let runner = try await Leap.load(url: bundleURL)
// llama.cpp backend via .gguf
let ggufURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "gguf")!
let ggufRunner = try await Leap.load(url: ggufURL)LiquidInferenceEngineOptions
Pass a LiquidInferenceEngineOptions value when you need to override the default runtime configuration.
public struct LiquidInferenceEngineOptions {
public var bundlePath: String
public let cacheOptions: LiquidCacheOptions?
public let cpuThreads: UInt32?
public let contextSize: UInt32?
public let nGpuLayers: UInt32?
public let mmProjPath: String?
}bundlePath: Path to the model file on disk. When you callLeap.load(url:), this is filled automatically.cacheOptions: Configure persistence of KV-cache data between generations.cpuThreads: Number of CPU threads for token generation.contextSize: Override the default maximum context length for the model.nGpuLayers: Number of layers to offload to GPU (for macOS/macCatalyst targets with Metal support).mmProjPath: Path to an auxiliary multimodal projection model (required for some vision-capable models). Leavenilto enable automatic detection.
Backend selection is automatic: .bundle files run on the ExecuTorch backend, while .gguf
checkpoints use the embedded llama.cpp backend. Bundled models reference their projection data in
metadata; GGUF checkpoints look for a sibling mmproj-*.gguf unless you specify the path through
LiquidInferenceEngineOptions. Make sure the projection is available if you want vision-capable
checkpoints to consume image content.
Example overriding the number of CPU threads and context size:
let options = LiquidInferenceEngineOptions(
bundlePath: bundleURL.path,
cpuThreads: 6,
contextSize: 8192
)
let runner = try await Leap.load(url: bundleURL, options: options)ModelRunner
A ModelRunner represents a loaded model instance. The SDK returns concrete ModelRunner implementations, but your code only needs the protocol surface:
public protocol ModelRunner {
func createConversation(systemPrompt: String?) -> Conversation
func createConversationFromHistory(history: [ChatMessage]) -> Conversation
func generateResponse(
conversation: Conversation,
generationOptions: GenerationOptions?,
onResponseCallback: @escaping (MessageResponse) -> Void,
onErrorCallback: ((LeapError) -> Void)?
) -> GenerationHandler
func unload() async
var modelId: String { get }
}Lifecycle
- Create conversations using
createConversation(systemPrompt:)orcreateConversationFromHistory(history:). - Hold a strong reference to the
ModelRunnerfor as long as you need to perform generations. - Call
unload()when you are done to release native resources (optional, happens automatically on deinit). - Access
modelIdto identify the loaded model (for analytics, debugging, or UI labels).
Low-level generation API
generateResponse(...) drives generation with callbacks and returns a GenerationHandler you can store to cancel the run. Most apps call the higher-level streaming helpers on Conversation, but you can invoke this method directly when you need fine-grained control (for example, integrating with custom async primitives).
let handler = runner.generateResponse(
conversation: conversation,
generationOptions: options,
onResponseCallback: { message in
// Handle MessageResponse values here
},
onErrorCallback: { error in
// Handle LeapError
}
)
// Stop generation early if needed
handler.stop()GenerationHandler
public protocol GenerationHandler: Sendable {
func stop()
}The handler returned by ModelRunner.generateResponse or Conversation.generateResponse(..., onResponse:) lets you cancel generation without tearing down the conversation.
Conversation
Conversation tracks chat state and provides streaming helpers built on top of the model runner.
public class Conversation {
public let modelRunner: ModelRunner
public private(set) var history: [ChatMessage]
public private(set) var functions: [LeapFunction]
public private(set) var isGenerating: Bool
public init(modelRunner: ModelRunner, history: [ChatMessage])
public func registerFunction(_ function: LeapFunction)
public func exportToJSON() throws -> [[String: Any]]
public func generateResponse(
userTextMessage: String,
generationOptions: GenerationOptions? = nil
) -> AsyncThrowingStream<MessageResponse, Error>
public func generateResponse(
message: ChatMessage,
generationOptions: GenerationOptions? = nil
) -> AsyncThrowingStream<MessageResponse, Error>
@discardableResult
public func generateResponse(
message: ChatMessage,
generationOptions: GenerationOptions? = nil,
onResponse: @escaping (MessageResponse) -> Void
) -> GenerationHandler?
}Properties
history: Copy of the accumulated chat messages. The SDK appends the assistant reply when a generation finishes successfully.functions: Functions registered viaregisterFunction(_:)for function calling.isGenerating: Boolean flag indicating whether a generation is currently running. Attempts to start a new generation while this istrueimmediately finish with an empty stream (ornilhandler for the callback variant).
Streaming Convenience
The most common pattern is to use the async-stream helpers:
let user = ChatMessage(role: .user, content: [.text("Hello! What can you do?")])
Task {
do {
for try await response in conversation.generateResponse(
message: user,
generationOptions: GenerationOptions(temperature: 0.7)
) {
switch response {
case .chunk(let delta):
print(delta, terminator: "")
case .reasoningChunk(let thought):
print("Reasoning:", thought)
case .functionCall(let calls):
handleFunctionCalls(calls)
case .complete(let fullText, let info):
print("\nComplete:", fullText)
if let stats = info.stats {
print("Prompt tokens: \(stats.promptTokens), completions: \(stats.completionTokens)")
}
}
}
} catch {
print("Generation failed: \(error)")
}
}Cancelling the task that iterates the stream stops generation and cleans up native resources.
Callback Convenience
Use generateResponse(message:onResponse:) when you prefer callbacks or need to integrate with imperative UI components:
let handler = conversation.generateResponse(message: user) { response in
updateUI(with: response)
}
// Later
handler?.stop()If a generation is already running, the method returns nil and emits a .complete message with finishReason == .stop via the callback.
The callback overload does not surface generation errors. Use the async-stream helper or call
ModelRunner.generateResponse with onErrorCallback when you need error handling.
Export Chat History
exportToJSON() serializes the conversation history into a [[String: Any]] payload that mirrors OpenAI’s chat-completions format. This is useful for persistence, analytics, or debugging tools.
MessageResponse
public enum MessageResponse {
case chunk(String)
case reasoningChunk(String)
case complete(String, GenerationCompleteInfo)
case functionCall([LeapFunctionCall])
}chunk: Partial assistant text emitted during streaming.reasoningChunk: Model reasoning tokens wrapped between<think>/</think>(only for models that expose reasoning traces).functionCall: One or more function/tool invocations requested by the model. See the Function Calling guide.complete: Signals the end of generation. The first argument is the full assistant reply assembled by the SDK.GenerationCompleteInfocontains thefinishReason(.stopor.exceed_context) and optionalGenerationStats(prompt/completion token counts and throughput).
public struct GenerationCompleteInfo {
public let finishReason: GenerationFinishReason
public let stats: GenerationStats?
}
public struct GenerationStats {
public var promptTokens: UInt64
public var completionTokens: UInt64
public var totalTokens: UInt64
public var tokenPerSecond: Float
}Errors surfaced during streaming are delivered through the thrown error of AsyncThrowingStream, or via the onErrorCallback closure when using the lower-level API.
Chat Messages
Roles
public enum ChatMessageRole: String {
case user
case system
case assistant
case tool
}Include .tool messages when you append function-call results back into the conversation.
Message Structure
public struct ChatMessage {
public var role: ChatMessageRole
public var content: [ChatMessageContent]
public var reasoningContent: String?
public var functionCalls: [LeapFunctionCall]?
public init(
role: ChatMessageRole,
content: [ChatMessageContent],
reasoningContent: String? = nil,
functionCalls: [LeapFunctionCall]? = nil
)
public init(from json: [String: Any]) throws
}content: Ordered fragments of the message. The SDK currently supports.textand.imageparts.reasoningContent: Optional text produced inside<think>tags by eligible models.functionCalls: Attach the calls returned byMessageResponse.functionCallwhen you include tool execution results in the history.
Message Content
public enum ChatMessageContent {
case text(String)
case image(Data)
public init(from json: [String: Any]) throws
}Provide JPEG-encoded bytes for .image. Helper initializers (e.g. ChatMessageContent.fromUIImage) are available for UIKit/AppKit targets. Images are only consumed when the loaded model exposes multimodal weights and an mmproj-*.gguf companion was detected alongside the main GGUF.
GenerationOptions
Tune generation behaviour with GenerationOptions.
public struct GenerationOptions {
public var temperature: Float?
public var topP: Float?
public var minP: Float?
public var repetitionPenalty: Float?
public var jsonSchemaConstraint: String?
public var functionCallParser: LeapFunctionCallParserProtocol?
public init(
temperature: Float? = nil,
topP: Float? = nil,
minP: Float? = nil,
repetitionPenalty: Float? = nil,
jsonSchemaConstraint: String? = nil,
functionCallParser: LeapFunctionCallParserProtocol? = LFMFunctionCallParser()
)
}- Leave a field as
nilto fall back to the defaults packaged with the model bundle. functionCallParsercontrols how tool-call tokens are parsed. Supply your own parser conforming toLeapFunctionCallParserProtocolto support custom formats, or set it tonilto receive raw tool-call text inMessageResponse.chunk.jsonSchemaConstraintactivates constrained generation. UsesetResponseFormat(type:)to populate it from a type annotated with the@Generatablemacro.
var options = GenerationOptions(temperature: 0.6, topP: 0.9)
try options.setResponseFormat(type: CityFact.self)
for try await response in conversation.generateResponse(
message: user,
generationOptions: options
) {
// Handle structured output
}Function Calling Types
public struct LeapFunction {
public let name: String
public let description: String
public let parameters: [LeapFunctionParameter]
}
public struct LeapFunctionParameter {
public let name: String
public let type: LeapFunctionParameterType
public let description: String
public let optional: Bool
}
public indirect enum LeapFunctionParameterType: Codable, Equatable {
case string(StringType)
case number(NumberType)
case integer(IntegerType)
case boolean(BooleanType)
case array(ArrayType)
case object(ObjectType)
case null(NullType)
}The parameter type wrappers (StringType, NumberType, etc.) let you attach descriptions and enumerations so that JSON schemas accurately describe your tools. See the Function Calling guide for in-depth usage patterns.
Errors
Errors are surfaced as LeapError values. The most common cases are:
LeapError.modelLoadingFailure: Problems reading or validating the model bundle.LeapError.generationFailure: Unexpected native inference errors.LeapError.promptExceedContextLengthFailure: Prompt length exceeded the configured context size.LeapError.serializationFailure: JSON encoding/decoding problems when working with chat history or function calls.
Handle thrown errors with do / catch when using async streams, or use the onErrorCallback in the lower-level API.
Putting it together
let runner = try await Leap.load(url: bundleURL)
let conversation = runner.createConversation(systemPrompt: "You are a travel assistant.")
conversation.registerFunction(weatherFunction)
var options = GenerationOptions(temperature: 0.8)
try options.setResponseFormat(type: TripRecommendation.self)
let userMessage = ChatMessage(
role: .user,
content: [.text("Plan a 3-day trip to Kyoto with food highlights")]
)
for try await response in conversation.generateResponse(
message: userMessage,
generationOptions: options
) {
process(response)
}Refer to the Quick Start for end-to-end project setup, Function Calling for tool invocation, and Constrained Generation for structured outputs.