LLamaSharp is a cross-platform .NET library that provides managed bindings to llama.cpp enabling efficient local execution of large language models (LLMs) in .NET applications. It supports both CPU and GPU acceleration, runs models in GGUF format, and provides high-level APIs for text generation, embeddings, and multimodal inference LLama/LLamaSharp.csproj19-23
This page provides an architectural overview of the LLamaSharp ecosystem. For installation instructions, see Installation and Setup. For getting started with code examples, see Quick Start Guide. For details on specific components, refer to the architecture sections (Core Architecture, Executors and Inference, Sampling and Token Selection, Advanced Features).
LLamaSharp serves three primary functions:
SafeHandle patterns docs/Architecture.md7ILLamaExecutor abstraction docs/Architecture.md10The library targets netstandard2.0 and net8.0, enabling compatibility across .NET Framework, .NET Core, and modern .NET applications LLama/LLamaSharp.csproj1-3
Sources: LLama/LLamaSharp.csproj1-33 README.md14-23 docs/Architecture.md3-16
LLamaSharp uses a modular distribution strategy with separate packages for core functionality, framework integrations, and hardware-specific backends.
Diagram: LLamaSharp Package Distribution Model
| Package | Purpose | Target Framework | Dependencies |
|---|---|---|---|
LLamaSharp | Core library with inference APIs | netstandard2.0, net8.0 | Microsoft.Extensions.AI.Abstractions |
LLamaSharp.semantic-kernel | Semantic Kernel integration | netstandard2.0, net8.0 | Microsoft.SemanticKernel.Abstractions |
LLamaSharp.kernel-memory | Kernel Memory integration (RAG) | net8.0 | Microsoft.KernelMemory.Abstractions |
LLamaSharp.Backend.Cpu | CPU binaries (+ Metal for macOS) | Runtime | Native .dll/.so/.dylib |
LLamaSharp.Backend.Cuda11 | CUDA 11 GPU acceleration | Runtime | Native .dll/.so |
LLamaSharp.Backend.Cuda12 | CUDA 12 GPU acceleration | Runtime | Native .dll/.so |
LLamaSharp.Backend.Vulkan | Vulkan GPU acceleration | Runtime | Native .dll/.so |
Users install the LLamaSharp core package plus exactly one backend package matching their hardware README.md89-108 The backend packages contain pre-compiled llama.cpp binaries. During build, binaries are downloaded and extracted to the runtimes/ directory via MSBuild targets LLama/LLamaSharp.csproj71-81 For more details, see Package Architecture.
Sources: README.md89-108 LLama/LLamaSharp.csproj50-100 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj1-51 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj1-37
LLamaSharp implements a layered architecture that progressively abstracts from native C++ code to high-level .NET APIs.
Diagram: LLamaSharp Layered Architecture
| Layer | Key Types | Responsibility |
|---|---|---|
| Application Layer | User code, ChatSession | High-level conversational API and session state management docs/Architecture.md11 |
| Executor Layer | ILLamaExecutor, InteractiveExecutor, StatelessExecutor | Abstraction of execution patterns (chat vs. instruction vs. stateless) docs/Architecture.md10 |
| Core Abstraction Layer | LLamaWeights, LLamaContext | Managed wrappers for model weights and inference context state docs/Architecture.md8-9 |
| Configuration Layer | ModelParams, InferenceParams | Configuration objects for loading models and controlling inference. |
| Native Interop Layer | SafeLlamaModelHandle, NativeApi, NativeLibraryConfig | Memory-safe P/Invoke and resource management docs/Architecture.md7 |
| Native Library Layer | llama.cpp | The underlying C++ inference engine README.md14 |
Sources: docs/Architecture.md3-16 README.md14-23 LLama/LLamaSharp.csproj1-33
LLamaWeights is the primary holder of model weights docs/Architecture.md8 It encapsulates the native llama_model* pointer via a SafeLlamaModelHandle. Multiple LLamaContext instances can share a single LLamaWeights to optimize memory usage when running multiple tasks on the same model docs/Architecture.md9
LLamaContext manages the state for a specific inference session, including the KV cache docs/Architecture.md9 It utilizes LLamaWeights and interacts with the native library to perform tokenization and forward passes.
The library provides several executors defining how to run the model docs/Architecture.md10:
InteractiveExecutor: Designed for multi-turn chat interactions where the context is preserved and shifted.InstructExecutor: Optimized for instruction-following tasks.StatelessExecutor: Used for one-shot inference where context is not preserved between calls.BatchedExecutor: Advanced executor for managing multiple concurrent conversation sequences.For more information on starting your first inference, see Quick Start Guide.
Sources: docs/Architecture.md3-16 README.md20-22
LLamaSharp integrates with major .NET AI ecosystems to simplify RAG and Agent development:
LLamaSharp.semantic-kernel LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj22-24LLamaSharp.kernel-memory LLama.KernelMemory/LLamaSharp.KernelMemory.csproj16-18Sources: README.md61-82 docs/index.md18-27
LLamaSharp version 0.27.0 is pinned to llama.cpp version 3f7c29d318e317b63f54c558bc69803963d7d88c LLama/LLamaSharp.csproj10-25 Native library loading is managed by NativeLibraryConfig and NativeApi, which ensures the correct backend is resolved at runtime docs/FAQ.md8
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.