Overview

Relevant source files

LLamaSharp is a cross-platform .NET library that provides managed bindings to llama.cpp enabling efficient local execution of large language models (LLMs) in .NET applications. It supports both CPU and GPU acceleration, runs models in GGUF format, and provides high-level APIs for text generation, embeddings, and multimodal inference LLama/LLamaSharp.csproj19-23

This page provides an architectural overview of the LLamaSharp ecosystem. For installation instructions, see Installation and Setup. For getting started with code examples, see Quick Start Guide. For details on specific components, refer to the architecture sections (Core Architecture, Executors and Inference, Sampling and Token Selection, Advanced Features).

Purpose and Capabilities

LLamaSharp serves three primary functions:

Native Library Integration: Wraps llama.cpp's C/C++ APIs with safe, idiomatic .NET interfaces using P/Invoke and SafeHandle patterns docs/Architecture.md7
High-Level Execution APIs: Provides multiple execution patterns (interactive chat, instruction-following, stateless inference, batched conversations) through the ILLamaExecutor abstraction docs/Architecture.md10
Framework Integration: Bridges LLamaSharp to Microsoft AI frameworks (Semantic Kernel, Kernel Memory) and third-party ecosystems README.md61-70

The library targets netstandard2.0 and net8.0, enabling compatibility across .NET Framework, .NET Core, and modern .NET applications LLama/LLamaSharp.csproj1-3

Sources: LLama/LLamaSharp.csproj1-33 README.md14-23 docs/Architecture.md3-16

Package Ecosystem

LLamaSharp uses a modular distribution strategy with separate packages for core functionality, framework integrations, and hardware-specific backends.

Package Structure

Diagram: LLamaSharp Package Distribution Model

Package	Purpose	Target Framework	Dependencies
`LLamaSharp`	Core library with inference APIs	`netstandard2.0`, `net8.0`	`Microsoft.Extensions.AI.Abstractions`
`LLamaSharp.semantic-kernel`	Semantic Kernel integration	`netstandard2.0`, `net8.0`	`Microsoft.SemanticKernel.Abstractions`
`LLamaSharp.kernel-memory`	Kernel Memory integration (RAG)	`net8.0`	`Microsoft.KernelMemory.Abstractions`
`LLamaSharp.Backend.Cpu`	CPU binaries (+ Metal for macOS)	Runtime	Native `.dll`/`.so`/`.dylib`
`LLamaSharp.Backend.Cuda11`	CUDA 11 GPU acceleration	Runtime	Native `.dll`/`.so`
`LLamaSharp.Backend.Cuda12`	CUDA 12 GPU acceleration	Runtime	Native `.dll`/`.so`
`LLamaSharp.Backend.Vulkan`	Vulkan GPU acceleration	Runtime	Native `.dll`/`.so`

Users install the LLamaSharp core package plus exactly one backend package matching their hardware README.md89-108 The backend packages contain pre-compiled llama.cpp binaries. During build, binaries are downloaded and extracted to the runtimes/ directory via MSBuild targets LLama/LLamaSharp.csproj71-81 For more details, see Package Architecture.

Sources: README.md89-108 LLama/LLamaSharp.csproj50-100 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj1-51 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj1-37

High-Level Architecture

LLamaSharp implements a layered architecture that progressively abstracts from native C++ code to high-level .NET APIs.

Diagram: LLamaSharp Layered Architecture

Layer Descriptions

Layer	Key Types	Responsibility
Application Layer	User code, `ChatSession`	High-level conversational API and session state management docs/Architecture.md11
Executor Layer	`ILLamaExecutor`, `InteractiveExecutor`, `StatelessExecutor`	Abstraction of execution patterns (chat vs. instruction vs. stateless) docs/Architecture.md10
Core Abstraction Layer	`LLamaWeights`, `LLamaContext`	Managed wrappers for model weights and inference context state docs/Architecture.md8-9
Configuration Layer	`ModelParams`, `InferenceParams`	Configuration objects for loading models and controlling inference.
Native Interop Layer	`SafeLlamaModelHandle`, `NativeApi`, `NativeLibraryConfig`	Memory-safe P/Invoke and resource management docs/Architecture.md7
Native Library Layer	`llama.cpp`	The underlying C++ inference engine README.md14

Sources: docs/Architecture.md3-16 README.md14-23 LLama/LLamaSharp.csproj1-33

Core Components

Model Loading: LLamaWeights

LLamaWeights is the primary holder of model weights docs/Architecture.md8 It encapsulates the native llama_model* pointer via a SafeLlamaModelHandle. Multiple LLamaContext instances can share a single LLamaWeights to optimize memory usage when running multiple tasks on the same model docs/Architecture.md9

Inference Sessions: LLamaContext

LLamaContext manages the state for a specific inference session, including the KV cache docs/Architecture.md9 It utilizes LLamaWeights and interacts with the native library to perform tokenization and forward passes.

Executors: Execution Patterns

The library provides several executors defining how to run the model docs/Architecture.md10:

InteractiveExecutor: Designed for multi-turn chat interactions where the context is preserved and shifted.
InstructExecutor: Optimized for instruction-following tasks.
StatelessExecutor: Used for one-shot inference where context is not preserved between calls.
BatchedExecutor: Advanced executor for managing multiple concurrent conversation sequences.

For more information on starting your first inference, see Quick Start Guide.

Sources: docs/Architecture.md3-16 README.md20-22

Framework Integrations

LLamaSharp integrates with major .NET AI ecosystems to simplify RAG and Agent development:

Semantic Kernel: Integration via LLamaSharp.semantic-kernel LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj22-24
Kernel Memory: RAG support via LLamaSharp.kernel-memory LLama.KernelMemory/LLamaSharp.KernelMemory.csproj16-18
Third-Party: Support for BotSharp, LangChain, and MaIN.NET README.md61-70

Sources: README.md61-82 docs/index.md18-27

Version Compatibility

LLamaSharp version 0.27.0 is pinned to llama.cpp version 3f7c29d318e317b63f54c558bc69803963d7d88c LLama/LLamaSharp.csproj10-25 Native library loading is managed by NativeLibraryConfig and NativeApi, which ensures the correct backend is resolved at runtime docs/FAQ.md8

Sources: LLama/LLamaSharp.csproj10-26 docs/FAQ.md1-82

Overview

Relevant source files

Purpose and Capabilities

LLamaSharp serves three primary functions:

Native Library Integration: Wraps llama.cpp's C/C++ APIs with safe, idiomatic .NET interfaces using P/Invoke and SafeHandle patterns docs/Architecture.md7
High-Level Execution APIs: Provides multiple execution patterns (interactive chat, instruction-following, stateless inference, batched conversations) through the ILLamaExecutor abstraction docs/Architecture.md10
Framework Integration: Bridges LLamaSharp to Microsoft AI frameworks (Semantic Kernel, Kernel Memory) and third-party ecosystems README.md61-70

The library targets netstandard2.0 and net8.0, enabling compatibility across .NET Framework, .NET Core, and modern .NET applications LLama/LLamaSharp.csproj1-3

Sources: LLama/LLamaSharp.csproj1-33 README.md14-23 docs/Architecture.md3-16

Package Ecosystem

LLamaSharp uses a modular distribution strategy with separate packages for core functionality, framework integrations, and hardware-specific backends.

Package Structure

Diagram: LLamaSharp Package Distribution Model

Package	Purpose	Target Framework	Dependencies
`LLamaSharp`	Core library with inference APIs	`netstandard2.0`, `net8.0`	`Microsoft.Extensions.AI.Abstractions`
`LLamaSharp.semantic-kernel`	Semantic Kernel integration	`netstandard2.0`, `net8.0`	`Microsoft.SemanticKernel.Abstractions`
`LLamaSharp.kernel-memory`	Kernel Memory integration (RAG)	`net8.0`	`Microsoft.KernelMemory.Abstractions`
`LLamaSharp.Backend.Cpu`	CPU binaries (+ Metal for macOS)	Runtime	Native `.dll`/`.so`/`.dylib`
`LLamaSharp.Backend.Cuda11`	CUDA 11 GPU acceleration	Runtime	Native `.dll`/`.so`
`LLamaSharp.Backend.Cuda12`	CUDA 12 GPU acceleration	Runtime	Native `.dll`/`.so`
`LLamaSharp.Backend.Vulkan`	Vulkan GPU acceleration	Runtime	Native `.dll`/`.so`

Sources: README.md89-108 LLama/LLamaSharp.csproj50-100 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj1-51 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj1-37

High-Level Architecture

LLamaSharp implements a layered architecture that progressively abstracts from native C++ code to high-level .NET APIs.

Diagram: LLamaSharp Layered Architecture

Layer Descriptions

Layer	Key Types	Responsibility
Application Layer	User code, `ChatSession`	High-level conversational API and session state management docs/Architecture.md11
Executor Layer	`ILLamaExecutor`, `InteractiveExecutor`, `StatelessExecutor`	Abstraction of execution patterns (chat vs. instruction vs. stateless) docs/Architecture.md10
Core Abstraction Layer	`LLamaWeights`, `LLamaContext`	Managed wrappers for model weights and inference context state docs/Architecture.md8-9
Configuration Layer	`ModelParams`, `InferenceParams`	Configuration objects for loading models and controlling inference.
Native Interop Layer	`SafeLlamaModelHandle`, `NativeApi`, `NativeLibraryConfig`	Memory-safe P/Invoke and resource management docs/Architecture.md7
Native Library Layer	`llama.cpp`	The underlying C++ inference engine README.md14

Sources: docs/Architecture.md3-16 README.md14-23 LLama/LLamaSharp.csproj1-33

Core Components

Model Loading: LLamaWeights

Inference Sessions: LLamaContext

Executors: Execution Patterns

The library provides several executors defining how to run the model docs/Architecture.md10:

InteractiveExecutor: Designed for multi-turn chat interactions where the context is preserved and shifted.
InstructExecutor: Optimized for instruction-following tasks.
StatelessExecutor: Used for one-shot inference where context is not preserved between calls.
BatchedExecutor: Advanced executor for managing multiple concurrent conversation sequences.

For more information on starting your first inference, see Quick Start Guide.

Sources: docs/Architecture.md3-16 README.md20-22

Framework Integrations

LLamaSharp integrates with major .NET AI ecosystems to simplify RAG and Agent development:

Semantic Kernel: Integration via LLamaSharp.semantic-kernel LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj22-24
Kernel Memory: RAG support via LLamaSharp.kernel-memory LLama.KernelMemory/LLamaSharp.KernelMemory.csproj16-18
Third-Party: Support for BotSharp, LangChain, and MaIN.NET README.md61-70

Sources: README.md61-82 docs/index.md18-27

Version Compatibility

Sources: LLama/LLamaSharp.csproj10-26 docs/FAQ.md1-82

Overview

Purpose and Capabilities

Package Ecosystem

Package Structure

High-Level Architecture

Layer Descriptions

Core Components

Model Loading: LLamaWeights

Inference Sessions: LLamaContext

Executors: Execution Patterns

Framework Integrations

Version Compatibility

On this page

Overview

Purpose and Capabilities

Package Ecosystem

Package Structure

High-Level Architecture

Layer Descriptions

Core Components

Model Loading: LLamaWeights

Inference Sessions: LLamaContext

Executors: Execution Patterns

Framework Integrations

Version Compatibility

On this page