Smolagents: Building Flexible AI Agents with Secure, Sandboxed Code Execution

Smolagents is an open‑source Python library from Hugging Face that lets developers create AI agents capable of running code actions in a modular, scaffolded environment. Its standout feature is the built‑in support for multiple sandboxed executors—E2B, Modal, Docker, and WebAssembly—enabling safe execution of untrusted code while keeping the core agent logic simple and extensible.

View source repository

Smolagents: Building Flexible AI Agents with Secure, Sandboxed Code Execution

Architecture and Core Components

The library’s core is organized around four interchangeable layers—agents, models, tools, and executors—each residing in its own module and communicating through well‑defined interfaces. This separation lets developers swap, for example, a LocalPythonExecutor for a sandboxed backend such as E2B, Modal, Docker, or WebAssembly without touching agent logic. The codebase totals 26 435 lines of Python, type‑annotated throughout, and enforces code quality with Ruff linting and formatting in the CI pipeline. Error handling relies on a custom hierarchy (AgentError, AgentParsingError, AgentExecutionError, …) that surfaces distinct failure modes for clearer debugging. Tests are spread across 18 files, exercising the agent loop, tool invocation, and each executor variant, giving the project a solid foundation for refactoring.

While the architecture is clean and well‑tested, production‑grade observability gaps are evident: logging relies on Rich’s console formatting, which emits human‑readable but non‑machine‑parseable output; there are no built‑in health‑check endpoints or readiness probes for container orchestration; and distributed tracing via OpenTelemetry or similar is absent. Adding a JSON‑logging toggle, exporting /health and /ready HTTP endpoints, and instrumenting key spans with OpenTelemetry would align the project with its own recommendations and bring the observability score from the current 50 toward a production‑ready level.

Security Model and Sandboxed Execution

Smolagents ships with a pluggable execution subsystem that separates the agent’s decision‑making logic from the environment in which code actions are run. The base Executor interface is implemented by several concrete runners, the most straightforward being LocalPythonExecutor, which executes Python snippets directly in the host interpreter. The project’s documentation explicitly flags this executor as not a security boundary—it provides no isolation, so any untrusted code can affect the host system, a fact listed under the critical findings in the project’s KPIs.

For production‑grade workloads the library offers four sandboxed alternatives that each enforce isolation through different mechanisms:

>E2BExecutor – runs code in the E2B secure sandbox, a short‑lived VM with resource limits.
>ModalExecutor – delegates execution to Modal’s serverless containers, providing network‑free sandboxes.
>DockerExecutor – builds and runs a Docker image with --read-only filesystem and user‑namespace remapping.
>WASMExecutor – compiles Python to WebAssembly via Pyodide and executes it in a sandboxed JS runtime.

These options are selectable at agent construction time, for example:

The codebase, comprising 26 435 lines of Python across 181 files, enforces Ruff linting in CI and ships with an extensive test suite (18 test files) that validates each executor’s behavior. Documentation is available in English, Spanish, Hindi, Korean, and Chinese, helping teams understand the security guarantees of each sandbox. While the current release lacks structured logging and tracing, the modular executor design makes it straightforward to wrap any sandbox with additional observability layers before deploying to production.

Extensibility: Tools, Models, and Integrations

Smolagents’ core design separates agents, models, tools, and executors, which already makes it straightforward to plug in new components. The library ships with a LocalPythonExecutor that the documentation explicitly labels as “not a security boundary,” while offering four sandboxed alternatives—E2B, Modal, Docker, and WebAssembly—for running untrusted code. These executors are invoked through a uniform interface, so swapping in a custom runtime (e.g., a Kubernetes‑based sandbox) requires only implementing the same Executor abstract base class.

Extending the model side is similarly modular: agents accept any object that conforms to the Model protocol, and the codebase already includes adapters for Hugging Face Hub, OpenAI, Anthropic, LiteLLM, and Gradio‑based UIs. Because type hints are enforced throughout, adding a new provider merely involves subclassing BaseModel and filling in the generate method; CI will catch missing implementations via the existing ruff lint and type‑check pipeline.

Where extensibility stalls for production use is in observability and safety hooks. The project’s own readiness scores show observability at 50/100 and security at 65/100, with critical findings noting the absence of structured JSON logging, health‑check endpoints, and distributed tracing. The recommendations advocate adding a JSON logger alongside the existing Rich console output, exposing /healthz readiness probes for container orchestration, and instrumenting calls with OpenTelemetry to capture spans across tool execution, model inference, and agent loops. Implementing these hooks would turn the current extensible architecture into a production‑ready platform without altering its core separation of concerns.

Production Readiness: Observability, Testing, and Deployment Guidance

Smolagents already provides a solid testing foundation: the repository contains 18 test files that cover the agent, model, tool, and executor modules, and the codebase enforces Ruff linting in CI via GitHub Actions. However, the observability score in the project’s readiness breakdown sits at 50 / 100, reflecting missing pieces that are essential for production‑grade operation.

The library’s default logging relies on Rich console formatting, which is human‑friendly but not machine‑parseable; there is no built‑in option to emit structured JSON logs that could be ingested by ELK, Splunk, or similar pipelines. Likewise, no health‑check endpoints or readiness probes are exposed, so orchestrators like Kubernetes cannot automatically verify that a containerized agent is responsive. Distributed tracing is also absent—no OpenTelemetry instrumentation or Prometheus metrics endpoint exists for tracking token usage, latency, or error rates across service boundaries.

To reach observability maturity, you can add a thin logging wrapper that switches between Rich for local development and JSON for staging/production, expose a /healthz route returning HTTP 200 when the executor and model clients are ready, and instrument key functions with OpenTelemetry spans that propagate trace IDs through the agent loop. Pair these changes with dependency‑scanning (Dependabot/Renovate) and a coverage gate of ≥ 70 % to tighten the testing loop, and you’ll close the gaps highlighted in the project’s own recommendations while preserving its clean, modular architecture.

View Software Valuation Report

All articles