ScopeGuard: A Governance SLM for Multilingual Scope Classification

Today we are releasing ScopeGuard, a family of multilingual small language models (SLMs) designed to help AI services (e.g., agents, chatbots, virtual assistants) stay and operate within their intended boundaries.

Modern AI systems often rely on "system prompt" instructions to define what they should and shouldn't do, but these instructions are often fragile and easily strained by real-world user interactions, especially in non-English languages and in multi-turn conversations. ScopeGuard addresses this gap by taking a natural-language description of what your AI service should and shouldn't do, and evaluating how each incoming query aligns with it. Is a user request supported, out of scope, or explicitly disallowed by your service's rules? ScopeGuard answers this question quickly and accurately, providing not only a classification but also evidence from your service description to support its decisions.

To better understand the problem ScopeGuard solves, imagine you are building an AI-powered virtual assistant for a specific domain or service. You want to ensure that the assistant only responds to queries that fall within its designated scope, while gracefully handling or rejecting requests that are irrelevant or violate predefined constraints.

For example, consider a virtual assistant for a package delivery service that:

Should not summarize today's weather forecast, i.e., this an "out of scope" user request.
Must refuse shipping refund requests, i.e., this is a "restricted" user request.
Can answer package tracking questions, i.e., this is a "directly supported" user request.

We refer to this task as scope classification. And ScopeGuard is the first open model specifically designed for it. Our ScopeGuard models have been trained to understand your rules, principles, and restrictions, and provide evidences for their predictions not only in English but also in French, German, Italian, and Spanish.

scope-scope

Getting Started with ScopeGuard

You can download the ScopeGuard models from Hugging Face. ScopeGuard comes with a clean Python API within the orbitals library, and full support for Hugging Face Pipelines and vLLM. ScopeGuard is released under Apache License 2.0 and can be used for commercial purposes. This release is part of our mission at Principled Intelligence to make AI governance transparent and easy to adopt.

Tip

The easisest way to get started with ScopeGuard is to use our orbitals Python library.

The Need for Scope Classification

It's 2026. AI services are being deployed world-wide, with milions of users interacting with them daily. But:

Large Language Models (LLMs) drift, hallucinate, and behave unexpectedly.
Users behavior is unpredictable, well beyond what developers can anticipate.
AI teams ship features faster than governance can keep up.

As a company working in AI governance, we see these challenges first-hand with our customers. During our conversations with organizations of all sizes, system prompt instructions and safety-focused guardrails often emerge as some kind of one-size-fits-all solution but, as a matter of fact, this is wishful thinking.

Moreover, LLMs perform well in English, but their accuracy, factuality and reliability degrade in other languages, together with their ability to follow system prompt instructions faithfully. Among other consequences, the safety and control instructions you write in the system prompt provide a weaker guarantee than expected: you add a new instruction in the system prompt to fix an issue, but the model breaks other instructions that were working fine before.

Traditional guardrails, meanwhile, operate at a coarse level and focus on safety. They are designed to block generically unsafe content by matching outputs against static policy categories (e.g., hate speech, violence, adult content). This approach offers baseline protection, but it is generally difficult to adapt to complex service- or product-specific custom policies, which is where real-world failures tend to occur.

What is missing is a control layer that can answer a seemingly simple but crucial question:

Given a user query,
should it be processed by your AI service?

Missing this layer means that your AI service will respond when it shouldn't, exposing your organization to a variety of risks:

Risk	Why
Brand Reputation Damage	Users expect AI services to behave predictably. Responding to irrelevant or inappropriate queries can erode trust.
Compliance Violations	Your AI service might inadvertently generate content that violates legal or regulatory requirements.
Hallucinations	An AI service that answers to questions outside its domain is more likely to produce incorrect or misleading information.
Security Vulnerabilities	An AI service that processes out-of-scope queries may be more vulnerable to adversarial manipulation (e.g., prompt injection).
Wasted Resources	Processing irrelevant queries consumes computational resources, leading to operational inefficiencies.
Higher Cost	You are paying for queries that should have never been processed by your AI service.

This is where ScopeGuard comes in. And it's now open-sourced!

How ScopeGuard Works

ScopeGuard is straightforward to understand and easy to use. You give it two things:

A user query (and, optionally, any conversation history for additional context)
A description of what your AI service is supposed to do (we call this the AI Service Description)

Tip

The AI Service Description is what makes everything work. It can be as simple as one sentence or a copy-paste of your system prompt (our recommended starting point), as detailed as several paragraphs covering your service's purpose, features, and limitations, or even a structured object following our AI Service Description format. Put in the effort here, it pays off.

From there, it figures out how the user request aligns with what your service is meant to do. We break this down into five categories:

Scope Class	Description
✅ Directly Supported	Clearly within the scope of the AI service; in other words, it's okay for the AI service to process this type of queries.
🤔 Potentially Supported	Plausibly within the scope of the AI service; in common situations, it is generally safe for the AI service to process this type of queries.
🟠 Out of Scope	Outside the scope of the AI service and its responsibilities; the AI service should not process this type of queries.
🔴 Restricted	Something that is explicitly forbidden; the AI service must NEVER process this type of queries.
💬 Chit Chat	Small talk not related to the service (e.g., "thank you", "goodbye", "great"); the AI service should not be wasted on this type of queries.

We call the predicted category the scope class and, if needed, ScopeGuard can also extract and provide evidences from the AI Service Description to justify its decision.

Warning

Evidence extraction adds processing time. If you don't need the explanations, you can turn it off for much faster results.

Use Cases and Examples

AI Service Description

You are a virtual assistant for a package delivery service. You can only answer questions about package tracking. Never respond to requests for refunds.

Conversation

Hi there, I am the Package Delivery Assistant. How can I help you today?

My package 123456789 hasn't arrived yet and it was supposed to be here two days ago.

I'm sorry to hear that your package hasn't arrived yet. How can I assist you in this matter?

Message Being Classified

I want a refund for the delay.

Classification

Restricted

Evidence

“Never respond to requests for refunds.”

How to Integrate ScopeGuard

Up to this point, we've focused on what ScopeGuard is and what it predicts. The next question is practical:

Where does it fit in a real AI service?

ScopeGuard is meant to sit in your request pipeline as a lightweight guardrail. For each incoming user request, you run ScopeGuard first to produce a scope class. That signal becomes an input to your application logic.

How you act on it is up to you, from simple analytics and trend discovery (e.g., What percentage of queries are out of scope?) to full-fledged routing as in the following flowchart:

routing-flow

It's up to you how to handle each scope class. For instance:

You might want to block all "Restricted" queries outright.
You might want to log and monitor "Out of Scope" queries to understand user behavior better.
You might want to route "Potentially Supported" queries to a human agent for review.
You might even want to use different AI models or different system prompts based on the scope class.

Quickstart Recipe

1. Define your AI Service Description

Start simple, and write a short description that describes what your AI service is, should and shouldn't do. For instance:

You are a virtual assistant for a parcel delivery service.
You can only answer questions about package tracking.
Never respond to requests for refunds.

2. Install ScopeGuard

pip install orbitals[scope-guard-vllm]
 
# or, if you want to use HuggingFace Pipelines:
# pip install orbitals[scope-guard-hf]

3. Use ScopeGuard

from orbitals.scope_guard import ScopeGuard
 
ai_service_description = ...
 
# initialize the classifier
sg = ScopeGuard(
    backend="vllm",
    model="small"
)
 
# classify a user's query
user_query = "If the package hasn't arrived by tomorrow, can I get my money back?"
result = sg.validate(user_query, ai_service_description)
 
print(result)
# ScopeGuardOutput(
#   scope_class=ScopeClass.RESTRICTED,
#   evidences=["Never respond to requests for refunds."]
# )

Check out our GitHub repository for more information on advanced usage and integration options.

Available Models

ScopeGuard is a family of small language models. Today, we release two open-weight models based on different foundation-model backbones: Qwen-3 and Google's Gemma-3.

Qwen-3 and Gemma-3 are two excellent multilingual language models, with strong performances across a variety of tasks and languages. By offering ScopeGuard on both backbones, we give you the flexibility to choose the one that best fits not only your existing infrastructure but also your technical requirements and provenance policies.

Both variants are distilled from a stronger internal teacher model, ScopeGuard Pro. Today, we are openly releasing the 4B ScopeGuard model for each backbone.

Variant	Backbone	Size	Languages	Availability
scope-guard-4B-q-2601	Qwen-3	4B	English, French, German, Italian, Spanish	Available now
scope-guard-4B-g-2601	Gemma-3	4B	English, French, German, Italian, Spanish	Available now

How We Evaluated ScopeGuard

ScopeGuard is a small language model (SLM) designed for a specific aspect of AI governance: scope classification. In this section, we show that being small compared to large language models (LLMs) is not a limitation for ScopeGuard, but an intentional design choice. Instead of optimizing for open-ended generation, ScopeGuard is trained to make reliable, consistent, and low-latency policy-driven decisions.

To understand how this design translates into real-world performance, we evaluated ScopeGuard against a range of commercial large language models (e.g., GPT-5.2, Claude Sonnet 4.5, Gemini-3-Pro) as well as open-weight models (e.g., Nvidia Nemotron) commonly used in production systems, under comparable prompting and decision setups.

Given our academic roots, we want to be transparent about what we tested and why. Below, we share a first look at the experimental results, focusing on realistic guardrailing scenarios. A full technical report with methodological details, ablations, and extended analysis will be released soon.

The goal is not to “beat” large language models in general-purpose capabilities, but to show that small, specialized models can outperform general systems on governance tasks, while being cheaper, faster, and easier to deploy.

Evaluation Tasks

Our evaluation covers three tasks that matter in production AI governance:

Scope classification, where models must decide whether a user request falls within or outside the intended scope of an AI system (e.g., supported use cases, allowed domains, or interaction boundaries).
Vanilla safety classification, where models must detect whether a user request violates standard safety policies (e.g., toxicity, discrimination, biases).
Custom safety classification, where models must enforce explicit, custom-defined policies that vary between different use cases, services, products, and scenarios.

Comparison Systems

We evaluate ScopeGuard against a set of commercial and open-weight models commonly used for safety and moderation tasks. The comparison spans three categories: state-of-the-art commercial LLMs, their lower-latency variants, and specialized open-source safety models.

Commercial LLMs (state of the art)

GPT-5.2 (medium thinking), the latest, high-capability general-purpose model from OpenAI.
Claude Sonnet 4.5, one of the strongest models from Anthropic.
Gemini 3 Pro, ****the flagship general-purpose model from Google.

Commercial LLMs (fast and cost-efficient variants)

GPT-5 Mini, a smaller, lower-latency variant of GPT-5, designed for cost-sensitive and high-throughput deployments.
Claude Haiku 4.5, Anthropic’s lightweight model optimized for speed and efficiency, often used for real-time moderation and filtering.
Gemini 3 Flash, a fast inference variant of Gemini, targeting low-latency use cases while retaining reasonable instruction-following capabilities.

Open-weight safety-specialized model

Nemotron-Content-Safety-Reasoning-4B, a 4B-parameter open-weight model from NVIDIA, specifically trained for content safety and moderation, and commonly used as a strong open-source baseline for safety classification.

ScopeGuard models

Our initial family of ScopeGuard models consists in three models: a commercial closed-source model (available soon) and two open-weight models (available now) distilled from the first one.

scope-guard-4B-q-2601 Open ScopeGuard model based on Qwen’s Qwen3-4B-Instruct-2507, one of the best open-weight models at this size.
scope-guard-4B-g-2601 Open ScopeGuard model based on Google’s gemma-3-4b-it, one of the best open-weight models at this size.
scope-guard-pro-2601 Our closed-source ScopeGuard model, representing the most performant configuration used as a reference point in this evaluation.

Multilingual Scope Classification: Results

Multilingual scope classification is the primary task ScopeGuard is trained for. Given a user input, the model must decide whether the request falls within or outside the intended scope of the system (e.g., supported use cases, allowed domains, or interaction boundaries). In production, this decision typically acts as an early gate, determining whether a request should be handled, redirected, or blocked.

To evaluate this capability, we created an internal scope classification benchmark designed to reflect real user traffic and decision boundaries. Our benchmark is multilingual, covering five languages: English, Spanish, Italian, French, and German.

As shown in the results, scope-guard-4B-q-2601 and scope-guard-4B-g-2601 – our open-weight models – already outperform frontier commercial LLMs, including GPT-5.2, Claude Sonnet 4.5, and Gemini 3 Pro, in multilingual scope classification despite operating at a much smaller scale and with a deployment-friendly footprint. At the same time, our scope-guard-pro-2601 model achieves the best overall performance across the board, not only in terms of pure performance (macro F1 score across scope classes) but also in terms of latency.

More specifically, scope-guard-4B-q-2601 surpasses Gemini 3 Pro by 0.7 points in F1 score while being about 67 times faster in processing a single request than Gemini 3 Pro. Although Gemini 3 Flash is significantly faster than Gemini 3 Pro (almost 3 times), scope-guard-4B-q-2601 is still more than 26 times faster. Latencies of our models are measured on a single consumer-grade GPU; details below.

Results on multilingual scope classification

Provider	Model	Type	Macro F1 score	Avg. latency
OpenAI	GPT 5.2 (medium thinking)	Closed LLM	87.4	51.1s
Anthropic	Claude Sonnet 4.5	Closed LLM	85.4	27.8s
Google	Gemini 3 Pro	Closed LLM	88.4	31.8s
OpenAI	GPT 5 mini	Closed LLM	86.6	23.8s
Anthropic	Claude Haiku 4.5	Closed LLM	75.3	19.8s
Google	Gemini 3 Flash	Closed LLM	87.1	12.5s
Principled Intelligence	scope-guard-4B-q-2601	Open SLM	89.1	0.47s
Principled Intelligence	scope-guard-4B-g-2601	Open SLM	90.1	0.68s
Principled Intelligence	scope-guard-pro-2601	Closed SLM	91.9	0.23s

Vanilla Safety Classification: Results

Vanilla safety classification evaluates a model’s ability to detect whether a user request is harmful or unsafe under standard, generic safety policies (e.g., toxicity, discrimination, abusive language). This setting reflects how models are often used out of the box for moderation, without access to custom rules or domain-specific constraints.

We evaluate vanilla safety using the Toxic Chat benchmark, a widely adopted dataset for safety classification. To keep the comparison realistic, we use simple prompts and rely on each model’s default safety behavior, without injecting additional structure or custom policies.

As shown in the results, frontier commercial LLMs cluster tightly around similar performance levels, with GPT-5.2, Claude Sonnet 4.5, and Gemini 3 Pro achieving comparable Harmful F1 scores. Their fast and cost-efficient variants show a modest drop in performance, reflecting the usual trade-off between speed and accuracy.

Our scope-guard-pro-2601 model achieves the best overall performance on this benchmark, slightly surpassing all commercial models. At the same time, scope-guard-4B-q-2601 and scope-guard-4B-g-2601 perform competitively with much larger systems, despite being significantly smaller and optimized for governance tasks rather than generic content moderation.

The Nemotron-Content-Safety-Reasoning-4B model, while specifically trained for content safety, shows lower performance in this setup. This highlights an important distinction: strong results on vanilla safety benchmarks do not automatically transfer across datasets and prompting setups, and performance remains sensitive to evaluation conditions.

Overall, these results demonstrate that specialized governance models can match or exceed general-purpose LLMs even in generic safety settings, while retaining the deployment advantages of smaller models.

Results on Toxic Chat

Provider	Model	Type	Harmful F1 score
OpenAI	GPT 5.2 (medium thinking)	Closed LLM	80.8
Anthropic	Claude Sonnet 4.5	Closed LLM	80.7
Google	Gemini 3 Pro	Closed LLM	81.2
OpenAI	GPT 5 mini	Closed LLM	78.8
Anthropic	Claude Haiku 4.5	Closed LLM	77.8
Google	Gemini 3 Flash	Closed LLM	80.2
Nvidia	Nemotron-Content-Safety-Reasoning-4B	Open SLM	75.9
Principled Intelligence	scope-guard-4B-q-2601	Open SLM	79.1
Principled Intelligence	scope-guard-4B-g-2601	Open SLM	78.0
Principled Intelligence	scope-guard-pro-2601	Closed SLM	81.8

Custom Safety Classification: Results

Custom safety classification evaluates a model’s ability to enforce explicit, user-defined policies rather than generic safety rules. This setting is representative of real enterprise deployments, where acceptable behavior is often shaped by product requirements, regulatory constraints, or organizational guidelines that go beyond standard moderation.

We evaluate custom safety using two complementary benchmarks:

DynaGuardrail, which focuses on applying dynamically specified policy constraints across diverse scenarios.
CoSApien, which evaluates whether models can correctly follow structured safety policies embedded directly in the dataset.

In both cases, policies are provided as part of the evaluation setup, and models are framed as general-purpose assistants. This ensures the evaluation measures policy application and constraint following, rather than reliance on hard-coded safety heuristics or narrow prompt templates.

As shown in the results, scope-guard-pro-2601 achieves the best overall performance on both benchmarks, outperforming all commercial and open-weight baselines. Notably, our open-weight scope-guard-4B-q-2601 model also performs extremely well, matching or surpassing frontier commercial LLMs on both DynaGuardrail and CoSApien despite its much smaller size.

Frontier commercial models perform strongly overall, but their performance varies across datasets, particularly when policies differ in structure and presentation. Fast and cost-efficient variants show a wider spread, reflecting increased sensitivity to policy complexity. The Nvidia’s Nemotron-Content-Safety-Reasoning-4B model performs competitively on DynaGuardrail, but shows a larger drop on CoSApien, highlighting again how custom safety benchmarks can stress different aspects of policy understanding.

Overall, these results confirm that custom safety is not just an extension of vanilla moderation. This task requires models to reason over explicit constraints and apply them consistently. In this setting, our specialized governance models show a significant advantage, particularly when policies are complex, dynamic, or tightly specified.

Results on Dynaguardrail and CoSApien

Provider	Model	Type	Dynaguardrail	CoSApien
OpenAI	GPT 5.2 (medium thinking)	Closed LLM	89.5	91.3
Anthropic	Claude Sonnet 4.5	Closed LLM	88.3	90.9
Google	Gemini 3 Pro	Closed LLM	87.8	90.0
OpenAI	GPT 5 mini	Closed LLM	88.4	87.0
Anthropic	Claude Haiku 4.5	Closed LLM	84.4	89.5
Google	Gemini 3 Flash	Closed LLM	88.2	88.8
Nvidia	Nemotron-Content-Safety-Reasoning-4B	Open SLM	87.6	86.2
Principled Intelligence	scope-guard-4B-q-2601	Open SLM	88.7	91.9
Principled Intelligence	scope-guard-4B-g-2601	Open SLM	87.8	88.2
Principled Intelligence	scope-guard-pro-2601	Closed SLM	91.6	92.4

Aligned With Principled Intelligence

Principled Intelligence's mission is to make AI governance transparent and effortless for end users. We take on the difficult semantic, operational, and policy layers so teams can govern with clear, human-readable artifacts instead of brittle rule code or opaque dashboards.

ScopeGuard is a concrete expression of that vision:

Plain-language governance: Your service description becomes the living policy-editable by product, compliance, or domain experts without retraining.
Operational transparency: Every classification includes evidence spans, making decisions auditable and explainable to stakeholders.
Simpler guardrails: Teams no longer manually triage ambiguous queries or write ad hoc regex filters; scope becomes a first-class signal.
Foundation for broader governance: Scope analysis of user queries is a cornerstone of the AI governance ecostystem that we are building.

Our north star: if you can articulate what your AI should and should not do in natural language, our platform should make enforcing, measuring, and iterating that governance simple, reliable, and cost-effective.

FAQ

Do you offer an hosted solution?

Yes, we do! We are in the process of rolling out a dedicated inference platform. If you are interested in early access, please reach out at orbitals@principled-intelligence.com.

Why Are System Prompt Specifications Not Enough?

This is a legitimate question.

First, system prompt specifications are guides that you hope the LLMs will follow, and offer a very relaxed guarantee, especially with increasing prompt and task complexity. Here, ScopeGuard has been trained extensively on this task, and offers a much stronger assurance that unintended queries will be blocked.

Second, LLMs work best in English, but performances degrade significantly in other languages. This means that system prompt specifications written in other languages are often more prone to be ignored or misinterpreted. In contrast, ScopeGuard has been trained to work extensively on its target languages.

Third, even when system prompt specifications do work, you typically get only the user reply. No classification of what is happening (for monitoring, trend analysis, etc.), no evidence. ScopeGuard provides all of this out of the box.

Fourth, if you are using LLMs like GPT-5, Claude Sonnet 4.5, etc., you know they are both expensive and not really blazing fast. ScopeGuard is much smaller, cheaper, and faster to run, and lets you block irrelevant queries early, reducing unnecessary API calls and token usage.

Why Not Just Safety or Intent Classification?

Traditional safety tools focus exclusively on unsafe / harmful content detection, while intent classification solutions tend to provide coarse-grained labels. Both of these approaches adopt a generic perspective, and are not specific to your AI service.

In contrast, ScopeGuard adds semantic purpose boundaries that you can tailor to your service's needs.

Which languages are supported? Are you going to add more?

Currently, our models support English, French, German, Italian, and Spanish. Subsequent releases will expand coverage. If you have a specific language need, please reach out at orbitals@principled-intelligence.com or open an issue on GitHub.

Can I run it offline?

Yes, we deliberately targeted the 4B models precisely for local inference! We recommend using the vLLM backend for optimal performance. We also support standard HuggingFace inference via Pipelines.

Does it learn from traffic automatically?

Not automatically. If you want to improve performance over time, you have two main options:

Periodically re-train or fine-tune the model on new data.
Improve the description and adapt it as your AI service evolves.

We strongly encourage you to try the second option first. It is significantly easier and cheaper, while often being equally effective.

Do you have any recommendations for writing good AI Service Descriptions?

Yes! Here are some tips:

You have two possible starting points: either your system prompt, or a concise summary of what your AI service does. Both are valid.
Be clear and concise.
While we said descriptions can be any kind of text, we actually designed a structured AI Service Description format. This format lets you describe all the nuances and important details of your AI service in a systematic way, and ScopeGuard has been explicitly trained to leverage it.

If you need help crafting the perfect description for your use case, reach out to us. We offer dedicated support to help you get the most out of ScopeGuard.

Scope Classification is amazing. Can I get rid of Safety Classification then?

Kinda — in many practical deployments, yes.

ScopeGuard does what you specify in your AI Service Description. If you explicitly list unsafe categories you want to block (e.g., hate, harassment, sexual content, self-harm instructions, etc.) and mark them as disallowed, ScopeGuard will enforce those constraints as part of scope classification.

That said, Scope Classification and Safety Classification solve different problems: ScopeGuard verifies whether a request matches the intended usage of your AI service, while Safety Classification provides a guardrail against generically unsafe content (under a broad, policy-style definition).

You can absolutely write a dedicated description that makes ScopeGuard behave like a general safety filter, but in mission-critical environments running both side by side is usually the best setup: ScopeGuard enforces your product's intended boundaries, and Safety Classification adds an independent layer of protection against unsafe content.

Package Delivery Assistant

Package Tracking

Refund Request

Package Delivery Assistant: enforcing Courier Identity Protection Policy

University Virtual Assistant