BENCHMARK EVERYTHING THAT MATTERS

Evaluations

Run structured evaluations across models, retrieval pipelines, prompts, agents, and workflows — all within the FloTorch console. Every evaluation is organized as a project with experiments you can compare side by side, so you know exactly what to change and why.

Get Started

Book a demo

Your Business Application using GenAI

Observability

Routing

Guardrails

Security

Workspace management

Prompt Management

Governance

AI GAteway

Custom models

Fine-Tuned

MCP Servers

FIVE TYPES. ONE FRAMEWORK.

Stop Guessing. Start Measuring.

Whether you're comparing inferencing models, validating retrieval quality, or testing a multi-agent workflow end to end, FloTorch evaluations give you the data to make the call with confidence — not intuition.

Get Started

Book a demo

LLM Evaluations — Find the Right Model for Your Use Case

Run question-answer style evaluations across one or more inferencing models on your dataset. Configure N-shot counts, system and user prompts, and scoring models per experiment. Each model combination runs as a separate experiment so you can compare performance, cost, and latency in a single view — without rebuilding your evaluation setup from scratch.

RAG Evaluations — Validate Your Entire Retrieval Pipeline

Go beyond answer quality. FloTorch RAG evaluations score how well your retrieval stack performs — from context precision and recall to faithfulness and noise sensitivity. Connect a knowledge base, select your inferencing and embedding models, and get per-question visibility into where your pipeline succeeds and where it breaks down. Completed RAG experiments can be deployed directly as a RAG endpoint.

Prompt Evaluations — Know Which Instructions Actually Work

Test multiple system and user prompt variants against the same dataset without changing your model or retrieval setup. Each prompt pair runs as a separate experiment, and scores are returned per pair — so you can see exactly which instruction set produces better answers. Combine with retrieval settings to evaluate prompts in RAG-style flows as well.

Agent Evaluations — Score Goal Achievement and Tool Use

Evaluate agents against trajectory-style metrics that go beyond output quality. FloTorch assesses whether the agent understood the task, used the right tools, and produced a response consistent with its execution trace. Run evaluations against any published agent in your workspace — no additional instrumentation required.

Workflow Evaluations — End-to-End Scoring for Multi-Agent DAGs

Validate agentic workflows where multiple agents operate in sequence or in parallel. Define your workflow as a DAG, provide test cases with expected outcomes, and FloTorch evaluates each node's behavior as well as the overall workflow output. Results include per-agent traces so you can see exactly where the workflow succeeded and where it diverged from intent.

Results You Can Act On — Compare, Inspect, and Export

Every evaluation project surfaces a results table with configurable columns — metric scores, model names, prompt variants, cost, and duration — across all experiments in that project. Drill into any experiment to see per-question outputs, ground truth comparisons, and the exact configuration used. Export full results for external review or version-controlled analysis.