BENCHMARK EVERYTHING THAT MATTERS

Evaluations

Run structured evaluations across models, retrieval pipelines, prompts, agents, and workflows — all within the FloTorch console. Every evaluation is organized as a project with experiments you can compare side by side, so you know exactly what to change and why.

Your Business Application using GenAI
Observability
Routing
Guardrails
Security
Workspace management
Prompt Management
Governance
AI GAteway
AI GAteway
Custom models
Fine-Tuned
MCP Servers
FIVE TYPES. ONE FRAMEWORK.

Stop Guessing. Start Measuring.

Whether you're comparing inferencing models, validating retrieval quality, or testing a multi-agent workflow end to end, FloTorch evaluations give you the data to make the call with confidence — not intuition.

LLM Evaluations — Find the Right Model for Your Use Case

Run question-answer style evaluations across one or more inferencing models on your dataset. Configure N-shot counts, system and user prompts, and scoring models per experiment. Each model combination runs as a separate experiment so you can compare performance, cost, and latency in a single view — without rebuilding your evaluation setup from scratch.

RAG Evaluations — Validate Your Entire Retrieval Pipeline

Go beyond answer quality. FloTorch RAG evaluations score how well your retrieval stack performs — from context precision and recall to faithfulness and noise sensitivity. Connect a knowledge base, select your inferencing and embedding models, and get per-question visibility into where your pipeline succeeds and where it breaks down. Completed RAG experiments can be deployed directly as a RAG endpoint.

Prompt Evaluations — Know Which Instructions Actually Work

Test multiple system and user prompt variants against the same dataset without changing your model or retrieval setup. Each prompt pair runs as a separate experiment, and scores are returned per pair — so you can see exactly which instruction set produces better answers. Combine with retrieval settings to evaluate prompts in RAG-style flows as well.

Agent Evaluations — Score Goal Achievement and Tool Use

 Evaluate agents against trajectory-style metrics that go beyond output quality. FloTorch assesses whether the agent understood the task, used the right tools, and produced a response consistent with its execution trace. Run evaluations against any published agent in your workspace — no additional instrumentation required.

Workflow Evaluations — End-to-End Scoring for Multi-Agent DAGs

Validate agentic workflows where multiple agents operate in sequence or in parallel. Define your workflow as a DAG, provide test cases with expected outcomes, and FloTorch evaluates each node's behavior as well as the overall workflow output. Results include per-agent traces so you can see exactly where the workflow succeeded and where it diverged from intent.

Results You Can Act On — Compare, Inspect, and Export

Every evaluation project surfaces a results table with configurable columns — metric scores, model names, prompt variants, cost, and duration — across all experiments in that project. Drill into any experiment to see per-question outputs, ground truth comparisons, and the exact configuration used. Export full results for external review or version-controlled analysis.