BENCHMARK EVERYTHING THAT MATTERS
Evaluations
Run structured evaluations across models, retrieval pipelines, prompts, agents, and workflows — all within the FloTorch console. Every evaluation is organized as a project with experiments you can compare side by side, so you know exactly what to change and why.

Your Business Application using GenAI

Observability

Routing

Guardrails

Security

Workspace management

Prompt Management

Governance
AI GAteway
AI GAteway

Custom models
Fine-Tuned
MCP Servers
.avif)
.avif)






FIVE TYPES. ONE FRAMEWORK.
Stop Guessing. Start Measuring.
Whether you're comparing inferencing models, validating retrieval quality, or testing a multi-agent workflow end to end, FloTorch evaluations give you the data to make the call with confidence — not intuition.
LLM Evaluations — Find the Right Model for Your Use Case
Run question-answer style evaluations across one or more inferencing models on your dataset. Configure N-shot counts, system and user prompts, and scoring models per experiment. Each model combination runs as a separate experiment so you can compare performance, cost, and latency in a single view — without rebuilding your evaluation setup from scratch.
RAG Evaluations — Validate Your Entire Retrieval Pipeline
Go beyond answer quality. FloTorch RAG evaluations score how well your retrieval stack performs — from context precision and recall to faithfulness and noise sensitivity. Connect a knowledge base, select your inferencing and embedding models, and get per-question visibility into where your pipeline succeeds and where it breaks down. Completed RAG experiments can be deployed directly as a RAG endpoint.
Prompt Evaluations — Know Which Instructions Actually Work
Test multiple system and user prompt variants against the same dataset without changing your model or retrieval setup. Each prompt pair runs as a separate experiment, and scores are returned per pair — so you can see exactly which instruction set produces better answers. Combine with retrieval settings to evaluate prompts in RAG-style flows as well.
Agent Evaluations — Score Goal Achievement and Tool Use
Evaluate agents against trajectory-style metrics that go beyond output quality. FloTorch assesses whether the agent understood the task, used the right tools, and produced a response consistent with its execution trace. Run evaluations against any published agent in your workspace — no additional instrumentation required.
Workflow Evaluations — End-to-End Scoring for Multi-Agent DAGs
Validate agentic workflows where multiple agents operate in sequence or in parallel. Define your workflow as a DAG, provide test cases with expected outcomes, and FloTorch evaluates each node's behavior as well as the overall workflow output. Results include per-agent traces so you can see exactly where the workflow succeeded and where it diverged from intent.
Results You Can Act On — Compare, Inspect, and Export
Every evaluation project surfaces a results table with configurable columns — metric scores, model names, prompt variants, cost, and duration — across all experiments in that project. Drill into any experiment to see per-question outputs, ground truth comparisons, and the exact configuration used. Export full results for external review or version-controlled analysis.

