What's New in FloTorch: Evaluate Everything, Route Intelligently, Build Faster

Introduction

Every enterprise GenAI team eventually hits the same wall: models are proliferating, evaluation is manual and inconsistent, and the infrastructure holding everything together wasn't built for production scale.

This release is FloTorch's answer to all three.

Version 1.13 ships Smart Routing, the Floeval SDK, a comprehensive Evaluations framework, and Gateway Hono — along with enhanced dataset tooling and ready-to-run workflow Blueprints. Taken together, these features close the loop between experimentation and production: run any model, evaluate every layer of your stack, and route requests intelligently — without rewriting your workflow each time.

Here's what's new and what it means for teams building on FloTorch.

Smart Routing: The Right Model for Every Query, Automatically

Picking the right model for each request has always been a manual tax on engineering teams. Use a frontier model for everything and you overspend. Use smaller models to cut costs and quality degrades on complex queries. The tradeoff is real — and until now, managing it required constant intervention.

Smart Routing removes that burden.

FloTorch now automatically evaluates the complexity of each incoming prompt and selects the most suitable model from your configured pool — routing straightforward queries to lighter, faster, lower-cost models and escalating complex ones to higher-capability options. No manual rules. No static configuration. No code changes to your existing workflows.

The practical upside is meaningful. Teams running high-volume workloads can expect significant cost reduction on queries that don't need a frontier model, without sacrificing output quality on the ones that do. And because routing is automatic, it scales with your usage — the optimization happens continuously, not just when someone remembers to tune it.

For teams already using FloTorch's Unified Gateway, Smart Routing integrates directly into the same endpoint layer. Switching it on doesn't mean switching anything else off.

Floeval SDK + Evaluations: A Unified Framework for Everything in Your Stack

Most evaluation setups have a blind spot. Teams can benchmark model responses in isolation, but the moment you introduce retrieval, agents, or multi-step workflows, the evaluation falls apart — different tools, inconsistent metrics, no clear way to compare across iterations.

This release addresses that comprehensively.

The Floeval SDK is a unified, multi-backend evaluation framework that covers LLMs, RAG pipelines, prompts, agents, and multi-agent workflows in a single library. It supports RAGAS, DeepEval, and custom metrics, runs via CLI or direct SDK integration, and connects natively to FloTorch's gateway for end-to-end traceability. Evaluation notebooks are available to get started quickly.

Explore more about evaluations using of Github -[Link]

The Evaluations module within FloTorch extends this to the platform level — providing outcome-based evaluation across all components using structured datasets, with consistent metric definitions and side-by-side comparison of models, prompts, and workflows across iterations.

Together, they mean one thing practically: teams can now measure quality consistently across every layer of their GenAI stack, not just at the model output level.

This matters most for teams in production or approaching it. When a RAG pipeline regresses, you need to know whether the problem is in the retrieval step, the embedding model, the prompt, or the LLM — not just that quality dropped. The Floeval SDK gives you that visibility.

FloTorch's existing benchmarking work — including the Amazon Nova vs GPT-4o study and the FinanceBench report — was built on this evaluation methodology. The SDK makes that same rigor available to every team building on the platform.

A More Powerful Gateway for Scale

We've significantly enhanced FloTorch's Gateway capabilities to handle higher volumes and scale for enterprise AI product development. The updated architecture supports true streaming through LLM model endpoints — delivering faster, more reliable performance as your workloads grow.

Blueprints: Pre-Built Workflows That Skip the Starting-From-Scratch Problem

Agentic workflows are only valuable when they're actually running. The gap between "we know what we want to automate" and "it's working in production" is where most teams lose time.

Blueprints close that gap.

This release ships four pre-built, ready-to-run workflow templates:

Google Calendar Blueprint — Automates event scheduling, availability checks, and meeting creation including Google Meet link generation
Presales Research Blueprint — Generates structured client research briefs before sales conversations, pulling relevant context automatically
Release Notes Generator — Produces structured release documentation directly from public repository commits and pull requests
Competitor Battle Cards — Delivers quick competitor comparisons covering strengths, gaps, and positioning insights on demand

Each Blueprint is a fully wired workflow, not a template that requires configuration from scratch. Teams can run them as-is or use them as the starting point for customized versions. For teams new to agentic workflows, Blueprints also function as working reference architectures — examples of how to structure multi-step automation in FloTorch.

Dataset Enhancements: Better Inputs, Better Evaluations

Evaluation quality is only as good as the data behind it. This release upgrades dataset management with a wizard-based UI that simplifies creation from multiple sources: PDFs (synthetic generation), model traces, Hugging Face datasets, and ground truth files.

Critically, datasets now support Q&A pairs with context — enabling richer, more accurate evaluations that reflect real-world usage patterns rather than isolated query-response pairs. Auto-capture enables continuous dataset generation from live traffic, so your evaluation sets stay current as your application evolves.

Dataset creation that previously required manual setup can now be completed in minutes.

What's Coming Next

Our roadmap includes AI-powered Blueprint generation, enhanced delete functionality, new Evaluation Blueprints, and improved log grouping for observability. UI stability and performance improvements will continue rolling through subsequent releases.