FloTorch Research

Science Behind Production-Grade
AI Systems

We don't just build tooling — we run the experiments, publish the findings, and engineer the stack that makes GenAI reliable at scale.

Performance Benchmarking

Independent Evaluation You Can Trust

FloTorch runs rigorous, reproducible benchmarks across LLMs, retrieval stacks, and cloud providers — no vendor bias, full methodology transparency. Every number traces to a real production workload.

65% Cost Reduction

Amazon Nova Pro delivers 65% lower cost than GPT-4o at comparable accuracy — verified across the CRAG dataset with 2,000+ queries.

+40 NDCG@1 Gain

Canonical text normalization alone recovered 40+ NDCG@1 points for MedEmbed — a critical finding for medical and multilingual retrieval stacks.

LLM-as-a-Judge

When ground truth doesn't exist — medical domains, proprietary data — we generate unbiased pseudo-labels via cross-model consensus scoring for fair evaluation.

22% Faster Latency

Nova Pro is 22% faster than GPT-4o on average — measured end-to-end including retrieval and inference across live workloads.

Cross-Cloud Comparison

AWS inference runs 40% faster than Azure at 2–3× lower cost for embedding-heavy RAG workloads — backed by measured latency and per-query cost data.

FinanceBench Evaluation

Semantic chunking with metadata filtering achieved 60% accuracy vs 25% for fixed chunking — with o1-preview and Nova Premier benchmarked across latency and cost dimensions.
Open Source Tools

Evaluate with FloTest — Free & Open Source

FloTest and our growing suite of open-source tools bring research-grade evaluation to every engineering team. No proprietary stack, no vendor lock-in.

FloTest Framework

End-to-end test harness for GenAI pipelines. Run regression suites across model versions, prompts, and retrieval configs without boilerplate.

Normalization Library

Unicode standardization that recovers 25–40 NDCG@1 points lost to encoding artifacts — proven critical for medical and multilingual workloads.

LLM-as-a-Judge Harness

Structured evaluation for zero-ground-truth domains. Cross-model consensus scoring generates unbiased pseudo-labels at scale.

Benchmark Report Generator

Automated PDF and markdown outputs from benchmark runs. Reproducible, shareable, and CI-friendly by design.
Inference Optimization

Optimize Every Layer of the Inference Stack

From prompt caching to quantized model endpoints — FloTorch research identifies exactly where latency and cost can be cut, with measured results at each layer.

Batching & Throughput Tuning

Continuous batching strategies measured across vLLM, TGI, and Amazon Bedrock batch APIs — optimal config mapped per workload type.

Complexity-Based Routing

Simple queries routed to lightweight models, complex reasoning reserved for frontier models — up to 20% cost reduction with no accuracy drop, validated.

Retrieval Stack Benchmarking

7 embedding models across 2,000 queries. Titan leads on semantic ranking stability. Azure T3-Large best for knowledge-heavy recall. Documented and reproducible.

Semantic Cache Research

Properly tuned prompt-response caching reduces token spend by 40–60% on repetitive enterprise workflows — benchmarked across real workload distributions.

Cross-Cloud Latency Profiling

AWS inference 40% faster than Azure at 2–3× lower cost for RAG workloads. Our profiling guides cloud architecture decisions with real data, not vendor claims.

Cost–Accuracy Pareto Analysis

We map the full cost-accuracy frontier for your task — so you can make rational model selection decisions instead of defaulting to the most expensive option.
Expert Services

Forward Deployment & Performance Engineering

Our embedded engineers build alongside your team, diagnose production failures, and solve complex GenAI problems nobody else can. From hours to minutes.

Root Cause Analysis

We trace accuracy regressions, latency spikes, and hallucination patterns back to specific retrieval gaps or prompt failures — with evidence, not guesses.

RAG Architecture Optimization

Embedding selection, chunking strategy, reranker tuning, and metadata filtering — we run the experiments your team doesn't have time for, with benchmarks to back every call.

Agentic Workflow Engineering

Multi-agent pipelines with deterministic fallbacks, cost guardrails, and evaluation checkpoints — built into the orchestration layer from day one, not retrofitted.

Evaluation Framework Design

Domain-specific eval suites for zero-ground-truth environments. LLM-as-a-Judge pipelines, normalization strategies, and scoring rubrics built for your data.

Cross-Cloud AI Strategy

AWS vs Azure vs GCP — answered with data, not vendor preference. Our profiling reveals the real cost and latency delta for your specific workload configuration.

Embedded FDE Engagement

1–4 week embedded engagements — benchmark audits, architecture reviews, production incident support, and eval framework design delivered with your team, not just handed over.
Model Research

Finetuning & Quantization of Open-Weight Models

We adapt frontier open-weight models to domain-specific tasks and compress them for production-grade inference — without the accuracy tradeoffs that make teams nervous.

Domain Dataset Curation

Privacy-compliant data prep, deduplication, and instruction-pair formatting for healthcare, legal, and financial verticals — the unglamorous work that makes finetuning actually work.

Evaluation vs. Base Model

Every finetuned model is benchmarked against its base across accuracy, hallucination rate, and latency — not just reported as "improved" but proven with domain-specific evals.

Production Deployment

Deploy via FloTorch Unified Gateway with full observability — token use, latency distributions, cost tracking, and drift detection built in from launch.

LoRA / QLoRA Finetuning

Parameter-efficient domain adaptation with LoRA adapters. QLoRA extends this to 4-bit quantized base models — full adaptation at a fraction of compute cost.

Quantization for Deployment

GGUF, AWQ, and GPTQ formats validated for production. We find the quantization level where accuracy retention meets inference cost targets — not just the smallest model.

Hyperparameter Optimization

Automated search across learning rate, batch size, LoRA rank, and quantization config — via FloTorch LLMOps to surface the best performing configuration fast.