Skip to Content
Functions.do is released 🎉

Benchmarks

Evaluate AI models on real-world business tasks to select the best models for your specific use cases.

Overview

Benchmarks provides a framework for evaluating and comparing AI models based on real-world business performance. It enables you to:

  • Assess model performance on enterprise-specific tasks
  • Compare models across accuracy, cost, latency, and throughput
  • Create standardized evaluation suites for business functions
  • Maintain leaderboards for different business capabilities

Features

  • Business Function Evals: Specialized tests for Sales, Marketing, Support, Coding, etc.
  • Multi-dimensional Scoring: Evaluate on accuracy, cost, and performance
  • Custom Test Suites: Create evaluations tailored to your business
  • Automated Testing: Run benchmarks on model releases
  • Comparative Analysis: Track model improvements over time

Usage

import { defineBenchmark, runBenchmark } from 'benchmarks.do' // Define a customer support benchmark const customerSupportBenchmark = defineBenchmark({ name: 'customer_support_quality', description: 'Evaluates AI models on customer support tasks', // Define test categories categories: [ { name: 'inquiry_classification', description: 'Classify customer inquiries by type', weight: 0.2, }, { name: 'response_generation', description: 'Generate helpful responses to inquiries', weight: 0.5, }, { name: 'escalation_detection', description: 'Identify when to escalate to human agents', weight: 0.3, }, ], // Define evaluation metrics metrics: [ { name: 'accuracy', weight: 0.7 }, { name: 'latency', weight: 0.15 }, { name: 'cost', weight: 0.15 }, ], }) // Run benchmark with multiple models const results = await runBenchmark({ benchmark: customerSupportBenchmark, models: ['openai/gpt-4.5', 'anthropic/claude-3-opus', 'google/gemini-pro'], }) // Get the best model for your use case const bestModel = results.getBestModel({ accuracyWeight: 0.8, costWeight: 0.1, latencyWeight: 0.1, })
Last updated on