Raleon's DTC AI Benchmarks: Choosing the Smartest Brain for your Brand
LLMs have revolutionized how we craft copy, plan campaigns, and build brand strategy. AI makes you faster, efficient, and raises the quality of your work. And the models are only getting better.
But there's a hidden factor you might be overlooking: the huge variation in performance between different LLMs, even when using the exact same prompt.
Choosing the wrong model for tasks like campaign planning, copywriting, or subject line creation can quietly drain 30-50% of your potential results, impacting open rates, engagement, and sales.
The Raleon platform is agentic – meaning multiple AIs operate in concert together, each responsible for a specific task in the retention workflow. We need the right model for the right job, every time.
In this post, we're going to pull back the curtain and show you exactly how we evaluate and identify the right model for your high-stakes marketing tasks.
You'll see our actual results evaluating all the latest state-of-the-art models for the most complex agent on our platform, the AI Strategist. You'll get our evaluation process, results, and a simplified, DIY evaluation method you can try today using just ChatGPT (no code required).
The bottom line: small, structured model evaluations are easy to do, can be done in less than 15 minutes, and can yield massive lifts in performance. Try our templates yourself, or let Raleon automatically manage the selection for you, ensuring you're always using the best model and driving maximum results for your brand or clients.
Why Model Choice Is a Hidden Growth Lever
Most marketers assume any LLM will do. But in reality, different models produce vastly different outputs even from the exact same prompt. For email-marketing tasks like crafting subject lines or sequencing promotional sends, choosing the optimal model can boost open rates, clicks, and sales significantly.
Think of it like running an A/B test: picking the best-performing variant squeezes out that critical extra 10–30% of revenue. Ignoring model selection means leaving real money on the table.
This process of evaluating model performance over time is called an "eval," and it's a table-stakes requirement for using AI at the highest level (in this interview you can watch to OpenAI's CPO Kevin Weil speak to how critical evals are for using LLMs across all use cases).
Raleon's AI Strategist Agent - Evaluating Our Most Sophisticated AI
At Raleon, one of our key tasks is turning brand goals, historical performance, AI-powered segmentation, and predictive analytics into detailed campaign plans that convert.
We purpose built our AI Strategist agent to solve this challenge. It demands precise reasoning and brand-specific context, far more than a simple copywriting task. Sequencing sends, balancing promo intensity, and maintaining a consistent brand voice across segments makes this the ideal scenario to illustrate why rigorous LLM evaluations matter.
Evaluation 101: How to Know Which Model Wins
The basic idea of an eval is to have a consistent test for each task you want an LLM to perform. The consistency creates apples-to-apples comparisons as models and prompts change over time. By testing each LLM by the specific task, you can determine which model is the best for every job across your process.
Again, this is not all that different from the A/B tests you've been running for years. Control your variables, measure performance, and a simple, quick test can drive major improvements. Even without technical expertise, any marketer can perform a basic evaluation loop within 15 minutes.
Change 1 variable at a time - the model, the prompt, or other inputs – and judge the outputs as impartially as possible. Highest score wins.
Here's a quick primer:
Task: The specific job the LLM must do (e.g., "Create a subject line").
Rubric: A checklist or scorecard to grade results objectively.
Score: A numeric rating summarizing quality of performance for each criteria.
Judge: You, another human, or a dedicated AI that evaluates the results and completes the scoring and rubric.
Variance: The difference in results across models.
Under the Hood: Our "Pro-Grade" Eval Stack
Raleon uses a multi-layer evaluation approach with:
Multi-pass sampling for consistent scoring.
Ensemble judgers (multiple evaluators) to reduce bias.
Statistical checks to minimize variance.
Our eval system is deployed with code and updated with every model or prompt change. This ensures that the platform produces the highest quality plans, copy, and email designs – always at the cutting edge of what's possible. Don't worry if this sounds difficult. Below we'll share a simple approach you can use without all the code – all with just the ChatGPT app and a spreadsheet.
What we test
For every test, we give each model the same task, a full context of data, brand details, and segments. In other words, the exact situation, context, and prompt that our customers use on the Raleon platform every single day.
Each model is asked to deliver an actionable campaign plan including 4–16 specific email ideas, scheduled send dates, segments, and rationale explaining why each email will work.
We then use a system of AI judges to evaluate each response across our scoring rubric, which is a weighted average scorecard custom designed for retention marketing. You can be your own judge, evaluate as a team (everyone can vote, etc), or you can use an LLM to act as your judge for you. We prefer the AI judge as it removes our own biases and provides a more consistent score.

Models in the ring and a three-run sanity check
For our May 2025 benchmarks, we tested Gemini 2.5 Pro, GPT-4.1, Grok-3-Mini, Mistral-Medium, Qwen-3-235B, Claude 3.7 and 3.5 Sonnet, and six additional competitors.
To eliminate random "good luck" results, each model faces the same prompt three separate times. We average their results to clearly spot consistent high-performers.
We evaluate each model using these simple, marketer-friendly criteria:
Does it follow directions? (Right number of emails, correct timeframe.)
Does it fit the brief and strategy framework? (Clear goals, fresh ideas, sufficient details, follows best practices.)
Is it realistic? (Avoids overwhelming subscribers.)
A dedicated "grader" AI scores each model's output from 0–4 in each area.
We weigh the scores in our rubric, evaluating for correctly following instructions, creativity, and strategic quality.
This means the winning model isn't just compliant. It's genuinely valuable, producing outputs that match professional agency standards.
Results You Can Steal (May 2025 DTC AI Benchmarks)
Below you can see our actual results in our AI Strategist evals for May 2025. It’s important to remember that this eval is scoring each model for its performance at the Strategist level. We run separate evals for other agent tasks on the platform, like copywriting.

Top performers
Claude 3.7 Sonnet narrowly edges out its peers, with reasoning-focused models dominating the leaderboard.
Rank | Model (release) | Average Score |
---|---|---|
1 | Claude 3.7 Sonnet | 84% |
2 | Gemini 2.5 Pro Preview | 83% |
3 | Grok 3 Mini Beta | 83% |
4 | GPT-4.1 | 83% |
Claude 3.7 is the current top performer, narrowly edging out Gemini 2.5 Pro, Grok-3-min-beta, and OpenAI's GPT-4.1.
Claude 3.7 and Gemini 2.5 Pro have been leading our benchmarks for several weeks now. It's note worthy that both are state-of-the-art reasoning models. Reasoning models dominate the top performers.
Grok-3-mini-beta delivers great results for smaller monthly plans (≤10 emails).
Qwen 3 235B and GPT-4o-latest remain strong second-tier options.
Strugglers
Claude 3.5 Sonnet lags the most, and a few models underperform when pushed to larger email volumes.
Rank | Model (release) | Average Score |
---|---|---|
15 | Gemini 2.0 Flash Lite 001 | 69% |
14 | Claude 3.5 Sonnet | 72% |
13 | GPT-4o Mini | 72% |
Claude 3.5 Sonnet consistently underperforms across evaluation dimensions.
Several models lose points due to mismatched email counts in larger plans. In Raleon, we are able to use software engineering techniques to overcome this limitation and produce higher count emails more consistently – if you're using ChatGPT on your own, you should limit your plan creation to less than 15 campaigns at a time to ensure the best results.
Practical takeaways
The winner is … Claude 3.7.
For monthly plans ≤10 emails, Grok-3-mini provides high quality at an efficient cost.
Gemini 2.5 Pro excels in detailed and strict-format plans (10–16 emails).
Re-evaluate models whenever your campaign strategy changes; rankings can shift.
Reality check: Requesting more than 16 emails per month? Consider breaking up the planning into multiple runs. For larger volume months, we recommend focusing each run on a particular goal – one set focused on your seasonal promotion, one focused on key segments, one focused on product launches, etc. You can split this up in a number of ways that help the model produce the absolute best quality results every time.
Prompt Pack: DIY Evals You Can Run in Any Chat Window
Writing your own evals doesn't require becoming a professional software engineer. For many tasks you complete day-to-day, a simple LLM based process is all you'll need to judge that your prompts and results keep getting better after every fine-tune.
The Eval Formula
Aman Khan, an AI Product Manager at Arize AI, wrote an excellent breakdown for using evals. Here's the framework he recommends, which we've adapted for you in the process and prompts below.
Each great LLM eval contains four distinct parts:
Setting the role. You need to provide the judge-LLM a role (e.g. 'you are examining written text') so that the system is primed for the task.
Providing the context. This is the data you will actually be sending to the LLM to grade.
Providing the goal. Clearly articulating what you want the judge-LLM to measure isn't just a step in the process; it's the difference between a mediocre AI and one that consistently delights users.
Defining the terminology and label. Toxicity, for example, can mean different things in different contexts. You want to be specific here so the judge-LLM is 'grounded' in the terminology you care about.
How to use this:
We've provided 3 simple, example Generation Prompts below that each will plan and generate an email-based campaign for a brand as well as a Scoring Prompt you can steal for your own evals.
To run your eval, copy one of the Generation Prompts (or even better, use your own prompt!) into three chatbots. Once the LLMs have finished responding, paste each response into a fresh chat with the Scoring Prompt.
We strongly recommend using a reasoning model like o3 in ChatGPT or Gemini-2.5 Advanced to be your LLM judge. This class of model will produce the best, most consistent scores.
You can run the eval once for a quick check, but it is best practice and our recommendation to run the scoring prompt a few times and use the average of the results. LLMs are at their core probability machines, using an average will give you a much more reliable understanding of how the prompts score in real-world use.
Generation Prompts (pick one or, even better, use your own prompts!)
Here are three simplified prompts you can use to get started. You should adapt them to match your exact task, include full context just like you would in a real-world case, and ensure they're as close to real examples as possible. The prompts we use in Raleon are significantly more advanced and in depth than these examples — these are simple examples designed to help you get started fast.
1. 10-Email Education Calendar
2. 4-Email Hard-Promo Sprint
3. 8-Email Mixed Content
Note: Use the identical prompt text for fair comparisons.
Scoring Prompt
Quick Comparison Cheat-Sheet
Use a spreadsheet or simple table to log your results. This example is simple and effective. Save these results for reference in the future as you update and evolve your prompts and LLM usage.
Model | Prompt # | Score Total | Notes |
GPT-4o | #1 | 17/20 | Strong angles, perfect cadence |
Gemini | #1 | 18/20 | Slightly fresher hooks |
Claude | #1 | 14/20 | Light on goal linkage |
What "Good" Looks Like
19–20 out of 20: Exceptional. Ready for direct use.
16–18 out of 20: Strong results. You're good to go, but keep iterating!
Less than 15 out of 20: Adjust your prompt or test another model.
Quick-Start DIY Recipe
Run your own mini-evaluation right now:
Choose one prompt from above or one that you use regularly in your day-to-day.
Run the prompt through three or more LLMs
Quickly score results using clarity, relevance, and tone as your rubric. Run the prompt 3 times for each model and take the average score for each. Do it yourself or use a reasoning model to judge for you.
Pick the best average performer.
This recipe takes less than 30 minutes to complete and can instantly improve your LLM performance by leaps and bounds. This is the best ROI task you'll do all day.
Key Takeaways
The most popular model isn't always the best. Use an eval to confirm.
Even small evaluations can yield major performance gains. Expect to see different models perform best for different tasks.
Models evolve quickly, so re-test periodically to maintain gains. Keep your eval process consistent so you are always comparing apples-to-apples.
Next Step: Let Raleon Handle the Heavy Lifting
Constantly evaluating LLMs at scale is a burden most busy marketers can't afford. Raleon guarantees you're always using the best model for each and every task, automatically and painlessly. Stop babysitting evaluations and get back to your actual job with Raleon.
Book a 15-min demo to see campaign planning on autopilot. You can save hours every week and drive significant incremental revenue today. Let us show you how.
Ready for your AI Retention Team? 👇
Related Posts
Tired of generic ChatGPT results? Discover how treating AI as a strategic collaborator instantly upgrades your email marketing copy.
11
m read
Discover how to use ChatGPT to run powerful RFM analysis for your DTC brand—no spreadsheets, no dashboards, and no extra cost. Includes a free prompt to get started.
8
m read
Learn how to create stunning customer testimonial visuals in minutes using ChatGPT 4o and three customizable templates designed specifically for email marketing. Perfect for product shots, lifestyle scenes, and quote cards, these step-by-step prompts help you craft on-brand, high-converting content without needing a professional designer.
8
m read

Ryan
GreenEZ
I don't even know where to begin. From the exceptional customer support to the ease of use, Raleon has surpassed every current paid software available.