DTC AI Benchmark - The Best LLMs for DTC Tasks (July Update)

Two months ago, a score of 82% would have landed you in 4th place on our AI leaderboard. Today, it barely gets you into the top half. But there's one very interesting outlier here that sets the stage for where AI is headed next.

Even as model performance is improving rapidly, we still see that choosing the right model for the right task makes a big difference. In fact, we still see that using the exact same prompt across different providers outputs very different results.

What Models Made the Biggest Improvements?

Every single returning model improved except one. The performance floor jumped 12 points, and the "middle of the pack" now sits where the winners were just two months ago. This isn't normal market evolution—it's a coordinated leap forward across the entire ecosystem.

Biggest movers:

  • GPT-03 debuts at 89%, immediately claiming bronze

  • Grok-4 hits 87%, vaulting over every Gemini model

  • DeepSeek goes from one underperformer to three competitive models (82-87%)

  • Codex-mini jumps 9 points to tie for first place at 90%

Models for Specific DTC Tasks Are Becoming More Important

With five providers now clustered in the 85-90% range, we're entering an era where niche strengths and margin gains are going to matter more than raw capability (look at you 4o).

The biggest implication of this is that different models are becoming better at more nuanced tasks, and even types of tasks.

For instance, we've been observing that:

Task

Details

Copywriting

Claude 3.7 is still the best, outperforming Claude 4

Strategy, Analysis, and Creativity

Claude sonnet 4 is great at strategy below a certain complexity, but falls off above a certain complexity (which is why it's not rated higher)

Strategy, Analysis, and Creativity

o3 is great at strategy above a certain complexity, but falls off below a certain complexity. Terrible at copywriting and creativity.

Image Generation

4o's image generation has gotten much worse, and it's beginning to be lapped by Gemini imagen.

This means that now more than ever, choosing the right model for the right kind of task actually has a big impact. It's part of why we've built that into Raleon.

Watch your N8N flows

One of the downsides of these agent builders is they're brittle. As models change, all your flows have to change and be adjusted. This is important as any model below 75% is no longer competitive. Vendors face immediate pressure to retire or retrain underperforming models.

A New Trend: The Power of "Context Engineering" for DTC AI

Think of context engineering as giving the LLMs the right amount of information and tools for it to be able to accomplish the task. While many people still focus on prompt engineering, we're seeing a shift toward recognizing context engineering as a more critical factor.

The real story in our recent benchmarks isn't that Claude maintains its 90% dominance or that Grok 4 leaped ahead of most major models on launch day (though that's impressive). The breakthrough is Codex-mini's 9-point performance jump, which reveals two key insights:

Why Codex-mini's jump matters:

  1. Larger context window - More information available for processing

  2. Task-specific tuning - Optimized specifically for coding tasks

This demonstrates two crucial trends: specialized models are beginning to outperform generalist ones in specific domains, and context quality and quantity significantly impact model performance.

This principle of context engineering is central to how Raleon's AI Retention Agent achieves its strong results so quickly. It's also why our Context Memory Engine is so important to it constantly learning and updating how it understands your brand.

The Bottom Line

AI continues to advanced rapidly. The fact we've seen this much change in 2 months, including the addition of quite a few models that were not even on the board 2 months ago makes it difficult to keep up. It also speaks to the power of platforms that help you go end-to-end.

For our complete methodology and evaluation framework, check out our original benchmark post. It includes our step-by-step evaluation process, DIY templates, and everything you need to run your own model comparisons.

If you're wondering how any of this might impact AI Email Marketing in 2025, we've got a full guide laid out for you. If you're wondering how to use any of this new technology, give us a look at Raleon. We save agencies and brands 10+ hours every week, and deliver incremental revenue for them.

Frequently Asked Questions

Q: Is AI good for writing emails?

A: Yes, but the model matters. Our benchmarks show Claude 3.7 still leads in copywriting tasks, even outperforming Claude 4. The key is choosing the right model for email writing specifically, rather than using a general-purpose AI.

Q: Can ChatGPT do data analysis?

A: ChatGPT can handle basic analysis, but getting the right context ends up being the biggest challenge. Often times loading in too much data can be more harmful than helpful. We also consistently see that when you're trying to have more of a "chain" of derived data, it fails to do so properly. For instance, creating a new KPI, and then using that KPI in another metric.

Q: Can you make ads with AI?

A: Yes and no. Creating static image ads with AI has definitely improved, although there are still many instances in which getting products to show up consistently in the generated image is a challenge. Not to mention ChatGPT constantly throwing content restrictions up. When it comes to video ads, Veo3 is making parts of it easier, but it's still a good ways off.

Q: Can you use AI for email marketing?

A: Yes, and it's becoming essential. You can now use AI to help with every aspect of your DTC email marketing, from strategy, analysis, and segmentation to the email generation itself. The key is using platforms like Raleon that understand your brand and audience, which is why platforms that integrate multiple AI models and maintain context memory are becoming so valuable.

Nathan Snell

Cofounder

Create Campaigns and Segments in Minutes

Related Posts

In this post, we unpack the tradeoffs of building your own DIY AI using tools like N8N or Relevance versus buying off-the-shelf AI-powered solutions like Raleon.

The playbook for how to get your brand to show up in ChatGPT, Gemini, and Perplexity using Generative Engine Optimization. Brands showing up in those answers are quietly winning. Some DTC brands are reporting 30% higher AOV from ChatGPT-referred customers.

Raleon AI Marketing Strategist
Raleon AI Marketing Strategist
Raleon AI Marketing Strategist
Raleon AI Marketing Strategist

Explore how Raleon helps DTC brands use AI to automate segmentation, plan campaigns, and build loyalty through email marketing.

11

m read

Experience the Raleon Efficiency

Difference

Copyright © 2024 Raleon. All Rights Reserved.

Copyright © 2024 Raleon. All Rights Reserved.

Copyright © 2024 Raleon. All Rights Reserved.