At Origin Spice, we’re not just shipping monthly boxes of exotic spices—we’re using AI to create personalized flavor experiences for each of our customers. While everyone in the AI space is obsessing over the perfect prompt, we’ve discovered that our secret ingredient isn’t what we tell our models but how we evaluate them.
From Spices to Data: Why We’re Building Robust AI Evaluations
Today’s food industry is rapidly evolving with AI technologies that analyze health data, dietary preferences, and even genetic profiles to provide personalized nutrition recommendations. At Origin Spice, we’ve discovered that the real magic in our AI spice profiler isn’t just in the prompts we feed it—it’s in how we evaluate and refine its recommendations.
While everyone else is obsessing over prompt engineering, we’ve learned what the article “The Rising Tide” mentioned in our research doc points out: “evals are the only way you can break down each step in the system and measure specifically what impact an individual change might have on a product.”
Why Evals Matter for Our Spice Business
When we first built our AI flavor profiling system, we thought our biggest challenge would be in sourcing exotic spices or perfecting our subscription model. But we quickly learned that the most crucial piece was ensuring our AI actually delivered on our core promise: recommending spices that delighted customers based on their unique preferences.
The flavor compound (chemical profile) of an edible ingredient often largely determines how it can be used in combination with other culinary elements. With AI and machine learning now being utilized to discover and develop new flavors, we recognized the potential to dramatically accelerate our business in a market expected to reach $12.8 billion by 2023.
Without proper evaluations, here’s what happened:
- Customers received inappropriate spice recommendations (like intense heat for those who preferred mild flavors)
- Our culinary experts couldn’t explain why the AI made certain choices
- We had no systematic way to improve the system’s accuracy over time
The Three Types of Evals We Use
Following the framework in our research document, we’ve implemented three types of evaluations:
- Human Evals: Every Origin Spice subscription box includes a simple feedback card where customers rate how well each spice matched their preferences. This direct feedback is gold.
- Code-Based Evals: Our development team runs automated tests to ensure the flavor recommendation engine is performing consistently across different user profiles.
- LLM-Based Evals: We’ve built an “AI food critic” that evaluates recommendations before they’re finalized, checking for cultural authenticity, flavor compatibility, and alignment with customer preferences.
Our Eval Formula
We follow a four-part formula for creating effective evaluations:
1. Setting the Role
We clearly define who’s doing the evaluation. Sometimes it’s our culinary experts, sometimes it’s our customers, and sometimes it’s our AI systems. Each evaluator needs clear instructions about their role.
2. Providing the Context
We make sure evaluators have all the necessary information, including:
- The customer’s flavor preference profile
- The specific spices being evaluated
- Any dietary restrictions or cultural preferences
- Regional cooking styles that the customer enjoys
3. Providing the Goal
Each evaluation has a clear purpose: to assess whether our recommendations will create flavor profiles that appeal to the target audience. This approach is similar to how sensory scientists in food and beverage manufacturing leverage data analytics to analyze consumer feedback, ingredient interactions, and market trends.
4. Defining Terminology and Labels
We’ve created a standardized “spice language” that everyone uses—from our AI to our customers to our culinary team—to ensure we’re all talking about the same flavor attributes.
Our Complete Workflow for Writing Effective Evals
Here’s where we expand on the information from our reference document to share our complete workflow:
- Define Evaluation Metrics: Before writing a single eval, determine what success looks like. For Origin Spice, we measure metrics like preference alignment, discovery satisfaction, and recipe success rate.
- Create Test Cases: Develop a diverse set of user profiles and expected outcomes to test against. Our test cases include everyone from culinary novices to experienced chefs, covering preferences across dozens of global cuisines.
- Implement Tiered Evaluation: Start with automated evals for quick feedback, then progress to expert reviews, and finally to real user testing. This creates layers of quality control.
- Compare Against Baselines: Always maintain a control group—for us, that’s recommendations made by our top human chefs without AI assistance.
- Iterate Based on Results: Adjust your AI systems based on evaluation feedback, then rerun the same evaluations to measure improvement.
Through this process, we’re effectively transforming spice curation from an art into a data-driven science while still honoring the culinary traditions behind each flavor profile.
Practical Example: The “North African Flavor Profile” Eval
Let me walk you through one of our most successful evals:
When our AI recommends North African spice blends, our evaluation system:
- Checks for Authenticity: Does the blend contain traditional North African spices like cumin, coriander, and cinnamon in appropriate ratios?
- Verifies Cultural Context: Does our recommendation include educational content about the blend’s origin and uses?
- Tests for Preference Alignment: Based on the customer’s previous ratings, is this blend likely to appeal to their palate?
This approach aligns with current research showing that AI algorithms can effectively enhance food flavor analysis through high-throughput data and screening technologies, offering a pathway to improve product formulations and enhance personalized experiences scientifically.
Key Takeaways for Building Your Own Evals
If you’re building AI-powered products, here are the steps to create effective evaluations:
- Start with clear objectives: Define what your AI system should accomplish and how you’ll know if it succeeds.
- Design multi-layered evaluations: Create evaluations that test different aspects of your system separately (component evaluations) and as a whole (end-to-end evaluations).
- Incorporate diverse perspectives: Include evaluations from different stakeholders, such as developers, domain experts, and end users.
- Measure what matters: Focus on outcomes that impact user satisfaction, not just technical metrics.
- Create a continuous feedback loop: Make evaluation an ongoing process, not a one-time activity.
Remember that neural networks in the flavor context are not replacing human creativity—they’re amplifying it by combining data intelligence with culinary artistry to unlock new possibilities for flavor innovation.
The Future of AI Flavor Evals
As we continue to develop our technology, we’re moving toward what Analytical Flavor Systems has pioneered: making flavor a predictive measurement where the AI sorts out what humans are saying they’re tasting and creates a systematic way to understand human sensory perception.
Our next frontier is hyper-personalization—not just recommending spices customers will like, but predicting flavor combinations they haven’t even imagined yet. We’re also exploring how our evaluations can help us detect and account for seasonal variations in spice quality, ensuring consistent experiences year-round.
Conclusion
Building an AI system without robust evaluations is like cooking without tasting your food. At Origin Spice, our evaluation framework has transformed our business by ensuring our AI-powered recommendations genuinely deliver on our promise of curating global flavors for modern kitchens.
The real power of AI in our business isn’t just in the technology itself—it’s in our ability to systematically measure, understand, and improve its performance through thoughtful evaluations. By sharing our approach, we hope to inspire other AI product teams to build more rigorous evaluation processes, whether they’re in the food industry or beyond.
Now, I’d love to hear from you—what evaluation approaches have you found effective in your AI systems? What metrics matter most for your products?


Leave a comment