the preliminary response from an LLM doesn’t swimsuit you? You rerun it, proper? Now, when you have been to automate that…
success = false
whereas not success:
response = immediate.invoke()
success = consider(response)
Alright, one thing like that. Individuals have achieved it for code, and the identical applies to non-code if the consider() operate is appropriate. These days, you should utilize LLMs for content material era and analysis. Nevertheless, a easy whereas loop that waits for one of the best random outcome will not be at all times ok. Typically, it’s good to modify the immediate. Experiment and blend issues up, and maintain monitor of what works and what doesn’t. Comply with alongside totally different ideation paths to maintain your choices open…
On this article, we’ll talk about how OpenEvolve [1], an open-source implementation of Google’s AlphaEvolve paper [2], can be utilized for content material creation. Within the background, it applies this “experiment and blend, observe totally different paths” strategy to optimize the LLM prompts.
The AlphaEvolve paper utilized an evolutionary system to the code era with LLMs. Learn extra concerning the thrilling, brand-new outcomes of this paper in my article, Google’s AlphaEvolve: Getting Began with Evolutionary Coding Brokers. In essence, in a survival of the fittest scheme, applications are combined and improved upon. The authors counsel that these evolutionary coding brokers can obtain analysis breakthroughs and current a number of outcomes.
As a result of sheer variety of issues that content material might be, I feel there could also be potential for high-value content material creation aside from code that makes use of such a long-running, steady evolution course of. On this article, we discover the way to apply the identical know-how to a non-code use case the place LLMs, fairly than algorithms, decide the outcomes of the LLM-generated resolution. We additionally dicuss the way to study the outcomes.
Conditions
First, let’s put together a fast, fundamental setup.
LLM server
With the intention to use OpenEvolve, you’ll need entry to an LLM server with OpenAI-compatible API endpoints. You’ll be able to register with Cerebras (they’ve a free tier), OpenAI, Google Gemini, or an identical service. Alternatively, you probably have a succesful GPU, you possibly can arrange your individual server, for instance with ollama. You’ll need to select a minimum of two totally different LLM fashions, a weak one (e.g., 4bn parameters) and a robust one (e.g., 17bn parameters).
Python envionment & git
I presume that you’re operating a Linux system with a ready Python setting, in which you’ll create digital environments and set up packages from the Python Package deal index.
OpenEvolve setup
Set up OpenEvolve, then put together your individual undertaking & immediate folders:
git clone https://github.com/codelion/openevolve.git
cd openevolve
python3 -m venv .venv
supply .venv/bin/activate
pip set up -e .
mkdir -p examples/my_project/prompts
Somewhat warning: OpenEvolve is at the moment a analysis undertaking. Its code base continues to be creating rapidly. Due to this fact, it’s a good suggestion to observe all updates intently.
Configuration
Create the file examples/my_project/config.yaml:
checkpoint_interval: 1
# LLM configuration
llm:
fashions:
- identify: "llama3.1-8b"
weight: 0.8
temperature: 1.5
- identify: "llama-4-scout-17b-16e-instruct"
weight: 0.2
temperature: 0.9
evaluator_models:
- identify: "llama-4-scout-17b-16e-instruct"
weight: 1.0
temperature: 0.9
api_base: "https://api.cerebras.ai/v1/" # The bottom URL of your LLM server API
# Immediate configuration
immediate:
template_dir: "examples/my_project/prompts"
num_top_programs: 0
num_diverse_programs: 0
# Database configuration
database:
num_islands: 3
# Evaluator configuration
evaluator:
timeout: 60
cascade_evaluation: false
use_llm_feedback: true
llm_feedback_weight: 1.0 # (Non-LLM metrics are weighed with an element of 1)
diff_based_evolution: true
allow_full_rewrites: false
To get a common concept of what you might be configuring right here, think about how new options are generated and evaluated in OpenEvolve. Options include their respective textual content content material and are saved in a database alongside their analysis metrics and “aspect channel” textual outcomes (e.g., errors throughout execution or textual enchancment ideas). The database additionally shops a listing of elite applications and applications that carry out notably effectively on totally different metrics (MAP-Elites) to have the ability to present inspirations for brand spanking new options. An LLM generates these new, mutated options primarily based on a single father or mother. Programmatic and/or LLM evaluators then decide the brand new resolution earlier than feeding it again into the database.

The configuration choices embrace:
- llm: fashions, evaluator_models
For era and analysis, you possibly can configure any variety of fashions.
The thought behind utilizing a number of fashions is to make use of a quick (weak) mannequin that rapidly explores many alternative choices and a slower (stronger) mannequin that provides high quality. For era, the load parameter controls the chance that every mannequin will probably be chosen in an iteration — it is just one mannequin at a time, not a number of. For analysis, all fashions will probably be executed every time, and their output metrics are weighed with the desired parameter.
The temperature setting affect how random these fashions behave. A price of 1.5 could be very excessive, and 0.9 continues to be a excessive temperature worth. For the inventive use case, I feel these are good. For enterprise content material or code, use decrease values. The OpenEvolve default setting is 0.7. - immediate: template_dir
The template_dir possibility specifies the listing that accommodates the immediate templates which might be used to overwrite the defaults. See beneath for extra data on the folder’s contents. - database: num_top_programs, num_diverse_programs
The prompts for producing new options can embrace inspirations from different applications within the database. With a price of 0, I turned this operate off, as a result of I discovered that the inspirations — which don’t embrace the content material itself, fairly simply metrics and alter abstract — weren’t too helpful for inventive content material evolution. - database: num_islands controls what number of separate sub-populations are maintained within the database. The extra islands you employ, the extra diverging resolution paths will outcome, whereas inside the similar island you’ll observe fewer substantial variations. For inventive use circumstances, you probably have sufficient time and assets to run many iterations, it might be useful to extend the variety of islands.
- evaluator: llm_feedback_weight
The mixed metrics generated by the analysis LLMs are multiplied with this parameter. Along with the algorithmically generated metrics, the numeric common is then used to search out one of the best program. Say the generated metrics have been
size: 1.0
llm_correctness: 0.5
llm_style: 0.7
with an llm_feedback_weight of 1.0, the general rating could be (1.0+0.5*1.0+0.7*1.0)/3 - diff_base_evolution / allow_full_rewrites:
Two totally different immediate approaches for the generator LLM are supported. Within the diff mode, the LLM makes use of a search-and-replace response format to interchange particular components within the present resolution. Within the full_rewrite mode, the LLM merely outputs a full rewrite. The latter mode is much less demanding for much less succesful LLMs, however it’s also much less appropriate for lengthy content material. High quality can also be higher with diff mode, primarily based on my assessments.
For extra choices, check with configs/default_config.yaml.
Prompts
OpenEvolve’s default prompts are written for code evolution. Due to this fact, its prompts usually are not appropriate for non-code era by default. Happily, we will overwrite them. The default prompts are encoded within the file openevolve/immediate/templates.py.
Create the next recordsdata and adapt the prompts to match your use case. Let’s strive a easy instance for creating poems.
Preliminary placeholder content material: examples/my_project/initial_content.txt
No preliminary poem, invent your individual.
The preliminary immediate represents the “first era” father or mother. It impacts its offspring, the second-generation options.
For the preliminary content material, you can present an present model or an empty placeholder textual content. You possibly can additionally present particular directions, similar to “Ensure it mentions cats,” to information the preliminary era in a desired course. For those who want extra common context for all generations, embrace it within the system immediate.
The system immediate: examples/my_project/prompts/system_message.txt
You're a Shakespeare degree poem author, turning content material into lovely poetry and enhancing it additional and additional.
The system immediate simply units the final context in your generator mannequin so it is aware of what your use case is all about. On this instance, we aren’t creating code, we’re writing poems.
Consumer immediate for content material era: examples/my_project/prompts/diff_user.txt
# Present Answer Data
- Present efficiency metrics: {metrics}
- Areas recognized for enchancment: {improvement_areas}
{artifacts}
# Evolution Historical past
{evolution_history}
# Present Answer
```
{current_program}
```
# Job
Recommend enhancements to the reply that can result in higher efficiency on the desired metrics.
You MUST use the precise SEARCH/REPLACE diff format proven beneath to point modifications:
>>>>>> REPLACE
Instance of legitimate diff format:
>>>>>> REPLACE
You'll be able to counsel a number of modifications. Every SEARCH part should precisely match textual content within the present resolution. If the answer is a clean placeholder, be sure that to reply with precisely one diff substitute -- trying to find the present placeholder string, changing it together with your preliminary resolution.
The content material era consumer immediate could be very common. It accommodates a number of placeholders, that will probably be changed with the content material from the answer database, together with the analysis outcomes of the father or mother program. This immediate illustrates how the evolution course of influences the era of recent options.
Consumer immediate for content material era with out the diff methodology: examples/my_project/prompts/full_rewrite.txt
# Present Answer Data
- Present metrics: {metrics}
- Areas recognized for enchancment: {improvement_areas}
{artifacts}
# Evolution Historical past
{evolution_history}
# Present Answer
```
{current_program}
```
# Job
Rewrite the reply to enhance its efficiency on the desired metrics.
Present the entire new reply. Don't add reasoning, changelog or feedback after the reply!
# Your rewritten reply right here
Immediate fragment for the evolution historical past: examples/my_project/prompts/evolution_history.txt
## Earlier Makes an attempt
{previous_attempts}
## Prime Performing Answer
{top_programs}
Immediate fragment for the highest applications: examples/my_project/prompts/top_programs.txt
### Answer {program_number} (Rating: {rating})
```
{program_snippet}
```
Key options: {key_features}
System immediate for the evaluator: examples/my_project/prompts/evaluator_system_message.txt
You're a Shakespeare degree poem author and are being requested to overview another person's work.
This method immediate for the evaluator fashions is actually the identical because the system immediate for the generator LLM.
Consumer immediate for the evaluator: examples/my_project/prompts/analysis.txt
Consider the next poem:
1. Magnificence: Is it lovely?
2. Inspiring: Is its message impressed and significant?
3. Emotion: Does the poem set off an emotional response?
4. Creativity: Is it inventive?
5. Syntax: Is its syntax good? Is it solely a poem or does it additionally comprise non-poem content material (if sure, fee as 0)? Are its traces overly lengthy (if sure, fee low)?
6. General rating: Give an general ranking. If Poem, Syntax or Size analysis was not okay, give a foul general suggestions.
For every metric, present a rating between 0.0 and 1.0, the place 1.0 is finest.
Reply to judge:
```
{current_program}
```
Return your analysis as a JSON object with the next format:
{{
"magnificence": score1,
"inspiring": score2,
"emotion": score3,
"creativity": score4,
"syntax": score5,
"overall_score": score6,
"improvement_suggestion": "..",
}}
Even for invalid enter, return nothing however the JSON object.
That is the place the magic occurs. On this immediate, you need to consider metrics that characterize what you might be optimizing. What determines whether or not the content material is sweet or dangerous? Correctness? Humor? Writing talent? Determine what’s necessary to you, and encode it properly. This may increasingly take some experimentation earlier than you see the evolution converge the way in which you supposed. Mess around as you observe the evolution of your content material (extra on that beneath).
Watch out — each metric is rated equally. They’re multiplied by the llm_feedback_weight consider your config.yaml. Additionally it is a good suggestion to maintain an overall_score metric that gives a abstract of the massive image analysis. You’ll be able to then kind the generated options by it.
The improvement_suggestion is a textual suggestion from the evaluator LLM. It is going to be saved together with the metrics within the database and offered to the generator LLM when this resolution is used as a father or mother, as a part of the {artifacts} placeholder you noticed above. (Be aware: As of this writing, textual LLM suggestions continues to be a pull request underneath overview within the OpenEvolve codebase, remember to use a model that helps it.)
The evaluator program
OpenEvolve was designed for code era with algorithmic evaluators. Though it’s tough to jot down an algorithm that judges the great thing about a poem, we can design a helpful algorithmic analysis operate additionally for our content material era use case. As an example, we will outline a metric that targets a specific variety of traces or phrases.
Create a file examples/my_project/analysis.txt:
from openevolve.evaluation_result import EvaluationResult
def linear_feedback(precise, goal):
deviation = abs(precise - goal) / goal
return 1 - min(1.0, deviation)
def evaluate_stage1(file_path):
# Learn in file_path
with open(file_path, 'r') as file:
content material = file.learn()
# Depend traces and phrases
traces = content material.splitlines()
num_lines = len(traces)
num_words = sum(len(line.cut up()) for line in traces)
# Goal size
line_target = 5
word_target = line_target*7
# Linear suggestions between 0 (worst) and 1 (finest)
line_rating = linear_feedback(num_lines, line_target)
word_rating = linear_feedback(num_words, word_target)
combined_rating = (line_rating + word_rating) / 2
# Create textual suggestions
length_comment_parts = []
# Line rely suggestions
line_ratio = num_lines / line_target
if line_ratio > 1.2:
length_comment_parts.append("Scale back the variety of traces.")
elif line_ratio 1.2:
length_comment_parts.append("Scale back the variety of phrases per line.")
elif words_per_line_ratio
This code has two facets:
First, it creates a metric worth that permits us to quantify the standard of the response size. If the response is just too brief or too lengthy, the rating is decrease. If the response is excellent, the rating reaches 1.
Second, this code prepares textual suggestions that the LLM can intuitively perceive, so it is aware of what to alter with out getting lured right into a predetermined concept of what to do when the size will not be good. For instance, it gained’t mistakenly assume: “I would like to jot down extra.. and extra..”.
Knowledge overview: Evolution at play
Run the evolution course of:
supply .venv/bin/activate
export OPENAI_API_KEY="sk-.."
python3 openevolve-run.py
examples/my_project/initial_program.py
examples/my_project/evaluator.py
--config examples/my_project/config.yaml
--iterations 9
It’s best to start with only some iterations and analyze the outcomes intently to make sure every little thing is functioning correctly. To take action, begin the visualization internet server and observe in actual time:
python3 scripts/visualizer.py
Or, you probably have a selected previous checkpoint that you simply want to analyze, open it with:
python3 scripts/visualizer.py --path examples/content_writing/openevolve_output/checkpoints/checkpoint_2
When rerunning your assessments after making enhancements, remember to transfer the present checkpoint folders out of the way in which earlier than beginning over:
mkdir -p examples/my_project/archive
mv examples/my_project/openevolve_output/ examples/my_project/archive/

Within the visualization entrance finish, click on the nodes to see the related present resolution textual content, in addition to all of their metrics, prompts and LLM responses. You can even simply click on via youngsters within the sidebar. Use the yellow locator button when you get misplaced within the graph and might’t see a node. By observing the prompts, you possibly can hint how the analysis response from a father or mother impacts the era consumer immediate of the kid. (Be aware: As of this writing, immediate & response logging continues to be a pull request underneath overview within the OpenEvolve codebase, remember to use a model that helps it.)
If you’re concerned with evaluating all options by a selected metric, choose it from the highest bar:

- The node colours characterize the islands, by which evolution takes place largely individually (when you run it lengthy sufficient!) and in several instructions. Often, relying on the migration parameters within the configuration, people from one island might be copied over into one other.
- The dimensions of every node signifies its efficiency on the at the moment chosen metric.
- The perimeters within the visualization present which father or mother was modified to supply the kid. This clearly has the strongest affect on the descendant.
In reality, the AlphaEvolve algorithm incorporates learnings from a number of earlier applications in its prompting (configurable top-n applications). The era immediate is augmented with a abstract of earlier modifications and their affect on the ensuing metrics. This “immediate crossover” will not be visualized. Additionally not visualized are the relations of “clones”: When an answer migrates to a different island, it’s copied with all of its knowledge, together with its ID. The copy reveals up as an unlinked ingredient within the graph.
In any case, one of the best resolution will probably be saved to examples/my_project/openevolve_output/finest/best_program.txt:
In silken moonlight, the place night time’s veil is lifted,
A constellation of goals is gently shifted,
The guts, a canvas, painted with vibrant hues,
A symphony of emotions, in tender Muse.
Can I…
- ..use my very own begin immediate?
Sure! Simply put the answer you have already got in your initial_content.txt. - ..not create my very own begin immediate?
Sure! Simply put a placeholder like “No preliminary poem, invent your individual. Ensure it mentions cats.” in your initial_content.txt. - ..not write any code?
Sure! For those who don’t need an algorithmic evaluator, put a stub in your evaluator.py like this:
def evaluate_stage1(file_path):
return {}
def consider(file_path):
return evaluate_stage1(file_path)
- …use an area or non-OpenAI LLM?
Sure, so long as it’s suitable with the OpenAI API! In your config.yaml, change the llm: api_base: to a price like ”http://localhost:11434/v1/” for a default ollama configuration. On the command-line, set your API key earlier than calling the Python program:
export OPENAI_API_KEY="ollama"
Ultimate thought
This text described an experiment with using LLM suggestions within the context of evolutionary algorithms. I needed to allow and discover this use case, as a result of the AlphaEvolve paper itself hinted at it — and talked about that they hadn’t optimized for that but. That is solely the start. The suitable use circumstances the place this comparatively excessive effort for content material era is price it and extra experiments nonetheless have to observe. Hopefully, all of it will turn into simpler to make use of sooner or later.
Actual-life outcomes: In observe I discover that enhancements throughout all metrics are observable as much as a sure level. Nevertheless, it’s tough to acquire good numeric metrics from an LLM as a result of their scores usually are not fine-grained and due to this fact rapidly plateau. Higher prompts, particularly for the evaluator, might presumably enhance upon this. Both approach, the mix of algorithmic and LLM analysis with a strong evolutionary algorithm and plenty of configuration choices makes the general strategy very efficient.
To generate extra thrilling LLM metrics that justify the long-running evolution, multi-stage LLM evaluator pipelines might be integrated. These pipelines might summarize content material and make sure the presence of sure information, amongst different issues. By calling these pipelines from the evaluator.py file, that is potential proper now inside OpenEvolve.
With information bases and instruments, the capabilities of such evolutionary techniques that incorporate LLM suggestions might be prolonged additional. An thrilling addition for OpenEvolve might be the assist for MCP servers sooner or later, however once more, within the evaluator.py file you can already make use of those to generate suggestions.
This entire strategy is also utilized with multi-modal LLMs or a separate backend LLM, that generates the precise content material in a distinct modality, and is prompted by the evolutionary system. Current MCP servers might generate photos, audio and extra. So long as we’ve got an LLM appropriate for evaluating the outcome, we will then refine the immediate to generate new, improved offspring.
In abstract, there are lots of extra experiments inside this thrilling framework ready to be achieved. I sit up for your responses and am wanting to see the result of this. Thanks for studying!
References
- Asankhaya Sharma, OpenEvolve: Open-source implementation of AlphaEvolve (2025), Github
- Novikov et al., AlphaEvolve: A Gemini-Powered Coding Agent for Designing Superior Algorithms (2025), Google DeepMind