Determine 1. Copilot Enviornment is a VSCode extension that collects human preferences of code straight from builders.
As mannequin capabilities enhance, massive language fashions (LLMs) are more and more built-in into person environments and workflows. Particularly, software program builders code with LLM-powered instruments in built-in growth environments comparable to VS Code, IntelliJ, or Eclipse. Whereas these instruments are more and more utilized in observe, present LLM evaluations wrestle to seize how customers work together with these instruments in actual environments, as they’re usually restricted to quick person research, solely think about easy programming duties versus real-world techniques, or depend on web-based platforms faraway from growth environments.
To deal with these limitations, we introduce Copilot Enviornment, an app designed to guage LLMs in real-world settings by accumulating preferences straight in a developer’s precise workflow. Copilot Enviornment is a Visible Studio Code extension that gives builders with code completions, akin to the kind of help offered by GitHub Copilot. To date, over 11,000 customers have downloaded Copilot Enviornment, and the software has served over 100K completions, and amassed over 25,000 code completion battles. The battles kind a dwell leaderboard on the LMArena web site. Since its launch, Copilot Enviornment has additionally been used to guage two new code completion fashions previous to their launch: a brand new Codestral mannequin from Mistral AI and Mercury Coder from InceptionAI.
On this weblog submit, we focus on how we designed and deployed Copilot Enviornment. We additionally spotlight how Copilot Enviornment supplies new insights into developer code preferences.
Copilot Enviornment System Design
To gather person preferences, Copilot Enviornment presents a novel interface that exhibits customers paired code completions from two totally different LLMs, that are decided based mostly on a sampling technique that mitigates latency whereas preserving protection throughout mannequin comparisons. Moreover, we devise a prompting scheme that permits a various set of fashions to carry out code completions with excessive constancy. Determine 1 overviews this workflow. We are going to overview every part beneath:
Person Interface: Copilot Enviornment permits customers to pick out between pairs of code completions from totally different LLMs. Person alternatives enable us to higher perceive developer preferences between LLMs. To keep away from interrupting person workflows, voting is designed to be seamless—customers use keyboard shortcuts to shortly settle for code completions.
Sampling mannequin pairs: We discover a sampling technique to attenuate the skilled latency. Since our interface exhibits two code completions collectively, the slowest completion determines the latency. We seize every mannequin’s latency as a log-normal distribution and tune a temperature parameter to interpolate between a latency-optimized distribution and a uniform distribution, observing a lower in median skilled latency by 33% (from 1.61 to 1.07 seconds) in comparison with a uniform distribution.

Prompting for code completions: Throughout growth, fashions have to “fill within the center”, the place code must be generated based mostly on each the present prefix and suffix. Whereas some fashions, comparable to DeepSeek and Codestral, are designed to fill within the center, many chat fashions usually are not and require further prompting. To perform this, we enable the mannequin to generate code snippets, which is a extra pure format, after which post-process them right into a FiM completion. Our strategy is as follows: along with the identical immediate templates above, the fashions are supplied with directions to start by re-outputting a portion of the prefix and equally finish with a portion of the suffix. We then match parts of the output code within the enter and delete the repeated code. This easy prompting trick permits chat fashions to carry out code completions with excessive success (Determine 2).
Deployment

We deploy Copilot Enviornment as a free extension out there on the VSCode extension retailer. Throughout deployment, we log person judgments and latency for mannequin responses, together with the person’s enter and completion. Given the delicate nature of programming, customers can limit our entry to their information. Relying on privateness settings, we additionally gather the person’s code context and mannequin responses.
As is normal in different work on pairwise choice analysis (e.g., Chatbot Enviornment), we apply a Bradley-Terry (BT) mannequin to estimate the relative strengths of every mannequin. We bootstrap the battles within the BT calculation to assemble a 95% confidence interval for the rankings, that are used to create a leaderboard that ranks all fashions, the place every mannequin’s rank is set by which different fashions’ decrease bounds fall beneath its higher certain. We host a dwell leadboard of mannequin rankings at lmarena.ai (Determine 3).
Findings

Comparability to prior datasets
We examine our leaderboard to current evaluations, which embody each dwell choice leaderboards with human suggestions and static benchmarks (Determine 4). The static benchmarks we examine in opposition to are LiveBench, BigCodeBench, and LiveCodeBench, which consider fashions’ code technology skills on a wide range of Python duties and proceed to be maintained with new mannequin releases. We additionally examine to Chatbot Enviornment and their coding-specific subset, that are human preferences of chat responses collected by means of an internet platform.
We discover a low correlation (r ≤ 0.1) with most static benchmarks, however a comparatively increased correlation (Spearman’s rank correlation (r) of 0.62) with Chatbot Enviornment (coding) and an identical correlation (r = 0.48) with Chatbot Enviornment (normal). The stronger correlation with human choice evaluations in comparison with static benchmarks seemingly signifies that human suggestions captures distinct facets of mannequin efficiency that static benchmarks fail to measure. We discover that smaller fashions are inclined to overperform (e.g., GPT-4o mini and Qwen-2.5-Coder 32B), significantly in static benchmarks. We attribute these variations to the distinctive distribution of knowledge and duties that Copilot Enviornment evaluates over, which we discover in additional element subsequent.

Compared to prior approaches, evaluating fashions in actual person workflows results in a various information distribution by way of programming and pure languages, duties, and code buildings (Determine 5):
- Programming and pure language: Whereas the plurality of Copilot Enviornment customers write in English (36%) and Python (49%), we additionally determine 24 totally different pure languages and 103 programming languages which is corresponding to Chatbot Enviornment (normal) and benchmarks centered on multilingual technology. In distinction, static benchmarks are inclined to give attention to questions written solely in Python and English.
- Downstream duties: Current benchmarks are inclined to supply issues from coding competitions, handwritten programming challenges, or from a curated set of GitHub repositories. In distinction, Copilot Enviornment customers are engaged on a various set of life like duties, together with however not restricted to frontend parts, backend logic, and ML pipelines.
- Code buildings and context lengths: Most coding benchmarks comply with particular buildings, which implies that most benchmarks have comparatively quick context lengths. Equally, Chatbot Enviornment focuses on pure language enter collected from chat conversations, with many prompts not together with any code context (e.g., 40% of Chatbot Enviornment’s coding duties comprise code context and solely 2.6% give attention to infilling). In contrast to any current analysis, Copilot Enviornment is structurally numerous with considerably longer inputs.
Insights into person preferences
- Downstream duties considerably have an effect on win fee, whereas programming languages have little impact: Altering process kind considerably impacts relative mannequin efficiency, which can point out that sure fashions are overexposed to competition-style algorithmic coding issues. However, the impact of the programming language on win-rates was remarkably small, which means that fashions that carry out properly on Python will seemingly carry out properly on one other language. We hypothesize that that is due to the inherent similarities between programming languages, and studying one improves efficiency in one other, aligning with tendencies reported in prior work.
- Smaller fashions could overfit to information much like static benchmarks, whereas the efficiency of bigger fashions is combined: Current benchmarks (e.g., these in Determine 4) primarily consider fashions on Python algorithmic issues with quick context. Nevertheless, we discover that Qwen-2.5 Coder performs noticeably worse on frontend/backend duties, longer contexts, and non-Python settings. We observe comparable tendencies for the 2 different small fashions (Gemini Flash and GPT-4o mini). We hypothesize that overexposure could also be significantly problematic for smaller fashions. However, efficiency amongst bigger fashions is combined.
Conclusion
Whereas Copilot Enviornment represents a shift in the proper route for LLM analysis, offering extra grounded and life like evaluations, there’s nonetheless vital work to be accomplished to completely symbolize all developer workflows. For instance, extending Copilot Enviornment to account for interface variations from manufacturing instruments like GitHub Copilot and tackling privateness concerns that restrict information sharing. Regardless of these constraints, our platform reveals that evaluating coding LLMs in life like environments yields rankings considerably totally different from static benchmarks or chat-based evaluations and highlights the significance of testing AI assistants with actual customers on actual duties. We’ve open-sourced Copilot Enviornment to encourage the open supply group to incorporate extra nuanced suggestions mechanisms, code trajectory metrics, and extra interplay modes.
For those who assume this weblog submit is helpful on your work, please think about citing it.
@misc{chi2025copilotarenaplatformcode,
title={Copilot Enviornment: A Platform for Code LLM Analysis within the Wild},
creator={Wayne Chi and Valerie Chen and Anastasios Nikolas Angelopoulos and Wei-Lin Chiang and Aditya Mittal and Naman Jain and Tianjun Zhang and Ion Stoica and Chris Donahue and Ameet Talwalkar},
yr={2025},
eprint={2502.09328},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.09328},
}