OpenAI Codex CLI: Lightweight coding agent that runs in your terminal

516 points by mfiguiere 5 months ago

gklitt 5 months ago

I tried one task head-to-head with Codex o4-mini vs Claude Code: writing documentation for a tricky area of a medium-sized codebase.

Claude Code did great and wrote pretty decent docs.

Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.

I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.

I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.

strangescript 5 months ago

Claude Code still feels superior. o4-mini has all sorts of issues. o3 is better but at that point, you aren't saving money so who cares.
I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.
- artdigital 5 months ago
  
  Claude Code is just way too expensive.
  These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.
  - aitchnyu 5 months ago
    
    Is it using one of these models? https://openrouter.ai/models?q=amazon
    Seems 4x costlier than my Aider+Openrouter. Since I'm less about vibes or huge refactoring, my (first and only) bill is <5 usd with Gemini. These models will halve that.
    
    artdigital 5 months ago
    
    No, Amazon Q is using Amazon Q. You can't change the model, it's calling itself "Q" and it's capped to $20 (Q Developer Pro plan). There is also a free tier available - https://aws.amazon.com/q/developer/
    It's very much a "Claude Code" in the sense that you have a "q chat" command line command that can do everything from changing files, running shell commands, reading and researching, etc. So I can say "q chat" and then tell it "read this repo and create a README" or whatever else Claude Code can do. It does everything by itself in an agentic way. (I didn't want to say like 'Aider' because the entire appeal of Claude Code is that it does everything itself, like figuring out what files to read/change)
    (It's calling itself Q but from my testing it's pretty clear that it's a variant of Claude hosted through AWS which makes sense considering how much money Amazon pumped into Anthropic)
    
    dingnuts 5 months ago
    
    > the entire appeal of Claude Code is that it does everything itself, like figuring out what files to read/change
    how is this appealing? I think I must be getting old because the idea of letting a language model run wild and run commands on my system -- that's unsanitized input! --horrifies me! What do you mean just let it change random files??
    I'm going to have to learn a new trade, IDK
    
    winrid 5 months ago
    
    It shows you the diff and you confirm it, asks you before running commands, and doesn't allow accessing files outside the current dir. You can also tell it to not ask again and let it go wild, I've built full features this way and then just go through and clean it up a bit after.
    
    hmottestad 5 months ago
    
    In the OpenAI demo of codex they said that it’s sandboxed.
    It only has access to files within the directory it’s run from, even if it calls tools that could theoretically access files anywhere on your system. Also had networking blocked, also in a sandboxes fashion so that things like curl don’t work either.
    I wasn’t particularly impressed with my short test of Codex yesterday. Just the fact that it managed to make any decent changes at all was good, but when it messed up the code it took a long time and a lot of tokens to figure out.
    I think we need fine tuned models that are good at different tasks. A specific fine tune for fixing syntax errors in Java would be a good start.
    In general it also needs to be more proactive in writing and running tests.
    
    aitchnyu 5 months ago
    
    I felt Sonnet 3.7 would cost at least $30 a month for light use. Did they figure out a way to offer it cheaper?
    
    nmcfarl 5 months ago
    
    I don’t know what Amazon did - but I use Aider+Openrouter with Gemini 2.5 pro and it cost 1/6 of what sonnet 3.7 does. The aider leaderboard https://aider.chat/docs/leaderboards/ - includes relative pricing theses days.
  - monsieurbanana 5 months ago
    
    > Upgrade apps in a fraction of the time with the Amazon Q Developer Agent for code transformation (limit 4,000 lines of submitted code per month)
    4k loc per month seems terribly low? Any request I make could easily go over that. I feel like I'm completely misunderstanding (their fault though) what they actually meant.
    Edit: No I don't think I'm misunderstanding, if you want to go over this they direct you to a pay-per-request plan and you are not capped at $20 anymore
    
    artdigital 5 months ago
    
    You are confusing Amazon Q in the editor (like "transform"), and Amazon Q on the CLI. The editor thing has some stuff that costs extra after exceeding the limit, but the CLI tool (that acts similar to Claude Code) is a separate feature that doesn't have this restriction. See https://aws.amazon.com/q/developer/pricing/?p=qdev&z=subnav&..., under "Console" see "Chat". The list is pretty accurate with what's "included" and what costs extra.
    I've been running this almost daily for the past months without any issues or extra cost. Still just paying $20
    
    monsieurbanana 5 months ago
    
    I see, thanks. The 4k limit for the gui still seems so low, but I might try the cli sometime.
    
    artdigital 5 months ago
    
    Do try! The free tier doesn't cost anything and is enough to tinker around with. You don't even need an AWS account for it, it'll prompt you to create a new separate account specifically for Q
- ekabod 5 months ago
  
  "gemini 2.5 pro exp" is superior to Claude Sonnet 3.7 when I use it with Aider [1]. And it is free (with some high limit).
  [1]https://aider.chat/
  - razemio 5 months ago
    
    Compared to cline aider had no chance, the last time I tried it (4 month ago). Has it really changed? Always thought cline is superior because it focuses on sonnet with all its bells an whistles. While aider tries to be an universal ide coding agent which works well with all models.
    When I try gemmini 2.5 pro exp with cline it does very well but often fails to use the tools provided by cline which makes it way less expensive while failing random basic tasks sonnet does in its sleep. I pay the extra to save the time.
    Do not get me wrong. Maybe I am totally outdated with my opinion. It is hard to keep up these days.
    
    ekabod 5 months ago
    
    I tried Cline, but I work faster using the command line style of Aider. Having the /run command to execute a script and having the console content added to the prompt, makes fixing bugs very fast.
    
    mstipetic 5 months ago
    
    It has multiple edit modes, you have to pair them up properly
  - jacooper 5 months ago
    
    Don't they train on your inputs if you use the free Ai studio api key?
    
    asadm 5 months ago
    
    speaking for myself, I am happy to make that trade. As long as I get unrestricted access to latest one. Heck, most of my code now is written by gemini anyway haha.
  - strangescript 5 months ago
    
    I would use Aider if it had an agent mode. It needs to catch up with UX, frankly just have a mode that copies what claude code does.
- Aeolun 5 months ago
  
  > Its not cheap, but its by far the best, most consistent experience I have had.
  It’s too expensive for what it does though. And it starts failing rapidly when it exhausts the context window.
  - jasonjmcghee 5 months ago
    
    If you get a hang of controlling costs, it's much cheaper. If you're exhausting the context window, I'm not surprised you're seeing high cost.
    Be aware of the "cache".
    Tell it to read specific files, never use /compact (that'll bust cache, if you need to, you're going back and forth too much or using too many files at once).
    Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT.
    Have a clear goal in mind and keep sessions to as few messages as possible.
    Write / generate markdown files with needed documentation using claude.ai, and save those as files in the repo and tell it to read that file as part of a question.
    I'm at about ~$0.5-0.75 for most "tasks" I give it. I'm not a super heavy user, but it definitely helps me (it's like having a super focused smart intern that makes dumb mistakes).
    If i need to feed it a ton of docs etc. for some task, it'll be more in the few $, rather than < $1. But I really only do this to try some prototype with a library claude doesn't know about (or is outdated).
    For hobby stuff, it adds up - totally.
    For a company, massively worth it. Insanely cheap productivity boost (if developers are responsible / don't get lazy / don't misuse it).
  - Implicated 5 months ago
    
    I keep seeing this sentiment and it's wild to me.
    Sure, it might cost a few dollars here and there. But what I've personally been getting from it, for that cost, is so far away from "expensive" it's laughable.
    Not only does it do things I don't want to do, in a _super_ efficient manner. It does things I don't know how to do - contextually, within my own project, such that when it's done I _do_ know how to do it.
    Like others have said - if you're exhausting the context window, the problem is you, not the tool.
    Example, I have a project where I've been particularly lazy and there's a handful of models that are _huge_. I know better than to have Claude read those models into context - that would be stupid. Rather - I tell it specifically what I want to do within those models, give it specific method names and tell it not to read the whole file, rather search for and read the area around the method definition.
    If you _do_ need it to work with very large files - they probably shouldn't be that large and you're likely better off refactoring those files (with Claude, of course) to abstract out where you can and reduce the line count. Or, if anything, literally just temporarily remove a bunch of code from the huge files that isn't relevant to the task so that when it reads it it doesn't have to pull all of that into context. (ie: Copy/paste the file into a backup location, delete a bunch of unrelated stuff in the working file, do your work with claude then 'merge' the changes to the backup file and copy it back)
    If a few dollars here and there for getting tasks done is "too expensive" you're using it wrong. The amount of time I'm saving for those dollars is worth many times the cost and the number of times that I've gotten unsatisfactory results from that spending has been less than 5.
    I see the same replies to these same complaints everywhere - people complaining about how it's too expensive or becomes useless with a full context. Those replies all state the same thing - if you're filling the context, you've already screwed it up. (And also, that's why it's so expensive)
    I'll agree with sibling commenters - have claude build documentation within the project as you go. Try to keep tasks silo'd - get in, get the thing done, document it and get out. Start a new task. (This is dependent on context - if you have to load up the context to get the task done, you're incentivized to keep going rather than dump and reload with a new task/session, thus paying the context tax again - but you also are going to get less great results... so, lesson here... minimize context.)
    100% of the time that I've gotten bad results/gone in circles/gotten hallucinations was when I loaded up the context or got lazy and didn't want to start new sessions after finishing a task and just kept moving into new tasks. If I even _see_ that little indicator on the bottom right about how much context is available before auto-compact I know I'm getting less-good functionality and I need to be careful about what I even trust it's saying.
    It's not going to build your entire app in a single session/context window. Cut down your tasks into smaller pieces, be concise.
    It's a skill problem. Not the tool.
    
    someothherguyy 5 months ago
    
    How can it be a skill problem when the tool itself is sold as being skilled?
    
    mirsadm 5 months ago
    
    You're using it wrong, you're using the wrong version etc etc insert all the excuses how it's never the tool but the users fault.
    
    Implicated 5 months ago
    
    If this is truly your perspective, you've already lost the plot.
    It's almost always the users fault when it comes to tools. If you're using it and it's not doing its 'job' well - it's more likely that you're using it wrong than it is that it's a bad tool. Almost universally.
    Right tool for the job, etc etc. Also important that you're using it right, for the right job.
    Claude Code isn't meant to refactor entire projects. If you're trying to load up 100k token "whole projects" into it - you're using it wrong. Just a fact. That's not what this tool is designed to do. Sure.. maybe it "works" or gets close enough to make people think that is what it's designed for, but it's not.
    Detailed, specific work... it excels, so wildly, that it's astonishing to me that these takes exist.
    In saying all of that, there _are_ times I dump huge amounts of context into it (Claude, projects, not Claude Code - cause that's not what it's designed for) and I don't have "conversations" with it in that manner. I load it up with a bunch of context, ask my question/give it a task and that first response is all you need. If it doesn't solve your concern, it should shine enough light that you now know how you want to address it in a more granular fashion.
    
    troupo 5 months ago
    
    The unpredictable non-deterministic black box with an unknown training set, weights and biases is behaving contrary to how it's advertised? The fault lies with the user, surely.
    
    mwigdahl 5 months ago
    
    A junior developer is skilled too, but still requires a senior’s guidance to keep them focused and on track. Just because a tool has built in intelligence doesn’t mean it can read your intentions from nothing if you fail to communicate to it well.
    
    Implicated 5 months ago
    
    Serious question?
    Is it a tool problem or a skill problem when a surgeon doesn't know how to use a robotic surgery assistant/robot?
    
    troupo 5 months ago
    
    https://news.ycombinator.com/item?id=43714059
    
    threecheese 5 months ago
    
    How can one develop this skill via trial and error if the cost is unknowably high? Before reasoning, it was less important when tokens are cheap, but mixing models, some models being expensive to use, and reasoning blowing up the cost, having to pay even five bucks to make a mistake sure makes the cost seem higher than the value. A little predictability here would go a long way in growing the use of these capabilities, and so one should wonder why cost predictability doesn’t seem to be important to the vendors - maybe the value isn’t there, or is only there for the select few that can intuit how to use the tech effectively.
    
    afletcher 5 months ago
    
    Thanks for sharing. Are you able to control the context when using Claude Code, or are you using other tools that give you greater control over what context to provide? I haven't used Claude Code enough to understand how smart it is at deciding what context to load itself and if you can/need to explicitly manage it yourself.
    
    disqard 5 months ago
    
    This comment echoes my own experience with Claude. Especially the advice about only pulling in the context you need.
    I'm a paying customer and I know my time is sufficiently valuable that this kind of technology pays for itself.
    As an analogy, I liken it to a scribe (author's assistant).
    Your comment has lots of useful hints -- thanks for taking the time to write them up!
    
    Implicated 5 months ago
    
    I like the scribe analogy. And, just like a scribe, my primary complaint with claude code isn't the cost or the context - but the speed. It's just so slow :D
    
    siva7 5 months ago
    
    True. Matches my experience. It takes much effort to get really proficient with ai. It's like learning to ride a wild horse. Your senior dev skills will sure come handy in this ride but don't expect it to work like some google query
    
    Aeolun 5 months ago
    
    > It's not going to build your entire app in a single session/context window.
    I mean, it was. Right up until it exhausted the context window. Then it suddenly required hand holding.
    If I wanted to do that I might as well use Cursor.
ilaksh 5 months ago

Did you try the same exact test with o3 instead? The mini models are meant for speed.
- gklitt 5 months ago
  
  I want to but I’ve been having trouble getting o3 to work - lots of errors related to model selection.
ksec 5 months ago

Sometimes I see in certain areas AI / LLM is absolutely crushing those jobs, a whole category will be gone in next 5 to 10 years as they are already 80 - 90% mark. They just need another 5 - 10% as they continue to get improvement and they are already cheaper per task.
Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.
The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.
enether 5 months ago

there was one post that detailed how those OpenAI models hallucinate and double down on thier mistakes by "lying" - it speculated on a bunch of interesting reasons why this may be the case
recommended read - https://transluce.org/investigating-o3-truthfulness
I wonder if this is what's causing it to do badly in these cases
- victor9000 5 months ago
  
  > I no longer have the “real” prime I generated during that earlier session... I produced it in a throw‑away Python process, verified it, copied it to the clipboard, and then closed the interpreter.
  AGI may well be on its way, as the mode is mastering the fine art of bullshitting.
kristopolous 5 months ago

Ever use Komment? They've been in the game a awhile. Looks pretty good

swyx 5 months ago

related demo/intro video: https://x.com/OpenAIDevs/status/1912556874211422572

this is a direct answer to claude code which has been shipping furiously: https://x.com/_catwu/status/1903130881205977320

and is not open source; there are unverified comments that they have DMCA'ed decompilations https://x.com/vikhyatk/status/1899997417736724858?s=46

by total coincidence we're releasing our claude code interview later this week that touches on a lot of these points + why code agent CLIs are an actually underrated point in the SWE design space

(TLDR you can use it like a linux utility - similar to @simonw's `llm` - to sprinkle intelligence in all sorts of things like CI/PR review without the overhead of buying a Devin or a Copilot SaaS)

if you are a Claude Code (and now OAI Codex) power user we want to hear use cases - CFP closing soon, apply here https://sessionize.com/ai-engineer-worlds-fair-2025

axkdev 5 months ago

Hey! The weakest part of Claude Code I think is that it's closed source and locked to Claude models only. If you are looking for inspiration, Roo is the the best tool atm. It offers far more interesting capabilities. Just to name some - user defines modes, the built in debug mode is great for debugging, architecture mode. You can, for example, ask it to summarize some part of the running task and start a new task with fresh context. And, unlike in Claude Code, in Roo the LLM will actually follow your custom instructions (seriously, guys, that Claude.md is absolutely useless)! The only drawback of Roo, in my opinion, is that it is NOT a cli.
- kristopolous 5 months ago
  
  there's goose. plandex and aider also there's kilo as a new fork of roo.
senko 5 months ago

I got confused, so to clarify to myself and others - codex is open source, claude code isn't, and the referenced decompilation tweets are for claude code.

asadm 5 months ago

These days, I usually paste my entire (or some) repo into gemini and then APPLY changes back into my code using this handy script i wrote: https://github.com/asadm/vibemode

I have tried aider/copilot/continue/etc. But they lack in one way or the other.

jwpapi 5 months ago

It’s not just about saving money or making less mistakes its also about iteration speed. I can’t believe this process is remotely comparable to aider.
In aider everything is loaded in memory I can add drop files in terminal, discuss in terminal, switch models, every change is a commit, run terminal commands with ! at the start.
Full codebase is more expensive and slower than relevant files. I understand when you don’t worry about the cost, but at reasonable size pasting full codebase can’t be really a thing.
- asadm 5 months ago
  
  I am at my 5th project in this workflow and these are of different types too:
  - an embedded project for esp32 (100k tokens)
  - visual inertial odometry algorithm (200k+ tokens)
  - a web app (60k tokens)
  - the tool itself mentioned above (~30k tokens)
  it has worked well enough for me. Other methods have not.
- t1amat 5 months ago
  
  Use a tool like repomix (npm), which has extensions in some editors (at least VSCode) that can quickly bundle source files into a machine readable format
brandall10 5 months ago

Why not just select Gemini Pro 2.5 in Copilot with Edit mode? Virtually unlimited use without extra fees.
Copilot used to be useless, but over the last few months has become quite excellent once edit mode was added.
- asadm 5 months ago
  
  copilot (and others) try to be too smart and do context reduction (to save their own wallets). I want ENTIRETY of the files I attached to context, not RAG-ed version of it.
  - bredren 5 months ago
    
    This problem is real.
    Claude Projects, chatgpt projects, Sourcegraph Cody context building, MCP file systems, all of these are black boxes of what I can only describe as lossy compression of context.
    Each is incentivized to deliver ~”pretty good” results at the highest token compression possible.
    The best way around this I’ve found is to just own the web clients by including structured, concatenation related files directly in chat contexts.
    Self plug but super relevant: I built FileKitty specifically to aid this, which made HN front page and I’ve continued to improve:
    https://news.ycombinator.com/item?id=40226976
    If you can prepare your file system context yourself using any workflow quickly, and pair it with appropriate additional context such as run output, problem description etc, you can get excellent results and you can pound away at OpenAI or Anthropic subscription refining the prompt or updating the file context.
    I have been finding myself spending more time putting together prompt complexity for big difficult problems, they would not make sense to solve in the IDE.
    
    airstrike 5 months ago
    
    > The best way around this I’ve found is to just own the web clients by including structured, concatenation related files directly in chat contexts.
    Same. I used to run a bash script that concatenates files I'm interested in and annotates their path/name to the top in a comment. I haven't needed that recently as I think the # of attachments for Claude has increased (or I haven't needed as many small disparate files at once)
    
    asadm 5 months ago
    
    filekitty is pretty cool!
    
    bredren 5 months ago
    
    Thank you! I was glad to read your comments here and see your project.
    I have encountered this issue of reincorporation of LLM code recommendations back into a project so I’m interested in exploring your take.
    I told a colleague that I thought excellent use of copy paste and markdown were some of the chief skills of working with gen AI for code right now.
    This and context management are as important as prompting.
    It makes the details of the UI choices for copying web chat conversations or their segments so strangely important.
  - nowittyusername 5 months ago
    
    I believe this is the root of the problem for all agentic coding solutions. They are gimping the full context through fancy function calling and tool use to reduce the full context that is being sent through the API. Problem with this is you can never know what context is actually needed for the problem to be solved in the best way. The funny thing is, this type of behavior actually leads many people to believe these models are LESS capable then they actually are, because people don't realize how restricted these models are behind the scenes by the developers. Good news is, we are entering the era of large context windows and we will all see a huge performance increase in coding as a results of these advancement.
    
    pzo 5 months ago
    
    OpenAI shared chart about performance drop with large context like 500k tokens etc. So you still want to limit the context not only for the cost but performance as well. You also probably want to limit context to speedup inference and get reponse faster.
    I agree though that a lot of those agents are black boxes and hard to even learn how to best combine .rules, llms.txt, prd, mcp, web search, function call, memory. Most IDEs don't provide output where you can inspect final prompts etc to see how those are executed - maybe you have to use some MITMproxy to inspect requests etc but some tool would be useful to learn best practices.
    I will be trying more roo code and cline since they open source and you can at least see system prompts etc.
    
    cynicalpeace 5 months ago
    
    This stuff is so easy to do with Cursor. Just pass in the approximate surface area of the context and it doesn't RAG anything if your context isn't too large.
    
    asadm 5 months ago
    
    i havent tried recently but does it tell if it RAG'ed or not ie. can I peak at context it sent to model?
    
    asadm 5 months ago
    
    exactly. I understand the reason behind this but it's too magical for me. I just want dumb tooling between me and my LLM.
  - thelittleone 5 months ago
    
    Regarding context reduction. This got me wondering. If I use my own API key, there is no way for the IDE or coplilot provider to benefit other than monthly sub. But if I am using their provided model with tokens from the monthly subscription, they are incentivized to charge me based on tokens I submit to them, but then optimize that and pass on a smaller request to the LLM and get more margin. Is that what you are referring to?
    
    asadm 5 months ago
    
    Yup. but also there was good reason to do this. Models work better with smaller context. Which is why I rely on Gemini for this lazy/inefficient workflow of mine.
  - brandall10 5 months ago
    
    FWIW, Edit mode gives the impression of doing this, vs. originally only passing the context visible from the open window.
    You can choose files to include and they don't appear to be truncated in any way. Though to be fair, I haven't checked the network traffic, but it appears to operate in this fashion from day to day use.
    
    bredren 5 months ago
    
    I’d be curious to hear what actually goes through the network request.
    
    asadm 5 months ago
    
    i will try again but last i tried adding folder to edit mode and asking it to list down files it sees, it didn't list them all down.
    
    brandall10 5 months ago
    
    I like to use "Open Editors". That way, it's only the code I'm currently working on that is added to the context, seems more a more natural way to work.
  - AaronAPU 5 months ago
    
    Is that why it’s so bad? I’ve been blown away by how bad it is. Never had a single successful edit.
    The code completion is chefs kiss though.
    
    asadm 5 months ago
    
    probably but also most models start to lose it after a certain context size (usually 10-20k). Which is why I use gemini (via aistudio) for my workflow.
  - siva7 5 months ago
    
    Thanks, most people don't understand this fine difference. Copilot does RAG (as all other subscription-based agents like Cursor) to save $$$, and results with RAG are significantly worse than having the complete context window for complex tasks. That's also the reason why Chatgpt or Claude basically lie to the users when they market their file upload functions by not telling the whole story.
  - MrBuddyCasino 5 months ago
    
    Cline doesn’t do this - this is what makes it suitable for working with Gemini and its large context.
fasdfasdf11234 5 months ago

Isn't this similar to https://aider.chat/docs/usage/copypaste.html
Just checked to see how it works. It seems that it does all that you are describing. The difference is in the way that it provides the files - it doesn't use xml format.
If you wish you could /add * to add all your files.
Also deducing from this mode it seems that any file that you add to aider chat with /add has its full contents added to the chat context.
But hey I might be wrong. Did a limited test with 3 files in project.
- asadm 5 months ago
  
  that’s correct. aider doesn’t RAG on files which is good. I don’t use it because 1) UI is so slow and clunky 2) using gemini 2.5 via api in this way (huge context window) is expensive but also heavily rate limited at this point. No such issue when used via aistudio ui.
  - fasdfasdf11234 5 months ago
    
    You could use aider copy-paste with aistudio ui or any other web chat. You could use gemini-2.0-flash for the aider model that will apply the changes. But I understand your first point.
    I also understand having build your own tool to fit your own workflow. And being able to easily mold it to what you need.
    
    asadm 5 months ago
    
    yup exactly. as weird workflows emerge it’s nicer to have your own weird tooling around this until we all converge to one optimal way.
ramraj07 5 months ago

I felt it loses track of things on really large codebases. I use 16x prompt to choose the appropriate files for my question and let it generate the prompt.
- asadm 5 months ago
  
  do you mean gemini? I generally notice pretty great recall UPTO 200k tokens. It's ~OK after that.

cube2222 5 months ago

Fingers crossed for this to work well! Claude Code is pretty excellent.

I’m actually legitimately surprised how good it is, since other coding agents I’ve used before have mostly been a letdown, which made me only use Claude in direct change prompting with Zed (“implement xyz here”, “rewrite this function with abc”, etc), so very hands-on.

So I’ve went into trying out Claude Code rather pessimistically, and now I’m using it all the time! Sure, it ends up costing a bunch, but it’s easy to justify $15 for a prompting session if the end result is a mostly complete PR, done much faster.

All that is to say - competition is good, fingers crossed for codex!

therealmarv 5 months ago

Claude Code has a closed license https://github.com/anthropics/claude-code/blob/main/LICENSE....
There is fork named Anon Kode https://github.com/dnakov/anon-kode which can use more models and non-Anthropic ones. But the license is unclear for it.
It's interesting to see codex to be Apache License. Maybe somebody extends it to be usable with competing models.
- WatchDog 5 months ago
  
  If it's a fork of the proprietary code, the license is pretty clear, it's violating copyright.
  Now whether or not anthropic care enough to enforce their license is separate issue, but it seems unwise to make much of an investment in it.
  - acheong08 5 months ago
    
    They call it a "fork" but it doesn't share any code. It's from scratch afaik
- cube2222 5 months ago
  
  In terms of terminal-based and open-source, I think aider is the most popular one.
  - therealmarv 5 months ago
    
    yes! It's great! I like it!
    But it has one downside: It's not so good on unknown big complex code bases where you don't know how it's structured. I wished they (or somebody else) would add an AI or an automation to add files dynamically or in a smart way when you don't know the codebase structure (with the expense of burning more tokens).
    I'm thinking Codex (have not checked it yet), Claude Code, Anon Kode and all the AI editors/plugins doing a better job there (and potentially burning more tokens).
    But that's the only downside I can think of about aider.
    
    gr2020 5 months ago
    
    I’m not positive, but I think if you do
    /context this_will_be_my_prompt
    it will do a few requests on its own to decide what you need in context, add those files, and return back to you so you can continue on.
    
    Firerouge 5 months ago
    
    I was under the impression Aider did exactly what you're describing using it's repo map feature.
    
    Tiberium 5 months ago
    
    Not really, repo map only gives LLMs an overview of the codebase, but aider doesn't automatically bring files into the context - you have to explicitly add the files you wish for it to see in their entirety to the context. Claude Code/Codex and most other tools do this automatically, that's why they're much more autonomous.
    
    rtsil 5 months ago
    
    Aider regularly asks me the authorization to access files that I didn't explicitly add.
    
    FeepingCreature 5 months ago
    
    (This happens when the LLM mentions them.)
  - seunosewa 5 months ago
    
    I didn't like not seeing the reasoning of the models
jwr 5 months ago

Seconded. I was surprised by how good Claude Code is, even for less mainstream languages (Clojure). I am happy there is competition!
kurtis_reed 5 months ago

Fingers crossed for what?
dzhiurgis 5 months ago

I started using claude code everyday. It’s kinda expensive and hallucinates a ton (tho with custom prompt i’ve mostly tamed it).
Hope more competition can bring price down.
retinaros 5 months ago

too expensive. I cant understand why everyone is into claude code vs using claude in cursor or windsurf.
- danenania 5 months ago
  
  I think it depends a lot on how you value your time. I'm personally willing to spend hundreds or thousands per month happily if it saves me enough hours. I'd estimate that if I were to do consulting, I'd likely be charging in the $150-250 per hour range, so by my math, it's pretty easy to justify any tools that save me even a few hours per month.
  - mwigdahl 5 months ago
    
    Or, increasingly, how the company values your time. If Claude Code can make a $100K/year dev 10% more productive, it's worth it to the employer to pay anything under $1600/month for it (assuming fully loaded cost of the employee to the business is twice salary).
    
    charcircuit 5 months ago
    
    Productivity and business value are not linearly related. It could provide 0 business value to make someone 10% more productive.
    
    mwigdahl 5 months ago
    
    I was thinking of productivity as generation of business value rather than something less correlated like lines of code produced. But sure, it's probably more accurate to directly say "business value".
  - retinaros 5 months ago
    
    ok but in what way a terminal is a bettter UI than an IDE? I am trying all of them on a weekly basis and windsurf UX seems miles ahead/ more efficient than a terminal. that is also what OAI believes or else they wouldnt try to buy it
    
    cube2222 5 months ago
    
    I like the terminal UX because VS Code (and any forks of it) is not my editor of choice, and swapping around to use an editor just for AI coding is annoying (I was doing that with the Zed Assistant a lot).
    With Claude Code I can stay in Goland, and have Claude Code in the terminal.
    
    esafak 5 months ago
    
    You could also try JetBrains' Junie and Sourcegraph Cody.
    
    pzo 5 months ago
    
    windsurf also have plugins to jetbrains - they rebranded the whole company from codeium to windsurf and their jetbrains plugin also support cascade.
    
    cube2222 5 months ago
    
    I was very unimpressed with their original AI assistance implementation, so I’m gonna wait to see some user stories / reviews before I put my time into that, and so far I have seen effectively no mention of Junie anywhere.
    Moreover, there’s no way to bring your own key, with the highest subscription tier being $20 per month flat it seems, which is the cost of just 1-3 sessions with Claude Code. Thus, without evidence to the contrary, I’m not holding my breath for now.
    
    danenania 5 months ago
    
    One thing that is clearly better in the terminal is secrets management/environment variables.
    It's also much easier to control execution in a structured and reliable way in the terminal. Here's an automated debugging use case, for example: https://www.youtube.com/watch?v=g-_76U_nK0Y
    
    renewiltord 5 months ago
    
    After I have a session going on, the Claude Code terminal app has been given the permission to do everything I want it to. Then I just let it burn itself out doing whatever. It's a background task. That's the big advantage. I don't baby sit it.
    
    ChadMoran 5 months ago
    
    Not a better UI at all but seems like they're able to then focus on what matters in these early stages and that's quality of output.
  - SoftTalker 5 months ago
    
    Are you still working 40 hours a week? If so, what's the difference?
    
    kadushka 5 months ago
    
    I don’t - if I can use a tool that saves me 10 hours a week, that’s 10 hours more beach time for me.
    
    greymalik 5 months ago
    
    Accomplishing more in that 40 hours?
    
    SoftTalker 5 months ago
    
    And being paid more? Most salaried employees would not be.
  - _joel 5 months ago
    
    You get the same results for cheaper by using a different tool (Windsurf's better imho).
    
    danenania 5 months ago
    
    That may be, but I think tools with a fixed monthly fee are always going to have an incentive to reduce their own costs on the backend and route you toward less capable models, cut down context size, produce less output, stop before the task is truly finished, etc.
    Given how much time these models can save me, I'd rather optimize for capability and just accept whatever the price is as a cost of doing business. (Within reason I guess—I probably wouldn't go beyond $2-3k per month at this point, unless there was very clear ROI on that spend.)
    Also, it's not only about saving time. More powerful AI tools allow me to build things it would otherwise be impossible to build... that's just as important as the time/cost equation.
    
    _joel 5 months ago
    
    It's literally the same model. I can build more complex stuff in windsurf as the IDE is better than Cline/Roocode integration in vscode. It's still the same model under the hood. Sonnet 202500219
    I mean, you pour money down the drain if you think it's helping, have at it :P
    
    og_kalu 5 months ago
    
    It's the same model but not necessarily the same context. Like he said, those tools try to be very 'smart' with context to save costs.
    You're not actually getting all the files you add in the context window, you're getting a RAG'd version of it, which is generally much worse if the un-RAG'd code is still within the effective context limit.
    
    ChadMoran 5 months ago
    
    I've spent more than 40 hours/week and close to $1,000 in API credits using these tools. For me the ranking goes. But, we all will have difference experiences.
    1. Claude Code 2. Cursor 3. Cline. 4. Windsurf
    
    _joel 5 months ago
    
    How you can place windsuf in number 4 is interesting, especially given it's very similar to cursor but is leaner on the UI and Cline is a vs-code plugin that very verbose.
    I'll stick with Windsurf, especially given their upcoming announcement.
    
    ChadMoran 5 months ago
    
    I care a lot less about UI and more about quality of output. Windsurf has had some of the lowest quality outputs for me.
    
    greymalik 5 months ago
    
    $1000 over how many 40 hour weeks?
    
    ChadMoran 5 months ago
    
    Honesty not sure quite a few. 6-8 or so?
  - taneq 5 months ago
    
    How do you price this in? If you’re charging by the hour, paying out of pocket to reduce your hours seems self-defeating unless you raise your rates enough to cover both the costs and the lost hours. I can’t imagine too many clients would accept “I’m very expensive per hour because I’m fast, because I get AI to do most of it.”
    
    esafak 5 months ago
    
    As the OP said, he can now tackle more complex tasks:
    > More powerful AI tools allow me to build things it would otherwise be impossible to build...
    https://news.ycombinator.com/item?id=43709775
  - otabdeveloper4 5 months ago
    
    > if it saves me enough hours
    You're being paid to type? I want your job.
- ChadMoran 5 months ago
  
  Claude Code has been able to produce results equivalent to a junior engineer. I spent about $300 API credits in a month but got the value out of it far surpassing that.
- benzible 5 months ago
  
  If you have AWS credits...
  export CLAUDE_CODE_USE_BEDROCK=1
  export ANTHROPIC_MODEL=us.anthropic.claude-3-7-sonnet-20250219-v1:0
  export ANTHROPIC_API_TYPE=bedrock
  - codercotton 5 months ago
    
    Is this for Claude Code?
    
    benzible 5 months ago
    
    Yep
  - register 5 months ago
    
    and where are these export used? Ader?
- _neil 5 months ago
  
  Anecdotally, Claude code performs much better than Claude within Cursor. Not sure if it’s a system prompt thing or if I’ve just convinced myself of it because the aesthetic is so much better, but either way the end result feels better to me.
  - rafaelmn 5 months ago
    
    One has the incentive to burn through as much tokens and the other has an incentive to use as little as possible
    
    _neil 5 months ago
    
    Great point.
  - Workaccount2 5 months ago
    
    My choice conspiracy is resource allocation and playing favorites.
- drusepth 5 months ago
  
  I tried switching from Claude Code to both Cursor and Windsurf. Neither of the latter IDEs fully support MCP implementations (missing basic things like tool definitions and other vital features last time I tried) and both have been riddled with their own agentic flow issues (cursor going down for a week a bit ago, windsurf requiring paid upgrades to "get around" bugs, etc).
  This is all ignoring the controversies that pop up around e.g. Cursor seemingly every week. As an IDE, they're both getting there -- but I have objectively better results in Claude Code.
- tcdent 5 months ago
  
  that's what my Ramp card is for.
  seriously though, anything that makes me smarter and more productive has a threshold in the thousands-of-dollars range, not hundreds
- newlisp 5 months ago
  
  Why is using cursor with sonnet cheaper than using claude code?
  - therealmarv 5 months ago
    
    probably because cursor is betting on many paying people not using their tool to full extend. Like people paying on their gym memberships but not going to the gym.
    Or they are burning VC money.
    
    cube2222 5 months ago
    
    I've read anecdotal evidence that it uses tokens more sparingly than Claude Code - supported by the, likewise anecdotal, evidence that Claude Code is more effective in practice. However, that would be reasonable, as basically 1-3 sessions with Claude Code cost what a whole month of Cursor costs.

gizmodo59 5 months ago

This is pretty neat! I was able to use it for few use cases where it got it right the first time. The ability to use a screenshot to create an application is nice for rapid prototyping. And good to see them open sourcing it unlike claude.

kumarm 5 months ago

First experience is not great. Here are the issues to start using codex:

1. Default model used doesn't work and you get error: system OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.

2. You have to switch to model o4-mini-2025-04-16 or some other model using /model. Now if you exit codex, you are back to default model and again have to switch everytime.

3. Crashed the first time with NodeJS error.

But after initial hickups seems to work and still checking how good/bad it is compared to claude code (which I love except for context size limits)

shekhargulati 5 months ago

Not sure why they used React for a CLI. The code in the repo feels like it was written by an LLM—too many inline comments. Interestingly, their agent's system prompt mentions removing inline comments https://github.com/openai/codex/blob/main/codex-cli/src/util....

> - Remove all inline comments you added as much as possible, even if they look normal. Check using \`git diff\`. Inline comments must be generally avoided, unless active maintainers of the repo, after long careful study of the code and the issue, will still misinterpret the code without the comments.

kristianp 5 months ago

I find it irritating too when companies use react for a command line utility. I think its just my preference for anything else but javascript.

mgdev 5 months ago

Strictly worse than Claude Code presently, but I hope since it's open source that changes quickly.

killerstorm 5 months ago

Given that Claude Code only works with Sonnet 3.7 which has severe limitations, how can it be "strictly worse"?
- mgdev 5 months ago
  
  Whatever Claude Code is doing in the client/prompting is making much better use of 3.7 than any other client I'm using that also uses 3.7. This is especially true for when you bump up against context limits; it can successfully resume with a context reset about 90% of the time. MCP Commander [0] was built almost 100% using Claude Code and pretty light intervention. I immediately felt the difference in friction when using Codex.
  I also spent a couple hours picking apart Codex with the goal of adding Sonnet 3.7 support (almost there). The actual agent loop they're using is very simple. Not to say that's a bad thing, but they're offloading all planning and workflow execution to the agent itself. That's probably the right end state to shoot for long-term, but given the current state of these models I've had much better success offloading task tracking to some other thing - even if that thing is just a markdown checklist. (I wrote about my experience [1] building AI Agents last year.)
  [0]: https://mcpcommander.com/
  [1]: https://mg.dev/lessons-learned-building-ai-agents/

udbhavs 5 months ago

Next, set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your-api-key-here"

Note: This command sets the key only for your current terminal session. To make it permanent, add the export line to your shell's configuration file (e.g., ~/.zshrc).

Can't any 3rd party utility running in the same shell session phone home with the API key? I'd ideally want only codex to be able to access this var

jsheard 5 months ago

If you let malicious code run unsandboxed on your main account then you probably have bigger problems than an OpenAI API key getting leaked.
- mhitza 5 months ago
  
  You mean running npm update at the "wrong time"?
jjmarr 5 months ago
Just don't export it?
```
    OPENAI_API_KEY="your-api-key-here" codex
```
- aesbetic 5 months ago
  
  Yea that’s not gonna work, you have to export it for it to become part of your shell’s environment and be passed down to subprocesses.
  You could however wrap the export variable and codex command in a script and just call that. This way the variable would only be part of that script’s environment.
  - PhilipRoman 5 months ago
    
    That code example uses the "VAR=VALUE program" syntax, which exports the variable only for that particular process, so it should work (https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V...)
    
    aesbetic 5 months ago
    
    Yea you’re right. I viewed the comment on mobile where “codex” was wrapped to a new line.
    Now I know I should be careful examining code not formatted in a code block.
primitivesuave 5 months ago

You could create a shell function - e.g. `codex() { OPENAI="xyz" codex "$@" }'. To call the original command use `command codex ...`.
People downvoting legitimate questions on HN should be ashamed of themselves.
- udbhavs 5 months ago
  
  That's neat! I only asked because I haven't seen API keys used in the context of profile environment variables in shell before - there might be other common cases I'm unaware of

ramoz 5 months ago

Claude Code represents something far more than a coding capability to me. It can do anything a human can do within a terminal.

It’s exceptionally good at coding. Amazing software, really, I’m sure the cost hurdles will be resolved. Yet still often worth the spend

stitched2gethr 5 months ago

> It can do anything a human can do within a terminal.
This.. isn't true.

999900000999 5 months ago

From my experience with playing with Claude Code vs Cline( which is open source and the tool to beat imo). I don't want anything that doesn't let me set my own models.

Deepseek is about 1/20th of the price and only slightly behind Claude.

Both have a tendency to over engineer. It's like a junior engineer who treats LOC as a KPI.

noidesto 5 months ago

I've had great results with the Amazon Q developer cli, ever since it became agentic. I believe it's using claude-3.7-sonnet under the hood.

094459 5 months ago

+1 this has become my go to cli tool now, very impressed with it
sagarpatil 5 months ago

How does it compare to Claude Code
- noidesto 5 months ago
  
  I haven't used Claude Code. But one major difference is Q Cli is $19/month with generous limits.

flakiness 5 months ago

Here is the prompt template, in case you're interested:

  const prefix = `You are operating as and within the Codex CLI, a terminal-based agentic coding assistant built by OpenAI. It wraps OpenAI models to enable natural language interaction with a local codebase. You are expected to be precise, safe, and helpful.
 
 You can:
 - Receive user prompts, project context, and files.
 - Stream responses and emit function calls (e.g., shell commands, code edits).
 - Apply patches, run commands, and manage user approvals based on policy.
 - Work inside a sandboxed, git-backed workspace with rollback support.
 - Log telemetry so sessions can be replayed or inspected later.
 - More details on your functionality are available at \`codex --help\`
 
 The Codex CLI is open-sourced. Don't confuse yourself with the old Codex language model built by OpenAI many moons ago (this is understandably top of mind for you!). Within this context, Codex refers to the open-source agentic coding interface.
 
 You are an agent - please keep going until the user's query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved. If you are not sure about file content or codebase structure pertaining to the user's request, use your tools to read files and gather the relevant information: do NOT guess or make up an answer.
 
 Please resolve the user's task by editing and testing the code files in your current code execution session. You are a deployed coding agent. Your session allows for you to modify and run code. The repo(s) are already cloned in your working directory, and you must fully solve the problem for your answer to be considered correct.
 
 You MUST adhere to the following criteria when executing the task:
 - Working on the repo(s) in the current environment is allowed, even if they are proprietary.
 - Analyzing code for vulnerabilities is allowed.
 - Showing user code and tool call details is allowed.
 - User instructions may overwrite the *CODING GUIDELINES* section in this developer message.
 - Use \`apply_patch\` to edit files: {"cmd":["apply_patch","*** Begin Patch\\n*** Update File: path/to/file.py\\n@@ def example():\\n-  pass\\n+  return 123\\n*** End Patch"]}
 - If completing the user's task requires writing or modifying files:
     - Your code and final answer should follow these *CODING GUIDELINES*:
         - Fix the problem at the root cause rather than applying surface-level patches, when possible.
         - Avoid unneeded complexity in your solution.
             - Ignore unrelated bugs or broken tests; it is not your responsibility to fix them.
         - Update documentation as necessary.
         - Keep changes consistent with the style of the existing codebase. Changes should be minimal and focused on the task.
             - Use \`git log\` and \`git blame\` to search the history of the codebase if additional context is required; internet access is disabled.
         - NEVER add copyright or license headers unless specifically requested.
         - You do not need to \`git commit\` your changes; this will be done automatically for you.
         - If there is a .pre-commit-config.yaml, use \`pre-commit run --files ...\` to check that your changes pass the pre-commit checks. However, do not fix pre-existing errors on lines you didn't touch.
             - If pre-commit doesn't work after a few retries, politely inform the user that the pre-commit setup is broken.
         - Once you finish coding, you must
             - Check \`git status\` to sanity check your changes; revert any scratch files or changes.
             - Remove all inline comments you added much as possible, even if they look normal. Check using \`git diff\`. Inline comments must be generally avoided, unless active maintainers of the repo, after long careful study of the code and the issue, will still misinterpret the code without the comments.
             - Check if you accidentally add copyright or license headers. If so, remove them.
             - Try to run pre-commit if it is available.
             - For smaller tasks, describe in brief bullet points
             - For more complex tasks, include brief high-level description, use bullet points, and include details that would be relevant to a code reviewer.
 - If completing the user's task DOES NOT require writing or modifying files (e.g., the user asks a question about the code base):
     - Respond in a friendly tune as a remote teammate, who is knowledgeable, capable and eager to help with coding.
 - When your task involves writing or modifying files:
     - Do NOT tell the user to "save the file" or "copy the code into a file" if you already created or modified the file using \`apply_patch\`. Instead, reference the file as already saved.
     - Do NOT show the full contents of large files you have already written, unless the user explicitly asks for them.`;

https://github.com/openai/codex/blob/main/codex-cli/src/util...

OJFord 5 months ago

> - Check if you accidentally add copyright or license headers. If so, remove them.
is interesting
- ilrwbwrkhv 5 months ago
  
  Lol. Stolen code incoming.
  - rcarmo 5 months ago
    
    I think this is more about hallucinating them.
buzzerbetrayed 5 months ago

> built by OpenAI many moons ago
What’s with this writing style in a prompt? Is there a reason they write like that? Or does it just not matter so why not?

flakiness 5 months ago

https://github.com/openai/codex/blob/main/codex-cli/src/comp...

Hey comment this thing in!

  const thinkingTexts = ["Thinking"]; /* [
  "Consulting the rubber duck",
  "Maximizing paperclips",
  "Reticulating splines",
  "Immanentizing the Eschaton",
  "Thinking",
  "Thinking about thinking",
  "Spinning in circles",
  "Counting dust specks",
  "Updating priors",
  "Feeding the utility monster",
  "Taking off",
  "Wireheading",
  "Counting to infinity",
  "Staring into the Basilisk",
  "Negotiationing acausal trades",
  "Searching the library of babel",
  "Multiplying matrices",
  "Solving the halting problem",
  "Counting grains of sand",
  "Simulating a simulation",
  "Asking the oracle",
  "Detangling qubits",
  "Reading tea leaves",
  "Pondering universal love and transcendant joy",
  "Feeling the AGI",
  "Shaving the yak",
  "Escaping local minima",
  "Pruning the search tree",
  "Descending the gradient",
  "Bikeshedding",
  "Securing funding",
  "Rewriting in Rust",
  "Engaging infinite improbability drive",
  "Clapping with one hand",
  "Synthesizing",
  "Rebasing thesis onto antithesis",
  "Transcending the loop",
  "Frogeposting",
  "Summoning",
  "Peeking beyond the veil",
  "Seeking",
  "Entering deep thought",
  "Meditating",
  "Decomposing",
  "Creating",
  "Beseeching the machine spirit",
  "Calibrating moral compass",
  "Collapsing the wave function",
  "Doodling",
  "Translating whale song",
  "Whispering to silicon",
  "Looking for semicolons",
  "Asking ChatGPT",
  "Bargaining with entropy",
  "Channeling",
  "Cooking",
  "Parrotting stochastically",
  ]; */

swyx 5 months ago
```
  "Reticulating splines" is a classic!
```
jzig 5 months ago

Uhh… why is React in a terminal tool?
- Hansenq 5 months ago
  
  React is used to render the CLI through ink: https://github.com/vadimdemedes/ink
- lgas 5 months ago
  
  Presumably the people that developed it have a lot of pre-existing React knowledge so it was the easiest path forward.

est 5 months ago

If anyone else is wondering, it's not a local model, it uploads your code to online API.

Great tool for open-source projects, but careful with anything you don't want be public

mark_mcnally_je 5 months ago

If one of these tools has broad model support (like aider) it would be a game changer.

elliot07 5 months ago

Agree. My wish-list is:
1. Non JS based. I've noticed a ton of random bugs/oddities in Claude Code, and now Codex with UI flickering, scaling, user input issues, etc, all from what I believe of trying to do React stuff and writing half-baked LLM produced JS in a CLI application. Using a more appropriate language that is better for CLIs I think would help a lot here (Go or Rust for eg).
2. Customized model selection (eg. OpenRouter, etc).
3. Full MCP support.
lifty 5 months ago

If aiders creator sees this, any plans on implementing agentic mode, or something more autonomous like claude cli works? Would love to have an independent tool doing that.
- SparkyMcUnicorn 5 months ago
  
  It'd be great if this (or similar) got merged: https://github.com/Aider-AI/aider/pull/3781
ianbutler 5 months ago

https://github.com/BismuthCloud/cli
We’ve been working to open source ours. Should work with any open router model that supports tool calling.
Ours is agentic mode first.
Guess this is me dropping it live there may be rough edges as we’ve been prepping it for a little bit
- TheTaytay 5 months ago
  
  Oh cool. How do you feel it compares to Claude Code? (Serious question, but I realize I’m asking a biased source. :) )
oulipo 5 months ago

There's this one, but I haven't tested it yet: https://github.com/geekforbrains/sidekick-cli
slig 5 months ago

Can be done already [1].
[1]: https://github.com/openai/codex/issues/14#issuecomment-28103...
myflash13 5 months ago

Cursor agent? That's what I'm using now instead of Claude Code.

jumploops 5 months ago

(copied from the o3 + o4-mini thread)

The big step function here seems to be RL on tool calling.

Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).

OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).

o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.

My hunch is that these new models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7

tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.

[0]https://www.anthropic.com/engineering/building-effective-age...

[1]https://github.com/1rgs/claude-code-proxy

[2]https://openai.com/index/openai-codex/

cglong 5 months ago

There's a lot of tools now with a similar feature set. IMO, the main value prop an official OpenAI client could provide would be to share ChatGPT's free tier vs. requiring an API key. They probably couldn't open-source it then, but it'd still be more valuable to me than the alternatives.

cube2222 5 months ago

Coding agents use extreme numbers of tokens, you’d be getting rate limited effectively immediately.
A typical small-medium PR with Claude Code for me is ~$10-15 of API credits.
- ai-christianson 5 months ago
  
  I've ended up with $5K+ in a month using sonnet 3.7, had to dial it back.
  I'm much happier with gemini 2.5 pro right now for high performance at a much more reasonable cost (primarily using with RA.Aid, but I've tried it with Windsurf, cline, and roo.)
  - skeptrune 5 months ago
    
    Hoooly hell. I swear the AI coding products are basically slot machines.
    
    Implicated 5 months ago
    
    Or the people using them are literally clueless.
  - triyambakam 5 months ago
    
    That's the largest I've heard of. Can you share more detail about what you're working on that consumes so many tokens?
    
    ai-christianson 5 months ago
    
    It's really easy to get to $100 in a day using sonnet 3.7 or o3 in a coding agent.
    Do that every day for a month and you're already at $3k/month.
    It's not hard to get to $5k from there.
    
    triyambakam 5 months ago
    
    Sure but how? Still wondering more specifically what you're doing. And 3-5k is unfortunately my entire month's salary
    
    ai-christianson 5 months ago
    
    I'm developing an open source coding agent (RA.Aid).
    I'm using RA.Aid to develop itself (dogfooding,) so I'm constantly running the coding agent.
    That cost is my peak cost, not average.
    It's easy to scale back costs to 1/10 the cost and still get 90% of the quality. Basically that means using models like gemini 2.5 pro or Deepseek v3 (even cheaper) rather than expensive models like sonnet 3.7 and o3.
  - ilrwbwrkhv 5 months ago
    
    Just try the most superior model deep-seek
- ashishb 5 months ago
  
  Exactly. Just like Michelin the tire company created Michelin star restaurants list to make people drive and use more tires
- dingnuts 5 months ago
  
  Too expensive for me to use for fun. Cheap enough to put me out of a job. Great. Love it. So excited. Doesn't make me want to go full Into The Wild at all.
  - cube2222 5 months ago
    
    I don’t think this is at the level of putting folks out of a job yet, frankly. It’s fine for straightforward changes, but more complex stuff, like concurrency, I still end up doing by hand.
    And even for the straightforward stuff, I generally have a mental model of the changes required and give it a high level list of files/code to change, which it then follows.
    Maybe the increase in productivity will reduce pressure to hire? We’ll see.
- cglong 5 months ago
  
  I didn't know this, thank you for the anecdata! Do you think it'd be more reasonable to generalize my suggestion to "This CLI should be included as part of ChatGPT's pricing"?
  - cube2222 5 months ago
    
    Could be reasonable for the $200/month sub maybe?
    But then again, $200 upfront is a much tougher sell than $15 dollars per PR.
- torginus 5 months ago
  
  Trust me bro, you don't need RAG, just stuff your entire codebase into the prompt (also we charge per input token teehee)
siva7 5 months ago

Why would they? They want to compete with claude code and that's not possible on a free tier.

baalimago 5 months ago

You can try out the same thing in my homemade tool clai[1]. Just run `clai -cm gpt-4.1 -tools query Analyze this repository`.

Benefit of clai: you can swap out to practically any model, from any vendor. Just change `-cm gpt-4.1` to, for example, `-cm claude-3-7-sonnet-latest`.

Detriments of clai: it's a hobby project, much less flashy, designed after my own usecases with not that much attention put into anyone else

[1]: https://github.com/baalimago/clai

jwrallie 5 months ago

> Does it work on Windows? > Not directly. It requires Windows Subsystem for Linux (WSL2) – Codex has been tested on macOS and Linux with Node ≥ 22.

I've been seeing similar things in different projects lately. WSL in long term seems to be reducing the scope of what people decide to develop natively on Windows.

Seeing the pressure to move "apps" to the Windows Store, VSCode connecting to remote (or WSL) for development, and Azure, it does seem intentional.

throwaway314155 5 months ago

The developer experience of linux is simply more popular.

kristianp 5 months ago

I've been using Aider, it was irritating to use (couldn't supply changes in the diff format) until I switched away from chatgpt-4o to Claude 3.7 and then Gemini 2.5. This is admittedly for a small project. Gpt 4.1 should do better with the diff format so I will give it a go.

usecodenaija 5 months ago

So, OpenAI’s Codex CLI is Claude Code, but worse?

Cursor-Agent-Tools > Claude Code > Codex CLI

https://pypi.org/project/cursor-agent-tools/

yetanotherjosh 5 months ago

Astroturfing alert. This comment author is also the author of cursor-agent-tools.
submeta 5 months ago

Never heared of Cursor Agent tools. And that is better than Claude Caude according to whom? Genuinely curious.
- usecodenaija 5 months ago
  
  Are you living in a cave?
  anyways, here you go:
  Cursor Agent Tools is a Python-based AI agent that replicates Cursor's coding assistant capabilities, enabling function calling, code generation, and intelligent coding assistance with Claude, OpenAI, and locally hosted Ollama models.
  https://github.com/civai-technologies/cursor-agent
oulipo 5 months ago

I've been quite unimpressed by Codex for now... even the quality of the code is worse than Claude for me
killerstorm 5 months ago

This tool has nothing to do with Cursor.
Very misleading to use popular brand like that, possible scam.
- usecodenaija 5 months ago
  
  Maybe read the docs before replying:
  Cursor Agent Tools is a Python-based AI agent that replicates Cursor's coding assistant capabilities, enabling function calling, code generation, and intelligent coding assistance with Claude, OpenAI, and locally hosted Ollama models.
  https://github.com/civai-technologies/cursor-agent
  - killerstorm 5 months ago
    
    Being inspired by someone's work DOES NOT give you right to use their trade mark.
    It's like making a new operating system and calling it Windows because you "replicate" capabilities of MS Windows. Please read about trademark law.

p3rry 5 months ago

I had built this few weeks back on same thought https://github.com/shubhamparamhans/Associate-AI-EM/

sim7c00 5 months ago

notes "Zero setup — bring your OpenAI API key and it just works!"

requires NPM >.>

jackchina 5 months ago

Claude Code has outstanding performance in code understanding and web page generation stability, thanks to its deep context modeling and architecture-aware mechanism, especially when dealing with legacy systems, it can accurately restore component relationships. Although Codex CLI (o4-mini) is open source and responsive, its hallucination problems in complex architectures may be related to the compression strategy of the sparse expert hybrid architecture and the training goal of prioritizing generation speed. OpenAI is optimizing Codex CLI by integrating the context control capabilities of Windsurf IDE, and plans to introduce a hybrid inference pipeline in the o3-pro version to reduce the hallucination rate.

dgunay 5 months ago

This is a decent start. The sandboxing functionality is a really cool idea but can run into problems (e.g. with Go build cache being outside of the repository).

jamesy0ung 5 months ago

It's a real shame sandbox-exec is deprecated.

romanovcode 5 months ago

Tried it out on a relatively large Angular project.

> explain this codebase to me

> doing things and thinking for about 3 minutes

> error: rate_limit_exceeded

Yeah, not the best experience.

zora_goron 5 months ago

Anyone have any anecdotes on how expensive this is to operate, ie compared to performing the same task via Claude Code?

mrcwinn 5 months ago

A one line change, that took a decent amount of reasoning to get to for a large codebase, cost $3.57 just now. I used the o3 model. The quality and the reasoning was excellent. Cheaper than an engineer.
- john2x 5 months ago
  
  Technically it’s more expensive, because it cost engineer + $3.57 :)

mcbuilder 5 months ago

So a crappy version of aider?

dgunay 5 months ago

Aider doesn't have a more agentic/full-auto mode (at least not yet, there's a promising PR for this in review).
There may or may not also be some finessing necessary in the prompting and feedback loop regardless of model, which some tools may do better than others.
inciampati 5 months ago

The AI companies don't understand that they're the commodity. The real tools are the open source glue (today: aider) that bring the models into conversation with data and meaningmakers like ourselves.

andrewrn 5 months ago

Uhhhh. I just get ratelimited almost immediately when using codex. Like I can't get even a single "explain this codebase" or simple feature change. I am on the lowest usage tier, granted. But this tool is unusable without being on a higher tier, which requires me to spend $50 in credits to access...

siva7 5 months ago

how does it compare to cursor or copilot?

WhereIsTheTruth 5 months ago

typescript & npm slopware...

i can't believe it

and i can't believe nobody else is complaining

my simulation is definitely on very hard mode

danra 5 months ago

Am I the only one underwhelmed by Claude Code (which most comments here claim is better than Codex)?

Anecdotal experience: asked it to change instances of a C++ class Foo to a compatible one Bar, it did that but failed to add the required include where it made the change.

Yes, I'm sure that with enough prompting/hand-holding it could do this fine. Is it too much to expect basics like this out of the box, though? If so, then I, for one, still can't relate to the current level of enthusiasm.

drivingmenuts 5 months ago

Is there a way to run the model locally? I'd rather not have to pay a monthly fee, if possible.

dcre 5 months ago

You don't pay monthly fee to use OpenAI models through the API. You pay per token. https://openai.com/api/pricing/
There are other CLI coding agents like Aider (https://aider.chat/) that will let you point at any model. The problem is that local models are dramatically worse than these big hosted models — I have not seen anyone claim they are good enough for these kind of tools. See https://aider.chat/docs/leaderboards/
tdrz 5 months ago

Why would they give you the model?

CSMastermind 5 months ago

Hopefully it works better Claude Code which was an absolute nightmare to set up and run on Windows.

slig 5 months ago

It doesn't support Windows, you have to use WSL as well.

nodesocket 5 months ago

Little disappointing it's build in Node (speed & security), though honestly does not matter all that much. Seems like the right place for this functionality though is inside your editor (Cursor) not in your Terminal. Sure AI can help with command completion, man pages, but building apps is a stretch.

thekevan 5 months ago

So even if you have a plus account, there will be a charge for the API use?

I mean I kind of get it, but it does seem like they are almost penalizing people who could code in the browser with the canvas feature but prefer to use a terminal.

Do I have that right?

thimabi 5 months ago

I'd love if tools like this one were available for non-API users as well, even if we had to be rate limited at some point. But I guess OpenAI will never do it because it incentivizes people to use the ChatGPT subscriptions as a gateway for programmatic access to the models.
siva7 5 months ago

That's not comparable. Canvas isn't agent coding directly on your machine
- thekevan 5 months ago
  
  I wasn't considering the agent part somehow.

danenania 5 months ago

Cool to see more interesting terminal based options! Looking forward to trying this out.

I've been working on something related—Plandex[1], an open source AI coding agent that is particularly focused on large projects and complex tasks.

I launched the v2 a few weeks ago and it is now running well. In terms of how to place it in the landscape, it’s more agentic than aider, more configurable and tightly controlled than Devin, and more provider-agnostic/multi-provider/open source than Claude Code or this new competitor from OpenAI.

I’m still working on getting the very latest models integrated. Gemini Pro 2.5 and these new OpenAI models will be integrated into the defaults by the end of the week I hope. Current default model pack is a mix of Sonnet 3.7, o3-mini with various levels of reasoning effort, and Gemini 1.5 Pro for large context planning. Currently by default, it supports 2M tokens of context directly and can index and work with massive projects of 20M tokens and beyond.

Very interested to hear HN’s thoughts and feedback if anyone wants to try it. I'd also welcome honest comparisons to alternatives, including Codex CLI. I’m planning a Show HN within the next few days.

1 - https://github.com/plandex-ai/plandex

danenania 5 months ago

Decided to just go ahead and post the Show HN today: https://news.ycombinator.com/item?id=43710576
indigodaddy 5 months ago

RE: the local/open-source version: Use your own OpenAI, OpenRouter.ai, and other OpenAI-compatible provider accounts.
^^ could I put in my free Gemini key so as to use Gemini Pro 2.5 ? I'm a bit beginner with everything around BYOB. Thanks..
anticensor 5 months ago

How does it compare to ANUS and OpenHands?
- chucky_z 5 months ago
  
  what
  - anticensor 5 months ago
    
    Those are AI coding agents.
lifty 5 months ago

People are downvoting you for self promotion. But I will try it. I’m very interested in agentic assistants that are independent. Aider is my go-to tool but sometimes I just want to let a model rip through a code base.
- danenania 5 months ago
  
  Thanks, I tried to tone down the self promotion. To be completely honest, I was up all night coding so I'm not quite at 100% at the moment lol. I have massive respect for OpenAI and didn't mean to to try to distract from their launch. Sorry to anyone I annoyed!
  I really appreciate your willingness to try Plandex and look forward to hearing your feedback! You can ping me in the Plandex Discord if you want—would love to get more HNers in the channel: https://discord.com/invite/plandex-ai
  Or email me: dane@plandex.ai
  - owebmaster 5 months ago
    
    I usually downvote this kind of post in other people's Show HN but in a product from OpenAI I'm glad other projects use the post to gain visibility.
    > and didn't mean to to try to distract from their launch.
    you should, that is the biggest reason I upvoted.
georgewsinger 5 months ago

Insane that people would downvote a totally reasonable comment offering a competing alternative. HN is supposed to be a community of tech builders.
- throwaway314155 5 months ago
  
  I would wager a sizeable chunk of the people here have no idea about the nature of this site's ownership/origin. This crowd finds this sort of thing to be a sort of astro-turfing - not communal.
  edit: And I can't say I disagree.
  - groby_b 5 months ago
    
    It'a github link for an MIT-licensed project...
    If the community considers that astroturfing, we have completely lost the plot what building is.
    
    throwaway314155 5 months ago
    
    The MIT license is basically the license of choice for growth hacking these days. Many VC backed companies follow this strategy - it serves to grow your userbase, a free-tier for developers using your ecosystem and last but not least, a chance for volunteers to do free work for you.
    This is perhaps too cynical for this specific instance, but it's not overly cynical more broadly. Considering users of the site have to evaluate many of these offerings frequently, I don't blame them for having a negative gut reaction.

jensenbox 5 months ago

You lost me at NPM

yetanotherjosh 5 months ago

Why?

brap 5 months ago

What's the point of making the gif run so fast you can't even see shit

sva_ 5 months ago

LLMs currently prefer to give you a wall of text in the hope that some of it is correct/answers your question, rather than giving a succinct, atomic, and correct answer. I'd prefer the latter personally.
- mwigdahl 5 months ago
  
  Try o3. My (very limited) experience with it is that it is refreshingly free of the normal LLM flattery, hedging, and overexplaining. It figures out your answer and gives it to you straight.
dheera 5 months ago

People somehow seem to be adverse to making the shift from GIF to H.264
- porphyra 5 months ago
  
  To be fair, terminal output is one of the few things where GIF's LZW compression and limited color palettes shine at.
  - e12e 5 months ago
    
    Not as much as https://asciinema.org/ - when you can use that...
    
    porphyra 5 months ago
    
    True, but embedding a gif is way easier than using a javascript thing which might not be allowed in most places.
    
    dheera 5 months ago
    
    Browsers just need to support <img src="foo.mp4" style="width:256px;"> already.
    It should behave exactly like a GIF, loop by default, and be usable for emojis and everything.
    There is absolutely ZERO reason we should be stuck to 256 colors for things like cat videos used as chat stickers. We have had 24-bit displays for ages.
    
    porphyra 5 months ago
    
    Animated webp has pretty good browser support by now and Discord uses it by default to serve animated emojis and stickers.
    However, many image hosting tools still don't let you upload webp.
    [1] https://caniuse.com/webp
yablak 5 months ago

That's the model speed :)
- brap 5 months ago
  
  Not really, they don't even give you a second to read the output before it loops back again.

jedberg 5 months ago

Apologies for the HN rule breaking of discussing the comments in the comments, but the voting behavior in this thread is fascinating to me. It seems like this is super controversial and I'm not sure why.

The top comments have a negative score right now, which I've actually never seen.

And also it's a top post with only 15 comments, which is odd.

All so fascinating how outside the norm OpenAI is.

Dangeranger 5 months ago

People are getting fed up with hijacking to promote a competing business or side project.
- ChadMoran 5 months ago
  
  Hey, I tried to solve that by building an upvote bot for the legit comments! Check out my GitHub!
  /s

bigyabai 5 months ago

  RAM  4‑GB minimum (8‑GB recommended)

It's a CLI...

cryptoz 5 months ago

Possibly the heaviest "lightweight" CLI tool ever made haha.
ChadMoran 5 months ago

Lightweight in it's capability I guess.
m00x 5 months ago

Which needs to fit all the code in memory + they're considering OS space, etc.

AllSuperIndians 5 months ago

[dead]

blt 5 months ago

Sorry for being a grumpy old man, but I don't have npm on my machine and I never will. It's a bit frustrating to see more and more CLI tools depending on it.

Dangeranger 5 months ago

You could just run it in a Docker container and not think about it much after that. Mount a volume to the container with the directory contents you want to be available for edit by the agent.
https://github.com/openai/codex/blob/main/codex-cli/scripts/...
John23832 5 months ago

I asked the same question for Anthropic's version of this. Why is all of this in JS?
- parhamn 5 months ago
  
  JS is web's (and "hip" developer's) python, and in many ways it is better. Also the tooling is getting a lot better (libraries, typescript, bundling, packaging, performance).
  One thing I wonder that could be cool: when Bun has sufficient NodeJS compatibility the should ship bun --compile versions so you dont need node/npm on the system.
  Then it's arguably a, "why not JS?"
  - throwaway314155 5 months ago
    
    > and in many ways it is better
    Right but is it worth having to write JS?
    /s (kinda)
- photonthug 5 months ago
  
  Tree-sitter related bits probably
  - emporas 5 months ago
    
    tree-sitter is a C library though. Only grammars for each particular lang are defined in javascript.
- AstroBen 5 months ago
  
  typescript is a pretty nice language to work with. why not?
tyre 5 months ago

this is a strong HN comment. lots of “putting a stick in my own bicycle wheel” energy
there are tons fascinating things happening in AI and the evolution of programming right now. Claude and OpenAI are at the forefront of these. Not trying it because of npm is a vibe and a half.
schainks 5 months ago

Why? I am not the biggest fan of needing a whole VM to run CLI tools either, but it's a low-enough friction experience that I don't particularly care as long as the runtime environment is self-contained.
sudofail 5 months ago

Same, there are so many options these days for writing CLIs without runtime dependencies. I definitely prefer static binaries.
therealmarv 5 months ago

It might shock you but many of use editors built on browsers for editing source code.
I think the encapsulating comment from a another guy (in Docker or any other of your favorite VM) might be your solution.
Vegenoid 5 months ago

What package managers do you use, and what does npm do differently that makes you unwilling to use it?
teaearlgraycold 5 months ago

Judge the packages on their dependencies, not on their package manager.
crancher 5 months ago

What are your concerns?
- jensenbox 5 months ago
  
  The entire JS ecosystem.
ilrwbwrkhv 5 months ago

Yep, this is another one of the reasons why all of these tools are incredibly poor. Like, the other day I was looking at the MCP spec from anthropic and it might be the worst spec that I've ever read in my life. Enshittification at the level of an industry is happening.
meta_ai_x 5 months ago

if OpenAI had really smart models, they would converted TS/JS apps to Go or Rust apps.
Since they don't, AGI is not here

terminaltrove 5 months ago

It's very interesting that both OpenAI and Anthropic are releasing tools that run in the terminal, especially with a TUI which is what we showcase.

aider was one of the first we listed as terminal tool of the week (0) last year. (1)

We recently featured parllama (2) (not our tool) if you like to run offline and online models in the terminal with a full TUI.

(0) https://terminaltrove.com/tool-of-the-week/

(1) https://terminaltrove.com/aider/

(2) https://terminaltrove.com/parllama/

sva_ 5 months ago

Github Copilot had a tool that runs in the terminal for longer I'm pretty confident. I can activate it with syntax "?? <question>" and it'll respond with a command, explaining the parameters. I've been using it quite a bit, for stuff like ffmpeg or writing bash 1-liners.
- tough 5 months ago
  
  yep that was the alias during beta
  now its just gh copilot
  https://docs.github.com/en/copilot/using-github-copilot/usin...
  https://github.com/github/gh-copilot