Which LLM is best at writing AI code? Surprise! - Sam Glassenberg

Which LLM is best at writing AI code? Surprise!

Which LLM Chatbot is best at writing AI code? Surprise!

Using AI to Make AI: Claude 3.5 vs. Gemini 1.5 vs. ChatGPT 4-o vs. Github Copilot

I did the head-to-head comparison so you don’t have to.

Image5

Over the past few weeks I’ve been developing a number of AI projects to train various models (more on the specific models later). These include time-series forecasting, computer vision, and LLM-based applications.

In these experiments, I left 98%+ of the AI coding to the LLM. I strategically write prompts, copy-paste code, let it train/predict for a few hours, and copy-paste results (fixing errors in between).

This conclusion surprised even me: Claude 3.5 Sonnet is the clear winner.

Sure, Claude 3.5 advertises scoring slightly higher on many coding benchmarks, but having worked in DirectX for many years I’ve always been skeptical of small differences in benchmark scores. This is a different story. Even when Google recently did its surprise announcement of its all-powerful Gemini 1.5 Pro, outperforming all other LLMs, it still lost slightly to Claude on coding.

Here’s the final breakdown:

Feature Claude 3.5 Sonnet ChatGPT 4-o Gemini 1.5
Knowledge of AI Algorithms and Libraries ⭐⭐⭐ ⭐⭐ ⭐⭐⭐
Code Reliability and Quality ⭐⭐
Ability to Retain Context ⭐⭐ ⭐⭐
Ability to Run Code
IDE Integration
Web Interface Performance ⭐⭐⭐ ⭐⭐⭐

You’ll notice I didn’t include GitHub copilot in this table… that’s because its performance was so bad with initial tests that I didn’t even try it across further scenarios.

Here are some of the machine learning libraries and models used throughout these various experiments:

  • YOLO8
  • H2o
  • ChatGPT 4-o API
  • CLIP
  • Tensorflow
  • Google Cloud Vision
  • Pillow
  • OpenCV
  • Tesseract (OCR)
  • EAST
  • Scikit-learn / KMean

Libraries in bold could be used inside ChatGPT4-o’s query analyzer, but were unfortunately not sufficiently capable for my needs.

Score Breakdown

Here’s a Breakdown of the Scores. We’ll go into more details on these in future blog posts.

Knowledge of AI Algorithms and Libraries

Feature Claude 3.5 Sonnet ChatGPT 4-o Gemini 1.5
Knowledge of AI Algorithms and Libraries ⭐⭐⭐ ⭐⭐ ⭐⭐⭐

Bringing some of my initial prompts over from ChatGPT4-o to Gemini, it was clear that Gemini 1.5 could reason far better about what models and algorithms to use and how to combine them. It did a much better job of layout out tradeoffs and expectations for each model.

Image1

Claude performed as well as Gemini 1.5 did on planning, and substantially better on coding.

Code Reliability and Quality

Feature Claude 3.5 Sonnet ChatGPT 4-o Gemini 1.5
Code Reliability and Quality ⭐⭐

Gemini’s codegen left much to be desired. Lots of syntax errors, and it would easily fall into traps where a fix for issue #1 would cause issue #2, and the fix for issue #2 would regress issue #1 (I’ll call this an LLM Regression Trap). This happens to ChatGPT all the time.

Across multiple projects, I only ran into an LLM Regression Trap with Claude around 3 times, and it was quickly resolved by starting a new chat with all of my code files.

To date, Claude consistently outperforms (albeit slightly, score-wise) ChatGPT, Llama, and Gemini in coding benchmarks. I have to say, this 2% difference in benchmark score appears to have substantial impact on overall coding ability.

Ability to Retain Context

Feature Claude 3.5 Sonnet ChatGPT 4-o Gemini 1.5
Ability to Retain Context ⭐⭐ ⭐⭐

This is where ChatGPT 4-o really hurts. It struggles to retain context in a larger-than-tiny codebase. It will quickly forget code and functions that don’t happen to be relevant to the particular discussion over the last few prompts.

Gemini and Claude advertise larger context windows. I’m always skeptical of context window stats as it seems that LLMs still tend to forget important things as their context window fills. Nonetheless, I found that Claude and Gemini generally performed well at remembering projects with 10-12 code files. Once in a while, Claude would drop a feature we hadn’t used in a while, but you could reliably get it to add the feature back.

Ability to Run Code

Feature Claude 3.5 Sonnet ChatGPT 4-o Gemini 1.5
Ability to Run Code

ChatGPT is the only LLM that offers the ability to run code using its query analyzer. This tool is incredibly powerful for small tasks – and I use it all the time in my day-to-day life (Resize this PDF! Manipulate this image! Graph this or that!). I’ve shared some best-practices in the past for how to get the most use out of it.

That being said, its execution environment is limited. This is as far as I could push it when it comes to AI algorithms:

You can run OpenCV for basic contouring

Image4

…no, it can’t generate results like this ‘out of the box’. This image was the result of multiple phases of histogramming and clustering (which, to its credit, Query Analyzer was able to do).

You can use tesseract for basic OCR

Unsurprisingly, the results on handwritten text aren’t as good as just asking ChatGPT to OCR it.

On a limited basis, you can actually upload a trained model file for these libraries to use:

ChatGPT doesn’t have access to the internet, but you can download a fairly large model file from the internet and upload it to ChatGPT and it’ll run it! This was fun to do from my iPhone.

Image3
Image2

IDE Integration

Feature Claude 3.5 Sonnet ChatGPT 4-o Gemini 1.5
IDE Integration

None of these systems have official IDE integration… yet. So there’s still a great deal of copy-pasting outputs from the LLM into your code files, and then copy-pasting the output back to the LLM. I’ll write an article shortly about the state-of-the-art and some tools I’m building (or rather having Claude build) to address this in the short term.

Claude has the best “agentic coding” capabilities, and so far the best plugin I’ve found for Visual Studio Code is Claude Dev. It will read the files from your project, and when it proposes changes you can authorize it with one click.

Not quite ready for prime-time, though. Claude Dev just added the ability to re-run a command (before that, a Claude error would cripple the whole chat). It doesn’t have the kind of robust LLM chat management that you really need to have a meaningful and productive co-development session with your favorite LLM.

Web Interface Performance

Feature Claude 3.5 Sonnet ChatGPT 4-o Gemini 1.5
Web Interface Performance ⭐⭐⭐ ⭐⭐⭐

Normally, this wouldn’t even be a category worthy of scoring. Unfortunately, the performance of Claude’s web GUI is so abysmal as conversations get longer it needs to be called out. Claude’s web GUI just becomes totally unusable at some point once a chat gets long enough.

Right now I’m using TypingMind to work around this. I must admit that TypingMind is so good that calling it a ‘workaround’ is a little demeaning. TypingMind is fantastic. You can run it locally, which means storing and backing up your LLM chats.

While TypingMind is the cure for Claude’s terrible web GUI performance, it doesn’t address the secondary problem with Claude: Claude sends the full chat history to the API every time, which means you burn through tokens faster and faster as your chats get longer. The $35 chat is the least of your problems here. Paid users will run up against the 2M token limit quickly. The Anthropic Sales team needs to get around to your upgrade ticket to upgrade your tier and they are severely backlogged right now.

In Summary

There are a surprising number of tradeoffs when choosing which LLM you are going to co-develop your AI code with. Fortunately, the platforms are quickly evolving and the rich ecosystem of 3rd-party solutions are rapidly filling the gaps.

In the future, I’ll do a few dedicated posts around IDE tools and the many co-development best-practices I’ve developed over the last few weeks… best way to stay up-to-speed is to follow me on LinkedIn.

Exported with Wordable

Leave a Reply

Your email address will not be published. Required fields are marked *