“Not until we are lost do we begin to understand ourselves.”

— Henry David Thoreau

Prologue

Despite the argues of whether we are getting closer to true Artificial Intelligence, today Large Language Models (LLMs) are undeniably getting more and more capable in various tasks. Coding is one of the most common tasks for LLMs, which is also the task that I am most familiar with. For all these years until a few weeks ago, I didn’t really care about the code generated by LLMs and thought that they lack the taste of a true software engineer. But recently, as I tried to work with the frontier models and the Agents powered by them, I found that things may have changed.

So, in this article, I am going to share some of my experience with LLMs, and some best practices of mine. This is more of a self reflection, but I would be glad if you also find it helpful.

LLMs are evolving really fast, so this article may also get outdated really fast.

Although I use LLMs more often nowadays, I refrain from using them in writing blogs. I will explicitly state the usage of LLMs, otherwise they are not used, or simply used for checking grammatical mistakes. See this post for an example of writing with an LLM.


The Adoption of LLMs

As an article for myself, I would like to have this sentimental section. If you just want to get the key insights, jump to my practices or directly to the takeaways.

The pre-LLM era

Technology advancements are indeed beneficial. They make our lives easier. But there is a hidden menace: more of us no longer bother to learn what we take for granted.

I still remember when I first picked up programming. I read C Primer Plus page-by-page, doing every exercise on an old ASUS laptop with Visual C++ 6.0. The laptop did not even have an Internet connection. It looks stupid now as we have various resources on the Internet, and can even ask LLMs to get answers for almost any question. But the easy access to an LLM also deprives us of first-hand knowledge and the experience of doing things on our own.

Look at those top programmers, they (almost) all come from the old times, when they had to write assembly, deal with hardware and limited resources, and fight with operating systems. Such hands-on experience makes them what they are, and they simply cannot be replaced by an LLM.

Many Gods of computer science are still living in the world.

So, I just took the “dumb” way to learn programming, all the way to the second year at university. I read books about design patterns and best practices, looked up documentation, and searched the Internet for solutions. From the very first program in the book, I eventually wrote a game with more than 50K lines of code.

And I thought this is it, I can become a better programmer just by following this practice!

From line-completion to block-completion

Then there was a subtle change in the development cycle, following the introduction of GitHub Copilot. I use “subtle” because by then I didn’t find it too exciting and thereby my memory is a little blurred.

GitHub Copilot was first released in 2021, and I didn’t use it until late 2022. At that time, I used Visual Studio 2019, and the built-in IntelliSense could already provide decent completion. When I first enabled GitHub Copilot, its accuracy was still pretty low, and only worked well for repetitive code segments.

Very soon, GitHub Copilot introduced support for multi-line completion and even function completion. I seldom used this feature, because I found the generated code almost never matched my description in the comment.

Back then I just couldn’t see it. But now things are much clearer. It proves that we are able to generate new code by learning existing code, massive amounts of code.

Ask whom, a search engine or an LLM?

Throughout my undergraduate life, whenever I encountered problems, I would consult search engines to find a solution. Although copy-paste is available for many questions, the solution is not always obvious. Sometimes it still requires non-trivial effort to try out different answers from Stack Overflow, or spend more time reading the documentation.

Due to service restrictions, I didn’t have an OpenAI or Claude account. So, I missed this period. But by then, their coding ability was not very strong, and could not solve slightly more complex tasks. But other aspects like writing were already pretty good.

I am the very first generation whose Bachelor thesis is checked against AI usage.

At this time, my experience was mostly with the GitHub Copilot chat mode. The biggest problem then was hallucination. There were undefined symbols and non-existent imports. So, most of the time I just hid the panel and directly went to a search engine.

Then when I finally got an OpenAI account, and there were better models, these issues were alleviated to some extent. From then on, I began to notice that asking an LLM could be more efficient. Using a search engine requires you to filter the results and summarize the answers. But all of these are already included in the training data of the LLM, so it could save a lot of time. And gradually, as well as today, I ask LLMs more than search engines.

Why not a Claude account? Because my Claude account was banned.🤬

However, LLMs are still prone to hallucination and bias, so their answers can be deceiving. Now the frontier models are more capable, and can explain things quite well with sources cited. You can generally believe them, but still have to be skeptical, and consult a search engine when feeling unsure.

Here comes the Agent

At the end of my undergraduate life, the introduction of Devin brought a new concept to the public — LLM Agents. Beyond conversation, LLMs can make decisions, interact with various tools, and complete software engineering tasks like a human developer. This really opened up a new paradigm for the application of LLMs.

Very soon, more and more papers were published, utilizing LLMs to solve various software engineering tasks. They indeed solved many long-standing problems in software engineering, and one of the most significant ones is code understanding and generation. Take program repair as an example, although many many methods have been proposed, traditional approaches could hardly understand the intent of the program, and the synthesized patches just couldn’t fit into the codebase. But an LLM could understand the code like a human, and generate code that generalizes to real-world codebases.

Lately, coding agents are getting more and more popular. Although promising, I was being over-confident about my coding skills. I was kind of obsessed with design patterns and best practices, and took LLM-generated code as junk. I mean, not all generated code is junk. You can have it generate one function, but you can’t expect it to write the whole project.

For quite a long time, I kept the thought that LLMs can’t generate code with good design and taste. But I was starting to accept that LLMs are going to generate a larger portion of the code.

Away from the keyboard?

Most recently, at the time of this post, the frontier models I’ve tried in coding are GPT-5.3-Codex and Claude Opus 4.7. And I use them in GitHub Copilot in Visual Studio Code. I have to admit, they can write production-level code. And their code quality is… somehow good, given their generation time.

The performance of different Agents, such as Codex and Claude Code may vary, but the underlying model is still the key factor. I also tried GPT-5.5 in Codex while writing this post. Much more capable in understanding the requirements and implementing large systems, they are.

Without LLMs, I have to refer to the documentation first to check viability, then design the system in my mind or on paper, and finally spend a few weeks on implementation. But now, things have changed. I only need to state what I want, and the Agent can then fill in the details, draft a plan, and implement the system in a few minutes! (Well, maybe a few days with supervision.) And you can again ask it to refine the implementation or fix bugs.

It is much more efficient than manual coding. Thousands of lines of code and a runnable program in a few hours, even with unit tests that one may never bother to write. It is astonishing, but also the reality. So, literally, one can write programs with natural language only.

But, for now, the LLMs are not that strong enough. They may generate code faster than humans, but that does not mean the generated code is all useful. Most likely, the generated code will be thrown away as it does not meet your expectation. See the practices in the next section for more.

Now the story of code generation has come to a new era, where we can “efficiently” generate “high-quality” code. The future is exciting, but also brings a new question: how can we trust the code generated by an LLM?

I also quote the word “efficient” for some reason. From the user’s perspective, code generation is now efficient with thousands of lines of code in a few minutes. But at what cost? How many computing resources? How much electricity and water? The rising price of the models may provide some clue.

Trust for the human developers may come from their reputations, the efforts they invested into the project, and that you can blame them if anything goes wrong. But an LLM has none of these. LLM Agents can be truly accepted in the development workflow only if we trust them. But how?

Here is a good post about programming with trust: Agentic AI Software Engineers: Programming with Trust.

Of course, there are various analysis and testing techniques. But they only find bugs, and do not provide a bug-free guarantee. They may be sufficient for humans, as we know what we are doing and can be reasoned with, but not enough for LLMs. This leads to a new area to explore, which is to introduce formal verification into code generation, to prove the correctness of the generated code as bug-free. This could be, and is already becoming, the next hot research area.


Agents in Action

Well, after the long and verbose murmuring, in this section, I will briefly talk about some projects in which I partially or fully delegated the development to the LLM Agent. I will also share the good and bad practices for reference.

The initial adoption — sysfonts

As I was writing my game engine, the font rendering module needed to load system fonts. As multi-platform support is taken into consideration, I decided to implement this as a standalone library to retrieve meta information of all installed system fonts.

I am most comfortable with Windows development, so that was quite easy after searching the Windows API documentation. Linux was also not that hard after visiting Stack Overflow.

However, it is a lot more difficult for macOS. It is my first time targeting macOS, and I cannot find anything related to font loading on the Internet (probably because I was too unfamiliar with macOS development). So, I decided to give the LLM a try. Very quickly, it spit out a solution. Although the code had a lot of errors, like non-existent API names and failing type-checking, it is a good start for me to learn.

Based on the wrong code, I manually looked up the correct APIs, fixed types based on the documentation, as well as memory management issues. And it works.

LLMs do not know everything and make mistakes. You may not get correct answers, but an incorrect one can be a good start for you to find the correct solution by yourself. It saves you a lot of searching time.

Vibe-coding for the first time — russian-explained

I’ve been learning Russian by myself using Duolingo after taking the introductory course. It’s fun, and addictive, but not that effective. It does not explain words, e.g. their formation and sample sentences, which makes it hard to remember words, let alone use them. Also, you practice a lot of examples that you do not use in daily life. As a complement, I consult LLMs to learn more.

However, it is tedious to type the same question every time, or write prompts to adjust the personality of the LLM. So, I was thinking if I could make it an interactive dictionary. And as GitHub had just released Copilot CLI, I tried to build such an app with it.

I decided to make it a Web app so I could access it on any of my devices. I had not written a frontend project for a long time, so I just let the Agent handle all the implementation. All I did was select the React framework, and write the features list. Because Agent skills had just come out at that time, I also included some so-called frontend development skills.

The development was totally a disaster.

LLM can’t stick to the plan. Since it is the first time I used an LLM Agent for full development, I proceeded with great care. I always used the plan mode first before implementation. Although the plans looked good, the implementation might diverge. In order to bring it back on track, I had to make more conversations, thus risking overflowing the context and breaking the old plan. As a result, new plans mixed up with old plans, and the Agent had no idea what it was doing.

The workflow is not under control. At the very beginning, the Agent focused on one feature at a time. But as the context grew, it might forget the old workflow or tackle multiple tasks in one batch. This is because all the tasks were given to the Agent, and I allowed the Agent to decide what to do next. Even if there were plans, the Agent might occasionally ignore them, and it only takes once to cause mayhem.

Skills may not help. Skills are condensed best practices, indeed, but one should not include them blindly, especially when you have no idea what the skill is about. For this project, I included skills from Vercel about best practices in React, and also a collection of superpowers. Vercel’s skills might be helpful, but the superpowers were just causing chaos. The brainstorming skill just wastes output tokens. Since I am not familiar with git worktrees, the using-git-worktrees skill caused a lot of trouble managing changes. Also, writing-plans kind of conflicts with the built-in plan mode, and the plans are most of the time outdated, thereby only leading to confusion.

This is an example of what will happen if you have little control over the development workflow. At the current stage, the LLM Agent is still not capable enough to handle large codebases and act consistently. Therefore, human-in-the-loop is required, and you must be clear about what is going on, as the Agent may not.

Agents under control — sponge-subtitles

Apologies in advance if you click the link to the repo. I am not ready yet to make it public.

This is the second time I vibe code the whole project. Initially this is more like a helper script, but it ended up being more complex.

Again, the idea comes from me learning Russian. It is already difficult for one to learn a new language without a teacher. And it is more difficult without exposure to the language. Not like English, which we can see and hear every day, probably on packages or movie cuts, you don’t often see Russian outside Russian-speaking countries.

To introduce more Russian in my daily life, I chose my favorite cartoon — SpongeBob SquarePants, a.k.a. Губка Боб Квадратные Штаны. Unfortunately, there are no existing subtitles, so I decided to add them by myself, which can also be part of the learning process.

Surprisingly, it is not that easy to find Russian dubbing of SpongeBob. To get it, you have to do it the Russian way. I mean, using ruTracker.org.

The goal is to write a helpful script to transcribe each episode, and eventually create the subtitle file in .srt format. Well, I initially tried local models with OpenAI Whisper CLI, as well as OpenAI’s speech-to-text models. But eventually I found a free online service TurboScribe that performs better.

Adjusting the timecode is of course still my job. But after that, I wanted something more. Instead of just subtitles, I also wanted to add annotations to assist learning. So, the final challenge was to burn the custom annotation into the video. Of course, one can do this in video editing software, but this kind of repetitive and mechanical work is apparently more suitable for automation. And not surprisingly, ffmpeg with ASS can perfectly achieve this purpose.

It is without doubt tedious to read the documentation and write scripts to parse and build ASS files and call ffmpeg, so I gave that to an LLM Agent. Unlike the last time I let the Agent do whatever it likes, this time I fixed the workflow and algorithm design, and tried to be as clear as I could when describing the requirements. And this works so much better, and I found that LLMs are much more proficient than me at writing fancy console outputs. Of course, it may also benefit from the updated LLM, which is GPT-5.3-Codex.

So, in this project, I learnt that although the LLMs are getting “smarter”, you, at least for now, are still in charge. Again, you should be clear about what you are doing. Also, planning is very important. You may not want to give the full bunch of work directly to the Agent. It is better to decompose it by yourself first, and let the Agent implement step-by-step just like what you would do. Additionally, you can specify the preferred workflow and code style for the Agent to follow, and restate them when the Agent compacts its memory.

I design, Agents implement

As the final experience in this post, this is of course the most advanced one. For my recent research idea, I tried to use LLM Agents to implement the prototype. (Not because I became lazy.)

Following all the practices before, I first formalized my idea in detail, including background, motivation, goal and implementation sketch. After giving them to the Agent, I did not let it start implementation directly. Instead, I…

  1. let it explain first, or repeat what I want to do to confirm that it understands well;
  2. go plan mode to decompose the task into smaller steps;
  3. revise the plan with my own expertise;
  4. hit the push-button for implementation

In order to be in more control, I also wrote a short development principle for the Agent to follow, such as the package manager, preferred libraries, and the high-level workflow. And to be in more control, I bootstrapped the project first before letting the Agent do whatever it likes.

Actually the idea formalization is also assisted by LLMs.

The ideal outcome is that, I spend a few days polishing my idea, and send it to the LLM to get a prototype in an afternoon. However, this did not happen. The reality is that, although the Agent can produce a lot of code in no time, it did not exactly follow my idea, and the implementation did not generalize beyond the given examples. So in the end, I still had to spend more than a week on implementation.

This did not work as expected mainly because I overestimated the capability of the Agent. They are not yet capable of turning abstract ideas into implementation. Therefore, one must be extra clear about what to do and formalize that into actionable plans for the Agent.

Don’t be lazy. It is OK not to write code anymore, but it is not OK to delegate thinking to an LLM Agent… yet.


Takeaways

To conclude my experiences, and to summarize what I learnt working with LLM Agents.

  • You are in control. Do not set Agents free to do things you probably don’t understand.
  • Be specific. Formalize your idea crystal clear before handing over the implementation to Agents.
  • Plan with care. Properly decompose the task to make the Agent more focused, and always remind it of the guidelines.

Epilogue

As a programmer, you might be afraid of what will come next, as LLM Agents are so capable of writing code. But rest assured, they work differently from humans. A Large Language Model is like a collection of all (programming) knowledge. In essence, it is a tool, although being more powerful than any predecessor. While you, my friend, are the user of that tool. ᓚᘏᗢ