LLMs are making BDD & Gherkin rise again ?

As AI coding assistants become increasingly common in software development, many teams are discovering their potential to transform how we design and maintain test automation. In this post, I share my early thoughts on how LLM-powered tools might revolutionize Behavior-Driven Development (BDD).

First impression with LLM powered coding assistant

When working with AI coding assistants powered by large language models (LLMs), I have noticed they thrive on structure, checklists, and concrete examples.

Before large language models (LLMs) became as popular as they are today, I was already fascinated by Behavior-Driven Development (BDD) and its Gherkin syntax – a human- and business-friendly way to express test cases in natural language.

Within its clear rules and structured syntax, feels like a natural fit for LLM-driven assistance.

Challenge with BDD and Gherkin

As the number of test cases grows, maintaining them becomes a real challenge. Without a well-defined and structured set of step definitions, things can quickly turn chaotic.

From my experience, the most time-consuming part of setting up a new BDD + human-readable syntax like Gherkin is about designing step definitions. The cost of creating and maintaining these definitions can outweighs the benefits they provide, and this is one of the main reasons Gherkin or Cucumber initiatives fail.

To keep the testing specification maintainable, it requires a careful balance:

Clarity: Steps must be understandable at a glance.
Reusability: They should be generic enough to cover multiple cases.
Simplicity: Too generic also a problem as over-abstracted steps can become confusing and difficult to maintain. We must keep the definition focus on their own single responsibility so that It’s easier for us to quickly identity the scope of it and device where we can reuse or creating a new definition.

Once a “step definition dictionary” is established, another challenge emerges, “identifying whether a similar step already exists“. Without proper tooling, duplicate or overlapping steps can proliferate, creating inconsistency and maintenance pain.

These issues highlight how tooling and management are critical to BDD + Gherkin long-term success.

Opportunity

When I first used an LLM-based chatbot to generate test cases, I realized how powerful these tools could be. They can:

Help with naming step definitions consistently.
Translate requirements and expectations into Gherkin-style test cases, grounded in an existing step definition dictionary.
or I just write my draft test cases, and AI assistant would refine it and take care of the remaining of naming, find existing step definition that I can reuse, etc,..

With this capability, I believe that LLM-based tools have the potential to elevate test automation to an entirely new level, especially for the BDD community – where structured syntax like Gherkin provides the perfect foundation for AI-assisted generation, validation, and management of test cases.

However, considering the LLM’s ability to understand human language, I believe there may be an opportunity to interpret Gherkin differently, or even create a new language/framework for describing test cases in a way that is closer to how humans naturally express them – and then trigger automated testing from that. This could help eliminate the limitations of the text-pattern matching approach that Cucumber currently uses.

Discussion

This might also influence how we manage our test cases and how Test Case Management systems evolve in the future. Future systems will likely integrate with AI coding assistants, feeding them text-based materials that support automated test case generation and optimization.

It could also mark a broader movement/trend – like feature files and natural-language requirements – become the primary input for automation, much like how GitOps, Infrastructure as Code are redefining infrastructure management.

Ultimately, I believe that LLM-powered testing will play a crucial role in modern software delivery, enhancing collaboration, reducing maintenance friction, and improving the overall productivity of testing teams.

This automated testing capability is also a crucial part of making an AI-powered coding assistant successful, so I believe there will be more frameworks, research, and best practices emerging around this topic in the coming years.

Hung Doan