Spec-Driven Development with LLMs: Precise Engineering Through Specifications

LLMs are transforming how we write code, but they've also exposed a fundamental truth: vague instructions produce vague implementations. This post introduces spec-driven development (SDD)—a methodology for building reliable software when working with Large Language Models as coding assistants. Specs aren't just documentation; they're the contract that ensures both humans and AI produce exactly what you need.

Why Specs Matter More Than Ever in the Age of LLMs

LLMs like Claude are remarkably capable coding assistants, but they have a fundamental limitation: they can only build what you describe. Vague instructions produce vague implementations. Incomplete requirements lead to incomplete features.

This is where spec-driven development becomes essential:

Without Specs                    With Specs
─────────────────────────────    ─────────────────────────────
"Add health endpoints"     →     Ambiguous implementation
                                 - What status codes?
                                 - What response format?
                                 - Which dependencies to check?

"Implement requirements    →     Precise implementation
 1.1 through 1.5 from            - HTTP 200 with JSON
 control-plane-health-           - RFC3339 timestamps
 endpoints spec"                 - Database/cache checks
                                 - Configurable timeouts

The Contract Between Human and Machine

Think of a spec as a legally binding contract between you and the LLM:

You specify exactly what you want, with testable acceptance criteria
The LLM implements according to those criteria
Tests verify the implementation matches the spec
Everyone wins: you get what you asked for, the LLM has clear guidance

Specs Prevent "AI Drift"

Without specs, LLMs can:

Make assumptions about behavior you didn't intend
Add features you didn't ask for
Implement patterns that don't match your architecture
Miss edge cases that seem obvious to you

With specs, these problems disappear. The LLM has explicit requirements to follow, and tests verify compliance.

Specs as Versioned Code Artifacts

Critical principle: Specs are not separate documentation—they are first-class code artifacts that live alongside your implementation.

Directory Structure

your-project/
├── .kiro/
│   └── specs/                    # All specifications
│       ├── README.md             # Spec conventions and overview
│       ├── mage-build-system/
│       │   ├── requirements.md   # What we're building
│       │   ├── design.md         # How we'll build it
│       │   └── tasks.md          # Implementation checklist
│       ├── authentication-middleware/
│       │   ├── requirements.md
│       │   ├── design.md
│       │   └── tasks.md
│       └── control-plane-health-endpoints/
│           ├── requirements.md
│           ├── design.md
│           └── tasks.md
├── internal/                     # Implementation
├── cmd/                          # Entry points
└── test/                         # Integration tests

Why Version Specs with Code?

Traceability: git blame shows who changed what requirement and when
History: You can see how requirements evolved over time
Context: Understanding why code exists by reading the original spec
Synchronization: Specs and code stay in sync through the same PR process
Onboarding: New engineers read specs to understand the system

Specs as Living Documentation

Unlike external documentation that drifts from reality, versioned specs:

Are reviewed in PRs alongside code changes
Must be updated when requirements change
Provide audit trails for compliance and debugging
Explain rationale that comments can't capture

For example:

- [x] 12. Final checkpoint - Ensure all tests pass and architecture is validated
  - Build successful: `go build ./...` passes
  - Property-based tests passing: All test structure alignment tests pass
  - **PBT Status**:
    - Test structure alignment: All 6 properties PASS
    - Layer-based directory structure: All 5 properties PASS
    - Migration completeness: All 6 properties PASS
  - **Overall Status**: CLEAN architecture migration COMPLETE

This is permanent, searchable history of what was verified and when.

The Three-Document Structure

Every feature has three documents that work together:

Document	Purpose	Audience	LLM Usage
`requirements.md`	What to build and why	Product, Engineering	Context for implementation
`design.md`	How to build it	Engineering	Architecture guidance
`tasks.md`	Step-by-step checklist	Implementation (Human or LLM)	Direct instructions

How LLMs Use Each Document

When working with an LLM on a feature:

1. Share requirements.md → LLM understands the goal and constraints
2. Share design.md       → LLM follows your architecture decisions
3. Work through tasks.md → LLM implements each task with clear scope
4. Run verification      → Tests confirm correctness

Requirements: The "What" and "Why"

The requirements.md file defines success criteria. For LLMs, this is especially critical—they need explicit, testable statements.

The EARS Pattern

We use EARS (Easy Approach to Requirements Syntax) for machine-parseable requirements:

Keyword	Meaning	Example
WHEN	Trigger condition	WHEN a client sends GET /health
THE	System component	THE System
SHALL	Mandatory	SHALL return HTTP 200
SHALL NOT	Forbidden	SHALL NOT log secrets
IF	Conditional	IF the cache is nil

Example: Health Endpoints Requirements

For example:

### Requirement 1

**User Story:** As a platform operator, I want a basic health check endpoint,
so that I can verify the Control Plane service is running and responsive.

#### Acceptance Criteria

1. WHEN a client sends GET /health, THE System SHALL return HTTP 200 with JSON
2. WHEN the health endpoint responds, THE System SHALL include status field
   with value "healthy"
3. WHEN the health endpoint responds, THE System SHALL include timestamp field
   in RFC3339 format
4. WHEN the health endpoint responds, THE System SHALL include service field
   with value "control-plane"
5. WHEN the health endpoint responds, THE System SHALL include version field
   with the current service version

Why this works for LLMs:

Each criterion is specific and testable
Values are explicitly stated ("healthy", "RFC3339")
No ambiguity about expected behavior

The Glossary: Shared Vocabulary

Define terms once, use them everywhere:

## Glossary

- **Task_Store**: Persistent storage for tasks (JSON file)
- **Zero_Magic**: Architectural principle requiring explicit behavior,
  no automatic discovery, and inspectable operations
- **12_Factor**: Application design methodology emphasizing configuration
  via environment, stateless processes, and explicit dependencies

The underscore convention (Task_Store not "task store") makes terms searchable and unambiguous for both humans and LLMs.

Design: The "How"

The design.md document captures architectural decisions that the LLM must follow.

Why Design Documents Matter for LLMs

Without design guidance, LLMs will:

Choose their own patterns (which may not match your codebase)
Make their own architectural decisions (which you'll have to reverse)
Miss integration points (which cause bugs later)

With design documents:

// From mage-build-system/design.md

### Mage Target Organization

Targets are organized into namespaces for clarity (100% namespaced):

```go
// Build namespace - compilation and build management (9 targets)
type Build mg.Namespace

func (Build) Default() error           // Build for current platform
func (Build) All() error               // Build for all platforms
func (Build) LinuxAmd64() error        // Build for linux-amd64


The LLM now knows:
- Use namespaces (not flat functions)
- Follow the naming convention
- Match the existing pattern

### Correctness Properties

A critical part of design documents is **correctness properties**—formal statements about system behavior:

```markdown
### Property 4: Status Code Mapping

*For any* ready endpoint response, if all checks have value "ok" then
HTTP status should be 200, and if any check has value "error" then
HTTP status should be 503.

**Validates: Requirements 2.3, 2.4, 4.2, 4.3**

These properties:

Define invariants that must always hold
Become property-based tests in the implementation
Provide verification criteria for LLM output

Tasks: The "When"

The tasks.md file is the implementation checklist—direct instructions for whoever (human or LLM) is writing the code.

Structure for LLM Consumption

- [ ] 1. Create health check logic file and implement dependency testing
  - Create `internal/control/health.go` with package declaration and imports
  - Implement `CheckDatabaseHealth(db *gorm.DB) string` function
    - Handle nil database connection (return "error")
    - Execute ping with 500ms timeout
    - Return "ok" on success, "error" on failure
  - _Requirements: 5.1, 5.2, 5.3, 5.5_

- [ ] 1.1 Write property test for database health check function
  - **Property 5: Database Check Result Mapping**
  - **Validates: Requirements 5.2, 5.3**
  - Tag: `Feature: control-plane-health-endpoints, Property 5`

Key elements:

Checkboxes track progress
Specific file paths eliminate guessing
Requirement references enable verification
Testing tasks follow implementation tasks

The Implementation-Test Pattern

Notice how every implementation task has corresponding test tasks:

Task 1:   Implement feature X
Task 1.1: Write unit tests for X
Task 1.2: Write property test for X
Task 2:   Implement feature Y
Task 2.1: Write unit tests for Y
...
Task N:   Final checkpoint - verify all tests pass

This ensures nothing ships without verification.

The Verification Pyramid: Ensuring Correctness

Specs are only valuable if we can verify the implementation matches them. We use a multi-layered verification approach:

                    ┌─────────────────┐
                    │   E2E Tests     │  ← Full system verification
                    │   (Minutes)     │
                    └────────┬────────┘
                             │
                    ┌────────┴────────┐
                    │ Integration     │  ← Component interaction
                    │ (Seconds)       │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │     Property-Based Tests    │  ← Universal properties
              │         (Seconds)           │
              └──────────────┬──────────────┘
                             │
        ┌────────────────────┴────────────────────┐
        │            Unit Tests                   │  ← Individual functions
        │            (Milliseconds)               │
        └────────────────────┬────────────────────┘
                             │
    ┌────────────────────────┴────────────────────────┐
    │         Linting & Formatting                    │  ← Code quality
    │         (Milliseconds)                          │
    └─────────────────────────────────────────────────┘

Layer 1: Linting and Formatting

Purpose: Ensure code quality before tests even run

mage quality:lint     # Run golangci-lint
mage quality:fmt      # Format code with gofmt
mage quality:vet      # Run go vet
mage quality:check    # Verify formatting (CI-friendly)

Why this matters for LLM output:

LLMs sometimes generate code with style inconsistencies
Linting catches security issues, bugs, and anti-patterns
Formatting ensures consistent code style

From mage-build-system/requirements.md:

### Requirement 7: Validation and Quality Targets

1. THE Build_System SHALL provide a `quality:lint` target for golangci-lint
2. THE Build_System SHALL provide a `quality:fix` target for auto-fix
3. THE Build_System SHALL provide a `quality:fmt` target for formatting
4. THE Build_System SHALL provide a `quality:check` for verifying format (CI)
5. THE Build_System SHALL provide a `quality:vet` for running go vet

Layer 2: Unit Tests

Purpose: Verify individual functions behave correctly

// TestNewTask_EmptyTitle verifies that empty title returns error.
func TestNewTask_EmptyTitle(t *testing.T) {
    _, err := NewTask("", PriorityMedium)
    if err != ErrEmptyTitle {
        t.Errorf("expected ErrEmptyTitle, got %v", err)
    }
}

Maps to requirements:

5. WHEN the title is empty, THE System SHALL return an error "title is required"

Layer 3: Property-Based Tests

Purpose: Verify universal properties hold across ALL valid inputs

// Feature: control-plane-health-endpoints, Property 5: Database check result mapping
func TestProperty_DatabaseCheckResultMapping(t *testing.T) {
    parameters := gopter.DefaultTestParameters()
    parameters.MinSuccessfulTests = 100

    properties := gopter.NewProperties(parameters)

    properties.Property("database check returns correct status", prop.ForAll(
        func(dbState string) bool {
            switch dbState {
            case "working":
                return CheckDatabaseHealth(workingDB) == "ok"
            case "nil":
                return CheckDatabaseHealth(nil) == "error"
            case "failed":
                return CheckDatabaseHealth(failedDB) == "error"
            }
            return true
        },
        gen.OneOfConst("working", "nil", "failed"),
    ))

    properties.TestingRun(t)
}

Why property tests are essential:

Unit tests verify specific examples
Property tests verify universal truths
LLMs may miss edge cases that properties catch

From clean-architecture-reorganization/design.md:

**Property 6: Dependency rule enforcement**
*For any* Go source file in the repository, its import statements
should follow the dependency rule where entities import nothing
from other layers, use cases import only from entities, adapters
import from entities and use cases, and drivers import from any layer.

**Validates: Requirements 6.1, 6.2, 6.3, 8.1, 8.2, 8.3, 8.4**

Layer 4: Integration Tests

Purpose: Verify components work together correctly

func TestJSONStore_Persistence(t *testing.T) {
    // Create temp directory
    tmpDir := t.TempDir()
    storePath := filepath.Join(tmpDir, "tasks.json")

    // Create store and add task
    store1, _ := NewJSONStore(storePath)
    task, _ := NewTask("Persistent task", PriorityLow)
    store1.Add(*task)

    // Create new store instance (simulates restart)
    store2, err := NewJSONStore(storePath)
    if err != nil {
        t.Fatalf("failed to create second store: %v", err)
    }

    // Verify task persisted
    tasks, _ := store2.GetAll()
    if len(tasks) != 1 {
        t.Errorf("expected 1 task, got %d", len(tasks))
    }
}

Maps to requirements:

### Requirement 5: Data Persistence

4. WHEN the application starts, THE System SHALL load tasks from Task_Store
5. WHEN the Task_Store file doesn't exist, THE System SHALL create it

Layer 5: End-to-End Tests

Purpose: Verify the complete system works as intended

From control-plane-health-endpoints/tasks.md:

- [x] 9. Write integration tests for failure scenarios
  - Test database failure scenario
    - Start Control Plane server
    - Stop database container
    - Make GET /ready request
    - Verify HTTP 503 response
    - Verify database check is "error"

The Verification Commands

mage test:unit          # Run unit tests (fast)
mage test:property      # Run property-based tests
mage test:integration   # Run integration tests (with testcontainers)
mage test:e2e           # Run end-to-end tests (with KIND)
mage test:all           # Run all tests
mage test:coverage      # Generate coverage report

Real Examples from Our Codebase

Example 1: Mage Build System Migration

The challenge: Migrate from Makefile to Mage while maintaining all functionality.

How specs helped:

19 detailed requirements covering every target
Design document with exact interface signatures
23 implementation tasks with checkboxes

Verification:

- [x] 23. Final Validation
  - Run full test suite (mage test:all)
  - Build for all platforms (mage build:all)
  - Generate all code (mage gen:all)
  - Validate all specs (mage validate:specs)
  - Test release process (mage release:dryRun)
  - Verify CI/CD workflows pass

Outcome: Complete migration with zero functionality loss, fully verified.

Example 2: CLEAN Architecture Reorganization

The challenge: Restructure entire codebase to follow CLEAN architecture.

How specs helped:

Property-based tests verify architectural constraints
Import restrictions enforced by linting rules
Clear migration path in tasks document

Key property test:

**Property 6: Dependency rule enforcement**
*For any* Go source file in the repository, its import statements
should follow the dependency rule...

Outcome: Architecture constraints are automatically verified on every commit.

Example 3: Authentication Middleware

The challenge: Implement JWT auth with development bypass mode.

How specs helped:

Clear requirements for production vs development behavior
Design specifies use of go-chi/jwtauth (no custom crypto)
Tests verify both modes work correctly

Key requirement:

1. WHEN running in development mode with X-Test-Namespace header present,
   THE Authentication_Middleware SHALL use the header value as the namespace
2. WHEN running in development mode without X-Test-Namespace header,
   THE Authentication_Middleware SHALL use a default namespace "default"

Working with LLMs: The Spec-Test-Verify Loop

Here's the workflow for LLM-assisted development:

Step 1: Write the Spec First

Before engaging the LLM:

Write requirements.md with testable acceptance criteria
Write design.md with architecture and interfaces
Write tasks.md with implementation checklist

You: "I need to implement the control-plane-health-endpoints feature.
     Here are the specs: [paste requirements.md, design.md, tasks.md]
     Please implement task 1."

Step 3: LLM Implements

The LLM follows:

Requirements for behavior
Design for architecture
Tasks for scope

Step 4: Verify with Tests

# After LLM generates code
mage quality:lint      # Does it pass linting?
mage quality:fmt       # Is it formatted correctly?
mage test:unit         # Do unit tests pass?
mage test:property     # Do properties hold?

Step 5: Iterate if Needed

If verification fails:

You: "Task 1 is failing property test 5. The requirement says:
     'WHEN the database connection is nil, THE System SHALL return error'
     But the implementation returns 'ok'. Please fix."

The LLM has specific feedback to address.

Step 6: Mark Complete and Continue

- [x] 1. Create health check logic file ← Mark done
- [x] 1.1 Write property test          ← Mark done
- [ ] 2. Define response types         ← Next task

Specs Provide Insight: The "Why" Behind the "What"

Specs aren't just for implementation—they're permanent records of decision-making.

Understanding Intent

Six months from now, when someone asks "why does the cache check return 'ok' when the cache is nil?":

For example:

5. WHEN the cache connection is nil, THE System SHALL mark cache status
   as "ok" (cache is optional)

The spec explains the requirement. The design explains the rationale:

For example:

### Cache Connectivity Errors

**Scenarios**:
- Cache connection is nil → Return "ok" (cache is optional)
- Cache ping fails → Return "error" status

**Rationale**: The cache is used for performance optimization, not core
functionality. A missing cache should not prevent the service from
being marked as ready.

Debugging with Specs

When a bug is reported:

Find the relevant spec
Check if the requirement covers this case
If yes → implementation bug (fix the code)
If no → spec gap (update spec, then code)

Onboarding with Specs

New team members can:

Read specs to understand what the system does
Read designs to understand how it's built
Read tasks to see what was verified
Use git log on specs to see evolution

Best Practices and Anti-Patterns

Best Practices

1. Write Specs Before Implementation

Even if the LLM could "just figure it out," specs ensure you get what you actually need.

2. Make Every Requirement Testable

Bad:  "The system should be fast"
Good: "THE System SHALL respond within 100 milliseconds"

3. Include Verification in Tasks

Every implementation task should have corresponding test tasks:

- [ ] 3. Implement feature X
- [ ] 3.1 Write unit tests for X
- [ ] 3.2 Write property test for X

4. Run Full Verification Before Merge

mage quality:all && mage test:all

5. Update Specs When Requirements Change

Specs must stay synchronized with code. If a PR changes behavior, it must update the spec.

6. Reference Requirements in Tests

// Requirement 1.3: timestamp field in RFC3339 format
func TestHealthResponse_TimestampFormat(t *testing.T) {
    ...
}

Anti-Patterns to Avoid

1. Writing Specs After Implementation

This defeats the purpose. Specs guide implementation, not document it after the fact.

2. Skipping Tests "Because the LLM Seems Right"

LLMs are confident even when wrong. Always verify.

3. Vague Acceptance Criteria

Bad:  "The system should handle errors gracefully"
Good: "WHEN the database query fails, THE System SHALL return HTTP 503"

4. Not Running Linting

LLM output often has subtle issues that linting catches.

5. Orphan Tests

Every test should trace to a requirement. No requirement? No test needed.

6. Treating Specs as Separate from Code

Specs live in the repo, are reviewed in PRs, and evolve with the code.

Conclusion

Spec-driven development with LLMs is about precision and verification:

Specs define success with testable acceptance criteria
LLMs implement following explicit guidance
Tests verify the implementation matches the spec
Versioned specs provide permanent, searchable history

The result:

Reliable code that does exactly what you specified
Comprehensive tests that catch regressions
Living documentation that explains why code exists
Efficient LLM collaboration with clear contracts

Specs aren't overhead—they're the foundation of quality. Welcome to the team!

Command Palette

Spec-Driven Development with LLMs: Precise Engineering Through Specifications

Why Specs Matter More Than Ever in the Age of LLMs

The Contract Between Human and Machine

Specs Prevent "AI Drift"

Specs as Versioned Code Artifacts

Directory Structure

Why Version Specs with Code?

Specs as Living Documentation

The Three-Document Structure

How LLMs Use Each Document

Requirements: The "What" and "Why"

The EARS Pattern

Example: Health Endpoints Requirements

The Glossary: Shared Vocabulary

Design: The "How"

Why Design Documents Matter for LLMs

Tasks: The "When"

Structure for LLM Consumption

The Implementation-Test Pattern

The Verification Pyramid: Ensuring Correctness

Layer 1: Linting and Formatting

Layer 2: Unit Tests

Layer 3: Property-Based Tests

Layer 4: Integration Tests

Layer 5: End-to-End Tests

The Verification Commands

Real Examples from Our Codebase

Example 1: Mage Build System Migration

Example 2: CLEAN Architecture Reorganization

Example 3: Authentication Middleware

Working with LLMs: The Spec-Test-Verify Loop

Step 1: Write the Spec First

Step 2: Share Context with the LLM

Step 3: LLM Implements

Step 4: Verify with Tests

Step 5: Iterate if Needed

Step 6: Mark Complete and Continue

Specs Provide Insight: The "Why" Behind the "What"

Understanding Intent

Debugging with Specs

Onboarding with Specs

Best Practices and Anti-Patterns

Best Practices

1. Write Specs Before Implementation

2. Make Every Requirement Testable

3. Include Verification in Tasks

4. Run Full Verification Before Merge

5. Update Specs When Requirements Change

6. Reference Requirements in Tests

Anti-Patterns to Avoid

1. Writing Specs After Implementation

2. Skipping Tests "Because the LLM Seems Right"

3. Vague Acceptance Criteria

4. Not Running Linting

5. Orphan Tests

6. Treating Specs as Separate from Code

Conclusion

Further Reading

Comments

More from this blog