Skip to content
AI 31 December 2025 mlondei

Feedback Loops for Autonomous AI Development: How We Built a Self-Improving Development Environment

What if your AI assistant could work while you sleep? This article explains how we built a development environment where Claude Code has full access to Docker containers, enabling it to build, test, troubleshoot, and gather feedback automatically.

# Feedback Loops for Autonomous AI Development: How We Built a Self-Improving Development Environment

*What if your AI assistant could work while you sleep?*

We’re living through a paradigm shift in software development. Just two years ago, AI coding assistants could suggest completions and generate snippets. Today, we’re running **100% AI-generated platforms** with remarkable success. The breakthrough wasn’t just better models—it was giving AI the infrastructure to iterate, fail, learn, and improve autonomously.

This article explains how we built a development environment where Claude Code has full access to Docker containers, enabling it to build, test, troubleshoot, and gather feedback automatically—without human interaction. The whole system is isolated from production data, and bugs are tracked in a database that AI can query naturally. Here’s our methodology.

## The Architecture: Controlled Autonomy Through Isolation

The foundation of autonomous AI development is **controlled autonomy**—giving AI enough power to work independently while maintaining strict boundaries that protect your systems.

### The Standalone Server Approach

We dedicate a standalone server specifically for AI development work. This isn’t the developer’s laptop or a production server—it’s a purpose-built environment where AI agents can operate freely. Key principles:

**Controlled Egress**: The server allows online searches and Docker image fetching. AI needs access to documentation, package registries, and base images to work effectively. However, there’s no access to internal data except through purpose-built internal APIs.

**RBAC Everywhere**: Role-Based Access Control is implemented throughout the system. Specific agents can only perform specific tasks. A code generation agent doesn’t need access to production monitoring data, so it doesn’t get it.

**Limited Power Principle**: AI agents run as Docker users, not root. They can spin up containers, build images, and execute code—but they can’t compromise the host system. No sudo access, ever.

**Hourly Backups**: Despite all safeguards, things can go wrong. Automated hourly backups provide a safety net for recovery. If an AI agent takes the codebase in a catastrophically wrong direction, we can roll back.

**Internal API Access Only**: When AI needs to interact with company systems, it goes through purpose-built interfaces. No direct database access to production data, no raw file system access to sensitive directories.

This architecture enables something powerful: you can let AI run unsupervised for hours, knowing it operates within well-defined boundaries.

## The Feedback Loop Engine: Fully Automated Development Cycles

The heart of our system is the feedback loop—a fully automated cycle where AI iterates on its own work:

**Pull images → Develop code → Test → Gather errors (backend + frontend) → Analyze → Fix → Repeat**

### How It Works

When given a task, Claude Code doesn’t just write code and stop. It:

1. **Pulls necessary Docker images** for the development environment
2. **Develops the code** according to specifications
3. **Runs the full test suite** in containers
4. **Gathers error output** from both backend and frontend systems
5. **Analyzes the failures** to understand what went wrong
6. **Fixes the code** based on that analysis
7. **Repeats** until tests pass and quality gates are met

This entire cycle runs without human intervention. The AI might iterate 5 times or 50 times—whatever it takes to produce working code.

### The Insanity Rule

Early in our journey, we discovered an important principle that we now call **The Insanity Rule**: *”You can’t keep doing the same thing over and over and expect different results.”*

When AI struggles with a request—iterating repeatedly without progress—it’s often a signal that the request itself is the problem. Maybe the requirements are contradictory. Maybe the architectural constraints don’t allow for the desired solution. Maybe the original demand simply wasn’t the right one.

This is where human developers step in—not to write the code, but to **redirect the approach**. Sometimes the most valuable intervention is reframing the problem.

## Multi-Model Collaboration: The AI Team

We don’t rely on a single AI model. Our development environment uses multiple models in specific roles:

### Claude Code as Primary Developer

Claude Code handles the actual coding work—writing features, fixing bugs, refactoring code, creating tests. Its deep reasoning capabilities and ability to hold context across long sessions make it ideal for the core development loop.

### Second Opinion Models

Before major decisions or after completing significant work, we consult other models for second opinions:

– **Gemini** for alternative perspectives on architecture
– **OpenAI models** for research and feature ideation
– **Grok** for unconventional approaches
– **GitHub Copilot** for quick syntax suggestions

This multi-model approach catches blind spots. Each model has different training data and reasoning patterns—disagreements between models often highlight areas that need more thought.

### n8n for Orchestration

We use n8n to orchestrate specific automated workflows:

– **Competitor research**: Automated analysis of how competitors solve similar problems
– **Article development**: Content generation workflows (like this very article)
– **Team insights**: Aggregating multiple AI perspectives on a single problem

### Ollama for Custom Training

Here’s a unique aspect of our setup: we train Ollama models on our own application datasets. The same data that powers our apps can train local models that understand our specific domain, coding patterns, and business logic.

## The Database as Knowledge Junction

AI excels at querying databases. While large language models can struggle with unstructured context, they’re remarkably good at formulating SQL queries and working with relational data. We exploit this by making **the database a junction between application data and AI work**.

### The Knowledge Schema

Our AI development environment includes dedicated database tables:

**Bug Tracker Table**: Every bug the AI encounters—error messages, stack traces, attempted fixes, successful solutions. Before tackling a new error, AI queries past solutions.

**Feature Request Table**: Requirements, acceptance criteria, implementation notes. AI can reference what was asked for and what was delivered.

**Asset Information Table**: Real-time data about the systems AI is working with—container states, service health, deployment configurations.

**Datasets for AI Operations**: Sample data, test fixtures, validation sets. AI can run operations against realistic data.

**Tasks Table**: A queue of automated test runs against containers. AI can schedule work and check results asynchronously.

### The Training Loop

This architecture creates a powerful feedback mechanism: the same data used by the application can train Ollama models. Each bug fixed becomes training data. Each successful solution reinforces good patterns. The system genuinely improves over time.

## Context Strategy: Teaching AI Your Codebase

Raw capability isn’t enough—AI needs to understand your specific codebase, patterns, and conventions.

### CLAUDE.md Files

We maintain CLAUDE.md files with project-specific instructions: coding standards, architectural decisions, common patterns, things to avoid. These files give AI immediate context without lengthy explanations.

### Component-Based Architecture

A critical insight: **keep files small**. AI struggles with 2500-line monolithic files. By maintaining a component-based architecture where each file has a single responsibility, we make the codebase more navigable for both AI and humans.

### Architecture Planning First

Before starting any project, we develop the architectural foundation. This isn’t optional—it’s the **first cornerstone** before any code gets written. Clear architecture gives AI clear constraints. Vague architecture leads to AI making decisions that may not align with your vision.

## Security: The Multi-Layered Approach

Letting AI write code that runs on your systems requires robust security measures.

### Multi-Agentic Code Auditing

Every significant piece of AI-generated code gets reviewed by multiple AI models: Claude, Gemini, GPT, and Ollama. Each audits the code for security vulnerabilities, logic errors, and edge cases. Agreement builds confidence; disagreement triggers human review.

### Pentesting with AI

We run automated security testing against AI-generated code. The same models that can write code can also think like attackers, identifying potential exploits before they reach production.

### Manual Dependency Auditing

Here’s an important lesson: **AI loves outdated libraries**. Training data has a cutoff, and models often suggest dependencies that were popular when they were trained but have since been deprecated or found to have vulnerabilities. Human developers manually verify dependencies, images, and libraries.

### SBOM Approach

Software Bill of Materials is mandatory. Every AI-generated project includes comprehensive SBOMs documenting all dependencies, making vulnerability tracking and compliance auditing straightforward.

### NATO Country Code Restriction

We restrict dependencies to those originating from NATO member countries. No Russian or Chinese code enters our supply chain. This protects against potential supply chain attacks and aligns with our security posture.

### Placeholder Detection

AI sometimes generates placeholder code—TODOs, incomplete implementations, hardcoded values meant to be replaced. Automated scanning identifies these before code merges.

### Extensive Test Base

Every system requires comprehensive tests. AI-generated code isn’t trusted until it passes systematic testing—unit tests, integration tests, end-to-end tests.

### Human Code Review

Despite all automation, **human code review remains essential**. It’s slower, but it catches things automated systems miss. For production code, a human developer reviews and approves.

## Human-AI Boundaries: When Humans Must Intervene

Autonomous doesn’t mean unsupervised. Clear boundaries define when humans step in.

### Live Asset Execution

When AI-generated code needs to run on live assets—production servers, customer data, real infrastructure—**human authorization is required**. We’ve built this into our tooling with Remedy AI, our remediation agent.

Remedy AI can view data from your fleet and execute code on live assets. But it always requires human authorization first. This creates complete accountability: every action on production systems has an approval trail.

### When AI Gets Stuck

As mentioned with the Insanity Rule, when AI iterates without progress, humans redirect. Sometimes architectural choices leave no margin for the desired solution. Sometimes the original demand was simply wrong.

### Architectural Decisions

Major architectural decisions—choosing frameworks, defining data models, establishing patterns—benefit from human judgment. AI can propose options and analyze trade-offs, but strategic decisions remain with human architects.

## Real-World Example: Remedy AI

Remedy AI demonstrates our accountability pattern for production systems:

– **Fleet visibility**: AI can view data across your entire infrastructure
– **Remediation capability**: AI can execute code on live assets to fix issues
– **Authorization requirement**: Every execution requires explicit human authorization
– **Audit trail**: Complete logging of who authorized what and when

This pattern—capability with authorization—lets us leverage AI’s speed and consistency while maintaining the oversight production systems require.

## Measuring Success & ROI

How do you know if autonomous AI development is working?

### Iteration Metrics

Track how many iterations AI needs to complete tasks. Decreasing iteration counts over time indicate the system is learning and improving.

### Autonomous Completion Rates

What percentage of tasks does AI complete fully autonomously versus needing human intervention? For us, complex features often need guidance, but routine bug fixes and small features typically complete without intervention.

### API Costs vs. Human Time

Calculate the cost of API calls against the human time saved. Even at current token prices, AI completing routine tasks often costs less than the equivalent developer time.

### Commercial Benchmarks

Compare your autonomous development metrics against commercial solutions. Devin claims “zero human keystrokes” for bug fixes. OpenAI Codex runs in cloud sandboxes. How does your self-hosted solution compare?

## Failure Modes & Recovery

What happens when things go wrong?

### Loop Detection

When AI iterates more than a threshold number of times without progress, the system flags for human review. The Insanity Rule kicks in: repeated failure indicates a problem with the approach, not just the execution.

### Hourly Backups

Every hour, the development environment backs up. If AI takes the codebase in a catastrophically wrong direction, we can roll back. This safety net enables aggressive experimentation.

### Human Escalation

Some failures automatically escalate to human developers: security warnings, resource exhaustion, repeated test failures. The system knows when to ask for help.

## Comparison to Commercial Solutions

How does this approach compare to commercial alternatives?

**Devin** claims autonomous development with zero human keystrokes. Our experience suggests that’s realistic for some tasks but not all—complex features still benefit from human direction.

**OpenAI Codex** runs in cloud sandboxes. We prefer self-hosted for data sovereignty and security control. Our code never leaves our infrastructure.

**GitHub Copilot Workspace** offers collaborative AI development. Our approach is more autonomous—AI works independently rather than waiting for human guidance at each step.

Open-source projects like **ralph-claude-code** and **claude-code-sandbox** demonstrate similar patterns. The approach is gaining traction across the industry.

## Future Directions

Where is this heading?

### Production Monitoring Feedback

Eventually, production monitoring data should feed back into development. When a bug appears in production, AI should automatically investigate, propose fixes, and create PRs.

### Cross-Project Pattern Learning

AI that works across multiple projects should recognize patterns and apply successful solutions from one codebase to another.

### Tighter CI/CD Integration

Autonomous AI development should integrate seamlessly with CI/CD pipelines—AI creates PRs, pipelines test them, AI responds to feedback, all without human involvement for routine changes.

### Encryption for Production

Our current approach works great in development. For production workloads with sensitive data, we need a proper encryption layer. This is our next major infrastructure investment.

## The Key Takeaway

If there’s one insight from our journey, it’s this:

**Sandboxing with limited power + feedback loops + hourly backups = decisive AI autonomy**

Give AI a safe place to work. Let it iterate and learn from failures. Maintain backups in case something goes wrong. With these foundations in place, you can confidently let AI work autonomously—and be surprised by what it accomplishes.

The future of development isn’t AI replacing developers. It’s developers directing AI that works tirelessly, learns continuously, and handles the routine so humans can focus on what matters.

*This article was developed using the methodology it describes—AI-generated with human oversight and editorial direction.*

## References

– [Docker Sandboxes: A New Approach for Coding Agent Safety](https://www.docker.com/blog/docker-sandboxes-a-new-approach-for-coding-agent-safety/)
– [Running Claude Code Agents in Docker Containers](https://medium.com/@dan.avila7/running-claude-code-agents-in-docker-containers-for-complete-isolation-63036a2ef6f4)
– [ralph-claude-code: Autonomous AI Development Loop](https://github.com/frankbria/ralph-claude-code)
– [claude-code-sandbox: Full Async Agentic Workflows](https://github.com/textcortex/claude-code-sandbox)
– [IBM: What Is AI Agent Memory?](https://www.ibm.com/think/topics/ai-agent-memory)
– [McKinsey: One Year of Agentic AI](https://www.mckinsey.com/capabilities/quantumblack/our-insights/one-year-of-agentic-ai-six-lessons-from-the-people-doing-the-work)
– [The Autonomous Code Agents Are Coming](https://medium.com/@khayyam.h/the-autonomous-code-agents-are-coming-are-you-ready-for-self-debugging-self-evolving-ai-82067b942c0a)
– [9 Agentic AI Workflow Patterns Transforming AI Agents in 2025](https://www.marktechpost.com/2025/08/09/9-agentic-ai-workflow-patterns-transforming-ai-agents-in-2025/)

mlondei

Cybersecurity expert with 25+ years of experience in security hardening, Microsoft infrastructure, and AI solutions.

Need Help With This?

Have questions about what you just read? Want help implementing these solutions? Let's talk.

Get in Touch