Automating Workflows with AI Agents Link to heading
The latter half of 2024 has brought significant advances in AI agent capabilities, particularly with Anthropic’s Model Context Protocol (MCP) and enhanced tool-using abilities in Claude. After months of working with multi-step prompts and manual context passing, I’m finally experimenting with true autonomous agents; systems that can use tools, make decisions, and execute complex workflows with minimal supervision. This represents a major leap from the copy-paste LLM interactions I was doing throughout most of 2024.
Evolution from Assistants to Agents Link to heading
The progression I’ve witnessed throughout 2024 has been remarkable:
Early 2024 (Post-Copilot): Using GitHub Copilot with Roo Code in VS Code for code completion and ChatGPT for explanations Mid 2024: Multi-step prompts with manual context management between API calls Late 2024: True autonomous agents with MCP enabling seamless tool integration
The key difference is agency; moving from “ask and copy-paste” to “describe the goal and let the agent figure out how to accomplish it.”
Agent Architecture Patterns Link to heading
The ReAct Pattern (Reasoning + Acting) Link to heading
The most successful agent pattern I’ve implemented follows the ReAct framework; alternating between reasoning about what to do next and taking actions:
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import BaseTool, Tool
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
import subprocess
import requests
import os
class GitCommandTool(BaseTool):
name = "git_command"
description = "Execute git commands in the current repository"
def _run(self, command: str) -> str:
"""Execute a git command and return the output"""
try:
result = subprocess.run(
f"git {command}",
shell=True,
capture_output=True,
text=True,
timeout=30
)
if result.returncode == 0:
return f"Success: {result.stdout}"
else:
return f"Error: {result.stderr}"
except subprocess.TimeoutExpired:
return "Error: Command timed out"
except Exception as e:
return f"Error: {str(e)}"
class FileOperationTool(BaseTool):
name = "file_operations"
description = "Read, write, or modify files in the project"
def _run(self, operation: str) -> str:
"""Perform file operations"""
try:
parts = operation.split('|')
action = parts[0]
if action == "read":
file_path = parts[1]
with open(file_path, 'r') as f:
content = f.read()
return f"File content:\n{content}"
elif action == "write":
file_path = parts[1]
content = parts[2]
with open(file_path, 'w') as f:
f.write(content)
return f"Successfully wrote to {file_path}"
elif action == "append":
file_path = parts[1]
content = parts[2]
with open(file_path, 'a') as f:
f.write(content)
return f"Successfully appended to {file_path}"
except Exception as e:
return f"Error: {str(e)}"
class APITestTool(BaseTool):
name = "api_test"
description = "Test API endpoints and return response status"
def _run(self, request_info: str) -> str:
"""Test API endpoints"""
try:
parts = request_info.split('|')
method = parts[0].upper()
url = parts[1]
if method == "GET":
response = requests.get(url, timeout=10)
elif method == "POST":
data = parts[2] if len(parts) > 2 else {}
response = requests.post(url, json=data, timeout=10)
else:
return f"Unsupported method: {method}"
return f"Status: {response.status_code}, Response: {response.text[:200]}"
except Exception as e:
return f"Error: {str(e)}"
class DeploymentAgent:
def __init__(self, openai_api_key):
self.llm = ChatOpenAI(
openai_api_key=openai_api_key,
model="gpt-4",
temperature=0.1
)
self.tools = [
GitCommandTool(),
FileOperationTool(),
APITestTool(),
Tool(
name="shell_command",
description="Execute shell commands (use carefully)",
func=self._execute_shell_command
)
]
# Create the agent prompt
self.prompt = PromptTemplate(
input_variables=["tools", "tool_names", "input", "agent_scratchpad"],
template="""
You are an AI agent responsible for deploying web applications safely.
Your workflow should typically follow these steps:
1. Check current git status and branch
2. Run tests to ensure code quality
3. Build the application if needed
4. Deploy to staging environment
5. Run integration tests against staging
6. If all tests pass, deploy to production
7. Verify production deployment
Available tools:
{tools}
Tool names: {tool_names}
IMPORTANT SAFETY RULES:
- Always run tests before deploying
- Never deploy if tests are failing
- Always deploy to staging before production
- Verify each step before proceeding
- If anything goes wrong, stop and report the issue
Use this format:
Thought: [what you need to do next]
Action: [tool name]
Action Input: [tool input]
Observation: [tool output]
When you're done, respond with:
Final Answer: [summary of what was accomplished]
Human: {input}
{agent_scratchpad}
"""
)
# Create the agent
self.agent = create_react_agent(
self.llm,
self.tools,
self.prompt
)
self.agent_executor = AgentExecutor(
agent=self.agent,
tools=self.tools,
verbose=True,
max_iterations=10,
handle_parsing_errors=True
)
def _execute_shell_command(self, command: str) -> str:
"""Execute shell command with safety checks"""
# Safety checks - prevent dangerous commands
dangerous_commands = [
'rm -rf', 'sudo rm', 'format', 'fdisk',
'chmod 777', 'chown -R', '>>', 'dd if='
]
if any(dangerous in command.lower() for dangerous in dangerous_commands):
return "Error: Dangerous command blocked for safety"
try:
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=60
)
if result.returncode == 0:
return f"Success: {result.stdout}"
else:
return f"Error: {result.stderr}"
except Exception as e:
return f"Error: {str(e)}"
def deploy_application(self, project_path: str, environment: str = "staging"):
"""Deploy application with the agent"""
os.chdir(project_path)
task = f"""
Deploy the application in the current directory to {environment}.
Project details:
- This is a Node.js application with package.json
- Tests can be run with 'npm test'
- Build with 'npm run build'
- Deploy script available as 'npm run deploy:{environment}'
Follow the safety workflow and report any issues.
"""
return self.agent_executor.invoke({"input": task})
The Planning Agent Pattern Link to heading
For complex workflows, I’ve found success with agents that create detailed plans before execution:
class PlanningAgent:
def __init__(self, openai_api_key):
self.llm = ChatOpenAI(openai_api_key=openai_api_key, model="gpt-4")
def create_plan(self, objective: str, constraints: list = None) -> dict:
"""Create a detailed execution plan"""
constraints_text = ""
if constraints:
constraints_text = f"Constraints:\n" + "\n".join(f"- {c}" for c in constraints)
planning_prompt = f"""
Create a detailed execution plan for this objective: {objective}
{constraints_text}
Your plan should include:
1. High-level strategy
2. Detailed steps with expected outcomes
3. Risk assessment and mitigation strategies
4. Success criteria
5. Rollback plan if things go wrong
Format as JSON:
{{
"strategy": "brief description of approach",
"steps": [
{{
"step": 1,
"description": "what to do",
"expected_outcome": "what should happen",
"tools_needed": ["list", "of", "tools"],
"risks": ["potential", "issues"],
"verification": "how to confirm success"
}}
],
"success_criteria": ["criteria", "for", "success"],
"rollback_plan": "what to do if things fail"
}}
"""
response = self.llm.invoke(planning_prompt)
return json.loads(response.content)
def execute_plan(self, plan: dict, tools: list) -> dict:
"""Execute the plan step by step"""
results = {
"plan": plan,
"execution_results": [],
"success": True,
"final_status": ""
}
for step in plan["steps"]:
print(f"Executing step {step['step']}: {step['description']}")
# Execute step with appropriate tools
step_result = self._execute_step(step, tools)
results["execution_results"].append({
"step": step["step"],
"result": step_result,
"success": step_result.get("success", False)
})
# If step fails, consider rollback
if not step_result.get("success", False):
print(f"Step {step['step']} failed. Considering rollback...")
results["success"] = False
# Ask agent whether to continue or rollback
continue_decision = self._should_continue(step, step_result, plan)
if not continue_decision:
print("Executing rollback plan...")
self._execute_rollback(plan["rollback_plan"], tools)
results["final_status"] = "Rolled back due to failure"
break
if results["success"]:
results["final_status"] = "Plan executed successfully"
return results
Practical Agent Applications Link to heading
Code Review and Refactoring Agent Link to heading
class CodeRefactoringAgent:
def __init__(self, openai_api_key):
self.llm = ChatOpenAI(openai_api_key=openai_api_key, model="gpt-4")
self.tools = [
Tool(
name="analyze_code",
description="Analyze code files for issues and improvement opportunities",
func=self._analyze_code
),
Tool(
name="refactor_code",
description="Apply refactoring changes to code files",
func=self._refactor_code
),
Tool(
name="run_tests",
description="Execute test suite to verify changes",
func=self._run_tests
),
Tool(
name="create_pull_request",
description="Create a pull request with the changes",
func=self._create_pull_request
)
]
def _analyze_code(self, file_path: str) -> str:
"""Analyze code for refactoring opportunities"""
with open(file_path, 'r') as f:
code = f.read()
analysis_prompt = f"""
Analyze this code for refactoring opportunities:
{code}
Look for:
1. Code duplication
2. Long methods/functions
3. Complex conditionals
4. Performance issues
5. Maintainability problems
Return JSON with specific issues and suggested improvements.
"""
response = self.llm.invoke(analysis_prompt)
return response.content
def _refactor_code(self, refactor_request: str) -> str:
"""Apply specific refactoring changes"""
# Parse the refactoring request
# Apply changes to files
# Return summary of changes
pass
def refactor_codebase(self, project_path: str, scope: str = "medium"):
"""Autonomously refactor a codebase"""
task = f"""
Refactor the codebase in {project_path} with {scope} scope changes.
Process:
1. Analyze all Python files for refactoring opportunities
2. Prioritise improvements by impact and risk
3. Apply safe refactoring changes one at a time
4. Run tests after each change
5. If tests fail, revert the change and try alternative approach
6. Create a summary report of all changes made
Focus on low-risk, high-impact improvements.
"""
return self.agent_executor.invoke({"input": task})
Content Management Agent Link to heading
class ContentManagementAgent:
def __init__(self, openai_api_key):
self.llm = ChatOpenAI(openai_api_key=openai_api_key)
self.tools = [
Tool(
name="analyze_content_gaps",
description="Analyze existing content to identify gaps",
func=self._analyze_content_gaps
),
Tool(
name="generate_content",
description="Generate new content based on requirements",
func=self._generate_content
),
Tool(
name="optimise_seo",
description="Optimise content for search engines",
func=self._optimise_seo
),
Tool(
name="schedule_publication",
description="Schedule content for publication",
func=self._schedule_publication
)
]
def _analyze_content_gaps(self, content_directory: str) -> str:
"""Analyze existing content for gaps and opportunities"""
# Scan content directory
# Analyze topics, keywords, content types
# Identify missing areas
pass
def manage_content_calendar(self, goals: str, timeline: str):
"""Autonomously manage content creation and publication"""
task = f"""
Manage content creation for the next {timeline} with these goals: {goals}
Process:
1. Analyze existing content to identify gaps
2. Generate content calendar with topic suggestions
3. Create high-priority content pieces
4. Optimise all content for SEO
5. Schedule publication across appropriate channels
6. Generate performance tracking metrics
Focus on creating valuable, engaging content that meets the stated goals.
"""
return self.agent_executor.invoke({"input": task})
Agent Safety and Reliability Link to heading
Implementing Safeguards Link to heading
class SafetyLayer:
def __init__(self, allowed_commands=None, blocked_patterns=None):
self.allowed_commands = allowed_commands or []
self.blocked_patterns = blocked_patterns or [
r'rm\s+-rf',
r'sudo\s+rm',
r'chmod\s+777',
r'>\s*/dev/',
r'dd\s+if='
]
def validate_command(self, command: str) -> tuple[bool, str]:
"""Validate that a command is safe to execute"""
# Check against blocked patterns
for pattern in self.blocked_patterns:
if re.search(pattern, command, re.IGNORECASE):
return False, f"Command blocked by safety pattern: {pattern}"
# Check if command is in allowed list (if specified)
if self.allowed_commands:
command_base = command.split()[0]
if command_base not in self.allowed_commands:
return False, f"Command not in allowed list: {command_base}"
return True, "Command validated"
def sandbox_execution(self, command: str, working_dir: str = None):
"""Execute command in a sandboxed environment"""
is_safe, message = self.validate_command(command)
if not is_safe:
raise ValueError(f"Unsafe command: {message}")
# Create temporary directory if needed
if working_dir:
original_dir = os.getcwd()
os.chdir(working_dir)
try:
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=30,
env={**os.environ, 'PATH': '/usr/local/bin:/usr/bin:/bin'}
)
return {
'success': result.returncode == 0,
'stdout': result.stdout,
'stderr': result.stderr,
'returncode': result.returncode
}
finally:
if working_dir:
os.chdir(original_dir)
Human Oversight Integration Link to heading
class HumanOversightAgent:
def __init__(self, approval_threshold=0.7):
self.approval_threshold = approval_threshold
self.pending_approvals = []
def requires_approval(self, action: str, confidence: float) -> bool:
"""Determine if an action requires human approval"""
high_risk_actions = [
'deploy to production',
'delete database',
'modify user permissions',
'send external communications',
'make financial transactions'
]
# High-risk actions always require approval
if any(risk_action in action.lower() for risk_action in high_risk_actions):
return True
# Low confidence actions require approval
if confidence < self.approval_threshold:
return True
return False
def request_approval(self, action: str, context: dict, confidence: float) -> str:
"""Request human approval for an action"""
approval_request = {
'id': str(uuid.uuid4()),
'action': action,
'context': context,
'confidence': confidence,
'timestamp': datetime.now(),
'status': 'pending'
}
self.pending_approvals.append(approval_request)
# In a real system, this would send notifications
print(f"Approval requested for: {action}")
print(f"Context: {context}")
print(f"Confidence: {confidence}")
# For demo purposes, return a mock approval
return "approved" # In practice, would wait for human input
Monitoring and Observability Link to heading
Agent Performance Metrics Link to heading
import time
from dataclasses import dataclass, field
from typing import List, Dict, Any
@dataclass
class AgentMetrics:
agent_id: str
start_time: float = field(default_factory=time.time)
end_time: float = 0
total_steps: int = 0
successful_steps: int = 0
failed_steps: int = 0
tools_used: List[str] = field(default_factory=list)
tokens_consumed: int = 0
cost_estimate: float = 0.0
errors: List[str] = field(default_factory=list)
def record_step(self, success: bool, tool_used: str = None, error: str = None):
self.total_steps += 1
if success:
self.successful_steps += 1
else:
self.failed_steps += 1
if error:
self.errors.append(error)
if tool_used:
self.tools_used.append(tool_used)
def finalise(self):
self.end_time = time.time()
def get_summary(self) -> Dict[str, Any]:
duration = self.end_time - self.start_time if self.end_time > 0 else time.time() - self.start_time
return {
'agent_id': self.agent_id,
'duration_seconds': duration,
'total_steps': self.total_steps,
'success_rate': self.successful_steps / self.total_steps if self.total_steps > 0 else 0,
'tools_used': len(set(self.tools_used)),
'tokens_consumed': self.tokens_consumed,
'cost_estimate': self.cost_estimate,
'error_count': len(self.errors)
}
class MonitoredAgent:
def __init__(self, base_agent, metrics_collector):
self.base_agent = base_agent
self.metrics = metrics_collector
def execute_with_monitoring(self, task: str):
"""Execute task while collecting metrics"""
agent_metrics = AgentMetrics(agent_id=f"agent_{int(time.time())}")
try:
# Wrap the base agent to collect metrics
result = self._monitored_execution(task, agent_metrics)
agent_metrics.finalise()
# Store metrics for analysis
self.metrics.record_session(agent_metrics)
return result
except Exception as e:
agent_metrics.record_step(success=False, error=str(e))
agent_metrics.finalise()
self.metrics.record_session(agent_metrics)
raise
Real-World Results and Lessons Link to heading
Deployment Automation Success Link to heading
I deployed an agent that handles routine deployment tasks:
- Time saved: 3-4 hours per week previously spent on manual deployments
- Error reduction: 90% fewer deployment-related incidents
- Consistency: 100% compliance with deployment checklist
- Team satisfaction: Developers can focus on feature development
Content Generation Agent Link to heading
An agent that manages technical documentation:
- Coverage improvement: 40% increase in documented APIs
- Freshness: Documentation updates within 24 hours of code changes
- Quality: Consistent formatting and structure across all docs
- Maintenance: Reduced documentation debt by 60%
Code Review Assistant Link to heading
An agent that performs initial code reviews:
- Review speed: First-pass review within 5 minutes of PR creation
- Issue detection: Catches 70% of common issues before human review
- Learning: Helps junior developers learn best practices
- Coverage: Reviews 100% of PRs, unlike human reviewers
Challenges and Limitations Link to heading
Reliability Issues Link to heading
- Non-deterministic behavior: Same input can produce different results
- Context drift: Long-running agents can lose focus
- Error propagation: Mistakes compound across steps
- Tool limitations: Agents are constrained by available tools
Cost Considerations Link to heading
class CostManager:
def __init__(self, budget_limit=100.0):
self.budget_limit = budget_limit
self.current_spend = 0.0
self.token_costs = {
"gpt-4": 0.03, # per 1k tokens
"gpt-3.5-turbo": 0.002
}
def estimate_task_cost(self, task_complexity: str, model: str = "gpt-4"):
"""Estimate cost for a task before execution"""
complexity_multipliers = {
"simple": 1.0,
"medium": 3.0,
"complex": 8.0
}
base_tokens = 1000 # Minimum tokens for any task
multiplier = complexity_multipliers.get(task_complexity, 3.0)
estimated_tokens = base_tokens * multiplier
cost_per_token = self.token_costs.get(model, 0.03) / 1000
estimated_cost = estimated_tokens * cost_per_token
return estimated_cost
def can_afford_task(self, task_complexity: str, model: str = "gpt-4"):
"""Check if task is within budget"""
estimated_cost = self.estimate_task_cost(task_complexity, model)
return self.current_spend + estimated_cost <= self.budget_limit
Future Directions Link to heading
Multi-Agent Systems Link to heading
class AgentTeam:
def __init__(self):
self.agents = {
'planner': PlanningAgent(),
'coder': CodeGenerationAgent(),
'reviewer': CodeReviewAgent(),
'tester': TestingAgent(),
'deployer': DeploymentAgent()
}
def coordinate_development(self, feature_request: str):
"""Coordinate multiple agents to develop a feature"""
# Planner creates development plan
plan = self.agents['planner'].create_plan(feature_request)
# Coder implements the feature
implementation = self.agents['coder'].implement_feature(plan)
# Reviewer checks the code
review = self.agents['reviewer'].review_code(implementation)
# Tester creates and runs tests
test_results = self.agents['tester'].test_implementation(implementation)
# Deployer handles deployment
if test_results['passed']:
deployment = self.agents['deployer'].deploy_feature(implementation)
return deployment
else:
return {'status': 'failed', 'reason': 'tests failed'}
Learning and Adaptation Link to heading
Agents that learn from their experiences and improve over time represent the next frontier. This involves building feedback loops, maintaining performance metrics, and adapting strategies based on outcomes.
Key Takeaways Link to heading
Start simple: Begin with single-purpose agents before building complex workflows
Safety first: Implement multiple layers of safeguards and human oversight
Monitor everything: Track performance, costs, and failure modes
Gradual autonomy: Increase agent independence as reliability improves
Human-agent collaboration: Best results come from combining human judgment with agent capabilities
Domain expertise matters: Agents work best in well-defined problem domains
Iteration is key: Continuously refine agent behavior based on real-world performance
AI agents represent a significant evolution in automation capabilities. While they’re not perfect, they’re already providing substantial value in specific domains where the benefits outweigh the risks. As the technology matures, we’ll see agents handling increasingly complex workflows with greater reliability and autonomy.
Have you experimented with building AI agents? What workflows have you found most suitable for automation, and what challenges have you encountered?