Persistent agent state: Refactor your agent to use a SQLite database (via sqlite3 or sqlmodel) to persist conversation history, tool call logs, and memory across sessions. Restart the agent and verify it remembers context from a previous session. Design the schema so you can replay any session.
Streaming responses: Implement streaming using the Anthropic SDK's stream() method. Print Claude's response token by token to the terminal as it arrives. Add a spinner/progress indicator for tool execution phases. Notice how streaming changes the user experience for long responses.
Agent red-teaming: Try to break your own agent with adversarial inputs: (1) prompt injection via tool results (return malicious instructions from your mock fetch_url), (2) context overflow (send a very long message), (3) tool call flooding (craft a prompt that makes the agent call the same tool 50 times). Document what broke and how you would fix it.
MCP (Model Context Protocol): Explore Anthropic's Model Context Protocol. Set up a local MCP server that exposes one of your tools. Connect to it from a Claude Desktop or SDK client. Understand why a standardised tool protocol matters for ecosystem composability.
Agentic benchmark: Run your research agent against a small subset of HotpotQA multi-hop reasoning questions. Score accuracy, measure tokens used per question, and estimate cost per 1,000 questions. What is the cost/accuracy trade-off of using Claude Haiku vs Sonnet for the worker agent?
Your agent has access to tools that can fetch URLs and query APIs. What is the worst thing that could happen if an attacker controlled the content of a page your agent fetched? How does this prompt injection scenario differ from a traditional web injection attack?
You implemented a max_iterations safety limit. What other limits should a production agentic system have? Think about: time, cost, memory, scope of actions.
How does the reliability of your agent change as the number of sequential tool calls increases? What compound failure rate do you get if each tool call has a 5% failure chance and you need 10 successful calls in sequence?
You evaluated your agent with an LLM-as-judge. What are the weaknesses of this evaluation approach? When might the judge LLM and the agent LLM agree on a wrong answer?
In a multi-agent system with an orchestrator and workers, where is the single point of failure? How would you design for resilience if the orchestrator crashes mid-task?