The Challenge: Understanding Agent Reasoning

Understanding how LLM agents reason through complex tasks is notoriously difficult. When an agent makes decisions, takes actions, and navigates toward a goal, it leaves behind a trace — a trajectory of steps that capture its thought process. But how do we extract meaningful insights at scale from these historical traces?

The traditional approach is to examine trajectories one at a time, manually inspecting what the agent did and why. This works for debugging specific issues, but it doesn't scale. We don't want to look at individual cases — we want to understand global patterns in the agent's behavior across thousands of runs.

🚕 Introducing Taxi: A Trajectory Taxonomy Generator 🚕

In this blogpost, we're excited to present 🚕 Taxi — a new tool we're developing to solve this problem. Taxi is a generic, trajectory-oriented taxonomy generator that helps you make sense of your agent's behavior at scale.

Let's break down what it does.

How Taxi Works

Step 1: Clustering Similar Steps

Taxi starts by ingesting all of your agents' trajectories. Each trajectory is a sequence of steps. Taxi then uses clustering methods — machine learning techniques to automatically identify similar steps and group them into discrete clusters based on semantic similarity.

Think of it like this: if your agent frequently performs similar reasoning steps — even if they're phrased differently or appear unrelated without full context — Taxi will recognize these underlying patterns. For example, "verify the input matches expected format", "check if user request is valid", and "ensure data integrity before processing" might all cluster together as variations of an input validation step. This happens automatically, without you having to define categories in advance.

Step 2: Creating Structured Representations

Once we have these step clusters, Taxi assigns each individual step in every trajectory to its corresponding cluster. This transforms the unstructured trajectory data into a structured, symbolic representation.

Instead of having raw text describing what the agent did, we now have a sequence of discrete states or actions. For example, a trajectory that was previously a long narrative becomes something like: [validate_input] → [retrieve_context] → [generate_response] → [verify_output].

Step 3: Unlocking New Analysis Capabilities

This is the fun part! Once your trajectories are represented as sequences of discrete clusters, you can analyze them in ways that weren't possible before. We're no longer limited to reading through text — we can apply computational methods to discover patterns.

Visualize Common Patterns: We can generate flow charts that show the most common paths our agents take. This makes it easy to see at a glance how our agent(s) approach problems.
Apply Classical Approaches (No LLM Needed Anymore): Because we're now working with discrete states rather than unstructured text, we can use classical techniques from fields like Graph Theory, Markov models, or sequence analysis. For instance, we can identify bottlenecks, measure state transition probabilities, or detect anomalous paths.
Identify Success and Failure Patterns: By correlating trajectory patterns with outcomes, we can pinpoint which specific states or sub-sequences lead to success versus failure. This helps us understand what confuses our models, where they get stuck, and what conditions lead to optimal performance.

Overview of Clustering and Labeling

Approaches to Clustering

The clustering step is crucial to Taxi's effectiveness — it determines how steps are grouped together and ultimately shapes the taxonomy that emerges. There are several approaches you can take, each with different trade-offs:

Embedding-Based Clustering

The most efficient approach is to use embeddings — vector representations of each step that capture semantic meaning. You can generate embeddings using models like OpenAI's text-embedding-ada-002 or open-source alternatives, then apply clustering algorithms like K-means, DBSCAN, or hierarchical clustering to group similar embeddings together.

This approach works well because it captures semantic similarity — steps that mean similar things will cluster together even if they use different words. However, you might need to choose the number of clusters (for K-means) or tune density parameters (for DBSCAN), which can require a bit of experimentation.

LLM-Based Clustering

Another approach is to use an LLM directly to judge similarity and create clusters. You can prompt a model to compare steps and decide whether they belong in the same category, and even have it generate natural language descriptions of emerging clusters.

This method can produce more interpretable clusters since the LLM can explain its reasoning, but it's typically slower and more expensive than embedding-based methods, and usually infeasible.

Hierarchical Taxonomies

For complex agent behaviors, you might want multiple levels of granularity. Hierarchical clustering naturally supports this — you can have broad categories like "information retrieval" that split into more specific subcategories like "database query" and "web search". This allows you to zoom in and out depending on what level of detail you need for your analysis.

Labeling Clusters

Once we've identified clusters of similar steps, we need meaningful labels to make them interpretable. The most scalable approach is to use an LLM to automatically generate labels by providing representative examples from each cluster and asking it to produce a concise, descriptive name.

The labeling strategy depends on your taxonomy structure. With flat labeling, you simply generate one label per cluster — straightforward and efficient. The LLM examines representative steps from a cluster and produces a single descriptive name like "Authentication Verification" or "Data Validation."

With recursive labeling for hierarchical taxonomies, you label at multiple levels of granularity. You can choose either a bottom-up or top-down labeling strategy. With bottom-up labeling, you start by labeling the finest-grained clusters first, then progressively label parent clusters as you move up the hierarchy. With top-down labeling, you label broad categories first, then label more specific subcategories as you work your way down.

Recursive labeling creates a multi-level taxonomy where labels at different levels provide different perspectives — you might have "Information Retrieval" at the top level, with "Database Query" and "Web Search" as sub-labels, each potentially subdividing further. This hierarchical labeling structure allows you to analyze agent behavior at whatever level of abstraction best suits your current analytical needs.

Advanced Analysis: Going Deeper with Taxi

Once you have your trajectories represented as discrete step sequences, Taxi opens the door to even more sophisticated analytical techniques. Here are some advanced ways to extract deeper insights from your agent's behavior:

Training Neural Models for Causal Analysis and Prediction

One powerful approach is to train a neural network on top of these discrete step sequences. This creates a "digital twin" of your agent — a model that learns to predict what steps come next based on the patterns it observes in your trajectory data.

This trained model serves multiple purposes. First, it enables predictive modeling — you can forecast where an agent is heading based on its current trajectory, predict the likelihood of success or failure, and even estimate how many more steps it will take to complete a task. This is valuable for understanding agent behavior patterns and identifying potential issues before they fully manifest.

Second, once you have this predictive model, you can apply attribution methods — techniques that identify which specific steps or features had the most influence on particular outcomes. Think of it as asking "what caused the agent to succeed or fail?" but in a data-driven, quantitative way. Attribution methods can highlight critical decision points, reveal hidden dependencies between steps, and show you which states most strongly predict success or failure. This neural approach to causality complements traditional analytical methods by capturing complex, non-linear relationships in agent behavior.

Discovering Behavioral Motifs

Just like DNA sequences have recurring patterns (motifs) that serve specific functions, agent trajectories often contain recurring sub-sequences that represent meaningful behavioral patterns. Taxi can help you identify these motifs — common sequences of steps that appear frequently across different trajectories.

For example, you might discover that successful agents tend to follow a particular "verification motif" before committing to an action, while unsuccessful ones skip it. Identifying these motifs helps you understand the building blocks of agent behavior and can inform how you design prompts, guardrails, or training procedures.

Anomaly Detection and Outlier Analysis

With trajectories represented as sequences of discrete states, you can apply anomaly detection algorithms to automatically flag unusual agent behavior. This is particularly valuable for identifying edge cases that cause unexpected reasoning patterns, detecting when agents encounter novel situations, spotting potential security issues, and finding interesting failure modes that warrant deeper investigation.

These advanced techniques transform Taxi from a simple visualization tool into a comprehensive platform for understanding and optimizing agent behavior at scale.

Conclusion and Takeaway

The beauty of Taxi is its flexibility. You don't need to decide upfront exactly how you'll use these structured trajectories. The taxonomy serves as a general-purpose representation that supports multiple downstream analyses — whether you want to debug, optimize, visualize, or build predictive models of agent behavior.

Taxi makes it possible understand your agents better at scale by bridging the gap between unstructured agent traces and structured analytical frameworks.

🚕 Taxi takes your agent places.

Stay tuned!

‍