Setting the Standard: Our AI Model Outperforms Spider 2

Who We Are, and What We Do

Rival is developing a solution catered to solving the toughest cybersecurity challenges and time sinks, freeing human security teams to engage in the creative and proactive defense.

Rival’s autonomous platform rests on a huge collection of domain-specific approaches that push boundaries of agentic performance on security-related tasks.

Beyond these, Rival also implements generic advancements in LLM reasoning techniques, developed in-house by our leading cyber and machine learning experts to be scalable, trustworthy, and shatter existing analytical benchmarks.

The Challenge of Reasoning

On our path from research to productization, we came face-to-face with the toughest hurdles that agentic systems face in widespread adoption for real-world enterprise use cases.

Chief among these challenges is a simple fact - today’s agentic approaches excel at toy examples and trivial problems, but their reasoning falls apart when approaching problems that human reasoning faces in the corporate world.

To illustrate this, lets take a look at a more general task than cybersecurity - corporate data analysis.

All large companies hold vast datasets and have human analysts responsible for mining that data to extract critical business insights.

Translating natural language queries to executable analytical logic (like SQL) is one of the standard benchmarks to measure the effectiveness of different approaches for logical reasoning over complex data.

The Spider(man) Benchmark - What It Is, and Why It Is Relevant

A standard used to measure this are the “Spider” family of text-to-sql benchmarks. On first glance, things look good for LLMs: Spider 1.0’s leaderboard reports that state-of-the-art agents achieve above 90% accuracy in translating natural language analysis questions to verifiable answers backed by SQL!

Things start to fall apart when you examine the details of the benchmark itself, however. The dataset consists of challenging questions such as:

-- QUESTION:
-- What type of pet is the youngest animal, and how much does it weigh?
-- ANSWER:
SELECT pettype, weight FROM pets ORDER BY pet_age LIMIT 1

-- QUESTION:
-- What type of pet is the youngest animal, and how much does it weigh?
-- ANSWER:
SELECT pettype, weight FROM pets ORDER BY pet_age LIMIT 1

-- QUESTION:
-- What is the ship id and name that caused most total injuries?
-- ANSWER:
SELECT T2.id, T2.name FROM death AS T1 
JOIN ship AS t2 ON T1.caused_by_ship_id = T2.id 
GROUP BY T2.id ORDER BY count(*) DESC LIMIT 1

-- QUESTION:
-- How many courses are there?
-- ANSWER:
SELECT count(*) FROM Courses

‍
The problem is obvious: Spider 1.0 contains the same kind of trivial, toy examples we mentioned above.

The folks at Spider realized that too, and set to work creating a new coliseum in which to test the juggernauts or artificial reasoning - Spider 2.0.-- For each customer, calculate their daily balances for every day between

-- their earliest and latest transaction dates, including days without 
-- transactions by carrying forward the previous day's balance. 
-- Treat any negative daily balances as zero. Then, for each month, 
-- determine the highest daily balance each customer had during that month. 
-- Finally, for each month, sum these maximum daily balances across all customers 
-- to obtain a monthly total.
-- Answer omitted for brevity - but trust me, it's long.

And, what do you know, the numbers go down - hard. Spider 2.0-lite’s leaderboard has the best approach achieving only 37% on these kind of analytical tasks. Approaches that achieved above 80% in Spider 1.0, like DailSQL, achieve only 5.68%!

Looking at Spider 2.0’s questions, we immediately see that they are significantly harder, and much better represent the kind of real world questions an AI agent would have to reason through. Case in point:

The specifics of questions that next-generation cyber platforms need to answer are different than the wide spectrum of subjects in Spider, but their complexity is similar. Take this example adapted from one of our customers:

-- For every new vulnerability ingested over the last 12 months that is 
-- classified as "Medium" severity or higher:

-- First, identify the specific root code file(s) and function(s) responsible
-- for the vulnerability. Then, using an AI-assisted static analysis agent 
-- (like Grep.app or a similar tool), define the abstract "bad pattern" that 
-- this vulnerable code represents.

-- Next, scan all private code repositories to identify every other instance 
-- where the "bad pattern" appears. For each discovered instance, map 
-- it to its corresponding API route or application. Calculate a 
-- "Blast Radius Score" for the original vulnerability, defined as the total 
-- count of these propagated instances across the entire codebase.

-- For each of the propagated instances, enrich the data by identifying the 
-- primary owning team, the date of the last commit, and whether the 
-- application is internal or external-facing. For the original vulnerability
-- that has since been closed, capture the Mean Time to Remediate (MTTR).

-- Finally, aggregate these findings to generate a quarterly report. 
-- This report should rank engineering teams not just by the volume of 
-- vulnerabilities they own, but by the average "Blast Radius Score" of those 
-- vulnerabilities. Combine with each team's average MTTR to highlight teams 
-- that are introducing high-impact vulnerabilities and are slow to remediate.

‍This is clearly more similar to Spider 2.0 that 1.0.

Obviously, a 30% success rate in this case might as well be 0%. If an agent can’t reliably carry out this task, our customers cannot and will not use them at all, keeping security teams preoccupied with the copy-and-paste tasks necessary to perform this analysis manually.

Rival’s Approach - Orchestrating Workflows

We started at a disadvantage - not only do we have to break these existing benchmarks, we have to do it in a way that is trustworthy and can operate at scale. We realized that a fixed agentic design would not answer these criteria.

Taking inspiration from cutting-edge academic approaches, Rival’s core reasoning approach implements a two-stage process: workflow generation and workflow orchestration.

See this

Generation

When Rival is first given the description of an analytical workflow, it breaks down the described problem into discrete logical steps that solve the problem. The operations allowed in each step is definable and extensible by the user - such as SQL queries, python code, free-form agentic reasoning, and so on.

This is a similar approach to prompting techniques that have seen success in sterile academic environments, such as Plan-and-Solve and Skeleton-of-Thought.

Each logical step is created with well-defined output schemas, as well as inputs that link to the outputs of other steps. Together, this allows Rival to generate plans with complex interdepencies while still being able to ensure that each individual piece of logic is well-defined.

These can be relatively simple - this is the generated flow for the above Spider2 question regarding bank customers:

‍‍

‍

Or they can be quite complicated, such as when asked to summarize complicated aggregation statistics on cricket players that are distributed throughout multiple different tables:

‍

‍

Composition and Orchestration

Once a plan has been constructed, Rival has to implement the individual steps.

Each of these steps could be any number of “functions” that have fixed input and output schemas: SQL queries, Python/typescript code, or even on-the-fly subagents!

Large amounts of research has demonstrated the effectiveness of in-context few-shot prompting - where examples of correct behavior are given to the model at generation time.

Given that the plan is a DAG, Rival is able to implement the steps “in order”. Outputs of prerequisite steps are generated and “previewed” to the steps that depend on them, while still enabling full parallelization of independent steps for speed.

Rival also enforces the output schemas of each step. This acts as another guardrail against insidious logical errors and hallucinations, and empirically serves to “remind” Rival of how each individual step’s logic fits into the bigger picture.

Once orchestration is complete, a complete final answer has been fully assembled. However, assuming that a correct workflow has been composed, the same workflow can be run repeatedly to generate up-to-date answers to the same question, without requiring any additional inference.

Benchmarking Rival

Overview of the Technique

In order to test the flexibility and power of our approach, we created a slimmed-down version of our core logic. We stripped away the cybersecurity-specific logic, simplified some of the code generation capabilities, and rewrote some of the agent prompts to make more sense in the context of the benchmark.

We then gave this version of Rival access to the SQLite portion of the Spider 2.0-lite (not the Snowflake or Bigquery hosted components) benchmark, and asked Rival to solve each of the benchmark’s questions.

Given the potential for Rival to use code as well as SQL queries solve the problem, we would often receive answers that were semantically correct, but with additional information or in a schema that did not match the tabular golden outputs of the benchmark exactly.

To properly grade Rival’s performance, we inspected the outputs for semantic identity - programmatically, using a critic LLM, and with a final manual pass. We categorized each of Rival’s outputs into one of the following categories:

Semantically identical (potentially with additional information)

Semantically identical, but with rounding mismatches (3.3 and 3.333333 are the same value, right?)

Semantically identical, with an alternate valid interpretation of the initial prompt.

As an example, some challenges asked users to be distributed into buckets based on ranges of months. Rival identified these buckets by their end month, while the benchmark identified them by their start month. The provided question didn’t specify a single “correct” method, so Rival gets points for this.

Not identical (an error)

Results

So, how does this match up to the existing Spider 2.0 leaderboard (Taken from here on 19/6/25)?

It blows existing approaches out of the water. Rivalbot successfully answers nearly 63% of the challenge successfully.

Untitled

‍