Skip to content

TOT-SQL Safeguard submission#56

Open
NG-VikasV wants to merge 2 commits into
ucbepic:mainfrom
NG-VikasV:submit-tot-sql-safeguard
Open

TOT-SQL Safeguard submission#56
NG-VikasV wants to merge 2 commits into
ucbepic:mainfrom
NG-VikasV:submit-tot-sql-safeguard

Conversation

@NG-VikasV

Copy link
Copy Markdown

Agent Name: TOT-SQL Safeguard
Backbone LLM: openai.gpt-oss-safeguard-120b
Dataset Hints Used: No

A reasoning-based agent architecture executing multi-step Data-to-SQL tasks. It leverages LLM reasoning to dynamically construct SQL queries, execute execution-verification loops, and retrieve correct query results across SQLite, PostgreSQL, DuckDB, and MongoDB databases without using any hardcoded schemas or domain-specific hints.

@Ruiying-Ma

Copy link
Copy Markdown
Collaborator

Hi @NG-VikasV — thank you for your contribution!

Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result.

@NG-VikasV

NG-VikasV commented Jun 9, 2026

Copy link
Copy Markdown
Author

TT_SQL_V2_traces_all_runs.zip

Hi @Ruiying-Ma,

I have attached the same (attached all traces including agnews), please let us know if you need anything more. Thank you.

@Ruiying-Ma

Copy link
Copy Markdown
Collaborator

Hi @NG-VikasV — thanks for sending the traces. We went through them and have two things to sort out before we can compute Pass@1.

First, run coverage. The leaderboard score averages over 5 runs per query, and we require at least 5 runs per query. The bundle currently contains a single run per query (54 traces total), and the submission JSON repeats that one answer across all 5 run slots. Could you re-run each query at least 5 times and share the per-run traces and answers?

Second, the submission JSON does not match the traces for two agnews queries. For query 1 the submission lists THECHAT, but the trace answer is The Rundown 4 Miami at N.C. State.. For query 2 the submission lists N/A, but the trace shows 12/13. The other 52 queries match. Could you update the submission JSON so the answers reflect the runs you actually executed?

Once we have the 5-run traces and a reconciled submission file, we'll re-run the verification and post the Pass@1 result.

@NG-VikasV

NG-VikasV commented Jun 11, 2026

Copy link
Copy Markdown
Author

Hi, @Ruiying-Ma

Thank you for the detailed feedback — apologies for the issues with the initial submission.

Both points have been addressed:

Please find the updated files committed directly to the PR branch:

submissions/TT_SQL_V2_traces_all_runs.zip — all 270 per-run traces (54 queries × 5 runs), structured as traces/{dataset}/query{id}/run{0–4}/
submissions/submission_spiderdin.json — reconciled submission with actual per-run answers

Direct link: Click here

Best regards, Vikas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants