Play First 3 Games
https://three.arcprize.org/
There are no instructions. You must play the game to discover controls, rules, and goal.
ARC-3, a sneak peek at the next-gen, interactive reasoning benchmark designed to illuminate the capability gap between today's AI and tomorrow's AGI.
Interactive Reasoning Benchmarks (IRBs) test for a broad scope of capabilities:
• Exploration
• Percept -> Plan → Action
• Memory
• Goal Acquisition
• Alignment
Game Design Constraints
• Easy for humans (can pick it up in <1 min of game play)
• Core Knowledge Priors (no language, trivia, cultural symbols)
• Should require no instructions to play
• Should be fun for humans and playable in 5-10 minutes
• Innovative and novel game mechanics encouraged (Hidden state, theory of mind, long term planning, navigating other agents, etc.)
This week I'm posting about presentations from two cool events (over on Twitter): https://x.com/byrd_nick/status/1943219893291164057
What are the events?
(1) The 1st Experimental Argument Analysis workshop
(2) The 5th European #ExperimentalPhilosophy #Conference
For the next couple days, I'm posting about talks and posters from the 2025 BioXPhi Summit in #Switzerland. Follow on #BlueSky: https://bsky.app/profile/byrdnick.com/post/3lsim7t6gq22t
The #conference website: https://ibmb.unibas.ch/en/public-outreach/projects-to-the-public/basel-oxford-nus-bioxphi-summit-2025/
#AI reasoning models may seem to reason reflectively when they say things like, "Let me rethink that".
But do these "reflective" phrases predict better reasoning performance?
Not in #Deepseek R1 Zero: https://doi.org/10.48550/arXiv.2503.20783
Can task-switching hinder decisions?
Switching between a reflection test and a fluid #IQ test lowered optimal reflection test scores and completion compared to taking the tests separately (N = 80).
Bad news for #multitasking?
https://ianburbidge.com/wp-content/uploads/2024/05/ian-burbidge-masters-dissertation-1.pdf
Accepted in Res Philosophica
"Reflective" thinking is rife in #cogSci and the #history of ideas.
But we lack a unified definition.
So I synthesized one.
Just 2 key factors.
Not just unifying, but useful!
Audiopaper: https://byrdnick.com/archives/28904/upon-reflection-ep-15-a-two-factor-explication-of-reflection
Preprint: https://osf.io/preprints/psyarxiv/d628j
Maybe @Dockers opted for the misspelled "TruTemp" #branding because #philosophy had already taken "Truetemp".
Aside: I recently published new data about #thoughtExperiments like Truetemp: https://doi.org/10.1093/analys/anaf015
https://osf.io/preprints/psyarxiv/y8sdm
Can group work/discussion cultivate #criticalThinking?
General #surgery trainees randomly assigned to team-based learning (rather than traditional curricula) had better reflection test scores (n = 36).
#AlgorithmAversion is a tendency to judge errors in automated decisions more harshly than errors in human decisions.
Telling people a decision is typically made by machines eliminated or even reversed the #bias.
Is reflective reasoning always better?
In "Bounded Reflectivism..." (2022), I argued that #cogSci data show reflection is NOT always best: https://doi.org/10.1111/meta.12534
Another #AI paper finds this: intuitive #LLM prompts were better for "common sense" tasks: https://doi.org/10.48550/arXiv.2502.12470
Another correlational study of #AI use and #CriticalThinking draws unmerited causal conclusions.
This one found a *positive* correlation between AI use and (self-report-derived) critical thinking.
Participants ≅100 pre-service teachers
People were less averse to #risk (d = 0.4) when making #prenatalTesting decisions in their SECOND #language — even when they seemed to understand the relevant information.
I'm #teaching at #Bucknell Tuesday
- Kahneman's peak-end rule & #socialMedia
- Global Analytic #Atheism: https://doi.org/10.1017/S0034412525000198
- Scalable Socratic Reflection: https://www.researchgate.net/publication/370132037
- Strategic Reflection in #AI, #HCI, #cogSci: https://www.researchgate.net/publication/390166382
Is reflective reasoning always slower than, say, intuition?
A paper used process dissociation to explicate deliberate control:
- it wasn't reliably slower
- it didn't reliably involve more self-reported deliberation (such as stopping to think)
Yet another paper showing dual-minded #LLMs (intuitive + reflective) can improve accuracy-cost tradeoffs: https://doi.org/10.48550/arXiv.2504.12329
As I argue in #StrategicReflectivism, pragmatic switching between the two modes is key to intelligent systems: https://www.researchgate.net/publication/390166382
Do people diagnosed with #autism respond differently to moral dilemmas?
In MINORS, sacrificial harm waned with age, more slowly in the ASD group: https://doi.org/10.1007/s10803-022-05795-6
In ADULTS, decisions were similar: https://doi.org/10.1016/j.paid.2024.112889
Does thinking aloud disrupt reasoning?
We didn't find effects on a verbal reflection test (https://pubmed.ncbi.nlm.nih.gov/37103261), but Shealy et al. found effects on word count, completion time, and DLPFC activity during a design task (N = 50).