CausalARC: Abstract Reasoning with Causal World Models

Jacqueline R. M. A. Maasch1, John Kalantari2, Kia Khezeli2
1Cornell Tech, New York, NY     2YRIKKA, New York, NY

Note: This website is currently under construction.

Reasoning requires adaptation to novel problem settings under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model (SCM). Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning.




Pearl Causal Hierarchy: observing factual realities (L1), exerting actions to induce interventional realities (L2), and imagining alternate counterfactual realities (L3) [1]. Lower levels generally underdetermine higher levels.



This work extends and reconceptualizes the ARC setup to support causal reasoning evaluation under limited data and distribution shift. Given a fully specified SCM, all three levels of the Pearl Causal Hierarchy (PCH) are well-defined: any observational (L1), interventional (L2), or counterfactual (L3) query can be answered about the environment under study [2]. This formulation makes CausalARC an open-ended playground for testing reasoning hypotheses at all three levels of the PCH, with an emphasis on abstract, logical, and counterfactual reasoning.




The CausalARC testbed. (A) First, SCM M is manually transcribed in Python code. (B) Input-output pairs are randomly sampled, providing observational (L1) learning signals about the world model. (C) Sampling from interventional submodels M' of M yields interventional (L2) samples (x', y'). Given pair (x, y), performing multiple interventions while holding the exogenous context constant yields a set of counterfactual (L3) pairs. (D) Using L1 and L3 pairs as in-context demonstrations, we can automatically generate natural language prompts for diverse reasoning tasks.

How to cite this work

Please cite our work using the following BibTeX.


      @misc{maasch2025carc,
        title={CausalARC: Abstract Reasoning with Causal World Models}, 
        author={Jacqueline Maasch and John Kalantari and Kia Khezeli},
        year={2025},
        eprint={2509.03636},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2509.03636}, 
      }