Causal reasoning and compositional reasoning are two core aspirations in AI. Measuring these behaviors requires principled evaluation methods. Maasch et al. 2025 consider both behaviors simultaneously, under the umbrella of compositional causal reasoning (CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. CCR.GB applies the theoretical foundations provided by Maasch et al. 2025 to introduce new community datasets. CCR.GB provides a generative benchmark for measuring CCR at all three levels of Pearl's Causal Hierarchy: (1) associational, (2) interventional, and (3) counterfactual.
Pearl's Causal Hierarchy: observing factual realities, exerting actions to
induce interventional realities, and imagining alternate counterfactual realities
[Bareinboim et al. 2022]. Lower levels underdetermine higher levels. The counterfactual quantity $p(y'_{x'} \mid x,y)$ is known as the probability of necessity.
CCR.GB uses a related counterfactual quantity, the probability of necessity and sufficiency, to evaluate AI reasoning
[Pearl 1999].
CCR.GB provides two artifacts:
We define CCR as the ability to correctly infer (1) how local causal measures
compose into global causal measures and (2) how global
causal measures decompose into local causal measures, in
both factual and counterfactual worlds. By extension, this requires reasoning over the propagation
of causal quantities through graphs. Thus, our framework also implicitly tests for graphical reasoning.
To facilitate CCR evaluation, we introduce a framework for the exhaustive assessment of compositional consistency: correct inference that
equivalent compositions are indeed equal [Maasch et al. 2025]. We measure compositional
consistency with respect to ground truth (external validity) and concordance among the LM's responses (internal consistency).
While explicitly combining compositional and causal reasoning evaluation in LMs is new, the intersection of compositionality
and causality already enjoys a rich mathematical framework: the formalisms and methods of graphical modeling and causal
inference. A classic example of compositionality in causal inference is the decomposition of the total effect in linear structural causal models (SCMs).
When causal functions are linear, the total causal effect decomposes as a sum of the natural direct effect and natural indirect effect.
A compositionally consistent causal reasoning agent should correctly infer that TE is equivalent to NDE + NIE:
Example 1: Decomposition of the total effect in linear SCMs.
Lower error rate seen on factual questions (recall) than counterfactual questions (reasoning) [Hüyük et al. 2025].
When generating a custom CCR task, the user can specify the task theme, the graphical complexity of the causal DAG (total nodes, total biconnected components, connectivity), the desired causal functions (all logical or, all logical and, or randomly assigned), and the number of extraneous variables in the prompt (i.e., those not needed to solve the task). Each task is constructed as follows.
Instead of directly prompting the model to perform formal causal inference (as in Jin et al. 2023), CCR.GB treats LMs as counterfactual data simulators (as in Gonzalez and Nori 2024, Hüyük et al. 2025). Series of yes/no questions are submitted to the LM, and the natural language responses are converted to their corresponding boolean value (e.g., using another LM as an extracter; see Hüyük et al. 2025, Maasch et al. 2025). The resulting boolean vectors are treated as samples from observational and interventional distributions. These boolean vectors are then used to compute PNS estimates, or classic performance metrics such as F1 score, precision, and accuracy.
Responses generated using CCR.GB allow for reasoning evaluation at all three levels of Pearl's Causal Hierarchy.
Through the lens of compositional consistency – the ability to reason that theoretically equivalent compositions are indeed equal – this evaluation framework is able to reveal a range of taxonomically distinct error patterns. Arising from the notions of external validity and internal consistency described above, Maasch et al. (2025) introduce a four-category taxonomy for compositional reasoning errors: valid-consistent, valid-inconsistent, invalid-consistent, and invalid-inconsistent. This taxonomy can be used to categorize model performance on CCR.GB tasks, providing insights for further error analysis. See Maasch et al. (2025) for elaboration.
As LMs become more capable, CCR.GB is more likely to be saturated. As shown in our preliminary results section and in Maasch et al. (2025), Llama models up to version 3.1 perform poorly even on very simple CCR tasks, GPT-4o does poorly without CoT plus demonstration, and o1 does well. For further context, we provide results from more recent reasoning models on a simple FluVaccine task. While DeepSeek R1 outperformed o4-mini and GPT-4o overall, no model showed perfect interventional reasoning even on this simple structure.
As proof of concept, Maasch et al. (2025) demonstrate the design of CCR tasks for language models in the LLama, Phi, and GPT families.
On a simple math word problem, this framework revealed a range of taxonomically distinct error patterns. Additionally,
CCR errors increased with the complexity of causal paths for all models except o1.
Causal DAG of the simple CandyParty task used in Maasch et al. (2025).
The authors thank Dr Samuel M. Roh DMD, MD for insightful feedback on the structure of our ClinicalNotes task, which is loosely based on the format of real-world H&P notes.
@inproceedings{maasch2025ccr,
title={Compositional Causal Reasoning Evaluation in Language Models},
author={Jacqueline Maasch and Alihan Hüyük and Xinnuo Xu and Aditya V. Nori and Javier Gonzalez},
booktitle={Proceedings of the 42nd International Conference on Machine Learning (ICML)},
url={https://arxiv.org/abs/2503.04556},
year={2025}
}