1. Irrespective of whether causes are observed or unobserved, both
algorithms must assume that they detect the correct independencies in
the data. This is largely done by setting thresholds on independence
measures, which is liable to produce some misjudgments. However,
Spirtes e.a claim some stability of their algorithm even in the face
of such basic mistakes. Secondly, the algorithms must assume that the
dependencies they find are representative of the underlying causal
structure, meaning that they and only they would also be produced by
the same causal structure with different parameter settings. Pearl
calls this stability, Spirtes e.a. faithfulness. Both algorithms
further suppose acyclic underlying causal structures.
In addition to these assumptions about the measured independencies in
the data and the nature of the underlying causal structure, the
algorithms are restricted by being purely observational, which leaves
them unable to distinguish between causal structures that do not
produce distinct independencies in the data. Such indistinguishable
structures may differ only in the identification of cause and effect
(A -> B vs. B -> A), but they may also be ambiguous as to the
existence of unmeasured common causes (A <- U -> B).
In fact, only a limited number of causal structures let these
algorithms rule out the existence of an unmeasured common cause to
account for a dependency between two variables that cannot be made
independent by conditioning on any other set of measured
variables. The most basic relation, for example, a dependency between
A and B with A and B as the only variables can be explained through a
number of possible causal structures, as mentioned above, for example:
A -> B (A causes B)
A <- B (B causes A)
A <- U -> B (something unmeasured causes A and B)
and the best the algorithms can do is produce A - B, meaning that
there is some causal relation between A and B that may be due to
unmeasured causes.
In some cases, the algorithms can use the independency patterns A and B
form with other variables to confirm or deny the existence of
unmeasured common causes, and perhaps determine the direction of
causation between A and B. Assuming A and B are conditionally
dependent in the data and cannot be made independent by conditioning
on any other measured variables (I will call this 'directly causally
related'):
i) A third variable, C, that is also directly causally related to B
but not to A, can be used to orient the causal structure as follows,
if B is not part of a set that d-separates A and C:
A -> B <- C
The intuition behind this is that if we know that conditioning on B
does not make A and C independent, A and C must be causes of
B. However, note that this does not rule out the existence of
unobserved causes like
A <- U1 -> B <- U2 -> C
because conditioning on B would still result in a dependency between A
and C.
ii) Note that the above cases may sometimes lead to dually oriented
directions of causation, A <-> B. For example, i) may apply twice,
with C and D. Say that the pairs (A,B), (B,C), (C,D) are all directly
causally related (but no other direct causal relations exist amongst
them), and B is not part of a set that d-separates A and C, whereas C
is not part of a set that d-separates B and D, the disambiguation in
i) results in
A -> B <-> C <- D
which means that there must be an unmeasured common cause between B
and C. Intuitively, given A -> B <- C - D, conditioning on C should
make B independent of its ancestors. However, C not being in the set
that d-separates B and D implies that this is not the case, as well as
that C must be caused by B and D (as opposed to causing them). This
contradicts the causal orientation already present, and implies that
the only explanation is an unmeasured common cause.
iii) There is one case in which the algorithms can be sure that no
unmeasured common causes account for the causal relation between A and
B, and whether A causes B or B causes A. If we have a variable C that
is directly causally connected to B but not to A, and we know from the
other disambiguation methods that the direction of causation is
such that C causes B, and that A does not cause B, we can be sure that
B causes A with no unmeasured common causes accounting for the causal
relationship,
A <- B <- C and not A <- U -> B <- C
The intuition here is that i) would apply if B is not part of a set
that d-separated A and C, so we could not be sure that A does not
cause B. Given that we cannot say that A causes B in this way, the
presence of C implies that conditioning on B must d-separate A and C,
which is only the case if A <- B given that B <- C. Furthermore, we
know that this causal link does not exist due to the presence of U,
because B does not d-separate A and C in A <- U -> B <- C.
iv) If a path through other variables from A to B is directed and we
know that each link on the path cannot be explained by unmeasured
causes, this implies that B cannot cause A, so A->B (again, with
possible unmeasured causes possibly explaining the link, as in
A <- U-> B). This holds because directed links not due to common causes
unambiguously specify a direction of causation in an acyclic graph.
In summary, the algorithms of Pearl and Spirtes e.a. can exploit the
above disambiguation patterns to determine the direction of causation
between measured variables in a limited number of cases. The
algorithms' conclusions range from the determination of definite and
direct causal links between variables, to cases in which there must or
may be unmeasured common causes explaining causal links, to cases
where the direction of causality cannot be determined from the data
and maybe also be due to unmeasured common causes. Which identification
obtains depends solely on the structure of the underlying causal graph.
The simple versions of the algorithms for cases in which unmeasured
variables exist only produce causal structures in which every definite
and possible unmeasured variable has exactly two children and no
parents. This is sufficient in the sense of what Pearl calls the
projection theorem, stating that every causal structure with
unmeasured variables has at least one causal structure in which each
unmeasured variable is a parentless common causes of exactly two
nonadjacent measured variables, where this structure stably implies
the same set of independence relationships as the original.
This finding implies that the algorithms yield graphs that posit
reasonable causal structures for the observed variables, but force the
possible unobserved variables into parentless common causes for pairs
of observed variables. This may obviously not reflect (at least up to
probabilistic dependency equivalence) the actual number and causal
structure of unobserved variables that might be useful and
interesting in the data's actual setting.
Spirtes e.a. also give an algorithm for deriving further structure
amongst the unobserved variables in the case of a class of experimental
designs where experimenters specify the expected children of unobserved
variables. In this case, the presented algorithm can detect
inconsistencies with this experimental design and lead to a
re-estimation of the hidden structure.
-------------
2. Both the PC and the IC algorithm depend on an
accurate list of dependencies. Given that the data is representative
and we can count on stability with infinite amounts of data, mistakes
in the case of finite data are due to spurious independencies or
spurious lack thereof introduced through lack of data. The question is
thus how the two algorithms use the dependency information to derive
causal structure, and how errors in this dependency information
change the results of the algorithms.
IC performs an exhaustive global search across all subsets of
variables for each pair of variables to determine whether they can be
made independent through conditioning on a subset. Spirtes
e.a. comment that this is theoretically a relatively stable step. A
missing independence between two variables will likely still produce
the correct graph because another set of variables may well imply the
same d-separation. An additional independence will only result in a
single missing link in the initial graph. Both mistakes are of a
local nature.
PC, on the other hand, substitutes a local outwards search from each
variable for the global search in IC. The main type of error this can
lead to is an early and erroneous severing of the link between two
variables which would be corrected in IC by considering further subsets
of variables, but stays a mistake in PC because the path will never
again be checked. This type of mistake can lead to others, because all
variables along a severed path will now be judged to be in two
independent groups if there are no other paths connecting them. The IC
algorithm can make a different type of mistake in misjudging a more
remote independence and in response removing a local one, but the
effect would be limited to this mistake.
Both algorithms share the subsequent orientation steps that use the
disambiguation relations from my answer to question one. As implied by
that answer, the disambiguation steps are interdependent and thus
local mistakes can propagate in the form of misapplications of these
criteria.
In practise, Spirtes' e.a. test runs on random graphs show that the
number and severity of mistakes made by IC and PC do strongly depend
on data size, especially for small amounts of data. They are also
affected by the true graph's vertices' in and out degrees. In these
experiments the algorithms only become relatively independent of the
amount of data presented at more than 1000 data points. The
experiments further show that PC and IC perform differently along
different criteria. IC is less likely to erroneously add causal
structure to the graph (PC making far more mistakes of this type),
whereas PC is less likely to erroneously omit causal structure.
-------------
3. Applying the structure learning algorithms of Pearl's and
Spirtes e.a. as a model of human causal learning implies a
commitment to statistical bottom-up batch estimation of causal
structure. All three modifiers, statistical, bottom-up and batch,
might be considered problematic in the human case.
The algorithms rely on sufficient data to estimate useful independence
relationships amongst all the data. While human beings might estimate
some workings of the world this way, it seems implausible that all
human causal judgments rely on collecting the necessary amount of data
to make statistical decisions. Human beings are happy to make strong
causal judgments based on single examples. To use the constraint based
algorithms in these cases, very strong assumptions about the
representativeness of very few examples must be made, which may well fail
in general due to noise and spurious or lacking correlations.
Related to their problematic reliance on statistically relevant data
sets is the strict adherence to a general bottom-up search procedure
by these algorithms. They potentially consider all possible structure
configurations, and while one can force them to make use of prior
knowledge in the form of specific constraints on the structure, they
offer no general way to incorporate structural preferences and prior
knowledge, or to arbitrarily compare and revise different structures
in the face of evidence and top-down knowledge. Human beings
exhibit all of these features: they quickly map novel situations
onto known causal structures and know how to compare and update them
on the fly.
Finally, the algorithms are presented in a batch learning paradigm,
assuming that all data needed is present and accessible at once. This
is an unrealistic assumption in the human cognition case, where it is
hard to imagine that we collect and remember a lot of observational
data to make a one-shot estimation of causal structure. Rather, human
beings seem to incorporate new data in an online fashion, and revise
and update their structural assumptions continuously.
Whether these assumptions are computationally realistic depends on the
type of computational system imagined. Certainly, a stronger argument
can be made that computers in any situation can remember a near
arbitrary amount of data, making batch estimation and collecting
statistics slightly more reasonable. However, complete structural
re-estimation for every update is still a bad assumption for any even
moderately interactive system, and the problems of incorporating prior
beliefs and structural constraints in a top-down fashion are the same
as in the human case.
A Bayesian approach to the causal structure learning problem
eliminates some of these objections, but not all. Certainly, it
provides a clear way to incorporate structural priors into the
learning process in a top-down fashion, and allows for meaningful
global comparisons of hypothesized structures. Strong structural
priors also counteract some of the problems of sufficient statistics
in the sense that individual data points are embedded in an already
existing structural hypothesis, eliminating the need to estimate a
complete structure in a bottom-up fashion.
The question raised by the Bayesian approach, however, asks where the
strong and accurate prior structural assumptions come from. If they
need to be estimated statistically in a bottom-up fashion, the
Bayesian approach simply augments the constraint-based approach with a
way to quickly decide between plausible structures in a rigorous way
given little data. This is not what the Bayesian view seems to be
claiming, but to make a credible claim about the existence of situation
specific structural priors it needs to provide a theory as to how
these can be learned.
The full Bayesian view, just as the constraint-based view, is also
unrealistic in the total computational and memory load imposed on the
reasoner, who can sometimes be asked to consider large numbers of
possible models and data points to make an optimal decision,
especially in problems that extend over time. However, there are ways to
soften this assumption by computing locally optimal solutions in
structure space and time rather than the globally optimal ones the
rational estimator should consider.
Finally, there are large areas of causal structure estimation that are
not addressed by either proposed solution. These involve how human
beings decide how to carve the world up into possibly measurable
variables, and how their knowledge of physical, social and mental
workings interplays with the estimation of causal structure. From my
point of view, one of the most interesting questions left unanswered
as well is how language becomes part of the picture of human causal
judgments. For example, I am interested in how causal structure can be
communicated and shared through language, and how language use and
speakers intention's can be modelled in a causal framework.
------------
4. Cheng's Power PC theory provides a local estimate of causal power
given that the reasoner knows a candidate cause and a candidate
effect. One should first note, therefore, that it does not as such
provide structural information, but rather estimates the strength of a
causal link. Given that the causal structure is known, and a causal
link exists between the candidate cause and its effect, Power PC
performs a maximum likelihood parameter estimation of the link
strength under the assumption that the candidate cause and other
possible causes of the same effect mathematically act as a noisy OR
gate.
Pearl introduces a quantity called the probability of necessity of a
cause, an answer to the question as to whether the effect would not
have occurred had the cause not occurred. He provides criteria as to
when this quantity can be calculated from observational data, namely
when causal effect of a variable on another can be calculated as a
conditional probability, and when a preventative effect is ruled
out. In this case, the probability of necessity of a cause is given by
Power PC.
Both of these findings show that Power PC should be interpreted as a
quantity that seems to correspond to human causal strength judgments
given a known causal structure with specific constraints, not as an
element of structure learning per se.
However, given that the quantity correlates well with human causal
strength judgments, and corresponds to the notion of a necessary cause
in this situation, it could see some use in structured approaches. One
use is that the psychological fact that human beings seem to make
judgments related to this quantity supports measuring it in causal
structures inferred by other algorithms, and using it as a heuristic
in human-like reasoning and linguistic explanation. Power PC also
reappears as a possible parameter estimation component of a Bayesian
structure learning account, and can generally be used as an estimation
of causal parameters when the causal structure is known or can be
estimated by other means.
-----------
5.a. One answer given directly in Johnson-Laird seems to be the
frequentist interpretation of human performance, implying that human
beings convert the probabilistic problem into one of frequencies to
solve it. This is interpreted as calculating 'model ratios' of
possible models of the data, tagged with their frequency of
occurrence. Krynski and Tenenbaum argue explicitly against this
interpretation of their results, and Johnson-Laird admits to the
findings that people may do better with probabilities than with
frequencies in some cases, without explicating what this implies about
his mental model theory.
Given this admission, I believe Johnson-Laird might actually agree
with the interpretation of Krynski and Tenenbaum with regards to the
causal structures people bring to a reasoning task. He might argue
that this prior knowledge dictates the variables people use in
constructing models to solve the problem, which may not match the
artificially limited question being asked. Furthermore, I think
Johnson-Laird should argue that his mental model theory less
proscribes a specific computational (in the Marr sense) way of
performing probabilistic inference, but rather argue that mental
models provide algorithmic and representational constraints on the
implementation of calculations proposed by, for example, Bayesian
causal networks. This view in turn would augment the structured causal
network approach well by pointing to and testing possible algorithmic
and representational limitations in how human beings implement problem
solutions, even if the ideal solution approximated by the human
implementation corresponds well with a Causal Bayesian Network
account. In particular, mental model theory might suggest additional
versions of the questions or measurements on participants in Krynski
and Tenenbaum that might reveal why people do not achieve the correct
answer when they don't, taking into account algorithmic limitations
such as number and types of models participants need to construct to
represent and reason about the question, rather than just prior
structural knowledge. These insights could lead to new versions
of questions that specifically isolate or ease the demands on
human reasoning in terms of mental model theory, and thus allow
for a better measurement of the relation of human probabilistic
reasoning to Bayesian causal networks.
5.b. Pearl puts forward two main reasons for viewing causal
relationships as fundamental and probabilistic relationships as
symptoms of an underlying causal structure. The first is one from
human intuition, which prefers some causal readings over others, even
though both can in principle lead to the same probabilistic
phenomena. The second concerns the usefulness of causal structures to
predict the response of the world to changes and actions.
Both arguments support Krynski and Tenenbaum's arguments and
discussion. People come to a probabilistic question that involves
real world entities with a set of useful and intuitive causal
structures that have helped them reason and act throughout life, and
quickly map the question into these, assuming features and
relationships that are not explicit in the question. They thus see the
probabilities given as symptoms of an underlying causal process that
should work according to what they know, and do not treat them as
isolated numbers that follow abstract mathematical laws.
--------------
6. Johnson-Laird's mental model theory attempts to
closely correlate with human limitations on parallel representations,
memory constraints and on-line deductive reasoning process, explaining
less the positive results people achieve than the errors they make. It
is important to note that this theory comes after a good number of
logical reasoning approaches that perform an ideal form of logical
deduction much more reliably and general than human beings
do. Probabilistic reasoning approaches, are attempting to model an
idealized version of human judgements first, and seem to account for
human performance limitations as an afterthought. In the future,
a theory like Johnson-Laird's mental models might provide a reasonable
explanation for human tendencies to err and perform sub-optimally.
Especially from reading Johnson-Laird's experiments on temporal event
models and human performance, I would think that his mental model
theory might contribute to the Bayesian reasoning efforts guidelines
as to the heuristics people use in considering variable
instantiations, and the limits imposed on them. The number of values
variables must take on to solve a reasoning problem, and the
preference with which humans consider them might lead to additional
constraints on and interpretations of human performance beyond the
ideal rational Bayesian estimate.
The problem with a direct application to the tasks in Kemp and
Tenenbaum is that the problems there are too simple for the existing
mental model theory to predict much about human performance in solving
them. This does not mean that it would not be useful for future
studies to phrase probabilistic estimations over entities suitable for
a mental model approach, and use the results of one as prior
information on the performance of the other, as suggested in
Johnson-Laird 2001. However, the problem in Kemp and Tenenbaum is
still to make a computational model produce the inductive conclusions
human beings draw, not to explain their failings.
In general, formal Bayesian reasoning solutions may provide loose
correlates to human performance for simple cases, but adjustments are
necessary in, for example, Steyvers, Tenenbaum, Wagenmakers and Blum,
to match limited memory and limited computational power in human
performance. Johnson-Lairds mental model theory seems to speak to
these adjustments. I do believe that a hybrid approach of the two
considerations would be fruitful, with Bayesian reasoning serving as
an idealized view that human thinking approximates, and mental models
providing a way to discretize the necessary computations (in time, and
perhaps also numerically) and to impose realistic memory and
representational constraints.
This hybrid approach would have all the advantages of Bayesian
modelling, like rational probability estimates, rigorous integration
across possible models using priors, and structural estimation. In
many instances these features seem to match tendencies in human
judgements closely. At the same time a version of mental model theory
would contribute predictions as to how human beings would go about
instantiating the actual structures and computations needed and
produced, hinting at biases (such as preferring models with variables
set to 'true') and errors (due failure to consider important
instantiations, perhaps.) These performance considerations produced by
mental model theory might stay in a separate framework, but preferably
could also be merged as well motivated priors and restrictions into a
generalized Bayesian account
--------------
7. Pearl's approach to causal reasoning yields two
important features to explain and model human thinking. On the one
hand, it addresses the question of how one might go about detecting
and identifying causal relationships from observations alone and
(augmented perhaps by an information theoretic criterion to decide on
actions) from controlled experimentation. These basic questions are
not addressed in the qualitative reasoning approaches, but are
obviously important in addressing how human beings decide to act and
create mental models of the world. On the other hand, Pearl's findings
provide a framework for modeling previously unmodelled reasoning
processes at a relatively abstract level, namely the probabilistic
prediction of action effects, and the estimation of
counterfactuals. Again, both are not directly addressed in the
qualitative reasoning works, and Pearl's treatment shows that where
one may think they are addressed, many important considerations are
not taken into account (for example, the global effects that should be
taken into account for counterfactual statements).
Because Pearl's theory comes in at a high level of abstraction, it
leaves the underlying workings of the world being modelled largely
unspecified. This has two effects: a) to identify causal
relationships, Pearl makes the very non-restrictive assumptions, and
thus his proposed algorithms need a lot of representative data and
computation time, and b) the relationships between variables in
Pearl's formalism are unspecified, and he does not consider it part of
his theory to determine their nature.
It seems that qualitative reasoning approaches can provide useful
insights in these places where Pearl's model is sparse. They provide a
relatively detailed model of at least some of the physical and
spatiotemporal paradigms people reason in, explicating the variables
and restricting the causal relationships that people might hypothesize
in time and space. They also supply examples of the qualitative
relationships between variables that people seem to use in reasoning,
thus filling in some of the missing details of Pearl's general
identification and prediction schema.
Overall, the integration of qualitative reasoning approaches about
specific systems and general causal reasoning a la Pearl should be a
bidirectional one. Human beings are rarely faced with utterly unknown
causal relationships in the world. When they are, perhaps as children,
perhaps as adults when reasoning about truly novel systems, they apply
something akin to Pearl's general reasoning strategy. In all other
instances, they come with a large amount of prior knowledge that not
only specifies assumed causal structures, but also their detailed
physical and temporal relationships, as given by qualitative reasoning
approaches. These priors put constraints on the types of causal
structures and the relationships of the variables in the systems human
reasoners will consider, making model building and inference quicker
and less data hungry. They also allow the filling in of the detailed
qualitative relationships between variables in an abstract model, and
provide existing algorithms to simulate their behaviour. In return,
phrasing these relationships in the paradigm of structural causal
models allows for the types of modeling and decision making explicated
by Pearl, that go beyond filling in values in qualitative
relationships, but rather let the reasoner consider which actions to
take, and how to reason reliably over a larger causal context.