1. Irrespective of whether causes are observed or unobserved, both algorithms must assume that they detect the correct independencies in the data. This is largely done by setting thresholds on independence measures, which is liable to produce some misjudgments. However, Spirtes e.a claim some stability of their algorithm even in the face of such basic mistakes. Secondly, the algorithms must assume that the dependencies they find are representative of the underlying causal structure, meaning that they and only they would also be produced by the same causal structure with different parameter settings. Pearl calls this stability, Spirtes e.a. faithfulness. Both algorithms further suppose acyclic underlying causal structures. In addition to these assumptions about the measured independencies in the data and the nature of the underlying causal structure, the algorithms are restricted by being purely observational, which leaves them unable to distinguish between causal structures that do not produce distinct independencies in the data. Such indistinguishable structures may differ only in the identification of cause and effect (A -> B vs. B -> A), but they may also be ambiguous as to the existence of unmeasured common causes (A <- U -> B). In fact, only a limited number of causal structures let these algorithms rule out the existence of an unmeasured common cause to account for a dependency between two variables that cannot be made independent by conditioning on any other set of measured variables. The most basic relation, for example, a dependency between A and B with A and B as the only variables can be explained through a number of possible causal structures, as mentioned above, for example: A -> B (A causes B) A <- B (B causes A) A <- U -> B (something unmeasured causes A and B) and the best the algorithms can do is produce A - B, meaning that there is some causal relation between A and B that may be due to unmeasured causes. In some cases, the algorithms can use the independency patterns A and B form with other variables to confirm or deny the existence of unmeasured common causes, and perhaps determine the direction of causation between A and B. Assuming A and B are conditionally dependent in the data and cannot be made independent by conditioning on any other measured variables (I will call this 'directly causally related'): i) A third variable, C, that is also directly causally related to B but not to A, can be used to orient the causal structure as follows, if B is not part of a set that d-separates A and C: A -> B <- C The intuition behind this is that if we know that conditioning on B does not make A and C independent, A and C must be causes of B. However, note that this does not rule out the existence of unobserved causes like A <- U1 -> B <- U2 -> C because conditioning on B would still result in a dependency between A and C. ii) Note that the above cases may sometimes lead to dually oriented directions of causation, A <-> B. For example, i) may apply twice, with C and D. Say that the pairs (A,B), (B,C), (C,D) are all directly causally related (but no other direct causal relations exist amongst them), and B is not part of a set that d-separates A and C, whereas C is not part of a set that d-separates B and D, the disambiguation in i) results in A -> B <-> C <- D which means that there must be an unmeasured common cause between B and C. Intuitively, given A -> B <- C - D, conditioning on C should make B independent of its ancestors. However, C not being in the set that d-separates B and D implies that this is not the case, as well as that C must be caused by B and D (as opposed to causing them). This contradicts the causal orientation already present, and implies that the only explanation is an unmeasured common cause. iii) There is one case in which the algorithms can be sure that no unmeasured common causes account for the causal relation between A and B, and whether A causes B or B causes A. If we have a variable C that is directly causally connected to B but not to A, and we know from the other disambiguation methods that the direction of causation is such that C causes B, and that A does not cause B, we can be sure that B causes A with no unmeasured common causes accounting for the causal relationship, A <- B <- C and not A <- U -> B <- C The intuition here is that i) would apply if B is not part of a set that d-separated A and C, so we could not be sure that A does not cause B. Given that we cannot say that A causes B in this way, the presence of C implies that conditioning on B must d-separate A and C, which is only the case if A <- B given that B <- C. Furthermore, we know that this causal link does not exist due to the presence of U, because B does not d-separate A and C in A <- U -> B <- C. iv) If a path through other variables from A to B is directed and we know that each link on the path cannot be explained by unmeasured causes, this implies that B cannot cause A, so A->B (again, with possible unmeasured causes possibly explaining the link, as in A <- U-> B). This holds because directed links not due to common causes unambiguously specify a direction of causation in an acyclic graph. In summary, the algorithms of Pearl and Spirtes e.a. can exploit the above disambiguation patterns to determine the direction of causation between measured variables in a limited number of cases. The algorithms' conclusions range from the determination of definite and direct causal links between variables, to cases in which there must or may be unmeasured common causes explaining causal links, to cases where the direction of causality cannot be determined from the data and maybe also be due to unmeasured common causes. Which identification obtains depends solely on the structure of the underlying causal graph. The simple versions of the algorithms for cases in which unmeasured variables exist only produce causal structures in which every definite and possible unmeasured variable has exactly two children and no parents. This is sufficient in the sense of what Pearl calls the projection theorem, stating that every causal structure with unmeasured variables has at least one causal structure in which each unmeasured variable is a parentless common causes of exactly two nonadjacent measured variables, where this structure stably implies the same set of independence relationships as the original. This finding implies that the algorithms yield graphs that posit reasonable causal structures for the observed variables, but force the possible unobserved variables into parentless common causes for pairs of observed variables. This may obviously not reflect (at least up to probabilistic dependency equivalence) the actual number and causal structure of unobserved variables that might be useful and interesting in the data's actual setting. Spirtes e.a. also give an algorithm for deriving further structure amongst the unobserved variables in the case of a class of experimental designs where experimenters specify the expected children of unobserved variables. In this case, the presented algorithm can detect inconsistencies with this experimental design and lead to a re-estimation of the hidden structure. ------------- 2. Both the PC and the IC algorithm depend on an accurate list of dependencies. Given that the data is representative and we can count on stability with infinite amounts of data, mistakes in the case of finite data are due to spurious independencies or spurious lack thereof introduced through lack of data. The question is thus how the two algorithms use the dependency information to derive causal structure, and how errors in this dependency information change the results of the algorithms. IC performs an exhaustive global search across all subsets of variables for each pair of variables to determine whether they can be made independent through conditioning on a subset. Spirtes e.a. comment that this is theoretically a relatively stable step. A missing independence between two variables will likely still produce the correct graph because another set of variables may well imply the same d-separation. An additional independence will only result in a single missing link in the initial graph. Both mistakes are of a local nature. PC, on the other hand, substitutes a local outwards search from each variable for the global search in IC. The main type of error this can lead to is an early and erroneous severing of the link between two variables which would be corrected in IC by considering further subsets of variables, but stays a mistake in PC because the path will never again be checked. This type of mistake can lead to others, because all variables along a severed path will now be judged to be in two independent groups if there are no other paths connecting them. The IC algorithm can make a different type of mistake in misjudging a more remote independence and in response removing a local one, but the effect would be limited to this mistake. Both algorithms share the subsequent orientation steps that use the disambiguation relations from my answer to question one. As implied by that answer, the disambiguation steps are interdependent and thus local mistakes can propagate in the form of misapplications of these criteria. In practise, Spirtes' e.a. test runs on random graphs show that the number and severity of mistakes made by IC and PC do strongly depend on data size, especially for small amounts of data. They are also affected by the true graph's vertices' in and out degrees. In these experiments the algorithms only become relatively independent of the amount of data presented at more than 1000 data points. The experiments further show that PC and IC perform differently along different criteria. IC is less likely to erroneously add causal structure to the graph (PC making far more mistakes of this type), whereas PC is less likely to erroneously omit causal structure. ------------- 3. Applying the structure learning algorithms of Pearl's and Spirtes e.a. as a model of human causal learning implies a commitment to statistical bottom-up batch estimation of causal structure. All three modifiers, statistical, bottom-up and batch, might be considered problematic in the human case. The algorithms rely on sufficient data to estimate useful independence relationships amongst all the data. While human beings might estimate some workings of the world this way, it seems implausible that all human causal judgments rely on collecting the necessary amount of data to make statistical decisions. Human beings are happy to make strong causal judgments based on single examples. To use the constraint based algorithms in these cases, very strong assumptions about the representativeness of very few examples must be made, which may well fail in general due to noise and spurious or lacking correlations. Related to their problematic reliance on statistically relevant data sets is the strict adherence to a general bottom-up search procedure by these algorithms. They potentially consider all possible structure configurations, and while one can force them to make use of prior knowledge in the form of specific constraints on the structure, they offer no general way to incorporate structural preferences and prior knowledge, or to arbitrarily compare and revise different structures in the face of evidence and top-down knowledge. Human beings exhibit all of these features: they quickly map novel situations onto known causal structures and know how to compare and update them on the fly. Finally, the algorithms are presented in a batch learning paradigm, assuming that all data needed is present and accessible at once. This is an unrealistic assumption in the human cognition case, where it is hard to imagine that we collect and remember a lot of observational data to make a one-shot estimation of causal structure. Rather, human beings seem to incorporate new data in an online fashion, and revise and update their structural assumptions continuously. Whether these assumptions are computationally realistic depends on the type of computational system imagined. Certainly, a stronger argument can be made that computers in any situation can remember a near arbitrary amount of data, making batch estimation and collecting statistics slightly more reasonable. However, complete structural re-estimation for every update is still a bad assumption for any even moderately interactive system, and the problems of incorporating prior beliefs and structural constraints in a top-down fashion are the same as in the human case. A Bayesian approach to the causal structure learning problem eliminates some of these objections, but not all. Certainly, it provides a clear way to incorporate structural priors into the learning process in a top-down fashion, and allows for meaningful global comparisons of hypothesized structures. Strong structural priors also counteract some of the problems of sufficient statistics in the sense that individual data points are embedded in an already existing structural hypothesis, eliminating the need to estimate a complete structure in a bottom-up fashion. The question raised by the Bayesian approach, however, asks where the strong and accurate prior structural assumptions come from. If they need to be estimated statistically in a bottom-up fashion, the Bayesian approach simply augments the constraint-based approach with a way to quickly decide between plausible structures in a rigorous way given little data. This is not what the Bayesian view seems to be claiming, but to make a credible claim about the existence of situation specific structural priors it needs to provide a theory as to how these can be learned. The full Bayesian view, just as the constraint-based view, is also unrealistic in the total computational and memory load imposed on the reasoner, who can sometimes be asked to consider large numbers of possible models and data points to make an optimal decision, especially in problems that extend over time. However, there are ways to soften this assumption by computing locally optimal solutions in structure space and time rather than the globally optimal ones the rational estimator should consider. Finally, there are large areas of causal structure estimation that are not addressed by either proposed solution. These involve how human beings decide how to carve the world up into possibly measurable variables, and how their knowledge of physical, social and mental workings interplays with the estimation of causal structure. From my point of view, one of the most interesting questions left unanswered as well is how language becomes part of the picture of human causal judgments. For example, I am interested in how causal structure can be communicated and shared through language, and how language use and speakers intention's can be modelled in a causal framework. ------------ 4. Cheng's Power PC theory provides a local estimate of causal power given that the reasoner knows a candidate cause and a candidate effect. One should first note, therefore, that it does not as such provide structural information, but rather estimates the strength of a causal link. Given that the causal structure is known, and a causal link exists between the candidate cause and its effect, Power PC performs a maximum likelihood parameter estimation of the link strength under the assumption that the candidate cause and other possible causes of the same effect mathematically act as a noisy OR gate. Pearl introduces a quantity called the probability of necessity of a cause, an answer to the question as to whether the effect would not have occurred had the cause not occurred. He provides criteria as to when this quantity can be calculated from observational data, namely when causal effect of a variable on another can be calculated as a conditional probability, and when a preventative effect is ruled out. In this case, the probability of necessity of a cause is given by Power PC. Both of these findings show that Power PC should be interpreted as a quantity that seems to correspond to human causal strength judgments given a known causal structure with specific constraints, not as an element of structure learning per se. However, given that the quantity correlates well with human causal strength judgments, and corresponds to the notion of a necessary cause in this situation, it could see some use in structured approaches. One use is that the psychological fact that human beings seem to make judgments related to this quantity supports measuring it in causal structures inferred by other algorithms, and using it as a heuristic in human-like reasoning and linguistic explanation. Power PC also reappears as a possible parameter estimation component of a Bayesian structure learning account, and can generally be used as an estimation of causal parameters when the causal structure is known or can be estimated by other means. ----------- 5.a. One answer given directly in Johnson-Laird seems to be the frequentist interpretation of human performance, implying that human beings convert the probabilistic problem into one of frequencies to solve it. This is interpreted as calculating 'model ratios' of possible models of the data, tagged with their frequency of occurrence. Krynski and Tenenbaum argue explicitly against this interpretation of their results, and Johnson-Laird admits to the findings that people may do better with probabilities than with frequencies in some cases, without explicating what this implies about his mental model theory. Given this admission, I believe Johnson-Laird might actually agree with the interpretation of Krynski and Tenenbaum with regards to the causal structures people bring to a reasoning task. He might argue that this prior knowledge dictates the variables people use in constructing models to solve the problem, which may not match the artificially limited question being asked. Furthermore, I think Johnson-Laird should argue that his mental model theory less proscribes a specific computational (in the Marr sense) way of performing probabilistic inference, but rather argue that mental models provide algorithmic and representational constraints on the implementation of calculations proposed by, for example, Bayesian causal networks. This view in turn would augment the structured causal network approach well by pointing to and testing possible algorithmic and representational limitations in how human beings implement problem solutions, even if the ideal solution approximated by the human implementation corresponds well with a Causal Bayesian Network account. In particular, mental model theory might suggest additional versions of the questions or measurements on participants in Krynski and Tenenbaum that might reveal why people do not achieve the correct answer when they don't, taking into account algorithmic limitations such as number and types of models participants need to construct to represent and reason about the question, rather than just prior structural knowledge. These insights could lead to new versions of questions that specifically isolate or ease the demands on human reasoning in terms of mental model theory, and thus allow for a better measurement of the relation of human probabilistic reasoning to Bayesian causal networks. 5.b. Pearl puts forward two main reasons for viewing causal relationships as fundamental and probabilistic relationships as symptoms of an underlying causal structure. The first is one from human intuition, which prefers some causal readings over others, even though both can in principle lead to the same probabilistic phenomena. The second concerns the usefulness of causal structures to predict the response of the world to changes and actions. Both arguments support Krynski and Tenenbaum's arguments and discussion. People come to a probabilistic question that involves real world entities with a set of useful and intuitive causal structures that have helped them reason and act throughout life, and quickly map the question into these, assuming features and relationships that are not explicit in the question. They thus see the probabilities given as symptoms of an underlying causal process that should work according to what they know, and do not treat them as isolated numbers that follow abstract mathematical laws. -------------- 6. Johnson-Laird's mental model theory attempts to closely correlate with human limitations on parallel representations, memory constraints and on-line deductive reasoning process, explaining less the positive results people achieve than the errors they make. It is important to note that this theory comes after a good number of logical reasoning approaches that perform an ideal form of logical deduction much more reliably and general than human beings do. Probabilistic reasoning approaches, are attempting to model an idealized version of human judgements first, and seem to account for human performance limitations as an afterthought. In the future, a theory like Johnson-Laird's mental models might provide a reasonable explanation for human tendencies to err and perform sub-optimally. Especially from reading Johnson-Laird's experiments on temporal event models and human performance, I would think that his mental model theory might contribute to the Bayesian reasoning efforts guidelines as to the heuristics people use in considering variable instantiations, and the limits imposed on them. The number of values variables must take on to solve a reasoning problem, and the preference with which humans consider them might lead to additional constraints on and interpretations of human performance beyond the ideal rational Bayesian estimate. The problem with a direct application to the tasks in Kemp and Tenenbaum is that the problems there are too simple for the existing mental model theory to predict much about human performance in solving them. This does not mean that it would not be useful for future studies to phrase probabilistic estimations over entities suitable for a mental model approach, and use the results of one as prior information on the performance of the other, as suggested in Johnson-Laird 2001. However, the problem in Kemp and Tenenbaum is still to make a computational model produce the inductive conclusions human beings draw, not to explain their failings. In general, formal Bayesian reasoning solutions may provide loose correlates to human performance for simple cases, but adjustments are necessary in, for example, Steyvers, Tenenbaum, Wagenmakers and Blum, to match limited memory and limited computational power in human performance. Johnson-Lairds mental model theory seems to speak to these adjustments. I do believe that a hybrid approach of the two considerations would be fruitful, with Bayesian reasoning serving as an idealized view that human thinking approximates, and mental models providing a way to discretize the necessary computations (in time, and perhaps also numerically) and to impose realistic memory and representational constraints. This hybrid approach would have all the advantages of Bayesian modelling, like rational probability estimates, rigorous integration across possible models using priors, and structural estimation. In many instances these features seem to match tendencies in human judgements closely. At the same time a version of mental model theory would contribute predictions as to how human beings would go about instantiating the actual structures and computations needed and produced, hinting at biases (such as preferring models with variables set to 'true') and errors (due failure to consider important instantiations, perhaps.) These performance considerations produced by mental model theory might stay in a separate framework, but preferably could also be merged as well motivated priors and restrictions into a generalized Bayesian account -------------- 7. Pearl's approach to causal reasoning yields two important features to explain and model human thinking. On the one hand, it addresses the question of how one might go about detecting and identifying causal relationships from observations alone and (augmented perhaps by an information theoretic criterion to decide on actions) from controlled experimentation. These basic questions are not addressed in the qualitative reasoning approaches, but are obviously important in addressing how human beings decide to act and create mental models of the world. On the other hand, Pearl's findings provide a framework for modeling previously unmodelled reasoning processes at a relatively abstract level, namely the probabilistic prediction of action effects, and the estimation of counterfactuals. Again, both are not directly addressed in the qualitative reasoning works, and Pearl's treatment shows that where one may think they are addressed, many important considerations are not taken into account (for example, the global effects that should be taken into account for counterfactual statements). Because Pearl's theory comes in at a high level of abstraction, it leaves the underlying workings of the world being modelled largely unspecified. This has two effects: a) to identify causal relationships, Pearl makes the very non-restrictive assumptions, and thus his proposed algorithms need a lot of representative data and computation time, and b) the relationships between variables in Pearl's formalism are unspecified, and he does not consider it part of his theory to determine their nature. It seems that qualitative reasoning approaches can provide useful insights in these places where Pearl's model is sparse. They provide a relatively detailed model of at least some of the physical and spatiotemporal paradigms people reason in, explicating the variables and restricting the causal relationships that people might hypothesize in time and space. They also supply examples of the qualitative relationships between variables that people seem to use in reasoning, thus filling in some of the missing details of Pearl's general identification and prediction schema. Overall, the integration of qualitative reasoning approaches about specific systems and general causal reasoning a la Pearl should be a bidirectional one. Human beings are rarely faced with utterly unknown causal relationships in the world. When they are, perhaps as children, perhaps as adults when reasoning about truly novel systems, they apply something akin to Pearl's general reasoning strategy. In all other instances, they come with a large amount of prior knowledge that not only specifies assumed causal structures, but also their detailed physical and temporal relationships, as given by qualitative reasoning approaches. These priors put constraints on the types of causal structures and the relationships of the variables in the systems human reasoners will consider, making model building and inference quicker and less data hungry. They also allow the filling in of the detailed qualitative relationships between variables in an abstract model, and provide existing algorithms to simulate their behaviour. In return, phrasing these relationships in the paradigm of structural causal models allows for the types of modeling and decision making explicated by Pearl, that go beyond filling in values in qualitative relationships, but rather let the reasoner consider which actions to take, and how to reason reliably over a larger causal context.