A simplicity principle for language learning:

Re-evaluating what can be

learned from positive evidence

 

 

Nick Chater

Institute for Applied Cognitive Science,

Department of Psychology,

University of Warwick.

 

Paul Vitányi

Centrum voor Wiskunde en Informatica

 

 

 

 

Please address correspondence concerning this article to Nick Chater, Institute for Applied Cognitive Science, Department of Psychology, University of Warwick, Coventry CV4 7AL, UK, email: nick.chater@warwick.ac.uk; or to Paul Vitányi, Centrum voor Wiskunde en Informatica, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands, email: Paul.Vitanyi@cwi.nl.

 

 

Abstract

This paper applies the “Simplicity Principle” to the problem of language acquisition. The Simplicity Principle is that the cognitive system seeks to choose the hypothesis that provides the briefest representation of the available data—here, the data are the linguistic input to the child. The Simplicity Principle allows learning from positive evidence alone, in apparent contrast to Gold’s (1967) celebrated theorems on language learnability in the limit. The results in this paper therefore suggest a re-evaluation of theoretical conclusions concerning child language acquisition based on apparent ‘logical’ difficulties in learning from positive linguistic evidence.  They also suggest that serious consideration should be given to the hypothesis that children learn language by using the Simplicity Principle.

Acknowledgements

We would like to thank Mark Ellison, Steven Finch, Ulrike Hahn, Stephen Muggleton, Gregory Mulhauser, Mike Oaksford, Luca Onnis, Martin Pickering, Emmanuel Pothos and Jerry Seligman for valuable discussions of these ideas at various stages in their development or for comments on this manuscript. Some of this research was conducted while Nick Chater was a Senior Research Fellow at BT Laboratories. Nick Chater was partially supported by European Commission grant RTN-HPRN-CT-1999-00065 and by Oliver, Wyman & Company; Paul Vitányi was partially supported by the EU through the NeuroColt II Working Group, and the QAIP project.

 

 


Children rapidly acquire grammatical mastery of their language, despite receiving what appears to be a noisy, degraded, unreliable, and partial sample of adult language (Chomsky, 1965). This phenomenon is sufficiently remarkable that many theorists have taken the nativist position that language acquisition cannot be a matter of mere learning at all. These theorists have argued that language acquisition is not primarily a matter of learning at all; they speak of the growth of a language ‘organ’ or the development of a language ‘instinct’  (e.g., Chomsky, 1980; Pinker, 1994). The noisiness and partial nature of language input are, though, matters of degree—and empiricists have responded that the language input to the child is sufficiently noise-free and extensive that language learning, in a non-trivial sense, may be possible after all.

            There is, however, one aspect of language input that appears to pose not mere problems of degree for language learning—but problems of principle. This is the fact that the child must learn from positive evidence alone. The child hears language that conforms with the rules of the language; but rarely, if ever, hears labelled non-examples of ungrammatical language. But learning a category (such as, here, the category of grammatical sentences) purely from positive examples seems to present fundamental difficulties. Viewed in general terms, the problem is that it is difficult to see how the learner can rule out overgeneral hypotheses—i.e., if the learner conjectures that a certain sentence is possible, when it is not, how is the learner ever to realize the mistake? The learner can not, of course, merely note that this sentence is never said. By this criterion, the learner would rule out all of the infinite number of perfectly grammatical sentences that have never actually been encountered, and be completely unable to produce or understand novel utterances.

In the literature on language acquisition, this kind of problem has often been considered in the context of specific linguistic examples. For example, Baker (e.g., 1981) has famously noted that the English auxiliary system poses difficult problems for language acquisition, because it contains what we might term ‘holes.’ That is, there is a highly complex set of restrictions that disallow certain structures, in a way that seems  entirely unpredictable from the overall pattern of the auxiliary system. As in the general case discussed above, it seems that the child cannot simply infer that these disallowed structures are ungrammatical, because they do not occur. This is because there are infinitely many other structures that have never been observed, but which are perfectly grammatical. The acquisition of the English auxiliary system, and many other linguistic constructions, have been viewed as sufficiently puzzling to earn the label Baker’s “paradox”—and the general puzzle of learning language from positive evidence only has been raised to the status of a ‘logical’ problem of language acquisition (Baker & McCarthy, 1981; Hornstein & Lightfoot, 1981).

            The puzzle of learning from positive evidence does not, interestingly, only pose problems for empiricist approaches to language acquisition. This is because many of the holes and idiosyncracies exhibited by specific linguistic structures do not follow from any universal grammatical principles. Indeed, language seems to be shot through with highly idiosyncratic patterns, which differ between languages, and which change over time (Bybee, Perkins & Pagliuca, 1994). Thus, it seems that both nativist and empiricist theorists must assume that these idiosyncratic aspects of language, such as the English auxiliary system, must somehow be learned. For this reason, the ‘logical’ problems associated with language acquisition have been extensively discussed both within the predominantly nativist framework of linguistics (e.g., Baker & McCarthy, 1981; Hornstein & Lightfoot, 1981), as well as by psychologists of an empiricist orientation (Braine, 1971; Bowerman, 1983, 1988).

            The concern that learning from positive evidence runs into fundamental ‘logical’ difficulties is crucially strengthened by results from an important mathematical analysis of learning, initiated by Gold (1967). Roughly, Gold showed that learning (technically, ‘identifying in the limit’) any interesting class of languages from positive evidence alone is impossible, given a particular formal analysis of the learning problem. In essence, Gold’s negative results stem from the point noted above—without negative evidence, the learner seems to have no way of eliminating over-general hypotheses (see, e.g., Pinker, 1979, 1984 for discussion).

            Three complementary lines of response to this problem can be envisaged. The first line of response is to argue that the assumptions of Gold’s argument (or the assumptions implicit in more informal arguments) are not met in practice. For example, perhaps the linguistic input in in some way richer than typically assumed, or perhaps learners only find some approximation to the structure of the language they hear. According to this kind of response, although mathematically correct, Gold’s results are not applicable to child language acquisition (MacWhinney, 1993, see the papers in Baker & McCarthy, 1981; Bates & MacWhinney, 1987; Gallaway & Richards, 1994; MacWhinney, 1987; Sokolov & Snow, 1994; see Rohde & Plaut, 1999, for a recent and insightful analysis). The second line of response attempts to back up the first line by demonstrating computational models that can learn aspects of language, when the learning problem is framed more realistically. This problem has so far proved far too difficult to address directly, and hence researchers have dealt instead with simplified learning problems, either by learning a highly idealized artificial language (e.g. Christiansen & Chater, 1999; Elman, 1990, 1993; Rohde & Plaut, 1999), or by learning very restricted aspects of language from corpora of natural language (e.g., Brent, 1996; Christiansen, Allen & Seidenberg, 1998; Elman, 1991; Redington, Chater & Finch, 1998; Siskind, 1996). Some of this work uses methods closely related to the formal framework advocated here, for example, from learning to segment linguistic units (Brent & Cartwright, 1996; Redlich, 1993; Wolff, 1977, 1982, 1988) to learning phrase structure or categorial grammars (Adriaans, 1992; Grünwald, 1996). This strand of computational work, although of great importance, is presently inconclusive, because it is not clear whether or how work in either tradition can scale-up to deal with the general problem of learning natural language from real input. The third line of argument, which will be the primary focus of the present paper, is to develop a new idealization of the language acquisition problem, as an alternative to Gold’s idealization. The goal of this line of argument is to prove formal results showing that language acquisition is, to some extent at least, possible from positive evidence alone.

Specifically, we show that substantial language learning from positive evidence is possible, if language learning is based on a Simplicity Principle. Roughly, the idea is that the learner postulates the underlying structure in the linguistic input which provides the simplest, i.e., briefest, description of that linguistic input. The general idea that cognition is a search for simplicity has a long history in psychology (Mach, 1959/1886; Koffka, 1962/1935), and has been widely discussed, in a relatively informal way, in the field of language and language learning (e.g., Fodor & Crain, 1987). In this paper, we describe a formal theory of inductive reasoning by simplicity, based on the branch of mathematics known as Kolmogorov complexity theory (Li & Vitányi, 1997). Kolmogorov complexity was developed independently by Solomonoff (1964), Kolmogorov (1965) and Chaitin (1969).1  Solomonoff’s primary motivation in developing the theory was to provide a formal model of learning by simplicity. Kolmogorov complexity and derivatives from it have been widely used in mathematics (e.g., Chaitin, 1987), physics (Zurek, 1991), computer science (e.g., Paul, Seiferas & Simon, 1981), artificial intelligence (Quinlan & Rivest, 1989), and statistics (Rissanen, 1987, 1989; Wallace & Freeman, 1987). This mathematical framework provides a concrete and well-understood specification of what it means to learn by choosing the simplest explanation, and provides a way of precisely defining the Simplicity Principle for cognitive science (Chater, 1996, 1997, 1999). This framework will prove to be useful in considering the amount of information available about the language that is inherent in positive evidence alone.

            It is important to note that the account developed here does not aim to specify a theory of how children acquire language. Rather, the aim of this paper is to help understand the nature of the learning problem that the child faces. Moreover, although the present analysis will be quite abstract, it may provide a starting point for a more specific analysis of the problem of language acquisition, considering, for example, specific types of language, specific computational restrictions or prior constraints on learners. Through such investigations, it should be possible to build up an increasingly realistic picture of the formal structure of the learning problem that the child faces, and of the nature of innate constraints and linguistic input that are required to explain how children learn language. Its role is thus analogous to Gold’s (1967) framework and their subsequent development (Jain, Osherson, Royer & Kumar Sharma, 1999; Osherson, Stob & Weinstein, 1985). But the upshot of the theoretical framework based on simplicity is very different from, and encompasses as a special case that based on Gold’s identification in the limit.2

It might be argued that the analysis provided below is unnecessary: perhaps, despite appearances, the child does have access to negative evidence of some kind. This question has generated a large body of research and considerable controversy (e.g., Baker, 1979; Baker & McCarthy, 1981; Bowerman, 1987; Brown & Hanlon, 1970; Fodor & Crain, 1987; Hirsch-Pasek, Treiman & Schniederman, 1984; Marcus, 1993; Morgan & Travis, 1989; Wexler & Cullicover, 1980). Despite the lack of consensus in this area, it is perhaps fair to say that the majority view is that direct forms of negative evidence, such as an adult correcting a child’s ungrammatical utterance, are too rare to be a substantial constraint in language acquisition. But even if negative evidence of some kind is available to the child, the problem of studying what can be learned from positive evidence alone is important for two reasons. The first reason is that it would help establish a ‘base-line’ against which the additional contribution of such putative negative evidence might be compared—only with such baseline can we establish how important such evidence is. The second reason is that learning from positive evidence arises in many other areas of cognition. Any kind of learning from experience (e.g., practicing a perceptual-motor task, or learning about the physical properties of the world through observation) involves learning from positive evidence alone—we are not provided with demonstrations of ‘impossible’ physical phenoeman, labelled as such. Indeed, scientific inquiry itself is limited to positive data, generated from actual natural laws (e.g., Boyle’s Gas Laws)—because the laws hold, every instance obeys the law, and there is no negative evidence (and Gold’s model has been widely applied in the context of scientific inference, e.g., Jain, Osherson, Royer & Kumar Sharma, 1999). Now, of course, the wrong, e.g., Gas Law might be falsified by observation—but this does not apply to the correct law, for which no negative evidence can occur. On the face of it, people do appear to be able to learn from experience; and science does seem to be able to progress, somehow. Of course, the sceptical possibility remains that all currently formulated laws are not correct; and this ‘fallibilist’ position is certainly entirely plausible in view of the history of science. But even if scientific laws are not formulated correctly, they appear to be formulated well enough to allow many aspects of the natural world to be predicted with great accuracy. So, whether or not language learning uses only positive evidence, there appears to be many interesting aspects of learning and inquiry that are limited to positive evidence—and hence the analysis described below should, in any case, be of interest in cognitive science.

The case of the fallibilist viewpoint in science is also suggestive, when related back to the problem of language acquisition. One might suspect that while the logical problems of language acquisition may preclude certainty that the language has been learned precisely, it might nonetheless  be possible for that learner to know enough about language to, for example, predict it, produce it, and make grammatical judgments about it. The formal results that we develop below show that, under fairly general conditions, there is a precise sense in which this is true.

            This paper has five sections. The first section, Learning by simplicity: The very idea, provides a general introduction to the Simplicity Principle, and its application both to learning in general, and to language acquisition in particular. The second section, Prediction and language acquisition, gives a proof of an important theorem in inductive inference in the framework of Kolmogorov complexity, first derived by Solomonoff (1978).3 This theorem shows that successful prediction is possible from positive examples alone, under quite broad assumptions. The third section, Prediction and grammaticality, derives a result concerning the accuracy of grammaticality judgements concerning sentences that arise in normal natural language, if language acquisition from positive evidence is guided by a Simplicity Principle. The fourth section, Language learning and language production, shows that the learner can learn to produce language in a way that is, asymptotically, indistinguishable from the language of other speakers in the linguistic community. Finally, the Discussion, considers open questions for future research, the implications of the current results, and assesses the hypothesis that human language acquisition may involve applying a Simplicity Principle.

 

Learning by simplicity: The very idea

To specify a set-up for language learning from positive examples, we need to provide (1) the class of linguistic inputs to be learned (the linguistic ‘environment’); (2) measures of learning performance; and (3) a learning method. Below, the class of linguistic inputs will be those that can be generated by a computable process combined with random noise. Learning performance will be measured both in terms of the ability to predict new input and in relation to judgements of grammaticality. And the learning method will be based on the Simplicity Principle—that the learner should prefer hypotheses which provide the simplest explanations of the linguistic input. We consider each of these points in more detail.

 

1. The class of allowable linguistic inputs

Let us assume that linguistic input corresponds to a potentially infinite sequence of atomic symbols (an ‘alphabet’), where there are only finitely many such symbols. Without loss of generality, we assume that the stream of sentences that form the input to the learner can be represented as a continuous binary string: a sequence of 0s and 1s. A simple way of doing this is to associate a fixed binary string with each element of the original alphabet in which the language is expressed (e.g., the standard English alphabet and punctuation symbols for sentences of English). Thus, the sequence of sentences of the input can be converted from its standard representation into a continuous binary sequence.4

            We assume, further, than the linguistic input is generated by a real computational process. But how, exactly, should we model what counts as a ‘real computational process’? A first suggestion would limit possible linguistic inputs to those that can be generated by Turing machines. But this would be over-restrictive, because linguistic input can also be affected by random factors. For example, imagine a speaker reporting successive tosses of a fair coin: “Heads, heads, tails, tails, tails, heads...” and so on. Assuming that the coin is tossed fairly, the corresponding utterance is a random infinite sequence, which cannot be generated by any Turing machine (Li & Vitányi, 1997). Real language input to the child is presumably a mixture of both kinds of factor—deterministic computational processes in the speaker (and perhaps also in other aspects of the environment which the speaker is describing or reacting to), mixed with random influences.5

            So how can we model this mixture of deterministic and random factors? One natural approach (which turns out to be quite elegant and general) is to assume that language is generated by a deterministic machine (concretely, a Turing machine), fed with a random input (concretely, the flips of an unbiased coin, which produce a potentially infinite string of binary inputs). As the stream of random input comes in, the machine writes its output as a binary string on a dedicated output tape--this corresponds to the utterances produced by the machine. This set-up needs to reflect a basic fact about utterances--that once they have be said, they cannot then be ‘unsaid.’ This can be expressed by saying that symbols on the output cannot be deleted. The intuitive picture to have in mind is that, as the input grows, the output (the corpus of things ‘said’) gradually increases, and can never decrease.6

Let us call a deterministic Turing machine which has this property a monotone Turing machine (Li & Vitányi, 1997).7

            We now have a model for how the linguistic input presented to the learner is generated. It is generated by a mixture of random factors (captured by the random input to the machine); and deterministic factors (captured by an arbitrary monotone Turing machine). The deterministic factors determine which sentences of the language are allowed, and their probabilities—the random factors generate an actual corpus of language, by choosing a particular stream of sentences. The output of the machine is a finite8 or infinite binary sequence, which can be viewed as encoding the corpus of linguistic material to which the learner is exposed. Because of the random component, many different outputs (corpora) can potentially be generated, and some corpora will more likely to be generated than others. The probability associated with each x is the probability that x will be the output of the computational process, when supplied with random binary input. Hence we can associate any given (monotone) computational process, C, with a probability distribution, mC(x),  over the set of finite and infinite binary output sequences, x. The fundamental assumption concerning the nature of the linguistic input outlined in this subsection can be summarised as the assumption that the linguistic input is generated by some monotone computable probability distribution mC(x).

            But is the definition of monotone computability sufficiently broad? Perhaps language is generated by some cognitive process which is not monotone computable. There is no definitive way to ascertain whether this occurs—a ‘monotone computability’ thesis must stand as a conjecture, which cannot be verified although it may, at some time in the future, be falsified. There are three points to consider.

The first point is the assumption that the ‘deterministic’ component is no more powerful than a Turing machine. This seems relatively uncontroversial, because it is in line with standard assumptions in cognitive science that cognitive processes are computable (this is the Church-Turing thesis). Thus the deterministic component does not seem overly restrictive.

The second point appears more problematic. The assumption that the input to the machine is the result of tossing a fair coin seems extremely restrictive. But, fortunately, this restriction can be weakened without affecting the class of probability distributions that are allowed. Specifically, the assumption can be weakened so that the input is generated by any ‘enumerable semi-measure.’ Roughly, this is the class of distributions that can be approximated in the limit9 by a computational process (Li & Vitányi, 1997). This is a very broad class of probabilistic inputs indeed and includes all the distributions standardly used in mathematics and probability theory10.

The third point concerns whether the sharp separation between deterministic computational processes and purely random factors is unrealistic. For example, on the natural assumption that the computations of the cognitive system are disrupted by internal ‘neural’ noise, then some of randomness will creep into the workings of the computational process itself. Fortunately, however, incorporating finite amounts of noise into the workings of the computational process itself does not affect the class of probability distributions over linguistic outputs that can be generated. This is because any finite amount of randomness in the internal workings of the machine can be simulated by a deterministic machine that ‘absorbs’ the relevant amount of randomness from part of its random input.11

            We shall therefore tentatively adopt the monotone computability conjecture henceforth, and therefore assume that language can be viewed as generated by a fair random binary input into an arbitrary deterministic, monotone computational process.

            To get an intuitive sense of how language generation occurs according to the present model, consider a variant of a well-worn example. It is often remarked that a monkey hitting typewriter keys at random would eventually produce the works of Shakespeare. Here, instead, the monkey is hitting the keys, not of a typewriter, but of a programmable computer. So the analogous remark is that a monkey randomly hitting the keys of a programmable computer will eventually write a computer program that produces the works of Shakespeare!

            The change from random typing to random programming is of vital importance. In the former case, the probability of a string being generated depends only on its length. So the probability of any billion symbol binary string being generated random is the probability of a billion successive coin tosses being coming up in some specific way, i.e.,  2-1000,000,000. Thus, all binary strings, whether they encode Shakespeare or are completely random are treated as having precisely the same probability. But this would be of no use as a model of language generation--because in providing the same probability to all outputs, the ‘monkey-at-the-typewriter’ scenario completely fails to favor grammatical, or relevant, or coherent, utterances over unintelligible gibberish.

            By contrast, the ‘monkey-at-the-programmable-computer’ scenario, allows for such biases to be incorporated into the structure of the computer. These biases can take forms as varied as the possible computer programs. To choose a simple example of relevance to language generation, the programmable computer might, for example, support a ‘programming language’ in which the rules of a phrase structure grammar can be specified (in this case by the random input). When run, such a program then generates a binary output, encoding a sequence of sentences that are in accordance with those rules. In this case, some binary strings (those that correspond to sequences of sentences which are grammatical according to a phase structure grammar) will be relatively probable; and others (those that do not correspond to such sequences) will have probability 0. So while the monkey typing at random is equally likely to produce any random sequence, the monkey programming at random (on this particular language generating ‘machine’) is very much more likely to produce some outputs than others.

            The specific probability distribution over binary outputs depends crucially, of course, on the choice of programmable computer, C, into which the random binary input is fed. But if we fix a particular C, we can quantify the probability that a particular output, x, will be generated from random input quite straightforwardly. This probability depends on the number of different ‘programs,’ y, that produce output x, and on the length of each of these programs, l(y).

            Focussing on a specific program, y, the probability that it is generated from random input is the probability that first l(y) symbols in the random input correspond precisely to y. There is a probability 1/2 that each particular symbol will correspond to the corresponding symbol of the sequence y, and hence a probability 2-l(y) that all l(y) symbols correspond to y. This means that outputs which correspond to short programs for the computer, C, are overwhelmingly more probable than outputs for which there are no short programs in C. Now we can derive the total probability, mC(x) that the output, x, is generated by random input to the computer C.  We just need to sum the probabilities of all the inputs y which produce an output x… (i.e., an output that begins with the subsequence x) when run on the C (in symbols, all the y such that C(y)=x). So the total probability, mC(x), that x is generated from random input to C is12:

 

                                                                                                (1)

 

where x… denote a finite or infinite sequence that begins with the subsequence x, and similarly for y… Thus, the fundamental assumption of the framework presented here is that the (possibly infinite) corpus of utterances which forms the corpus, x, for the language learner is generated by a probability distribution mC(x), associated with a monotone computational process, C, provided with fair coin flips.13 We shall call such mC(x), monotone computable probability distributions. We show below that the structure of corpora generated in this way can be learned, to an approximation, from positive evidence alone.

            Finally, we note that we have skated over some technical complications in the interests of clarity. First, defining probability distributions over infinite sequences requires care, because probabilities typically all tend to zero in the limit. Technically, this requires introducing the notion of a measure (Appendix A). Second, the ‘probability distributions’ we have been discussing do not all obey a standard feature of probability: that the probabilities of outcomes sum to 1. In particular, this arises because some inputs to a monotone machine may produce no well-defined output (the monkey typing into the computer may type a syntactically invalid program, or a program which goes into an infinite loop, or exhibits some other pathology). This means that the sum of the probabilities of the well-defined outputs, s, which we wrote mC(s), may be less than 1. This complication requires generalizing from the conventional measures used in probability theory to the notion of a semi-measure (Appendix A). Where measures and semi-measures are drawn on in the results below, the discussion is relegated to other Appendices. Thus, the reader may ignore such complications without substantial loss of comprehension or continuity below.

 

2. Measuring learning performance

We have defined the class of monotone computable generating mechanisms—and we assume that the linguistic input to the learner (e.g., the child) is generated by a monotone computable process. But how are we to measure how well learning succeeds? Our primary measure is prediction—how well can the learner specify the probabilities of each possible continuation of a sentence or text? The use of prediction as a measure of learning language structure can be traced back to Shannon (1951) and has been widely used in connectionist models in language learning (e.g, Christiansen & Chater, 1994, 1999; Elman, 1990, 1993).

This prediction criterion is very difficult to meet. Intuitively, it requires that the learner must not merely uncover the grammatical rules underlying language structure, but must also master whatever other regularities determine which sentences in the corpus are generated, whether these regularities are pragmatic, semantic, or due to the influence of world knowledge. Thus, this criterion requires that the learner acquire not merely the language, but much else besides. Nonetheless, it seems intuitively plausible that if language is learned in this strong sense, it is necessarily learned in the weaker, and more natural, sense of acquiring just language-specific information, such as the grammar of the language. We shall see below that there is a precise sense in which this is true, in the subsection Prediction and grammaticality, below.

            Let us frame the prediction criterion more exactly. A particular subsequence, x, is generated according to a monotone computable probability distributionmC. The learner is exposed to this specific x with a probability mC. The learner is faced with the problem of predicting how the binary sequence will continue, i.e., what language input is to come. Let us consider this input, symbol by symbol. The basic problem is to decide the probability that sequence, x, will be followed by either a ‘1’ or a ‘0’. The true probability that it will be followed by a ‘0’ can be written, using standard notation from probability, as:

 

                                                                            (2)

 

But, of course, the learner does not know the ‘true’ distribution mC (we will typically drop the subscript, C, below)—because the learner does not know the language at the outset. Instead, the learner must use some other probability distribution, and hope that the predictions that it makes approximate, to some degree, the predictions that arise from the true distribution.

 

3. The learning method: Predicting by simplicity

Rather than attempting to provide a detailed model of human language acquisition, we shall instead adopt an idealized formal model of learning. This formal model will allow an analysis of what can be learned from the linguistic input—and hence to address the issue of the poverty of the linguistic stimulus. We shall leave the question of whether a Simplicity Principle might be guide human language acquisition until the Discussion.

Specifically, the formal model is that learning follows a Simplicity Principle. The learner prefers hypotheses, theories, or patterns to the extent that they provide simple explanation of the data. Thus, we assume that the learner chooses the underlying theory of the probabilistic structure of the language that provides the simplest explanation of the history of linguistic input to which the learner has been exposed.

The learner can then make predictions about subsequent input by applying the prediction of this best (simplest) theory of the language. More accurately, as we shall see below, it makes predictions by considering the predictions of many different theories, and being influenced by each prediction to the extent that the theory that generates it provides a simple encoding of the data.

            So prediction by simplicity requires finding the theory which provides the simplest explanation of the language input that has been encountered (or, more exactly, a weighted combination of explanations with the simplest explanations weighted more heavily). What does this mean in practice? A first suggestion is that the simplest hypothesis should be preferred. To see how this might work, consider a trivial input which consists of, for example, an initial sub-sequence of 1,000,000 alternations of 1 and 0: 1010...1010. Intuitively, the simplest hypothesis is that the sequence continues to alternate indefinitely—leading to the prediction that the next symbol will be 1. This hypothesis is therefore favored over, for example, the intuitively more complex hypothesis that the sequence consists of 1,000,000 alternations of 1 and 0, followed by infinitely many 0s, which would make the opposite prediction. But, taken alone, the injunction to accept the simplest hypothesis has an absurd consequence. An even simpler hypothesis, e.g., that the sequence consists of an infinite sequence of 0s, leading to the prediction that the next symbol will therefore be a 0, will always be preferred. Such possibilities are, of course, ruled out by the constraint that the hypothesis  has to be consistent with the available data—i.e., some hypotheses are just too simple. But this point itself raises difficult questions: What does it mean for a hypothesis  to be consistent with the available data?  Can consistency with the input be traded against simplicity of hypothesis? If so, how are simplicity and consistency with the data to be jointly optimised? The theoretical account of simplicity presented below answers these questions. 

            There is, however, also a more subtle difficulty: What rules out the simplest possible “vacuous” hypothesis which allows any sequence whatever—such a “hypothesis” could be interpreted as saying that “anything goes?”  This hypothesis  seems extremely simple; and it is also consistent with the available data. Indeed it would be consistent with any data, because it rules nothing out. Mere consistency or compatibility with the data is plainly not enough; the hypothesis  must also, in some sense, capture regularities in the data. That is, it must have explanatory power (Harman, 1965). So we appear to be faced with the unattractive conclusion that we must somehow jointly optimize two factors, simplicity and explanatory power; and the relative influence of these two factors is unspecified.

            Fortunately, there is an alternative way to proceed. This is to view a hypothesis as a way of encoding the data; and to propose that the hypothesis chosen is that which allows the shortest encoding of the data.  This proposal disfavour vacuous or nearly vacuous hypotheses, that bear little or no relation to the data. These hypotheses do not help encode the data simply, because they capture no regularities in the data.  Focussing on using the hypothesis as a way of encoding the data also suggests an operational definition of the “explanatory power” of a hypothesis—as the degree to which that hypothesis helps provide a simple encoding of the data.  If a hypothesis captures the regularities in the data  (i.e., if it “explains” those regularities), then it will provide the basis for a short description of the data. Conversely, if a hypothesis fails to capture regularities in the data, then it does not provide a short description. Explanatory power is therefore not an additional constraint that must be traded off against simplicity; maximizing explanatory power is the same as maximizing the simplicity of the encoding of the data.

            Measuring simplicity as brevity of encoding appears to face two problems. First, it seems that a new description language, in terms of which the linguistic or other data are to be encoded, may be required for each new hypothesis (e.g., each new grammatical theory). Second, it seems that brevity of encoding of hypotheses and data will depend on the description language chosen—and hence the predictions derived from the Simplicity Principle will likewise depend on the choice of description language.

            These problems are addressed by the mathematical theory of Kolmogorov complexity (Li & Vitányi, 1997). The first problem, of needing a new language for each new type of data, is avoided by choosing a general coding language. Specifically, the language chosen is a universal programming language. A universal programming language is a general purpose language for programming a computer. The familiar programming languages such as PROLOG, LISP and PASCAL are all universal programming languages. All these programming languages are what is called “prefix-free,” that is, no syntactically correct program in the language is a proper prefix of any other syntactically correct program in the language. Moreover, the machine executing the program can determine where the program ends without having to read past the last symbol of the program. Such programs are prefix-free, effectively so, and are called “self-delimiting.” For example, a language consisting of the programs “01, 001, 0001” is prefix-free, and the language  “10, 100, 1000” is not prefix-free. For technical reasons we require that all universal programming languages considered in this paper are prefix-free. This is a crucial requirement for the development of the later mathematical arguments.

How can an object, such as a corpus of linguistic input, be encoded in a universal programming language such as LISP? The idea is that a program in LISP encodes an object if the object is generated as the output or final result of running the program. By the definition of a universal programming language, if an object has a description from which it can be reconstructed in some language, then it will have a description from which it can be reconstructed in the universal programming language. It is this that makes the programming language universal. Notice that, above, we assumed that linguistic input can be viewed as generated by a computational process (possibly mixed with a source of random input). By using a universal programming language, the learner can be sure to be able, at least in principle, represent every such computational process.14

            Moreover, in solving the first problem, the second problem, that different coding languages give different code lengths, is, at least partially, addressed.  A central result of Kolmogorov complexity theory, the invariance theorem (Li & Vitányi, 1997), states that the length of the shortest description of an object, x, is invariant (up to a constant) between different universal languages.15 The invariance theorem thus allows us to speak of the code length required to specify an object, x, related to a fixed choice of universal language,  where this code length is, up to an additive constant, independent of the particular universal language in which the shortest code for x is written. The additive constant depends on the universal language but not on the particular object, x. This quantity is defined to be the Kolmogorov complexity, K(x), of that object. So, by assuming that the coding language that the cognitive system uses is universal, we can avoid having to provide a detailed account of the codes that the learner uses.

            So far we have considered how to measure the complexity of individual objects. But linguistic input consists not of a single ‘objects’ (e.g., a single sentence, paragraph, or whatever, which can be represented as a finite binary sequence) but a sequence of objects, which may be indefinitely long. How can Kolmogorov complexity be applied to measure the complexity to potentially infinite sequences? The only modification that we require is that complexity is measured in terms of the shortest input to a monotone universal machine (as discussed above), which produces a particular sequence. Thus, given that a short input to a universal monotone machine will generate a long string of 0s, then this string has low monotone Kolmogorov complexity. We define the monotone Kolmogorov complexity of a finite sequence x as the length in bits of the shortest string such that every input that begins with this string produces as output that begins with x (the output sequence may then continue in any way at all). The monotone Kolmogorov complexity for a sequence, x, is denoted Km(x). The invariance theorem holds for Km(x) just as for K(x) and, indeed, for finite sequences, Km(x) closely approximates K(x) (see Li & Vitányi, 1997, p. 285). Note also that a program on a universal monotone machine can implement a probability distribution over potentially infinite binary strings. The probability associated with a binary string is simply the probability the string will be generated by that random binary input—that is, the probability that it will be generated by a monkey typing random input to the program, to pick up our earlier picture. More formally, an initial program p of length K(m)that is, the program is self-delimiting so the interpreting machine can parse itcauses the universal monotone machine to start simulating the machine  Tm that transforms the following (possibly infinite) input sequence into a (possibly infinite) output sequence in such a way that the uniform distribution over the input sequences (that is, infinite sequences of 0's and 1's generated by fair coin flips) following the initial program p is transformed in the distribution m over the output sequences. Thus, we can speak of the monotone complexity, K(m),  of a probability distribution, m—signifying the shortest self-delimiting program on the universal machine that implements m.16

            We have been focussing up to now purely on finding the single simplest input program which encodes the sequence, x. But for any sequence which can be encoded at all, there will be many—in fact infinitely many—such programs.17 Suppose that an input program (i.e., a binary sequence) p has length l(p). The probability that it will be generated by chance is 2-l(p). Thus, the probability, l(x), of a sequence x being generated on a universal monotone machine is a special case of (1) above:

 

                                                                               (3)

 

            We shall call l(x) the universal monotone distribution, and, following Solomonoff (1964, 1978), we assume that the learner uses l to predict the next linguistic input in the sequence:

 

                                                                                (4)

 

This is a special case of (2) above.

            So far, we have specified the weak condition that language is generated by a monotone computable distribution, m. We have also specified that the learner follows a Simplicity Principle—favoring hypotheses in so far as they provide brief encodings of linguistic data—and that the learner makes predictions according to a universal monotone computable distribution, l. We have, furthermore, suggested that the learner’s performance can usefully be assessed in terms of its ability to predict the linguistic input successfully, while allowing that another important criterion is the learner’s ability to judge the grammaticality of novel sentences. We can now consider the possible effectiveness of language learnability by simplicity, from positive instances alone.

 

Prediction and language acquisition

This section rests on a key result, which we label the Prediction Theorem, by Solomonoff (1978). This theorem shows that, in a specific rigorous sense, the universal monotone distribution l is reliable for predicting any computable monotone distribution m, with very little expected error. Similarly it can be shown that the predictions according to the universal distribution asymptotically approximate those according to the real

distribution m, for almost all sequences (the  m-random sequences) (Li & Vitányi, 1997). As it happens, this doesn't follow from Solomonoff's result and Solomonoff's result doesn’t follow from this one. Solomonoff's result states that the expected prediction error (square difference) in the n-th prediction decreases faster than 1/(nlogn), but it doesn't state that with m-probability1 the ratio between the conditional real probability of the n-th prediction and the universal probability of the n-th prediction (given the previous n-1

outcomes) goes to 1. That is, this must happen for almost all sequences individually, wheras Solomonoff's result tells us something over the average taken over all sequences. This is a similar difference as that between the “Strong Law of Large Numbers” that holds for almost all infinite sequences individually and the “Weak Law of Large Numbers” that holds on average. The problem is that it is consistent with Solomonoff's result that

m(0|x) = 0 infinitely often which prevents the ratio l(0|x)/m(0|x) from going to 1 in the limit. Nonetheless, Solomonoff's result has a speed-of-convergence estimate that is quite strong (but only holds for the average) while the convergence law has no speed-of-convergence estimate although it guaranties convergence with probability 1. Given the assumption, made in the previous section, that language is generated according to such a distribution, Solomonoff’s result is immediately relevant to the formal problem of language acquisition. It implies that the universal monotone distribution l is reliable for predicting what linguistic input is to come: it almost always is guarenteed to converge, and it soon makes little error. For the moment, though, let us consider the result in general terms.

According to these prediction theorems, l is a universal approximation for monotone computable probability distributions. A learner makes predictions by using l, the learner will rapidly close in on the ‘correct’ predictions of m. Given this informal statement, it may seem that l may sound too good to be true—and indeed it may seem conceptually impossible that a single distribution can simultaneously approximate  every one of the entire class of computable probability distributions, because these distributions will themselves be so diverse. To see how this is possible, let us state Solomonoff’s result more precisely.

            Suppose that a sequence of n-1 binary values are generated by a computable data-generating mechanism, associated with a probability distribution, m, in the way described in the previous section. Let us call this sequence of n-1 values x. Given this sequence, we can ask how closely the prediction according to m agree with the predictions from the prior, l. Specifically, we measure the difference in these predictions by the square of difference in the probabilities that m and l assign to the next symbol being 0.18 Formally, this difference is:

 

                                                                        (5)

 

Error(x) measures how good an approximation l(0|x)  is to m(0|x)—but its value clearly depends on which previous sequence of items, x, has been generated. To get a general comparison between l and m we need to take into account the various sequences x that m(x) might have generated. Moreover, we weight these sequences by the probability m(x) that they were generated by the true distribution, m.

 

                                                                                   (6)

 

sn is thus the expected value of the squared error between the predictions of l and m on the nth prediction. The smaller the sn, the better l predicts m.

This weighting by the actual distribution, m, reflects the fact that we would intuitively view l as a good approximation to the true distribution m if it assigns similar probabilities to events which are actually likely to occur (according to m). It does not matter whether m and l disagree on cases that never, or very rarely, arise in practice (i.e., where m(x) is 0 or nearly 0). This weighting by the actual distribution will be important below, when we apply these ideas to language learning. Specifically, in assessing how well a learner has learned the structure of a language, there will be considerable weight attached to linguistic material which might actually be said; and little weight attached to sentences (e.g., a sentence containing 1000 ‘and’s) which will never actually be produced.

Finally, the expected prediction performance over the entire sequence is just:

           

                                                                                                   (7)

 

We shall use this measure as our overall measure of predictive success. To get a feel for the meaning of , consider the case where the expected value of the sum square difference between two computable probability distributions m1 and m2 is always greater than some constant d, where d may be arbitrarily small. In this case,  is at least ¥.d  which is, of course,  ¥. But, remarkably, Solomonoff’s Prediction Theorem shows that, in relation to the distributions l and m considered here, the sum  has a limit bounded by a constant, and hence that as the amount of data increases, the expected prediction error tends to 0. That is, given enough data, expected prediction should be almost perfect—using the universal distribution l the learner should accurately be able to learn, in an approximate sense, any true computable distribution m. Specifically, the following result holds:

           

Prediction Theorem (Solomonoff, 1978): Let m be a computable monotone distribution. Then,

                                                                                          (8)

 

We shall consider below how the Prediction Theorem can be related to language acquisition, but first, we show how Solomonoff’s remarkable theorem can be proved19. The proof has four steps.

            The first, and crucial, step is to show that, for any finite or infinite computable data x:

 

                                                                                            (9)

 

This puts an upper bound on how much m(x) can exceed l(x). Intuitively, it implies that if x is probable according to m, it is also reasonably probable according to the universal prior probability l.

            The second step is to show that (9) implies a bound on a measure of similarity between the distributions  m and l over the set of computable data strings x. This measure of similarity is Kullback-Liebler distance D(m||l):

 

                                                                                             (10)

 

            The third step breaks up this similarity measure over the distribution over whole strings x into an infinite sum of Kullback-Liebler similarity measures over each of the j positions in the string Dj(m||l) . This step is conceptually straightforward, but algebraically complex:

 

                                                                                 (11)

 

            The final step makes the connection between the Kullback-Liebler measure of similarity and the measure of similarity with which we are primarily concerned: sum-squared error.

 

                                                                            (12)

 

We can now put the last three steps together to obtain the final result:

 

                     (13)

 

thus proving the theorem. We now prove each step in turn. 

 

Step 1

To prove:                                                                             (9)       

 

Consider a universal monotone machine, U. Because U is universal, for any monotone machine W, there will be a finite string, p, which ‘programs’ U to behave like W.20 That is, for any monotone machine W, there will be a program p (in fact, many such programs), such that for all x, U(px) = W(x). Since U must parse px into p and x it must be able to detect the end of p; that is, p must be self-delimiting. As noted above, the probability of a randomly generating a binary program, p, of length l(p), is 2-l(p). This means that short programs have the highest probability.

            Let us therefore focus on the shortest and hence most probable program for W in U; if there are several programs which tie for the shortest length, we choose one of them arbitrarily. The length of this shortest self-delimiting program is K(W) by the definition of prefix-complexity, K. Each monotone machine is associated with a distribution over all its possible output sequences, as defined above. If W is associated with the true probability distribution m, then we can write K(m) to denote the length of the shortest program which generates m.

            Now consider a string x, with probability m(x). What can we say about l(x)? By the considerations above, we know that one input (of many inputs) to l which will generate x is as follows: first a program of length K(m) which converts U into W (and hence l into m) followed by any of the inputs to W which produce x. The probability of the first part is the probability of a specific binary sequence of length K(m), which is 2-K(m). The probability of that the second part of the sequence then generates x is, by definition, m(x). Thus the probability of both parts of the sequence is the product of these two: . This is just one way of obtaining x in U (there are infinitely many other programs), which means that this cannot be greater than l(x), the overall probability of obtaining output x from U. Thus,

 

                                                                                           (14)

 

Rearranging, we get:

 

                                                                                                    (15)

 

Taking logs in base 2 of both sides of the inequality proves Step 1.

           

Step 2

To prove:                                                                              (10)

 

Let us introduce a measure of the similarity between two probability distributions, which can be related to, but is easier to deal with, than sum-squared difference, defined above. This measure is Kullback-Liebler divergence, D(P||Q). This measure originates in information theory. It measures the expected amount of information that is wasted in transmitting a message x which is actually generated by a probability distribution P, but is encoded using a code which is instead optimally adjusted to transmit messages generated by probability distribution Q. The waste arises because an optimal code assigns short codes to probable items, and longer codes to less probable items. Thus, if P and Q are very different, then codes will be assigned in an inappropriate way. Specifically, short codes will be used for items which are probable according to Q but which may not be probable according to the actual distribution P, and vice versa. When P = Q, there is, of course, no waste at all, and D(P||P) is therefore 0. Moreover, if P ¹ Q there is always some waste, so that D(P||Q) ³ 0—and the amount of waste measures how similar or different the two probability distributions are.21

            The Kullback-Liebler divergence between probability distributions P and Q is defined:22

 

                                                                         (16)

 

            Let us now consider the Kullback Liebler divergence between the distributions m(x) and l(x), where x ranges over (possibly infinite) output sequences:

 

                                                                         (17)

 

Applying (9), we have:

 

                                                                     (18)

 

where the second inequality follows because m(x) is a semi-measure, i.e., . This proves Step 2.

 

Step 3

To Prove:                                                                  (11)

 

We have seen how Kullback-Liebler divergence can be defined over distributions of entire (possibly infinite) sequences. It will turn out to be useful to relate this to the Kullback-Liebler divergence at each location in the sequence.

            A useful intuition concerning how this works is as follows. D(m||l) measures the expected amount of ‘wasted’ information required to send a randomly selected sequence generated by the distribution m, using codes which are optimal relative to the assumption that the distribution is l, over and above the expected amount of information required the codes used are optimized to the true distribution m. Suppose we consider the expected amount of information wasted in transmitting the first symbol; then the expected amount of information wasted in transmitting the second symbol; and so on. These quantities correspond to Kullback-Liebler divergences, defined over each symbol in turn. It seems plausible that the sum of the expected amounts of information wasted in transmitting each symbol should be equal to the total amount of information wasted in transmitting the entire sequence—this is the intuitive content of the result that we are aiming to prove.

            To put this more exactly, we need to express the expected amount of information wasted at symbol j. Suppose that the sequence of symbols from 1 to j-1 is x. According to m, the probabilities of the next symbols are given by m(×|x). Similarly, according to l, the probabilities of the next symbols are given by l(×|x).

            Then, using standard Kullback-Liebler distance regarding the outcomes for the jth symbol, we have:

 

 

                                               (19)     

 

The expected value of this term with respect to the true distribution m(.) requires weighting it by m(x), the probability that the first j-1 symbols in the sequence are x according to the true distribution m. Thus, the expected amount of wasted information in encoding the jth symbol using l instead of m, which we shall denote by Dj(m|l) is:

 

                                                    (20)

 

            Here we have merely defined the terms in the conjecture above, and explained the intuition behind it. That is, the amount of information wasted in transmitting a sequence by using a code optimized to the ‘wrong’ probability distribution is the same, whether the sequence is encoded all at once, or symbol by symbol. A rigorous derivation that substantiates this intuition is given in Appendix B.

 

Step 4:

To Prove:

                                                               (12)

 

We have shown that the learner’s distribution l is similar to any computable distribution m, where similarity is measured by Kullback-Liebler distance. Moreover, we have shown how the expected divergence between the distributions over infinite sequences can be converted to a sum of the expected divergences at each location in the series. It remains to relate Kullback-Liebler distance to the familiar measure of goodness of prediction with which we began: the expected sum-squared error between m and l.

            The key to doing this is the following result, which applies to arbitrary distributions P and Q that can take just two values 0 and 1 (the proof is given Appendix C).

 

                                                                   (21)

 

            It immediately follows that the same result holds if P and Q are conditional on a previous string, x:

 

                                         (22)

 

Substituting m and l and for P and Q, and using the definition of Error(.) in equation (5), we obtain:

 

                       (23)

 

Using the definition of sj (Equation 6),

 

                      (24)

 

We now take the sum squared error over all symbols in the sequence, which immediately gives equation (12), and hence proves Step 4.

            Having proved Steps 1 to 4, we have hence completed the proof of the Prediction Theorem.

            The Prediction Theorem provides a counterweight to Gold’s negative results concerning the possibility of learning language in the limit from positive evidence. It shows that learning by simplicity can, in principle, be expected to converge to the correct conditional probabilities in predicting subsequent linguistic material. Intuitively, this means that the learner must be able to learn a great deal about the range of linguistic, pragmatic, social and environmental factors which influence the linguistic input that is received. This appears to imply that the learner must know a good deal about the specifically linguistic structure of the language.

            It is appropriate to ask whether this intuition can be backed up with a quantitative measure of how well the learner must acquire specifically linguistic information. The results in the next section shows that this can be done, by putting an upper bound on the number of ‘grammaticality’ errors that the learner can make in the course of predicting the linguistic input.

 

Prediction and grammaticality

Consider the following test of the learner’s ability to distinguish grammatical from non-grammatical linguistic input: Suppose that the learner has to ‘guess’ the next word in the text at each point. The question is: How often is the continuation that the learner chooses ungrammatical? Notice that this is quite a rigorous test. For example, if the sentence is a center-embedded structure: dogs cats Fido chases chase run, then predicting which continuations are grammatical requires, for example, predicting the agreement of the verb (whether it is singular or plural) based on the corresponding noun; and knowing which noun is the corresponding noun requires understanding the structure of center-embedded sentence (see Christiansen & Chater, 1994, 1999; Elman, 1990, 1993, for connectionist simulations of learning recursive constructions according to this logic). But, of course, by setting the grammatical judgement task in the context of predicting naturally occurring language, different types of grammatical structure are weighted by their probability of occurrence in normal language. This reflects a desire to assess the degree to which the learner learns to agree with the grammaticality intuitions of other speakers for aspects of the language which are actually produced.  Thus, it downplays any possible disagreements between speaker and learner concerning, for example, whether sentences with 2000 relative clauses or 20 cases of WH-movement are grammatical—because these are never, or almost never, produced.

            Let us follow standard practice and consider the case where there is no noise in the linguistic input—that is, all sentences that the learner hears are grammatical. We play the ‘guessing game’—at each point in the linguistic input, the learner makes a guess. We can then ask: How often does this guess violate the rules of the grammar?

            Two kinds of error can occur in principle: overgeneralization and

 and undergeneralization.

 

Overgeneralization errors

In overgeneralization, linguistic sequences are allowed by the learner’s probability distribution (i.e., they are viewed as grammatical by the learner), but they are not allowed by the true grammar. We wish both to measure, and to attempt to put limits on, the amount of overgeneralization that learning according to the Simplicity Principle will involve.

            In the discussion of both overgeneralization and undergeneralization, it is convenient to consider language input as a sequence of words,23 rather than coded as a binary sequence. Of course, the binary sequence is, by stipulation, simply particular way of encoding words—words are encountered one-by-one, and each word stands in one-to-one correspondence with a binary string. Thus, each possible corpus of language, viewed as a sequence of words, stands in a one-to-one correspondence with a possible binary string; and the probabilities of each corpus of words are identical to the probabilities associated with the corresponding binary strings. Thus, instead of dealing with distributions over finite and infinite binary sequences, m, and the learner’s universal approximation, l, we shall deal with corresponding distributions defined over finite and infinite sequences of words. We shall call these corresponding distributions Pm and Pl.

Suppose that the learner has seen a specific corpus, x, of j-1 words. Suppose that the learner has a probability Dj(x) of erroneously guessing that the next (i.e., the jth) word in the input is a word which is actually not allowed by the grammar. In symbols, we can define:

 

 

                                                                    (25)

 

That is, Dj(x) is the amount of probability that the learner devotes to grammatically impossible overgeneralization on the jth word. Because we assume that the linguistic input contains no noise, ungrammatical continuations have zero probability of occurring.

The probability Dj(x) will, of course, depend on the specific x that has been encountered. The expected value of Dj(x), which we shall write , is defined as follows:

 

                                                                      (26)

 

Our goal is to put some bound on the expected number of overgeneralization errors throughout the corpus, i.e., to put a bound on . The following Overgeneralization Theorem holds.

 

Overgeneralization Theorem.

Where  is defined as above,

 

                                                                                             (27)

 

That is, the expected amount of probability devoted by the learner to overgeneralizations, in the course of encountering an infinite corpus.

 

Proof. The proof has two parts. The first part concerns how the waste of probability, Dj(x), due to overgeneralization after seeing a sequence x, inevitably leads to a waste of information. This information is quantified by the Kullback-Liebler divergence between Pm and Pl, which can later on be related to K(m). But this leaves a crucial gap—it deals with Dj(x) for some particular x; but it says nothing about , the expected amount of probability wasted by the learner, averaged across all x. The second part of the proof fills in this gap, and hence provides the required bound on . To finish the proof, we also need to relate the results from these two steps to some of the analysis we have described above, in proving the Prediction Theorem.

            The first part of the proof begins by considering the following scenario. Suppose that the learner uses its probability distribution Pl to encode the output from the true underlying distribution, Pl. After the sequence, x, of j-1 words has been encountered, we can ask: What is the expected amount of wasted information in encoding the jth item? Such waste is inevitable, because the learner is using codes which are optimized to the learner’s distribution (i.e., Pl(.|x)), rather than to the true (but from the learner’s point of view, unknown) distribution (i.e., Pm(.|x)). The key underlying intuition is that, to the extent that the learner has a tendency to overgeneralize, the learner must necessarily waste a certain amount of information. This is because the learner encodes items as if some continuations are possible, where in reality they are not possible. This means that some code length in specifying the actual continuation must be ‘used up’ in order to rule out these continuations. The greater to degree to which the learner overgeneralizes, the greater the amount of wasted information.

            Suppose, then, that the learner has a particular Dj(x). How much wasted information must follow from this wasted probability? The minimum level of wasted information is achieved as follows.24 Assume that, for all the other lexical items, k, which are possible continuations, the probability assigned by the learner to this continuation, Pl(k|x)), is just (1-Dj(x)) times the true probability Pm(k|x). Thus, a certain  amount of ‘probability’ is wasted by the learner, on continuations that are impossible; but otherwise the probabilities of all the possible continuations are correct, except that they have to be appropriately re-scaled. What is the expected amount of waste that occurs by encoding the actual continuation in terms of the learner’s Pl(k|x)), rather than the true Pm(k|x), using this maximally efficient ‘re-scaled’ encoding? Applying Kullback-Liebler divergence:

 

           

                    (28)

 

The first term is 0, because Pm(k|x) is zero for ungrammatical continuations. Simplifying the second term, we obtain:25

 

                                                                   (29)

 

Because the input continues in some way or other, , and hence we can conclude that:

 

                                                      (30)

 

This is the minimum expected amount of waste that accrues for a particular guess, with probability Dj(x) of the learner guessing an ungrammatical continuation.

            We have considered a particular x. We now average over all the possible sequences of j-1 words, to get the expected amount of information loss on encoding the jth item, which is denoted by  (using the definition in equation 20 above). Thus, we obtain:

 

            (31)

 

This completes the first part of the proof.

            The second part of the proof shows how the above result can be applied to put a bound on . Log is a concave function, and we can therefore use the standard result that for expectations over an arbitrary random variable, z:

 

                                                                                (32)

 

This implies that:

 

                                                                             (33)

 

if we then substitute in 1 - Dj(x) for z, we obtain:

 

                      (34)

 

This is a crucial part of the second step in the proof—we have now introduced the expected value, , across all possible x. We can now get at more directly, but replacing the log expression on the right hand side of the inequality using a Taylor expansion.

 

                                                                                                                        (35)

 

Stringing together the inequalities (31), (34) and (35), we obtain:

 

                                                (36)     

 

So far we have only considered the probability of overgeneralization, and consequent waste of information, for the jth word in the corpus. We now sum over all j. The left hand side of equation (36) immediately simplifies, using the result above (equation 20) that . This gives:

 

                                                                 (37)

 

In this section, we have so far worked with probability distributions Pm and Pl over sequences, rather than with the familiar m and l, which are defined over binary sequences. We can now relate the present discussion back to the binary analysis. By stipulation, there is as one to one correspondence between possible binary states and sequences of words.26 There is therefore also a direct correspondence between the probabilities of these corresponding states. The probability of generating a particular word sequence is the same as the probability of generating the corresponding binary sequence. Information-theoretic measures, such as Kullback-Liebler distance, are of interest precisely because they are independent of the details of the coding scheme used to represent a probability distribution. Thus, it makes no difference whether the probability distributions are defined over strings of words (like Pm and Pl) or are defined over the corresponding binary strings (like m and l). Hence,

 

                                                            (38)

 

where the right hand inequality follows from (20) above. Putting (37) and (38) together gives the result:

 

                                                                                 (39)

 

            The intuitive significance of the overgeneralization theorem can be thought of in the following way. Suppose that the language learner were to continually attempt to guess the next word of every linguistic interchange. If the learner follows the Simplicity Principle, and makes predictions according to the distribution Pl (or, equivalently, according to l over a binary code), then the expected number of times that the learner will make a prediction that violates the grammar of the language has a finite bound, even on an infinite corpus. This implies, for example, that, for a linguistic input of n words, the expected average number of overgeneralization errors can be no more than: . Thus, if we consider a sufficiently large corpus (i.e., we increase n), the average expected number of overgeneralization errors tends to zero.

 

Undergeneralization errors

In undergeneralization, a sentence, s, is allowed by the true grammar, but it disallowed by the learner’s probability distribution. If this were to occur, after hearing a prior sequence of words x, a word, k, would be encountered which the learner had assigned a probability of 0. The learner would have undergeneralized, by assuming that the language is more restrictive than it in fact is.27

            For a learner using the Simplicity Principle, however, such undergeneralizations never occur. This apparently remarkable result can be understood intuitively as following simply from the fact that the learner’s probability distribution, l, corresponds to a universal monotone computer. Any computable output (including any corpus of language generated by a monotone computable process) therefore has a non-zero probability of being generated by this universal machine—because a universal machine, by definition, can simulate the computable process that generated this output. There is therefore a non-zero probability that a program that simulates this computational process will be generated by chance.28

            But further reflection suggests that the problem of undergeneralization  has not really be ruled out effectively by the analysis above. So far we have ruled out the possibility that the learner assumes a continuation to be impossible, when it is actually possible; but it seems relevant also to consider the case where the learner drastically underestimates (perhaps by a vast factor) the probability that a sentence might occur. In this case, the true distribution might allow that a continuation (e.g., dogs after the context raining cats and...) is actually rather common; whereas the learner believes that it is so infinitesimally probable that it is unlikely to occur in the entire history of the universe. Such a learner would seem, intuitively, to be making an undergeneralization  error (and a rather blatant one!); but such errors will not be detected by the previous criterion, as the learner believes the probability of the continuation to be non-zero.

            To address this concern, let us therefore consider a ‘soft’ version of undergeneralization. Suppose, as before, that the sequence of words encountered by the learner is generated according to a computable probability distribution Pm, and that the learner attempts to predict this sequence by a universal probability distribution Pl. As usual, we denote the sequence of the initial j-1 words that the learner encounters by x, and let us call the jth word, k. If the learner undergeneralizes on word k by a factor f, this means that the learner underestimates the probability that k will occur after x by a factor f. That is, f.Pl(k|x) £Pm(k|x). What is the probability that the k that is chosen according to the true distribution is a word on which the learner undergeneralizes, given the preceding sequence, x? This probability, which we denote Lj(x), can be expressed:

 

                                                                   (40)

 

The expected probability, , with which this occurs on the jth item is expressed:

 

                                                                                 (41)

 

Our goal is to put some bound on the expected number of undergeneralization errors throughout the corpus, i.e., . The following result can be derived (see Appendix F for a proof):

 

Soft undergeneralization theorem

 

                                                                    (42)

 

(so long as f > e)

 

The theorem implies that the expected number of ‘soft’ undergeneralizations is bounded by a constant, even for an infinitely long sequence of linguistic input. As with overgeneralizations, the upper bound is proportional to the complexity of the underlying probabilistic mechanism generating the language (including, presumably, the grammar of the language). Moreover, the more severe the criterion for an undergeneralization (the greater the value of f), the fewer such undergeneralizations can occur.

            We have shown that, if language is generated by an arbitrary computable probability distribution, Pm, and the learner employs the universal distribution Pl, the expected number of over- and under-generalizations that the learner makes will be bounded by a constant, over an infinitely long linguistic input.

            Thus, in testing grammaticality judgements by prediction, as discussed above (and assuming the highly idealized case where linguistic input consists only of grammatical sentences), the learner can, in the limit, make highly accurate grammaticality judgements.

            Note that, although we have viewed this result as describing a ‘guessing game’ over the language represented as words (i.e., the learner must guess which word comes next), exactly the same argument applies at other units of linguistic analysis. Thus, in a the same results would apply if the learner’s task were to predict utterances phoneme-by-phoneme, syllable-by-syllable, or sentence-by-sentence. We shall use this fact below, in considering how well the learner may fair in producing language, rather than judging grammaticality. 

 

Language learning and language production

So far we have presented two results. First, we have shown that learning using a Simplicity Principle can be used to successfully predict linguistic input, in the asymptote; this result arises directly from Solomonoff’s (1978) Prediction Theorem. Second, we showed that the Prediction Theorem has implications for the ability to learn to make grammaticality judgements from positive evidence alone. Roughly, the logic of the argument was to show how a learner that can predict effectively can use this ability to make grammaticality judgements; and hence to use the result concerning the quality of prediction to provide an insight to the quality of grammaticality judgements.

            It might appear, however, that a more challenging task for the learner is not merely to judge whether sentences that it hears are grammatical, but to successfully produce sentences of its own. Fortunately, it is possible to show that by learning using the universal distribution, l, the learner can also produce language effectively, in the asymptote.

            For concreteness, let us imagine a rather idealized set-up. Let us imagine that the learner hears an indefinitely long conversation (i.e., the learner’s entire history of linguistic input), and will, at some point ‘join in’ by producing language of its own. Let us assume, further, that the learner has the rather restricted goal of producing an utterance which blends in with the previous conversation as well as possible--i.e., the learner aims to say something which one of the other participants in the conversation might equally well have said. This is, of course, a rather limited goal, given that part of the purpose of language production is to express one’s own distinctive perspective. Nonetheless, if the learner can ‘blend in’ to the conversation in this way, the learner must have learned to produce the language successfully, because the learner can mirror the behavior of the other speakers, who are, by assumption, speakers of the language. So by assessing whether the learner can blend in to the conversation, we can assess whether the learner has learned to speak the language successfully.

            Consider an arbitrary chunk of language--e.g., phoneme, word, sentence or longer, that the learner might utter at a certain point in the conversation. Suppose that the potential chunk of new material that the learner might add at this point is encoded in the binary sequence y; and the conversation so far is encoded in the binary sequence x. The actual probability that the sentence has this continuation, if the sequence continues to the generated by the existing speakers, is by definition m(y|x). The learner generates utterances instead by the same distribution that the learner uses in prediction, l, so that the probability of the learner producing this continuation is l(y|x). The learner will blend in, to the extent that l(y|x) is a good approximation to m(y|x)--i.e., that the learner has a propensity to produce language that the other speakers have a propensity to produce.

            Fortunately, the following theorem holds (Li & Vitányi, 1997, Theorem 5.2.2). Where m is, as above, a probability distribution (strictly, a semi-measure) associated with the probabilities of the output of some monotone computable process, and l denotes the universal distribution, then for any finite sequence y, then as the length of sequence x tends to infinity:29

 

                                                                                                   (43)

 

with a probability growing to 1 for fixed y and growing x. Interpreting (43) in the context of language production, this means that, in the asymptote, the learner will blend in arbitrarily well. The probability of the learner producing any continuation of the conversation will tend towards the probability of that continuation being made by another speaker. In particular, this means that there will not be sentences that the other speakers might say with some significant probability, but which the learner is incapable of saying; and conversely that everything that the learner might say with significant probability will be something that the other speakers might have said. Thus, in the asymptote, the learner can speak the language indistinguishably from the speakers in the language community in which the language was learned.

           

Discussion

We have shown that, under quite broad assumptions about the linguistic input over which learning occurs, there is enough information in positive input alone to learn a good deal about a language. In this section, we reconsider the relationship of the present results to those of Gold (1967), discuss the status of ‘logical problems’ for language acquisition, outline open questions for future research, and finally consider the hypothesis that a simplicity principle might actually guide how language acquisition and other cognitive processes occur.

 

Relationship to Gold (1967)

I noted at the outset that the scope for learning language from positive evidence alone has been viewed as very limited since Gold’s (1967) classic paper “Language identification in the limit.” By contrast, the present results suggest that under very general conditions positive evidence can provide enough information for a learner to gain a great deal of information about a language (though we shall mention a number of caveats below).

            The present results do not, of course, cast doubt on the validity of Gold’s results. Nor does it cast doubt on the usefulness of Gold’s approach to the study of learning. Indeed, an important subfield of research, learning theory, has emerged from extensions of Gold’s results (Angluin, 1980; Blum & Blum, 1975; Jain, Osherson, Royer & Kumar Sharma, 1999; Martin & Osherson, 1998; Osherson, Stob & Weinstein, 1985). Moreover, results from learning theory have been extensively related to human learning, including language learning (e.g., Osherson & Weinstein, 1982; Osherson, Stob & Weinstein, 1982, 1984; Pinker, 1979, 1984).

            The present results do emphasize the general truism that different formal idealizations of a single process--here the process of language acquisition--can lead to very different theoretical conclusions. The pressing question, therefore, is in what ways do the idealizations differ, and which idealization appears to be most relevant to how children learn natural language. An exhaustive analysis of the issues is beyond the scope of this paper. Here we briefly mention three critical points of difference (see Rohde & Plaut, 1999 for related discussion).

            A first difference is that Gold’s criterion for successful learning is more exacting than that considered here. Gold is concerned with precisely ‘identifying’ a language—i.e., specifying exactly (or almost exactly—see Osherson, Stob & Weinstein, 1985) what sentences it does or does not contain. This seems too strict a criterion of learning in relation to how children learn language—after all the idiolects of any two native speakers will presumably show at least subtle differences. Moreover, even a single difference over a specific grammatical rule between two idiolects can lead two speakers to disagree on the grammaticality of the infinite number of sentences in which that grammatical rule is involved. Thus, we would expect that any two people would disagree on the grammaticality of an infinite number of sentences. This means that theoretical results showing that a learner cannot precisely identify a language from a teacher providing only positive evidence, such as Gold provides, may not apply directly to language acquisition in the child. The model developed here allows that the learner and the ‘teachers’ from whom the language is learned may make different judgments about the grammaticality of an infinite number of sentences (and the teachers may, presumably, also differ among themselves). But the learner and teachers will agree on almost all sentences that have a substantial probability of being said. This means that, for example, a disagreement between learner and teachers concerning the application of a controversial grammatical rule in a ten billion word long sentence will not count noticeably against the learner’s having successfully acquired the language. From the pragmatic point of view of explaining how learners come to understand the actual sentences that they hear, and learn to produce similar sentences, the more relaxed criterion adopted here seems reasonable.

            A second, and related, difference is that Gold’s result makes a crucial simplification in ignoring statistical properties of the language. In Gold’s learning set-up, a language is a collection of sentences; and the goal of learning is to identify this set. But in the speech to which children are exposed, some types of sentences are more common than others—and learning the language critically involves learning these types of sentences, over and above types of sentences which are rarely or never produced. Thus, all native speakers of English agree that the cat is on the mat is an acceptable grammatical sentence; but examples of a rare structure such as this is the paper I filed without reading leaves native speakers, and even linguists, divided and uncertain regarding grammaticality.

To reinforce the point, an analogy from a much simpler learning problem, linear regression, is useful. Suppose we want to fit a straight line to a set of x-y values generated by some physical system. Suppose that we take a representative sample of x-y values and find that we have thousands of data points where x ranges from 0 to 10. If a linear regression appears to be a good fit, then we might (given certain assumptions) justifiably conclude that we can predict y values for new values of x in that region, with a reasonable degree of confidence. But this does not mean that we can have much, or perhaps any, confidence in our estimate for y for x values of -1010 or 1050. But this may not matter in practice, if we need to predict or control this natural system, if these x-values never arise. From a theoretical perspective, the behavior of the function at these extreme values may be of enormous significance; but the practical person does not care which theory is true, so long as they can predict within the range of data that is actually encountered.

A third difference between Gold’s framework and the present set-up is that Gold demands that for a language to be learnable, it must be possible for the learner to learn the language given any text for that language. Here a text is defined as a (typically infinite) sequence of sentences (allowing arbitrary repetitions) which includes all and every sentence of the language. This means that every grammatical sentence of the language will be encountered eventually, but that there are no further constraints whatever concerning the order in which sentences are encountered. Gold (1967) notes that the demand that language can be learned from every possible may be too strong. That is, he allows the possibility that language learning from positive evidence may be possible precisely because there are restrictions on which texts are possible.

Indeed, once the demand that the learner must successfully acquire the language from any text is abandoned, then a potentially powerful source of ‘implicit’ negative evidence becomes available: absence as implicit negative evidence. To see how critically important this factor can be consider a language learner that is considering the viability of the ‘vacuous’ grammar, that any set of words in any order is grammatical—‘anything goes.’ But suppose that the ten million words that the learner has so far encountered have been generated by a trivial finite state grammar. It might seem that the learner can pretty safely rule out the ‘vacuous’ hypothesis, under these conditions—and, indeed, it might seem that any intelligent learning mechanism is likely to reach this conclusion. The absence of all but a tiny fraction of possible sentences would seem to be strong evidence that these sentences (or at least, the vast bulk of them) are not allowed in the language. Thus, it seems reasonable to interpret absence as a potential source of implicit negative evidence. But in Gold’s set-up, a learner that adopts this assumption will be found wanting, because learners are required to acquire the language successfully, whatever the text on which they learn (so long as the text includes all and only the grammatical sentences of the language). Thus, any text at all is a perfectly legitimate text for the ‘vacuous’ grammar, including the one mentioned above; that is, the text can be ‘rigged’ arbitrarily to ‘mislead’ the learner; and Gold’s criterion requires that the learner should, nonetheless, always ultimately succeed in identifying the language correctly. More broadly, because the text can be rigged arbitrarily, the learner can never rule out ‘over-general’ grammars—i.e., grammars that allow more sentences in the language than the target grammar. Intuitively, the point is that for any ‘reasonable’ text, including the linguistic inputs to which children are exposed, absence can be used as negative evidence. Thus, by allowing ‘unreasonable’ texts, Gold’s idealization makes the learning problem unduly difficult.

   The potential importance of absence as a source of negative evidence applies not just at the general level mentioned above. As Rohde and Plaut (1999) have elegantly argued, it is also at the core of a wide range of specific proposals that attempt to explain how the child can acquire aspects of language from positive evidence alone. These proposals, which include the “uniqueness principle,” “competition,” “preemption,” “blocking,” the “principle of contrast,” “mutual exclusivity” and the “M-constraint” (Bowerman, 1988; MacWhinney, 1993; Pinker, 1984; Wexler & Cullicover, 1980), all rely on absence as an implicit signal than certain forms cannot occur. Rohde and Plaut (1999) point out that these principles require the learner to use ‘soft’ constraints such as that verbs typically have a single past tense, or that nouns typically have a single plural form. The constraints are ‘soft’ because they are some cases in which they do not apply. For example, in US English, ‘dive’ has two past tense forms ‘dived’ and ‘dove,’ both of which are reasonably frequent. But the soft constraint can nonetheless be extremely useful to the learner, if combined with the use of absence as negative evidence. Suppose, for example, that the learner hears countless examples of ‘went’ as the past tense of ‘go.’ The constraint that verbs typically have just one past tense means that the learner may reasonably conjecture that ‘goed’ is not viable. By using absence as surrogate negative evidence, the more examples of ‘went’ are heard, the more confident the learner can be. The learner can reason that if ‘goed’ existed, it would very likely have been encountered. Indeed, presumably it is just such an inference which underlies our adult intuition that ‘goed’ is not viable—it would seem incredibly unlikely that ‘goed’ is a valid past tense form, but that due to a remarkable chain of coincidence, one has never heard anyone say it. Note, by contrast, that this style of reasoning would not be appropriate for a learner aiming to learn according to Gold’s idealization of the learning problem—because the learner must succeed even in the ‘rigged’ text where ‘goed’ is legitimate, but is only heard after one billion examples of ‘went.’

To use absence as a source of negative evidence requires, then, some restrictions on the class of possible inputs to the learner (texts cannot be arbitrarily rigged). But which assumptions about the class of texts are appropriate? One extreme idealization would be to assume that texts are created by concatenating sentences chosen independently from an identical distribution over the (infinite) space of possible sentences. This idealization is attractive from a formal point of view—because it allows the application of the standard probability theory concerned with the properties of such sequences. But this assumption is clearly too restrictive, because there are patently very strong, and linguistically crucial, interdependencies between successive sentences. A natural direction to explore is to weaken this assumption by allowing dependencies between short sub-sequences of sentences, or in some other way assume that the language is relatively stationary (Rohde & Plaut, 1999). Any such assumption that the language is ‘stationary’ is subject to the concern, however, that there are dependencies between chunks of language over arbitrary scales. To see this, consider the dependencies in an academic journal, which apply between sentences and subsequent sentences; between paragraphs and subsequent paragraphs; between sections and subsequent sections; and even between articles and subsequent articles. Thus, it is not clear that language is a stationary stochastic process over any time-scale, although the possibility remains that it may be approximately stationary, to some useful degree, or at some level of linguistic analysis. The present framework places strong, but rather general, restrictions on texts, but without requiring stationarity. Specifically, infinite texts must be monotone computable. This restriction is significant. The overwhelming majority of infinite texts will correspond to uncomputable sequences.30 However, the uncomputable sequences, being incompressible (every initial segment is incompressible) to some degree, correspond more or less to "white noise" and has no meaning or regularity apart from being white noise, and hence there is no cogent reason why one should want to learn them or that they would express any interesting phenomenon. Note, too that the restriction to computability is still quite weak, in the sense that it does not impose any constraints which are specific to learning natural language. Nonetheless, the results we have discussed here show that adding this constraint on inputs suffices to make language learning possible.

            In a nutshell, Gold’s learning paradigm embodies the view that the child’s goal in learning language is primarily theoretical:The goal is to get the correct theory that decides all possible cases, whether or not they arise in practice or not; and Gold demands that this theory is learnable on all possible texts for the language. But it may be more appropriate to view the child’s primary goal as practical: What matters is learning to handle the language as it is actually spoken, from samples of the language that might actually be heard. In brief, Gold’s results show that language learning from positive evidence alone is impossible, when viewed as a problem of theory discovery; but the present results show that practical knowledge of how to predict, judge and produce sentences of a language can in principle be derived from positive evidence alone. The present analysis seems appropriate for natural languages where there is typically little consensus concerning what constitute correct sentences is necessarily fluid: different native speakers and linguists may completely disagree on the correctness of infinitely many sentences; and grammaticality judgments may be inconsistent across different occasions for the same speaker. In this sense, then, it seems appropriate to view a natural language is a probabilistic or fuzzy concept, and absolute criteria of constituting a set with absolute membership as used by Gold seem misplaced if applied to understanding the problem of real language acquisition, as faced by the child.

 

Baker’s Paradox and the ‘logical’ problem of language acquisition

The results that we have provided imply that there is no ‘logical’ problem of language acquisition from positive evidence. And, in particular, this means that the specific examples of “Baker’s paradox” that are discussed in the linguistics literature as presenting learnability puzzles are all learnable, given sufficient data. For suppose the contrary: that, say, some aspect of the English auxiliary system were not learnable, even with indefinitely large amounts of data. If so, this would mean that the predictions of the learner using the simplicity principle would never converge precisely on the true probabilities of different continuations—because, when the overgeneral structure arose, the learner would spuriously assign a non-zero probability to ‘hole’ in the language, that is ungrammatical and hence should be assigned zero probability. This lack of convergence would mean that there would always be a fixed residual error between the learners predictions and the true probabilities of different linguistic continuations, and hence that the sum of such errors, over an infinite corpus, would diverge to infinity. But this stands is inconsistent with Solomonoff’s prediction theorem, described above, which shows that a learner using the simplicity principle will only make a finite summed error, over an infinite corpus.

            The present proposal is that learning by simplicity can solve Baker’s paradox (given unlimited data). This may seem especially puzzling, givent that it has traditionally been assumed that simplicity is unable to handle the Baker’s paradox. After all, one way of framing the paradox is to say that the ‘simplest’ linguistic hypothesis that accounts for the positive data that has been encountered, does not predict the linguistic ‘holes’—this, after all, is what makes the holes unpredictable. So, given that it seems that simplicity can be used in framing the problem, it may seem mysterious that simplicity can be invoked to slve it! But in fact there is no mystery, because the simplicity principle, as a principle of learning, is that the learner should choose not the simplest hypothesis (which will, indeed, give over-general predictions), but that the learner should choose the hypothesis that gives the shortest description of the linguistic data. According to this criterion, it is important both that a hypothesis is reasonably simple, but also that the data can be expressed as compactly as possible in terms of that hypothesis. It is this second step that rules out overgeneral hypotheses—because by being too general, such hypotheses use an unnecessarily lengthy code to express the data in terms of hypothesis. More generally, any simplicity principle that considers only the simplicity of hypotheses, without taking account of the simplicity with which the data can be expressed in terms of that hypothesis is likely to be suspect, either as an account of learning, or as a methodological precept in linguistic theory (Chomsky & Halle, 1968; Culicover, 1999; Fodor & Crain, 1987; Hawkins, 1994).

            In a nutshell, the general asymptotic learnability results provided here necessarily apply to the asymptotic learnability of any particularly awkward linguistic structure. Hence there is no in principle ‘logical’ problems of language acquisition raised by such examples.

But this does not, of course, mean that such structures do not, in practice, raise puzzles for language acquisition. It remains to be established if or how, e.g., the English auxiliary system, can be learned, given the size and quality of the corpus that is available to the learner.

 

 

 

 

Open questions

The analysis in this paper considers the amount of information available to a learner from positive evidence alone; but it does not consider the extent to which it is possible for a learner to exploit this information fully.

            To consider whether this information can be exploited fully, let us assume that the learner has the same computational power as the mechanism producing the corpus (this seems a reasonable assumption, as today’s learner is tomorrow’s corpus-generator for future learners). Thus, the learner is modeled as a monotone Turing machine with access to a random input. To obtain optimal learning, the learner needs to predict according to the universal distribution, l, conditional on previous input. But in general, at least, this will not be possible, because l is an uncomputable distribution--this is a standard result of algorithmic complexity theory (Li & Vitányi, 1997). So, although the information may be available, the learner cannot exploit it fully.

            Hence, a psychological mechanism that learns using a Simplicity Principle must operate by approximating the probability distribution l--i.e., finding a short, but not necessarily the shortest, encoding of past linguistic data. This opens up the very interesting question of how approximations to l will fare in language acquisition--in prediction, making grammaticality judgements, and language production. Two extreme possibilities may be envisaged. One extreme possibility is that computational restrictions change the picture dramatically. Although for a learner with no computational limitations, the linguistic input contains enough information for successful learning, it might be that for real computational learners, very little useful information about language structure can be extracted from the input. The other extreme possibility is that computational limitations do not qualitatively affect what can be learned--i.e., the learner can predict, judge grammaticality, and produce language successfully, by choosing the simplest account of the language that it is able to find, although not, of course, quite as accurately as would be possible if the Simplicity Principle could be implemented precisely.  The question of which extreme represents the true situation, or which compromise between them is appropriate, is currently an open problem. Nonetheless, some steps have been made in this direction. Vitányi & Li (2000) consider a computable approximation to the universal distribution--the statistical Minimum Description Length Principle (e.g., Rissanen, 1987, 1989) and show via mathematical analysis that, under certain conditions, this computable approximation it is expected to lead to successful predictions with probability 1. There remains, though, a rich set of open questions concerning the properties of learners which various more specific computational properties and restrictions (e.g., learners that can only entertain certain languages). Most important, of course, is the analysis of  idealized learners that are psychologically realistic as models of human learners.

            A related area set of questions concerns more specific models of both of the language to be learned, and of the nature of the learner. In the analysis here, our only constraint on the language is that it could be produced by a ‘monotone’ Turing computable process (with access to a source of randomness). The learning problem may be expected to become substantially easier if constraints are placed in the class of languages that might need to be learned. These constraints might range from very general properties of language (which might emerge from communicative constraints, cognitive limitations, or in a variety of other ways) to highly specific and elaborate constraints of language structure, such as those embodied in ‘universal grammar’ (Chomsky, 1981).

            A third important set of open questions, that we touched on at the end of the last subsection, concerns the quality and amount of data required for language acquisition to occur. Formal results both in the tradition of formal learning theory started by Gold and learning by simplicity started by Solomonoff have focussed on learning in the asymptote, using a potentially infinite supply to data. But real language learning must occur reliably using limited amounts of data (although the available data to the child will comprise many millions of words each year). Thus a crucial set of open questions concerns how rapidly learners can converge well enough on the structure of the linguistic environment to succeed reasonably well in prediction, grammaticality judgements and language production. Some progress on this issue has already been made by Solomonoff, who has shown that the expected squared error in the n-th prediction probabilities of using the universal distribution to decrease more rapidly than 1/(n log(n)).

 

Language learning by simplicity as a

working hypothesis in cognitive science?

In the absence to answers to the open questions above, the analysis presented here, like learnability theory (Gold, 1967), has remained at a high level of generality. It has abstracted away computational limitations of the learner entirely; assumed only the most general restrictions on the structure of the linguistic input; and considered only asymptotic learning. Are such general results likely to be of relevance to the more specific questions that a theory of language learning must ultimately address? In particular, to what extent do these general results about the viability of learning language by simplicity really say about the plausibility of language learning by simplicity as a working hypothesis in cognitive science?

            No definitive answer to these questions is presently possible. But two considerations suggest that the possibility that language is learned using a Simplicity Principle deserves serious attention.

            The first is that learning by simplicity demonstrably works in a wide range of practical applications. Indeed, it comprises an entire research program in statistics (where simplicity is approximated using ‘minimum description length,’ Rissanen, 1987, 1989, and ‘minimum message length,’ Wallace & Freeman, 1987). Moreover, simplicity (often under the label “Occam’s razor”) is a fundamental principle of contemporary machine learning theory (e.g., Kearns & Vazirani, 1994; Quinlan & Rivest, 1989). In these contexts, the Simplicity Principle has to be applied used limited computational resources, data sets with restricted structure, and where the amount of data available is finite. Nonetheless, choosing the simplest hypothesis appears to work well as a practical method for learning; it seems therefore worth considering the possibility that a Simplicity Principle may be an effective practical method for language acquisition also.

            The second line of support for pursuing the Simplicity Principle as a working hypothesis about language learning is that Simplicity Principles have already proved useful in modeling a range of cognitive phenomena--indeed, we have touched on some examples in the discussion above. Most directly relevant is work that uses simplicity as a model of particular aspects of language learning. For example, Brent & Cartwright (1996) show how morphological structure can be found within isolated words using minimum description length statistical inference (Rissanen, 1989), a particular type of Simplicity Principle. Wolff (e.g., 1977, 1982, 1988, 1991; see also Redlich, 1993) has produced a large body of research showing how higher level structure can be found automatically in text, by attempting to compress the text into the simplest possible representation.

Furthermore, the Simplicity Principle as a proposal for understanding cognitive phenomena has much broader scope as well as a long tradition (e.g., Chater, 1997, 1999). The Simplicity Principle can be traced at least as far back as Mach (1959/1886), who proposed that the perceptual system seeks to find the simplest representations of sensory input. This viewpoint is echoed the study of perceptual organization in the Gestalt tradition: that perceptual organization is chosen to maximize “prägnanz” (Koffka, 1962/1935). Moreover, Hochberg and McAlister (1953) explicitly identified the goal of perceptual organisation as maximizing simplicity, and this work was followed by a variety of related proposals, where simplicity is measured in different ways (Buffart, Leeuwenberg & Restle, 1981; Chater, 1996; Garner, 1962, 1974; Leeuwenberg, 1969, 1971; Leeuwenberg & Boselie, 1988). Simplicity also arises in the very different tradition of the study of early vision—the early visual system is viewed as ‘compressing’ sensory information into the simplest possible form (Atick & Redlich, 1990; Barlow, Kaushal & Mitchison, 1989; Blakemore, 1990). A further connection, is the application of Kolmogorov complexity to the study of perceived randomness (Falk & Konald, 1997) who argue that perceived randomness may be determined by the degree to which the cognitive system fails to find a simple structure. Finally, the Simplicity Principle may apply to higher level cognitive phenomena: simplicity is after all used both as a principle of theory evaluation in scientific and common-sense reasoning, and an determinant of aesthetic preferences (e.g., Chater, 1997; Kemeny, 1953; Li & Vitányi, 1997, Chapter 5; Sober, 1975). This range of research suggests the search for simplicity might be a unifying cognitive principle (Chater, 1999).

            Thus, although the analysis of learning by simplicity in this paper is abstract, it appears that concrete computational and cognitive models based on simplicity are highly productive. This suggests that the notion that children learn language by simplicity may be a productive working hypothesis for cognitive science.

 

Conclusion

This paper presents an alternative to Gold’s idealization of the problem of language acquisition. Under this idealization, there is sufficient information in the linguistic input for a learner to make predictions about what will be said; to make grammaticality judgements; and to learn to produce language. These results are achieved by a learner using a Simplicity Principle--choosing the model of the language that provides the simplest (shortest) description of the linguistic data that has been encountered. These results re-open the question of the viability of language learning from positive evidence under less idealized conditions, of limited computational resources or amounts of linguistic data available to be learner. More concretely, we suggest that the working hypothesis that the search for simplicity is a guiding principle in language acquisition deserves serious attention.


Notes

1.                  Chaitin (1966) addresses issues of machine dependent minimal code length but does not hit on the crucial universal optimal coding as associated with Kolmogorov complexity. The same applies to Chaitin (1969) except for the last section where suddenly Kolmogorov's and Solomonoff's notion is introduced.

2.                  See Li & Vitányi, 1997, pp. 335-337 for analysis of identification in the limit as a special case of learning by simplicity.

3.                  Here, we shall follow the proof described in Li & Vitányi, 1997, principally due to Péter Gács, rather than Solomonoff’s original derivation.

4.                  The only subtlety here is that the mapping into the binary alphabet should be reversible, meaning that the original alphabetic representation can be uniquely decoded. This can be ensured by, for example, using a prefix binary code for the original alphabet and punctuation marks—that is, a code such that no initial portion (i.e., prefix) for any item corresponds to the code for some other item.

5.                  No great metaphysical weight needs to be borne by the concept of randomness here. What matters is that many aspects of linguistic input (e.g., those affected by coin tosses, the weather, and ‘chance’ events of all kinds) will be, from a practical point of view, random for the learner. That is, no underlying pattern can conceivably be found by the learner, whether or not some such pattern ultimately exists. This epistemic notion of randomness is made precise by defining random sequences as sequences that are their own shortest description, leading to the mathematical theory of algorithmic randomness (Li & Vitányi, 1997).

6.                  Technically, it is allowed that, at some point, no further output might be produced.

7.                  More precisely, the requirement is that the output is a produced by a monotone computational process acting on the input. We define a monotone computational process as follows: it reads its input in one direction only (i.e., it cannot ‘go back’ to look at earlier inputs, although it can store this input in its memory); and it cannot modify this input (the input is ‘read-only’). Moreover, the output can be written in one direction only (i.e., one an output is ‘written’ it cannot be altered); and the output cannot be read (the output is ‘write-only’). The output of the machine is defined as the binary sequence on the output tape, if the machine halts (and hence all subsequent inputs are ignored); and the infinite sequence binary sequence on the output tape, if the machine does not halt, but continues producing further outputs indefinitely. See Li and Vitányi (1997, p.276-277) for a rigorous description. Thus, as input is added, output cannot be deleted--although it is possible that the machine becomes ‘mute’--it produces no more output after a certain point.

8.                  The output is finite if the machine produces no more output after a certain point in the infinite binary input sequence. For example, the machine might halt, or go into an infinite loop.

9.                  Strictly, approximated in the limit from below.

10.              Provided that these distributions have rational parameters.

11.              This class of outputs of the machine is broader, however, if the internal noise in the system can contribute an infinite amount of randomness--more technically, if the internal randomness supplies an infinite number of bits of information. This is because the model presented here only allows a finite amount of randomness to be absorbed from the environment in making any particular output. For example, a computational process which depended on the real valued variable sampled from a probability density function--i.e., where the value of this variable must be known to infinite precision in order to assess its computational significance--could not be simulated by the model described here. It is conceptually possible that this might arise--but this assumption is not embodied in any current theoretical and computational model of language processing, to my knowledge.

12.              Equation 1 is a simplification, because it ignores the issue of double counting two input sequences which both start with a sub-sequence z, and where z alone generates x. See Li & Vitányi, 1997 for a rigorous specification, which takes account of this problem. we have ignored this subtlety here and elsewhere below in the interests of clarity.

13.              Or some other enumerable (semicomputable) probability distribution. This is a very broad class of distributions, including all those that are used in statistics (see Li & Vitányi, 1997).

14.              Strictly, a universal language can represent only the deterministic part of the mixture between deterministic and random factors assumed above to be involved in generating the corpus. This is not a substantial limitation for the learner in encoding the input, however. At any point in learning, the learner has only encountered a finite amount of data, and this finite amount of data only contains a finite amount of randomness. A universal machine can straightforwardly represent an input that contains only a finite amount of randomness (e.g., by just storing it verbatim).

15.              Crucially, this is true if all languages use the same alphabet—here, for simplicity, we assume that any coding language is, at bottom, encoded in a binary alphabet. With larger alphabets, shortest code lengths get shorter, as each choice of symbol can carry more information. Converting code lengths depending on alphabet size is straightforward—we lose no generality by restricting ourselves to a binary alphabet here.

16.              The reader may wonder why, given that we are dealing a monotone Universal Turing machine, the relevant measure for the complexity of a probability distribution is not Km(m) rather than K(m). The reasons are technical, but in essence, the reason is that we shall want to be able to specify a probability distribution, and then sample from it—and to do this, we have to know when the probability distribution has been specified. Therefore, we need to be able to specify a description of the distribution, rather than a sequence which begins with a specification of the distribution (see Li & Vitányi, 1997)—that is, the code for the distribution must be self-delimiting.

17.              Consider, for example, padding a computer program with arbitrarily large amounts of null operations, in the case of a conventional computer language.

18.              We could, of course, equally well consider the difference in the probability that the next symbol is a 1, with no substantive change to the proof.

19.              I here follow the spirit and much of the notation of Li and Vitányi’s (1997) treatment, which is based on a proof suggested by Peter Gács. Solomonoff’s original proof is quite different. We have also reworked the proof in order to reduce it to its essentials as far as possible, and to provide a self-contained presentation, not presupposing knowledge of algorithmic probability theory (e.g., Zvonkin & Levin, 1970) or the general theory of Kolmogorov complexity (Li & Vitányi, 1997).

20.              Strictly, we stipulate that this program is self-delimiting, which means that it is clear when the end of the program has been reached, and hence when the data input to the program begins. This apparently minor point actually has substantial mathematical consequences, which bedeviled early attempts to formalize these ideas (e.g., Solomonoff, 1964).

21.              Some theories of similarity in cognitive science presuppose that similarity must be symmetrical. That is, A must be exactly as similar to B as B is to A. But Kullback Liebler divergence is not symmetrical. Hence, from the perspective of these accounts, Kullback-Liebler distance can be related to similarity only at a metaphorical level. We nonetheless use the term ‘similarity’ in relation to Kullback-Liebler distance here, for clarity, without intending any particular stand on these issues (see, e.g., Hahn & Chater, 1998).

22.              Kullback-Liebler divergence is sometimes defined using logs in base e, rather than base 2. This leads to some minor differences between statements of results here and those in Li and Vitányi (1997).

23.              Nothing theoretically substantial rests on the choice of the word as the unit of choice. The important point here is that language is considered as a sequence of a finite number of linguistically significant and separate chunks. The arguments below would equally well go through if we assumed that language input were coded in terms of phonemes, morphemes or syllables.

24.              See Appendix D for a proof of this ‘re-scaling lemma.’

25.              Note that this formula allows for the possibility that there are grammatical sentences which have zero probability of being heard.

26.              Strictly, this is true for binary states with non-zero probability of occurrence. We assume that all and only the binary strings that can be generated are sequences of words—the whole point of the binary code is to encode language input.

27.              Note that the learner might undergeneralize not only because of an underestimation of which sentences are grammatical. The learner might, instead, assume that a certain sentence is impossible for a variety of other reasons. For example, the learner might wrongly assume that people can only produce center-embedded sentences of depth one—this could be viewed as an incorrect estimation of people’s short-term memory constraints, rather than a misconstrual of the grammar. In more general terms, to the extent that a distinction between linguistic competence and linguistic performance can be made (Chomsky, 1965), the learner may undergeneralize with respect to either competence or performance. The bounds that we develop here apply to undergeneralization of both kinds; and hence automatically provide bounds on undergeneralizations of linguistic competence, which are of most interest to linguists. Hence, we need not consider the difficult questions concerning how, if at all, the competence/performance distinction can be made precise (though see Christiansen & Chater, 1999).

28.              A proof is given in Appendix E.

29.              Strictly, this theorem does not hold for all sequences xy; but the probability that the theorem holds tends to 1, as the length of x tends to infinity. Thus, the ‘pathological’ sequences where the theorem does not hold will do not arise too often in practice.

30.              This follows because the number of computable texts is bounded by the number of Turing machines, which is countable; but the set of all infinite texts is uncountable.

 

 

 


References

Adriaans, P. W. (1992). Learning shallow context-free languages under simple distribution. Manuscript, Centrum voor Wiskunde en Informatica, Amsterdam.

Angluin, D. (1980). Inductive inference of formal languages from positive data. Informatoin and Control, 45, 117-135.

Atick, J. J. & Redlich, A. N. (1990). Towards a theory of early visual processing. Neural Computation, 2, 308-320.

Baker, C. L. (1979). Syntactic theory and the projection problem. Linguistic Inquiry, 10, 533-581.

Baker, C. L. & McCarthy, J. J. (Eds.) (1981). The logical problem of language acquisition. Cambridge, MA: MIT Press.

Barlow, H. B., Kaushal, T. P. & Mitchison, G. J. (1989). Finding minimum entropy codes. Neural Computation, 1, 412-423.

Bates, E. & MacWhinney, B. (1987). Competition, variation, and language learning. In B. MacWhinney (Ed.), Mechanisms of language acquisition (pp. 159-174). Hillsdale, NJ: Erlbaum.

Blakemore, C. (Ed.) (1990). Vision: Coding and efficiency. Cambridge, England: Cambridge University Press.

Blum, L. & Blum, M. (1975). Toward a mathematical theory of inductive inference. Information and Control, 28, 125-155.

Bowerman, M. (1983). How do children avoid constructing an overly general grammar  in the absence of feedback about what is not a sentence? Papers and Reports on Child Language Development, 22, Stanford, CA: Stanford University Press.

Bowerman, M. (1987). The ‘no negative evidence’ problem: How do children avoid constructing an overly general grammar? In J. A. Hawkins (Ed.), Explaining language universals (pp. 443-466). Oxford: Basil Blackwell.

Braine, M. D. S. (1971). On two types of models of the internalization of grammar. In D. I. Slobin (Ed.). The ontogenesis of grammar: A theoretical symposium. New York: Academic Press.

Brent, M. R. (1996). Advances in the computational study of language acquisition. Cognition, 61, 1-38.

Brent, M. R. & Cartwright, T. A. (1996). Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61, 93-126.

Brown, R. & Hanlon, C. (1970). Derivational complexity and order of acquisition in child speech. In J. Hayes (Ed.), Cognition and the developmental of language (pp.11-53). New York: Wiley.

Buffart, H., Leeuwenberg, E. & Restle, F. (1981). Coding theory of visual pattern completion. Journal of Experimental Psychology: Human Perception and Performance, 7, 241-274.

Bybee, J., Perkins, R. & Pagliuca, W. (1994). The evolution of grammar. Chicago: Chicago University Press.

Chaitin, G. J. (1966). On the length of programs for computing finite binary sequences. Journal of the Association for Computing Machinery, 13, 547-569.

Chaitin, G. J. (1969). On the simplicity and speed of programs for computing infinite sets of natural numbers. Journal of the Association for Computing Machinery, 16, 407-422.

Chaitin, G. J. (1987). Beyond Godel’s proof. IBM Research Magazine, 25, 12-15.

Chater, N. (1996). Reconciling simplicity and likelihood principles in perceptual organization. Psychological Review, 103, 566-581.

Chater, N. (1997). Simplicity and the mind. The Psychologist, November, 1997, 495-498.

Chater, N. (1999). The search for simplicity: A fundamental cognitive principle? Quarterly Journal of Experimental Psychology, 52A, 273-302.

Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.

Chomsky, N. (1980). Rules and representations. Cambridge, MA: MIT Press.

Chomsky, N. (1981). Lectures on government and binding. Foris: Dordrecht.

Chomsky, N. & Halle, M. (1968). The sound patterns of English. New York, Harper and Row.

Christiansen, M. H. Allen, J. & Seidenberg, M. S. (1998). Learning to segment speech using multiple cues: A connectionist model. Language and Cognitive Processes. 13, 221-268.

Christiansen, M.H. & Chater, N. (1994). Generalization and connectionist language learning. Mind and Language, 9, 273-287.

Christiansen, M.H. & Chater, N. (1999). Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23, 157-205.

Culicover, P. (1999). Syntactic nuts: Hard cases, syntactic theory, and language acquisition. Oxford: Oxford University Press.

Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14, 179-211.

Elman, J.L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71-99.

Falk, R., & Konold, C. (1997). Making sense of randomness: Implicit encoding as a basis for judgment. Psychological Review, 104, 301-318.

Fodor, J. D. & Crain, S. (1987). Simplicity and generality of rules in language acquisition. In B. MacWhinney (Ed.), Mechanisms of language acquisition (pp. 35-63). Hillsdale, NJ: Erlbaum.

Gallaway, C. & Richards, B. (Eds.) (1994). Input and interaction in language acquisition. Cambridge: Cambridge University Press.

Garner, W. R. (1962). Uncertainty and structure as psychological concepts. New York: John Wiley.

Garner, W. R. (1974). The processing of information and structure. Potomac, Md.: Erlbaum.

Gold, E. M. (1967). Language identification in the limit. Information and Control, 16, 447-474.

Grünwald, P. D. (1996). A minimum description length approach to grammar inference. In G. Scheler, S. Wermter, & E. Riloff (Eds.) Connectionist, statistical and symbolic approaches to learning for natural language processing (pp. 203-216). New York: Springer.

Hahn, U.  & Chater, N. (1998). Similarity and rules: distinct? exhaustive? empirically distinguishable? Cognition, 65, 197-230.

Harman, G. (1965). The inference to the best explanation. Philosophical Review, 74, 88-95.

Hawkins, J. A. (1994). A performance theory of order and constituency. Cambridge: Cambridge University Press.

Hirsh-Pasek, K., Treiman, R. & Schniederman, M. (1984). Brown and Hanlon revisited: Mothers’ sensitivity to ungrammatical forms. Journal of Child Language, 11, 81-88.

Hochberg, J.  & McAlister, E. (1953). A quantitative approach to figure “goodness.” Journal of Experimental Psychology, 46, 361-364.

Hornstein, N. & Lightfoot, D. W. (1981). Explanation in linguistics: The logical problem of language acquisition. London, UK: Longman.

Jain, S., Osherson, D. N., Royer, J. S., & Kumar Sharma, A. (1999). Systems that learn (2nd edition). Cambridge, MA: MIT Press.

Kearns, M. J. & Vazirani, U. V. (1994). An introduction to computational learning theory. Cambridge, MA: MIT Press.

Kemeny, J. G. (1953). The use of simplicity in induction. Philosophical Review, 62, 391-408.

Koffka, K. (1962). Principles of Gestalt psychology (5th ed.). London: Routledge and Kegan Paul. (Original work published in 1935).

Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information. Problems in Information Transmission, 1, 1-7.

Leeuwenberg, E. (1969). Quantitative specification of information in sequential patterns. Psychological Review, 76, 216-220.

Leeuwenberg, E. (1971). A perceptual coding language for perceptual and auditory patterns. American Journal of Psychology, 84, 307-349.

Leeuwenberg, E. & Boselie, F. (1988).  Against the likelihood principle in visual form perception. Psychological Review, 95, 485-491.

Li, M. & Vitányi, P. (1997). An introduction to Kolmogorov complexity theory and its applications (2nd edition). Berlin: Springer.

Mach, E. (1959). The analysis of sensations and the relation of the physical to the psychical. New York: Dover Publications. (Original work published 1886).

MacWhinney, B. (1993). The (il)logical problem of language acquisition. In Proceedings of the 15th Annual Conference of the Cognitive Science Society (pp. 61-70). Mahwah, NJ: Erlbaum.

Marcus, G. F. (1993). Negative evidence in language acquisition. Cognition, 46, 53-85.

Martin, E. & Osherson, D. N. (1998). Elements of scientific inquiry. Cambridge, MA: MIT Press.

Morgan, J. L. & Travis, L. L. (1989). Limits on negative information in language input. Journal of Child Language, 16, 531-552.

Osherson, D. N. & Weinstein, S. (1982). A note on formal learning theory. Cognition, 11, 77-88.

Osherson, D. N., Stob, M. & Weinstein, S. (1982). Ideal learning machines. Cognitive Science, 6, 277-290.

Osherson, D. N., Stob, M. & Weinstein, S. (1984). Learning theory and natural language. Cognition, 17, 1-28.

Osherson, D. N., Stob, M. & Weinstein, S. (1985). Systems that learn. Cambridge, MA: MIT Press.

Paul, W. J., Seiferas, J. I., & Simon, J. (1982). An information theoretic approach to time bounds for on-line computation. Journal of Computer and System Sciences, 23, 108-126.

Pinker, S. (1979). Formal models of language learning. Cognition, 7, 217-283.

Pinker, S. (1984). Language learnability and language development. Cambridge, MA: MIT Press.

Pinker, S. (1994). The language instinct. Harmondsworth, UK: Penguin.

Quinlan, J. R. & Rivest, R. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227-248.

Redlich, A. N. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural Computation, 5, 289-304.

Redington, M., Chater,  N., & Finch, S. (1998). Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science, 22, 425-469.

Rhode, D. L. T. & Plaut, D. C. (1999). Language acquisition in the absence of explicit negative evidence: How important is starting small? Cognition, 72, 68-109.

Rissanen, J. (1987). Stochastic complexity. Journal of the Royal Statistical Society, Series B, 49, 223-239.

Rissanen, J. (1989). Stochastic complexity and statistical inquiry. Singapore: World Scientific.

Shannon, C. E. (1951). Prediction and entropy of printed English. Bell System Technical Journal, 30, 50-64.

Simon, H. A. (1972). Complexity and the representation of patterned sequences of symbols. Psychological Review, 79, 369-382 .

Siskind, J. M. (1996). A computational study of cross-situational techniques for learning word-to-meaning mappings. Cognition, 61, 39-91.

Sober, E. (1975). Simplicity. Oxford: Clarendon Press.

Sokolov, J. L. & Snow, C. E. (1994). The changing role of negative evidence in theories of language development. In C. Gallaway & B. Richards (Eds.) (1994) Input and interaction in language acquisition (pp. 38-55). Cambridge: Cambridge University Press.

Solomonoff, R. J. (1964). A formal theory of inductive inference, Parts 1 and 2. Information and Control, 7, 1-22, 224-254.

Solomonoff, R. J. (1978). Complexity-based induction systems: Comparisons and convergence theorems. IEEE Transactions on Information Theory, 24, 422-432.

Vitányi, P. M. B. & Li, M. (2000). Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity, IEEE Transactions on Information Theory, IT-46, 446-464.

Wallace, C. S. & Freeman, P. R. (1987). Estimation and inference by compact coding. Journal of the Royal Statistical Society, Series B, 49, 240-251.

Wexler, K. & Cullicover, P. (1980). Formal principles of language acquisition. Cambrdige, MA: MIT Press.

Wolff, J. G. (1977). The discovery of segmentation in natural language. British Journal of Psychology, 67, 377-390.

Wolff, J. G. (1982). Language acquisition, data compression and generalization. Language and Communication, 2, 57-89.

Wolff, J. G. (1988). Learning syntax and meanings through optimisation and distributional analysis. In Y. Levy, I. M. Schlesinger & M. D. S. Braine (Eds.), Categories and processes in language acquisition, (pp. 179-215). Hillsdale, NJ: LEA.

Wolff, J. G. (1991). Towards a theory of cognition and computing. Chichester: Ellis Horwood.

Zurek, W. H. (ed.) (1991). Complexity, entropy, and the physics of information. Redwood City, CA: Addison-Wesley.

Zvonkin, A. K. & Levin, L. A. (1970). The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms. Russian Mathematical Surveys, 25, 83-124.