PDP Systems and the Connectionist Conception of Human Cognition

Greg Bowering, June 1998 (with slight corrections July 1998)

(This is the original 4300 word assignment that was then edited down to be closer to the 2500 word limit!)

 

Introduction

Parallel Distributed Processing (PDP) is a computational paradigm modeled on neural interaction in the brain. Man-made devices called Artificial Neural Networks (ANN) have been constructed to explore, develop, and utilise theories within the PDP paradigm. It has long been suspected that the central nervous system, composed of billions of special interconnected cells called neurons is intimately involved with our intuitive concepts of mind and cognition. Greater understanding of PDP has enabled the development of a Connectionist Conception (CC) of cognition: that the cognition can be modeled purely in terms of the combined abilities of biologically-realised PDP networks in the brain.

The aim here is to consider the plausibility of the CC as a theory of mind in its own right, not to compare it with alternative theories. In considering its plausibility, however, it is worth considering how well it fits with other theories at lower levels of description (such as neurophysiology) and at higher levels (such as cognitive psychology). Before the plausibility of the CC can be discussed, however, it is necessary to provide an overview of PDP.

Parallel Distributed Processing

The following sections overview the architectural, representational, and computational characteristics of the various ANN systems described by PDP theory.

Architectural Characteristics

An artificial neural network consists of units with variable levels of 'activation' connected to other units such that their own activation can influence the activation of those other units. Connections may vary in strength and may either have a positive (excitatory) or negative (inhibitory) influence on activation. Input units are those that are given their activation levels from outside the network, and are not then free to change their activation levels. Output units are those whose activation levels are provided to the outside world as the network's "response" to the inputs. Hidden units are the remaining units of the network.

The activation level of a unit is entirely determined as a function of influences from its afferent (input) connections. With a deterministic unit, output is the same as its activation. A deterministic network is one composed entirely of deterministic units. With a Stochastic unit, output may be one of two discrete values (such as '1' or '0'). The output is non-deterministic in the sense that it is intrinsically uncertain what value it will take given the unit's current activation. The activation level determines the probability, P(1), that the output will be '1' rather than '0'. A stochastic network is one composed of stochastic units. The theory of stochastic networks is mostly inspired by statistical physics whereas deterministic networks are more biologically inspired.

Feedforward networks are acyclic in that connections are only one-way, and no path of influence through the network may form a loop or cycle. Because of this, they may be arranged conceptually into a strict hierarchy of layers. Input units in one layer are connected to successive layers of hidden units in the middle, which are connected to a layer of output units. Influences between units are restricted to those whereby A may influence B if B belongs to a layer further away from the input layer (and closer to the output layer) than A's layer. The influence of changes provided to activations of the units of the input layer can then only feed forward through the network to the output layer. Feedforward networks are always stable in that a change from one stable input to another will always lead to a change from one stable output response to another, after a small fixed transition delay.

The sum total inputs to a unit determines its activation. The activation function (otherwise known as the transfer or decision function) is some scalar function of the summed inputs which is often modelled mathematically by linear functions such as the identity or threshold-step functions, or by nonlinear functions such as the sigmoid or tanh. Linear activation functions limit a network's successful performance to linear approximations to appropriate behaviour. Such a network might work fine in some conditions (linear problem domains), but rather poorly in others. This was seen as a major failing of PDP in the late 50's. Networks that use non-linear activation functions can now perform better in a broader range of environments (P.M. Churchland 1988).

Recurrent networks contain symmetrical (two-way) connections between units and/or longer cycles of influence. The existence of such cycles makes it impossible to conceive of these networks as being composed of a strict hierarchy of layers of units. However, they may sometimes still be conceived of in layers, but then allow for backward connections (creating feedback) and sometimes lateral connections (between units in the same layer). Because of feedback cycles, recurrent networks are potentially unstable dynamical systems (Gleick, 1987). A change from one stable input to another might, after some variable delay, cause the system to settle into a stable state and thus exhibit a stable output response. However, this is not necessarily so. It might settle into an oscillating state such that the output units alternate between two or more different patterns. The third possibility is that the system might exhibit a chaotic response, with output units producing an unpredictable sequence of states. Which of these three outcomes occur depends on the input to the system, its architecture, and connections. Furthermore, an arbitrarily small difference in input or connection strength can be the difference between a stable outcome and a chaotic one. Even in the case of a stable outcome, it is possible for an arbitrarily small difference in input or connection strength to be the difference between settling to one stable response or another.

Composite networks (Deco and Obradovic 1996, Ch.9) include interconnected PDP networks. There are two distinct kinds of interconnection possible. The first kind involves the outputs of one network serving as inputs to another network. A more interesting kind involves the outputs of one network serving to determine connection weights in another network. McClelland (1986, cited by Bechtel 1990) refers to the latter as "programmable connection systems".

Representational Characteristics

The activation patterns of input units explicitly encode potential questions asked, problems posed, or environments presented to the network. The nature of this encoding is determined by design, which depends on the dimensions of the problem domain or the transducer systems that encode real-world events and properties into appropriate forms. A localised encoding is one in which each input unit activation can be readily interpreted to represent the degree to which a distinct problem feature is present. A distributed encoding is one in which it is not possible to single out a single unit and say what its activation represents, rather it is only the whole pattern of a group of unit activations that can be interpreted as representing something.

When a network has been trained to perform a particular transformation (solve a problem, or recognise a pattern) the output response is similarly interpreted to explicitly encode the system's answer. Where the network has been trained by exposure to a set of example correct input-output pairs via a supervised learning paradigm, the output interpretation is again determined by design. Since all activation patterns potentially change whenever the input pattern changes, they are said to be a transient form of representation.

The activation patterns of hidden units can often be interpreted as distributed representations of the firings of particular transformations or rules, or as a categorisations of inputs. The activation of N units in a layer can be considered as a vector in an N-dimensional activation space. The activation of a unit can be interpreted as saying whether or not the vector of units connecting into it is currently within a particular region of activation space. It is the connection weights of those connections which determine the boundaries of that region.

Whilst activation patterns are transient forms of representation, the connection weights only change as part of the training of the network. The connection weights thus provide a long-term store of what the network knows. If they are not being adjusted we may consider the network to be essentially "hard-wired" to do what it does. In this sense connection weights are part of that hard-wiring, and thus may be interpreted as part of the tacit knowledge possessed by the system. In a feedforward network they embody the transformation rules that relate the activation pattern in one layer to the activation pattern it causes as a response in the subsequent layer. Since the one set of connection weights may be interpreted as participating in a variety of distinct input-output transformations, they are said to store information in a superpositional fashion. For example, if we construct and then train a network to remember pairs of names and telephone numbers, the encoding of each pair stored by the network is spread across all its connections. This property is demonstrated by what happens when part of the network is destroyed, or when it is overloaded by too many pairs, or when two pairs are too similar to each other: graceful and global degradation in performance, and 'crosstalk' or 'blending' errors (Bechtel 1990, P.S. Churchland & Sejnowski 1990, Copeland 1993).

Computational Characteristics

There are two main kinds of computation involved in PDP systems. The first kind is what is involved in mechanically transforming an input pattern into an output response. The second kind is what is involved in configuring the connection weights to optimally perform the first. It involves learning knowledge and storing it tacitly in the network ready to be applied appropriately on demand. Although learning is not traditionally part of the network per se, it central to their ability to adapt it to a broad variety of problems and is often used to continuously improve its performance while the network is being put to use.

Solution-Finding

As mentioned earlier, feedforward networks complete their input-output transformations, producing an 'answer' at the output layer with no fuss after a small fixed transition delay. When successfully trained, a good proportion of the time this answer will firstly make sense with respect to the chosen interpretation/encoding scheme for output patterns, and secondly it will most likely be a correct response, or at least a reasonable 'guess'.

In recurrent networks, solution-finding relies on the system settling into a stable state. The problem can be thought of as seeking to minimize some energy function of the system in activation space. If the network has been successfully trained, when the input units are given an input activation pattern, the system will quickly explode into a flurry of state changes: first large changes of unit activations, then gradually smaller changes as the whole system spirals down the 'energy landscape' of the network in activation space towards the desired global minimum, where it will stabilise. Unfortunately it is possible for a deterministic recurrent network to be caught in a 'local minimum' and never reach the best solution. To overcome this, the network design can include a means of jostling the system around with random noise so as to shake it out of any such ruts it might get stuck in. The level of random noise introduced into the system is sometimes thought of as its 'temperature' (this comes from the contribution of statistical physics to PDP theory). A successful method of making sure the recurrent network finds the best solution is Simulated Annealing. This involves a high temperature when the input is introduced, and then a gradual cooling until the system is stable, with no temperature (noise). The solution thus obtained can be checked by starting the system in that state, and then reheating the system and slowly cooling it back down. If the system does not find the best solution in the first instance, it will do so after repeated simulated annealing, provided enough heat (noise) is applied. The method takes its name from an analogous technique in metalwork for obtaining desirable qualities in steel.

The stochastic recurrent network, typically what is referred to as a Boltzmann machine (Deco & Obradovic 1996) differs in that random noise is modelled as part of the intrinsic behaviour of each unit, and that the output of each unit is one of two discrete values. However the broad solution-finding process is similar: Instead of undergoing a smooth gradient descent in an N-dimensional activation space, the system bounces around to find the best corner of an N-dimensional hypercube.

Training and Learning

Each style of PDP system can learn via a number of strategies. Supervised learning involves a teacher presenting the system with example inputs and then telling it how incorrect each of its outputs are relative to the desired output. The most well known supervised strategy is backpropagation for feedforward networks. It is a mathematically derived algorithm for mechanically assigning proportionate blame for output errors to each of the network's connection weights, and then adjusting those weights in proportion to that blame. This process must be repeated many hundreds or thousands of times for each input/output example pair in a training-set until the network exhibits acceptable performance. There are two general ways in which this strategy is applied (Deco and Obradovic 1996). In batch training, the whole training set is presented to the network, one example at a time, and connection weights are not adjusted until the total error has been calculated over all training pairs (an epoch). Alternatively, incremental training involves adjustments to weights after each training example is presented. Because the order in which examples are presented is important in the latter case, they are often presented in a random order (stochastic training).

Unsupervised learning involves no such teacher but instead uses raw information (possibly feedback) from the environment for the system to self-organise so as to reflect some of the structure inherent in it. The most well known unsupervised rule is the Hebbian learning rule, formulated by Hebb in 1949 (Deco and Obradovic 1996). In contrast to backpropagation, it is a biologically motivated learning rule. Essentially, the rule means that whenever the activation of unit A is followed immediately by the activation of unit B, the connection strength of AB is increased in proportion to the product of those activation levels. In this way, a constant conjunction of 'A then B' caused from environmental inputs during learning causes the network to model a causal relation from A to B. An anti-Hebbian rule is introduced as a means of weakening excitatory connections and strengthening inhibitory connections between two units when they fail to activate together during learning.

Generalisation

An important property of a PDP network is its ability to generalise - to correctly perform transformations not included in its training set. An analogy a network's generalisation can be made with the interpolation of data points on a graph:

If we have three points (training examples) and we only know how to draw straight lines (linear functions) then we can attempt to construct a line of best-fit for the three points, which gives us a linear equation for approximately predicting the positions of as yet unseen data points. Or, if we know how to construct parabolas, we can make one exactly fit our three points, giving us a quadratic equation. If, for some reason we're only able to construct polynomials of degree three, there are an infinite number of such curves we can exactly fit to the available points. If we choose one cubic equation to describe our data, we no longer have something likely to fit with new examples, especially if it happens that a quadratic equation was really the correct model for the underlying relation.

Both over-training on undersized training sets and an oversized network architecture can lead to overfitting which is what stops a trained network from successfully generalising to fresh examples. Deco and Obradovic (1996, p.31) summarise two principal strategies for obtaining good generalisation. The first involves starting with undersized network architecture and gradually adding units and connections to obtain the minimal network complexity that will succeed. The second strategy starts with an oversized network and then either gradually prunes the network down to size, or keeps it oversized but stops training when generalisation measured on a separate 'validation' example set begins to decline.

The design problem concerning both the optimal size and shape of network architecture often involves some trial-and-error, but some PDP researchers have extended the trial-and-error approach to the extreme by employing the AI method of genetic algorithms to determine network architectures.

Plausibility of the Connectionist Conception of Cognition

Theories are often argued for on the basis of one or more of the following kinds of support:

I assume here without argument that only the first two kinds of support are worth consideration in directly appraising the plausibility of a theory. They at least aim to provide means for an objective assessment of its consistency (both logical self-consistency and consistency with the observable universe). A more thorough appraisal of a theory would appeal to at least some other kinds of support.

The third and fourth kinds of support promote a theory as fitting nicely into an already accepted view of the world. These can only indirectly establish the plausibility of the theory in question in terms of both the plausibility of the analogy/isomorphism being true and the plausibility of the second theory/intuition.

The fifth and sixth kinds of support tend to promote the usefulness of a theory. One can conceive of quite plausible theories that might explain or predict little or nothing (consider, for example, a useless but plausible theory whose only constituent is the proposition: "This is a proposition"). Compare this with the currently less plausible theory which posits a celestial sphere to successfully explain the motion of the stars about the Earth, and accurately predict their positions in the sky at any given time for hundreds of years ahead.

The seventh and eight kinds of support promote the efficiency of a theory. They include the use of such principles as Okham's razor ("posit the minimum required entities"), Cosmological principles (anti-chauvinism – "nothing is particularly special about our personal corner of the universe or this moment in its history"), unification ("try to find theories that explain more so that science requires fewer distinct theories to explain everything").

The final kind of support is related to the previous two, since the emergence of a successful, simple, parsimonious theory is often admired so much for these properties, that it is looked upon as a miracle of nature, and a thing of beauty. Some take it as evidence that the universe is by design, evidence of a Designer. However, besides simple simplicity, human aesthetic judgement seems to be able to resolve a degree of formal consistency, a fact often referred to by the phrase "truth is beauty" (explaining this ability poses an interesting test for any theory of mind).

Each of these forms of support for a theory can also be used in attacking a theory. Research into PDP systems gives rise to the CC, which posits that the cognitive abilities of the human mind can be explained purely in terms of the combined abilities of biologically realised PDP networks in the brain. The remainder of this paper examines rational reasons and empirical evidence both for and against the CC.

Patricia Churchland and Terrence Sejnowski (1990, pp.232-4) suggest some a priori arguments as to why the CC is plausible. One argument points out how some kinds of cognitive task seem to quite naturally lend themselves to an approach by the PDP style of computation whilst at the same time being difficult to perform by any non-connectionist means. The NETtalk system, which transforms English text to phonemes, is just one such example, but the PDP literature is full of such cases since this is exactly why PDP is now much more than the curiosity it used to be.

One objection to the CC is that the higher-order cognitive processes of the mind do not even begin to resemble PDP-style transformations, categorisations and pattern-recognition. Churchland and Sejnowski answer this by pointing out via analogy with DNA that the structure of the cause does not have to resemble the structure of the effect. Hence a complete lack of similarity or resemblance between higher cognitive features and those of neural networks is no grounds for rejecting the theory. That we might not easily conceive of how one might give rise to the other is beside the point, as emphasised by Orgel's Second Rule "Nature is more ingenious than we are."

What PDP theory manages to do, is demonstrate how "wetware" could conceptually participate in a form of computation and thus keeps the question open as to whether it might not participate in all forms of human cognition.

Progress in both neurosciences and cognitive psychology continues to accumulate empirical evidence predominantly in support of the CC. Paul Churchland (1988) offers evidence for materialism (a philosophy of mind) which also support the CC. For example, there is a large amount of evidence for the neural dependence of the mind. Not only is cognition sensitive to changes in chemicals in the brain, but damage to (or removal of) neurons impacts memory and cognition in ways strikingly consistent with PDP theory. The ability for a theory of mind to correctly predict failure modes is an admirable one, and is evident in the CC. Most promisingly, neuroscience has already succeeded in explaining the behaviour of simple creatures purely in terms of their neural architecture, however it is unknown as to whether higher cognition can submit to similar analysis.

Whilst the basic apparatus for neurally-realised PDP computation is apparent, the neurological mechanisms for learning remain a mystery - we still don't know how synaptic connections are adjusted appropriately in the brain. This remains a serious shortcoming of the CC, since it needs to identify this mechanism to have any hope of being completely plausible. Paul Churchland (1988, p.164) points out things called climbing fibres as candidate information pathways in some kind of backpropagation in the brain. Another possibility is that learning is achieved by a process more like the Hebbian learning rule mentioned earlier. Such learning requires neither supervision nor backpropagation pathways.

Aspects of the real-world performance of PDP systems lend strong support to the CC because of their similarity to aspects of human cognitive performance. Whilst the time-performance of feedforward networks in computing responses to inputs is very impressive, this is far from true when it comes to current learning strategies for them. Supervised backpropagation strategies seem to require a disturbingly large amount of repetitive training to produce good performance. However, the abilities of PDP systems to learn new concepts from scratch, to discern patterns, to categorise information, to form internal models and generalisations of their environments, and to store information in a content-addressable manner are quite compatible with what we know about human cognitive abilities. More impressive is the way PDP systems give rise to the same kinds of responses that humans do, when they are overloaded or damaged. Both are relatively tolerant to noisy, partial, or distorted information. For example, when human speech has some phonemes replaced with noise, the listener not only still understands what was said, but fails to even notice that some parts of words were completely missing (Warren 1970 cited in Copeland 1993, p.218). This is similar to the way a PDP system can tolerate some garbled input to give the same response as if the input was clean. Another similarity is seen when a content-addressable memory implemented by a PDP system is given too many records to remember, or two records that are too similar: the result is "blending" or "crosstalk" whereby recalled information can be a little muddled. This behaviour is the same as for a human trying to remember a list of say, names and telephone numbers. Likewise, the graceful degradation of PDP and biological cognitive systems alike is apparent when such systems are damaged.

Conclusion

PDP theory, neuroscience and cognitive psychology are relatively complex and advanced fields of ongoing research. However there are yet many gaps and open questions left untouched by these endeavours. As Churchland and Sejnowski (1990) conclude, these theories, among others, need to co-evolve at all levels before we can have a complete theory of cognition. It is my judgement, given a priori considerations along with the weight of empirical evidence from both brain and behavioural sciences, that the Connectionist Conception of human cognition is plausible. I believe it already successfully explains many aspects of micro-cognition, and whether or not explanatory extension to higher-order cognition is feasibly within its grasp, it nevertheless provides us with an acceptable starting theory with which to proceed towards that goal.

 

Bibliography

Bechtel, W. 1991. Connectionism and the philosophy of mind: An overview. In William G. Lycan (ed.) Mind and cognition: a reader. pp.252-273. Oxford, UK; New York, NY: Basil Blackwell.

 

Bechtel, W. and Abrahamsen, A.A. 1991. Connectionism and the mind: an introduction to parallel processing in networks. Cambridge, Mass.: B. Blackwell. pp.210-215.

 

Churchland, P.M. 1988. Matter and consciousness: a contemporary introduction to the philosophy of mind. Cambridge, Mass: MIT Press.

 

Churchland, P.S. and Sejnowski, T.J. 1990. Neural representation and neural computation. In William G. Lycan (ed.) Mind and cognition: a reader. pp.224-252. Oxford, UK; New York, NY: Basil Blackwell.

 

Churchland, P.M. 1995. The engine of reason, the seat of the soul: a philosophical journey into the brain. Cambridge, Mass.: MIT Press. pp.84-91.

 

Clark, A. 1989. Microcognition: philosophy, cognitive science, and parallel distributed processing. Cambridge, Mass: MIT Press. Ch.5.

 

Copeland, J. 1993. Artificial intelligence: a philosophical introduction. Oxford, UK: Basil Blackwell.

 

Deco, G. and Obradovic, D. 1996. An Information-Theoretic Approach to Neural Computing. New York, NY: Springer-Verlag.

 

Gleick, J. 1987. Chaos. William Heinemann Ltd.