A summary of this algorithm can be found here. Or in more details here. This algorithm allows us teach a neural network without knowing how many hidden nodes to have. Fahlman & Lebiere (1991) describe how it begins with a simple network with inputs connected to outputs and while learning, if the error reduction stagnates it adds additional hidden nodes as needed to reduce the error. The cleverness of this algorithm is to add a number of candidate nodes, calculate which one reduces the error the most before deciding to add it permanently to the network. Weights once added in this way are fixed. This will reduce the herd effect (that is the network alternating its convergence to 1 or another subtasks but take a very long time to reach a point where it does both at the same time) which will likely be exaggerated with the multiple input sequences we are using for reading.
Recurrent cascade correlation deals with handling this over time. Fahlman (1991) has also investigated this. It would appear however that the effects of the prior input in such networks fades over iterations. Lin et al (1996) propose a way to extend this effect. Although reading 1 letter at a time does imply an eventual upper limit to the maximum number of iterations that need to be taken into account, that of the longest word in the language, sequences of symbols for example in a URL could represent a longer chain than is covered by these approaches. Somehow the length of the sequence so far must play some part in the interpretation of the item at each step – differentiating between a block of understandable text and an arbitrary block of symbols. Perhaps the arbitrary block can be subsequently re-examined with lower thresholds which permits greater errors but more options in a space where none exist. This implies that there should be thresholds which alter how the word breakup occurs. How would this be represented and how would such a mechanism be trained?