Activation Functions

There exists a number of activation functions for neural nets which each have their advantages and disadvantages.  Donald Tveter provides a nice description of a number of the most common ones.  As the number of nodes increase in the network, optimising the activation function can reduce the amount of time to process and in particular to train the network.  A particularly interesting one is proposed by D Elliot which uses the formula:

fn(x) = 1 / 1 + |x|

This is typically faster to compute other sigmoids while presenting a roughly similar curve.  It does take longer to converge but depending upon the problem (such as classification) this is not necessarily an issue.

Cascade Correlation

A summary of this algorithm can be found here.   Or in more details here. This algorithm allows us teach a neural network without knowing how many hidden nodes to have.  Fahlman  & Lebiere (1991) describe how it begins with a simple network with inputs connected to outputs and while learning, if the error reduction stagnates it adds additional hidden nodes as needed to reduce the error.  The cleverness of this algorithm is to add a number of candidate nodes, calculate which one reduces the error the most before deciding to add it permanently to the network.  Weights once added in this way are fixed.  This will reduce the herd effect (that is the network alternating its convergence to 1 or another subtasks but take a very long time to reach a point where it does both at the same time) which will likely be exaggerated with the multiple input sequences we are using for reading.

Recurrent cascade correlation deals with handling this over time.  Fahlman (1991) has also investigated this.  It would appear however that the effects of the prior input in such networks fades over iterations.  Lin et al (1996) propose a way to extend this effect.  Although reading 1 letter at a time does imply an eventual upper limit to the maximum number of iterations that need to be taken into account, that of the longest word in the language, sequences of symbols for example in a URL could represent a longer chain than is covered by these approaches.  Somehow the length of the sequence so far must play some part in the interpretation of the item at each step – differentiating between a block of understandable text and an arbitrary block of symbols.  Perhaps the arbitrary block can be subsequently re-examined with lower thresholds which permits greater errors but more options in a space where none exist.  This implies that there should be thresholds which alter how the word breakup occurs.  How would this be represented and how would such a mechanism be trained?