More about how light is processed

The path of light to the eye is manipulated on the way by a series of Muller cells (which function more or less like a selective fibre optic tube) that lead the light to the back of the retina.  These cells optimise the flow of red, green and blue light to the cones which handle color.  However not all of the light is funnelled and varying amounts of the light spill over the edges to reach the retina independently.  This effect has been described by Labin, Safui, Ribak & Perlman (Nature, 8 July, 2014)

The effect also has an impact on the detection of light by the rods of about 20%.  This impact is not uniform, in particular there is a dip in detected intensity (~40%) around 560nm (Yellow-green) light.  This adjustment for the rods should then be taken into account in terms of the overall detected brightness or darkness of a picture when seen from a human perspective.

The concentration of light provided by the Muller cells also provides a distorted view of the red and green vs. that of blue.  While this does go quite some way towards explaining the apparent “correctness” of various color models investigated so far, it also presents a possible simpler calculation for determining what a given scene looks like for a human just prior to neural processing.

Activation Functions

There exists a number of activation functions for neural nets which each have their advantages and disadvantages.  Donald Tveter provides a nice description of a number of the most common ones.  As the number of nodes increase in the network, optimising the activation function can reduce the amount of time to process and in particular to train the network.  A particularly interesting one is proposed by D Elliot which uses the formula:

fn(x) = 1 / 1 + |x|

This is typically faster to compute other sigmoids while presenting a roughly similar curve.  It does take longer to converge but depending upon the problem (such as classification) this is not necessarily an issue.

Cascade Correlation

A summary of this algorithm can be found here.   Or in more details here. This algorithm allows us teach a neural network without knowing how many hidden nodes to have.  Fahlman  & Lebiere (1991) describe how it begins with a simple network with inputs connected to outputs and while learning, if the error reduction stagnates it adds additional hidden nodes as needed to reduce the error.  The cleverness of this algorithm is to add a number of candidate nodes, calculate which one reduces the error the most before deciding to add it permanently to the network.  Weights once added in this way are fixed.  This will reduce the herd effect (that is the network alternating its convergence to 1 or another subtasks but take a very long time to reach a point where it does both at the same time) which will likely be exaggerated with the multiple input sequences we are using for reading.

Recurrent cascade correlation deals with handling this over time.  Fahlman (1991) has also investigated this.  It would appear however that the effects of the prior input in such networks fades over iterations.  Lin et al (1996) propose a way to extend this effect.  Although reading 1 letter at a time does imply an eventual upper limit to the maximum number of iterations that need to be taken into account, that of the longest word in the language, sequences of symbols for example in a URL could represent a longer chain than is covered by these approaches.  Somehow the length of the sequence so far must play some part in the interpretation of the item at each step – differentiating between a block of understandable text and an arbitrary block of symbols.  Perhaps the arbitrary block can be subsequently re-examined with lower thresholds which permits greater errors but more options in a space where none exist.  This implies that there should be thresholds which alter how the word breakup occurs.  How would this be represented and how would such a mechanism be trained?

Neural Networks for reading

A starting point for Neural networks in c# is the aforge library.  This library provides some of the basic items for making a neural network.  Here the NN is implemented as a series of layers – Each layer is fully connected to the neurons in the subsequent layer.  In this way the network is not fully connected.  Other types of network structures which might be on interest but are not in the library are fully connected neurons in a forward pattern, hierarchical layering and fully connected with loops.  More complexity in the structure however is not necessarily a good thing.

Creating something that can read means we should first define what we mean by that.  So I will define this for the moment as taking a sequences of symbols and spaces and converting it into a list of concepts.  In terms of natural language processing this is roughly equivalent to parsing however instead of resulting in a list of “words”, the idea is to end up with a list of “concepts” directly instead of just the text as is typically done.  I believe this will help with the parsing of invented words and quasi-words.

The structure of the neural network will be critical however what that structure should be is far from clear and so far this hasn’t been solved by anyone that I can find so we will need to experiment.  The first problem to overcome is the fact that our “words” are made up of symbols.  So our input is a sequence of symbols.  Assigning an input neuron to each known symbol seems like a good way to start.  We can even have something create new neuron on the input layer for any symbol we didn’t previously know.  Making it self building.

Typical neural networks take a given input and provide a calculated output.  In this case we want to provide a sequence of inputs for which a single output is provided.  Therefore this means that some of the neurons will need to act as stateful neurons.  What a stateful neuron is however needs to be determined.  It could simply be a neuron which holds its value between output calculations, or it could be something which stores a value in a memory and that stored value is used by other neurons as part of the activation function.  Elman Neural Networks is a type of recurrent neural network which can exhibit this behaviour It could be that both are true and one is an emergent behaviour of the other in which case we should determine the most efficient way of computing it.

Another piece of information to determine is around “word” boundaries.  a parser will use things like whitespace and punctuation to determine word boundaries, but this has issues with the way we write things and is also rather non-forgiving for invented phrasing and quasi-language that we often find in modern communication.  So to this end I see 2 neural networks feeding each other.  the first determining if a sequence of text is a word, the second determining if a word boundary has been found.  These 2 networks are interdependent.  The first providing the second with an indication if it has a known word – the second providing the first an indication of how far away the current letter is from the front of the word.

The reason not to have them as a single neural network is that the solution space is divided.  Back propagation training, which has been shown to provide a good means of training a neural network cannot handle the situation of a divided solution set as the error is common across the whole set.  This would imply then that it could be done in single network with a modified training method that split the error calculation across different output goals. – to be attempted.

CIECAM02

In order to be able to have a solid basis for processing an image with similar results to the way humans do we need to “see” the image in the same way.  Unfortunately we don’t really have a model for converting an RGB bitmap into an image as it would be seen by the human eye.  The closest model is the CIECAM02 model published by the CIE.  Unfortunately much of the articles discussing this are hidden behind paywall and so the field advances very very slowly.  Still, Billy Biggs with a little help from Nathan Moroney from HP have published an implementation for CIECAM02 in C.  So here is an adaption for c#.

There are a number of paywalled articles which discuss flaws with the CIECAM02 algorithm and a number of possible future correctives are being examined by the CIE.  Perhaps the element the most relevant to image processing however is that we gain greater depth perception with 2 eyes rather than 1 eye.  So logically the image as it is perceived by each eye should be slightly different.   This is not currently represented in the algorithm but something to be considered.   The choice of illumination should not really matter (we are able to understand the same image in a variety of settings).  There should however be sufficient contrast to the white point to be able to resolve the differences in color.  It seems to me that the selection of the white point should also be somehow derivable from the RGB image as we are able to analyse an image 

    /* this code is adapted from the work of Billy Biggs and Nathan Moroney
     * http://scanline.ca/ciecam02/
     * 
     * The following is an implementation of the forward and inverse
     * functions for XYZ to JCh values from CIECAM02, as well as a function
     * to return all of J, C, h, Q, M and s.  It has been tested against the
     * spreadsheet of example calculations posted by Mark D.  Fairchild on
     * his website (http://www.cis.rit.edu/fairchild/).
     *
     * This code should be used with XYZ values in 0 to 100, not 0 to 1 like
     * most of my other code uses.  For input from sRGB, I recommend that
     * you use these values: D65 whitepoint, 20% gray, La value of 4 cd/m^2
     * to correspond to an ambient illumination of 64 lux:
     *
     *  la = 4.0;
     *  yb = 20.0;
     *  xw = 95.05;
     *  yw = 100.00;
     *  zw = 108.88;
     *
     *  The f, c and nc parameters control the surround.  CIECAM02 uses
     *  these values for average (relative luminance > 20% of scene white),
     *  dim (between 0% and 20%), and dark (0%).  In general, use average
     *  for both input and output.
     *
     *  // Average
     *  f = 1.0; c = 0.690; nc = 1.0;
     *  // Dim
     *  f = 0.9; c = 0.590; nc = 0.95;
     *  // Dark
     *  f = 0.8; c = 0.525; nc = 0.8;
     *
     * J is the lightness.
     * C is the chroma.
     * h is the hue angle in 0-360.
     * Q is the brightness.
     * M is the colourfulness.
     * s is the saturation.
     */
    public struct CIECAM02Color
    {
        public double x, y, z;
        public double Lightness; // J
        public double Chroma; // C
        public double HueAngle; // h
        public double HueComposition; // H
        public double Brightness; // Q
        public double Colorfulness; // M
        public double Saturation; // s
        public double ac, bc;
        public double asl, bs;
        public double am, bm;
    }

    public class CIECAM02
    {
        double xw, yw, zw, aw; // reference white point
        double LuminanceAdaptingField; // La
        double LuminanceFactorBackground; // Lb
        double BackgroundLuminousFactor, z;
        double IncompleteAdaptationFactor; // F
        double ImpactOfSurround; // c
        double ChromaticInductionFactor; // Nc
        double adaptation; // d
        double nbb, ncb, fl; 

        /// <summary>
        /// Setups up initial viewing conditions
        /// </summary>
        public CIECAM02(double xw, double yw, double zw, double la, double yb, int surround)
        {
            this.xw = xw;
            this.yw = yw;
            this.zw = zw;
            this.LuminanceAdaptingField = la;
            this.LuminanceFactorBackground = yb;

            // Xw, Yw, Zw, La, Yb and surround
            // surrounds are 1 - average, 2 - dim, and 3 - dark.
            if (surround == 1)
            {
                /**
                 * Average
                 */
                IncompleteAdaptationFactor = 1.00;
                ImpactOfSurround = 0.69;
                ChromaticInductionFactor = 1.00;
            }
            else if (surround == 2)
            {
                /**
                 * Dim
                 */
                IncompleteAdaptationFactor = 0.90;
                ImpactOfSurround = 0.59;
                ChromaticInductionFactor = 0.90;
            }
            else if (surround == 3)
            {
                /**
                 * Dark
                 */
                IncompleteAdaptationFactor = 0.800;
                ImpactOfSurround = 0.525;
                ChromaticInductionFactor = 0.800;
            }


            /**
             * Read in and compute the parameters associated with the viewing conditions.
             */

            BackgroundLuminousFactor = yb / yw;
            z = 1.48 + Math.Pow(BackgroundLuminousFactor, 0.5);
            fl = Compute_fl(la);
            nbb = 0.725 * Math.Pow((1.0 / BackgroundLuminousFactor), 0.2);
            ncb = nbb;
            adaptation = IncompleteAdaptationFactor * (1.0 - ((1.0 / 3.6) * Math.Exp((-la - 42.0) / 92.0)));
            aw = achromatic_response_to_white();
        }

        private double achromatic_response_to_white()
        {
            double r, g, b;
            double rc, gc, bc;
            double rp, gp, bp;
            double rpa, gpa, bpa;

            xyz_to_cat02(out r, out g, out b, xw, yw, zw);

            rc = r * (((yw * adaptation) / r) + (1.0 - adaptation));
            gc = g * (((yw * adaptation) / g) + (1.0 - adaptation));
            bc = b * (((yw * adaptation) / b) + (1.0 - adaptation));

            cat02_to_hpe(out rp, out gp, out bp, rc, gc, bc);

            rpa = nonlinear_adaptation(rp, fl);
            gpa = nonlinear_adaptation(gp, fl);
            bpa = nonlinear_adaptation(bp, fl);

            return ((2.0 * rpa) + gpa + ((1.0 / 20.0) * bpa) - 0.305) * nbb;
        }

        static void xyz_to_cat02(out double r, out double g, out double b,
                          double x, double y, double z)
        {
            r = (0.7328 * x) + (0.4296 * y) - (0.1624 * z);
            g = (-0.7036 * x) + (1.6975 * y) + (0.0061 * z);
            b = (0.0030 * x) + (0.0136 * y) + (0.9834 * z);
        }

        void cat02_to_hpe(out double rh, out double gh, out double bh,
                          double r, double g, double b)
        {
            rh = (0.7409792 * r) + (0.2180250 * g) + (0.0410058 * b);
            gh = (0.2853532 * r) + (0.6242014 * g) + (0.0904454 * b);
            bh = (-0.0096280 * r) - (0.0056980 * g) + (1.0153260 * b);
        }

        static double nonlinear_adaptation(double c, double fl)
        {
            double p = Math.Pow((fl * c) / 100.0, 0.42);
            return ((400.0 * p) / (27.13 + p)) + 0.1;
        }

        private static double Compute_fl(double la)
        {
            double k = 1.0 / ((5.0 * la) + 1.0);
            return 0.2 * Math.Pow(k, 4.0) * (5.0 * la) + 0.1 *
                 (Math.Pow((1.0 - Math.Pow(k, 4.0)), 2.0)) *
                 (Math.Pow((5.0 * la), (1.0 / 3.0)));
        }

        public CIECAM02Color CreateColorFromXYZ(double X, double Y, double Z)
        {
            CIECAM02Color color = new CIECAM02Color();
            color.x = X;
            color.y = Y;
            color.z = Z;
            color = forwardCIECAM02(color);
            return color;
        }

        public CIECAM02Color CreateColorFromJCh(double J, double C, double h)
        {
            CIECAM02Color color = new CIECAM02Color();
            color.Lightness = J;
            color.Chroma = C;
            color.HueAngle = h;
            color = inverseCIECAM02(color);
            return color;
        }

        private CIECAM02Color inverseCIECAM02(CIECAM02Color theColor)
        {
            double r, g, b;
            double rw, gw, bw;
            double rc, gc, bc;
            double rp, gp, bp;
            double rpa, gpa, bpa;
            double a, ca, cb;
            double et, t;
            double p1, p2, p3, p4, p5, hr;
            double tx, ty, tz;

            xyz_to_cat02(out rw, out gw, out bw, this.xw, this.yw, this.zw);

            t = Math.Pow(theColor.Chroma / (Math.Sqrt(theColor.Lightness / 100.0) * Math.Pow(1.64 - Math.Pow(0.29, this.BackgroundLuminousFactor), 0.73)), (1.0 / 0.9));
            et = (1.0 / 4.0) * (Math.Cos(((theColor.HueAngle * Math.PI) / 180.0) + 2.0) + 3.8);

            a = Math.Pow(theColor.Lightness / 100.0, 1.0 / (this.ImpactOfSurround * this.z)) * this.aw;

            p1 = ((50000.0 / 13.0) * this.ChromaticInductionFactor * this.ncb) * et / t;
            p2 = (a / this.nbb) + 0.305;
            p3 = 21.0 / 20.0;

            hr = (theColor.HueAngle * Math.PI) / 180.0;

            if (Math.Abs(Math.Sin(hr)) >= Math.Abs(Math.Cos(hr)))
            {
                p4 = p1 / Math.Sin(hr);
                cb = (p2 * (2.0 + p3) * (460.0 / 1403.0)) /
                 (p4 + (2.0 + p3) * (220.0 / 1403.0) *
                 (Math.Cos(hr) / Math.Sin(hr)) - (27.0 / 1403.0) +
                 p3 * (6300.0 / 1403.0));
                ca = cb * (Math.Cos(hr) / Math.Sin(hr));
            }
            else
            {
                p5 = p1 / Math.Cos(hr);
                ca = (p2 * (2.0 + p3) * (460.0 / 1403.0)) /
                     (p5 + (2.0 + p3) * (220.0 / 1403.0) -
                     ((27.0 / 1403.0) - p3 * (6300.0 / 1403.0)) *
                     (Math.Sin(hr) / Math.Cos(hr)));
                cb = ca * (Math.Sin(hr) / Math.Cos(hr));
            }

            Aab_to_rgb(out rpa, out gpa, out bpa, a, ca, cb, this.nbb);

            rp = inverse_nonlinear_adaptation(rpa, this.fl);
            gp = inverse_nonlinear_adaptation(gpa, this.fl);
            bp = inverse_nonlinear_adaptation(bpa, this.fl);

            hpe_to_xyz(out tx, out ty, out tz, rp, gp, bp);
            xyz_to_cat02(out rc, out gc, out bc, tx, ty, tz);

            r = rc / (((this.yw * this.adaptation) / rw) + (1.0 - this.adaptation));
            g = gc / (((this.yw * this.adaptation) / gw) + (1.0 - this.adaptation));
            b = bc / (((this.yw * this.adaptation) / bw) + (1.0 - this.adaptation));

            cat02_to_xyz(out theColor.x, out theColor.y, out theColor.z, r, g, b);

            return (theColor);

        }

        static double inverse_nonlinear_adaptation(double c, double fl)
        {
            return (100.0 / fl) * Math.Pow((27.13 * Math.Abs(c - 0.1)) / (400.0 - Math.Abs(c - 0.1)), 1.0 / 0.42);
        }

        static void hpe_to_xyz(out double x, out double y, out double z,
                        double r, double g, double b)
        {
            x = (1.910197 * r) - (1.112124 * g) + (0.201908 * b);
            y = (0.370950 * r) + (0.629054 * g) - (0.000008 * b);
            z = b;
        }

        static void cat02_to_xyz(out double x, out double y, out double z,
                          double r, double g, double b)
        {
            x = (1.096124 * r) - (0.278869 * g) + (0.182745 * b);
            y = (0.454369 * r) + (0.473533 * g) + (0.072098 * b);
            z = (-0.009628 * r) - (0.005698 * g) + (1.015326 * b);
        }

        static void Aab_to_rgb(out double r, out double g, out double b, double A, double aa,
                        double bb, double nbb)
        {
            double x = (A / nbb) + 0.305;

            /*       c1              c2               c3       */
            r = (0.32787 * x) + (0.32145 * aa) + (0.20527 * bb);
            /*       c1              c4               c5       */
            g = (0.32787 * x) - (0.63507 * aa) - (0.18603 * bb);
            /*       c1              c6               c7       */
            b = (0.32787 * x) - (0.15681 * aa) - (4.49038 * bb);
        }

        CIECAM02Color forwardCIECAM02(CIECAM02Color theColor)
        {
            double r, g, b;
            double rw, gw, bw;
            double rc, gc, bc;
            double rp, gp, bp;
            double rpa, gpa, bpa;
            double a, ca, cb;
            double et, t, temp;

            xyz_to_cat02(out r, out g, out b, theColor.x, theColor.y, theColor.z);
            xyz_to_cat02(out rw, out gw, out bw, xw, yw, zw);

            rc = r * (((this.yw * this.adaptation) / rw) + (1.0 - this.adaptation));
            gc = g * (((this.yw * this.adaptation) / gw) + (1.0 - this.adaptation));
            bc = b * (((this.yw * this.adaptation) / bw) + (1.0 - this.adaptation));

            cat02_to_hpe(out rp, out gp, out bp, rc, gc, bc);

            rpa = nonlinear_adaptation(rp, this.fl);
            gpa = nonlinear_adaptation(gp, this.fl);
            bpa = nonlinear_adaptation(bp, this.fl);

            ca = rpa - ((12.0 * gpa) / 11.0) + (bpa / 11.0);
            cb = (1.0 / 9.0) * (rpa + gpa - (2.0 * bpa));

            theColor.HueAngle = (180.0 / Math.PI) * Math.Atan2(cb, ca);
            if (theColor.HueAngle < 0.0) theColor.HueAngle += 360.0;

            if (theColor.HueAngle < 20.14)
            {
                temp = ((theColor.HueAngle + 122.47) / 1.2) + ((20.14 - theColor.HueAngle) / 0.8);
                theColor.HueComposition = 300 + (100 * ((theColor.HueAngle + 122.47) / 1.2)) / temp;
            }
            else if (theColor.HueAngle < 90.0)
            {
                temp = ((theColor.HueAngle - 20.14) / 0.8) + ((90.00 - theColor.HueAngle) / 0.7);
                theColor.HueComposition = (100 * ((theColor.HueAngle - 20.14) / 0.8)) / temp;
            }
            else if (theColor.HueAngle < 164.25)
            {
                temp = ((theColor.HueAngle - 90.00) / 0.7) + ((164.25 - theColor.HueAngle) / 1.0);
                theColor.HueComposition = 100 + ((100 * ((theColor.HueAngle - 90.00) / 0.7)) / temp);
            }
            else if (theColor.HueAngle < 237.53)
            {
                temp = ((theColor.HueAngle - 164.25) / 1.0) + ((237.53 - theColor.HueAngle) / 1.2);
                theColor.HueComposition = 200 + ((100 * ((theColor.HueAngle - 164.25) / 1.0)) / temp);
            }
            else
            {
                temp = ((theColor.HueAngle - 237.53) / 1.2) + ((360 - theColor.HueAngle + 20.14) / 0.8);
                theColor.HueComposition = 300 + ((100 * ((theColor.HueAngle - 237.53) / 1.2)) / temp);
            }

            a = ((2.0 * rpa) + gpa + ((1.0 / 20.0) * bpa) - 0.305) * this.nbb;

            theColor.Lightness = 100.0 * Math.Pow(a / this.aw, this.ImpactOfSurround * this.z);

            et = (1.0 / 4.0) * (Math.Cos(((theColor.HueAngle * Math.PI) / 180.0) + 2.0) + 3.8);
            t = ((50000.0 / 13.0) * this.ChromaticInductionFactor * this.ncb * et * Math.Sqrt((ca * ca) + (cb * cb))) /
                 (rpa + gpa + (21.0 / 20.0) * bpa);

            theColor.Chroma = Math.Pow(t, 0.9) * Math.Sqrt(theColor.Lightness / 100.0)
                                 * Math.Pow(1.64 - Math.Pow(0.29, this.BackgroundLuminousFactor), 0.73);

            theColor.Brightness = (4.0 / this.ImpactOfSurround) * Math.Sqrt(theColor.Lightness / 100.0) *
                  (this.aw + 4.0) * Math.Pow(this.fl, 0.25);

            theColor.Colorfulness = theColor.Chroma * Math.Pow(this.fl, 0.25);

            theColor.Saturation = 100.0 * Math.Sqrt(theColor.Colorfulness / theColor.Brightness);

            theColor.ac = theColor.Chroma * Math.Cos((theColor.HueAngle * Math.PI) / 180.0);
            theColor.bc = theColor.Chroma * Math.Sin((theColor.HueAngle * Math.PI) / 180.0);

            theColor.am = theColor.Colorfulness * Math.Cos((theColor.HueAngle * Math.PI) / 180.0);
            theColor.bm = theColor.Colorfulness * Math.Sin((theColor.HueAngle * Math.PI) / 180.0);

            theColor.asl = theColor.Saturation * Math.Cos((theColor.HueAngle * Math.PI) / 180.0);
            theColor.bs = theColor.Saturation * Math.Sin((theColor.HueAngle * Math.PI) / 180.0);

            return (theColor);
        }

    }

Stemming

Stemming is the process by which various methods are used to reduce words to just the root portion or stem of a word.  For example the stem of running would be run.  The English language has numerous exceptions which makes this process not so straight forward.  It remains questionable however as a method of simplifying the context as by its nature stemming loses information.  Information that can be important in understanding the sentence.  When comparing text, being able to compare text in a tense indifferent way however can be very useful and as a result selective stemming can improve matching.

Alski has written an English language stemmer (based upon the Porter Stemmer) written in c#.  The code can be found here

Brightness

Starting from an RGB value there are various ways of determining the brightness of that pixel.  The simplest is to compute the average of the R, G and B channels.

        private double GetBrightness(Color clr)
        {
            return (clr.R + clr.G + clr.B) / 3.0; 
        }

Another is to use the HSV hexcone model which takes the maximum of all the colors

        private double GetBrightnessHSV(Color clr)
        {
            return Math.Max(Math.Max(clr.R, clr.G), clr.B);
        } 

Another is to use the HSL method which averages the maximum and minimum of the colors

        private double GetBrightnessHSL(Color clr)
        {
            return 0.5 * Math.Max(Math.Max(clr.R, clr.G), clr.B) + 0.5 * Math.Min(Math.Min(clr.R, clr.G), clr.B);
        }

There is also luma which is based upon the weighted average of gamma corrected RGB values.  Naturally there are different corrections depending upon the source.  Rec.601 refers to NTSC sources.  Rec 709 to sRGB and Rec 2020 to UHDTV.  Noting that in the .net framework each color is represented with 8 bits per color.  Rec 2020 calls for 10 or 12 bits per sample, expanding the range of possible color representation, this requires a different “color” object to represent the pixel’s value.

        private double GetBrightnessLumaRec2020(UHDColor clr)
        {
            return 0.2627 * clr.R + 0.6780 * clr.G + 0.0593 * clr.B; // rec 2020 Luma coefficients
        }

        private double GetBrightnessLumaRec601(Color clr)
        {
            return 0.30 * clr.R + 0.59 * clr.G + 0.11 * clr.B; // rec 601 Luma coefficients
        }

        private double GetBrightnessLumaRec709(Color clr)
        {
            return 0.21 * clr.R + 0.72 * clr.G + 0.07 * clr.B; // rec 709 Luma coefficients
        } 

All of these approximations have flaws, the most accurate representation we have appears to be the latest standard from the International Commission on Illumination (yes it really exists!) called CIECAM02.  This appears to be implemented in Windows from Vista onwards but not yet available in .net.

Working with an Image

Our budding DevOps engineer will need to be able to look at diagrams and understand them.  In order to be able to read a diagram we will need to be able to pick out objects from the background and capture the text associated with the objects.  The most general case if for this to be an image (Visio plugins could help pull out the underlying object structure but not necessarily the text that goes with each object, nor necessarily the associations between objects which might be joining or non joining lines or based upon underlays or overlaps of other images)

If we think about it, although the image is itself flat, we humans are able to see the difference between the objects in the picture and determine where the boundaries are.  If such a picture is initially held as a byte array and we are able to understand the displayed picture as a layered image, potentially with depth and shadow, then all such information necessary to determine this is in that byte array that we started with.

So first lets convert our bitmap into a byte array so we can do something more fancy with it then we can with it as a bitmap.  Also has the side effect of being much faster to operate with then directly manipulating the bitmap with .net.  Perhaps it will be more useful as a 2 dimensional byte array.

        public byte[,] GetImage(string filename)
        {
            Bitmap bmap = new Bitmap(filename);
            int colorDepth = Bitmap.GetPixelFormatSize(bmap.PixelFormat);
            int sizex = bmap.Width;
            int sizey = bmap.Height;
            int bytesPerPixel = colorDepth / 8;
            int pixelCount = sizex * sizey;
            byte[] pixels = new byte[pixelCount * bytesPerPixel];

            Rectangle rect = new Rectangle(0, 0, sizex, sizey);
            var bitmapData = bmap.LockBits(rect, ImageLockMode.ReadWrite,
                  bmap.PixelFormat);
            IntPtr Iptr = bitmapData.Scan0;

            // Copy data from pointer to array
            Marshal.Copy(Iptr, pixels, 0, pixels.Length);
            bmap.UnlockBits(bitmapData);
            byte[,] pixelgrid = pixels.ToSquare2D(sizex * bytesPerPixel);
            return pixelgrid;
        }	

The ToSquare2D extension method I used was originally posted by ZenLulz here.  I have kept the extension method however have replaced its insides with BlockCopy which appears to be faster.

Buffer.BlockCopy(array, 0, buffer, 0, array.Length);

One of the first things we notice if we open up a picture is that nearly every single pixel is different.  Even areas which might look visually identical can have little variations, RGB(255,0,0) looks extremely similar to RGB(254,1,2) if it is not identical.  So if it is identical to me and I can read the picture then it is too much information.  Of course those subtle differences might be what helps us determine orientation and depth in an otherwise flat image.

Introduction

Artificial intelligence is a topic which has fascinated me for a long time.  In my early days of programming the universe expanded for me when I first wrote some code which wrote code and I thought at the time “if software can write software, then where is the limit?”.  Actually my naivety at the time is now somewhat sobering but the excitement remained.  The obvious question being how does the computer know what to write.  Well from the same way us humans do, by getting requirements.  Ahh, but computers can’t understand written English.  So began a discovery for me as I started to explore exactly that sentence.

Computers can’t understand written English.

There are 2 key items here.  Understanding and written English.  Understanding, what is “understanding” exactly and how does it come about?  is it codifiable or is it emergent from a set of abilities such as logic inference and relational analysis.  These are questions to which we have no answers yet but many theories and many clever people working on different approaches.  The answer may come but for now the 2nd key item is something which looks far easier to deal with.

Written English – it brings to mind the expression “written in plain English” to indicate something should be simple to understand.  However when the starting point is a non-intelligent computer this is some kind of vastly amusing joke.  Our simplest written language is anything but plain, actually it is incredibly complex.  Researchers across the world have made great in roads into extracting this complexity and got stuck.  For the simple reason that language outside of context is meaningless.  The text “I bought a new red car” can have its grammatical elements isolated with good accuracy but it has no meaning if you don’t have a concept of self, do not understand the passage of time such that “new” is significant and what it implies, have no idea what “red” is let alone what the noun “car” refers to.

So perhaps we need to work on concepts first and how to represent those.  Concepts are also a topic subject to much research with great efforts being made to model words and their relationships, establish hierarchies of concepts and the like.  Alternatively perhaps we need to emulate the way humans learn, with our senses and extract the concepts from the sense information.  Vision is a key aspect of this and a number of companies are already working on the first practical aspects of image recognition technology.

There are theories which say that we generate a virtual world in our head and that not only is the sense information used to update that world but also our own thoughts where we run “simulations” in this virtual world.  With robotics we are in a way reverse engineering how humans operate in our environment and gaining valuable insight from that and at the same time we are learning ever more about the brain and how it functions.  Lots of research, lots of different approaches with a myriad of different aims.  Hence this blog.

This blog is about an exploration into the various techniques being explored by researchers,  the code (in c#) and a journey of an attempt to assemble software which can emulate a devops engineer.