Saturday, 16 June 2012

Speech Recognition for the Kinect, the Easy Way...

Speech Recognition for the Kinect:

In the development of Kin-educate I found this to be one of the most tricky parts. Largely because I couldn't find any complete tutorials out there other than the quick start series at channel 9, which, if you have checked it out, you will know is helpful, but not comprehensive.

I have been asked by quite a lot of people about how I did the speech recognition in the maths game for Kin-educate, so I thought I would do a quick tutorial that cuts out all the unnecessary bits, and just focuses on getting you set up and speech recognition working quickly and easily. This tutorial assumes you have a Kinect project set up already - if you do, you should be able to just copy and paste this code, in order, and you're all set!

*You decide what kind of outputs you would like for the speech recognition, but for this example I have used just three text boxes for feedback. One for the hypothesized result (good for debugging), one for the rejected speech, and one for the reply - when speech is recognized.

Add using statements and references:

//Make sure to add a reference to Kinect in the references
using Microsoft.Kinect;
//Make sure you have the speech SDK installed
//go to add reference, browse, navigate to program files, micrsoft SDKs
//speech, assemblies and select speech.dll
using Microsoft.Speech.AudioFormat;
using Microsoft.Speech.Recognition;
using System.IO;

Then, declare your variables and get the speech recognizer:

        //Create an instance of your kinect sensor
        public KinectSensor CurrentSensor;
        //and the speech recognition engine (SRE)
        private SpeechRecognitionEngine speechRecognizer;
        //Get the speech recognizer (SR)
        private static RecognizerInfo GetKinectRecognizer()
            Func<RecognizerInfo, bool> matchingFunc = r =>
                string value;
                r.AdditionalInfo.TryGetValue("Kinect", out value);
                return "True".Equals(value, StringComparison.InvariantCultureIgnoreCase) && "en-US".Equals(r.Culture.Name, StringComparison.InvariantCultureIgnoreCase);
            return SpeechRecognitionEngine.InstalledRecognizers().Where(matchingFunc).FirstOrDefault();

When the window loads, we need to initialize the Kinect sensor:

        //When the window loads, initialize the Kinect
        public MainWindow()
        //Initilaize the kinect
        private KinectSensor InitializeKinect()
            //get the first available sensor and set it to the current sensor variable
            CurrentSensor = KinectSensor.KinectSensors
                                  .FirstOrDefault(s => s.Status == KinectStatus.Connected);
            speechRecognizer = CreateSpeechRecognizer();
            //Start the sensor
            //then run the start method to start streaming audio
            return CurrentSensor;

Now we need to configure the audio stream:

        //Start streaming audio
        private void Start()
            //set sensor audio source to variable
            var audioSource = CurrentSensor.AudioSource;
            //Set the beam angle mode - the direction the audio beam is pointing
            //we want it to be set to adaptive
            audioSource.BeamAngleMode = BeamAngleMode.Adaptive;
            //start the audiosource 
            var kinectStream = audioSource.Start();
            //configure incoming audio stream
                kinectStream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null));
            //make sure the recognizer does not stop after completing     
            //reduce background and ambient noise for better accuracy
            CurrentSensor.AudioSource.EchoCancellationMode = EchoCancellationMode.None;
            CurrentSensor.AudioSource.AutomaticGainControlEnabled = false;

Here we set the culture, define the words we want our program to recognize, and set up the grammar builder:

        //here is the fun part: create the speech recognizer
        private SpeechRecognitionEngine CreateSpeechRecognizer()
            //set recognizer info
            RecognizerInfo ri = GetKinectRecognizer();
            //create instance of SRE
            SpeechRecognitionEngine sre;
            sre = new SpeechRecognitionEngine(ri.Id);

            //Now we need to add the words we want our program to recognise
            var grammar = new Choices();

            //set culture - language, country/region
            var gb = new GrammarBuilder { Culture = ri.Culture };

            //set up the grammar builder
            var g = new Grammar(gb);

            //Set events for recognizing, hypothesising and rejecting speech
            sre.SpeechRecognized += SreSpeechRecognized;
            sre.SpeechHypothesized += SreSpeechHypothesized;
            sre.SpeechRecognitionRejected += SreSpeechRecognitionRejected;
            return sre;

Now all we need to do is set up the methods for hypothesizing, recognizing and rejecting speech:

        //if speech is rejected
        private void RejectSpeech(RecognitionResult result)
            textBox2.Text = "Pardon Moi?";

        private void SreSpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e)

I use the hypothesized result for debugging and changing the confidence level for managing accuracy:

        //hypothesized result
        private void SreSpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
            textBox1.Text = "Hypothesized: " + e.Result.Text + " " + e.Result.Confidence;

This is where we decide what happens when speech is recognized. The confidence level is set quite low here. Experiment with it to see what suits you best:

        //Speech is recognised
        private void SreSpeechRecognized(object sender, SpeechRecognizedEventArgs e)
            //Very important! - change this value to adjust accuracy - the higher the value
            //the more accurate it will have to be, lower it if it is not recognizing you
            if (e.Result.Confidence < .4)
            //and finally, here we set what we want to happen when 
            //the SRE recognizes a word
            switch (e.Result.Text.ToUpperInvariant())
                case "HELLO":
                    textBox3.Text = "Hi there.";
                case "GOODBYE":
                    textBox3.Text = "Goodbye then.";

And that is that. You should now have speech recognition working within your Kinect program. Check back for the next blog where I will be expanding upon this by making a speech-based application for controlling your media player!

Contact info:

Or, leave a comment on my YouTube channel


  1. This comment has been removed by the author.

  2. Hi! Thanks for article. Can You upload solution? I'm a new in C# and it will be easy if I can download source files.

  3. hi,,
    i'm a beginner in kinect..

    i wanna ask, can culture in kinect detect another language, such as indonesian??
    or it's only detect english word..

    how if i wanna detect my word using my language??

    thanks for your help

  4. Awesome tutorial, thanks for the help!

  5. This is what I have been looking for. However, I cannot get past step 1. It says...

    //Make sure to add a reference to Kinect in the references
    using Microsoft.Kinect;
    //Make sure you have the speech SDK installed
    //go to add reference, browse, navigate to program files, micrsoft SDKs
    //speech, assemblies and select speech.dll

    Okay, add references where? What needs to be open to do this? It isn't in the Kinect Studio, not in the windows developer toolkit. I have looked everywhere for the starting point and have no idea of where to go to add this reference. Can you help me? I have been looking for weeks on how to set this all up and this is the closest thing I have found to a solution.