​There are three main strategies in converting user speech input to text:

  • Voice Commands
  • Free Dictation
  • Grammar.

These strategies exist in any voice detection engine (Google, Microsoft, Amazon, Apple, Nuance, Intel, or others), therefore the concepts described here will give you a good reference point to understand how to work with any of them. In today’s article, we’ll explore the differences of each method, understand their use-cases, and see a quick implementation of the main ones.

Prerequisites

To write and execute code, you need to install the following software:

Unity3D is using a Microsoft API that works on any Windows 10 device (Desktop, UWP, HoloLens, XBOX). Similar APIs also exist for Android and iOS.

Did you know?…

LightBuzz has been helping Fortune-500 companies and innovative startups create amazing Unity3D applications and games. If you are looking to hire developers for your project, get in touch with us.

Source code

The source code of the project is available in our LightBuzz GitHub account. Feel free to download, fork, and even extend it!

Download on GitHub

1) Voice commands

We are first going to examine the simplest form of speech recognition: plain voice commands.

Description

Voice commands are predictable single words or expressions, such as:

  • “Forward”
  • “Left”
  • “Fire”
  • “Answer call”

The detection engine is listening to the user and compares the result with various possible interpretations. If one of them is near the spoken phrase within a certain confidence threshold, it’s marked as a proposed answer.

Since that is a “one or anything” approach, the engine will either recognize the phrase or nothing at all.

This method fails when you have several ways to say one thing. For example, the words “hello”, “hi”, “hey there” are all forms of greeting. Using this approach, you have to define all of them explicitly.

Use-case

This method is useful for short, expected phrases, such as in-game controls.

Example

Our original article includes detailed examples of using simple voice commands. You may also check out the Voice Commands Scene on the sample project.

Below, you can see the simplest C# code example for recognizing a few words:

using UnityEngine;
using UnityEngine.Windows.Speech;

public class VoiceCommandsEngine : MonoBehaviour
{
    public string[] keywords = new string[] {"up", "down", "left", "right"};
    public ConfidenceLevel confidence = ConfidenceLevel.Medium;

    protected string word = "right";

    private void Start()
    {
        recognizer = new KeywordRecognizer(keywords, confidence);
        recognizer.OnPhraseRecognized += Recognizer_OnPhraseRecognized;
        recognizer.Start();
    }

    private void Recognizer_OnPhraseRecognized(PhraseRecognizedEventArgs args)
    {
        Debug.Log(args.text);
    }

    private void OnApplicationQuit()
    {
        if (recognizer != null && recognizer.IsRunning)
        {
            recognizer.OnPhraseRecognized -= Recognizer_OnPhraseRecognized;
            recognizer.Stop();
        }
    }
}

2) Free Dictation

To solve the challenges of simple voice commands, we shall use the dictation mode.

Description

While the user speaks in this mode, the engine listens for every possible word. While listening, it tries to find the best possible match of what the user meant to say.

This is the mode activated by your mobile device when you speak to it when writing a new email using voice. The engine manages to write the text in less than a second after you finish to say a word.

Technically, this is really impressive, especially considering that it compares your voice across multi-lingual dictionaries, while also checking grammar rules.

Use-case

Use this mode for free-form text. If your application has no idea what to expect, the Dictation mode is your best bet.

Example

You can see an example of the Dictation mode in the sample project Dictation Mode Scene. Here is the simplest way to use the Dictation mode:

using UnityEngine;
using UnityEngine.Windows.Speech;

public class DictationEngine : MonoBehaviour
{
    protected DictationRecognizer dictationRecognizer;

    void Start()
    {
        StartDictationEngine();
    }

    private void DictationRecognizer_OnDictationHypothesis(string text)
    {
        Debug.Log("Dictation hypothesis: " + text);
    }

    private void DictationRecognizer_OnDictationComplete(DictationCompletionCause completionCause)
    {
        switch (completionCause)
        {
            case DictationCompletionCause.TimeoutExceeded:
            case DictationCompletionCause.PauseLimitExceeded:
            case DictationCompletionCause.Canceled:
            case DictationCompletionCause.Complete:
                // Restart required
                CloseDictationEngine();
                StartDictationEngine();
                break;
            case DictationCompletionCause.UnknownError:
            case DictationCompletionCause.AudioQualityFailure:
            case DictationCompletionCause.MicrophoneUnavailable:
            case DictationCompletionCause.NetworkFailure:
                // Error
                CloseDictationEngine();
                break;
        }
    }

    private void DictationRecognizer_OnDictationResult(string text, ConfidenceLevel confidence)
    {
        Debug.Log("Dictation result: " + text);
    }

    private void DictationRecognizer_OnDictationError(string error, int hresult)
    {
        Debug.Log("Dictation error: " + error);
    }


    private void OnApplicationQuit()
    {
        CloseDictationEngine();
    }

    private void StartDictationEngine()
    {
        dictationRecognizer = new DictationRecognizer();

        dictationRecognizer.DictationHypothesis += DictationRecognizer_OnDictationHypothesis;
        dictationRecognizer.DictationResult += DictationRecognizer_OnDictationResult;
        dictationRecognizer.DictationComplete += DictationRecognizer_OnDictationComplete;
        dictationRecognizer.DictationError += DictationRecognizer_OnDictationError;

        dictationRecognizer.Start();
    }

    private void CloseDictationEngine()
    {
        if (dictationRecognizer != null)
        {
            dictationRecognizer.DictationHypothesis -= DictationRecognizer_OnDictationHypothesis;
            dictationRecognizer.DictationComplete -= DictationRecognizer_OnDictationComplete;
            dictationRecognizer.DictationResult -= DictationRecognizer_OnDictationResult;
            dictationRecognizer.DictationError -= DictationRecognizer_OnDictationError;

            if (dictationRecognizer.Status == SpeechSystemStatus.Running)
            {
                dictationRecognizer.Stop();
            }

            dictationRecognizer.Dispose();
        }
    }
}

As you can see, we first create a new dictation engine and register for the possible events.

  • It starts with DictationHypothesis events, which are thrown really fast as the user speaks. However, hypothesized phrases may contain lots of errors.
  • DictationResult is an event thrown after the user stops speaking for 1–2 seconds. It’s only then that the engine provides a single sentence with the highest probability.
  • DictationComplete is thrown on several occasions when the engine shuts down. Some occasions are irreversible technical issues, while others just require a restart of the engine to get back to work.
  • DictationError is thrown for other unpredictable errors.

Here are two general rules-of-thumb:

  • For the highest quality, use DictationResult.
  • For the fastest response, use DictationHypothesis.

Having both quality and speed is impossible with this technic.

3) Grammar

Is it even possible to combine high-quality recognition with high speed?

Well, there is a reason we are not yet using voice commands as Iron Man does: In real-world applications, users are frequently complaining about typing errors, which probably happens only less than 10% of the cases… Dictation has many more mistakes than that.

To increase accuracy and keep the speed fast at the same time, we need the best of both worlds — the freedom of the Dictation and the response time of the Voice Commands.

The solution is Grammar Mode. This mode requires us to write a dictionary. A dictionary is an XML file that defines various rules for the things that the user will potentially say. This way, we can ignore languages we don’t need, and phrases the user will probably not use.

The grammar file also explains to the engine what are the possible words it can expect to receive next, therefore shrinking the amount from ANYTHING to X. This significantly increases performance and quality.

For example, using a Grammar, we could greet with either of these phrases:

  • “Hello, how are you?”
  • “Hi there”
  • “Hey, what’s up?”
  • “How’s it going?”
  • Etc.

All of those could be listed in a rule that says:

"Hello, how are you?" OR 
"Hi there" OR
"Hey, what's up?" OR
"How's it going?"

If the user started saying something that sounds like” Hello”, it would be easily differentiated from e.g “Ciao”, compared to being differentiated also from e.g. “Yellow” or “Halo”.

We are going to see how to create our own Grammar file in a future article.

For your reference, this is the official specification for structuring a Grammar file.

Summary

In this tutorial, we described two methods of recognizing voice in Unity3D: Voice Commands and Dictation. Voice Commands are the easiest way to recognize pre-defined words. Dictation is a way to recognize free-form phrases. In a future article, we are going to see how to develop our own Grammar and feed it to Unity3D.

Until then, why don’t you start writing your code by speaking to your PC?

Source code

You made it to this point? Awesome! Here is the source code for your convenience.

Download on GitHub

Before you go…

LightBuzz has been helping Fortune-500 companies and innovative startups create amazing Unity3D applications and games. If you are looking to hire developers for your project, get in touch with us.

Sharing is caring!

If you liked this article, remember to share it on social media, so you can help other developers, too! Also, let me know your thoughts in the comments below. ‘Til the next time… keep coding!

Shachar Oz

Shachar Oz

Shachar Oz is a product manager and UX specialist with extensive experience with emergent technologies, like AR, VR and computer vision. He designed Human Machine Interfaces for the last 10 years for video games, apps, robots and cars, using interfaces like face tracking, hand gestures and voice recognition. Website

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.