There are three main strategies in converting user speech input to text:
- Voice Commands
- Free Dictation
- Grammar.
These strategies exist in any voice detection engine (Google, Microsoft, Amazon, Apple, Nuance, Intel, or others), therefore the concepts described here will give you a good reference point to understand how to work with any of them. In today’s article, we’ll explore the differences of each method, understand their use-cases, and see a quick implementation of the main ones.
Prerequisites
To write and execute code, you need to install the following software:
- Visual Studio 2019 Community
- Unity3D
- Windows 10
Unity3D is using a Microsoft API that works on any Windows 10 device (Desktop, UWP, HoloLens, XBOX). Similar APIs also exist for Android and iOS.
Did you know?…
LightBuzz has been helping Fortune-500 companies and innovative startups create amazing Unity3D applications and games. If you are looking to hire developers for your project, get in touch with us.
Source code
The source code of the project is available in our LightBuzz GitHub account. Feel free to download, fork, and even extend it!
1) Voice commands
We are first going to examine the simplest form of speech recognition: plain voice commands.
Description
Voice commands are predictable single words or expressions, such as:
- “Forward”
- “Left”
- “Fire”
- “Answer call”
The detection engine is listening to the user and compares the result with various possible interpretations. If one of them is near the spoken phrase within a certain confidence threshold, it’s marked as a proposed answer.
Since that is a “one or anything” approach, the engine will either recognize the phrase or nothing at all.
This method fails when you have several ways to say one thing. For example, the words “hello”, “hi”, “hey there” are all forms of greeting. Using this approach, you have to define all of them explicitly.
Use-case
This method is useful for short, expected phrases, such as in-game controls.
Example
Our original article includes detailed examples of using simple voice commands. You may also check out the Voice Commands Scene on the sample project.
Below, you can see the simplest C# code example for recognizing a few words:
using UnityEngine; using UnityEngine.Windows.Speech; public class VoiceCommandsEngine : MonoBehaviour { public string[] keywords = new string[] {"up", "down", "left", "right"}; public ConfidenceLevel confidence = ConfidenceLevel.Medium; protected string word = "right"; private void Start() { recognizer = new KeywordRecognizer(keywords, confidence); recognizer.OnPhraseRecognized += Recognizer_OnPhraseRecognized; recognizer.Start(); } private void Recognizer_OnPhraseRecognized(PhraseRecognizedEventArgs args) { Debug.Log(args.text); } private void OnApplicationQuit() { if (recognizer != null && recognizer.IsRunning) { recognizer.OnPhraseRecognized -= Recognizer_OnPhraseRecognized; recognizer.Stop(); } } }
2) Free Dictation
To solve the challenges of simple voice commands, we shall use the dictation mode.
Description
While the user speaks in this mode, the engine listens for every possible word. While listening, it tries to find the best possible match of what the user meant to say.
This is the mode activated by your mobile device when you speak to it when writing a new email using voice. The engine manages to write the text in less than a second after you finish to say a word.
Technically, this is really impressive, especially considering that it compares your voice across multi-lingual dictionaries, while also checking grammar rules.
Use-case
Use this mode for free-form text. If your application has no idea what to expect, the Dictation mode is your best bet.
Example
You can see an example of the Dictation mode in the sample project Dictation Mode Scene. Here is the simplest way to use the Dictation mode:
using UnityEngine; using UnityEngine.Windows.Speech; public class DictationEngine : MonoBehaviour { protected DictationRecognizer dictationRecognizer; void Start() { StartDictationEngine(); } private void DictationRecognizer_OnDictationHypothesis(string text) { Debug.Log("Dictation hypothesis: " + text); } private void DictationRecognizer_OnDictationComplete(DictationCompletionCause completionCause) { switch (completionCause) { case DictationCompletionCause.TimeoutExceeded: case DictationCompletionCause.PauseLimitExceeded: case DictationCompletionCause.Canceled: case DictationCompletionCause.Complete: // Restart required CloseDictationEngine(); StartDictationEngine(); break; case DictationCompletionCause.UnknownError: case DictationCompletionCause.AudioQualityFailure: case DictationCompletionCause.MicrophoneUnavailable: case DictationCompletionCause.NetworkFailure: // Error CloseDictationEngine(); break; } } private void DictationRecognizer_OnDictationResult(string text, ConfidenceLevel confidence) { Debug.Log("Dictation result: " + text); } private void DictationRecognizer_OnDictationError(string error, int hresult) { Debug.Log("Dictation error: " + error); } private void OnApplicationQuit() { CloseDictationEngine(); } private void StartDictationEngine() { dictationRecognizer = new DictationRecognizer(); dictationRecognizer.DictationHypothesis += DictationRecognizer_OnDictationHypothesis; dictationRecognizer.DictationResult += DictationRecognizer_OnDictationResult; dictationRecognizer.DictationComplete += DictationRecognizer_OnDictationComplete; dictationRecognizer.DictationError += DictationRecognizer_OnDictationError; dictationRecognizer.Start(); } private void CloseDictationEngine() { if (dictationRecognizer != null) { dictationRecognizer.DictationHypothesis -= DictationRecognizer_OnDictationHypothesis; dictationRecognizer.DictationComplete -= DictationRecognizer_OnDictationComplete; dictationRecognizer.DictationResult -= DictationRecognizer_OnDictationResult; dictationRecognizer.DictationError -= DictationRecognizer_OnDictationError; if (dictationRecognizer.Status == SpeechSystemStatus.Running) { dictationRecognizer.Stop(); } dictationRecognizer.Dispose(); } } }
As you can see, we first create a new dictation engine and register for the possible events.
- It starts with
DictationHypothesis
events, which are thrown really fast as the user speaks. However, hypothesized phrases may contain lots of errors. DictationResult
is an event thrown after the user stops speaking for 1–2 seconds. It’s only then that the engine provides a single sentence with the highest probability.DictationComplete
is thrown on several occasions when the engine shuts down. Some occasions are irreversible technical issues, while others just require a restart of the engine to get back to work.DictationError
is thrown for other unpredictable errors.
Here are two general rules-of-thumb:
- For the highest quality, use
DictationResult
. - For the fastest response, use
DictationHypothesis
.
Having both quality and speed is impossible with this technic.
3) Grammar
Is it even possible to combine high-quality recognition with high speed?
Well, there is a reason we are not yet using voice commands as Iron Man does: In real-world applications, users are frequently complaining about typing errors, which probably happens only less than 10% of the cases… Dictation has many more mistakes than that.
To increase accuracy and keep the speed fast at the same time, we need the best of both worlds — the freedom of the Dictation and the response time of the Voice Commands.
The solution is Grammar Mode. This mode requires us to write a dictionary. A dictionary is an XML file that defines various rules for the things that the user will potentially say. This way, we can ignore languages we don’t need, and phrases the user will probably not use.
The grammar file also explains to the engine what are the possible words it can expect to receive next, therefore shrinking the amount from ANYTHING to X. This significantly increases performance and quality.
For example, using a Grammar, we could greet with either of these phrases:
- “Hello, how are you?”
- “Hi there”
- “Hey, what’s up?”
- “How’s it going?”
- Etc.
All of those could be listed in a rule that says:
"Hello, how are you?" OR "Hi there" OR "Hey, what's up?" OR "How's it going?"
If the user started saying something that sounds like” Hello”, it would be easily differentiated from e.g “Ciao”, compared to being differentiated also from e.g. “Yellow” or “Halo”.
We are going to see how to create our own Grammar file in a future article.
For your reference, this is the official specification for structuring a Grammar file.
Summary
In this tutorial, we described two methods of recognizing voice in Unity3D: Voice Commands and Dictation. Voice Commands are the easiest way to recognize pre-defined words. Dictation is a way to recognize free-form phrases. In a future article, we are going to see how to develop our own Grammar and feed it to Unity3D.
Until then, why don’t you start writing your code by speaking to your PC?
Source code
You made it to this point? Awesome! Here is the source code for your convenience.
Before you go…
LightBuzz has been helping Fortune-500 companies and innovative startups create amazing Unity3D applications and games. If you are looking to hire developers for your project, get in touch with us.
Sharing is caring!
If you liked this article, remember to share it on social media, so you can help other developers, too! Also, let me know your thoughts in the comments below. ‘Til the next time… keep coding!
Hello, I have a question, while in unity everything works perfectly, but when I build the project for PC, and open the application, it doesn’t work. Please help.
hi Omar,
well, i have built it will Unity 2019.1 as well as with 2019.3
and it works perfectly.
i apologize if it doesn’t. please try to make a build from the github source code, and feel free to send us some error messages that occur.
hi Omar,
well, i have built it will Unity 2019.1 as well as with 2019.3
and it works perfectly.
i apologize if it doesn’t. please try to make a build from the github source code, and feel free to send us some error messages that occur.
Hello, I’m trying Dictation Recognizer and I want to change the language to Spanish but I still don’t quite get it. Can you help me with this?
hi Alexis,
perhaps check if the code here could help you:
https://docs.microsoft.com/en-us/windows/apps/design/input/specify-the-speech-recognizer-language
You need an object – protected PhraseRecognizer recognizer;
in the example nr 1. Take care and thanks for this article!
Carl
Thank you Carl.
Happy you liked it.
does this support android builds
Hi there.
Sadly not. Android and ios have different speech api. this api supports microsoft devices.
Any working example for the grammar case?
Hi fabio,
Well, you can find this example from Microsoft. It should work anyway on PC. A combination between Grammar and machine learning is how most of these mechanisms work today.
https://learn.microsoft.com/en-us/dotnet/api/system.speech.recognition.grammar?view=netframework-4.8.1#examples