docs/06_Component_Documentation/10_agl_voice_agent_assistant.md

   1 ---
   2 title: AGL Voice Agent / Assistant
   3 ---
   4
   5 # AGL Voice Agent / Assistant
   6
   7 # Introduction
   8 A gRPC-based voice agent designed for Automotive Grade Linux (AGL). This service leverages GStreamer, Vosk, Snips, and RASA to seamlessly process user voice commands. It converts spoken words into text, extracts intents from these commands, and performs actions through the Kuksa interface. The voice agent is designed to be modular and extensible, allowing for the addition of new speech recognition and intent extraction models.
   9
  10 # Installation and Usage
  11 Before we dive into the detailed components documentation, let's first take a look at how to install and use the voice agent service. All of the features of the voice agent service are encapsulated in the `meta-offline-voice-agent` sub-layer which can be found under `meta-agl-devel` layer. These features are currently part of the `master` branch only. This sub-layer can be built into the final image by using following commands:
  12
  13 ```shell
  14 $ source master/meta-agl/scripts/aglsetup.sh -m qemux86-64 -b build-master agl-demo agl-devel agl-offline-voice-agent
  15 $ source agl-init-build-env
  16 $ bitbake agl-ivi-demo-platform-flutter
  17 ```
  18
  19 After the build is complete, you can run the final image using QEMU. Once the image is running, you can start the voice agent service by running the following command:
  20 ```shell
  21 $ voiceagent-service run-server --default
  22 ```
  23
  24 The `--default` flag loads the voice agent service with default configuration. The default configuration file looks like this:
  25 ```ini
  26 [General]
  27 base_audio_dir = /usr/share/nlu/commands/
  28 stt_model_path = /usr/share/vosk/vosk-model-small-en-us-0.15/
  29 wake_word_model_path = /usr/share/vosk/vosk-model-small-en-us-0.15/
  30 snips_model_path = /usr/share/nlu/snips/model/
  31 channels = 1
  32 sample_rate = 16000
  33 bits_per_sample = 16
  34 wake_word = hello auto
  35 server_port = 51053
  36 server_address = 127.0.0.1
  37 rasa_model_path = /usr/share/nlu/rasa/models/
  38 rasa_server_port = 51054
  39 rasa_detached_mode = 0
  40 base_log_dir = /usr/share/nlu/logs/
  41 store_voice_commands = 0
  42
  43 [Kuksa]
  44 ip = 127.0.0.1
  45 port = 8090
  46 protocol = ws
  47 insecure = True
  48 token = /usr/lib/python3.10/site-packages/kuksa_certificates/jwt/super-admin.json.token
  49
  50 [Mapper]
  51 intents_vss_map = /usr/share/nlu/mappings/intents_vss_map.json
  52 vss_signals_spec = /usr/share/nlu/mappings/vss_signals_spec.json
  53 ```
  54
  55 Most of the above configuration variable are self explanatory, however, I'll dive deeper into the ones that might need some explanation.
  56
  57 - **`store_voice_commands`**: This variable is used to enable/disable the storage of voice commands. If this variable is set to `1`, then the voice commands will be stored in the `base_audio_dir` directory. The voice commands are stored in the following format: `base_audio_dir/<timestamp>.wav`. The `timestamp` is the time at which the voice command was received by the voice agent service.
  58
  59 - **`rasa_detached_mode`**: This variable is used to enable/disable the detached mode for the RASA NLU engine. If this variable is set to `1`, then the RASA NLU engine will be run in detached mode, i.e. the voice agent service won't run it and will assume that RASA is already running. This is useful when you want to run the RASA NLU engine on a separate machine. If this variable is set to `0`, then the RASA NLU engine will be run as a sub process of the voice agent service.
  60
  61 - **`intents_vss_map`**: This is the path to the file that actually maps the intent output from our intent engine to the VSS signal specification. This file is in JSON format and contains the mapping for all the intents that we want to support. The default file looks like this:
  62
  63     ```json
  64     {
  65     "intents": {
  66         "VolumeControl": {
  67             "signals": [
  68                 "Vehicle.Cabin.Infotainment.Media.Volume"
  69             ]
  70         },
  71         "HVACFanSpeed": {
  72             "signals": [
  73                 "Vehicle.Cabin.HVAC.Station.Row1.Left.FanSpeed",
  74                 "Vehicle.Cabin.HVAC.Station.Row1.Right.FanSpeed",
  75                 "Vehicle.Cabin.HVAC.Station.Row2.Left.FanSpeed",
  76                 "Vehicle.Cabin.HVAC.Station.Row2.Right.FanSpeed"
  77             ]
  78         },
  79         "HVACTemperature": {
  80             "signals": [
  81                 "Vehicle.Cabin.HVAC.Station.Row1.Left.Temperature",
  82                 "Vehicle.Cabin.HVAC.Station.Row1.Right.Temperature",
  83                 "Vehicle.Cabin.HVAC.Station.Row2.Left.Temperature",
  84                 "Vehicle.Cabin.HVAC.Station.Row2.Right.Temperature"
  85             ]
  86         }
  87     }
  88     }
  89     ```
  90     Notice that `VolumeControl`, `HVACFanSpeed`, and `HVACTemperature` are the intents that we want to support. The `signals` array contains the VSS signals that we want to change when the user issues a command for that intent. For example, if the user says "Set the volume to 50", then the voice agent service will extract the `VolumeControl` intent from the user's command and then change the `Vehicle.Cabin.Infotainment.Media.Volume` signal to 50.
  91
  92 - **`vss_signal_spec`**: This is path to the file that defines the specs and values that can be mapped onto a VSS signal. This file is in JSON format and contains specs for all the VSS signal that we want to support. A sample VSS spec definition looks like this:
  93
  94     ```json
  95     {
  96     "signals": {
  97         "Vehicle.Cabin.Infotainment.Media.Volume": {
  98             "default_value": 15,
  99             "default_change_factor": 5,
 100             "actions": {
 101                 "set": {
 102                     "intents": ["volume_control_action"],
 103                     "synonyms": ["set", "change", "adjust"]
 104                 },
 105                 "increase": {
 106                     "intents":["volume_control_action"],
 107                     "synonyms": ["increase", "up", "raise", "louder"],
 108                     "modifier_intents": ["to_or_by"]
 109                 },
 110                 "decrease": {
 111                     "intents": ["volume_control_action"],
 112                     "synonyms": ["decrease", "lower", "down", "quieter", "reduce"],
 113                     "modifier_intents": ["to_or_by"]
 114                 }
 115             },
 116             "values": {
 117                 "ranged": true,
 118                 "start": 1,
 119                 "end": 100,
 120                 "ignore": [],
 121                 "additional": []
 122             },
 123             "default_fallback": true,
 124             "value_set_intents": {
 125                 "numeric_value": {
 126                     "datatype": "number",
 127                     "unit": "percent"
 128                 }
 129             }
 130         }
 131     }
 132     }
 133     ```
 134     Notice that `Vehicle.Cabin.Infotainment.Media.Volume` is the VSS signal that whose specification we want to define. The `default_value` is the default value of the signal to use if the user doesn't specify a value in their command. The `default_change_factor` is the default change factor of the signal, i.e the value to increment or decrement the current value with if user didn't specify any specific change factor. The `actions` object defines the actions that can be performed on the signal, currently, only "increase", "decrease", and "set" are supported. The `values` object defines the range of values that can be mapped onto the signal. The `value_set_intents` object defines the intent (or the `slot` to be more precise) that contains the specific value of the signal defined by the user in their command. Here `numeric_value` is the slot that contains the value of the signal as defined during the training of the intent engine. The `datatype` is the type of the value, i.e. `number`, `string`, `boolean`, etc. The `unit` is the unit of the value, i.e. `percent`, `degree`, `celsius`, etc.
 135
 136
 137 If you want to change the default configuration, you can do so by creating a new configuration file and then passing it to the voice agent service using the `--config` flag. For example:
 138 ```shell
 139 $ voiceagent-service run-server --config path/to/config.ini
 140 ```
 141
 142 One thing to note here is that all the directory paths in the configuration file should be absolute and always end with a `/`.
 143
 144 # High Level Architecture
 145 ![Voice_Agent_Architecture](images/agl-voice-agent/AGL_Offline_VoiceAgent_(High_Level_Architecture).png)
 146
 147 # Components
 148 - Voice Agent Service
 149     - Vosk Kaldi
 150     - Snips
 151     - RASA
 152 - Voice Assistant App
 153
 154 # Voice Agent Service
 155 The voice agent service is a gRPC-based service that is responsible for converting spoken words into text, extracting intents from these commands, and performing actions through the Kuksa interface. The service is composed of three main components: Vosk Kaldi, RASA, and Snips.
 156
 157 ## Vosk Kaldi
 158 Vosk Kaldi is a speech recognition toolkit that is based on Kaldi and Vosk. It is used to convert spoken words into text. It provides us with some official pre-trained models for various popular languages. We can also train our own models using the Vosk Kaldi toolkit. The current voice agent service requires two different models to run, one for **wake-word detection** and one for **speech recognition**. The wake word detection model is used to detect when the user says the wake word, which is "Hey Automotive" by default, we can easily change the default wake word by modifying the config file. The speech recognition model is used to convert the user's spoken words into text.
 159
 160 ## Snips
 161 Snips NLU (Natural Language Understanding) is a Python based Intent Engine that allows to extract structured information from sentences written in natural language. The NLU engine first detects what the intention of the user is (a.k.a. intent), then extracts the parameters (called slots) of the query. The developer can then use this to determine the appropriate action or response. Our voice agent service uses either Snips or RASA to extract intents from the user's spoken commands.
 162
 163 It is recommended to take a brief look at [Snips Official Documentation](https://snips-nlu.readthedocs.io/en/latest/) to get a better understanding of how Snips works.
 164
 165 ### Dataset Format
 166 The Snips NLU engine uses a dataset to understand and recognize user intents. The dataset is structured into two files:
 167
 168 - `intents.yaml`: Contains the intent definitions, slots, and sample utterances for each intent.
 169 - `entities.yaml`: Defines the entities used in the intents, including their values, synonyms, and matching strictness.
 170
 171 To train the NLU Intent Engine model, a pre-processing step is required to convert the dataset into a format compatible with the Snips NLU engine. Once the model is trained, it can be used to parse user queries and extract the intent and relevant slots for further processing.
 172
 173 ### Training
 174 To train the NLU Intent Engine for your specific use case, you can modify the dataset files `intents.yaml` and `entities.yaml` to add new intents, slots, or entity values. You need to re-generate the dataset if you modify `intent.yaml` or `entities.yaml`, for this purpose you need to install [`snips-sdk-agl`](https://github.com/malik727/snips-sdk-agl) module. This module is an extension of the original Snips NLU with upgraded Python support and is specifically designed for data pre-processing and training purposes only.
 175
 176 After installation run the following command to generate the updated `dataset.json` file:
 177 ```shell
 178 $ snips-sdk generate-dataset en entities.yaml intents.yaml > dataset.json
 179 ```
 180
 181 Then run the following command to re-train the model:
 182 ```shell
 183 $ snips-sdk train path/to/dataset.json path/to/model
 184 ```
 185
 186 Finally, you can use the [`snips-inference-agl`](https://gerrit.automotivelinux.org/gerrit/gitweb?p=src/snips-inference-agl.git;a=summary) module to process commands and extract the associated intents.
 187
 188 ### Usage
 189 To set up and run the Snips NLU Intent Engine, follow these steps:
 190
 191 1. Train your model by following the steps laid earlier or just clone a pre-existing model from [here](https://gerrit.automotivelinux.org/gerrit/gitweb?p=src/snips-model-agl.git;a=summary).
 192
 193 2. Install and set up the [`snips-inference-agl`](https://gerrit.automotivelinux.org/gerrit/gitweb?p=src/snips-inference-agl.git;a=summary) module on your local machine. This module is an extension of the original Snips NLU with upgraded Python support and is specifically designed for inference purposes only.
 194
 195 3. Once you have the [`snips-inference-agl`](https://gerrit.automotivelinux.org/gerrit/gitweb?p=src/snips-inference-agl.git;a=summary) module installed, you can load the pre-trained model located in the model/ folder. This model contains the trained data and parameters necessary for intent extraction. You can use the following command to process commands and extract the associated intents:
 196     ```shell
 197     $ snips-inference parse path/to/model -q "your command here"
 198     ```
 199
 200 ### Observations
 201 - The Snips NLU engine is very lightweight and uses around 250 MB - 300 MB of RAM when running on the target device.
 202 - The underlying AI arhictecture of the Snips NLU is not extensible or changeable.
 203 - The Snips NLU engine is not very accurate as compared to RASA, however, its extremely lightweight and really fast.
 204
 205 ## RASA
 206 RASA is an open-source machine learning framework for building contextual AI assistants and chatbots. It is based on Python and TensorFlow. It is used to extract intents from the user's spoken commands. The RASA NLU engine is trained on a dataset that contains intents, entities, and sample utterances. The RASA NLU engine is used to parse user queries and extract the intent and relevant entities for further processing.
 207
 208 It is recommended to take a brief look at [RASA Official Documentation](https://rasa.com/docs/rasa/) to get a better understanding of how RASA works.
 209
 210 ### Dataset Format
 211 Rasa uses YAML as a unified and extendable way to manage all training data, including NLU data, stories and rules.
 212
 213 You can split the training data over any number of YAML files, and each file can contain any combination of NLU data, stories, and rules. The training data parser determines the training data type using top level keys.
 214
 215 NLU training data consists of example user utterances categorized by intent. Training examples can also include entities. Entities are structured pieces of information that can be extracted from a user's message. You can also add extra information such as regular expressions and lookup tables to your training data to help the model identify intents and entities correctly. Example dataset for `check_balance` intent:
 216
 217 ```yaml
 218 nlu:
 219 - intent: check_balance
 220   examples: |
 221     - What's my [credit](account) balance?
 222     - What's the balance on my [credit card account]{"entity":"account","value":"credit"}
 223
 224 - synonym: credit
 225   examples: |
 226     - credit card account
 227     - credit account
 228 ```
 229
 230 ### Training
 231 To train the RASA NLU intent engine model you need to curate a dataset for your sepcific use case. You can also use the [RASA NLU Trainer](https://rasahq.github.io/rasa-nlu-trainer/) to curate your dataset. Once you have your dataset ready, now you need to create a `config.yml` file. This file contains the configuration for the RASA NLU engine. A sample `config.yml` file is given below:
 232
 233 ```yaml
 234 language: en  # your 2-letter language code
 235 assistant_id: 20230807-130137-kind-easement
 236
 237 pipeline:
 238   - name: WhitespaceTokenizer
 239   - name: RegexFeaturizer
 240   - name: LexicalSyntacticFeaturizer
 241   - name: CountVectorsFeaturizer
 242   - name: CountVectorsFeaturizer
 243     analyzer: "char_wb"
 244     min_ngram: 1
 245     max_ngram: 4
 246   - name: DIETClassifier
 247     epochs: 100
 248     constrain_similarities: true
 249   - name: EntitySynonymMapper
 250   - name: ResponseSelector
 251     epochs: 100
 252     constrain_similarities: true
 253   - name: FallbackClassifier
 254     threshold: 0.3
 255     ambiguity_threshold: 0.1
 256 ```
 257
 258 Now download RASA (v3.6.4) using the following command:
 259 ```shell
 260 $ pip install rasa==3.6.4
 261 ```
 262 Finally, you can use the following command to train the RASA NLU engine:
 263 ```shell
 264 $ rasa train nlu --config config.yml --nlu path/to/dataset.yml --out path/to/model
 265 ```
 266
 267 ### Usage
 268 To set up and run the RASA NLU Intent Engine, follow these steps:
 269
 270 1. Train your model by following the steps laid earlier or just clone a pre-existing model from [here](https://gerrit.automotivelinux.org/gerrit/gitweb?p=src/rasa-model-agl.git;a=summary).
 271
 272 2. Once you have RASA (v3.6.4) installed, you can load the pre-trained model located in the model/ folder. This model contains the trained data and parameters necessary for intent extraction. You can use the following command to process commands and extract the associated intents:
 273     ```shell
 274     $ rasa shell --model path/to/model
 275     ```
 276
 277 ### Observations
 278 - The RASA NLU engine is heavy and uses around 0.8 GB - 1 GB of RAM when running on the target device.
 279 - The underlying AI arhictecture of the RASA NLU is extensible and changeable thanks to the TensorFlow backend.
 280 - The RASA NLU engine is very accurate as compared to Snips, however, its heavy and slightly slow.
 281
 282 # Voice Assistant App
 283 The voice assistant app is a flutter based application made for Automotive Grade Linux (AGL). It is responsible for interacting with the voice agent service for user voice command recognition, intent extraction, and command execution. It also receives the response from the voice agent service and displays it on the screen. Some app UI screenshots are attached below.
 284
 285 ![Voice_Agent_App_1](images/agl-voice-agent/voice-assistant-flutter-1.png)
 286 ![Voice_Agent_App_2](images/agl-voice-agent/voice-assistant-flutter-2.png)