ASR - Automatic speech recognition
n this page |
|
What’s about
This section contains the blocks that can be used for voice-to-text function.
You can select your preferred provider from the list of those integrated with XCALLY, that you can find in the next paragraphs.
Important
Please note that ASR providers are third-party applications, and their functionalities, costs, and behaviors depend on the provider you select.
An internet connection is required for the ASR blocks to properly function.
Google ASR
This block allows to perform a voice-to-text conversion using the Google ASR Agi Parameters.
Label: here you can type a brief description
Key: insert your license key from the console.developers.google.com account
Language: select the language you would to use from the dropdown list
Timeout: define maximum recording duration in seconds. If -1 the considered timeout is unlimited
Interrupt Key: set special digits to exit the current recorded call
Beep: if Yes, it reproduces a beep before the recording starts
Record Speech: select if you want to save the voice recording used by Google ASR on XCALLY. If Yes, audio files will be saved in Recordings section.
Number of seconds of silence: the number of seconds of silence that are permitted before passing through the next block, regardless of the interrupt key or timeout
Explore this documentation to find out How to retrieve Google Key for Cally Square blocks
Exit Arrows
This block provides just one arrow out to the next step
The ASR saves the results in two-channel variables:
GOOGLE_ASR_TRANSCRIPT
: the result of the dictation recognitionGOOGLE_ASR_CONFIDENCE
: the precision of the recognition, between 0 and 1. Usually, values above 0.8-0.9 mean that the dictation has been correctly recognized.
AWS ASR
This block allows to perform a voice-to-text conversion using the Amazon AWS Transcribe SDKs Parameters.
Label: here you can type a brief description
Access Key ID: insert your license key from the Amazon account
Secret Access Key: insert your secret key from the Amazon account
Language: select the language you would to use from the dropdown list. Here the supported languages
Region: Select the region from the dropdown list. See here for more details
Timeout: define maximum recording duration in seconds. If -1 the considered timeout is unlimited
Interrupt Key: set special digits to exit the current recorded call
Beep: if Yes, it reproduces a beep before the recording starts
Record Speech: select if you want to save the voice recording used by Google ASR on XCALLY. If Yes, audio files will be saved in Recordings section.
Number of seconds of silence: the number of seconds of silence that are permitted before passing through the next block, regardless of the interrupt key or timeout
Exit Arrows
This block provides just one arrow out to the next step
The ASR saves the results in two-channel variables:
AWS_ASR_TRANSCRIPT
: the result of the dictation recognitionAWS_ASR_CONFIDENCE
: the precision of the recognition, between 0 and 1. Usually, values above 0.8-0.9 mean that the dictation has been correctly recognized.
ISpeech ASR
This block allows to perform a voice-to-text conversion using the Ispeech ASR Agi Parameters.
Label: here you can type a brief description
Key: insert your license key from the ispeech.org account
Model: define the grammar of the dictation, to increase the precision of the recognition
Language: select the language you would to use from the dropdown list
Interrupt Key: set special digits to exit the current recorded call
Timeout: define maximum recording duration in seconds. If -1 the considered timeout is unlimited
Beep: if Yes, it reproduces a beep before the recording starts
Exit Arrows
This block provides just one arrow out to the next step
The ASR saves the results in two-channel variables:
ispeech_asr_utterance
: the result of the dictation recognitionispeech_asr_precision
: the precision of the recognition, between 0 and 1. Usually, values above 0.8-0.9 mean that the dictation has been correctly recognised.
Lumenvox ASR
This block allows to perform a voice-to-text conversion using the Lumenvox ASR application.
Label: here you can type a brief description
Grammar: The grammar that should be used for the recognition. Grammars can be specified as text/XML inline for built-in grammars or by using a reference to an external file/URI.
If you need to use the multiple grammar, the value of the first grammar must be separated from the second one using a character defined (e.g.: grammar-a%grammar-b, defining the % as a delimiter in the options field gd).
LumenVox provides the following built-in grammars:
URI
|
Sample Input
|
Semantic Interpretation Result
|
---|---|---|
builtin:grammar/boolean
|
"yes", "no thank you", etc.
|
"true" or "false"
|
builtin:grammar/date
|
"january thirteenth" or "december first two thousand"
|
"????0113" or "20001201"
|
builtin:grammar/digits
|
"one two three four"
|
"1234"
|
builtin:grammar/currency
|
"eighteen dollars and four cents"
|
"USD18.04"
|
builtin:grammar/number
|
"four hundred point five"
|
"400.5"
|
builtin:grammar/phone
|
"area code eight five eight seven oh seven oh seven oh seven"
|
"8587070707"
|
builtin:grammar/time
|
"six o clock" or "five thirty p m"
|
"0600?" or "0530p"
|
Options: define details about the recognition. Valid options are:
p - Profile to use in mrcp.conf
i - Digits to allow recognition to be interrupted with. Set this to "none" to allow LumenVox to process the DTMF using a DTMF grammar. Otherwise, if "any" or other digits specified, recognition will be interrupted and the digit will be returned to dialplan.
f - Filename to play while recognition occurs (if empty or not specified, no file is played)
t - The recognition timeout (in milliseconds). This is the total amount of time a caller has to speak.
b - Barge-in value (no barge-in=0, ASR engine barge-in=1, Asterisk barge-in=2). LumenVox strongly recommends allowing the ASR to perform barge-in instead of Asterisk.
gd – The grammar delimiter. Defaults to a comma.
ct - The confidence threshold (0.0 - 1.0). If a recognition result has a confidence score below this value, it will be returned as "no match." Defaults to 0.5.
sl - The barge-in sensitivity level (0.0 - 1.0). The higher this number, the easier it is to barge-in. Defaults to 0.5.
sva - Speed vs. accuracy, set on a scale of 0.0 - 1.0. The higher this number, the faster (and less accurate) recognitions will be. Defaults to 0.5.
nb - N-best list length. Defaults to 1; increase this value if you wish to get more answers back from the recognizer.
nit - No input timeout. This is the amount of time the caller has to start speaking before the recognizer returns a no-input result.
sct - Speech Complete Timeout. This is the amount of time, in milliseconds, LumenVox must detect silence after a user stops speaking before the recognizer begins processing the utterance. Set this lower for single word utterances and higher for longer utterances. In most cases, a value of 800 is correct.
dit - DTMF interdigit timeout
dtt - DTMF terminate timout
dttc - DTMF terminate characters
sw - Save Waveform (true/false)
nac - new audio channel (true/false)
spl - speech language (en-US/en-GB/etc.). If a language is declared in a grammar (it should be) this will be ignored.
cdb - clear DTMF buffer (true/false)
mt - media type
iwu - input waveform URI (only applies to MRCPv2). Not supported by LumenVox.
sint - Speech Incomplete Timeout. Not supported by LumenVox.
rm - Recognition Mode. Not supported by LumenVox.
hmaxd - hotword max duration. Not supported by LumenVox.
hmind - hotword min duration. Not supported by LumenVox.
enm - early no match (true/false). Not supported by LumenVox.
Multiple options can be provided by joining options with an ampersand, e.g. f=sayHelloWorld&t=5000
Exit Arrows
This block provides just one arrow out to the next step.
The ASR save the results in three channel variables:
LUMENVOX_ASR_TRANSCRIPT
: the result of the dictation recognitionLUMENVOX_ASR_CONFIDENCE
: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 means that the dictation has been correctly recognised.LUMENVOX_ASR_INSTANCE
: the instance of the recognition.
Tilde ASR
This block allows to perform a voice-to-text conversion using the Tilde ASR Agi Parameters.
Label: here you can type a brief description
URI: insert here your custom Tilde server URI (default standard value is wss://runa.tilde.lv/client/ws/speech/LVASR-ONLINE)
App ID: insert your App ID from the tilde.com account
App Secret: insert your App secret from the tilde.com account
Interrupt Key: set special digits to exit the current recorded call
Timeout: define maximum recording duration in seconds. If -1 the considered timeout is unlimited
Beep: if Yes, it reproduces a beep before the recording starts
Exit Arrows
This block provides just one arrow out to the next step.
The ASR saves the results in two-channel variables:
TILDE_ASR_TRANSCRIPT
: the result of the dictation recognitionTILDE_ASR_CONFIDENCE
: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 means that the dictation has been correctly recognised.TILDE_ASR_STATUS
: the response status (integer)TILDE_ASR_STATUS_MESSAGE
: the response status message
|
|
---|---|
0 | Success |
1 | No speech. Sent when the incoming audio contains a large portion of silence or non-speech |
2 | Aborted. Recognition was aborted for some reason. |
9 | Not available. Max load limit reached. |
10 | Authentication failed. |
11 | All recognition workers are currently in use and real-time recognition is not possible. |
The variables TILDE_ASR_TRANSCRIPT
and TILDE_ASR_CONFIDENCE
are available just when TILDE_ASR_STATUS
is equal to 0
Sestek ASR
This block allows to perform a voice-to-text conversion using the Sestek ASR.
Label: here you can type a brief description
Grammar: The grammar that should be used for the recognition. Grammars can be specified as text/XML and its full path on the local server must be provided, like in the example on the left
Options: define details about the recognition. Valid options are:
p - Profile to use in mrcp.conf
i - Digits to allow recognition to be interrupted with. Set this to "none" to allow Sestek to process the DTMF using a DTMF grammar. Otherwise, if "any" or other digits specified, recognition will be interrupted and the digit will be returned to dialplan.
f - Filename to play while recognition occurs (if empty or not specified, no file is played)
t - The recognition timeout (in milliseconds). This is the total amount of time a caller has to speak.
b - Barge-in value (no barge-in=0, ASR engine barge-in=1, Asterisk barge-in=2). Sestek strongly recommends allowing the ASR to perform barge-in instead of Asterisk.
gd – The grammar delimiter. Defaults to a comma.
ct - The confidence threshold (0.0 - 1.0). If a recognition result has a confidence score below this value, it will be returned as "no match." Defaults to 0.5.
sl - The barge-in sensitivity level (0.0 - 1.0). The higher this number, the easier it is to barge-in. Defaults to 0.5.
sva - Speed vs. accuracy, set on a scale of 0.0 - 1.0. The higher this number, the faster (and less accurate) recognitions will be. Defaults to 0.5.
nb - N-best list length. Defaults to 1; increase this value if you wish to get more answers back from the recognizer.
nit - No input timeout. This is the amount of time the caller has to start speaking before the recognizer returns a no-input result.
sct - Speech Complete Timeout. This is the amount of time, in milliseconds, Sestek must detect silence after a user stops speaking before the recognizer begins processing the utterance. Set this lower for single word utterances and higher for longer utterances. In most cases, a value of 800 is correct.
dit - DTMF interdigit timeout
dtt - DTMF terminate timout
dttc - DTMF terminate characters
sw - Save Waveform (true/false)
nac - new audio channel (true/false)
spl - speech language (en-US/en-GB/etc.). If a language is declared in a grammar (it should be) this will be ignored.
cdb - clear DTMF buffer (true/false)
mt - media type
iwu - input waveform URI (only applies to MRCPv2)
sint - Speech Incomplete Timeout
rm - Recognition Mode
hmaxd - hotword max duration
hmind - hotword min duration
enm - early no match (true/false)
Multiple options can be provided by joining options with an ampersand, e.g. f=sayHelloWorld&t=5000
Exit Arrows
This block provides just one arrow out to the next step.
The ASR save the results in three channel variables:
SESTEK_ASR_TRANSCRIPT
: the result of the dictation recognitionSESTEK_ASR_CONFIDENCE
: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 mean that the dictation has been correctly managedSESTEK_ASR_INSTANCE
: the instance of the recognition.
MRCP Recog
This block allows to perform a voice-to-text conversion using the MRCP Recog ASR.
Label: here you can type a brief description
Grammar: The grammar that should be used for the recognition. Grammars can be specified as text/XML inline for built-in grammars or by using a reference to an external file/URI.
If you need to use the multiple grammar, the value of the first grammar must be separated from the second one using a character defined (e.g.: grammar-a%grammar-b, defining the % as a delimiter in the options field gd)
MRCPRecog provides the following built-in grammars:
URI
|
Sample Input
|
Semantic Interpretation Result
|
---|---|---|
builtin:grammar/boolean
|
"yes", "no thank you", etc.
|
"true" or "false"
|
builtin:grammar/date
|
"january thirteenth" or "december first two thousand"
|
"????0113" or "20001201"
|
builtin:grammar/digits
|
"one two three four"
|
"1234"
|
builtin:grammar/currency
|
"eighteen dollars and four cents"
|
"USD18.04"
|
builtin:grammar/number
|
"four hundred point five"
|
"400.5"
|
builtin:grammar/phone
|
"area code eight five eight seven oh seven oh seven oh seven"
|
"8587070707"
|
builtin:grammar/time
|
"six o clock" or "five thirty p m"
|
"0600?" or "0530p"
|
Options: define details about the recognition. Valid options are:
p - Profile to use in mrcp.conf
i - Digits to allow recognition to be interrupted with. Set this to "none" to allow MRCPRecog to process the DTMF using a DTMF grammar. Otherwise, if "any" or other digits specified, recognition will be interrupted and the digit will be returned to dialplan.
f - Filename to play while recognition occurs (if empty or not specified, no file is played)
t - The recognition timeout (in milliseconds). This is the total amount of time a caller has to speak.
b - Barge-in value (no barge-in=0, ASR engine barge-in=1, Asterisk barge-in=2). MRCPRecog strongly recommends allowing the ASR to perform barge-in instead of Asterisk.
gd – The grammar delimiter. Defaults to a comma.
ct - The confidence threshold (0.0 - 1.0). If a recognition result has a confidence score below this value, it will be returned as "no match." Defaults to 0.5.
sl - The barge-in sensitivity level (0.0 - 1.0). The higher this number, the easier it is to barge-in. Defaults to 0.5.
sva - Speed vs. accuracy, set on a scale of 0.0 - 1.0. The higher this number, the faster (and less accurate) recognitions will be. Defaults to 0.5.
nb - N-best list length. Defaults to 1; increase this value if you wish to get more answers back from the recognizer.
nit - No input timeout. This is the amount of time the caller has to start speaking before the recognizer returns a no-input result.
sct - Speech Complete Timeout. This is the amount of time, in milliseconds, MRCPRecog must detect silence after a user stops speaking before the recognizer begins processing the utterance. Set this lower for single word utterances and higher for longer utterances. In most cases, a value of 800 is correct.
dit - DTMF interdigit timeout
dtt - DTMF terminate timout
dttc - DTMF terminate characters
sw - Save Waveform (true/false)
nac - new audio channel (true/false)
spl - speech language (en-US/en-GB/etc.). If a language is declared in a grammar (it should be) this will be ignored.
cdb - clear DTMF buffer (true/false)
mt - media type
iwu - input waveform URI (only applies to MRCPv2). Not supported by MRCPRecog.
sint - Speech Incomplete Timeout. Not supported by MRCPRecog.
rm - Recognition Mode. Not supported by MRCPRecog.
hmaxd - hotword max duration. Not supported by MRCPRecog.
hmind - hotword min duration. Not supported by MRCPRecog.
enm - early no match (true/false). Not supported by MRCPRecog.
Multiple options can be provided by joining options with an ampersand, e.g. f=sayHelloWorld&t=5000
Exit Arrows
This block provides just one arrow out to the next step.
The ASR save the results in three channel variables:
MRCP_RECOG_TRANSCRIPT
: the result of the dictation recognitionMRCP_RECOG_CONFIDENCE
: the precision of the recognition, between 0 and 1. Usually, values above 0.8-0.9 means that the dictation has been correctly recognised.MRCP_RECOG_INSTANCE
: the instance of the recognition.
OpenAI Whisper
OpenAI Whisper allows you to do automatic speech recognition and transcription using OpenAI → https://openai.com/research/whisper
With OpenAI whisper, it is possible to use the automatic language detection for transcription.
Label: here you can type a brief description
OpenAI Cloud Provider: select the provider already configured
Model: the AI model (non-editable field)
Language: select the language you would like to use from the dropdown list
Timeout: define the maximum recording duration in seconds. If -1 the considered timeout is unlimited
Interrupt Key: set special digits to exit the current recorded call
Beep: to reproduce a beep before to record
Record b: select if you want to save the voice recording used by OpenAI.
Number of seconds of silence: the number of seconds of silence that are permitted before to pass through the next block, regardless of the interrupt key or timeout
Exit Arrows
This block provides just one arrow out to the next step.
The ASR saves the results in the channel variable: OPENAI_WHISPER_TRANSCRIPT
Moreover, the Asterisk variable OPENAI_WHISPER_TRANSCRIPT_LANGUAGE
designs the call flow by using the detected language