Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

In this section, you can find the blocks used for voice-to-text functions.

GoogleASR

Description

This box allows you to do a voice-to-text conversion using the Google ASR Agi* Parameters.

*Internet connection is required for this box to work


Parameters

  • Label: here you can type a brief description

  • Key: your acquired license key from the console.developers.google.com account

  • Language: the language you want use for the translation

  • Timeout: maximum recording duration in seconds. If -1 the considered timeout is unlimited

  • Interrupt Key: special digits to exit the current recorded call

  • Beep: to reproduce a beep before to record

  • Record Speech: select if you want to save the voice recording used by Google ASR to turn it into text on XCALLY. Recordings will be saved in Recordings

  • Number of seconds of silence: the number of seconds of silence that are permitted before to pass through the next block, regardless of the interrupt key or timeout

Please note Google ASR require a valid key from the console.developers.google.com website and a sufficient amount of acquired credits. Furthermore it is pure experimental and it can bring to unexpected behavior.

The ASR saves the results in two channel variables:

  • GOOGLE_ASR_TRANSCRIPT: the result of the dictation recognition

  • GOOGLE_ASR_CONFIDENCE: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 means that the dictation has been correctly recognized.


Exit Arrows

This box provides just one arrow out to the next step

Remember: this software is managed by others. Check if it works properly.

AWS ASR

Description

This box allows you to convert voice-to-text using the Amazon AWS Transcribe SDKs* Parameters.

*Internet connection is required for this box to work


Parameters

  • Label: here you can type a brief description

  • Access Key ID: your acquired license key from the Amazon account

  • Secret Access Key: your secret key from the Amazon account

  • Language: the language you want use for the translation https://docs.aws.amazon.com/transcribe/latest/dg/supported-languages.html

  • Region: Select from the available regions: https://docs.aws.amazon.com/general/latest/gr/transcribe.html#transcribe-streaming

  • Timeout: maximum recording duration in seconds. If -1 the considered timeout is unlimited

  • Interrupt Key: special digits to exit the current recorded call

  • Beep: to reproduce a beep before to record

  • Record Speech: select if you want to save the voice recording used by AWS ASR to turn it into text on XCALLY. Recordings will be saved in Recordings

  • Number of seconds of silence: the number of seconds of silence that are permitted before to pass through the next block, regardless of the interrupt key or timeout

The ASR saves the results in two channel variables:

  • AWS_ASR_TRANSCRIPT: the result of the dictation recognition

  • AWS_ASR_CONFIDENCE: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 means that the dictation has been correctly recognized.


Exit Arrows

This box provides just one arrow out to the next step

Remember: this software is managed by others. Check if it works properly.

ISpeechASR

Description

This box allows you to do a voice-to-text conversion using the Ispeech ASR Agi* Parameters

*Internet connection is required for this box to work

The ASR saves the results in two channel variables:

  • ispeech_asr_utterance: the result of the dictation recognition

  • ispeech_asr_precision: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 means that the dictation has been correctly recognised.


Parameters

  • Label: here you can type a brief description

  • Key: Your acquired license key from the ispeech.org account

  • Model: the grammar of the dictation, to increase the precision of the recognition

  • Language: the language you want use for the translation

  • Interrupt Key: special digits to exit the current recorded call

  • Timeout: maximum recording duration in seconds. If -1 the considered timeout is unlimited

  • Beep: to reproduce a beep before to record

Please note Ispeech ASR require a valid key from the ispeech.org website and a sufficient amount of acquired credits. Furthermore it is pure experimental and it can bring to unexpected behaviour.  Each dictation processing requires 1 credit.


Exit Arrows  

This box provides just one arrow out to the next step

Remember: this software is managed by others. Check if it works properly.

LumenvoxASR

Description

This box allows you to do a voice-to-text conversion using the Lumenvox ASR application*

*In order to have this box working you must install Lumenvox on a machine that is reachable by your system.

The ASR save the results in three channel variables:

  • LUMENVOX_ASR_TRANSCRIPT: the result of the dictation recognition

  • LUMENVOX_ASR_CONFIDENCE: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 means that the dictation has been correctly recognised.

  • LUMENVOX_ASR_INSTANCE: the instance of the recognition.


Parameters

  • Label: here you can type a brief description

  • Grammar: The grammar that should be used for the recognition. Grammars can be specified as text/XML inline for built-in grammars or by using a reference to an external file/URI.

If you need to use the multiple grammar, the value of the first grammar must be separated from the second one using a character defined (e.g.: grammar-a%grammar-b, defining the % as a delimiter in the options field gd):

LumenVox provides the following built-in grammars:

  

URI

  
  

Sample Input

  
  

Semantic Interpretation Result

  
  

builtin:grammar/boolean

  

  

"yes",  "no thank you", etc.

  
  

"true" or  "false"

  
  

builtin:grammar/date

  
  

"january  thirteenth" or "december first two thousand"

  
  

"????0113"  or "20001201"

  
  

builtin:grammar/digits

  
  

"one two three  four"

  
  

"1234"

  
  

builtin:grammar/currency

  
  

"eighteen dollars  and four cents"

  
  

"USD18.04"

  
  

builtin:grammar/number

  

  

"four hundred  point five"

  
  

"400.5"

  
  

builtin:grammar/phone

  
  

"area code eight  five eight seven oh seven oh seven oh seven"

  
  

"8587070707"

  
  

builtin:grammar/time

  
  

"six o  clock" or "five thirty p m"

  
  

"0600?" or  "0530p"

  

Options:

Options control details about the recognition. Valid options are:

  • p - Profile to use in mrcp.conf

  • i - Digits to allow recognition to be interrupted with. Set this to "none" to allow LumenVox to process the DTMF using a DTMF grammar. Otherwise, if "any" or other digits specified, recognition will be interrupted and the digit will be returned to dialplan.

  • f - Filename to play while recognition occurs (if empty or not specified, no file is played)

  • t - The recognition timeout (in milliseconds). This is the total amount of time a caller has to speak.

  • b - Barge-in value (no barge-in=0, ASR engine barge-in=1, Asterisk barge-in=2). LumenVox strongly recommends allowing the ASR to perform barge-in instead of Asterisk.

  • gd – The grammar delimiter. Defaults to a comma.

  • ct - The confidence threshold (0.0 - 1.0). If a recognition result has a confidence score below this value, it will be returned as "no match." Defaults to 0.5.

  • sl - The barge-in sensitivity level (0.0 - 1.0). The higher this number, the easier it is to barge-in. Defaults to 0.5.

  • sva - Speed vs. accuracy, set on a scale of 0.0 - 1.0. The higher this number, the faster (and less accurate) recognitions will be. Defaults to 0.5.

  • nb - N-best list length. Defaults to 1; increase this value if you wish to get more answers back from the recognizer.

  • nit - No input timeout. This is the amount of time the caller has to start speaking before the recognizer returns a no-input result.

  • sct - Speech Complete Timeout. This is the amount of time, in milliseconds, LumenVox must detect silence after a user stops speaking before the recognizer begins processing the utterance. Set this lower for single word utterances and higher for longer utterances. In most cases, a value of 800 is correct.

  • dit - DTMF interdigit timeout

  • dtt - DTMF terminate timout

  • dttc - DTMF terminate characters

  • sw - Save Waveform (true/false)

  • nac - new audio channel (true/false)

  • spl - speech language (en-US/en-GB/etc.). If a language is declared in a grammar (it should be) this will be ignored.

  • cdb - clear DTMF buffer (true/false)

  • mt - media type

  • iwu - input waveform URI (only applies to MRCPv2). Not supported by LumenVox.

  • sint - Speech Incomplete Timeout. Not supported by LumenVox.

  • rm - Recognition Mode. Not supported by LumenVox.

  • hmaxd - hotword max duration. Not supported by LumenVox.

  • hmind - hotword min duration. Not supported by LumenVox.

  • enm - early no match (true/false). Not supported by LumenVox.

You are not required to supply any options. Multiple options can be provided by joining options with an ampersand, e.g. f=sayHelloWorld&t=5000


Exit Arrows

This box provides just one arrow out to the next step.

Remember: this software is managed by others. Check if it works properly.

TildeASR

Description

This box allows you to do a voice-to-text conversion using the Tilde ASR Agi* Parameters

*Internet connection is required for this box to work

The ASR saves the results in two channel variables:

  • TILDE_ASR_TRANSCRIPT: the result of the dictation recognition

  • TILDE_ASR_CONFIDENCE: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 means that the dictation has been correctly recognised.

  • TILDE_ASR_STATUS: the response status (integer)

  • TILDE_ASR_STATUS_MESSAGE: the response status message

  • TILDE_ASR_STATUSTILDE_ASR_STATUS_MESSAGE
    0Success
    1No speech. Sent when the incoming audio contains a large portion of silence or non-speech
    2Aborted. Recognition was aborted for some reason.
    9Not available. Max load limit reached.
    10Authentication failed.
    11All recognition workers are currently in use and real-time recognition is not possible.

    Attention: The variables TILDE_ASR_TRANSCRIPT and TILDE_ASR_CONFIDENCE are available just when TILDE_ASR_STATUS is equal to 0

Please note Tilde ASR require a valid app key from the tilde.com website and a sufficient amount of acquired credits.

Furthermore it is pure experimental and it can bring to unexpected behaviour.


Parameters

  • Label: here you can type a brief description

  • App ID: Your App ID from the tilde.com account

  • App Secret: Your App secret from the tilde.com account

  • Interrupt Key: special digits to exit the current recorded call

  • Timeout: maximum recording duration in seconds. If -1 the considered timeout is unlimited

  • Beep: to reproduce a beep before to record

  • URI: insert here your custom Tilde server URI (default standard value is wss://runa.tilde.lv/client/ws/speech/LVASR-ONLINE)


Exit Arrows

This box provides just one arrow out to the next step.

Remember: this software is managed by others. Check if it works properly.

SestekASR

available from rel. 2.0.84

Description

This box allows you to do a voice-to-text conversion using the Sestek ASR*

*Internet connection is required for this box to work

The ASR save the results in three channel variables:

  • SESTEK_ASR_TRANSCRIPT: the result of the dictation recognition

  • SESTEK_ASR_CONFIDENCE: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 mean that the dictation has been correctly managed

  • SESTEK_ASR_INSTANCE: the instance of the recognition.


Parameters

  • Label: here you can type a brief description

  • Grammar: The grammar that should be used for the recognition. Grammars can be specified as text/XML and its full path on the local server must be provided, like in the following example:

  • Options control details about the recognition. Valid options are:

    • p - Profile to use in mrcp.conf

    • i - Digits to allow recognition to be interrupted with. Set this to "none" to allow Sestek to process the DTMF using a DTMF grammar. Otherwise, if "any" or other digits specified, recognition will be interrupted and the digit will be returned to dialplan.

    • f - Filename to play while recognition occurs (if empty or not specified, no file is played)

    • t - The recognition timeout (in milliseconds). This is the total amount of time a caller has to speak.

    • b - Barge-in value (no barge-in=0, ASR engine barge-in=1, Asterisk barge-in=2). Sestek strongly recommends allowing the ASR to perform barge-in instead of Asterisk.

    • gd – The grammar delimiter. Defaults to a comma.

    • ct - The confidence threshold (0.0 - 1.0). If a recognition result has a confidence score below this value, it will be returned as "no match." Defaults to 0.5.

    • sl - The barge-in sensitivity level (0.0 - 1.0). The higher this number, the easier it is to barge-in. Defaults to 0.5.

    • sva - Speed vs. accuracy, set on a scale of 0.0 - 1.0. The higher this number, the faster (and less accurate) recognitions will be. Defaults to 0.5.

    • nb - N-best list length. Defaults to 1; increase this value if you wish to get more answers back from the recognizer.

    • nit - No input timeout. This is the amount of time the caller has to start speaking before the recognizer returns a no-input result.

    • sct - Speech Complete Timeout. This is the amount of time, in milliseconds, Sestek must detect silence after a user stops speaking before the recognizer begins processing the utterance. Set this lower for single word utterances and higher for longer utterances. In most cases, a value of 800 is correct.

    • dit - DTMF interdigit timeout

    • dtt - DTMF terminate timout

    • dttc - DTMF terminate characters

    • sw - Save Waveform (true/false)

    • nac - new audio channel (true/false)

    • spl - speech language (en-US/en-GB/etc.). If a language is declared in a grammar (it should be) this will be ignored.

    • cdb - clear DTMF buffer (true/false)

    • mt - media type

    • iwu - input waveform URI (only applies to MRCPv2)

    • sint - Speech Incomplete Timeout

    • rm - Recognition Mode

    • hmaxd - hotword max duration

    • hmind - hotword min duration

    • enm - early no match (true/false)

You are not required to supply any options. Multiple options can be provided by joining options with an ampersand, e.g. f=sayHelloWorld&t=5000


Exit Arrows

This box provides just one arrow out to the next step.

Remember: this software is managed by others. Check if it works properly.

MRCPRecog

available from rel. 2.5.2

Description

This box allows you to do a voice-to-text conversion using the MRCPRecog application*

*you must install MRCPRecog on a machine that is reachable by your system for this box to work 

The ASR save the results in three channel variables:

  • MRCP_RECOG_TRANSCRIPT: the result of the dictation recognition

  • MRCP_RECOG_CONFIDENCE: the precision of the recognition, between 0 and 1. Usually values above 0.8-0.9 means that the dictation has been correctly recognised.

  • MRCP_RECOG_INSTANCE: the instance of the recognition.


Parameters

  • Label: here you can type a brief description

  • Grammar: The grammar that should be used for the recognition. Grammars can be specified as text/XML inline for built-in grammars or by using a reference to an external file/URI.

If you need to use the multiple grammar, the value of the first grammar must be separated from the second one using a character defined (e.g.: grammar-a%grammar-b, defining the % as a delimiter in the options field gd):

MRCPRecog provides the following built-in grammars:

  

URI

  
  

Sample Input

  
  

Semantic Interpretation Result

  
  

builtin:grammar/boolean

  

  

"yes",  "no thank you", etc.

  
  

"true" or  "false"

  
  

builtin:grammar/date

  
  

"january  thirteenth" or "december first two thousand"

  
  

"????0113"  or "20001201"

  
  

builtin:grammar/digits

  
  

"one two three  four"

  
  

"1234"

  
  

builtin:grammar/currency

  
  

"eighteen dollars  and four cents"

  
  

"USD18.04"

  
  

builtin:grammar/number

  

  

"four hundred  point five"

  
  

"400.5"

  
  

builtin:grammar/phone

  
  

"area code eight  five eight seven oh seven oh seven oh seven"

  
  

"8587070707"

  
  

builtin:grammar/time

  
  

"six o  clock" or "five thirty p m"

  
  

"0600?" or  "0530p"

  

Options:

Options control details about the recognition. Valid options are:

  • p - Profile to use in mrcp.conf

  • i - Digits to allow recognition to be interrupted with. Set this to "none" to allow MRCPRecog to process the DTMF using a DTMF grammar. Otherwise, if "any" or other digits specified, recognition will be interrupted and the digit will be returned to dialplan.

  • f - Filename to play while recognition occurs (if empty or not specified, no file is played)

  • t - The recognition timeout (in milliseconds). This is the total amount of time a caller has to speak.

  • b - Barge-in value (no barge-in=0, ASR engine barge-in=1, Asterisk barge-in=2). MRCPRecog strongly recommends allowing the ASR to perform barge-in instead of Asterisk.

  • gd – The grammar delimiter. Defaults to a comma.

  • ct - The confidence threshold (0.0 - 1.0). If a recognition result has a confidence score below this value, it will be returned as "no match." Defaults to 0.5.

  • sl - The barge-in sensitivity level (0.0 - 1.0). The higher this number, the easier it is to barge-in. Defaults to 0.5.

  • sva - Speed vs. accuracy, set on a scale of 0.0 - 1.0. The higher this number, the faster (and less accurate) recognitions will be. Defaults to 0.5.

  • nb - N-best list length. Defaults to 1; increase this value if you wish to get more answers back from the recognizer.

  • nit - No input timeout. This is the amount of time the caller has to start speaking before the recognizer returns a no-input result.

  • sct - Speech Complete Timeout. This is the amount of time, in milliseconds, MRCPRecog must detect silence after a user stops speaking before the recognizer begins processing the utterance. Set this lower for single word utterances and higher for longer utterances. In most cases, a value of 800 is correct.

  • dit - DTMF interdigit timeout

  • dtt - DTMF terminate timout

  • dttc - DTMF terminate characters

  • sw - Save Waveform (true/false)

  • nac - new audio channel (true/false)

  • spl - speech language (en-US/en-GB/etc.). If a language is declared in a grammar (it should be) this will be ignored.

  • cdb - clear DTMF buffer (true/false)

  • mt - media type

  • iwu - input waveform URI (only applies to MRCPv2). Not supported by MRCPRecog.

  • sint - Speech Incomplete Timeout. Not supported by MRCPRecog.

  • rm - Recognition Mode. Not supported by MRCPRecog.

  • hmaxd - hotword max duration. Not supported by MRCPRecog.

  • hmind - hotword min duration. Not supported by MRCPRecog.

  • enm - early no match (true/false). Not supported by MRCPRecog.

You are not required to supply any options. Multiple options can be provided by joining options with an ampersand, e.g. f=sayHelloWorld&t=5000


Exit Arrows

This box provides just one arrow out to the next step.

Remember: this software is managed by others. Check if it works properly.

OpenAI Whisper

Description

OpenAI Whisper allows you to do automatic speech recognition and transcription using OpenAI → https://openai.com/research/whisper

*Internet connection is required for this box to work

With OpenAI whisper, it is possible to use the automatic language detection for transcription


Parameters

  • Label: here you can type a brief description

  • OpenAI Cloud Provider: Select the provider already configured

  • Model: non-editable field; represents the AI model

  • Language: the language you want use for the translation

  • Timeout: maximum recording duration in seconds. If -1 the considered timeout is unlimited

  • Interrupt Key: special digits to exit the current recorded call

  • Beep: to reproduce a beep before to record

  • Record Speech: select if you want to save the voice recording used by OpenAI.

  • Number of seconds of silence: the number of seconds of silence that are permitted before to pass through the next block, regardless of the interrupt key or timeout

The ASR saves the results in the channel variable: OPENAI_WHISPER_TRANSCRIPT

Moreover, the Asterisk variable OPENAI_WHISPER_TRANSCRIPT_LANGUAGE designs the call flow by using the language detected

  • No labels