The live ASR (Automatic Speech Recognition) or transcription is done via secure websocket connections
Number of available transcribers for a language. A JSON message is received at the beginning and then each time the number of available transcribers changes.
Only start live transcription if there are transcribers available, that is, if the number of available transcribers is not 0. Otherwise, the transcription will return an error.
[LANG] can be either "eu" or "es"
[LANG] can be either "eu" or "es"
Starts a live transcription, where a raw audio signal is sent through the websocket and the recognised words and sentences are received.
The server assumes by default that incoming audio is sent using 16 kHz, mono, 16 bit little-endian format. This can be overriden using the 'content-type' request parameter in the websocket URL.
After finishing the live transcription, the audio sent is saved in an audio file and recorded in the user's file logs together with the produced transcription, subtitles, etc. This can be overriden by sending a 'save' parameter with the content 'false', in which case no files will be saved and thus no space will be used (only a log of the time used in the transcription).
content-type (optional) in GStreamer 1.0 caps format; default=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1 save (optional); default=true
Before starting to send the audio, a text message with the API id and the API key must be sent for authentication and then wait for an authorization confirmation message in JSON format. If there has been a problem with authentication or the account has no time credit left, the JSON message will say so and the websocket must be closed.
After the authorization is received, the sending of the audio can begin.
Audio should be sent to the server in raw blocks of data, using the encoding specified when session was opened. It is recommended that a new block is sent at least 4 times per second (less frequent blocks would increase the recognition lag). Blocks do not have to be of equal size.
After the last block of audio data, a text message containing the 3-byte ANSI-encoded string "EOS" ("end-of-stream") needs to be sent to the server. This tells the server that no more speech is coming and the recognition can be finalized.
After sending "EOS", the client has to keep the websocket open to receive the final recognition results from the server. Server closes the connection itself when all recognition results have been sent to the client. No more audio can be sent via the same websocket after an "EOS" has been sent. In order to process a new audio stream, a new websocket connection has to be created by the client.
Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿùõ×ö"úr`ò@~?Å$Ö³!ÀõY3n... á!â,¼"´*'ðS¸ïuñU@~?Å$Ö³!ÀõY3n´ÿºÿäÿøÿÞÿ·ÿÎÿðÿÍÿ§ÿÖÿ... ... EOS
Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:
The following status codes are currently in use:
Websocket is always closed by the server after sending a non-zero status update.
Server transcribes incoming audio on the fly. For each sentence or audio segment between silences, many non-final hypotheses are sent, followed by one final hypothesis. Non-final hypotheses are used to present partial recognition hypotheses to the client. A sequence of non-final hypotheses is always followed by a final hypothesis for that segment. The final hypothesis overrides the non-final hypotheses for the sentence or segment. Client is reponsible for presenting the results to the user in a way suitable for the application.
After sending a final hypothesis for a segment, server starts decoding the next segment or closes the connection if all audio sent by the client has been processed.