The live ASR (Automatic Speech Recognition) or transcription can be done in two ways:
The address for both is: live.aditu.eus
Number of available transcribers for a language. A JSON message is received at the beginning and then each time the number of available transcribers changes.
Only start live transcription if there are transcribers available, that is, if the number of available transcribers is not 0. Otherwise, the transcription will return an error.
wss://live.aditu.eus/[LANG]/client/ws/status
[LANG] can be either "eu", "es" or "eu-es"
wss://live.aditu.eus/[LANG]/client/ws/speech
[LANG] can be either "eu", "es" or "eu-es"
Starts a live transcription, where a raw audio signal is sent through the websocket and the recognised words and sentences are received.
This is the preferred use mode for continous transcribing of long speech sections. An example is live transcription from microphone input.
The server assumes by default that incoming audio is sent using 16 kHz, mono, 16 bit little-endian format. This can be overriden using the 'content-type' request parameter in the websocket URL.
By default, the server does not perform any unexpansion of numbers, acronyms, abbreviations, etc. This can be overriden by sending an 'unexpand' parameter with the content 'true', in which case numbers, accronyms, abbreviations, etc. will come in their short form.
By default, the server does not punctuate or apply case to the recognized words. This can be overriden by sending a 'punctuationcase' parameter with the content 'true'.
After finishing the live transcription, the audio sent is saved in an audio file and recorded in the user's file logs together with the produced transcription, subtitles, etc. This can be overriden by sending a 'save' parameter with the content 'false', in which case no files will be saved and thus no space will be used (only a log of the time used in the transcription).
content-type (optional) in GStreamer 1.0 caps format; default=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1
unexpand (optional); default=false
punctuationcase (optional); default=false
save (optional); default=true
wss://live.aditu.eus/eu/client/ws/speech?content-type=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1
wss://live.aditu.eus/eu/client/ws/speech?save=false
wss://live.aditu.eus/eu/client/ws/speech?content-type=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1&save=false
Before starting to send the audio, a text message with the API id and the API key must be sent for authentication and then wait for an authorization confirmation message in JSON format. If there has been a problem with authentication or the account has no time credit left, the JSON message will say so and the websocket must be closed.
After the authorization is received, the sending of the audio can begin.
Audio should be sent to the server in raw blocks of data, using the encoding specified when session was opened. It is recommended that a new block is sent at least 4 times per second (less frequent blocks would increase the recognition lag). Blocks do not have to be of equal size.
After the last block of audio data, a text message containing the 3-byte ANSI-encoded string "EOS" ("end-of-stream") needs to be sent to the server. This tells the server that no more speech is coming and the recognition can be finalized.
After sending "EOS", the client has to keep the websocket open to receive the final recognition results from the server. Server closes the connection itself when all recognition results have been sent to the client. No more audio can be sent via the same websocket after an "EOS" has been sent. In order to process a new audio stream, a new websocket connection has to be created by the client.
Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿùõ×ö"úr`ò@~?Å$Ö³!ÀõY3n...
á!â,¼"´*'ðS¸ïuñU@~?Å$Ö³!ÀõY3n´ÿºÿäÿøÿÞÿ·ÿÎÿðÿÍÿ§ÿÖÿ...
...
EOS
Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:
The following status codes are currently in use:
Websocket is always closed by the server after sending a non-zero status update.
Server transcribes incoming audio on the fly. For each sentence or audio segment between silences, many non-final hypotheses are sent, followed by one final hypothesis. Non-final hypotheses are used to present partial recognition hypotheses to the client. A sequence of non-final hypotheses is always followed by a final hypothesis for that segment. The final hypothesis overrides the non-final hypotheses for the sentence or segment. Client is reponsible for presenting the results to the user in a way suitable for the application.
Likewise, in bilingual live transcription hypotheses for both of the languages are sent. Client is reponsible for presenting the results to the user in a way suitable for the application.
After sending a final hypothesis for a segment, server starts decoding the next segment or closes the connection if all audio sent by the client has been processed.
https://live.aditu.eus/[LANG]/client/http/recognize
[LANG] can be either "eu", "es" or "eu-es"
Immediately returns the transcription of a short section of speech.
This is the preferred use mode for transcription of pre-segmented sentences. An example is the transcription of a sentence for a smart speaker.
The server assumes by default that incoming audio is sent using 16 kHz, mono, 16 bit little-endian format. This can be overriden using the 'Content-Type' header.
If there is a 'Transfer-Encoding' header set to 'chunked', it indicates that the body is in chunked mode (see what this means below).
The 'unexpand' header indicates whether numbers, accronyms, abbreviations, etc. must come in their short form (defaults to false).
The 'punctuationcase' header indicates whether punctuation signs and case must be assigned (defaults to false).
Authentication credentials can be sent in the headers via the 'api_id' and 'api_key' headers, or in the body (see below).
The 'save' header indicates whether the audio and the transcription are to be saved in the user's file list (defaults to false).
Content-Type (optional) in MIME format; default=audio/x-raw-int; rate=16000
Transfer-Encoding (optional)
unexpand (optional); default=false
punctuationcase (optional); default=false
api_id (optional)
api_key(optional)
save (optional); default=false
If there is a 'Transfer-Encoding' header set to 'chunked', then the body with the audio can be sent in chunks for the transcriber to start work without waiting for the whole audio.
Each chunk of the POST body is composed of:
To end the transmission, a final chunk of 0 length must be sent, that is, a binary message with the text '0' plus two line jumps.
Without a 'Transfer-Encoding' header set to 'chunked', then the audio is sent in the body plainly.
The authentication can be done in the header (see above), or else at the beginning of the body as explained here.
In chunked mode, a chunk with the API id and the API key must be sent before starting to send the audio.
In non-chunked mode, the API id and the API key followed by a newline must be sent before starting to send the audio.
In chunked mode, audio in raw format is sent in chunks as defined above.
Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿùõ×ö"úr`ò@~?Å$Ö³!ÀõY3n...
In non-chunked mode, the whole audio in raw format is sent in the body.
Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿùõ×ö"úr`ò@~?Å$Ö³!ÀõY3n...
Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:
The following status codes are currently in use:
In bilingual live transcription hypotheses for both of the languages are sent. Client is reponsible for presenting the results to the user in a way suitable for the application.
Demo of a web page that captures audio from the microphone and calls the websockets live transcription API in Javascript: https://live.aditu.eus/js-dictate-example/demo.html. The Javascript code can be downloaded from there using the browser's developer tools.
It is based on dictate.js: https://kaljurand.github.io/dictate.js, https://github.com/Kaljurand/dictate.js.
Python client that calls the websockets live API with the contents of a raw audio file: https://live.aditu.eus/client.py.
It is based on client.py from kaldi-gstreamer-server: https://github.com/alumae/kaldi-gstreamer-server/blob/master/kaldigstserver/client.py.
React Native code for calling the websocket speech recognition service from a mobile app: https://live.aditu.eus/reactnative_ws.zip
Python clients that call the HTTP POST live API with the contents of a raw audio file:
React Native code for calling the HTTP speech recognition service from a mobile app: https://live.aditu.eus/reactnative_http.zip