a Aditu - Elhuyar speech recognition - Documentation

Aditu live API

The live ASR (Automatic Speech Recognition) or transcription can be done in two ways:

The address for both is: live.aditu.eus

Get number of available transcribers (via secure websocket connections)

Number of available transcribers for a language. A JSON message is received at the beginning and then each time the number of available transcribers changes.

Only start live transcription if there are transcribers available, that is, if the number of available transcribers is not 0. Otherwise, the transcription will return an error.

Websocket connection example

wss://live.aditu.eus/[LANG]/client/ws/status
					

[LANG] can be either "eu", "es" or "eu-es"

Received messages example


Monolingual:

{
	"num_workers_available": 1
}
...
{
	"num_workers_available": 0
}
...
{
	"num_workers_available": 1
}

Bilingual:

{
	"num_workers_available": "1-1"
}
...
{
	"num_workers_available": "1-0"
}
...
{
	"num_workers_available": "0-0"
}
...
{
	"num_workers_available": "0-1"
}
					

Live transcription via secure websocket connections


wss://live.aditu.eus/[LANG]/client/ws/speech
				

[LANG] can be either "eu", "es" or "eu-es"

Starts a live transcription, where a raw audio signal is sent through the websocket and the recognised words and sentences are received.

This is the preferred use mode for continous transcribing of long speech sections. An example is live transcription from microphone input.

Websocket connection

The server assumes by default that incoming audio is sent using 16 kHz, mono, 16 bit little-endian format. This can be overriden using the 'content-type' request parameter in the websocket URL.

After finishing the live transcription, the audio sent is saved in an audio file and recorded in the user's file logs together with the produced transcription, subtitles, etc. This can be overriden by sending a 'save' parameter with the content 'false', in which case no files will be saved and thus no space will be used (only a log of the time used in the transcription).

Parameters

content-type (optional) in GStreamer 1.0 caps format; default=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1
save (optional); default=true
					
Websocket connection examples

wss://live.aditu.eus/eu/client/ws/speech?content-type=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1
wss://live.aditu.eus/eu/client/ws/speech?save=false
wss://live.aditu.eus/eu/client/ws/speech?content-type=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1&save=false
					
Authentication

Before starting to send the audio, a text message with the API id and the API key must be sent for authentication and then wait for an authorization confirmation message in JSON format. If there has been a problem with authentication or the account has no time credit left, the JSON message will say so and the websocket must be closed.

Authentication message example

api_id=<YOUR_API_ID> api_key=<YOUR_API_ID>
					
Authorization confirmation message examples

{
	"status": 0,
	"message": "Authentication OK"
}
					

{
	"status": 1,
	"message": "No workers available"
}
					

{
	"status": 2,
	"message": "No time credit"
}
					

{
	"status": 3,
	"message": "No space left"
}
					

{
	"status": 4,
	"message": "Authentication error: no user with those credentials"
}
					

{
	"status": 5,
	"message": "Authentication error: no database"
}
					

{
	"status": 6,
	"message": "Authentication error: credentials incorrect"
}
					
Sending audio

After the authorization is received, the sending of the audio can begin.

Audio should be sent to the server in raw blocks of data, using the encoding specified when session was opened. It is recommended that a new block is sent at least 4 times per second (less frequent blocks would increase the recognition lag). Blocks do not have to be of equal size.

After the last block of audio data, a text message containing the 3-byte ANSI-encoded string "EOS" ("end-of-stream") needs to be sent to the server. This tells the server that no more speech is coming and the recognition can be finalized.

After sending "EOS", the client has to keep the websocket open to receive the final recognition results from the server. Server closes the connection itself when all recognition results have been sent to the client. No more audio can be sent via the same websocket after an "EOS" has been sent. In order to process a new audio stream, a new websocket connection has to be created by the client.

Audio block examples

Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿ˜ùžõ×ö"úr`ò@~?Å‹$Ö³!ÀõY3n‚...
á!â,¼"´*'ðS¸ïuñU@~?Å‹$Ö³!ÀõY3n‚´ÿºÿäÿøÿÞÿ·ÿÎÿðÿÍÿ§ÿÖÿ...
...
EOS
					
Reading results

Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:

  • status: response status (integer), see codes below
  • message: (optional) status message
  • result: (optional) recognition result, containing the following fields:
    • hypotheses: recognized words, a list with each item containing the following:
      • transcript: recognized words
      • confidence: (optional) confidence of the hypothesis (float, 0 through 1)
      • Likelihood: (optional) likelihood of the hypothesis (float)
      • language: (optional) language of the hypothesis (for bilingual)
    • final: true when the hypothesis is final, i.e., doesn't change any more

The following status codes are currently in use:

  • 0: Success. Usually used when recognition results are sent.
  • 2: Aborted. Recognition was aborted for some reason.
  • 1: No speech. Sent when the incoming audio contains a large portion of silence or non-speech.
  • 9: Not available. Used when all recognizer processes are currently in use and recognition cannot be performed.

Websocket is always closed by the server after sending a non-zero status update.

Server transcribes incoming audio on the fly. For each sentence or audio segment between silences, many non-final hypotheses are sent, followed by one final hypothesis. Non-final hypotheses are used to present partial recognition hypotheses to the client. A sequence of non-final hypotheses is always followed by a final hypothesis for that segment. The final hypothesis overrides the non-final hypotheses for the sentence or segment. Client is reponsible for presenting the results to the user in a way suitable for the application.

Likewise, in bilingual live transcription hypotheses for both of the languages are sent. Client is reponsible for presenting the results to the user in a way suitable for the application.

After sending a final hypothesis for a segment, server starts decoding the next segment or closes the connection if all audio sent by the client has been processed.

Recognition results examples

{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ ERE."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ ERE PROBATZEN."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ ERE PROBATZEN ARI."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"segment-start": 1.21,
	"segment-length": 3.29,
	"total-length": 2.1,
	"result":
	{
		"hypotheses": [{"likelihood": 1329.08, "confidence": 0.9999984230216836, "transcript": "BERRIZ ERE PROBATZEN ARI NAIZ."}],
		"final": true
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 1,
	"result":
	{
		"hypotheses": [{"transcript": "BIGARREN."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 1,
	"result":
	{
		"hypotheses": [{"transcript": "BIGARREN ESALDIA."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 1,
	"segment-start": 6.63,
	"segment-length": 8.11,
	"total-length": 1.5,
	"result":
	{
		"hypotheses": [{"likelihood": 713.558, "confidence": 0.9995114164171041, "transcript": "BIGARREN ESALDIA DA HAU."}],
		"final": true
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
					

Live transcription via HTTP POST


https://live.aditu.eus/[LANG]/client/http/recognize
				

[LANG] can be either "eu", "es" or "eu-es"

Immediately returns the transcription of a short section of speech.

This is the preferred use mode for transcription of pre-segmented sentences. An example is the transcription of a sentence for a smart speaker.

Headers

The server assumes by default that incoming audio is sent using 16 kHz, mono, 16 bit little-endian format. This can be overriden using the 'Content-Type' header.

The 'Transfer-Encoding' header should be set to 'chunked'. The audio can be sent in chunks for the transcriber to start work without waiting for the whole audio. Regardless of this parameter, it can also be sent in only one chunk.

The 'save' header indicates whether the audio and the transcription are to be saved in the user's file list (defaults to false).

Example

Content-Type (optional) in MIME format; default=audio/x-raw-int; rate=16000
Transfer-Encoding (optional); default=chunked
save (optional); default=false
					
POST body

Each chunk of the POST body is composed of:

  • a line with a binary message with the length of the chunk in hex format
  • a line with a binary message with the chunk

To end the transmission, a final chunk of 0 length must be sent, that is, a binary message with the text '0' plus two line jumps.

Authentication

Before starting to send the audio, a chunk with the API id and the API key must be sent for authentication.

Authentication chunk example (remember, it is preceded by a line with the length of the chunk and followef by a line jump, and it is binary)

api_id=<YOUR_API_ID> api_key=<YOUR_API_ID>
					
Sending audio

Audio in raw format is sent in chunks as defined above.

Audio chunk examples (remember, it is preceded by a line with the length of the chunk and followed by a line jump, and it is binary)

Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿ˜ùžõ×ö"úr`ò@~?Å‹$Ö³!ÀõY3n‚...
					
Response

Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:

  • status: response status (integer), see codes below
  • message: (optional) status message (only in case of error)
  • hypotheses: (optional) recognized words (only in case of success), a list with each item containing the following:
    • transcript: recognized words
    • confidence: confidence of the hypothesis (float, 0 through 1)
    • likelihood: likelihood of the hypothesis (float)
    • language: (optional) language of the hypothesis (for bilingual)

The following status codes are currently in use:

  • 0: Success. Usually used when recognition results are sent.
  • 2: Aborted. Recognition was aborted for some reason.
  • 1: No speech. Sent when the incoming audio contains a large portion of silence or non-speech.
  • 9: Not available. Used when all recognizer processes are currently in use and recognition cannot be performed.

In bilingual live transcription hypotheses for both of the languages are sent. Client is reponsible for presenting the results to the user in a way suitable for the application.

Recognition results examples


Monolingual:

{
	"status": 0,
	"result": {"hypotheses": [{"likelihood": 1329.08, "confidence": 0.9999984230216836, "transcript": "BERRIZ ERE PROBATZEN ARI NAIZ."}],
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}

Bilingual:

{
	"status": 0,
	"result": {"hypotheses": [{"transcript": "BERRIZ ERE PROBACH EN HARINA.", "confidence": 0.9941423102866637, "likelihood": 1038.07, "language": "es"}], "final": true},
	"id": "9a165b65-b1db-4e5d-9edf-ebd529a22c91"
}
...
{
	"status": 0,
	"result": {"hypotheses": [{"transcript": "BERRIZ ERE PROBATZEN ARI NAIZ.", "confidence": 0.9999984230216836, "likelihood": 1265.48, "language": "eu"}], "final": true},
	"id": "9a165b65-b1db-4e5d-9edf-ebd529a22c91"
}
					

Client software examples

Websockets
Javascript

Demo of a web page that captures audio from the microphone and calls the websockets live transcription API in Javascript: https://live.aditu.eus/js-dictate-example/demo.html. The Javascript code can be downloaded from there using the browser's developer tools.

It is based on dictate.js: https://kaljurand.github.io/dictate.js, https://github.com/Kaljurand/dictate.js.

Python

Python client that calls the websockets live API with the contents of a raw audio file: https://live.aditu.eus/client.py.

It is based on client.py from kaldi-gstreamer-server: https://github.com/alumae/kaldi-gstreamer-server/blob/master/kaldigstserver/client.py.

HTTP POST
Python

Python client that calls the HTTP POST live API with the contents of a raw audio file: https://live.aditu.eus/client_http.py.