Aditu live API

The live ASR (Automatic Speech Recognition) or transcription is done via secure websocket connections

URL

wss://live.aditu.eus

Get number of available transcribers

Number of available transcribers for a language. A JSON message is received at the beginning and then each time the number of available transcribers changes.

Only start live transcription if there are transcribers available, that is, if the number of available transcribers is not 0. Otherwise, the transcription will return an error.

Websocket connection example

wss://live.aditu.eus/[LANG]/client/ws/status
					

[LANG] can be either "eu" or "es"

Received messages example

{
	"num_workers_available": 1
}
...
{
	"num_workers_available": 0
}
...
{
	"num_workers_available": 1
}
					

Live transcription


wss://live.aditu.eus/[LANG]/client/ws/speech
				

[LANG] can be either "eu" or "es"

Starts a live transcription, where a raw audio signal is sent through the websocket and the recognised words and sentences are received.

Websocket connection

The server assumes by default that incoming audio is sent using 16 kHz, mono, 16 bit little-endian format. This can be overriden using the 'content-type' request parameter in the websocket URL.

After finishing the live transcription, the audio sent is saved in an audio file and recorded in the user's file logs together with the produced transcription, subtitles, etc. This can be overriden by sending a 'save' parameter with the content 'false', in which case no files will be saved and thus no space will be used (only a log of the time used in the transcription).

Parameters

content-type (optional) in GStreamer 1.0 caps format; default=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1
save (optional); default=true
					
Websocket connection examples

wss://live.aditu.eus/eu/client/ws/speech?content-type=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1
wss://live.aditu.eus/eu/client/ws/speech?save=false
wss://live.aditu.eus/eu/client/ws/speech?content-type=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1&save=false
					
Authentication

Before starting to send the audio, a text message with the API id and the API key must be sent for authentication and then wait for an authorization confirmation message in JSON format. If there has been a problem with authentication or the account has no time credit left, the JSON message will say so and the websocket must be closed.

Authentication message example

api_id=<YOUR_API_ID> api_key=<YOUR_API_ID>
					
Authorization confirmation message examples

{
	"status": 0,
	"message": "Authentication OK"
}
					

{
	"status": 1,
	"message": "No workers available"
}
					

{
	"status": 2,
	"message": "No time credit"
}
					

{
	"status": 3,
	"message": "No space left"
}
					

{
	"status": 4,
	"message": "Authentication error: no user with those credentials"
}
					

{
	"status": 5,
	"message": "Authentication error: no database"
}
					

{
	"status": 6,
	"message": "Authentication error: credentials incorrect"
}
					
Sending audio

After the authorization is received, the sending of the audio can begin.

Audio should be sent to the server in raw blocks of data, using the encoding specified when session was opened. It is recommended that a new block is sent at least 4 times per second (less frequent blocks would increase the recognition lag). Blocks do not have to be of equal size.

After the last block of audio data, a text message containing the 3-byte ANSI-encoded string "EOS" ("end-of-stream") needs to be sent to the server. This tells the server that no more speech is coming and the recognition can be finalized.

After sending "EOS", the client has to keep the websocket open to receive the final recognition results from the server. Server closes the connection itself when all recognition results have been sent to the client. No more audio can be sent via the same websocket after an "EOS" has been sent. In order to process a new audio stream, a new websocket connection has to be created by the client.

Audio block examples

Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿ˜ùžõ×ö"úr`ò@~?Å‹$Ö³!ÀõY3n‚...
á!â,¼"´*'ðS¸ïuñU@~?Å‹$Ö³!ÀõY3n‚´ÿºÿäÿøÿÞÿ·ÿÎÿðÿÍÿ§ÿÖÿ...
...
EOS
					
Reading results

Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:

  • status: response status (integer), see codes below
  • message: (optional) status message
  • result: (optional) recognition result, containing the following fields:
    • hypotheses: recognized words, a list with each item containing the following:
      • transcript: recognized words
      • confidence: (optional) confidence of the hypothesis (float, 0 through 1)
    • final: true when the hypothesis is final, i.e., doesn't change any more

The following status codes are currently in use:

  • 0: Success. Usually used when recognition results are sent.
  • 2: Aborted. Recognition was aborted for some reason.
  • 1: No speech. Sent when the incoming audio contains a large portion of silence or non-speech.
  • 9: Not available. Used when all recognizer processes are currently in use and recognition cannot be performed.

Websocket is always closed by the server after sending a non-zero status update.

Server transcribes incoming audio on the fly. For each sentence or audio segment between silences, many non-final hypotheses are sent, followed by one final hypothesis. Non-final hypotheses are used to present partial recognition hypotheses to the client. A sequence of non-final hypotheses is always followed by a final hypothesis for that segment. The final hypothesis overrides the non-final hypotheses for the sentence or segment. Client is reponsible for presenting the results to the user in a way suitable for the application.

After sending a final hypothesis for a segment, server starts decoding the next segment or closes the connection if all audio sent by the client has been processed.

Recognition results examples

{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ ERE."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ ERE PROBATZEN."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ ERE PROBATZEN ARI."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"segment-start": 1.21,
	"segment-length": 3.29,
	"total-length": 2.1,
	"result":
	{
		"hypotheses": [{"likelihood": 1329.08, "confidence": 0.9999984230216836, "transcript": "BERRIZ ERE PROBATZEN ARI NAIZ."}],
		"final": true
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 1,
	"result":
	{
		"hypotheses": [{"transcript": "BIGARREN."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 1,
	"result":
	{
		"hypotheses": [{"transcript": "BIGARREN ESALDIA."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 1,
	"segment-start": 6.63,
	"segment-length": 8.11,
	"total-length": 1.5,
	"result":
	{
		"hypotheses": [{"likelihood": 713.558, "confidence": 0.9995114164171041, "transcript": "BIGARREN ESALDIA DA HAU."}],
		"final": true
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
					

Client software example

https://live.aditu.eus/js-dictate-example/demo.html

Demo of a web page that captures audio from the microphone and calls the live transcription API in Javascript. It is based on dictate.js: https://kaljurand.github.io/dictate.js, https://github.com/Kaljurand/dictate.js.