Python SDK for Kotoba speech APIs — REST batch transcription and streaming ASR, TTS, and speech-to-speech translation over WebSockets.
Phase-1 alpha. See
docs/quickstart.md.
pip install kotoba-sdkOr from a checkout:
git clone https://github.com/kotoba-tech/kotoba-python.git
cd kotoba-python
uv venv
uv pip install -e .Python ≥ 3.10. Optional mic extra (pip install 'kotoba-sdk[mic]') installs sounddevice for live-microphone examples.
The SDK reads configuration from these env vars only — set the ones for the routes you actually need:
| Variable | Purpose |
|---|---|
KOTOBA_API_KEY |
Bearer token sent as Authorization: Bearer … (REST + WS) |
KOTOBA_ASR_REST_URL |
REST API base URL including version prefix, e.g. https://.../v1 |
KOTOBA_ASR_URL |
WebSocket URL for live ASR, e.g. wss://.../asr |
KOTOBA_TTS_JA_URL |
WebSocket URL for Japanese TTS, e.g. wss://.../tts |
KOTOBA_S2ST_EN_JA_URL |
WebSocket URL for English-to-Japanese speech translation |
You can also register routes from code:
import kotoba
kotoba.register_endpoint("tts", None, "ko", "wss://.../tts")URLs passed explicitly via url=... on a call take precedence over the registry.
import kotoba
client = kotoba.KotobaClient() # reads KOTOBA_API_KEY + KOTOBA_*_URL from env
# 1) Speech recognition (REST batch — default for files)
result = client.asr.transcribe(
"examples/audio/ja/example.mp3", language="ja"
)
print(result.text)
# 2) Text-to-Speech (Japanese, default speaker)
audio = client.tts.synthesize("こんにちは、世界。", language="ja")
audio.to_wav("hello.wav")
# 3) Speech-to-Speech translation (English -> Japanese)
translated = client.s2st.translate(
"examples/audio/en/example.mp3", src="en", tgt="ja"
)
translated.to_wav("translated.wav")
print(translated.transcript_source)KotobaClient() reads its credentials and URLs from env vars. Pass them explicitly if you'd rather not rely on the environment:
client = kotoba.KotobaClient(
api_key="sk_...",
url="https://.../v1", # REST base
asr_ws_url="wss://.../asr",
tts_ja_ws_url="wss://.../tts",
s2st_en_ja_ws_url="wss://.../sts",
)ASR, TTS, and S2ST are all streaming-first. Audio chunks and partial transcripts surface the moment the server emits them, so you can play / display incrementally instead of waiting for the full response.
ASR streams transcript deltas as audio arrives; TTS streams audio chunks as the server produces them from a single text prompt. ASR accepts a generator of PCM16 chunks on the input side (feed + drain run concurrently); TTS sends the full text in one frame and streams audio back:
# ASR: pcm16 bytes in -> transcript deltas out
for delta in client.asr.transcribe_stream(mic_chunks(), language="ja"):
print(delta, end="", flush=True)
# TTS: full text in -> pcm audio chunks streamed out
for pcm in client.tts.synthesize_stream("こんにちは、世界。", language="ja"):
speaker.write(pcm)import asyncio, kotoba
async def main():
client = kotoba.AsyncKotobaClient()
async with client.tts.stream(language="ja") as session:
await session.synthesize("こんにちは。本日はよろしくお願いします。")
async for event in session:
if event.type == "audio_chunk":
await play(event.audio)
elif event.type == "done":
break
asyncio.run(main())import kotoba
client = kotoba.KotobaClient()
with client.s2st.stream(src="en", tgt="ja") as session:
for chunk in pcm16_chunks_from_mic():
session.send_audio(chunk)
session.commit()
for event in session:
if event.type == "partial_transcript":
print(event.text, end="", flush=True)
elif event.type == "audio_chunk":
speaker.write(event.audio)
elif event.type == "done":
breakThe sync wrapper runs an asyncio loop on a background daemon thread, so the underlying transport is identical — only the call style differs.
| Module | What |
|---|---|
kotoba.KotobaClient / AsyncKotobaClient |
Top-level entry point |
client.asr.transcribe(path, ...) |
REST batch transcription with optional with_timestamps=True |
client.asr.stream(...) / transcribe_stream(iter) |
Streaming ASR (Japanese, English) over WebSocket |
client.tts.stream(...) / synthesize(...) / synthesize_stream(...) |
Streaming TTS (Japanese) |
client.s2st.stream(...) / translate(...) |
Streaming speech-to-speech translation |
kotoba.register_endpoint(...) |
Add (modality, src, tgt) -> URL routes |
kotoba.audio.* |
PCM16 / float32 WAV helpers |
Each example under examples/ is runnable with uv run examples/<file>.py and uses bundled audio under examples/audio/ by default.
| File | What it shows | Required env |
|---|---|---|
asr_rest_sync.py |
REST batch transcription with with_timestamps=True, sync |
KOTOBA_API_KEY, KOTOBA_ASR_REST_URL |
asr_rest_async.py |
Same, async with AsyncKotobaClient context manager |
KOTOBA_API_KEY, KOTOBA_ASR_REST_URL |
asr_stream_async.py |
Live ASR via transcribe_stream(generator) with first-token-latency measurement |
KOTOBA_API_KEY, KOTOBA_ASR_URL |
tts_synthesize_sync.py |
One-shot TTS with explicit speaker_id |
KOTOBA_API_KEY, KOTOBA_TTS_JA_URL |
tts_stream_async.py |
One-shot text in → streamed audio chunks with first-audio-latency timing | KOTOBA_API_KEY, KOTOBA_TTS_JA_URL |
s2st_stream_async.py |
File in → live transcript + translated WAV out | KOTOBA_API_KEY, KOTOBA_S2ST_EN_JA_URL |
s2st_mic_async.py |
Live microphone in → translated WAV out (Ctrl-C to stop). Requires pip install 'kotoba-sdk[mic]' and PortAudio. |
KOTOBA_API_KEY, KOTOBA_S2ST_EN_JA_URL |
REST is shown in both sync + async because the context-manager pattern matters for resource cleanup. Streaming examples are async-by-default — wrap with kotoba.KotobaClient() for sync (the snippets above show the conversion).
KotobaClient(
*,
api_key: str | None = None, # KOTOBA_API_KEY
url: str | None = None, # KOTOBA_ASR_REST_URL (REST)
asr_ws_url: str | None = None, # KOTOBA_ASR_URL (WS ASR)
tts_ja_ws_url: str | None = None, # KOTOBA_TTS_JA_URL (WS TTS)
s2st_en_ja_ws_url: str | None = None, # KOTOBA_S2ST_EN_JA_URL
timeout: float = 30.0, # per-request HTTP timeout (s)
max_retries: int = 3, # for 429/5xx and network errors
)Exposes:
.asr—ASRClient/AsyncASRClient(REST + WS).tts—TTSClient/AsyncTTSClient(WS).s2st—S2STClient/AsyncS2STClient(WS)
The async variant supports async with … and exposes await client.close().
transcribe(
audio_file_path: str | Path,
*,
language: str = "ja",
with_timestamps: bool = False, # ask server for per-segment timestamps
poll_interval: float = 1.0, # initial GET polling interval (s)
poll_backoff: float = 1.5, # multiplied each poll
max_poll_interval: float = 10.0,
timeout: float = 1200.0, # overall deadline for job completion
) -> TranscriptResultPOSTs the file, polls GET /transcription_jobs/{id} with exponential backoff, returns the final transcript. Raises TranscriptionError on server-reported failure, TimeoutError if the deadline elapses.
When with_timestamps=True, TranscriptResult.segments is populated with [Segment(text, start, end), …].
client.asr.submit_job(path, language="ja") -> JobIDResponse # POST
client.asr.get_job(job_id) -> JobStatus # GET, 202→processingJobStatus.state is one of JobState.processing | done | error. For done, read .transcription; for error, read .error_message.
client.asr.stream(language="ja", url=...) -> ASRSession
client.asr.transcribe_stream(audio_iter, ...) -> Iterator[str]
client.tts.stream(language="ja", speaker_id=..., url=...) -> TTSSession
client.tts.synthesize_stream(text, ...) -> Iterator[bytes]
client.tts.synthesize(text, ...) -> AudioResult
client.s2st.stream(src="en", tgt="ja", url=...) -> S2STSession
client.s2st.translate(path, src="en", tgt="ja") -> S2STResultURLs resolve from the per-route env vars (KOTOBA_ASR_URL, KOTOBA_TTS_JA_URL, KOTOBA_S2ST_EN_JA_URL) unless passed explicitly with url=.
All inherit from kotoba.KotobaError:
| Exception | When |
|---|---|
AuthError |
HTTP 401/403, WS auth rejection |
ProtocolError |
Other 4xx, or a server error frame violating the contract |
APIError |
Transport or 5xx that exhausted retries |
TimeoutError |
HTTP timeout, WS handshake timeout, or transcribe() polling deadline exceeded |
JobNotFoundError |
GET returned 404 |
TranscriptionError |
Job completed in error state |
UnsupportedRouteError |
No WS URL registered for the requested (modality, src, tgt) |
Both sync and async clients retry on network errors, 429, and 5xx with exponential backoff. Retry-After headers on 429 are honored (async client). 4xx other than 429 raise immediately.
uv venv
uv pip install -e ".[dev]"
uv run pytest