Text-to-Voice Conversion¶

Psyflow supports text-to-speech (TTS) conversion to enhance accessibility and standardize instruction delivery across different languages.

Why it matters: Using text-to-speech improves accessibility—especially for children, elderly individuals, or participants with low literacy. It ensures consistent voice delivery across different language versions and eliminates the need to record human voiceovers for each translation. Moreover, by using standardized synthetic voices, it reduces variability introduced by different experimenters, helping to maintain consistency across sessions and sites.

How It Works: Psyflow uses Microsoft’s edge-tts, a cloud-based TTS API that converts text to audio (MP3). The generated voice files are stored in the assets/ folder, automatically skipped if they already exist (unless overwrite=True is specified), and registered into the StimBank as new Sound stimuli ready for playback.

Note

An internet connection is required for TTS generation. Offline tools exist but produce lower-quality audio.

Convert Existing Text Stimuli to Voice¶

from psyflow import StimBank
stim_bank = StimBank(config)
stim_bank.convert_to_voice(keys=["instruction_text", "good_bye"], 
voice="zh-CN-YunyangNeural")

This will create audio files like instruction_text_voice.mp3 in assets/. The resulting voices will be registered as instruction_text_voice, good_bye_voice in StimBank.

If you plan to use voice output, make sure to delete any previously generated audio files in the assets/ folder before generating new ones. Additionally, choose a TTS voice that matches the language of the text to ensure natural and accurate pronunciation. By default, “zh-CN-XiaoxiaoNeural” is used.

Add Voice from Custom Text¶

If you want to add an arbitrary line of text as a voice stimulus:

stim_bank.add_voice(
    stim_label="welcome_voice",
    text="ようこそ。タスクを開始します。",
    voice="ja-JP-NanamiNeural"
)

The result will be registered as welcome_voice and available like any other stimulus.

Voice Selection¶

Use the built-in helper to explore available voices:

from psyflow import list_supported_voices

# Print all voices
list_supported_voices(human_readable=True)

# Print all Japanese voices
list_supported_voices(filter_lang="ja", human_readable=True)

Sample output:

ShortName	Locale	Gender	Personalities	FriendlyName
af-ZA-AdriNeural	af-ZA	Female	Friendly, Positive	Microsoft Adri Online (Natural) - Afrikaans (South Africa)
af-ZA-WillemNeural	af-ZA	Male	Friendly, Positive	Microsoft Willem Online (Natural) - Afrikaans (South Africa)
sq-AL-AnilaNeural	sq-AL	Female	Friendly, Positive	Microsoft Anila Online (Natural) - Albanian (Albania)
sq-AL-IlirNeural	sq-AL	Male	Friendly, Positive	Microsoft Ilir Online (Natural) - Albanian (Albania)

Alternatively, you can check the list of supported voices here.

Tips and Caveats¶

Placeholder Limitation: The TTS engine does not support dynamic text with placeholders such as {duration} or {block_num}. If your text includes placeholders, it will not be converted as expected — the synthesis may fail or result in unnatural output.
Internet Connection Required: TTS generation relies on Microsoft’s cloud service and requires a stable internet connection. If you’re offline or behind a restrictive network (e.g., with proxy issues), voice generation will fail.
Overwrite: Use overwrite=True to regenerate voice files even if they exist. However, be careful with this option, as it assumes you need to regenerate the voice every time you run the task ⚠️.
Voice Mismatch: Always match the voice language to the text language to avoid unnatural pronunciation. By default, “zh-CN-XiaoxiaoNeural” is used.
Preview Your Audio: You can test output files manually in the assets/ folder before running full experiments.
If a file is empty or not playable, it may cause the task to fail at runtime — try deleting and regenerating the voice file.