Meta has unveiled Meta Spirit LM, an open-source multimodal language model focused on the seamless integration of speech and text.
This new model improves the current text-to-speech (TTS) processes, which typically rely on automatic speech recognition (ASR) for transcription before synthesising text with a large language model (LLM) and converting it back to speech. Such methods often overlook the expressive qualities of speech.
Meta Spirit LM employs a word-level interleaving method during training, utilising both speech and text datasets to facilitate cross-modality generation.
The model comes in two versions, Spirit LM Base, which utilises phonetic tokens for speech modelling, and Spirit LM Expressive, which incorporates pitch and style tokens to convey tone, capturing emotions like excitement or anger.
The new model allows users to generate more natural-sounding speech and demonstrates the capability to learn tasks across different modalities, including ASR, TTS, and speech classification. Meta aims to inspire further development in speech and text integration within …