WhisperSpeech” is an open source text-to-speech (TTS) system for synthesizing speech from text. The system provides technology to convert text data into speech that resembles the human voice. Below is a detailed description of WhisperSpeech.
Overview of WhisperSpeech
- Text-to-Speech (TTS) system: Converts text data into a synthetic voice. This allows the user to listen to written text in speech.
- Open source: The source code is publicly available and can be used, improved, and distributed by anyone.
- Multi-language support: Multiple languages may be supported. The purpose of this is that there are plans to do so in the future…. This allows text in different languages to be converted into speech.
- High-quality speech synthesis: The aim is to generate natural sounding speech that is close to the human voice.
Application Examples
- Audiobook creation: Text from books and documents can be converted to speech to create audiobooks for the visually impaired and those who have difficulty reading.
- Assistant devices: Use in voice assistants and interactive systems.
- Language learning aid: Helps language learners hear the correct pronunciation of texts.
File Content Description
GitHub contains code, documentation, and related resources for the WhisperSpeech project.
https://github.com/collabora/WhisperSpeech
Specifically, it may contain the following content. Because we have not looked at it closely.
- Source code: implementation of the TTS system written in Python.
- Documentation: Instructions on how to use and set up the system.
- Example notebooks: Jupyter notebooks showing how to use the system in practice.
For more information about WhisperSpeech and how to use it, please refer to the project’s README file or the official documentation. This will give you a more detailed understanding of how the system works and how to set it up.
Now let’s proceed with the actual installation.
I have CUDA 11.8 and Python 3.10 (confirmed with python -V) installed on my Windows PC, and by creating a Python virtual environment and installing the necessary libraries in it, I can run the project while protecting my PC system environment. You can run your project while protecting your PC system environment by creating a Python virtual environment and installing the necessary libraries in it.
It is important to match the version of CUDA used on the host system and in the virtual environment. Especially when using applications or libraries that utilize CUDA (e.g. PyTorch), CUDA version compatibility is very important.
Importance of Compatibility
- PyTorch and CUDA: PyTorch is built for a specific CUDA version. For example, PyTorch for CUDA 11.8 requires that CUDA 11.8 be installed on the host system as well.
- Driver Compatibility: The version of CUDA must also be compatible with the GPU driver you are using. Older drivers may not support newer versions of CUDA.
CUDA Version Check
- Host system CUDA version: You can check the installed CUDA version by running the
nvcc --version
command on the host system. - PyTorch CUDA Version: When installing PyTorch, you must select a compatible CUDA version; you can select the appropriate version for your system on the PyTorch website.
Virtual Environment Settings
- CUDA in Virtual Environments: When installing PyTorch in a virtual environment, you must choose a PyTorch version that matches the CUDA version of the host system.
This is important because CUDA version mismatches can cause unexpected errors and compatibility issues. If you experience problems using PyTorch or other CUDA-dependent libraries, checking and matching versions should be your first troubleshooting step.
Here are the steps
- Create a Python virtual environment:
- Open a command prompt or PowerShell.
- Navigate to the appropriate directory (e.g.,
cd project_directory
). - Create a virtual environment using the command
python -m venv venv
. Herevenv
is the name of the virtual environment, but you can use any name (e.g.python -m venv whisper
).
The actual command (Go to the existing youtube directory directly under the C drive and create a new whisper directory there. Then, after moving to the whisper directory, create a virtual environment).cd\
cd youtube
mkdir whisper
cd whisper
python -m venv venv
- Activate the virtual environment:
- On Windows, you can activate the virtual environment by executing the command
venv\Scripts\activate
.
- On Windows, you can activate the virtual environment by executing the command
- Install the required libraries:
- Install the required libraries in the virtual environment as
pip install WhisperSpeech
You will need to install a version of PyTorch that supports CUDA. You can find the commands at the following page
https://pytorch.org/
In our current environment, it is better to install PyTorch first.pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install WhisperSpeech
The existence of thepip install WhisperSpeech
command means that a package namedWhisperSpeech
is published in the Python Package Index (PyPI). This means that you can install this package using thepip
command. - This is for installing PyTorch, TorchVision, and TorchAudio for CUDA 11.8. This command is appropriate for installing PyTorch-related packages for systems with GPUs that support CUDA.
torch
is the main library for PyTorch.torchvision
provides functionality related to image processing.torchaudio
provides functions related to audio processing.- By running this command in a virtual environment, you can set up a CUDA-enabled PyTorch environment; if you have a GPU that supports CUDA, you should now have an environment ready to run WhisperSpeech scripts.
- Install the required libraries in the virtual environment as
- Running the project:
- Run your project’s code or notebook within the virtual environment.
This procedure isolates the project execution environment from the PC’s main system, preventing contamination of the environment. In addition, the virtual environment can be easily removed if necessary, making it easy to clean up later.
- Projects such as
WhisperSpeech
, if registered with PyPI, can be installed with just apip install
command. - You may need to clone the source code if you want to contribute directly to the project or if you need to do a custom installation using
setup.py
.
After installing PyTorch with libraries and CUDA support in a virtual environment, write the following in Notepad or similar. Save this as “test.py” in the folder whisper.
from whisperspeech.pipeline import Pipeline
import torchaudio
# Initialize Pipeline
pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')
# Convert text to speech (use the correct method here)
# Assume you have a `generate` method for example (check the actual method name)
result = pipe.generate("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""")
# Move CUDA tensor to CPU (for tensors on GPU)
result = result.cpu()
# Save audio data as WAV file
torchaudio.save('output.wav', result, sample_rate=22050)
This script uses the whisperspeech
library to synthesize speech from text and save the result as a WAV file. The process consists of the following steps
- Initialization of a
Pipeline
object. - Generating speech data (tensor) from text using the
pipe.generate
method. - Moving the generated tensor to the CPU (if using a GPU).
- Save the audio data as a WAV file using
torchaudio.save
.
Run this script in a virtual environment.
python test.py
However, an error occurs.
(venv) C:\youtube\whisper>python test.py
C:\youtube\whisper\venv\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in UserWarning: torch.nn.utils.parametrizations.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Traceback (most recent call last):███████| 100.00% [752/752 00:07<00:00] File "C:\youtube\whisper\test.py", line 17, in
torchaudio.save('output.wav', result, sample_rate=22050)
File "C:\youtube\whisper\venv\lib\site-packages\torchaudio_backend\utils.py", line 311, in save
backend = dispatcher(uri, format, backend)
File "C:\youtube\whisper\venv\lib\site-packages\torchaudio_backend\utils.py", line 221, in dispatcher
raise RuntimeError(f "Couldn't find appropriate backend to handle uri {uri} and format {format}.")
RuntimeError: Couldn't find appropriate backend to handle uri output.wav and format None.
Deal with the error.
The error RuntimeError : Couldn't find appropriate backend to handle uri output . wav and format None.
This usually indicates that the torchaudio.save function could not find an appropriate backend to handle uri output.wav and format None. This usually occurs when the required audio backend is not installed or configured.
To resolve this issue, try the following steps
- Check for required dependencies:
- Check that the dependencies required by
torchaudio
are properly installed. In particular, some libraries, such assoundfile
andsox
, may be required. - These libraries can be installed with commands such as
pip install soundfile sox
.
- Check that the dependencies required by
- Trying different file formats:
- Try different file formats (e.g.
output.mp3
) with thetorchaudio.save
function.
- Try different file formats (e.g.
- Check the version of PyTorch and TorchAudio:
- Check that the versions of PyTorch and TorchAudio you are using are compatible. If there is a mismatch, you will need to update to a compatible version.
- Explicitly specify the backend:
- Explicitly specifying the
torchaudio
backend may solve this problem. For example, trytorchaudio.set_audio_backend('soundfile')
.
- Explicitly specifying the
I was able to solve this problem with pip install soundfile sox
. After installation, run the script and the audio file is created.
Next, try audio cloning.
Jupyter notebooks are a great choice, especially for interactive data analysis, machine learning, and speech processing projects like this one. Jupyter notebooks are the preferred choice for many scientific computing and data science projects because they allow you to run Python code, visualize results, and document them all in one place.
Features of Jupyter Notebooks
- Interactive coding:
- Execute code cell by cell and see results instantly. This allows for step-by-step experimentation and data analysis.
- Rich text support:
- Markdown and HTML can be used to create documents that include descriptive text, mathematical formulas, images, etc.
- Data visualization:
- Easily chart and graph data, used in conjunction with libraries such as Matplotlib, Seaborn, Plotly, etc.
- Multilingual support:
- In addition to Python, multiple programming languages are supported, including R, Julia, and Scala.
- Sharing and reproducibility:
- Notebooks are stored in an integrated format with code, data, diagrams, and explanatory text, making it easy to share and reproduce results.
Uses
- Education: Suitable for use in the creation of learning materials, in classes and workshops.
- Data analysis: useful for exploratory analysis, preprocessing, and visualization of data.
- Research: useful for documenting research results, generating figures and tables for publications, and sharing the analysis process.
- Machine Learning: for model prototyping, parameter tuning, and visualization of results.
The flexibility and functionality of Jupyter notebooks make them a valuable tool for many scientists, researchers, data analysts, and educators.
Jupyter notebooks can be installed within a virtual environment.
pip install notebook
Start the Jupyter notebook:
- After installation, run the
jupyter notebook
command to start the notebook. This will open a browser and display the notebook’s interface.
How to run scripts in a Jupyter notebook
- Create or open a notebook:
- In the Jupyter notebook interface, click “New” to create a new notebook or open an existing one.
- Writing code:
- Write your Python code in a cell in the notebook. For example, enter WhisperSpeech’s speech synthesis code in the cell.
- Execute the code:
- Select the cell containing the code and click the “Run” button on the toolbar or use the keyboard shortcut (usually Shift Enter) to execute the code.
- Checking the result:
- The result of the code execution is displayed directly below the cell. This includes text output, graphics, and playback of audio files.
- Document Saving:
- When you are done working, save the notebook. You can use the “Save” button on the toolbar.
To start with, we will check the GPU usage in the Jupyter notebook.
- To check if the GPU is available in PyTorch, run the following code in a Python shell or Jupyter notebook
import torch
print(torch.cuda.is_available())
- If this command returns
True
, PyTorch is ready to use the GPU.
- If this command returns
- To check the details of available GPUs
import torch
print(torch.cuda.get_device_name(0))
Following these steps, you can see if PyTorch is using the host GPU.
For voice cloning, here is an example code
from whisperspeech.pipeline import Pipeline
pipe = Pipeline()
pipe.generate_to_notebook("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', speaker='https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg')
The script uses the WhisperSpeech
library’s Pipeline
object to convert the given text to speech. The process uses the English speech model with lang='en'
and generates speech based on the reference speech (local file) specified with speaker='path/to/downloaded/file.ogg'
.
In a browser, the following error occurred when the code was saved and executed.
C:\youtube\whisper\venv\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm C:\youtube\whisper\venv\lib\site-packages\fastprogress\fastprogress.py:107: UserWarning: Couldn’t import ipywidgets properly, progress bar will use console behavior warn(“Couldn’t import ipywidgets properly, progress bar will use console behavior”) C:\youtube\whisper\venv\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn(“torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.”) The torchaudio backend is switched to ‘soundfile’. Note that ‘sox_io’ is not supported on Windows. C:\youtube\whisper\venv\lib\site-packages\speechbrain\utils\torch_audio_backend.py:22: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call. torchaudio.set_audio_backend(“soundfile”) The torchaudio backend is switched to ‘soundfile’. Note that ‘sox_io’ is not supported on Windows. C:\youtube\whisper\venv\lib\site-packages\speechbrain\utils\torch_audio_backend.py:22: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call. torchaudio.set_audio_backend(“soundfile”)
—————————————————————————OSError Traceback (most recent call last) Cell In[3], line 3 1 fromwhisperspeech.pipelineimport Pipeline 2 pipe = Pipeline() —-> 3 pipe.generate_to_notebook(“”” 4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer. 5 “””, lang=’en’, speaker=’https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg‘) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:93, in Pipeline.generate_to_notebook(self, text, speaker, lang, cps, step_callback) 92 def generate_to_notebook(self, text, speaker=None, lang=’en’, cps=15, step_callback=None): —> 93 self.vocoder.decode_to_notebook(self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=None)) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:80, in Pipeline.generate_atoks(self, text, speaker, lang, cps, step_callback) 78 def generate_atoks(self, text, speaker=None, lang=’en’, cps=15, step_callback=None): 79 if speaker isNone: speaker = self.default_speaker —> 80elif isinstance(speaker, (str, Path)): speaker = self.extract_spk_emb(speaker) 81 text = text.replace(“\n“, ” “) 82 stoks = self.t2s.generate(text, cps=cps, lang=lang, step=step_callback) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:70, in Pipeline.extract_spk_emb(self, fname) 68 if self.encoder isNone: 69 fromspeechbrain.pretrainedimport EncoderClassifier —> 70 self.encoder = EncoderClassifier.from_hparams(“speechbrain/spkrec-ecapa-voxceleb”, 71 savedir=”~/.cache/speechbrain/”, 72 run_opts={“device”: “cuda”}) 73 samples, sr = torchaudio.load(fname) 74 samples = self.encoder.audio_normalizer(samples[0,:30*sr], sr) File C:\youtube\whisper\venv\lib\site-packages\speechbrain\pretrained\interfaces.py:467, in Pretrained.from_hparams(cls, source, hparams_file, pymodule_file, overrides, savedir, use_auth_token, revision, download_only, **kwargs) 465 clsname = cls.__name__ 466 savedir = f”./pretrained_models/{clsname}–{hashlib.md5(source.encode(‘UTF-8′, errors=’replace’)).hexdigest()}” –> 467 hparams_local_path = fetch( 468 filename=hparams_file, 469 source=source, 470 savedir=savedir, 471 overwrite=False, 472 save_filename=None, 473 use_auth_token=use_auth_token, 474 revision=revision, 475 ) 476 try: 477 pymodule_local_path = fetch( 478 filename=pymodule_file, 479 source=source, (…) 484 revision=revision, 485 ) File C:\youtube\whisper\venv\lib\site-packages\speechbrain\pretrained\fetching.py:181, in fetch(filename, source, savedir, overwrite, save_filename, use_auth_token, revision, cache_dir, silent_local_fetch) 179 sourcepath = pathlib.Path(fetched_file).absolute() 180 _missing_ok_unlink(destination) –> 181 destination.symlink_to(sourcepath) 182 return destination File ~\AppData\Local\Programs\Python\Python310\lib\pathlib.py:1255, in Path.symlink_to(self, target, target_is_directory) 1250 def symlink_to(self, target, target_is_directory=False): 1251 “”” 1252 Make this path a symlink pointing to the target path. 1253 Note the order of arguments (link, target) is the reverse of os.symlink. 1254 “”” -> 1255 self._accessor.symlink(target, self, target_is_directory) OSError: [WinError 1314] クライアントは要求された特権を保有していません。: ‘C:\\Users\\minok\\.cache\\huggingface\\hub\\models–speechbrain–spkrec-ecapa-voxceleb\\snapshots\\5c0be3875fda05e81f3c004ed8c7c06be308de1e\\hyperparams.yaml’ -> ‘~\\.cache\\speechbrain\\hyperparams.yaml’
Investigation of the cause.
The displayed OSError: [WinError 1314]
indicates that you lack the privilege to create symbolic links in Windows. This error occurs when the speechbrain
library downloads the required files and tries to create a symbolic link in the cache directory.
Tried solution.
- Try “running as administrator” a command prompt in Windows and running the
jupyter notebook
command in it.
Since the virtual environment has already been created, go to the desired directory and activate the virtual environment.
cd\
cd youtube
cd whisper
venv\Scripts\activate
Run the preceding code in your browser. However, the error was generated but the content changed.
LibsndfileError Traceback (most recent call last) Cell In [2], line 3 1 fromwhisperspeech.pipelineimport Pipeline 2 pipe = Pipeline() —->
3 pipe. generate_to_notebook(“”” 4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels 5 “””, lang=’en’, speaker= ‘https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour. ogg ‘) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:93, in Pipeline.generate_to_notebook (self, text, speaker, lang, cps, step_callback) 92 def generate_to_notebook(self, text, speaker=None, lang=’en’, cps=15, step_callback=None ): —>
93 self.vocoder .decode_to_notebook(self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=None )) File C:\youtube\whisper\venv\lib\site- packages\whisperspeech\pipeline.py:80, in Pipeline.generate_atoks (self, text, speaker, lang, cps, step_callback) 78 def generate_atoks(self, text, speaker=None, lang=’en’, cps=15, step_callback=None ): 79 if speaker isNone: speaker = self.default_speaker —>
80elif isinstance( speaker, (str, Path)): speaker = self.extract_spk_emb(speaker) 81 text = text.replace( “\n “, ” “) 82 stoks = self.t2s.generate(text, cps=cps, lang= lang, step=step_callback) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:73, in Pipeline.extract_spk_emb (self , fname) 69 fromspeechbrain.pretrainedimport EncoderClassifier 70 self.encoder = EncoderClassifier.from_hparams(“speechbrain/spkrec-ecapa- voxceleb”, 71 savedir=”~/.cache/speechbrain/”, 72 run_opts={“device”: “cuda”}) —>
73 samples, sr = torchaudio.load(fname) 74 samples = self. encoder.audio_normalizer(samples[0,:30*sr], sr) 75 spk_emb = self.encoder.encode_batch(samples) File C:\youtube\whisper\venv\lib\site- packages\torchaudio\_backend\utils.py:204, in get_load_func.
Investigation of the cause.
The LibsndfileError
displayed indicates that torchaudio
failed to read an audio file directly from a remote URL, in this case the URL of Winston Churchill’s speech. and does not support loading audio files directly from URLs.
Solution
- Download the audio file:
- Manually download the file you want to use as reference audio and save it to your local file system.
- Specify the path to the downloaded file:
- Specify the local path to the downloaded audio file in the
speaker
argument. For example,speaker='downloaded_file.ogg'
(wheredownloaded_file.ogg
is the name of the downloaded file).
- Specify the local path to the downloaded audio file in the
- Code modification:
- In the Jupyter notebook, update the
speaker
argument to specify the path to the downloaded file. Note that when there is an audio file in the same case as the code, only the file name is written.
- In the Jupyter notebook, update the
from whisperspeech.pipeline import Pipeline
# Initialize the Pipeline object
pipe = Pipeline()
# use pipe.generate_to_notebook(...) in the rest of the code
pipe.generate_to_notebook(
"""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""",
lang='en',
speaker='path/to/downloaded/file.ogg'
)