Install WhisperSpeech on your PC (local)

WhisperSpeech” is an open source text-to-speech (TTS) system for synthesizing speech from text. The system provides technology to convert text data into speech that resembles the human voice. Below is a detailed description of WhisperSpeech.

Overview of WhisperSpeech

  • Text-to-Speech (TTS) system: Converts text data into a synthetic voice. This allows the user to listen to written text in speech.
  • Open source: The source code is publicly available and can be used, improved, and distributed by anyone.
  • Multi-language support: Multiple languages may be supported. The purpose of this is that there are plans to do so in the future…. This allows text in different languages to be converted into speech.
  • High-quality speech synthesis: The aim is to generate natural sounding speech that is close to the human voice.

Application Examples

  • Audiobook creation: Text from books and documents can be converted to speech to create audiobooks for the visually impaired and those who have difficulty reading.
  • Assistant devices: Use in voice assistants and interactive systems.
  • Language learning aid: Helps language learners hear the correct pronunciation of texts.

File Content Description

GitHub contains code, documentation, and related resources for the WhisperSpeech project.

https://github.com/collabora/WhisperSpeech

Specifically, it may contain the following content. Because we have not looked at it closely.

  • Source code: implementation of the TTS system written in Python.
  • Documentation: Instructions on how to use and set up the system.
  • Example notebooks: Jupyter notebooks showing how to use the system in practice.

For more information about WhisperSpeech and how to use it, please refer to the project’s README file or the official documentation. This will give you a more detailed understanding of how the system works and how to set it up.

Now let’s proceed with the actual installation.
I have CUDA 11.8 and Python 3.10 (confirmed with python -V) installed on my Windows PC, and by creating a Python virtual environment and installing the necessary libraries in it, I can run the project while protecting my PC system environment. You can run your project while protecting your PC system environment by creating a Python virtual environment and installing the necessary libraries in it.

It is important to match the version of CUDA used on the host system and in the virtual environment. Especially when using applications or libraries that utilize CUDA (e.g. PyTorch), CUDA version compatibility is very important.

Importance of Compatibility

  • PyTorch and CUDA: PyTorch is built for a specific CUDA version. For example, PyTorch for CUDA 11.8 requires that CUDA 11.8 be installed on the host system as well.
  • Driver Compatibility: The version of CUDA must also be compatible with the GPU driver you are using. Older drivers may not support newer versions of CUDA.

CUDA Version Check

  • Host system CUDA version: You can check the installed CUDA version by running the nvcc --version command on the host system.
  • PyTorch CUDA Version: When installing PyTorch, you must select a compatible CUDA version; you can select the appropriate version for your system on the PyTorch website.

Virtual Environment Settings

  • CUDA in Virtual Environments: When installing PyTorch in a virtual environment, you must choose a PyTorch version that matches the CUDA version of the host system.

This is important because CUDA version mismatches can cause unexpected errors and compatibility issues. If you experience problems using PyTorch or other CUDA-dependent libraries, checking and matching versions should be your first troubleshooting step.

Here are the steps

  1. Create a Python virtual environment:
    • Open a command prompt or PowerShell.
    • Navigate to the appropriate directory (e.g., cd project_directory ).
    • Create a virtual environment using the command python -m venv venv. Here venv is the name of the virtual environment, but you can use any name (e.g. python -m venv whisper ).

      The actual command (Go to the existing youtube directory directly under the C drive and create a new whisper directory there. Then, after moving to the whisper directory, create a virtual environment).
      cd\
      cd youtube
      mkdir whisper
      cd whisper
      python -m venv venv
  2. Activate the virtual environment:
    • On Windows, you can activate the virtual environment by executing the command venv\Scripts\activate.
  3. Install the required libraries:
    • Install the required libraries in the virtual environment as pip install WhisperSpeech You will need to install a version of PyTorch that supports CUDA. You can find the commands at the following page
      https://pytorch.org/
      In our current environment, it is better to install PyTorch first.
      pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
      pip install WhisperSpeech

      The existence of the pip install WhisperSpeech command means that a package named WhisperSpeech is published in the Python Package Index (PyPI). This means that you can install this package using the pip command.
    • This is for installing PyTorch, TorchVision, and TorchAudio for CUDA 11.8. This command is appropriate for installing PyTorch-related packages for systems with GPUs that support CUDA.
    • torch is the main library for PyTorch.
    • torchvision provides functionality related to image processing.
    • torchaudio provides functions related to audio processing.
    • By running this command in a virtual environment, you can set up a CUDA-enabled PyTorch environment; if you have a GPU that supports CUDA, you should now have an environment ready to run WhisperSpeech scripts.
  4. Running the project:
    • Run your project’s code or notebook within the virtual environment.

This procedure isolates the project execution environment from the PC’s main system, preventing contamination of the environment. In addition, the virtual environment can be easily removed if necessary, making it easy to clean up later.

  • Projects such as WhisperSpeech, if registered with PyPI, can be installed with just a pip install command.
  • You may need to clone the source code if you want to contribute directly to the project or if you need to do a custom installation using setup.py.

After installing PyTorch with libraries and CUDA support in a virtual environment, write the following in Notepad or similar. Save this as “test.py” in the folder whisper.

from whisperspeech.pipeline import Pipeline
import torchaudio

# Initialize Pipeline
pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')

# Convert text to speech (use the correct method here)
# Assume you have a `generate` method for example (check the actual method name)
result = pipe.generate("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""")

# Move CUDA tensor to CPU (for tensors on GPU)
result = result.cpu()

# Save audio data as WAV file
torchaudio.save('output.wav', result, sample_rate=22050)

This script uses the whisperspeech library to synthesize speech from text and save the result as a WAV file. The process consists of the following steps

  1. Initialization of a Pipeline object.
  2. Generating speech data (tensor) from text using the pipe.generate method.
  3. Moving the generated tensor to the CPU (if using a GPU).
  4. Save the audio data as a WAV file using torchaudio.save.

Run this script in a virtual environment.

python test.py

However, an error occurs.

(venv) C:\youtube\whisper>python test.py
C:\youtube\whisper\venv\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in UserWarning: torch.nn.utils.parametrizations.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Traceback (most recent call last):███████| 100.00% [752/752 00:07<00:00] File "C:\youtube\whisper\test.py", line 17, in
torchaudio.save('output.wav', result, sample_rate=22050)
File "C:\youtube\whisper\venv\lib\site-packages\torchaudio_backend\utils.py", line 311, in save
backend = dispatcher(uri, format, backend)
File "C:\youtube\whisper\venv\lib\site-packages\torchaudio_backend\utils.py", line 221, in dispatcher
raise RuntimeError(f "Couldn't find appropriate backend to handle uri {uri} and format {format}.")
RuntimeError: Couldn't find appropriate backend to handle uri output.wav and format None.

Deal with the error.
The error RuntimeError : Couldn't find appropriate backend to handle uri output . wav and format None. This usually indicates that the torchaudio.save function could not find an appropriate backend to handle uri output.wav and format None. This usually occurs when the required audio backend is not installed or configured.

To resolve this issue, try the following steps

  1. Check for required dependencies:
    • Check that the dependencies required by torchaudio are properly installed. In particular, some libraries, such as soundfile and sox, may be required.
    • These libraries can be installed with commands such as pip install soundfile sox.
  2. Trying different file formats:
    • Try different file formats (e.g. output.mp3 ) with the torchaudio.save function.
  3. Check the version of PyTorch and TorchAudio:
    • Check that the versions of PyTorch and TorchAudio you are using are compatible. If there is a mismatch, you will need to update to a compatible version.
  4. Explicitly specify the backend:
    • Explicitly specifying the torchaudio backend may solve this problem. For example, try torchaudio.set_audio_backend('soundfile').

I was able to solve this problem with pip install soundfile sox. After installation, run the script and the audio file is created.

Next, try audio cloning.

Jupyter notebooks are a great choice, especially for interactive data analysis, machine learning, and speech processing projects like this one. Jupyter notebooks are the preferred choice for many scientific computing and data science projects because they allow you to run Python code, visualize results, and document them all in one place.

Features of Jupyter Notebooks

  1. Interactive coding:
    • Execute code cell by cell and see results instantly. This allows for step-by-step experimentation and data analysis.
  2. Rich text support:
    • Markdown and HTML can be used to create documents that include descriptive text, mathematical formulas, images, etc.
  3. Data visualization:
    • Easily chart and graph data, used in conjunction with libraries such as Matplotlib, Seaborn, Plotly, etc.
  4. Multilingual support:
    • In addition to Python, multiple programming languages are supported, including R, Julia, and Scala.
  5. Sharing and reproducibility:
    • Notebooks are stored in an integrated format with code, data, diagrams, and explanatory text, making it easy to share and reproduce results.

Uses

  • Education: Suitable for use in the creation of learning materials, in classes and workshops.
  • Data analysis: useful for exploratory analysis, preprocessing, and visualization of data.
  • Research: useful for documenting research results, generating figures and tables for publications, and sharing the analysis process.
  • Machine Learning: for model prototyping, parameter tuning, and visualization of results.

The flexibility and functionality of Jupyter notebooks make them a valuable tool for many scientists, researchers, data analysts, and educators.

Jupyter notebooks can be installed within a virtual environment.

pip install notebook

Start the Jupyter notebook:

  • After installation, run the jupyter notebook command to start the notebook. This will open a browser and display the notebook’s interface.

How to run scripts in a Jupyter notebook

  1. Create or open a notebook:
    • In the Jupyter notebook interface, click “New” to create a new notebook or open an existing one.
  2. Writing code:
    • Write your Python code in a cell in the notebook. For example, enter WhisperSpeech’s speech synthesis code in the cell.
  3. Execute the code:
    • Select the cell containing the code and click the “Run” button on the toolbar or use the keyboard shortcut (usually Shift Enter) to execute the code.
  4. Checking the result:
    • The result of the code execution is displayed directly below the cell. This includes text output, graphics, and playback of audio files.
  5. Document Saving:
    • When you are done working, save the notebook. You can use the “Save” button on the toolbar.

To start with, we will check the GPU usage in the Jupyter notebook.

  • To check if the GPU is available in PyTorch, run the following code in a Python shell or Jupyter notebook

    import torch

    print(torch.cuda.is_available())
    • If this command returns True, PyTorch is ready to use the GPU.
  • To check the details of available GPUs

    import torch

    print(torch.cuda.get_device_name(0))

Following these steps, you can see if PyTorch is using the host GPU.

For voice cloning, here is an example code

from whisperspeech.pipeline import Pipeline
pipe = Pipeline()
pipe.generate_to_notebook("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', speaker='https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg')

The script uses the WhisperSpeech library’s Pipeline object to convert the given text to speech. The process uses the English speech model with lang='en' and generates speech based on the reference speech (local file) specified with speaker='path/to/downloaded/file.ogg'.

In a browser, the following error occurred when the code was saved and executed.

C:\youtube\whisper\venv\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm C:\youtube\whisper\venv\lib\site-packages\fastprogress\fastprogress.py:107: UserWarning: Couldn’t import ipywidgets properly, progress bar will use console behavior warn(“Couldn’t import ipywidgets properly, progress bar will use console behavior”) C:\youtube\whisper\venv\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn(“torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.”) The torchaudio backend is switched to ‘soundfile’. Note that ‘sox_io’ is not supported on Windows. C:\youtube\whisper\venv\lib\site-packages\speechbrain\utils\torch_audio_backend.py:22: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call. torchaudio.set_audio_backend(“soundfile”) The torchaudio backend is switched to ‘soundfile’. Note that ‘sox_io’ is not supported on Windows. C:\youtube\whisper\venv\lib\site-packages\speechbrain\utils\torch_audio_backend.py:22: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call. torchaudio.set_audio_backend(“soundfile”)
—————————————————————————OSError Traceback (most recent call last) Cell In[3], line 3 1 fromwhisperspeech.pipelineimport Pipeline 2 pipe = Pipeline() —-> 3 pipe.generate_to_notebook(“”” 4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer. 5 “””, lang=’en’, speaker=’https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg‘) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:93, in Pipeline.generate_to_notebook(self, text, speaker, lang, cps, step_callback) 92 def generate_to_notebook(self, text, speaker=None, lang=’en’, cps=15, step_callback=None): —> 93 self.vocoder.decode_to_notebook(self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=None)) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:80, in Pipeline.generate_atoks(self, text, speaker, lang, cps, step_callback) 78 def generate_atoks(self, text, speaker=None, lang=’en’, cps=15, step_callback=None): 79 if speaker isNone: speaker = self.default_speaker —> 80elif isinstance(speaker, (str, Path)): speaker = self.extract_spk_emb(speaker) 81 text = text.replace(“\n“, ” “) 82 stoks = self.t2s.generate(text, cps=cps, lang=lang, step=step_callback) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:70, in Pipeline.extract_spk_emb(self, fname) 68 if self.encoder isNone: 69 fromspeechbrain.pretrainedimport EncoderClassifier —> 70 self.encoder = EncoderClassifier.from_hparams(“speechbrain/spkrec-ecapa-voxceleb”, 71 savedir=”~/.cache/speechbrain/”, 72 run_opts={“device”: “cuda”}) 73 samples, sr = torchaudio.load(fname) 74 samples = self.encoder.audio_normalizer(samples[0,:30*sr], sr) File C:\youtube\whisper\venv\lib\site-packages\speechbrain\pretrained\interfaces.py:467, in Pretrained.from_hparams(cls, source, hparams_file, pymodule_file, overrides, savedir, use_auth_token, revision, download_only, **kwargs) 465 clsname = cls.__name__ 466 savedir = f”./pretrained_models/{clsname}–{hashlib.md5(source.encode(‘UTF-8′, errors=’replace’)).hexdigest()}” –> 467 hparams_local_path = fetch( 468 filename=hparams_file, 469 source=source, 470 savedir=savedir, 471 overwrite=False, 472 save_filename=None, 473 use_auth_token=use_auth_token, 474 revision=revision, 475 ) 476 try: 477 pymodule_local_path = fetch( 478 filename=pymodule_file, 479 source=source, (…) 484 revision=revision, 485 ) File C:\youtube\whisper\venv\lib\site-packages\speechbrain\pretrained\fetching.py:181, in fetch(filename, source, savedir, overwrite, save_filename, use_auth_token, revision, cache_dir, silent_local_fetch) 179 sourcepath = pathlib.Path(fetched_file).absolute() 180 _missing_ok_unlink(destination) –> 181 destination.symlink_to(sourcepath) 182 return destination File ~\AppData\Local\Programs\Python\Python310\lib\pathlib.py:1255, in Path.symlink_to(self, target, target_is_directory) 1250 def symlink_to(self, target, target_is_directory=False): 1251 “”” 1252 Make this path a symlink pointing to the target path. 1253 Note the order of arguments (link, target) is the reverse of os.symlink. 1254 “”” -> 1255 self._accessor.symlink(target, self, target_is_directory) OSError: [WinError 1314] クライアントは要求された特権を保有していません。: ‘C:\\Users\\minok\\.cache\\huggingface\\hub\\models–speechbrain–spkrec-ecapa-voxceleb\\snapshots\\5c0be3875fda05e81f3c004ed8c7c06be308de1e\\hyperparams.yaml’ -> ‘~\\.cache\\speechbrain\\hyperparams.yaml’

Investigation of the cause.
The displayed OSError: [WinError 1314] indicates that you lack the privilege to create symbolic links in Windows. This error occurs when the speechbrain library downloads the required files and tries to create a symbolic link in the cache directory.

Tried solution.

  1. Try “running as administrator” a command prompt in Windows and running the jupyter notebook command in it.

Since the virtual environment has already been created, go to the desired directory and activate the virtual environment.

cd\
cd youtube
cd whisper
venv\Scripts\activate

Run the preceding code in your browser. However, the error was generated but the content changed.

LibsndfileError Traceback (most recent call last) Cell In [2], line 3 1 fromwhisperspeech.pipelineimport Pipeline 2 pipe = Pipeline() —-> 3 pipe. generate_to_notebook(“”” 4 This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels 5 “””, lang=’en’, speaker= ‘https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour. ogg ‘) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:93, in Pipeline.generate_to_notebook (self, text, speaker, lang, cps, step_callback) 92 def generate_to_notebook(self, text, speaker=None, lang=’en’, cps=15, step_callback=None ): —> 93 self.vocoder .decode_to_notebook(self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=None )) File C:\youtube\whisper\venv\lib\site- packages\whisperspeech\pipeline.py:80, in Pipeline.generate_atoks (self, text, speaker, lang, cps, step_callback) 78 def generate_atoks(self, text, speaker=None, lang=’en’, cps=15, step_callback=None ): 79 if speaker isNone: speaker = self.default_speaker —> 80elif isinstance( speaker, (str, Path)): speaker = self.extract_spk_emb(speaker) 81 text = text.replace( “\n “, ” “) 82 stoks = self.t2s.generate(text, cps=cps, lang= lang, step=step_callback) File C:\youtube\whisper\venv\lib\site-packages\whisperspeech\pipeline.py:73, in Pipeline.extract_spk_emb (self , fname) 69 fromspeechbrain.pretrainedimport EncoderClassifier 70 self.encoder = EncoderClassifier.from_hparams(“speechbrain/spkrec-ecapa- voxceleb”, 71 savedir=”~/.cache/speechbrain/”, 72 run_opts={“device”: “cuda”}) —> 73 samples, sr = torchaudio.load(fname) 74 samples = self. encoder.audio_normalizer(samples[0,:30*sr], sr) 75 spk_emb = self.encoder.encode_batch(samples) File C:\youtube\whisper\venv\lib\site- packages\torchaudio\_backend\utils.py:204, in get_load_func..load (uri, frame_offset, num_frames, normalize, channels_first , format, buffer_size, backend) 127 “””Load audio data from source. 128 129 By default (“normalize=True“, “channels_first=True“), this function returns Tensor with (…) 201 `[channel, time]` else `[time, channel]`. 202 “”” 203 backend = dispatcher(uri, format, backend) –> 204return backend.load(uri, frame_offset, num_frames, normalize, channels_first, channels_first format, buffer_size) File C:\youtube\whisper\venv\lib\site-packages\torchaudio\_backend\soundfile.py:27, in SoundfileBackend.load (uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size) 17 @staticmethod 18 def load( 19 uri: Union[BinaryIO, str, os.PathLike ], (…) 25 buffer_size: int = 4096, 26 ) -> Tuple[torch.Tensor, int]: —> 27return soundfile_backend.load(uri, frame_offset, num_frames, normalize , channels_first, format) File C:\youtube\whisper\venv\lib\site-packages\torchaudio\_backend\soundfile_backend.py:221, in load (filepath, frame_offset, num_frames, normalize, channels_first, format) 139 @_requires_soundfile 140 def load( 141 filepath: str, (…) 146 format: Optional [str] = None, 147 ) -> Tuple[torch.Tensor, int]: 148 “””Load audio data from file. 149 150 Note: (…) 219 `[channel, time]` else `[time, channel]`. SoundFile(filepath, “r”) as file_: 222 if file_.format ! = “WAV” or normalize: 223 dtype = “float32” File C:\youtube\whisper\venv\lib\site-packages\soundfile.py:658, in SoundFile.__init__ (self, file, mode, samplerate, channels, subtype, endian, format, closefd) 655 self._mode = mode 656 self._info = _create_info_struct(file, mode, samplerate, channels, 657 channels, 657 format, subtype, endian) –> 658 self._file = self._open(file, mode_int, closefd) 659 if set(mode).issuperset(‘r ‘) and self. seekable(): 660 # Move write position to 0 (like in Python file objects) 661 self.seek(0) File C:\youtube\whisper\venv\lib\site-packages\soundfile .py:1216, in SoundFile._open (self, file, mode_int, closefd) 1213 if file_ptr == _ffi.NULL: 1214 # get the actual error code 1215 err = _snd.sf_error( file_ptr) ->1216raise LibsndfileError(err, prefix=”Error opening {0!r}: “.format(self.name)) 1217 if mode_int == _snd.SFM_WRITE: 1218 # Due to a bug in libsndfile version <= 1.0.25, frames ! = 0 1219 # when opening a named pipe in SFM_WRITE mode. 1220 # See http://github.com/erikd/libsndfile/issues/77. 1221 self._info.frames = 0 LibsndfileError: Error opening ‘https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg’: System error. System error.

Investigation of the cause.
The LibsndfileError displayed indicates that torchaudio failed to read an audio file directly from a remote URL, in this case the URL of Winston Churchill’s speech. and does not support loading audio files directly from URLs.

Solution

  1. Download the audio file:
    • Manually download the file you want to use as reference audio and save it to your local file system.
  2. Specify the path to the downloaded file:
    • Specify the local path to the downloaded audio file in the speaker argument. For example, speaker='downloaded_file.ogg' (where downloaded_file.ogg is the name of the downloaded file).
  3. Code modification:
    • In the Jupyter notebook, update the speaker argument to specify the path to the downloaded file. Note that when there is an audio file in the same case as the code, only the file name is written.
from whisperspeech.pipeline import Pipeline

# Initialize the Pipeline object
pipe = Pipeline()

# use pipe.generate_to_notebook(...) in the rest of the code
pipe.generate_to_notebook(
    """
    This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
    """,
    lang='en',
    speaker='path/to/downloaded/file.ogg'
)
Please share if you like it!
TOC