Previously, I introduced how to use Whisper Speech with Jupyter Notebook.
WhisperSpeech: A Tool Utilizing Speech Synthesis Technology to Convert Text into Speech
Here are some of the main features of WhisperSpeech:
Key Features of WhisperSpeech
Text-to-Speech Conversion:
Converts input text into speech using a selected voice. This feature is useful for creating narrations or voice content.
Voice Cloning Function:
You can generate speech that resembles a specific person’s voice by using a sample of their voice. This allows you to read new text in that particular voice.
Emotional Expression:
In speech synthesis, emotions such as joy, sadness, or surprise can be expressed in the generated speech.
Real-Time Synthesis:
WhisperSpeech can convert text to speech in real time, making it suitable for situations that require immediacy.
Creation of Custom Voice Models:
Users can train their own voice models to generate speech tailored for specific purposes.
Speech Speed Adjustment:
The speed of the synthesized speech can be adjusted, allowing for natural speech delivery or matching specific requirements.
Customization of Speech Style:
You can adjust the pitch and tone of the voice to generate more customized speech.
By leveraging these features, WhisperSpeech can be utilized in various scenarios. The voice cloning function and emotional expression, in particular, can be powerful tools in content creation, entertainment, and education. Moreover, even when using the same text, the speech results can differ each time. There are several reasons for this, which I will explain below.
1. Introduction of Randomness
Many speech synthesis systems introduce a slight degree of randomness into the synthesis process. As a result, even with the same text, subtle differences in intonation and pronunciation can be generated. This randomness is designed to make the synthesized speech sound more natural and human-like.
2. Model Uncertainty
Speech synthesis models are based on highly complex statistical models, which consider multiple possible outputs when generating speech from text. As a result, different synthesis results may occur. This is particularly noticeable in models that allow for emotional expression and customization of speech style.
3. Internal Processing of the Speech Synthesis Engine
The internal processing performed by the speech synthesis engine when converting text to speech is extremely complex. For instance, the algorithms or acoustic models used in the speech generation process may select different parameters depending on the environment. These variations contribute to the subtle differences in the generated speech.
4. Differences in Output Formats and Encoding
The format and encoding settings used when saving the speech can also affect the sound quality. Even with the same model, the sound quality may vary depending on the output settings and encoding methods used.
5. System and Resource Load
The state of the system’s resources (such as CPU and memory) can also impact the generated speech. If the system is under heavy load, the speech synthesis may not be processed in real-time, resulting in slight timing delays or changes in quality.
By combining these factors, the generated speech may differ slightly each time, even with the same text. While this diversity makes speech synthesis more natural and flexible, it also explains why generating identical speech repeatedly can be challenging.
Next, I will introduce how to operate WhisperSpeech via a browser, similar to Stable Diffusion WebUI, using the following GitHub reference:
https://github.com/Mateusz-Dera/whisperspeech-webui
The following is an explanation of the information written on the page:
export HSA_OVERRIDE_GFX_VERSION=11.0.0
I am using an NVIDIA GeForce RTX 4060 GPU, which is unrelated to the HSA_OVERRIDE_GFX_VERSION setting that pertains to AMD GPUs.
The command export HSA_OVERRIDE_GFX_VERSION=11.0.0
is a setting used to override a specific graphics version when using AMD GPUs. It does not apply to NVIDIA GPUs. Therefore, this environment variable has no relevance to an NVIDIA system, and there is no need to set it.
As long as you are using an NVIDIA GPU, the HSA_OVERRIDE_GFX_VERSION
setting is unnecessary. If this setting is included in a script or configuration file, it is likely written for AMD GPUs, but it can be removed or disabled without any issues in an NVIDIA environment.
And the other command is as follows:
sudo apt install ffmpeg
This is the installation command for FFmpeg. If you are using a Python virtual environment on Windows, commands like sudo apt install
are generally not available. sudo
is a command used to obtain administrative privileges on Linux or macOS, and apt
is the package management tool used in Debian-based Linux distributions (such as Ubuntu) for installing packages. If an operation like “Mount the repository directory” is required, it’s more appropriate to use WSL (Windows Subsystem for Linux) or a real Linux environment.
In a real Linux environment, all Linux features are available, making it suitable for tasks that require more complex operations or higher performance.
WSL (Windows Subsystem for Linux):
Since WSL allows you to run a Linux-like environment on Windows, you can proceed with tasks using mount commands and other Linux tools.
WSL is easy to install and integrates seamlessly with Windows, making it a convenient option, especially for Windows users.
Real Linux:
If a genuine Linux environment is required, or if specific hardware access is needed, you might consider using a virtual machine, dual boot, or a dedicated Linux machine.
Though the installed Linux on my physical machine isn’t high-spec, I’ll opt to use WSL. If you want to run it with only the CPU, it may also work on VirtualBox or VMware. With Proxmox, GPU passthrough is possible.
For more information about pyenv
, I have explained it in detail in the following article. While this article covers the process on a physical Ubuntu machine, it can also be achieved with the same steps on WSL.
Check the Python Version
First, let’s check the current version of Python installed on your system. This is important to compare with the version recommended for the project.
Run the following command:
python3 -V
This command will display the current Python version. For example, you might see a result like:
Python 3.12.3
Here, you’ll notice that it is “slightly different” from the recommended Python 3.12.0 on the GitHub page. Although it may work with 3.12.3, it’s safer to align with the recommended version, especially if the developer has optimized dependencies or packages for a specific version. Using the same version helps prevent unexpected errors.
Cloning the Project
First, let’s clone the project from GitHub to set up your working environment. Cloning allows you to copy the project files and code to your local computer.
Use the following command to clone the project:
git clone https://github.com/Mateusz-Dera/whisperspeech-webui.git
This command creates a directory named whisperspeech-webui
, and all project files are downloaded into that directory. Cloning is a very common practice when sharing or managing code using Git.
Next, navigate to the created directory. By moving into the directory, you can start working within the project. Run the following command to change into the directory:
cd whisperspeech-webui
Now, you are ready to proceed with the setup inside the whisperspeech-webui
directory.
Installing the Recommended Python Version
Next, let’s install the recommended version of Python for the project. In this case, we will use a tool called pyenv
to manage Python versions. pyenv
allows you to switch between multiple versions of Python installed on your system easily, which is helpful when specific projects require certain versions.
First, install Python 3.12.0 by running the following command:
pyenv install 3.12.0
This command will prompt pyenv
to download the specified version of Python from the internet and install it on your system.
Applying the Python Version to a Specific Directory
After installation, you can set this version of Python to be used exclusively within a specific project folder. With pyenv
, you can apply a specific version of Python to a directory without affecting other projects or your system.
Run the following command:
pyenv local 3.12.0
This command ensures that Python 3.12.0 is automatically used inside the whisperspeech-webui
directory. This setting creates a .python-version
file in the project folder, recording the specified version. As a result, the recommended version is applied only to this project, without affecting other folders or projects.
Verifying the Applied Python Version
Next, let’s verify that the correct Python version is applied. By using the pyenv version
command, you can see which Python version is currently set for the directory.
Run the following command:
pyenv version
The output will display the currently active Python version for the directory.
3.12.0 (set by /home/mamu/whisperspeech-webui/.python-version)
If this is displayed, you can confirm that Python 3.12.0 has been correctly set.
Creating a Python Virtual Environment
Next, let’s create a Python virtual environment. A virtual environment allows you to create an isolated Python environment for each project, enabling you to manage library installations and settings without affecting other projects.
Run the following command to create a virtual environment named venv
:
python -m venv venv
This will create a folder named venv
, where the virtual environment is prepared. However, simply creating the virtual environment does not activate it, so in the next step, we will “activate” the virtual environment.
Activating the Virtual Environment
To activate the virtual environment, run the following command:
source venv/bin/activate
Once this command is executed, the prompt will change, indicating that the virtual environment has been activated. While the virtual environment is active, any installed packages and Python settings are confined to this environment and will not affect the system as a whole.
Pay Attention to Differences Between Windows and WSL
Here, I’d like to explain a common mistake beginners often make. There is a difference in the way file paths are written between Windows and Linux (including WSL), so be cautious when entering commands.
In Windows:
File paths are separated using backslashes (\
).
Example: C:\Users\Username\Documents\
In WSL (Linux):
File paths are separated using forward slashes (/
).
Example: /home/username/documents/
When activating a virtual environment in WSL, use forward slashes, like source venv/bin/activate
, but in Windows Command Prompt, the format will be like venv\Scripts\activate
. It’s important to use the correct command depending on your environment.
Installing FFmpeg in the Virtual Environment
With the virtual environment activated, let’s install the necessary tools. This time, we’ll install FFmpeg, which is required for processing audio and video.
First, install FFmpeg with the following command:
sudo apt install ffmpeg
Key Points:
What is FFmpeg?
FFmpeg is a powerful tool for converting, recording, and editing audio and video. WhisperSpeech requires FFmpeg to process audio data. By installing it, you will be able to encode and decode audio files.
Why use the apt
command?
apt
is a package management system used in Linux and WSL (Windows Subsystem for Linux). It allows you to easily install the necessary software.
However, note that the apt
command cannot be used directly in Windows virtual environments. Therefore, when performing this task on Windows, you will need to use WSL to access a Linux environment. This will allow you to fully utilize Linux’s package management capabilities and easily install FFmpeg and other necessary software.
Installing CUDA and Its Dependencies
Next, to speed up the processing of WhisperSpeech, we will install CUDA (NVIDIA’s GPU computing platform). This allows you to leverage your GPU for fast computational processing, improving the performance of speech generation.
Install CUDA and its dependencies using the following command:
pip install -r requirements_cuda_12.1.txt
Key Points:
What is CUDA?
CUDA is a platform that enables high-speed parallel computing using NVIDIA GPUs. When large-scale data processing is required, CUDA allows for much faster computation compared to a CPU. For speech synthesis tools like WhisperSpeech, using a GPU can significantly improve processing speed.
What is a requirements file?
The requirements_cuda_12.1.txt
file lists the libraries and packages needed to run WhisperSpeech. Using this file, the pip
command installs all necessary dependencies in one go. Since we are using CUDA version 12.1, the corresponding dependency packages are managed through this file.
Running the Script
Once everything is set up, the final step is to run the script to launch WhisperSpeech. Execute the following command:
python webui.py
Key Points:
The Role of the Script
This script launches WhisperSpeech’s web interface (GUI). Once executed, you can control the interface from your browser to convert text to speech.
Similarities with Stable Diffusion WebUI
This web interface operates similarly to Stable Diffusion WebUI. Since it can be operated intuitively through a browser, there’s no need to manually handle the program, making it easy for beginners to use.
Additional Explanation: Why Use WSL?
The tools mentioned so far, such as apt
and CUDA, primarily operate in a Linux environment. It is challenging to use these tools directly in the standard Windows environment, so we use WSL (Windows Subsystem for Linux) to create a Linux environment and proceed with the tasks.
Here are the reasons why WSL is used:
Ability to Use Linux’s Package Management System apt
:
With apt
, which cannot be used in Windows, you can easily install and manage software. Tools like FFmpeg can be smoothly installed using this system.
CUDA Support:
When utilizing a GPU, a Linux environment is recommended. CUDA, especially when used with NVIDIA GPUs, has strong support on Linux. By using WSL, you can take full advantage of this support.
Maintaining Windows Compatibility While Utilizing the Flexibility of Linux:
WSL allows you to enjoy the convenience of Windows while freely using tools and commands that are only available in Linux. This flexibility is a major advantage of WSL.
Using WSL is highly convenient for smoothly setting up and running WhisperSpeech.