AI video generation has been creating quite a buzz lately. If you’ve already tried your hand at image generation with tools like MidJourney or Stable Diffusion, you might be wondering what’s next. Well, today we’re taking it up a notch by diving into the world of AI video generation!
Imagine typing in a few words and watching them transform into a moving video, or seeing a single image come to life with motion… sounds like magic, right? Well, this magic is actually free to try, thanks to a tool called “Pyramid Flow.”
What makes this tool particularly exciting is its browser-based GUI interface. Even if you’re not too comfortable with programming, you can get started as long as you have the right environment set up.
Let’s jump in and explore the world of AI video generation with Pyramid Flow!
What is Pyramid Flow?
Pyramid Flow is a groundbreaking AI tool that can generate videos from either text descriptions or static images. Built on a technology called Flow Matching, it’s an efficient video generation model that’s been trained exclusively on open-source data – which means it’s completely free for anyone to use.
Here’s what makes it special:
- High-quality video generation (up to 10 seconds at 768p resolution, 24FPS)
- Text-to-video generation capabilities
- Image-to-video transformation
- User-friendly browser-based GUI
- Optimized memory usage features
Required Environment
To get started with Pyramid Flow, you’ll need:
- Python 3.8.10 (recommended version)
- A PC with an NVIDIA GPU
- Visual Studio Code (recommended editor)
- CUDA-compatible GPU drivers
Setting Up Your Environment
In this guide, I’ll walk you through the setup process on Windows using VSCode, which offers a great visual interface for managing your project.
Managing Python Versions
First things first – we need to make sure we’re using the right version of Python (3.8.10). I’m using pyenv for version management, and here’s why that matters:
Let’s check our current Python version:
python --version
In my case, I was running Python 3.10.11, so I needed to switch to 3.8.10.
Project Setup
First, let’s grab the project from GitHub:
git clone https://github.com/jy0205/Pyramid-Flow
cd Pyramid-Flow
Now, here’s where I ran into my first challenge. Using pyenv, I tried to set the local Python version:
pyenv local 3.8.10
Pro Tip: If you’re using VSCode like I am, you’ll need to completely close and reopen it for the version change to take effect. I learned this the hard way! After reopening VSCode, double-check your Python version:
python --version
Creating a Virtual Environment
Next up, let’s set up a Python virtual environment. This keeps our project dependencies nice and tidy:
python -m venv venv
venv\Scripts\activate
You’ll know it’s working when you see (venv)
appear at the start of your command prompt.
Package Installation
First, let’s update pip to its latest version:
python -m pip install --upgrade pip
Now we have two options for installing the required packages. The most straightforward way is to use the requirements file:
pip install -r requirements.txt
Alternatively, you can install the packages individually:
pip install gradio torch Pillow diffusers huggingface_hub
I tried both methods, and they both work fine. The requirements file is generally recommended as it ensures you get the exact versions that have been tested with the project.
Starting the GUI
After installing the packages, let’s try launching the GUI:
python app.py
diffusion_transformer_768p/config.json: 100%|█████████████████| 465/465 [00:00<00:00, 226kB/s]
README.md: 100%|█████████████████████████████████████████| 9.38k/9.38k [00:00<00:00, 4.45MB/s]
diffusion_transformer_image/config.json: 100%|████████████████| 465/465 [00:00<00:00, 233kB/s]
text_encoder_2/config.json: 100%|█████████████████████████████| 782/782 [00:00<00:00, 391kB/s]
text_encoder/config.json: 100%|███████████████████████████████| 613/613 [00:00<00:00, 204kB/s]
(…)t_encoder_2/model.safetensors.index.json: 100%|███████| 19.9k/19.9k [00:00<00:00, 6.63MB/s]
tokenizer/merges.txt: 100%|████████████████████████████████| 525k/525k [00:00<00:00, 1.21MB/s]
tokenizer/special_tokens_map.json: 100%|██████████████████████| 588/588 [00:00<00:00, 235kB/s]
tokenizer/tokenizer_config.json: 100%|████████████████████████| 705/705 [00:00<00:00, 276kB/s]
tokenizer/vocab.json: 100%|██████████████████████████████| 1.06M/1.06M [00:00<00:00, 1.61MB/s]
tokenizer_2/special_tokens_map.json: 100%|███████████████| 2.54k/2.54k [00:00<00:00, 1.26MB/s]
spiece.model: 100%|████████████████████████████████████████| 792k/792k [00:00<00:00, 2.17MB/s]
tokenizer_2/tokenizer.json: 100%|████████████████████████| 2.42M/2.42M [00:01<00:00, 1.68MB/s]
tokenizer_2/tokenizer_config.json: 100%|█████████████████| 20.8k/20.8k [00:00<00:00, 5.93MB/s]
model.safetensors: 100%|███████████████████████████████████| 246M/246M [00:49<00:00, 5.00MB/s]
diffusion_pytorch_model.bin: 100%|███████████████████████| 1.34G/1.34G [02:03<00:00, 10.9MB/s]
Fetching 24 files: 17%|██████▌ | 4/24 [02:03<12:53, 38.69s/it]
diffusion_pytorch_model.safetensors: 18%|██▋ | 1.38G/7.89G [02:02<12:30, 8.66MB/s]
diffusion_pytorch_model.safetensors: 40%|██████ | 3.16G/7.89G [04:16<04:17, 18.4MB/s]
diffusion_pytorch_model.safetensors: 32%|████▊ | 2.53G/7.89G [04:16<05:16, 16.9MB/s]
diffusion_pytorch_model.safetensors: 32%|████▊ | 2.55G/7.89G [04:15<15:01, 5.92MB/s]
model-00001-of-00002.safetensors: 29%|█████▏ | 1.43G/4.99G [02:01<03:59, 14.9MB/s]
model-00001-of-00002.safetensors: 64%|███████████▌ | 3.22G/4.99G [04:15<02:12, 13.4MB/s]
model-00002-of-00002.safetensors: 27%|████▉ | 1.24G/4.53G [01:59<06:22, 8.62MB/s]
model-00002-of-00002.safetensors: 59%|██████████▋ | 2.69G/4.53G [04:14<03:21, 9.12MB/s]
Note: You might encounter this warning message:
[WARNING] CUDA is not available. Proceeding without GPU.
Don’t worry – we’ll address this in the GPU setup section.
GPU Configuration
To utilize your GPU, first check your CUDA version:
nvcc -V
In my case, I had CUDA 12.4 installed. Based on your CUDA version, install the corresponding PyTorch version. For CUDA 12.4:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Now that we’ve covered the basic setup steps, let me share some of the challenges I encountered along the way. Trust me, knowing these beforehand will save you some headaches…
The Real Challenges I Encountered
While I’ve made the setup process sound straightforward above, I actually hit quite a few roadblocks along the way. Let me share my experience – it might save you some time and frustration!
The VSCode Python Version Puzzle
Here’s a tricky situation I ran into. After setting Python 3.8.10 with pyenv:
python --version
Python 3.10.11
Wait, what? Even after running pyenv local 3.8.10
, my Python version wasn’t changing. After some head-scratching and research, I discovered this was actually a VSCode quirk – you need to completely close and reopen VSCode for the version change to take effect. Nobody mentions this in the tutorials!
Detective Work: Finding the Version File
After restarting VSCode, I decided to investigate my project structure:
dir
2024/11/22 16:52 <DIR> .
2024/11/22 16:51 <DIR> ..
2024/11/22 16:51 1,446 .gitignore
2024/11/22 16:52 8 .python-version
2024/11/22 16:51 <DIR> annotation
2024/11/22 16:51 15,269 app.py
2024/11/22 16:51 5,619 app_multigpu.py
2024/11/22 16:51 <DIR> assets
2024/11/22 16:51 8,105 causal_video_vae_demo.ipynb
2024/11/22 16:51 <DIR> dataset
2024/11/22 16:51 <DIR> diffusion_schedulers
2024/11/22 16:51 <DIR> docs
2024/11/22 16:51 3,391 image_generation_demo.ipynb
2024/11/22 16:51 4,909 inference_multigpu.py
2024/11/22 16:51 1,086 LICENSE
2024/11/22 16:51 <DIR> pyramid_dit
2024/11/22 16:51 16,508 README.md
2024/11/22 16:51 406 requirements.txt
2024/11/22 16:51 <DIR> scripts
2024/11/22 16:51 <DIR> tools
2024/11/22 16:51 <DIR> train
2024/11/22 16:51 <DIR> trainer_misc
2024/11/22 16:51 14,387 utils.py
2024/11/22 16:51 7,052 video_generation_demo.ipynb
2024/11/22 16:51 <DIR> video_vae
Here’s where I made an interesting discovery – a .python-version
file that simply contained “3.8.10”. You can spot this either through VSCode’s explorer or Windows File Explorer.
Running the version check again:
python --version
Python 3.8.10
Finally! The version had switched correctly.
Another Gotcha: The Missing Package Saga
Just when I thought I was ready to roll:
python app.py
Boom – another error:
Traceback (most recent call last):
File "app.py", line 3, in
import gradio as gr
ModuleNotFoundError: No module named 'gradio'
One problem led to another. When I tried to install the missing packages:
pip install gradio torch Pillow diffusers huggingface_hub
I got this warning:
WARNING: You are using pip version 21.1.1; however, version 24.3.1 is available.
Well, might as well do things properly:
python -m pip install --upgrade pip
And for good measure:
pip install -r requirements.txt
Who knew setting up a virtual environment could be such an adventure? But step by step, we got there!
Continuing Our Setup Journey
After overcoming those initial hurdles, we were making progress. But there was still one crucial piece of the puzzle left: getting our GPU ready for action.
The Final Boss: GPU Configuration
When you run the application, you might see this warning:
[WARNING] CUDA is not available. Proceeding without GPU.
Don’t panic! This warning is telling us that your GPU isn’t being recognized yet. Since video generation requires some serious computational power, getting this right is super important.
Let’s check what CUDA version we’re working with:
nvcc -V
In my setup, I had CUDA 12.4:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:30:10_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
Time to install the matching PyTorch version:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
And there we have it! Finally, we’ve got everything we need to start generating some videos. Trust me, all that setup work is about to pay off…
Time to Generate Some Videos!
Now that we’ve got everything set up, let’s dive into the fun part. First, launch the application:
python app.py
When you first run this, you’ll notice it starts downloading some pretty hefty model files. Don’t worry if this takes a while – grab a coffee and let it do its thing. Once the download is complete, your browser will automatically open to display the Gradio interface.
Getting to Know the GUI
The interface is divided into two main tabs:
- Text-to-Video
- Prompt: Where you describe your desired video
- Duration: Video length (up to 16 frames for 384p, 31 frames for 768p)
- Guidance Scale: Controls how closely it follows your description
- Video Guidance Scale: Adjusts the intensity of motion
- Resolution: Choose between 384p or 768p
- Image-to-Video
- Input Image: Upload your starting image
- Prompt: Describe how you want the image to animate
- Other settings: Similar to text-to-video options
Pro Tips for Better Results
Through my experimentation, I’ve learned a few things that might help you:
- Start with 384p resolution (it’s much easier on your GPU)
- Be as specific as possible with your prompts – vague descriptions lead to vague results
- If you’ve got 8GB GPU memory like me, you might hit the “CUDA out of memory” error after a few generations – just refresh your browser if this happens
Real-World Generation Examples
Text-to-Video: My First Attempt
Let’s start with this creative prompt:
CopyA movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors
Here’s how I configured it:
- Resolution: 384p (playing it safe for the first try)
- Duration: 16
- Guidance Scale: 7.0
- Video Guidance Scale: 5.0
The results were fascinating! The AI generated a scene featuring an astronaut with what looked like a red knitted helmet, walking across a desert landscape. It really captured that cinematic quality I was hoping for in the prompt.
Image-to-Video: Breathing Life into Still Images
For my next experiment, I tried the sample Great Wall image with this prompt:
FPV flying over the Great Wall
Settings used:
- Resolution: 384p
- Duration: 16
- Video Guidance Scale: 4.0
The transformation was incredible – the static image smoothly transitioned into a dynamic sequence that really did look like drone footage flying over the Great Wall.
A Deep Dive into Memory Usage
Curious about resource consumption, I ran this diagnostic script:
import torch
print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**2:.0f}MB")
print(f"Allocated: {torch.cuda.memory_allocated() / 1024**2:.0f}MB")
print(f"Cached: {torch.cuda.memory_reserved() / 1024**2:.0f}MB")
The results were enlightening:
Total Memory: 8188MB
Allocated: 0MB
Cached: 0MB
This explained why I was struggling with higher resolutions – my 8GB GPU was really being pushed to its limits!
Tips and Troubleshooting
The GPU Memory Challenge
The most significant limitation I encountered was GPU memory constraints. With my 8GB GPU setup, I faced several challenges:
- 768p resolution was nearly impossible to work with
- Even at 384p, I could only generate a few videos before running into memory issues
- “CUDA out of memory” errors became a familiar sight
Effective Workarounds
After some trial and error, I found these strategies helped:
- Stick to 384p resolution for your initial work
- Reduce the duration (frame count) when memory gets tight
- Refresh your browser when errors occur
- Restart the application to clear GPU memory if things get sluggish
Practical Usage Tips
Here’s how I’ve learned to work efficiently with limited resources:
- Start with low resolution to test your concepts
- Once you get the results you like, try bumping up to higher resolution
- When memory issues occur, take a quick break and restart the application
Final Thoughts
While Pyramid Flow is an incredibly powerful tool, it does require decent GPU specifications to really shine. A GPU with 16GB+ memory would definitely provide a smoother experience.
That said, don’t let hardware limitations discourage you. Even with my modest 8GB setup, I was able to create some truly impressive videos. The key is understanding your system’s limits and working within them.
The world of AI video generation is evolving rapidly, and tools like Pyramid Flow are making it more accessible than ever. Whether you’re a content creator, an AI enthusiast, or just someone curious about the latest tech, it’s an exciting time to dive in and experiment.
Give it a try – you might be surprised at what you can create, even with basic hardware. And remember, today’s limitations are tomorrow’s laughable specs. The future of AI video generation is looking brighter every day!