Getting Started with Excel Automation in Python: Building on Our Foundation
In our previous post, we took the initial steps toward automating Excel by completing the following tasks:

- Creating Sample Data: Developed example datasets to work with
- Generating Split Sales Files: Divided the sales data into three separate Excel files
- Preparing Incomplete Data for Cleaning: Created datasets that required cleaning and preprocessing
- Setting Up Sales Data for Analysis: Organized the sales data files to be ready for in-depth analysis
Additionally, we implemented some fundamental automation scripts, including:
- Merging Excel Files: Combined multiple Excel files into a single dataset
- Data Cleaning Functions: Automated the process of cleaning and preparing data for analysis
- Automated Monthly Reporting: Created scripts to generate monthly reports without manual intervention
Enhancing Our Automation with Advanced Analysis
Building on this solid foundation, our goal for this installment is to introduce more practical analytical features and ensure everything is explained in detail for beginners. We’ll use the latest versions of pandas (2.3.2) and openpyxl (3.1.4) as of 2025, following official documentation best practices.
Prerequisites and Environment Setup
Step 1: Check Your Python Version
First, ensure you have Python 3.9 or later installed:
python --version
# or
python3 --version
Expected output: Python 3.9.0 or higher (up to 3.13)
Step 2: Create a Virtual Environment
Important: Always use a virtual environment for Python projects to avoid dependency conflicts.
On Windows:
# Navigate to your project folder
cd C:\Users\YourName\excel-automation
# Create virtual environment
python -m venv excel_env
# Activate it
excel_env\Scripts\activate
On macOS/Linux:
# Navigate to your project folder
cd ~/excel-automation
# Create virtual environment
python3 -m venv excel_env
# Activate it
source excel_env/bin/activate
Success indicator: Your terminal prompt should now show (excel_env) at the beginning.
Step 3: Create Project Structure
Create the following folder structure:
excel-automation/
│
├── excel_env/ # Virtual environment (created above)
├── excel_files/ # For storing generated Excel files
├── scripts/ # Python scripts
└── requirements.txt # Package dependencies
Create folders:
mkdir excel_files scripts
Installing Required Packages
Step 1: Create requirements.txt
Create a file named requirements.txt with the following content:
pandas==2.3.2
openpyxl==3.1.4
scipy==1.13.1
xlsxwriter==3.2.0
numpy==1.26.4
Step 2: Install Packages
With your virtual environment activated:
pip install --upgrade pip
pip install -r requirements.txt
Step 3: Verify Installation
Create a file scripts/test_imports.py:
"""Test if all required packages are installed correctly"""
import sys
print(f"Python version: {sys.version}")
try:
import pandas as pd
print(f"✓ pandas {pd.__version__} installed")
except ImportError as e:
print(f"✗ pandas not installed: {e}")
try:
import openpyxl
print(f"✓ openpyxl {openpyxl.__version__} installed")
except ImportError as e:
print(f"✗ openpyxl not installed: {e}")
try:
import scipy
print(f"✓ scipy {scipy.__version__} installed")
except ImportError as e:
print(f"✗ scipy not installed: {e}")
try:
import numpy as np
print(f"✓ numpy {np.__version__} installed")
except ImportError as e:
print(f"✗ numpy not installed: {e}")
print("\nAll packages imported successfully!")
Run it:
python scripts/test_imports.py
Creating Sample Data
Step 1: Generate Comprehensive Sales Data
Create scripts/create_sample_data.py:
"""
Create sample Excel files for learning Excel automation
"""
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os
from pathlib import Path
# Set random seed for reproducibility
np.random.seed(42)
def create_sales_data():
"""Generate realistic sales data with multiple dimensions"""
# Configuration
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 12, 31)
num_records = 1000
# Sample data categories
products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones',
'Webcam', 'USB Cable', 'Phone Charger', 'Tablet', 'Smart Watch']
categories = ['Electronics', 'Accessories', 'Peripherals']
sales_reps = [f'Rep {i}' for i in range(1, 11)]
regions = ['North', 'South', 'East', 'West']
# Generate random dates
date_range = (end_date - start_date).days
random_days = np.random.randint(0, date_range, num_records)
dates = [start_date + timedelta(days=int(day)) for day in random_days]
# Create the main dataframe
df = pd.DataFrame({
'Date': dates,
'Year-Month': [date.strftime('%Y-%m') for date in dates],
'Product': np.random.choice(products, num_records),
'Category': np.random.choice(categories, num_records),
'Sales_Rep': np.random.choice(sales_reps, num_records),
'Region': np.random.choice(regions, num_records),
'Quantity': np.random.randint(1, 50, num_records),
'Unit_Price': np.round(np.random.uniform(10, 1000, num_records), 2),
})
# Calculate Sales Amount
df['Sales_Amount'] = np.round(df['Quantity'] * df['Unit_Price'], 2)
# Add some realistic patterns
# Higher sales in Q4 (holiday season)
q4_mask = df['Date'].dt.quarter == 4
df.loc[q4_mask, 'Quantity'] = (df.loc[q4_mask, 'Quantity'] * 1.5).astype(int)
df.loc[q4_mask, 'Sales_Amount'] = np.round(df.loc[q4_mask, 'Quantity'] *
df.loc[q4_mask, 'Unit_Price'], 2)
# Sort by date
df = df.sort_values('Date').reset_index(drop=True)
return df
def create_messy_data():
"""Generate data that needs cleaning for practice"""
# Create intentionally messy data
data = {
'Customer_Name': ['John Doe', ' Jane Smith ', 'Bob Johnson', None, 'Alice Brown'],
'Email': ['john@email.com', 'JANE@EMAIL.COM', 'bob@', 'invalid-email', 'alice@email.com'],
'Phone': ['123-456-7890', '9876543210', '(555) 123-4567', 'not a phone', None],
'Purchase_Date': ['2023-01-15', '2023/02/20', '03-15-2023', 'yesterday', '2023-04-10'],
'Amount': [100.50, '200', None, 'fifty', 350.75],
'Status': ['Active', 'active', 'ACTIVE', 'Inactive', None]
}
# Duplicate some rows
df = pd.DataFrame(data)
df = pd.concat([df, df.iloc[[0, 2]]], ignore_index=True)
return df
def split_sales_files(df):
"""Split the main sales data into multiple files"""
# Create excel_files directory if it doesn't exist
Path('excel_files').mkdir(exist_ok=True)
# Split data into 3 parts
chunk_size = len(df) // 3
for i in range(3):
start_idx = i * chunk_size
if i == 2: # Last chunk gets remainder
chunk = df.iloc[start_idx:]
else:
chunk = df.iloc[start_idx:start_idx + chunk_size]
filename = f'excel_files/sales_data_{i+1}.xlsx'
chunk.to_excel(filename, index=False, engine='openpyxl')
print(f"Created: {filename} ({len(chunk)} records)")
return True
def main():
"""Main function to create all sample files"""
print("=" * 60)
print("Creating Sample Excel Files for Learning")
print("=" * 60)
try:
# Create main sales data
print("\n1. Generating sales data...")
sales_df = create_sales_data()
sales_df.to_excel('sales_data_original.xlsx', index=False, engine='openpyxl')
print(f" ✓ Created sales_data_original.xlsx ({len(sales_df)} records)")
# Show sample of the data
print("\n Sample of sales data:")
print(sales_df[['Date', 'Product', 'Sales_Amount']].head())
# Create messy data for cleaning practice
print("\n2. Generating messy data for cleaning...")
messy_df = create_messy_data()
messy_df.to_excel('messy_data.xlsx', index=False, engine='openpyxl')
print(f" ✓ Created messy_data.xlsx ({len(messy_df)} records)")
# Split files
print("\n3. Creating split files...")
split_sales_files(sales_df)
print("\n" + "=" * 60)
print("✓ All sample files created successfully!")
print("=" * 60)
# Display file structure
print("\nYour file structure should now look like:")
print("""
excel-automation/
├── sales_data_original.xlsx
├── messy_data.xlsx
└── excel_files/
├── sales_data_1.xlsx
├── sales_data_2.xlsx
└── sales_data_3.xlsx
""")
except Exception as e:
print(f"\n✗ Error creating files: {e}")
print("\nTroubleshooting tips:")
print("1. Make sure you're in the correct directory")
print("2. Check that you have write permissions")
print("3. Close any Excel files that might be open")
return False
return True
if __name__ == "__main__":
main()
Basic Excel Operations
Step 1: Merging Multiple Excel Files
Create scripts/merge_excel_files.py:
"""
Merge multiple Excel files into one
"""
import pandas as pd
from pathlib import Path
import os
def merge_excel_files(input_folder='excel_files', output_file='merged_sales.xlsx'):
"""
Merge all Excel files in a folder into one file
Parameters:
-----------
input_folder : str
Folder containing Excel files to merge
output_file : str
Name of the output merged file
"""
# Check if folder exists
if not os.path.exists(input_folder):
raise FileNotFoundError(f"Folder '{input_folder}' not found!")
# Get all Excel files
excel_files = list(Path(input_folder).glob('*.xlsx'))
if not excel_files:
raise ValueError(f"No Excel files found in '{input_folder}'")
print(f"Found {len(excel_files)} Excel files to merge:")
for file in excel_files:
print(f" - {file.name}")
# Read and combine all files
all_dataframes = []
for file in excel_files:
try:
df = pd.read_excel(file, engine='openpyxl')
all_dataframes.append(df)
print(f" ✓ Read {file.name}: {len(df)} records")
except Exception as e:
print(f" ✗ Error reading {file.name}: {e}")
# Merge all dataframes
if all_dataframes:
merged_df = pd.concat(all_dataframes, ignore_index=True)
# Save to Excel
merged_df.to_excel(output_file, index=False, engine='openpyxl')
print(f"\n✓ Successfully merged {len(all_dataframes)} files")
print(f" Total records: {len(merged_df)}")
print(f" Saved to: {output_file}")
return merged_df
else:
print("✗ No files could be read")
return None
if __name__ == "__main__":
# Run the merge
merged_data = merge_excel_files()
if merged_data is not None:
print("\nFirst 5 rows of merged data:")
print(merged_data.head())
Step 2: Writing to Multiple Sheets
Create scripts/multi_sheet_writer.py:
"""
Write data to multiple sheets in an Excel file
"""
import pandas as pd
import numpy as np
from datetime import datetime
def create_multi_sheet_report(sales_df):
"""
Create an Excel report with multiple sheets
"""
# Create the Excel writer object
output_file = 'multi_sheet_report.xlsx'
# Using context manager ensures file is properly saved
with pd.ExcelWriter(output_file, engine='xlsxwriter') as writer:
# Sheet 1: Raw Data
sales_df.to_excel(writer, sheet_name='Raw Data', index=False)
print("✓ Added 'Raw Data' sheet")
# Sheet 2: Monthly Summary
monthly_summary = sales_df.groupby('Year-Month').agg({
'Sales_Amount': ['sum', 'mean', 'count'],
'Quantity': 'sum'
}).round(2)
monthly_summary.columns = ['Total_Sales', 'Avg_Sales', 'Num_Transactions', 'Total_Quantity']
monthly_summary.to_excel(writer, sheet_name='Monthly Summary')
print("✓ Added 'Monthly Summary' sheet")
# Sheet 3: Product Performance
product_summary = sales_df.groupby('Product').agg({
'Sales_Amount': 'sum',
'Quantity': 'sum'
}).sort_values('Sales_Amount', ascending=False).round(2)
product_summary.to_excel(writer, sheet_name='Product Performance')
print("✓ Added 'Product Performance' sheet")
# Sheet 4: Regional Analysis
regional_summary = sales_df.groupby(['Region', 'Category']).agg({
'Sales_Amount': 'sum'
}).unstack(fill_value=0).round(2)
regional_summary.to_excel(writer, sheet_name='Regional Analysis')
print("✓ Added 'Regional Analysis' sheet")
# Sheet 5: Top Performers
top_reps = sales_df.groupby('Sales_Rep')['Sales_Amount'].sum().nlargest(5)
top_products = sales_df.groupby('Product')['Sales_Amount'].sum().nlargest(5)
top_df = pd.DataFrame({
'Top Sales Reps': top_reps.index,
'Rep Sales': top_reps.values,
'Top Products': top_products.index,
'Product Sales': top_products.values
})
top_df.to_excel(writer, sheet_name='Top Performers', index=False)
print("✓ Added 'Top Performers' sheet")
print(f"\n✓ Multi-sheet report saved to: {output_file}")
return output_file
if __name__ == "__main__":
# Load the sales data
try:
df = pd.read_excel('sales_data_original.xlsx', engine='openpyxl')
print(f"Loaded {len(df)} records from sales_data_original.xlsx")
# Create the report
report_file = create_multi_sheet_report(df)
except FileNotFoundError:
print("✗ Error: sales_data_original.xlsx not found!")
print(" Please run create_sample_data.py first")
except Exception as e:
print(f"✗ Error: {e}")
Advanced Analysis with SciPy
Enhanced Analysis Script
Create scripts/smart_analysis.py:
"""
Advanced Excel data analysis with SciPy and statistical insights
"""
import pandas as pd
import numpy as np
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
def check_column_names(df):
"""
Display available columns and check for required columns
"""
print("\n=== Data Structure Check ===")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print("\nAvailable columns:")
for i, col in enumerate(df.columns, 1):
print(f" {i}. {col}")
# Check for required columns
required_columns = ['Year-Month', 'Sales_Amount', 'Product', 'Region']
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
print(f"\n⚠ Warning: Missing columns: {missing_columns}")
print(" Column names are case-sensitive!")
return False
else:
print("\n✓ All required columns found")
return True
def discover_insights(df):
"""
Extract comprehensive insights from sales data
"""
insights = []
# Ensure we have the right columns
if not check_column_names(df):
print("\nAttempting to fix column names...")
# Try to fix common column name issues
df.columns = df.columns.str.replace(' ', '_')
if not check_column_names(df):
return ["Error: Required columns not found in data"]
print("\n=== Analyzing Data ===")
try:
# 1. Monthly Sales Analysis
monthly_sales = df.groupby('Year-Month')['Sales_Amount'].sum()
# Best and worst months
best_month = monthly_sales.idxmax()
best_month_sales = monthly_sales.max()
worst_month = monthly_sales.idxmin()
worst_month_sales = monthly_sales.min()
insights.append(
f"Best Sales Month: {best_month} "
f"(¥{best_month_sales:,.0f})"
)
insights.append(
f"Worst Sales Month: {worst_month} "
f"(¥{worst_month_sales:,.0f})"
)
# 2. Statistical Analysis
sales_mean = df['Sales_Amount'].mean()
sales_median = df['Sales_Amount'].median()
sales_std = df['Sales_Amount'].std()
insights.append(
f"Average Transaction: ¥{sales_mean:,.2f} "
f"(Median: ¥{sales_median:,.2f}, Std Dev: ¥{sales_std:,.2f})"
)
# 3. Trend Analysis using SciPy
months_numeric = np.arange(len(monthly_sales))
slope, intercept, r_value, p_value, std_err = stats.linregress(
months_numeric, monthly_sales.values
)
trend = "increasing ↑" if slope > 0 else "decreasing ↓"
insights.append(
f"Sales Trend: {trend} "
f"(¥{abs(slope):,.0f}/month, R²={r_value**2:.3f})"
)
# 4. Product Performance
top_product = df.groupby('Product')['Sales_Amount'].sum().idxmax()
top_product_sales = df.groupby('Product')['Sales_Amount'].sum().max()
insights.append(
f"Top Product: {top_product} "
f"(¥{top_product_sales:,.0f} total sales)"
)
# 5. Regional Insights
regional_sales = df.groupby('Region')['Sales_Amount'].sum()
top_region = regional_sales.idxmax()
top_region_sales = regional_sales.max()
insights.append(
f"Top Region: {top_region} "
f"(¥{top_region_sales:,.0f} total sales)"
)
# 6. Outlier Detection using SciPy
z_scores = np.abs(stats.zscore(df['Sales_Amount']))
outliers = df[z_scores > 3]
if len(outliers) > 0:
insights.append(
f"Outliers Detected: {len(outliers)} transactions "
f"(>{3} standard deviations from mean)"
)
# 7. Seasonality Check (Q4 vs other quarters)
if 'Date' in df.columns:
df['Date'] = pd.to_datetime(df['Date'])
df['Quarter'] = df['Date'].dt.quarter
q4_sales = df[df['Quarter'] == 4]['Sales_Amount'].sum()
other_q_sales = df[df['Quarter'] != 4]['Sales_Amount'].sum()
q4_boost = ((q4_sales / 3) / (other_q_sales / 9) - 1) * 100
if q4_boost > 10:
insights.append(
f"Seasonal Pattern: Q4 shows {q4_boost:.1f}% higher sales "
f"(holiday season effect)"
)
except KeyError as e:
insights.append(f"Error: Column {e} not found in data")
print(f"\n✗ KeyError: {e}")
print(" Please check your column names match exactly")
except Exception as e:
insights.append(f"Analysis error: {str(e)}")
print(f"\n✗ Unexpected error: {e}")
return insights
def save_insights_to_excel(insights, df, output_file='analysis_report.xlsx'):
"""
Save insights and data to a formatted Excel file
"""
with pd.ExcelWriter(output_file, engine='xlsxwriter') as writer:
# Create insights dataframe
insights_df = pd.DataFrame({
'Insight': insights,
'Generated': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
})
# Write insights
insights_df.to_excel(writer, sheet_name='Key Insights', index=False)
# Write raw data
df.head(100).to_excel(writer, sheet_name='Sample Data', index=False)
# Get the workbook and worksheet objects
workbook = writer.book
worksheet = writer.sheets['Key Insights']
# Add formatting
header_format = workbook.add_format({
'bold': True,
'bg_color': '#4CAF50',
'font_color': 'white',
'border': 1
})
# Apply header formatting
for col_num, value in enumerate(insights_df.columns.values):
worksheet.write(0, col_num, value, header_format)
# Adjust column widths
worksheet.set_column('A:A', 80)
worksheet.set_column('B:B', 20)
print(f"\n✓ Analysis report saved to: {output_file}")
# Main Execution
if __name__ == "__main__":
try:
# Load the sample data
print("Loading data...")
df = pd.read_excel("sales_data_original.xlsx", engine='openpyxl')
print(f"✓ Loaded {len(df)} records")
# Run the analysis
print("\nStarting analysis...")
found_insights = discover_insights(df)
# Display the results
print("\n" + "=" * 60)
print("ANALYSIS RESULTS")
print("=" * 60)
for i, insight in enumerate(found_insights, 1):
print(f"{i}. {insight}")
# Save to Excel
save_insights_to_excel(found_insights, df)
except FileNotFoundError:
print("\n✗ Error: sales_data_original.xlsx not found!")
print("\nSolution:")
print("1. Run: python scripts/create_sample_data.py")
print("2. Make sure you're in the correct directory")
except Exception as e:
print(f"\n✗ Unexpected error: {e}")
print("\nTroubleshooting:")
print("1. Check that all packages are installed correctly")
print("2. Verify the Excel file isn't corrupted")
print("3. Make sure the file isn't open in Excel")
Troubleshooting Guide
Common Errors and Solutions
1. ModuleNotFoundError
Error:
ModuleNotFoundError: No module named 'pandas'
Solution:
# Make sure your virtual environment is activated
# Windows:
excel_env\Scripts\activate
# macOS/Linux:
source excel_env/bin/activate
# Then reinstall
pip install pandas
2. KeyError with Column Names
Error:
KeyError: 'Year-Month'
Solution: Check exact column names (case-sensitive):
import pandas as pd
df = pd.read_excel('your_file.xlsx')
print("Actual column names:")
print(df.columns.tolist())
# Fix column names if needed
df.columns = df.columns.str.replace(' ', '_')
3. Permission Denied Error
Error:
PermissionError: [Errno 13] Permission denied: 'file.xlsx'
Solution:
- Close the Excel file if it’s open
- Check file permissions
- Use a different filename:
from datetime import datetime
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
filename = f'output_{timestamp}.xlsx'
4. Excel File Corruption
Error:
BadZipFile: File is not a zip file
Solution:
# Try different engines
try:
df = pd.read_excel('file.xlsx', engine='openpyxl')
except:
df = pd.read_excel('file.xlsx', engine='xlrd')
Best Practices Checklist
✅ Always use virtual environments – Isolate project dependencies
✅ Use context managers – with pd.ExcelWriter() ensures proper file closing
✅ Add error handling – Use try-except blocks for robust code
✅ Check column names – They’re case-sensitive
✅ Close Excel files – Before running Python scripts
✅ Use meaningful variable names – Makes code self-documenting
✅ Add progress indicators – For long-running operations
✅ Version control – Use Git to track changes
✅ Document your code – Add docstrings and comments
✅ Test incrementally – Run code step by step when debugging
Complete Workflow Example
Here’s how to run everything from start to finish:
# 1. Setup environment (one time only)
python -m venv excel_env
source excel_env/bin/activate # or excel_env\Scripts\activate on Windows
pip install -r requirements.txt
# 2. Create sample data
python scripts/create_sample_data.py
# 3. Run basic operations
python scripts/merge_excel_files.py
python scripts/multi_sheet_writer.py
# 4. Run advanced analysis
python scripts/smart_analysis.py
# 5. Check the generated files
ls *.xlsx
ls excel_files/*.xlsx
Next Steps and Resources
Official Documentation
- pandas documentation is available at pandas.pydata.org
- openpyxl documentation is at: https://openpyxl.readthedocs.io
Advanced Topics to Explore
- Data Visualization: Add charts using xlsxwriter
- Formatting: Apply conditional formatting and styles
- Formulas: Add Excel formulas programmatically
- Performance: Handle large files with chunking
- Automation: Schedule scripts with cron/Task Scheduler
Project Ideas
- Sales dashboard generator
- Automated monthly reports
- Data quality checker
- Excel template filler
- Multi-file consolidator
Conclusion
You now have a complete foundation for Excel automation with Python. This guide has covered:
- Proper environment setup with virtual environments
- Creating and manipulating Excel files with multiple sheets
- Advanced analysis using SciPy for statistical insights
- Comprehensive error handling and troubleshooting
Remember to always refer to the official documentation for the most up-to-date information, and practice with your own data to solidify these concepts.
Last updated: September 2025 | Python 3.9+ | pandas 2.3.2 | openpyxl 3.1.4

