Resolving KeyError Issues in Python Pandas for Excel Data Analysis

Getting Started with Excel Automation in Python: Building on Our Foundation

In our previous post, we took the initial steps toward automating Excel by completing the following tasks:

  • Creating Sample Data: Developed example datasets to work with
  • Generating Split Sales Files: Divided the sales data into three separate Excel files
  • Preparing Incomplete Data for Cleaning: Created datasets that required cleaning and preprocessing
  • Setting Up Sales Data for Analysis: Organized the sales data files to be ready for in-depth analysis

Additionally, we implemented some fundamental automation scripts, including:

  • Merging Excel Files: Combined multiple Excel files into a single dataset
  • Data Cleaning Functions: Automated the process of cleaning and preparing data for analysis
  • Automated Monthly Reporting: Created scripts to generate monthly reports without manual intervention
table of contents

Enhancing Our Automation with Advanced Analysis

Building on this solid foundation, our goal for this installment is to introduce more practical analytical features and ensure everything is explained in detail for beginners. We’ll use the latest versions of pandas (2.3.2) and openpyxl (3.1.4) as of 2025, following official documentation best practices.

Prerequisites and Environment Setup

Step 1: Check Your Python Version

First, ensure you have Python 3.9 or later installed:

python --version
# or
python3 --version

Expected output: Python 3.9.0 or higher (up to 3.13)

Step 2: Create a Virtual Environment

Important: Always use a virtual environment for Python projects to avoid dependency conflicts.

On Windows:

# Navigate to your project folder
cd C:\Users\YourName\excel-automation

# Create virtual environment
python -m venv excel_env

# Activate it
excel_env\Scripts\activate

On macOS/Linux:

# Navigate to your project folder
cd ~/excel-automation

# Create virtual environment
python3 -m venv excel_env

# Activate it
source excel_env/bin/activate

Success indicator: Your terminal prompt should now show (excel_env) at the beginning.

Step 3: Create Project Structure

Create the following folder structure:

excel-automation/
│
├── excel_env/           # Virtual environment (created above)
├── excel_files/         # For storing generated Excel files
├── scripts/             # Python scripts
└── requirements.txt     # Package dependencies

Create folders:

mkdir excel_files scripts

Installing Required Packages

Step 1: Create requirements.txt

Create a file named requirements.txt with the following content:

pandas==2.3.2
openpyxl==3.1.4
scipy==1.13.1
xlsxwriter==3.2.0
numpy==1.26.4

Step 2: Install Packages

With your virtual environment activated:

pip install --upgrade pip
pip install -r requirements.txt

Step 3: Verify Installation

Create a file scripts/test_imports.py:

"""Test if all required packages are installed correctly"""

import sys
print(f"Python version: {sys.version}")

try:
    import pandas as pd
    print(f"✓ pandas {pd.__version__} installed")
except ImportError as e:
    print(f"✗ pandas not installed: {e}")

try:
    import openpyxl
    print(f"✓ openpyxl {openpyxl.__version__} installed")
except ImportError as e:
    print(f"✗ openpyxl not installed: {e}")

try:
    import scipy
    print(f"✓ scipy {scipy.__version__} installed")
except ImportError as e:
    print(f"✗ scipy not installed: {e}")

try:
    import numpy as np
    print(f"✓ numpy {np.__version__} installed")
except ImportError as e:
    print(f"✗ numpy not installed: {e}")

print("\nAll packages imported successfully!")

Run it:

python scripts/test_imports.py

Creating Sample Data

Step 1: Generate Comprehensive Sales Data

Create scripts/create_sample_data.py:

"""
Create sample Excel files for learning Excel automation
"""

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os
from pathlib import Path

# Set random seed for reproducibility
np.random.seed(42)

def create_sales_data():
    """Generate realistic sales data with multiple dimensions"""
    
    # Configuration
    start_date = datetime(2023, 1, 1)
    end_date = datetime(2023, 12, 31)
    num_records = 1000
    
    # Sample data categories
    products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones', 
                'Webcam', 'USB Cable', 'Phone Charger', 'Tablet', 'Smart Watch']
    categories = ['Electronics', 'Accessories', 'Peripherals']
    sales_reps = [f'Rep {i}' for i in range(1, 11)]
    regions = ['North', 'South', 'East', 'West']
    
    # Generate random dates
    date_range = (end_date - start_date).days
    random_days = np.random.randint(0, date_range, num_records)
    dates = [start_date + timedelta(days=int(day)) for day in random_days]
    
    # Create the main dataframe
    df = pd.DataFrame({
        'Date': dates,
        'Year-Month': [date.strftime('%Y-%m') for date in dates],
        'Product': np.random.choice(products, num_records),
        'Category': np.random.choice(categories, num_records),
        'Sales_Rep': np.random.choice(sales_reps, num_records),
        'Region': np.random.choice(regions, num_records),
        'Quantity': np.random.randint(1, 50, num_records),
        'Unit_Price': np.round(np.random.uniform(10, 1000, num_records), 2),
    })
    
    # Calculate Sales Amount
    df['Sales_Amount'] = np.round(df['Quantity'] * df['Unit_Price'], 2)
    
    # Add some realistic patterns
    # Higher sales in Q4 (holiday season)
    q4_mask = df['Date'].dt.quarter == 4
    df.loc[q4_mask, 'Quantity'] = (df.loc[q4_mask, 'Quantity'] * 1.5).astype(int)
    df.loc[q4_mask, 'Sales_Amount'] = np.round(df.loc[q4_mask, 'Quantity'] * 
                                                 df.loc[q4_mask, 'Unit_Price'], 2)
    
    # Sort by date
    df = df.sort_values('Date').reset_index(drop=True)
    
    return df

def create_messy_data():
    """Generate data that needs cleaning for practice"""
    
    # Create intentionally messy data
    data = {
        'Customer_Name': ['John Doe', '  Jane Smith  ', 'Bob Johnson', None, 'Alice Brown'],
        'Email': ['john@email.com', 'JANE@EMAIL.COM', 'bob@', 'invalid-email', 'alice@email.com'],
        'Phone': ['123-456-7890', '9876543210', '(555) 123-4567', 'not a phone', None],
        'Purchase_Date': ['2023-01-15', '2023/02/20', '03-15-2023', 'yesterday', '2023-04-10'],
        'Amount': [100.50, '200', None, 'fifty', 350.75],
        'Status': ['Active', 'active', 'ACTIVE', 'Inactive', None]
    }
    
    # Duplicate some rows
    df = pd.DataFrame(data)
    df = pd.concat([df, df.iloc[[0, 2]]], ignore_index=True)
    
    return df

def split_sales_files(df):
    """Split the main sales data into multiple files"""
    
    # Create excel_files directory if it doesn't exist
    Path('excel_files').mkdir(exist_ok=True)
    
    # Split data into 3 parts
    chunk_size = len(df) // 3
    
    for i in range(3):
        start_idx = i * chunk_size
        if i == 2:  # Last chunk gets remainder
            chunk = df.iloc[start_idx:]
        else:
            chunk = df.iloc[start_idx:start_idx + chunk_size]
        
        filename = f'excel_files/sales_data_{i+1}.xlsx'
        chunk.to_excel(filename, index=False, engine='openpyxl')
        print(f"Created: {filename} ({len(chunk)} records)")
    
    return True

def main():
    """Main function to create all sample files"""
    
    print("=" * 60)
    print("Creating Sample Excel Files for Learning")
    print("=" * 60)
    
    try:
        # Create main sales data
        print("\n1. Generating sales data...")
        sales_df = create_sales_data()
        sales_df.to_excel('sales_data_original.xlsx', index=False, engine='openpyxl')
        print(f"   ✓ Created sales_data_original.xlsx ({len(sales_df)} records)")
        
        # Show sample of the data
        print("\n   Sample of sales data:")
        print(sales_df[['Date', 'Product', 'Sales_Amount']].head())
        
        # Create messy data for cleaning practice
        print("\n2. Generating messy data for cleaning...")
        messy_df = create_messy_data()
        messy_df.to_excel('messy_data.xlsx', index=False, engine='openpyxl')
        print(f"   ✓ Created messy_data.xlsx ({len(messy_df)} records)")
        
        # Split files
        print("\n3. Creating split files...")
        split_sales_files(sales_df)
        
        print("\n" + "=" * 60)
        print("✓ All sample files created successfully!")
        print("=" * 60)
        
        # Display file structure
        print("\nYour file structure should now look like:")
        print("""
excel-automation/
├── sales_data_original.xlsx
├── messy_data.xlsx
└── excel_files/
    ├── sales_data_1.xlsx
    ├── sales_data_2.xlsx
    └── sales_data_3.xlsx
        """)
        
    except Exception as e:
        print(f"\n✗ Error creating files: {e}")
        print("\nTroubleshooting tips:")
        print("1. Make sure you're in the correct directory")
        print("2. Check that you have write permissions")
        print("3. Close any Excel files that might be open")
        return False
    
    return True

if __name__ == "__main__":
    main()

Basic Excel Operations

Step 1: Merging Multiple Excel Files

Create scripts/merge_excel_files.py:

"""
Merge multiple Excel files into one
"""

import pandas as pd
from pathlib import Path
import os

def merge_excel_files(input_folder='excel_files', output_file='merged_sales.xlsx'):
    """
    Merge all Excel files in a folder into one file
    
    Parameters:
    -----------
    input_folder : str
        Folder containing Excel files to merge
    output_file : str
        Name of the output merged file
    """
    
    # Check if folder exists
    if not os.path.exists(input_folder):
        raise FileNotFoundError(f"Folder '{input_folder}' not found!")
    
    # Get all Excel files
    excel_files = list(Path(input_folder).glob('*.xlsx'))
    
    if not excel_files:
        raise ValueError(f"No Excel files found in '{input_folder}'")
    
    print(f"Found {len(excel_files)} Excel files to merge:")
    for file in excel_files:
        print(f"  - {file.name}")
    
    # Read and combine all files
    all_dataframes = []
    
    for file in excel_files:
        try:
            df = pd.read_excel(file, engine='openpyxl')
            all_dataframes.append(df)
            print(f"  ✓ Read {file.name}: {len(df)} records")
        except Exception as e:
            print(f"  ✗ Error reading {file.name}: {e}")
    
    # Merge all dataframes
    if all_dataframes:
        merged_df = pd.concat(all_dataframes, ignore_index=True)
        
        # Save to Excel
        merged_df.to_excel(output_file, index=False, engine='openpyxl')
        
        print(f"\n✓ Successfully merged {len(all_dataframes)} files")
        print(f"  Total records: {len(merged_df)}")
        print(f"  Saved to: {output_file}")
        
        return merged_df
    else:
        print("✗ No files could be read")
        return None

if __name__ == "__main__":
    # Run the merge
    merged_data = merge_excel_files()
    
    if merged_data is not None:
        print("\nFirst 5 rows of merged data:")
        print(merged_data.head())

Step 2: Writing to Multiple Sheets

Create scripts/multi_sheet_writer.py:

"""
Write data to multiple sheets in an Excel file
"""

import pandas as pd
import numpy as np
from datetime import datetime

def create_multi_sheet_report(sales_df):
    """
    Create an Excel report with multiple sheets
    """
    
    # Create the Excel writer object
    output_file = 'multi_sheet_report.xlsx'
    
    # Using context manager ensures file is properly saved
    with pd.ExcelWriter(output_file, engine='xlsxwriter') as writer:
        
        # Sheet 1: Raw Data
        sales_df.to_excel(writer, sheet_name='Raw Data', index=False)
        print("✓ Added 'Raw Data' sheet")
        
        # Sheet 2: Monthly Summary
        monthly_summary = sales_df.groupby('Year-Month').agg({
            'Sales_Amount': ['sum', 'mean', 'count'],
            'Quantity': 'sum'
        }).round(2)
        monthly_summary.columns = ['Total_Sales', 'Avg_Sales', 'Num_Transactions', 'Total_Quantity']
        monthly_summary.to_excel(writer, sheet_name='Monthly Summary')
        print("✓ Added 'Monthly Summary' sheet")
        
        # Sheet 3: Product Performance
        product_summary = sales_df.groupby('Product').agg({
            'Sales_Amount': 'sum',
            'Quantity': 'sum'
        }).sort_values('Sales_Amount', ascending=False).round(2)
        product_summary.to_excel(writer, sheet_name='Product Performance')
        print("✓ Added 'Product Performance' sheet")
        
        # Sheet 4: Regional Analysis
        regional_summary = sales_df.groupby(['Region', 'Category']).agg({
            'Sales_Amount': 'sum'
        }).unstack(fill_value=0).round(2)
        regional_summary.to_excel(writer, sheet_name='Regional Analysis')
        print("✓ Added 'Regional Analysis' sheet")
        
        # Sheet 5: Top Performers
        top_reps = sales_df.groupby('Sales_Rep')['Sales_Amount'].sum().nlargest(5)
        top_products = sales_df.groupby('Product')['Sales_Amount'].sum().nlargest(5)
        
        top_df = pd.DataFrame({
            'Top Sales Reps': top_reps.index,
            'Rep Sales': top_reps.values,
            'Top Products': top_products.index,
            'Product Sales': top_products.values
        })
        top_df.to_excel(writer, sheet_name='Top Performers', index=False)
        print("✓ Added 'Top Performers' sheet")
    
    print(f"\n✓ Multi-sheet report saved to: {output_file}")
    return output_file

if __name__ == "__main__":
    # Load the sales data
    try:
        df = pd.read_excel('sales_data_original.xlsx', engine='openpyxl')
        print(f"Loaded {len(df)} records from sales_data_original.xlsx")
        
        # Create the report
        report_file = create_multi_sheet_report(df)
        
    except FileNotFoundError:
        print("✗ Error: sales_data_original.xlsx not found!")
        print("  Please run create_sample_data.py first")
    except Exception as e:
        print(f"✗ Error: {e}")

Advanced Analysis with SciPy

Enhanced Analysis Script

Create scripts/smart_analysis.py:

"""
Advanced Excel data analysis with SciPy and statistical insights
"""

import pandas as pd
import numpy as np
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

def check_column_names(df):
    """
    Display available columns and check for required columns
    """
    print("\n=== Data Structure Check ===")
    print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
    print("\nAvailable columns:")
    for i, col in enumerate(df.columns, 1):
        print(f"  {i}. {col}")
    
    # Check for required columns
    required_columns = ['Year-Month', 'Sales_Amount', 'Product', 'Region']
    missing_columns = [col for col in required_columns if col not in df.columns]
    
    if missing_columns:
        print(f"\n⚠ Warning: Missing columns: {missing_columns}")
        print("  Column names are case-sensitive!")
        return False
    else:
        print("\n✓ All required columns found")
        return True

def discover_insights(df):
    """
    Extract comprehensive insights from sales data
    """
    insights = []
    
    # Ensure we have the right columns
    if not check_column_names(df):
        print("\nAttempting to fix column names...")
        # Try to fix common column name issues
        df.columns = df.columns.str.replace(' ', '_')
        if not check_column_names(df):
            return ["Error: Required columns not found in data"]
    
    print("\n=== Analyzing Data ===")
    
    try:
        # 1. Monthly Sales Analysis
        monthly_sales = df.groupby('Year-Month')['Sales_Amount'].sum()
        
        # Best and worst months
        best_month = monthly_sales.idxmax()
        best_month_sales = monthly_sales.max()
        worst_month = monthly_sales.idxmin()
        worst_month_sales = monthly_sales.min()
        
        insights.append(
            f"Best Sales Month: {best_month} "
            f"(¥{best_month_sales:,.0f})"
        )
        insights.append(
            f"Worst Sales Month: {worst_month} "
            f"(¥{worst_month_sales:,.0f})"
        )
        
        # 2. Statistical Analysis
        sales_mean = df['Sales_Amount'].mean()
        sales_median = df['Sales_Amount'].median()
        sales_std = df['Sales_Amount'].std()
        
        insights.append(
            f"Average Transaction: ¥{sales_mean:,.2f} "
            f"(Median: ¥{sales_median:,.2f}, Std Dev: ¥{sales_std:,.2f})"
        )
        
        # 3. Trend Analysis using SciPy
        months_numeric = np.arange(len(monthly_sales))
        slope, intercept, r_value, p_value, std_err = stats.linregress(
            months_numeric, monthly_sales.values
        )
        
        trend = "increasing ↑" if slope > 0 else "decreasing ↓"
        insights.append(
            f"Sales Trend: {trend} "
            f"(¥{abs(slope):,.0f}/month, R²={r_value**2:.3f})"
        )
        
        # 4. Product Performance
        top_product = df.groupby('Product')['Sales_Amount'].sum().idxmax()
        top_product_sales = df.groupby('Product')['Sales_Amount'].sum().max()
        
        insights.append(
            f"Top Product: {top_product} "
            f"(¥{top_product_sales:,.0f} total sales)"
        )
        
        # 5. Regional Insights
        regional_sales = df.groupby('Region')['Sales_Amount'].sum()
        top_region = regional_sales.idxmax()
        top_region_sales = regional_sales.max()
        
        insights.append(
            f"Top Region: {top_region} "
            f"(¥{top_region_sales:,.0f} total sales)"
        )
        
        # 6. Outlier Detection using SciPy
        z_scores = np.abs(stats.zscore(df['Sales_Amount']))
        outliers = df[z_scores > 3]
        
        if len(outliers) > 0:
            insights.append(
                f"Outliers Detected: {len(outliers)} transactions "
                f"(>{3} standard deviations from mean)"
            )
        
        # 7. Seasonality Check (Q4 vs other quarters)
        if 'Date' in df.columns:
            df['Date'] = pd.to_datetime(df['Date'])
            df['Quarter'] = df['Date'].dt.quarter
            q4_sales = df[df['Quarter'] == 4]['Sales_Amount'].sum()
            other_q_sales = df[df['Quarter'] != 4]['Sales_Amount'].sum()
            q4_boost = ((q4_sales / 3) / (other_q_sales / 9) - 1) * 100
            
            if q4_boost > 10:
                insights.append(
                    f"Seasonal Pattern: Q4 shows {q4_boost:.1f}% higher sales "
                    f"(holiday season effect)"
                )
        
    except KeyError as e:
        insights.append(f"Error: Column {e} not found in data")
        print(f"\n✗ KeyError: {e}")
        print("  Please check your column names match exactly")
    except Exception as e:
        insights.append(f"Analysis error: {str(e)}")
        print(f"\n✗ Unexpected error: {e}")
    
    return insights

def save_insights_to_excel(insights, df, output_file='analysis_report.xlsx'):
    """
    Save insights and data to a formatted Excel file
    """
    with pd.ExcelWriter(output_file, engine='xlsxwriter') as writer:
        # Create insights dataframe
        insights_df = pd.DataFrame({
            'Insight': insights,
            'Generated': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
        })
        
        # Write insights
        insights_df.to_excel(writer, sheet_name='Key Insights', index=False)
        
        # Write raw data
        df.head(100).to_excel(writer, sheet_name='Sample Data', index=False)
        
        # Get the workbook and worksheet objects
        workbook = writer.book
        worksheet = writer.sheets['Key Insights']
        
        # Add formatting
        header_format = workbook.add_format({
            'bold': True,
            'bg_color': '#4CAF50',
            'font_color': 'white',
            'border': 1
        })
        
        # Apply header formatting
        for col_num, value in enumerate(insights_df.columns.values):
            worksheet.write(0, col_num, value, header_format)
        
        # Adjust column widths
        worksheet.set_column('A:A', 80)
        worksheet.set_column('B:B', 20)
    
    print(f"\n✓ Analysis report saved to: {output_file}")

# Main Execution
if __name__ == "__main__":
    try:
        # Load the sample data
        print("Loading data...")
        df = pd.read_excel("sales_data_original.xlsx", engine='openpyxl')
        print(f"✓ Loaded {len(df)} records")
        
        # Run the analysis
        print("\nStarting analysis...")
        found_insights = discover_insights(df)
        
        # Display the results
        print("\n" + "=" * 60)
        print("ANALYSIS RESULTS")
        print("=" * 60)
        for i, insight in enumerate(found_insights, 1):
            print(f"{i}. {insight}")
        
        # Save to Excel
        save_insights_to_excel(found_insights, df)
        
    except FileNotFoundError:
        print("\n✗ Error: sales_data_original.xlsx not found!")
        print("\nSolution:")
        print("1. Run: python scripts/create_sample_data.py")
        print("2. Make sure you're in the correct directory")
        
    except Exception as e:
        print(f"\n✗ Unexpected error: {e}")
        print("\nTroubleshooting:")
        print("1. Check that all packages are installed correctly")
        print("2. Verify the Excel file isn't corrupted")
        print("3. Make sure the file isn't open in Excel")

Troubleshooting Guide

Common Errors and Solutions

1. ModuleNotFoundError

Error:

ModuleNotFoundError: No module named 'pandas'

Solution:

# Make sure your virtual environment is activated
# Windows:
excel_env\Scripts\activate

# macOS/Linux:
source excel_env/bin/activate

# Then reinstall
pip install pandas

2. KeyError with Column Names

Error:

KeyError: 'Year-Month'

Solution: Check exact column names (case-sensitive):

import pandas as pd

df = pd.read_excel('your_file.xlsx')
print("Actual column names:")
print(df.columns.tolist())

# Fix column names if needed
df.columns = df.columns.str.replace(' ', '_')

3. Permission Denied Error

Error:

PermissionError: [Errno 13] Permission denied: 'file.xlsx'

Solution:

  • Close the Excel file if it’s open
  • Check file permissions
  • Use a different filename:
from datetime import datetime
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
filename = f'output_{timestamp}.xlsx'

4. Excel File Corruption

Error:

BadZipFile: File is not a zip file

Solution:

# Try different engines
try:
    df = pd.read_excel('file.xlsx', engine='openpyxl')
except:
    df = pd.read_excel('file.xlsx', engine='xlrd')

Best Practices Checklist

✅ Always use virtual environments – Isolate project dependencies
✅ Use context managers – with pd.ExcelWriter() ensures proper file closing
✅ Add error handling – Use try-except blocks for robust code
✅ Check column names – They’re case-sensitive
✅ Close Excel files – Before running Python scripts
✅ Use meaningful variable names – Makes code self-documenting
✅ Add progress indicators – For long-running operations
✅ Version control – Use Git to track changes
✅ Document your code – Add docstrings and comments
✅ Test incrementally – Run code step by step when debugging

Complete Workflow Example

Here’s how to run everything from start to finish:

# 1. Setup environment (one time only)
python -m venv excel_env
source excel_env/bin/activate  # or excel_env\Scripts\activate on Windows
pip install -r requirements.txt

# 2. Create sample data
python scripts/create_sample_data.py

# 3. Run basic operations
python scripts/merge_excel_files.py
python scripts/multi_sheet_writer.py

# 4. Run advanced analysis
python scripts/smart_analysis.py

# 5. Check the generated files
ls *.xlsx
ls excel_files/*.xlsx

Next Steps and Resources

Official Documentation

Advanced Topics to Explore

  1. Data Visualization: Add charts using xlsxwriter
  2. Formatting: Apply conditional formatting and styles
  3. Formulas: Add Excel formulas programmatically
  4. Performance: Handle large files with chunking
  5. Automation: Schedule scripts with cron/Task Scheduler

Project Ideas

  • Sales dashboard generator
  • Automated monthly reports
  • Data quality checker
  • Excel template filler
  • Multi-file consolidator

Conclusion

You now have a complete foundation for Excel automation with Python. This guide has covered:

  • Proper environment setup with virtual environments
  • Creating and manipulating Excel files with multiple sheets
  • Advanced analysis using SciPy for statistical insights
  • Comprehensive error handling and troubleshooting

Remember to always refer to the official documentation for the most up-to-date information, and practice with your own data to solidify these concepts.

Last updated: September 2025 | Python 3.9+ | pandas 2.3.2 | openpyxl 3.1.4

If you like this article, please
Follow !

Please share if you like it!
table of contents