HAL 9000: (Gradient)Descent into (March)Madness Part 1: Data Acquisition and Cleaning
Building a March Madness Prediction Engine: Part 1 - Web Scraping and Data Acquisition
Important Disclaimer
This article is for educational purposes only. Web scraping KenPom may violate their terms of service. This project was conducted as a learning exercise and is not intended for commercial use. No data is being distributed or shared publicly. Code examples have been intentionally simplified or removed to prevent readers from overwhelming KenPom’s servers or violating their terms of service. The goal is to understand the data acquisition process, not to recreate it. Always respect website terms of service and consider subscribing to premium services when available.
March Madness represents one of the most unpredictable tournaments in sports, where statistical underdogs routinely upset heavily favored teams. Creating a reliable prediction model starts with acquiring comprehensive, high-quality data that captures the nuanced performance metrics distinguishing championship contenders from first-round exits. This four-part series explores building a complete NCAA basketball prediction system, beginning with the foundation: automated web scraping and data cleaning.
The KenPom Data Challenge
College basketball analytics revolve around KenPom (kenpom.com), the authoritative source for advanced basketball statistics. Ken Pomeroy’s efficiency-based metrics provide tempo-adjusted ratings that account for strength of schedule and pace of play—critical factors that raw statistics miss. However, accessing this premium data programmatically presents several challenges:
Subscription-Based Access: KenPom requires paid subscriptions and implements authentication barriers that prevent simple HTTP requests.
Dynamic Content Loading: The site uses JavaScript for some data rendering, making traditional scraping approaches unreliable.
Rate Limiting Considerations: As a small, independent site, KenPom requires respectful scraping practices to avoid overwhelming their servers.
Data Spread Across Multiple Pages: Team statistics are distributed across six different specialized pages, each with unique layouts and data structures.
Playwright-Based Scraping Architecture
The solution employs Playwright for browser automation, providing robust handling of authentication and dynamic content. The scraper implements several key components:
Session Management: Once authenticated, the browser session maintains cookies across multiple page requests, avoiding repeated login attempts.
Failure Detection: The system distinguishes between network timeouts, invalid credentials, and successful authentications that might load slowly.
Graceful Retry Logic: Temporary network issues get handled through exponential backoff rather than immediate failure.
class KenPomScraper:
def __init__(self, email, password, begin_season=2025, end_season=2025, max_retries=3):
self.email = email
self.password = password
self.begin_season = begin_season
self.end_season = end_season
self.max_retries = max_retries
# Authentication and scraping methods would go here
# (Code simplified for educational purposes)
Multi-Page Data Extraction
KenPom organizes team statistics across specialized pages, each providing different analytical perspectives:
Efficiency Ratings Page: Core offensive and defensive efficiency metrics adjusted for tempo and strength of schedule.
Four Factors Page: Shooting efficiency, turnover rates, rebounding percentages, and free throw statistics that Ken Pomeroy identified as the most predictive of success.
Advanced Statistics Page: Detailed breakdowns of shooting performance, including two-point and three-point efficiency metrics.
Point Distribution Page: Analysis of scoring patterns and shot selection tendencies.
Height and Experience Page: Team composition metrics including average height by position and player experience levels.
Dynamic URL Construction
Each page requires season-specific parameters, handled through systematic URL construction with retry logic for reliable data extraction across multiple seasons.
Data Cleaning and Standardization Pipeline
Raw scraped data requires extensive cleaning before analysis. The preprocessing pipeline addresses multiple data quality challenges:
Numeric Data Conversion
KenPom displays statistics in human-readable formats that require parsing for analysis:
def _process_dataframe(self, df, year):
"""Clean and standardize scraped data"""
# Convert efficiency ratings and tempo to numeric
for col in df.columns:
if col.endswith(('Rtg', 'T')):
df[col] = pd.to_numeric(df[col].str.replace(',', ''), errors='coerce')
elif col.endswith('_Rank'):
# Extract numeric rank from formatted strings like "15 (2.3)"
df[col] = pd.to_numeric(df[col].str.extract(r'(\d+)')[0], errors='coerce')
# Parse win-loss records
if 'Record' in df.columns:
df[['Wins', 'Losses']] = df['Record'].str.split('-', expand=True)
df = df.drop(columns='Record')
# Standardize team names for consistent joining
df['Team'] = df['Team'].str.replace(r'[\d*]', '', regex=True).str.strip().str.lower()
df['Season'] = year
return df
Text Cleaning: Team names contain rankings, asterisks, and other formatting that must be stripped for consistent identification.
Numeric Parsing: Statistics appear with commas, parenthetical information, and other formatting that prevents direct numeric conversion.
Missing Data Handling: Some teams have incomplete statistics, requiring careful handling to avoid analysis errors.
Multi-Season Data Organization
The system handles both historical and current season data differently:
Historical Data: Completed seasons that won’t change, stored permanently for training data.
Current Season Snapshots: Dated snapshots of ongoing seasons that capture team development over time.
Latest Current: Always-updated current season data for real-time predictions.
Advanced Data Preprocessing
Beyond basic cleaning, the data requires sophisticated preprocessing to prepare for modeling:
Statistical Outlier Management
The preprocessing pipeline implements comprehensive data quality measures:
Invalid Data Filtering: KenPom uses sentinel values like -99 to indicate invalid or insufficient data, which must be excluded from analysis.
Incomplete Profile Removal: Teams with too many missing statistics likely have insufficient game data for reliable analysis.
Redundancy Elimination: Rankings duplicate the information contained in underlying efficiency metrics while adding noise.
Team Identity Mapping
One of the most challenging aspects involves mapping team names consistently across seasons:
Name Standardization: Teams use different name variations across seasons and data sources.
Conference Changes: Teams change conferences, affecting their identities and competitive contexts.
Program Transitions: Schools occasionally change names, merge programs, or undergo other institutional changes.
Feature Space Optimization
The preprocessing pipeline implements dimensionality reduction to improve model performance:
Correlation Analysis: Some KenPom metrics are mathematically related, creating multicollinearity that can destabilize models.
Data Quality Assessment: Certain statistics have known reliability issues or sparse coverage across teams.
Robust Feature Scaling
Statistical metrics operate on vastly different scales, requiring careful normalization:
def scale_features(self, df, fit_scaler=True):
"""Apply robust scaling to handle outliers"""
features_to_scale = [col for col in df.columns if col not in ['TeamID', 'Season']]
if fit_scaler:
self.scaler = RobustScaler() # Less sensitive to outliers than StandardScaler
df[features_to_scale] = self.scaler.fit_transform(df[features_to_scale])
else:
df[features_to_scale] = self.scaler.transform(df[features_to_scale])
return df
Robust Scaling: Uses median and interquartile range instead of mean and standard deviation, providing better handling of statistical outliers common in college basketball.
Persistent Scaling: Scalers get saved and reused for consistency between training and prediction data.
Pipeline Orchestration and Error Handling
The complete acquisition pipeline coordinates multiple specialized components with comprehensive error handling:
Browser Management: Proper browser lifecycle management prevents resource leaks during long scraping sessions.
Partial Failure Handling: If individual seasons fail to scrape, the system continues rather than losing all progress.
Data Persistence: Results get saved incrementally to prevent data loss from unexpected failures.
Data Quality Validation
The pipeline implements comprehensive validation throughout the acquisition process:
Statistical Sanity Checks: Efficiency ratings must fall within expected ranges based on historical data.
Completeness Validation: Each team must have minimum statistical coverage to be included in datasets.
Temporal Consistency: Teams shouldn’t show impossible statistical changes between data points.
Ethical Considerations and Best Practices
When conducting any web scraping project, several important principles should guide the approach:
Respect Rate Limits: Implement delays between requests to avoid overwhelming target servers.
Honor robots.txt: Check and respect website crawling policies.
Use Data Responsibly: Ensure scraped data is used ethically and in compliance with terms of service.
Consider Alternatives: Always evaluate whether official APIs or data purchase options exist before scraping.
Looking Ahead
This robust data acquisition foundation provides clean, comprehensive basketball statistics ready for advanced analysis. The automated pipeline ensures consistent data quality while handling the complexities of web scraping premium content.
In Part 2, we’ll explore how this clean dataset gets analyzed through exploratory data analysis and transformed through sophisticated feature engineering to identify the statistical patterns that predict tournament success.
Next: Part 2 - Exploratory Data Analysis and Feature Engineering