ALWAYS LEARNING SOMETHING

Johnathan Oneal

THANKS FOR STOPPING BY!

HAL 9000: (Gradient)Descent into (March)Madness Part 4: Bracket

Building a March Madness Prediction Engine: Part 4 - Tournament Simulation and Bracket Optimization

Responsive SVG

The culmination of our NCAA basketball prediction system transforms individual game predictions into comprehensive tournament strategy through Monte Carlo simulation and expected value optimization. This final component demonstrates how sophisticated modeling can navigate the complex decision-making required for competitive bracket pools, where the goal isn’t simply predicting individual games correctly, but maximizing expected points across an entire tournament structure.

Tournament Optimization Challenge

March Madness bracket pools present a unique optimization problem that differs fundamentally from individual game prediction. Success requires balancing multiple competing objectives:

Expected Value Maximization: Each round carries different point values (typically 1, 2, 4, 8, 16, 32 points), making later round accuracy exponentially more valuable than early round precision.

Risk Management: Conservative picks maximize expected value but provide little differentiation in large pools, while aggressive upsets offer high upside with corresponding risk.

Strategic Differentiation: In competitive pools, the optimal strategy often involves calculated contrarian picks that provide unique upside when they succeed.

Temporal Dependencies: Tournament brackets create complex dependencies where early round upsets eliminate potential late-round picks, requiring sophisticated simulation to capture these interactions.

Monte Carlo Simulation Architecture

The BracketOptimizer class implements a comprehensive tournament simulation framework that addresses these optimization challenges:

class BracketOptimizer:
    def __init__(self):
        # Tournament structure with point values for each round
        self.rounds = [
            {"name": "First Round", "pointValue": 1},
            {"name": "Second Round", "pointValue": 2},
            {"name": "Sweet 16", "pointValue": 4},
            {"name": "Elite 8", "pointValue": 8},
            {"name": "Final Four", "pointValue": 16},
            {"name": "Championship", "pointValue": 32}
        ]
        
        # Storage for simulation results
        self.team_advancement = {}  # Track how often each team reaches each round
        self.expected_values = {}   # Calculate expected points for each team
        self.win_probabilities = {} # Cache matchup predictions
        
        # Historical upset rates for probability adjustment
        self.seed_upset_rates = self.initialize_seed_upset_rates()

Tournament Structure Modeling

The system models the complete NCAA tournament structure with proper regional organization and advancement rules:

def setup_tournament(self, tournament_teams_path=None):
    """Initialize tournament with 68 teams across 4 regions"""
    tournament_df = pd.read_csv(tournament_teams_path)
    
    # Map teams with seeds and regional assignments
    for _, row in tournament_df.iterrows():
        team_id = row['TeamID']
        seed = row['Seed']
        region = row['Region']
        
        self.teams[team_id] = {
            'name': team_name,
            'seed': seed,
            'region': region
        }
    
    # Create bracket structure with proper advancement paths
    # Standard NCAA matchups: 1v16, 8v9, 5v12, 4v13, 6v11, 3v14, 7v10, 2v15
    matchups = [(1, 16), (8, 9), (5, 12), (4, 13), (6, 11), (3, 14), (7, 10), (2, 15)]
    
    for region in regions:
        for i, (seed1, seed2) in enumerate(matchups):
            game_id = f"{region}_R1_G{i+1}"
            self.bracket_structure[region][1][game_id] = {
                "team1_id": teams_by_seed[seed1],
                "team2_id": teams_by_seed[seed2],
                "winner_id": None,
                "next_game": self.get_next_game_id(game_id)
            }

This structure captures the complete tournament progression from first round through championship, enabling realistic simulation of bracket advancement scenarios.

Prediction Integration and Caching

The simulation integrates seamlessly with the trained models from Part 3, calling the prediction system for each required matchup:

def run_prediction(self, team1_id: int, team2_id: int, neutral_site: bool = True) -> Dict[str, float]:
    """Integrate with trained models for game predictions"""
    # Check cache to avoid redundant predictions
    cache_key = (team1_id, team2_id)
    if cache_key in self.prediction_cache:
        return self.prediction_cache[cache_key]
    
    try:
        if neutral_site:
            # For neutral site games, run prediction both ways and average
            team1_home_result = self._run_single_prediction(team1_id, team2_id)
            team2_home_result = self._run_single_prediction(team2_id, team1_id)
            
            # Average the win probabilities to remove home court bias
            team1_win_prob = (team1_home_result["home_win_probability"] + 
                             (1.0 - team2_home_result["home_win_probability"])) / 2.0
        else:
            # Single prediction with specified home team
            result = self._run_single_prediction(team1_id, team2_id)
            team1_win_prob = result["home_win_probability"]
            
        # Cache and return structured result
        prediction_result = {
            "team1_id": team1_id,
            "team2_id": team2_id,
            "team1_win_probability": team1_win_prob,
            "home_advantage_adjusted": neutral_site
        }
        
        self.prediction_cache[cache_key] = prediction_result
        return prediction_result
        
    except Exception as e:
        # Fallback to seed-based model if prediction fails
        return self._seed_based_fallback(team1_id, team2_id)

Robust Prediction Pipeline

The system includes multiple fallback mechanisms to ensure tournament simulation continues even when individual predictions fail:

def _run_single_prediction(self, home_team_id: int, away_team_id: int) -> Dict[str, float]:
    """Execute prediction using trained models via subprocess"""
    cmd = f"python ../main.py --predict --hometeamid={home_team_id} --awayteamid={away_team_id}"
    
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    
    # Parse prediction output for win probabilities
    wl_models = []
    spread_models = []
    
    for line in result.stdout.split('\n'):
        if "Model Name:" in line:
            model_name = line.split("Model Name:")[1].strip()
        elif "Prediction:" in line and model_name:
            # Extract numeric predictions using regex
            if "WL" in model_name:
                match = re.search(r"'predictions':\s*np\.float\d*\(([^)]+)\)", line)
                if match:
                    wl_models.append(float(match.group(1)))
            elif "Spread" in model_name:
                # Convert spreads to win probabilities
                match = re.search(r"'predictions':\s*np\.float\d*\(([^)]+)\)", line)
                if match:
                    spread = float(match.group(1))
                    win_prob = 0.5 + min(abs(spread) * 0.02, 0.45) if spread < 0 else 0.5 - min(abs(spread) * 0.02, 0.45)
                    spread_models.append(win_prob)
    
    # Ensemble multiple model predictions
    if wl_models:
        home_win_prob = sum(wl_models) / len(wl_models)
    elif spread_models:
        home_win_prob = sum(spread_models) / len(spread_models)
    else:
        raise ValueError("No valid predictions found")
    
    return {"home_win_probability": home_win_prob}

Historical Upset Rate Integration

Pure model predictions often underestimate upset probability because they focus on team quality rather than tournament dynamics. The system addresses this by blending model predictions with historical upset rates:

def initialize_seed_upset_rates(self) -> Dict[Tuple[int, int], float]:
    """Historical upset rates by seed matchup"""
    upset_rates = {
        (16, 1): 0.01,  # UMBC over Virginia (2018) - extremely rare
        (15, 2): 0.05,  # Happens occasionally 
        (14, 3): 0.15,  # More common but still surprising
        (13, 4): 0.20,  # Regular occurrence
        (12, 5): 0.35,  # The famous "12-5 upset" expectation
        (11, 6): 0.40,  # Very common
        (10, 7): 0.45,  # Nearly even matchups
        (9, 8): 0.50,   # Essentially coin flips
    }
    
    # Generate rates for later round matchups based on seed differences
    for high_seed in range(1, 17):
        for low_seed in range(1, 17):
            if low_seed > high_seed:
                key = (low_seed, high_seed)
                if key not in upset_rates:
                    seed_diff = low_seed - high_seed
                    rate = self._calculate_upset_rate_by_difference(seed_diff)
                    upset_rates[key] = min(rate, 0.50)
    
    return upset_rates

def adjust_win_probability(self, team1_id: int, team2_id: int, base_win_prob: float) -> float:
    """Blend model predictions with historical upset patterns"""
    team1_seed = self.teams[team1_id]['seed']
    team2_seed = self.teams[team2_id]['seed']
    
    # Identify potential upset scenario
    if team1_seed > team2_seed:  # team1 would be upset winner
        upset_key = (team1_seed, team2_seed)
        historical_rate = self.seed_upset_rates.get(upset_key, 0.5)
        
        # Blend model confidence with historical rates
        model_confidence = abs(base_win_prob - 0.5) * 2
        adjusted_prob = (base_win_prob * model_confidence) + (historical_rate * (1 - model_confidence))
    else:
        # Similar logic for opposite upset scenario
        upset_key = (team2_seed, team1_seed)
        historical_rate = self.seed_upset_rates.get(upset_key, 0.5)
        adjusted_prob = (base_win_prob * model_confidence) + ((1 - historical_rate) * (1 - model_confidence))
    
    return max(0.05, min(0.95, adjusted_prob))

This approach recognizes that March Madness upsets follow certain historical patterns that pure statistical models may not fully capture.

Monte Carlo Tournament Simulation

The simulation engine runs thousands of complete tournament scenarios to calculate advancement probabilities:

def simulate_tournament(self, num_simulations: int = 1000, neutral_site: bool = True, 
                       upset_logic: bool = True) -> None:
    """Run comprehensive tournament simulations"""
    print(f"Running {num_simulations} complete tournament simulations...")
    
    # Reset advancement counters
    for team_id in self.teams:
        self.team_advancement[team_id] = [0] * len(self.rounds)
    
    # Execute simulations with progress tracking
    for sim in tqdm(range(num_simulations)):
        bracket = self.create_simulation_bracket()
        
        # Simulate each round sequentially
        for round_num in range(1, 7):
            self.simulate_round(bracket, round_num, neutral_site, upset_logic)
    
    # Calculate expected values from simulation results
    for team_id in self.teams:
        for round_idx in range(len(self.rounds)):
            advancement_prob = self.team_advancement[team_id][round_idx] / num_simulations
            round_points = self.rounds[round_idx]["pointValue"]
            self.expected_values[team_id][round_idx] = advancement_prob * round_points

Individual Game Simulation

Each game simulation incorporates both model predictions and stochastic variation:

def simulate_game(self, team1_id: int, team2_id: int, neutral_site: bool, upset_logic: bool) -> int:
    """Simulate single game with probability-based outcomes"""
    # Get base prediction from trained models
    prediction = self.run_prediction(team1_id, team2_id, neutral_site)
    win_prob = prediction["team1_win_probability"]
    
    # Apply historical upset adjustment if enabled
    if upset_logic:
        win_prob = self.adjust_win_probability(team1_id, team2_id, win_prob)
    
    # Stochastic simulation based on calculated probability
    return team1_id if random.random() < win_prob else team2_id

This approach captures the inherent randomness of tournament basketball while respecting the underlying statistical relationships identified by the models.

Expected Value Optimization

The bracket optimization process maximizes expected tournament points rather than individual game accuracy:

def build_optimal_bracket(self, neutral_site: bool = True, upset_logic: bool = True, 
                         variance_preference: float = 0.4) -> Dict[str, Any]:
    """Construct optimal bracket using expected value optimization"""
    optimal_bracket = {"rounds": []}
    bracket = self.create_simulation_bracket()
    
    # Process each round with EV-based decision making
    for round_num in range(1, 7):
        round_data = {"name": self.rounds[round_num-1]["name"], "games": []}
        
        # For each game in the round
        for game_id, game_data in self._get_round_games(bracket, round_num):
            team1_id = game_data["team1_id"]
            team2_id = game_data["team2_id"]
            
            if team1_id is None or team2_id is None:
                continue
                
            # Get win probability
            prediction = self.run_prediction(team1_id, team2_id, neutral_site)
            team1_win_prob = prediction["team1_win_probability"]
            
            if upset_logic:
                team1_win_prob = self.adjust_win_probability(team1_id, team2_id, team1_win_prob)
            
            # Calculate expected values from simulation results
            team1_ev = self.get_team_expected_value(team1_id)
            team2_ev = self.get_team_expected_value(team2_id)
            
            # Apply variance preference for upset consideration
            team1_seed = self.teams[team1_id]["seed"]
            team2_seed = self.teams[team2_id]["seed"]
            
            # Boost EV for potential upsets based on variance preference
            if team1_seed > team2_seed:  # team1 would be upset
                upset_bonus = (team1_seed - team2_seed) * self.rounds[round_num-1]["pointValue"] * variance_preference
                team1_ev += upset_bonus
            elif team2_seed > team1_seed:  # team2 would be upset
                upset_bonus = (team2_seed - team1_seed) * self.rounds[round_num-1]["pointValue"] * variance_preference
                team2_ev += upset_bonus
            
            # Calculate probability-weighted expected values
            team1_adjusted_ev = team1_ev * team1_win_prob
            team2_adjusted_ev = team2_ev * (1 - team1_win_prob)
            
            # Select team with higher adjusted EV
            winner_id = team1_id if team1_adjusted_ev > team2_adjusted_ev else team2_id
            
            # Update bracket structure and advance to next round
            self._update_bracket_with_winner(bracket, game_data, winner_id)
            
    return optimal_bracket

Variance Preference Tuning

The variance_preference parameter enables strategy customization:

Conservative Strategy (variance_preference = 0.0): Picks favorites in almost every game, maximizing expected value but providing little differentiation.

Balanced Strategy (variance_preference = 0.4): Selects strategic upsets where the risk/reward ratio justifies deviating from favorites.

Aggressive Strategy (variance_preference = 0.8): Emphasizes upsets heavily, providing high upside potential but increased risk.

Strategic Upset Identification

Beyond basic bracket construction, the system identifies specific upset opportunities with favorable risk/reward profiles:

def find_strategic_upsets(self, bracket: Dict[str, Any], max_upsets: int = 10) -> List[Dict[str, Any]]:
    """Identify high-value upset opportunities"""
    potential_upsets = []
    
    for round_data in bracket["rounds"]:
        round_idx = self._get_round_index(round_data["name"])
        
        for game in round_data["games"]:
            # Identify potential upset scenarios
            team1_seed = game["team1_seed"]
            team2_seed = game["team2_seed"]
            winner_id = game["winner_id"]
            winner_prob = game["winner_probability"]
            
            # Check if current pick represents an upset
            is_upset = self._is_upset_pick(game)
            
            if is_upset:
                # Calculate strategic value metrics
                seed_diff = abs(team1_seed - team2_seed)
                round_value = self.rounds[round_idx]["pointValue"]
                
                # Quality factors for upset evaluation
                probability_quality = min(winner_prob, 1.0 - winner_prob) * 2  # Scale to 0-1
                
                # Strategic value combines multiple factors
                strategic_value = seed_diff * round_value * probability_quality
                
                # Bonus for early round upsets of top seeds (high differentiation value)
                if round_idx <= 1 and min(team1_seed, team2_seed) <= 3:
                    strategic_value *= 1.5
                
                potential_upsets.append({
                    **game,
                    "round_name": round_data["name"],
                    "seed_diff": seed_diff,
                    "strategic_value": strategic_value,
                    "upset_probability": winner_prob
                })
    
    # Return top strategic upset opportunities
    return sorted(potential_upsets, key=lambda x: x["strategic_value"], reverse=True)[:max_upsets]

This analysis identifies upsets that provide optimal combinations of:

Tournament Analysis and Insights

The simulation results enable comprehensive tournament analysis beyond simple bracket construction:

def analyze_simulation_results(self) -> Dict[str, Any]:
    """Extract strategic insights from simulation results"""
    analysis = {
        "most_likely_champions": [],
        "best_value_picks": [],
        "most_volatile_regions": {},
        "most_common_upsets": []
    }
    
    # Championship probability analysis
    championship_probs = {}
    for team_id in self.teams:
        championship_count = self.team_advancement[team_id][-1]  # Final round
        championship_probs[team_id] = championship_count / self.num_simulations
    
    # Sort by championship probability
    analysis["most_likely_champions"] = sorted(
        [(team_id, prob, self.teams[team_id]) for team_id, prob in championship_probs.items()],
        key=lambda x: x[1], reverse=True
    )[:10]
    
    # Value pick analysis (high EV relative to seed)
    value_picks = []
    for team_id in self.teams:
        team_ev = self.get_team_expected_value(team_id)
        team_seed = self.teams[team_id]["seed"]
        value_score = team_ev * team_seed / 100  # Normalize by seed
        
        value_picks.append((team_id, value_score, team_ev, self.teams[team_id]))
    
    analysis["best_value_picks"] = sorted(value_picks, key=lambda x: x[1], reverse=True)[:10]
    
    # Regional volatility analysis
    for region in self.regions:
        region_teams = [tid for tid in self.teams if self.teams[tid]["region"] == region]
        
        # Calculate upset frequency in this region
        top_seeds = [tid for tid in region_teams if self.teams[tid]["seed"] <= 4]
        lower_seeds = [tid for tid in region_teams if self.teams[tid]["seed"] > 4]
        
        # Elite 8 advancement rates
        elite8_top = sum(self.team_advancement[tid][3] for tid in top_seeds)
        elite8_lower = sum(self.team_advancement[tid][3] for tid in lower_seeds)
        total_elite8 = elite8_top + elite8_lower
        
        if total_elite8 > 0:
            volatility = elite8_lower / total_elite8
            analysis["most_volatile_regions"][region] = {
                "volatility_score": volatility,
                "upset_frequency": volatility
            }
    
    return analysis

Production Tournament Pipeline

The complete system integrates through a command-line interface that orchestrates the entire optimization process:

def main():
    parser = argparse.ArgumentParser(description="NCAA Tournament Bracket Optimizer")
    parser.add_argument("--simulations", type=int, default=1000, help="Number of simulations")
    parser.add_argument("--variance", type=float, default=0.4, help="Upset preference (0.0-1.0)")
    parser.add_argument("--neutral-site", action="store_true", default=True)
    parser.add_argument("--upset-logic", action="store_true", default=True)
    parser.add_argument("--analyze", action="store_true", help="Run additional analysis")
    parser.add_argument("--output", help="Output JSON file for results")
    
    args = parser.parse_args()
    
    # Initialize and configure optimizer
    optimizer = BracketOptimizer()
    optimizer.setup_tournament()  # Loads 2025 tournament data by default
    
    # Execute complete optimization pipeline
    print(f"Running {args.simulations} tournament simulations...")
    optimizer.simulate_tournament(args.simulations, args.neutral_site, args.upset_logic)
    
    print("Building optimal bracket...")
    bracket = optimizer.build_optimal_bracket(args.neutral_site, args.upset_logic, args.variance)
    
    print("Identifying strategic upsets...")
    upsets = optimizer.find_strategic_upsets(bracket)
    
    # Display results
    optimizer.print_bracket(bracket)
    optimizer.print_upsets(upsets)
    
    # Optional detailed analysis
    if args.analyze:
        analysis = optimizer.analyze_simulation_results()
        print("\nMost Likely Champions:")
        for i, (team_id, prob, team_info) in enumerate(analysis["most_likely_champions"][:5], 1):
            print(f"{i}. ({team_info['seed']}) {team_info['name']}: {prob*100:.1f}%")
    
    # Export results if requested
    if args.output:
        optimizer.export_bracket_json(bracket, upsets, args.output)

Real-World Application and Results

The tournament optimization system provides actionable insights for bracket pool strategy:

Expected Output Analysis

Optimal Bracket Structure: The system typically recommends 6-12 upset picks across all rounds, with the majority concentrated in the first two rounds where the risk/reward ratio is most favorable.

Strategic Upset Distribution:

Value Pick Identification: The system consistently identifies mid-major teams with strong advanced metrics but lower seeds as high-value championship contenders.

Performance Validation

Historical backtesting shows the optimization system would have:

Integration with Complete Prediction System

This tournament optimization represents the culmination of the complete four-part prediction system:

Part 1 Foundation: Clean, comprehensive KenPom data provides the statistical foundation for all predictions.

Part 2 Intelligence: Sophisticated feature engineering transforms raw statistics into predictive signals that capture basketball’s essential dynamics.

Part 3 Modeling: Robust machine learning models generate reliable win probabilities for any matchup combination.

Part 4 Optimization: Monte Carlo simulation and expected value optimization transform individual predictions into tournament strategy that maximizes expected points while managing risk.

Strategic Insights and Limitations

The system provides several key strategic insights:

Upset Timing Matters: Early round upsets provide maximum differentiation value, while late round upsets require extremely high confidence due to exponential point values.

Regional Balance: Successful brackets typically include upsets distributed across regions rather than concentrated in a single volatile region.

Seed-Based Value: The biggest value opportunities often come from teams seeded 2-3 lines lower than their true strength due to conference tournament losses or evaluation committee oversights.

Model Limitations: Even sophisticated models can’t eliminate March Madness’s inherent unpredictability—the system optimizes expected value rather than guaranteeing specific outcomes.

Conclusion

The tournament simulation and optimization system demonstrates how advanced analytics can navigate complex strategic decisions in competitive environments. By combining rigorous statistical modeling with sophisticated optimization techniques, the system provides a systematic approach to bracket construction that balances multiple competing objectives.

The complete four-part system—from data acquisition through tournament optimization—illustrates how modern data science can enhance decision-making in domains characterized by uncertainty, competition, and complex interdependencies. While no system can eliminate the genuine unpredictability that makes March Madness compelling, this framework provides a principled approach to maximizing expected outcomes in the face of inherent uncertainty.

The integration of machine learning models with Monte Carlo simulation and expected value optimization creates a powerful tool for tournament analysis that goes far beyond simple game-by-game predictions, enabling strategic thinking about risk, reward, and competitive positioning that defines successful bracket pool strategy.


This completes our four-part series on building a comprehensive NCAA basketball prediction system, from data acquisition through tournament optimization.