Building a Dota 2 Match Outcome Predictor, Part 2: Enhancing the Dataset and Adding New Features

25 min read Original article ↗

Viktor Hamretskyi

“The journey of continuous improvement is fueled by the desire to be better today than yesterday, and better tomorrow than today.” — Anonymous

Press enter or click to view image in full size

Author: https://www.deviantart.com/jirojh

Hello everyone! Today, we’re diving into part two of our grand ML project! Fingers crossed, let’s hope this article doesn’t get overshadowed by the U.S. election headlines — it’s not every day my post competes with national news! Who knows — maybe my next project will be predicting elections… because hey, why not aim big? But for now, let’s get back to the real excitement: machine learning!

In the first part of this series, I covered the foundational steps of building a Dota 2 match outcome predictor, focusing on data collection and basic model setup. This involved initial data preparation, selecting essential features, and creating a baseline model, alongside discussing insights and challenges faced in the process. These steps laid a strong groundwork for prediction, but there’s more we can do to increase the model’s effectiveness and usability. In Part 2, I’ll focus on enhancing the dataset and incorporating new user-facing features to refine prediction accuracy and usability, offering players a more interactive experience and deeper insights into match dynamics in Dota 2.

Advanced Feature Engineering in Dota 2 Match Outcome Prediction

In Part 1 of this series, I laid the groundwork for predicting match outcomes in Dota 2 by gathering essential metrics, selecting basic features, and building an initial predictive model. While this provided a starting point, I knew there was more we could do to enhance the model’s predictive power by digging deeper into data refinement and feature engineering.

In Part 2, I’ll walk through the advanced feature engineering techniques that have transformed the initial dataset into a more insightful and performance-oriented version. These improvements aim to capture the complex dynamics of team performance in Dota 2, from individual player stats to overall team strategies.

1. Aggregated Team Features

One of the key enhancements in Part 2 is the calculation of team-level aggregated features. Rather than using individual player stats for each team, I implemented a function to average important metrics across players to represent a unified team performance. This includes average values for:

  • Hero win rates (to represent the effectiveness of hero choices),
  • Kills, deaths, and assists, and
  • GPM (gold per minute) and XPM (experience per minute).

This approach allows the model to view teams as cohesive units rather than collections of individuals, giving it a broader context when predicting outcomes.

2. Streamlined Data with Feature Reduction

In Part 1, the dataset included numerous individual stats per player and team. However, for a more efficient model, I dropped many of these intermediate columns after aggregating them. By consolidating data into key features like average kills or team-level experience gain, the model focuses on essential information without redundant noise. This feature reduction simplifies the dataset, allowing the model to generalize better and train faster without sacrificing relevant insights.

3. Scaling and Normalization for Consistency

To make sure that all features contribute proportionally to the predictions, I applied MinMax scaling to normalize values across the dataset. By transforming the data to a common scale between 0 and 1, this process ensures consistency in model training and reduces sensitivity to large numerical disparities across different metrics. To ensure the model continues to process future datasets in the same way, the scaler is saved after initial fitting. This allows us to apply the same transformation consistently in the future, keeping new data aligned with the original dataset and preventing skewed outcomes when updated data is introduced.

4. New Feature Additions for Greater Depth

Finally, I expanded the dataset with additional features that capture aspects of team strategy and map control. Key new additions include:

  • Teamfight participation: to represent team cohesion in fights,
  • Net worth: as an indicator of each team’s overall economy,
  • Observer and sentry ward placements: to reflect map control and vision strategy.

These features give the model a deeper look at team dynamics, allowing it to consider not only raw stats but also in-game tactics that influence match outcomes.

Visualization: Understanding Feature Importance

Visualizing data is a crucial step in understanding the relationships between features and the target variable, especially in a complex game like Dota 2. In this section, I’ll demonstrate how to visualize the correlation between various features and the match outcome, specifically focusing on the “radiant_win” column. This insight can help identify which features are most influential in predicting match outcomes and guide further feature engineering efforts.

To achieve this, I utilized a correlation matrix and visualized it using a heatmap. Here’s the Python script I used to load the dataset, prepare the data, and create the heatmap:

import os
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

from structure.helpers import prepare_match_prediction_data

pd.set_option("display.max_columns", None)

file_path = os.path.join("..", "dataset", "train_data", "all_data_match_predict.csv")
scaler_path = "../scaler.pkl"

# Load and prepare the dataset
df = pd.read_csv(file_path)
df = prepare_match_prediction_data(df, scaler_path)

# Specify the features and target column
# features = df.columns[:-1].tolist() # All columns except the last one
target = "radiant_win" # Change to your target column name
features = df.columns.drop(target).tolist()

corr_matrix = df.corr()

# Visualize with a heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(
corr_matrix[["radiant_win"]].sort_values(by="radiant_win", ascending=False),
annot=True,
cmap="coolwarm",
)
plt.title("Correlation of Features with Radiant Win")
plt.show()

Explanation of the Code:

  1. Data Loading and Preparation: The dataset is loaded from a CSV file, and the prepare_match_prediction_data function is called to preprocess the data, including scaling and normalization.
  2. Correlation Matrix: The correlation matrix is calculated using the corr() function from Pandas, which measures how strongly the features are related to each other and to the target variable "radiant_win".
  3. Heatmap Visualization: The heatmap is created using Seaborn’s heatmap() function. The correlation values with "radiant_win" are sorted in descending order, allowing us to quickly identify which features have the strongest positive or negative correlations with the target.
  4. Interpreting the Heatmap: The heatmap provides a visual representation of feature importance. Features that are strongly correlated (either positively or negatively) with the “radiant_win” outcome can be prioritized for further model tuning or feature engineering.

Insights from the Visualization

Press enter or click to view image in full size

Figure: Correlation of features with the “radiant_win” outcome.

By examining the heatmap, we can gain valuable insights into which features may be most predictive of a Radiant team’s success. For instance, if certain metrics like average kills or teamfight participation show a strong positive correlation with “radiant_win,” these could be essential features to focus on in future iterations of the model.

Upon analyzing the correlation heatmap specifically related to the radiant_win column, several notable trends emerge concerning how various features impact the outcome for the Radiant team.

  1. Gold Per Minute (GPM): The Radiant team’s GPM shows a strong positive correlation with winning. This indicates that when Radiant players accumulate gold efficiently, they can afford better items and upgrades, enhancing their overall power during the match.
  2. Experience Per Minute (XPM): Similar to GPM, XPM is also positively correlated with radiant wins. Higher experience gain per minute allows Radiant players to level up faster, unlocking crucial skills and abilities that can turn the tide in battles.
  3. KDA (Kills, Deaths, Assists): The KDA ratio for Radiant players exhibits a strong positive correlation with winning as well. A high KDA suggests that players are not only securing kills and assists but are also minimizing their deaths. This metric reflects both individual performance and effective team coordination, which is essential in competitive matches.

These positive correlations imply that as the Radiant team excels in GPM, XPM, and KDA, their chances of winning significantly increase. This insight underscores the importance of these metrics in assessing team performance.

Interestingly, while the analysis focuses on the Radiant team, it’s essential to contrast this with the Dire team’s performance. Although we won’t delve deeply into their negative correlations here, the implication is clear: when the Dire team exhibits lower GPM, XPM, and KDA, it diminishes their likelihood of victory.

This visualization step not only enhances our understanding of the dataset but also informs our next steps in model improvement and feature selection. By understanding the relationship between these metrics and the radiant_win column, we can refine our model to prioritize features that significantly impact the likelihood of Radiant success. By enhancing these features further through data collection and engineering, we can build a more robust predictive model that accurately reflects the dynamics of gameplay.

Enhancing the Player Class for Dota 2 Match Prediction

In this section, I’ll walk you through the improvements made to the Player class, which is pivotal for our Dota 2 match outcome prediction model. The enhancements focus on robust data retrieval, better statistical handling, and improved object representation, which altogether enrich the player analysis experience.

Key Improvements

  1. Initialization and Data Handling: The updated Player class initializes player attributes directly from a given player_data dictionary. If no data is provided, it resets all statistics to zero and retrieves the player's total data from recent matches. This dual approach not only provides flexibility in object creation but also ensures that we can easily handle cases where player data may be missing.
def __init__(self, account_id, name, hero_id, team, player_data=None):
self.account_id = account_id
self.team = team
self.hero = Hero(hero_id)
self.name = name

if player_data:
self.load_player_data(player_data)
else:
self.reset_stats()
self.get_player_total_data()

This initialization method is cleaner and allows for straightforward updates when player data is available.

2. Efficient Data Aggregation: The new get_player_total_data method incorporates a more systematic approach to fetching and accumulating player statistics. It retrieves recent match data with a retry mechanism for robustness, ensuring that if the data cannot be retrieved initially, the system will attempt multiple times before failing.

For instance, during data accumulation, counters are updated only when player data is successfully retrieved, ensuring accurate statistical representation without the risk of division by zero when calculating averages.

  def get_player_total_data(self):
"""Fetch player total data with retries on match data retrieval."""
recent_matches = self.fetch_recent_matches()

# Initialize counters for averages
participation_count = obs_count = sen_count = net_worth_count = 0
kills_count = deaths_count = assists_count = roshan_count = 0
last_hits_count = denies_count = gpm_count = xpm_count = level_count = 0
hero_damage_count = tower_damage_count = healing_count = 0

# Iterate through recent matches
for match in recent_matches:
match_id = match["match_id"]
match_data = self.fetch_match_data_with_retries(match_id)

if match_data is None:
logger.warning(f"Skipping match {match_id} after 5 attempts")
continue # Skip the match if it couldn't be retrieved

# Get player data
player_data = self.get_player_data(match_data)

if player_data:
logger.debug(
f"Processing match data for match ID {match_id}: {player_data}"
)
# Accumulate values safely
participation_count = self.accumulate_value(
player_data, "teamfight_participation", participation_count
)
obs_count = self.accumulate_value(player_data, "obs_placed", obs_count)
sen_count = self.accumulate_value(player_data, "sen_placed", sen_count)
net_worth_count = self.accumulate_value(
player_data, "net_worth", net_worth_count
)
kills_count = self.accumulate_value(player_data, "kills", kills_count)
deaths_count = self.accumulate_value(
player_data, "deaths", deaths_count
)
assists_count = self.accumulate_value(
player_data, "assists", assists_count
)
roshan_count = self.accumulate_value(
player_data, "roshans_killed", roshan_count
)
last_hits_count = self.accumulate_value(
player_data, "last_hits", last_hits_count
)
denies_count = self.accumulate_value(
player_data, "denies", denies_count
)
gpm_count = self.accumulate_value(
player_data, "gold_per_min", gpm_count
)
xpm_count = self.accumulate_value(player_data, "xp_per_min", xpm_count)
level_count = self.accumulate_value(player_data, "level", level_count)
hero_damage_count = self.accumulate_value(
player_data, "hero_damage", hero_damage_count
)
tower_damage_count = self.accumulate_value(
player_data, "tower_damage", tower_damage_count
)
healing_count = self.accumulate_value(
player_data, "hero_healing", healing_count
)

# Safely divide by the number of successful additions for each field
self.teamfight_participation = self.calculate_average(
self.teamfight_participation, participation_count
)
self.obs_placed = self.calculate_average(self.obs_placed, obs_count)
self.sen_placed = self.calculate_average(self.sen_placed, sen_count)
self.net_worth = self.calculate_average(self.net_worth, net_worth_count)
self.kills = self.calculate_average(self.kills, kills_count)
self.deaths = self.calculate_average(self.deaths, deaths_count)
self.assists = self.calculate_average(self.assists, assists_count)
self.roshans_killed = self.calculate_average(self.roshans_killed, roshan_count)
self.last_hits = self.calculate_average(self.last_hits, last_hits_count)
self.denies = self.calculate_average(self.denies, denies_count)
self.gold_per_min = self.calculate_average(self.gold_per_min, gpm_count)
self.xp_per_min = self.calculate_average(self.xp_per_min, xpm_count)
self.level = self.calculate_average(self.level, level_count)
self.hero_damage = self.calculate_average(self.hero_damage, hero_damage_count)
self.tower_damage = self.calculate_average(
self.tower_damage, tower_damage_count
)
self.hero_healing = self.calculate_average(self.hero_healing, healing_count)

3. Statistical Functions: A set of utility methods like accumulate_value and calculate_average have been introduced to streamline the process of updating statistics and calculating averages. This modularity enhances code readability and maintainability. The calculations are encapsulated in functions that make it easier to modify or expand statistical methods in the future.

def accumulate_value(self, player_data, key, count):
if key in player_data:
setattr(self, key, getattr(self, key) + player_data[key])
count += 1
return count

def calculate_average(self, total, count):
return total / count if count > 0 else 0

Conclusion

These enhancements to the Player class are designed to provide a more comprehensive and reliable framework for analyzing player performance in Dota 2 matches. By adopting a modular approach to data handling and emphasizing clear statistical representation, we set the stage for improved predictive capabilities in our model.

The integration of retries during data fetching also prepares the system to deal with potential inconsistencies in data availability, ensuring a smoother user experience. As we continue to refine our model, these updates to the Player class will prove to be instrumental in yielding more accurate predictions and deeper insights into player behavior and team dynamics.

Hero Class Enhancements for Counter-Pick Analysis

To develop a comprehensive Hero Pick Feature, we modified our existing Hero class from previous articles to incorporate counter-pick data against opposing heroes. Below is an overview of the modified Hero class and its additional capabilities for evaluating counter-pick strengths.

Modified Hero Class

The Hero class has been expanded to support counter-pick evaluations against other heroes. Here’s a summary of key components and methods in this updated version:

  • Win Rate Calculation: The class calculates each hero’s win rate in professional matches.
  • Counter Pick Data: The set_counter_pick_data method analyzes the win rate of the hero against specific heroes on the opposing team, adding these values to the hero’s data.

This update allows us to dynamically set counter-pick data based on specific enemy heroes, giving our model a more nuanced view of each hero’s potential performance.

Updated Code for the Hero Class

Below is the complete Hero class with the set_counter_pick_data method implemented:

class Hero:
def __init__(self, hero_id):
self.hero_id = hero_id
self.features = self.get_hero_features()
self.name = self.features["name"] if self.features else "Unknown Hero"
self.counter_picks = []

if self.features and self.features["pro_pick"] > 0:
self.winrate = self.features["pro_win"] / self.features["pro_pick"]
else:
self.winrate = 0

logger.info(f"Initialized Hero: {self}")

def get_hero_features(self):
url = f"https://api.opendota.com/api/heroStats?api_key={opendota_key}"
logger.info(f"Fetching hero features for Hero ID: {self.hero_id}")
response = requests.get(url)

if response.status_code == 200:
heroes = response.json()
for hero in heroes:
if hero["id"] == self.hero_id:
logger.info(
f"Hero features retrieved for ID {self.hero_id}: {hero}"
)
return {
"hero_id": hero["id"],
"name": hero["localized_name"],
"pro_win": hero.get("pro_win", 0),
"pro_pick": hero.get("pro_pick", 0),
}
else:
logger.error(f"Error fetching hero features: {response.status_code}")
return None

def get_hero_matchups(self):
url = f"https://api.opendota.com/api/heroes/{self.hero_id}/matchups?api_key={opendota_key}"
logger.info(f"Fetching matchups for Hero ID: {self.hero_id}")
response = requests.get(url)

if response.status_code == 200:
hero_matchups = response.json()
logger.info(f"Matchups retrieved for Hero ID {self.hero_id}.")
return hero_matchups
else:
logger.error(f"Error fetching hero matchups: {response.status_code}")
return None

def set_counter_pick_data(self, hero_against_ids):
logger.info(f"Setting counter pick data for Hero ID: {self.hero_id}")
hero_matchups = self.get_hero_matchups()
if hero_matchups:
for hero_matchup in hero_matchups:
if hero_matchup["hero_id"] in hero_against_ids:
win_rate = (
hero_matchup["wins"] / hero_matchup["games_played"]
if hero_matchup["games_played"] > 0
else 0
)
self.counter_picks.append(
{"win_rate": win_rate, "hero_id": hero_matchup["hero_id"]}
)
logger.info(
f"Added counter pick for Hero ID: {hero_matchup['hero_id']} with win rate: {win_rate:.2f}"
)
else:
logger.warning(f"No matchups found for Hero ID: {self.hero_id}")

def __repr__(self):
return f"Hero(ID: {self.hero_id}, Name: {self.name}, Features: {self.features})"

Hero Pick Data in the Match Class

To prepare hero pick data dynamically for predictions, we added the get_hero_match_data_for_prediction function to the Match class. This function processes counter-pick win rates and hero win rates in the context of a specific match:

def get_hero_match_data_for_prediction(self):
if len(self.radiant_team.players) == 5 and len(self.dire_team.players) == 5:
dire_hero_ids = [player.hero.hero_id for player in self.dire_team.players]
radiant_hero_ids = [player.hero.hero_id for player in self.radiant_team.players]

for player in self.dire_team.players:
player.hero.set_counter_pick_data(radiant_hero_ids)
for player in self.radiant_team.players:
player.hero.set_counter_pick_data(dire_hero_ids)

match_data = {
"match_id": self.match_id,
"radiant_team_id": self.radiant_team.team_id,
"radiant_team_name": self.radiant_team.team_name,
"dire_team_id": self.dire_team.team_id,
"dire_team_name": self.dire_team.team_name,
}

# Compile hero data for each player on both teams
for i, player in enumerate(self.radiant_team.players):
match_data[f"radiant_player_{i + 1}_hero_id"] = player.hero.hero_id
match_data[f"radiant_player_{i + 1}_hero_winrate"] = player.hero.winrate
for n, counter_pick in enumerate(player.hero.counter_picks):
match_data[f"radiant_hero_{i + 1}_{n + 1}_counter_pick"] = (
counter_pick["win_rate"]
)

for i, player in enumerate(self.dire_team.players):
match_data[f"dire_player_{i + 1}_hero_id"] = player.hero.hero_id
match_data[f"dire_player_{i + 1}_hero_winrate"] = player.hero.winrate
for n, counter_pick in enumerate(player.hero.counter_picks):
match_data[f"dire_hero_{i + 1}_{n + 1}_counter_pick"] = (
counter_pick["win_rate"]
)

df = pd.DataFrame([match_data])
df = prepare_hero_pick_data(df)
return df
else:
raise ValueError("Both teams must have exactly 5 players.")

Dataset Generation Script: Integrating Hero Pick Data

To build our training data, we expanded the dataset generation script to capture each hero’s counter-pick data. This enhancement helps the model recognize favorable and unfavorable matchups for each hero within a given match.

Below is the code snippet from the dataset generation script:

def generate_dataset():
api = OpenDotaApi()
dataset = []

premium_leagues = api.set_premium_leagues()

for premium_league in premium_leagues:
league_id = premium_league["leagueid"]
league_name = premium_league["name"]
tournament = Tournament(league_id=league_id, name=league_name)
tournament.get_league_matches()

for match in tournament.matches:
radiant_team = match.radiant_team
dire_team = match.dire_team

if len(radiant_team.players) == 5 and len(dire_team.players) == 5:
match_data = {
"match_id": match.match_id,
"radiant_team_id": radiant_team.team_id,
"radiant_team_name": radiant_team.team_name,
"dire_team_id": dire_team.team_id,
"dire_team_name": dire_team.team_name,
"radiant_win": match.radiant_win,
}

for i, player in enumerate(radiant_team.players):
match_data[f"radiant_player_{i + 1}_hero_id"] = (
player.hero.hero_id
)
match_data[f"radiant_player_{i + 1}_hero_winrate"] = (
player.hero.winrate
)
for n, counter_pick in enumerate(player.hero.counter_picks):
match_data[f"radiant_hero_{i + 1}_{n + 1}_counter_pick"] = (
counter_pick["win_rate"]
)

for i, player in enumerate(dire_team.players):
match_data[f"dire_player_{i + 1}_hero_id"] = (
player.hero.hero_id
)
match_data[f"dire_player_{i + 1}_hero_winrate"] = (
player.hero.winrate
)
for n, counter_pick in enumerate(player.hero.counter_picks):
match_data[f"dire_hero_{i + 1}_{n + 1}_counter_pick"] = (
counter_pick["win_rate"]
)

dataset.append(match_data)

df = pd.DataFrame(dataset)
df.to_csv("hero_pick_dataset.csv", index=False)
print("Hero pick dataset generated and saved.")

Data Preparation: Creating Hero Features

After generating the dataset, we prepare the data for analysis and modeling. This involves creating features that summarize the hero performance for each team and facilitate better predictive modeling.

Below are the functions used for data preparation, including hero feature creation:

def create_hero_features(df, team_prefix):
logger.info(f"Creating hero features for {team_prefix}")
try:
hero_columns = [
f"{team_prefix}_hero_{i}_{n}_counter_pick"
for i in range(1, 6)
for n in range(1, 6)
]
hero_winrate_columns = [
f"{team_prefix}_player_{i}_hero_winrate" for i in range(1, 6)
]

df[f"{team_prefix}_avg_counter_pick"] = df[hero_columns].mean(axis=1)
df[f"{team_prefix}_avg_hero_winrate"] = df[hero_winrate_columns].mean(axis=1)

df.drop(columns=hero_columns + hero_winrate_columns, inplace=True)
logger.info(f"Hero features created and columns dropped for {team_prefix}")

except Exception as e:
logger.error(f"Error in create_hero_features for {team_prefix}: {e}")

return df

def prepare_hero_pick_data(df):
logger.info("Preparing hero pick data")
try:
df = create_hero_features(df, "radiant")
df = create_hero_features(df, "dire")

try:
df["radiant_win"] = df["radiant_win"].astype(int)
except KeyError:
logger.warning("radiant_win column missing")

df.drop(
columns=[
"match_id",
"radiant_team_id",
"radiant_team_name",
"dire_team_id",
"dire_team_name",
*[f"radiant_player_{i}_hero_id" for i in range(1, 6)],
*[f"radiant_player_{i}_hero_name" for i in range(1, 6)],
*[f"dire_player_{i}_hero_id" for i in range(1, 6)],
*[f"dire_player_{i}_hero_name" for i in range(1, 6)],
],
inplace=True,
)
logger.info("Hero pick data prepared and relevant columns dropped")

except Exception as e:
logger.error(f"Error in prepare_hero_pick_data: {e}")

return df

Explanation of Data Preparation

  1. Hero Feature Creation:
  • The create_hero_features function calculates average counter-pick values and win rates for each team by averaging the corresponding columns.
  • It constructs new columns for average counter-picks and hero win rates, dropping the original detailed columns to streamline the dataset.

2. Final Data Preparation:

  • The prepare_hero_pick_data function orchestrates the preparation process by invoking create_hero_features for both teams.
  • It ensures that the radiant_win column is in the correct format, dropping irrelevant columns to retain only the most significant features for modeling.

This structured approach to data generation and preparation lays a solid foundation for training machine learning models that can predict match outcomes based on hero selections and counter-picks.

Visualization of Hero Pick Data

Once we generated the dataset with hero pick and counter-pick data, we visualized feature correlations to understand which matchups might influence match outcomes. Here’s the code for visualizing the correlation of features with the match result:

import os
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

file_path = os.path.join("..", "dataset", "train_data", "hero_pick_dataset.csv")

df = pd.read_csv(file_path)

# Compute correlation matrix
corr_matrix = df.corr()

# Visualize with a heatmap
plt.figure(figsize=(24, 20))
sns.heatmap(
corr_matrix[["radiant_win"]].sort_values(by="radiant_win", ascending=False),
annot=True,
cmap="coolwarm",
)
plt.title("Correlation of Features with Radiant Win")
plt.show()

This visualization provides insight into how each hero’s pick and counter-pick win rates correlate with match outcomes, serving as a foundation for feature selection in the model.

Here’s the output of our correlation visualization:

Press enter or click to view image in full size

Figure: Correlation of hero features with the “radiant_win” outcome.

In the correlation plot, we observe the following key values associated with Radiant and Dire teams:

  1. Dire Average Counter-Pick (-0.38): This significant negative correlation suggests that when Dire team heroes have a higher average counter-pick value against Radiant, Radiant’s chances of winning decrease. This metric indicates how effective Dire’s lineup is against Radiant’s chosen heroes, with a strong counter-pick lineup for Dire generally reducing Radiant’s likelihood of success.
  2. Radiant Average Counter-Pick (0.37): In contrast, the strong positive correlation here implies that Radiant’s odds of winning improve when Radiant heroes are selected to counter Dire’s lineup effectively. This reinforces the importance of counter-picking within team composition, where an advantage in hero matchups can significantly influence the match outcome.
  3. Radiant Average Winrate (-0.026) and Dire Average Winrate (-0.079): Both of these values have very low correlations with radiant_win, indicating that the average win rates of each team’s chosen heroes may have limited impact on predicting match outcomes. This implies that counter-pick effectiveness is a stronger predictor of match success than raw hero win rates.

These observations indicate that team composition strategies, particularly in counter-picking, have a meaningful impact on match outcomes. High average counter-pick values, especially in favor of Radiant, improve Radiant’s odds, while Dire’s advantage in counter-picking works against Radiant’s chances. This insight directs us to prioritize features based on counter-picking effectiveness when training predictive models.

Model Training Script: Building Our Prediction Engine

To train our model effectively, we implemented a structured approach that encompasses data preparation, feature specification, and model training. This allows our model to predict match outcomes with greater accuracy by leveraging the detailed hero pick data we’ve gathered.

Training the Model

Below is the complete script for training the model using the dataset generated from our hero pick data:

import os
import logging
import pandas as pd
from ml.model import MainML
from structure.helpers import prepare_match_prediction_data

logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Define the file paths
file_path = os.path.join("..", "dataset", "train_data", "all_data_match_predict.csv")
scaler_path = "../scaler.pkl"

# Load and prepare the dataset
df = pd.read_csv(file_path)
df = prepare_match_prediction_data(df, scaler_path)

# Specify the features and target column
target = "radiant_win" # Change to your target column name
features = df.columns.drop(target).tolist()

# Path to save the model
model_path = "../xgb_model.pkl" # Path where the model will be saved

# Create an instance of MainML
main_ml = MainML(df, model_path)

# Train and save the model
main_ml.train_and_save_model(features, target)

# Load the model
main_ml.load_model()

# Prepare new data for prediction (replace this with actual data)
new_data = df.tail(5).drop(columns=[target]) # Assuming the last row is new data to predict
prediction = main_ml.predict(new_data)

print(f"Prediction for new data: {prediction}")

Explanation of the Script:

  1. Importing Libraries: The script starts by importing necessary libraries, including os, logging, and pandas, as well as specific classes for model training and data preparation.
  2. Logging Configuration: The logging is set up to capture and display messages with timestamps, which helps in tracking the training process.
  3. File Paths: The paths for the dataset and scaler file are defined to ensure that the script knows where to load data from and where to save the model.
  4. Data Preparation: The dataset is loaded using pandas and prepared through a helper function, prepare_match_prediction_data, which can include scaling and other preprocessing steps.
  5. Feature Specification: The target variable, which indicates whether the radiant team wins (radiant_win), is defined, and the feature set is prepared by dropping the target from the DataFrame.
  6. Model Instantiation: An instance of the MainML class is created, responsible for handling the training and saving of the model.
  7. Training the Model: The model is trained and saved to the specified path using the train_and_save_model method.
  8. Model Loading and Prediction: Finally, the model is loaded back, and a prediction is made using the last five rows of the dataset, which are treated as new data for testing.

Model Performance Improvement

Following the feature engineering process, we observed a significant improvement in our model’s accuracy. Previously, our model achieved a 60% accuracy rate. After implementing the enhancements, our XGBoost model reported the following performance metrics:

2024-10-31 10:35:42,311 - ml.model - INFO - XGBoost Classification Report:
precision recall f1-score support

0 1.00 0.99 0.99 259
1 0.99 1.00 0.99 252

accuracy 0.99 511
macro avg 0.99 0.99 0.99 511
weighted avg 0.99 0.99 0.99 511

2024-10-31 10:35:42,312 - ml.model - INFO - XGBoost Confusion Matrix:
[[256 3]
[ 1 251]]

Summary of Results

Classification Report

The classification report indicates the following:

  1. Class 0 (Dire Wins):
  • Precision: 1.00 — The model perfectly predicted all Dire wins.
  • Recall: 0.99 — It successfully identified 99% of actual Dire wins.
  • F1-score: 0.99 — The model maintains a strong balance between precision and recall.
  • Support: 259 — There were 259 actual instances of Dire wins.

2. Class 1 (Radiant Wins):

  • Precision: 0.99 — The model correctly predicted 99% of Radiant wins.
  • Recall: 1.00 — It identified all actual Radiant wins.
  • F1-score: 0.99 — Strong performance across precision and recall.
  • Support: 252 — There were 252 actual instances of Radiant wins.
  • Overall Accuracy: 99% — The model correctly classified 99% of all matches.

Confusion Matrix

  • True Negatives (TN): 256 — Correctly predicted Dire wins.
  • False Positives (FP): 3 — Incorrectly predicted Radiant wins as Dire wins.
  • False Negatives (FN): 1 — Incorrectly predicted Dire wins as Radiant wins.
  • True Positives (TP): 251 — Correctly predicted Radiant wins.

Conclusion

Overall, the model demonstrates excellent predictive capability, achieving a high accuracy of 99%. It effectively balances precision and recall, with very few misclassifications, indicating it is a reliable tool for predicting match outcomes based on hero picks and team compositions.

Telegram Match Predictions

In our latest update, users can now select matches they are interested in and choose whether they want a prediction on the overall match winner or a detailed analysis of hero matchups. This interactive feature enhances user engagement and allows for a more tailored experience.

Below are the relevant functions from our implementation:

    def gen_match_markup_by_id(self, call):
logger.info(f"Generating match markup by ID for call: {call}")
dota_api = Dota2API(steam_api_key)
self.markup = dota_api.get_match_as_buttons(self.markup)
return self.markup

def gen_hero_match_markup_by_id(self, call):
logger.info(f"Generating hero match markup by ID for call: {call}")
dota_api = Dota2API(steam_api_key)
self.markup = dota_api.get_hero_match_as_buttons(self.markup)
return self.markup

def make_prediction_for_selected_match(self, call, match_id):
logger.info(f"Making prediction for selected match ID: {match_id}")
self.bot.send_message(
chat_id=call.message.chat.id,
text="Task started. This may take around 5 minutes. Please wait...",
)
dota_api = Dota2API(steam_api_key)
match = dota_api.build_single_match(match_id=match_id)
message = (
f"<b>Match ID:</b> {match.match_id}\n"
f"<b>Dire Team {Icons.direIcon}:</b> {match.dire_team.team_name} (ID: {match.dire_team.team_id})\n"
"<b>Players:</b>\n"
)

# List Dire team players
for player in match.dire_team.players:
message += (
f" - {player.name} {Icons.playerIcon}(Hero: {player.hero.name})\n"
)

message += (
f"\n<b>Radiant Team {Icons.radiantIcon}:</b> {match.radiant_team.team_name} (ID: {match.radiant_team.team_id})\n"
"<b>Players:</b>\n"
)

# List Radiant team players
for player in match.radiant_team.players:
message += (
f" - {player.name} {Icons.playerIcon}(Hero: {player.hero.name})\n"
)

# Prepare match data for prediction
df, top_features = match.get_match_data_for_prediction()
main_ml = MainML(None, "xgb_model.pkl")
main_ml.load_model()
prediction, probabilities = main_ml.predict(df)
message += f"\n<b>Prediction:</b> {'Radiant Wins' if prediction[0] == 1 else 'Dire Wins'}\n"
radiant_prob = probabilities[0][1] # Assuming class 1 is Radiant
dire_prob = probabilities[0][0] # Assuming class 0 is Dire
message += f"<b>Probabilities:</b> Radiant: {radiant_prob:.2%}, Dire: {dire_prob:.2%}\n"
message += "<b>----------------------------------------</b>\n" # Separator line in bold

# Log the message text
logger.info(f"Sending message to chat {call.message.chat.id}: {message}")
self.bot.send_message(
chat_id=call.message.chat.id, text=message, parse_mode="HTML"
)
logger.info(f"Prediction for match ID {match_id} sent successfully.")

def make_hero_pick_prediction_for_selected_match(self, call, match_id):
logger.info(f"Making hero pick prediction for match ID: {match_id}")
self.bot.send_message(
chat_id=call.message.chat.id,
text="Task started. This may take around 5 minutes. Please wait...",
)
dota_api = Dota2API(steam_api_key)
match = dota_api.build_single_match(match_id=match_id)
message = (
f"<b>Match ID:</b> {match.match_id}\n"
f"<b>Dire Team {Icons.direIcon}:</b> {match.dire_team.team_name} (ID: {match.dire_team.team_id})\n"
"<b>Players:</b>\n"
)

# List Dire team players
for player in match.dire_team.players:
message += (
f" - {player.name} {Icons.playerIcon}(Hero: {player.hero.name})\n"
)

message += (
f"\n<b>Radiant Team {Icons.radiantIcon}:</b> {match.radiant_team.team_name} (ID: {match.radiant_team.team_id})\n"
"<b>Players:</b>\n"
)

# List Radiant team players
for player in match.radiant_team.players:
message += (
f" - {player.name} {Icons.playerIcon}(Hero: {player.hero.name})\n"
)

# Prepare match data for prediction
df, top_features = match.get_hero_match_data_for_prediction()
hero_pick_ml = MainML(None, "xgb_model_hero_pick.pkl")
hero_pick_ml.load_model()
prediction, _ = hero_pick_ml.predict(df)
message += f"\n<b>Prediction:</b> {'Radiant pick is stronger' if prediction[0] == 1 else 'Dire pick is stronger'}\n"
message += "<b>----------------------------------------</b>\n" # Separator line in bold

# Log the message text
logger.info(f"Sending message to chat {call.message.chat.id}: {message}")
self.bot.send_message(
chat_id=call.message.chat.id, text=message, parse_mode="HTML"
)
logger.info(f"Hero pick prediction for match ID {match_id} sent successfully.")

User Interaction

When users select a match, they will see options for:

  1. Match Winner Prediction: Get insights into which team is predicted to win based on the current matchup.
  2. Hero Matchup Analysis: Evaluate the strengths and weaknesses of each hero in the selected match, which can guide strategic decisions.

Screenshot Examples:

Screenshot 1: Here you can select the match you are interested in.
Screenshot 2: Example of match outcome prediction.
Screenshot 3: Example of hero strength prediction.

Good to Note: Logging and Test Coverage

To enhance the usability and reliability of our Dota 2 predictive model, we incorporated detailed logging and increased our test coverage.

  • Logging: We implemented a logging system that captures key events and errors throughout the dataset generation, data preparation, and model training processes. This allows us to easily track the model’s behavior and swiftly diagnose issues as they arise.
  • Test Coverage: By increasing our test coverage, particularly for critical functions, we ensure that our code behaves as expected. Comprehensive tests help identify potential bugs early, promoting a robust and maintainable codebase.

These improvements not only facilitate debugging but also strengthen our model’s overall reliability and adaptability for future enhancements.

Conclusion

In this article, we delved into the intricacies of hero pick analysis in Dota 2, enhancing our model with robust counter-pick data and statistical features that empower predictive accuracy. By integrating these features, we achieved a significant improvement in model performance, increasing our accuracy from 60% to an impressive 99%. This leap in predictive capability underscores the effectiveness of our feature engineering efforts.

The integration of detailed logging and improved test coverage not only ensures the reliability of our processes but also sets a solid foundation for future developments.

I would like to extend my heartfelt thanks to the community on Reddit for their invaluable support and motivation throughout this journey. Your insights and encouragement have been instrumental in pushing this project forward.

For those interested in exploring the complete code and implementation details, you can find the project on my GitHub repository.

Cliffhanger: A New Frontier

As we continue to refine our predictive model, we stand on the brink of exploring advanced techniques that could revolutionize our approach. In the upcoming section, we will investigate the historical performance of heroes and teams, utilizing incremental learning to adapt and optimize strategies dynamically. Will these innovations elevate our model’s predictive capabilities to unprecedented heights? Join us as we embark on this exciting journey into the realm of advanced machine learning techniques in Dota 2.