“I am not a genius, I don’t know the future. But I like the idea of smart people using technology and data to help us improve. We use analytics, and it has been very helpful.” — Jürgen Klopp.
Press enter or click to view image in full size
As an avid gamers we keep an eye on the unpredictable due to the unpredictable nature of multiplayer online battle arena (MOBA) games. Dota 2 is one of the most complex MOBAs out there. There are many variables — heroes, items, player skills, and strategies — that make predicting the outcome of a tournament a difficult task. Because people who love programming and Dota 2 can control the outcome of Dota 2 matches, this project started as a gift to my big sports-loving brother, and soon it turned into an opportunity to learn data science. Machine Learning and Esports Development.
In this article, I will take you through the journey of creating a Dota 2 predictor from data collection and processing to train machine learning models and ultimately predict outcomes. Whether you’re a Dota 2 fan, a programmer, or just curious about the process, I hope this post is suitable for you. It will give you an understanding of how the system is built and the challenges involved.
Why I Created the Dota 2 Predictor
As a Dota 2 fan, I often find myself wondering which teams win or lose. Are there any hero coincidences? Player’s personal skills? What do they buy? Inspired by these questions and driven by my brother’s enthusiasm for the game, I set out on a mission to answer those questions using data and machine learning.
Data set structure and forecasting system design
With a clear understanding of the classes we need — heroes, players, teams, matches, tournaments — it’s time to dive deeper into how they are processed and used to predict match results. Here’s how each category works and how to turn this data into key features for our prediction models.
Hero Class
The Hero class shows details about a specific hero including win rates in professional games. Using the method The get_hero_features() class retrieves information such as the hero’s name, picks, and victories. This information is important because a hero’s performance is a key factor in the match’s outcome.
class Hero:
def __init__(self, hero_id):
self.hero_id = hero_id
self.features = self.get_hero_features()
self.name = self.features["name"] if self.features else "Unknown Hero" if self.features and self.features["pro_pick"] > 0:
self.winrate = self.features["pro_win"] / self.features["pro_pick"]
else:
self.winrate = 0
def get_hero_features(self):
url = f"https://api.opendota.com/api/heroStats?api_key={opendota_key}"
response = requests.get(url)
if response.status_code == 200:
heroes = response.json()
for hero in heroes:
if hero["id"] == self.hero_id:
return {
"hero_id": hero["id"],
"name": hero["localized_name"],
"pro_win": hero.get("pro_win", 0),
"pro_pick": hero.get("pro_pick", 0),
}
else:
print(f"Error fetching data: {response.status_code}")
return None
def __repr__(self):
return f"Hero(ID: {self.hero_id}, Name: {self.name}, Features: {self.features})"
Player Class
Player class represent individual players and record their statistics such as kills, deaths, assists, gold per minute, and experience per minute. These statistics are retrieved using the get_player_total_data() and get_player_data(), both of which rely on the OpenD.
class Player:
def __init__(self, account_id, name, hero_id, team):
self.account_id = account_id
self.name = name
self.team = team
self.hero = Hero(hero_id)
self.player_data = self.get_player_data() player_data = self.get_player_total_data()
kills = find_dict_in_list(player_data, "field", "kills")
self.kills = kills["sum"] / kills["n"] if kills["n"] > 0 else 0
deaths = find_dict_in_list(player_data, "field", "deaths")
self.deaths = deaths["sum"] / deaths["n"] if deaths["n"] > 0 else 0
assists = find_dict_in_list(player_data, "field", "assists")
self.assists = assists["sum"] / assists["n"] if assists["n"] > 0 else 0
gold_per_min = find_dict_in_list(player_data, "field", "gold_per_min")
self.gold_per_min = (
gold_per_min["sum"] / gold_per_min["n"] if gold_per_min["n"] > 0 else 0
)
xp_per_min = find_dict_in_list(player_data, "field", "xp_per_min")
self.xp_per_min = (
xp_per_min["sum"] / xp_per_min["n"] if xp_per_min["n"] > 0 else 0
)
last_hits = find_dict_in_list(player_data, "field", "last_hits")
self.last_hits = last_hits["sum"] / last_hits["n"] if last_hits["n"] > 0 else 0
denies = find_dict_in_list(player_data, "field", "denies")
self.denies = denies["sum"] / denies["n"] if denies["n"] > 0 else 0
def get_player_total_data(self):
"""Fetch player total data with indefinite retries until success."""
url = f"https://api.opendota.com/api/players/{self.account_id}/totals?api_key={opendota_key}&hero_id={self.hero.hero_id}&limit=30"
while True: # Retry loop
try:
response = requests.get(url)
if response.status_code == 200:
return response.json() # Successful response, exit loop
else:
print(
f"Error fetching player data: {response.status_code}. Retrying..."
)
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}. Retrying...")
sleep(2)
def get_player_data(self):
# Fetch general win/loss data
url = f"https://api.opendota.com/api/players/{self.account_id}/wl?api_key={opendota_key}"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
player_stats = {
"win_rate": (
data.get("win") / (data.get("win") + data.get("lose"))
if (data.get("win") + data.get("lose")) > 0
else 0
),
}
hero_url = f"https://api.opendota.com/api/players/{self.account_id}/heroes?api_key={opendota_key}&limit=30"
hero_response = requests.get(hero_url)
if hero_response.status_code == 200:
hero_data = hero_response.json()
for hero in hero_data:
if hero["hero_id"] == self.hero.hero_id:
# Calculate the hero's win rate
if hero["games"] > 0:
self.hero_win_rate = hero["win"] / hero["games"]
else:
self.hero_win_rate = 0
break
else:
print(f"Error fetching hero data: {hero_response.status_code}")
return player_stats
else:
print(f"Error fetching player data: {response.status_code}")
return None
def __repr__(self):
return f"Player({self.name}, Hero : {self.hero.name}, Team: {self.team}, Data: {self.player_data})"
In addition to general win/loss information, The Player class tracks the hero’s performance only when the player plays. This combination of each hero’s stats forms the basis of our prediction system by combining the skills of the players Hero selection and past performance combined with operational insights.
Team Class
Team class organize players into their teams. Each team instance can add a Player object using the add_player() method. This construct is important in simulating real competition situations where coordination between team members and hero choices can influence the outcome of the match.
class Team:
def __init__(self, team_name: str, team_id: int):
self.team_name = team_name
self.team_id = team_id
self.players = [] def add_player(self, player):
self.players.append(player)
def __repr__(self):
return f"Team({self.team_name}, ID: {self.team_id}, Players: {self.players})"
Match Class
The Match class stores all relevant statistics for the match, including both teams, their players, and the final match result. The get_match_data() method fills in this information by pulling match details from the OpenDota API, allowing us to break down each player’s impact on the match.
class Match:
def __init__(
self,
match_id: int,
radiant_team_id: int,
dire_team_id: int,
league_id: int,
radiant_win=None,
):
self.match_id = match_id
self.radiant_team_id = radiant_team_id
self.dire_team_id = dire_team_id
self.radiant_team = None
self.dire_team = None
self.league_id = league_id
self.radiant_win = radiant_win def get_match_data(self):
url = f"https://api.opendota.com/api/matches/{self.match_id}?api_key={opendota_key}"
response = requests.get(url)
if response.status_code == 200:
match_info = response.json()
radiant_team = Team(
match_info["radiant_name"], match_info["radiant_team_id"]
)
dire_team = Team(match_info["dire_name"], match_info["dire_team_id"])
self.radiant_win = match_info["radiant_win"]
for player in match_info["players"]:
if player["isRadiant"]:
player = Player(
player["account_id"],
player["name"],
player["hero_id"],
radiant_team.team_name,
)
radiant_team.add_player(player)
else:
player = Player(
player["account_id"],
player["name"],
player["hero_id"],
dire_team.team_name,
)
dire_team.add_player(player)
self.radiant_team = radiant_team
self.dire_team = dire_team
def get_match_data_for_prediction(self):
if len(self.radiant_team.players) == 5 and len(self.dire_team.players) == 5:
# Create a single row with match and player data
match_data = {
"match_id": self.match_id,
"radiant_team_id": self.radiant_team.team_id,
"radiant_team_name": self.radiant_team.team_name,
"dire_team_id": self.dire_team.team_id,
"dire_team_name": self.dire_team.team_name,
}
# Add radiant team player data (5 players)
for i, player in enumerate(self.radiant_team.players):
match_data[f"radiant_player_{i + 1}_id"] = player.account_id
match_data[f"radiant_player_{i + 1}_name"] = player.name
match_data[f"radiant_player_{i + 1}_hero_id"] = player.hero.hero_id
match_data[f"radiant_player_{i + 1}_hero_name"] = player.hero.name
match_data[f"radiant_player_{i + 1}_hero_winrate"] = player.hero.winrate
match_data[f"radiant_player_{i + 1}_winrate"] = player.player_data[
"win_rate"
]
match_data[f"radiant_player_{i + 1}_kills"] = player.kills
match_data[f"radiant_player_{i + 1}_deaths"] = player.deaths
match_data[f"radiant_player_{i + 1}_assists"] = player.assists
match_data[f"radiant_player_{i + 1}_gold_per_min"] = player.gold_per_min
match_data[f"radiant_player_{i + 1}_xp_per_min"] = player.xp_per_min
# Add dire team player data (5 players)
for i, player in enumerate(self.dire_team.players):
match_data[f"dire_player_{i + 1}_id"] = player.account_id
match_data[f"dire_player_{i + 1}_name"] = player.name
match_data[f"dire_player_{i + 1}_hero_id"] = player.hero.hero_id
match_data[f"dire_player_{i + 1}_hero_name"] = player.hero.name
match_data[f"dire_player_{i + 1}_hero_winrate"] = player.hero.winrate
match_data[f"dire_player_{i + 1}_winrate"] = player.player_data[
"win_rate"
]
match_data[f"dire_player_{i + 1}_kills"] = player.kills
match_data[f"dire_player_{i + 1}_deaths"] = player.deaths
match_data[f"dire_player_{i + 1}_assists"] = player.assists
match_data[f"dire_player_{i + 1}_gold_per_min"] = player.gold_per_min
match_data[f"dire_player_{i + 1}_xp_per_min"] = player.xp_per_min
df = pd.DataFrame([match_data])
df = prepare_data(df)
top_features = df.columns.tolist()
return df, top_features
def __repr__(self):
# Prepare the Radiant team players
radiant_players = "\n".join(
[
f" Player: {player.name} (Hero : {player.hero.name})"
for player in self.radiant_team.players
]
)
# Prepare the Dire team players
dire_players = "\n".join(
[
f" Player: {player.name} (Hero : {player.hero.name})"
for player in self.dire_team.players
]
)
# Format the result
return (
f"Match ID: {self.match_id}\n"
f"League ID: {self.league_id}\n"
f"Radiant Team: {self.radiant_team.team_name}\n"
f"Radiant Players:\n{radiant_players}\n"
f"Dire Team: {self.dire_team.team_name}\n"
f"Dire Players:\n{dire_players}\n"
f"Radiant Win: {'Yes' if self.radiant_win else 'No'}"
)
The get_match_data_for_prediction() method collects data from both teams, which are then formatted to feed data into the prediction model. This ensures appropriate dataset structure for machine learning by separating related statistics
Tournament Class
Finally, the Tournament class collects all matches for the tournament and stores them in a list. This is especially useful for analyzing large groups of matches and isolating broad trends.
class Tournament:
def __init__(self, league_id: int, name: str):
self.league_id = league_id
self.name = name
self.matches = [] def add_match(self, match):
self.matches.append(match)
def get_league_matches(self):
url = f"https://api.opendota.com/api/leagues/{self.league_id}/matches?api_key={opendota_key}"
response = requests.get(url)
if response.status_code == 200:
for match_info in response.json():
match_id = match_info["match_id"]
radiant_team_id = match_info["radiant_team_id"]
dire_team_id = match_info["dire_team_id"]
radiant_win = match_info["radiant_win"]
match = Match(
match_id, radiant_team_id, dire_team_id, self.league_id, radiant_win
)
match.get_match_data()
self.add_match(match)
else:
print(
f"Error fetching matches for league {self.league_id}: {response.status_code}"
)
def __repr__(self):
return f"Tournament({self.name}, ID: {self.league_id})"
Dataset Generation for Premium Dota 2 Leagues
The predictive ability of a machine learning model depends heavily on the quality of the data it processes. For Dota 2, creating a comprehensive dataset is essential. This section describes how we extract historical data and the level of match. Focusing on premium tournaments using the OpenDota API, this dataset is used to train a model that predicts tournament results based on a combination of historical performance and live match data.
1. Historical Data for Players
One of the most important sources of information is player profile information, that contains metrics that reflect overall performance in the previous game. It provides insights into the following matters:
- Player consistency: Players who maintain a high kill, assist, or GPM (gold per minute) average are more effective and reliable.
- Hero-specific performance: Statistics such as the win rate of players with a specific hero tell us how proficient they are with a particular character.
For example, using OpenDota’s /totals or /heroes endpoints, you can collect:
- Kills per game: The average number of kills a player has earned in all or most recent games.
- Hero-specific win rate: How often players win when using special heroes.
- GPM and XPM (Experience Per Minute): Measures how efficiently players farm gold and gain experience over time.
These statistics serve as an aid in predicting how a player is likely to perform in future matches. Basically, historical data reveals patterns regarding players’ general skill levels, tendencies, and hero-specific strengths.
2. Using Historical Data for Predictions
This is because historical data is not tied to a specific competition. It thus serves as a comprehensive performance indicator. Here’s how these measures contribute to the model:
- Player performance features: Metrics like kills, deaths, assists, GPM, and XPM are used as predictors of a player’s in-game behavior.
- Averaged historical data: For example, a player’s average kills, win rate, and GPM in the last N games provide a more stable estimate of their overall potential.
Example historical features for players might include:
- average_kills: Average player kills in the last N games.
- win_rate: Player’s win rate in the last N games.
- hero_win_rate: Player’s win rate with their specific hero in the last N games.
- gold_per_min: The player’s average GPM across their last N games.
These features allow the model to estimate the likely impact of each player in a given match based on their past performance.
3. Hero Features
In addition to player stats, hero-specific data also plays a critical role. These stats don’t refer to the ongoing game but are averages from all matches where the hero was picked. This helps assess the strength of heroes in the current meta:
- Pro Pick Rate: How frequently the hero is picked in professional matches.
- Pro Win Rate: The hero’s win rate across professional games.
These hero features give the model an understanding of the broader effectiveness of a given hero in professional play, which is crucial when making predictions.
4. Combining Historical Data with Live Match Data
For the match outcome prediction model, both historical data and live match data are essential. The idea is to combine:
- Historical player data: Information like kills, assists, win rates, and GPM that indicate how players have performed in the past.
- Hero performance data: Stats like hero win rates and pick rates that represent a hero’s overall strength in professional matches.
This mix of historical and live data forms a comprehensive input set for machine learning models. Even though historical data doesn’t reflect the events of a specific match, it serves as a strong indicator of player and hero potential, which can be used to predict outcomes.
5. Extracting Match Data with OpenDota API
With this understanding of historical and hero data, the next step is extracting match-level data from premium Dota 2 leagues. Using OpenDota’s API, we gather detailed match information from prestigious tournaments like The International and ESL One.
Here’s a Python script that demonstrates how to generate a dataset from these leagues, incorporating both match-level and player-level data:
from structure.struct import Tournament
from structure.opendota import OpenDotaApidef generate_dataset():
api = OpenDotaApi()
dataset = []
premium_leagues = api.set_premium_leagues()
last_big_leagues = [
"ESL One Kuala Lumpur powered by Intel" "BetBoom Dacha Dubai 2024",
"DreamLeague Season 22 powered by Intel",
"Elite League Season 2 Main Event – presented by ESB",
"ESL One Birmingham 2024 Powered by Intel",
"DreamLeague Season 23 powered by Intel",
"Riyadh Masters 2024 at Esports World Cup",
"Clavision DOTA League S1 : Snow-Ruyi",
"The International 2024",
"PGL Wallachia 2024 Season 1",
]
for premium_league in premium_leagues:
league_id = premium_league["leagueid"]
league_name = premium_league["name"]
if league_name in last_big_leagues:
tournament = Tournament(league_id=league_id, name=league_name)
tournament.get_league_matches() # Load matches for the tournament
# Extract data from each match in the tournament
for match in tournament.matches:
radiant_team = match.radiant_team
dire_team = match.dire_team
# Ensure we have 5 players on each team
if len(radiant_team.players) == 5 and len(dire_team.players) == 5:
# Create a single row with match and player data
match_data = {
"match_id": match.match_id,
"radiant_team_id": radiant_team.team_id,
"radiant_team_name": radiant_team.team_name,
"dire_team_id": dire_team.team_id,
"dire_team_name": dire_team.team_name,
"radiant_win": match.radiant_win, # True/False if Radiant team won
}
# Add radiant team player data (5 players)
for i, player in enumerate(radiant_team.players):
match_data[f"radiant_player_{i + 1}_id"] = player.account_id
match_data[f"radiant_player_{i + 1}_name"] = player.name
match_data[f"radiant_player_{i + 1}_hero_id"] = (
player.hero.hero_id
)
match_data[f"radiant_player_{i + 1}_hero_name"] = (
player.hero.name
)
match_data[f"radiant_player_{i + 1}_hero_winrate"] = (
player.hero.winrate
)
match_data[f"radiant_player_{i + 1}_winrate"] = (
player.player_data["win_rate"]
)
match_data[f"radiant_player_{i + 1}_kills"] = player.kills
match_data[f"radiant_player_{i + 1}_deaths"] = player.deaths
match_data[f"radiant_player_{i + 1}_assists"] = player.assists
match_data[f"radiant_player_{i + 1}_gold_per_min"] = (
player.gold_per_min
)
match_data[f"radiant_player_{i + 1}_xp_per_min"] = (
player.xp_per_min
)
# Add dire team player data (5 players)
for i, player in enumerate(dire_team.players):
match_data[f"dire_player_{i + 1}_id"] = player.account_id
match_data[f"dire_player_{i + 1}_name"] = player.name
match_data[f"dire_player_{i + 1}_hero_id"] = player.hero.hero_id
match_data[f"dire_player_{i + 1}_hero_name"] = player.hero.name
match_data[f"dire_player_{i + 1}_hero_winrate"] = (
player.hero.winrate
)
match_data[f"dire_player_{i + 1}_winrate"] = player.player_data[
"win_rate"
]
match_data[f"dire_player_{i + 1}_kills"] = player.kills
match_data[f"dire_player_{i + 1}_deaths"] = player.deaths
match_data[f"dire_player_{i + 1}_assists"] = player.assists
match_data[f"dire_player_{i + 1}_gold_per_min"] = (
player.gold_per_min
)
match_data[f"dire_player_{i + 1}_xp_per_min"] = (
player.xp_per_min
)
print(match_data)
# Append match data to dataset
dataset.append(match_data)
df = pd.DataFrame(dataset)
# Write DataFrame to a CSV file
df.to_csv("premium_league_matches.csv", index=False)
print("Match dataset has been generated and saved to 'premium_league_matches.csv'.")
generate_dataset()
6. Structuring and Saving the Dataset
The final dataset is saved in a CSV file. Each row represents a single match, capturing both match-level and player-level data, including historical player stats and hero performance. This dataset provides the input for our machine learning models to predict match outcomes.
Data Preparation
Data preparation is one of the most critical stages in the machine learning workflow, as it significantly impacts the performance and accuracy of predictive models. In this section, we outline how raw match data was transformed into a structured dataset ready for machine learning analysis.
1. Feature Engineering
Feature engineering involves creating new features from existing data to improve model insights. For this project, we developed team-based features that summarize key performance metrics for both Radiant and Dire teams:
- Average Hero Win Rate: We calculated the average win rate of the heroes selected by the players on each team. This provides insight into how historically successful each team’s hero lineup is.
- Average Player Win Rate: Similar to hero win rates, this feature represents the average win rate of the players on the team, offering a view of the team’s player strength.
- Total Kills, Deaths, and Assists: Summing the kills, deaths, and assists for all players on each team gives us a picture of overall team performance.
- Average Gold per Minute (GPM) and Experience per Minute (XPM): Calculating the average GPM and XPM for each team provides a measure of resource accumulation, which is crucial in understanding how well a team farms and levels up.
These team-based features were generated using the calculate_team_features function, which processes a DataFrame and calculates new metrics based on a given team prefix (either "radiant" or "dire"). Once the new features were created, the original columns used to derive them were dropped to avoid redundancy.
Here is the implementation of the calculate_team_features function:
def calculate_team_features(df, team_prefix):
"""
Function to calculate team-based features for a given prefix (radiant or dire).
"""
# Team Hero Win Rate: Average win rate of the heroes for the team
hero_winrate_cols = [f"{team_prefix}_player_{i}_hero_winrate" for i in range(1, 6)]
df[f"{team_prefix}_avg_hero_winrate"] = df[hero_winrate_cols].mean(axis=1) # Team Player Win Rate: Average win rate of the players for the team
player_winrate_cols = [f"{team_prefix}_player_{i}_winrate" for i in range(1, 6)]
df[f"{team_prefix}_avg_player_winrate"] = df[player_winrate_cols].mean(axis=1)
# Team Kills, Deaths, Assists
kills_cols = [f"{team_prefix}_player_{i}_kills" for i in range(1, 6)]
deaths_cols = [f"{team_prefix}_player_{i}_deaths" for i in range(1, 6)]
assists_cols = [f"{team_prefix}_player_{i}_assists" for i in range(1, 6)]
df[f"{team_prefix}_total_kills"] = df[kills_cols].sum(axis=1)
df[f"{team_prefix}_total_deaths"] = df[deaths_cols].sum(axis=1)
df[f"{team_prefix}_total_assists"] = df[assists_cols].sum(axis=1)
# Team GPM and XPM: Average GPM and XPM per team
gpm_cols = [f"{team_prefix}_player_{i}_gold_per_min" for i in range(1, 6)]
xpm_cols = [f"{team_prefix}_player_{i}_xp_per_min" for i in range(1, 6)]
df[f"{team_prefix}_avg_gpm"] = df[gpm_cols].mean(axis=1)
df[f"{team_prefix}_avg_xpm"] = df[xpm_cols].mean(axis=1)
# Drop the original columns used to create these features
df.drop(
columns=hero_winrate_cols + player_winrate_cols + gpm_cols + xpm_cols,
inplace=True,
)
return df
In addition to team features, we calculated the Kill-Death-Assist (KDA) ratio for each player. KDA is a widely used metric in competitive gaming that helps evaluate individual player performance. The formula used is:
This avoids division by zero, ensuring robust calculations. After calculating KDA for all players, the original columns for kills, deaths, and assists were dropped.
Here’s the implementation of the calculate_player_kda function:
def calculate_player_kda(df, team_prefix):
"""
Function to calculate KDA (Kill-Death-Assist ratio) for each player.
"""
for i in range(1, 6):
df[f"{team_prefix}_player_{i}_kda"] = (
df[f"{team_prefix}_player_{i}_kills"]
+ df[f"{team_prefix}_player_{i}_assists"]
) / df[f"{team_prefix}_player_{i}_deaths"].replace(
0, 1
) # Avoid division by zero
# Drop kills, deaths, and assists for each player
df.drop(
columns=[
f"{team_prefix}_player_{i}_kills",
f"{team_prefix}_player_{i}_deaths",
f"{team_prefix}_player_{i}_assists",
],
inplace=True,
) return df
2. Target Variable Creation
To transform the dataset for binary classification, we converted the radiant_win column into an integer format where:
1represents a Radiant victory0represents a Dire victory
This step is essential for the supervised learning model, where predicting the winning team (Radiant or Dire) is the primary objective.
3. Data Cleaning
Data cleaning was performed to remove unnecessary columns that don’t contribute to model performance. Specifically, we dropped:
- Match-specific identifiers: Such as
match_id,radiant_team_id,dire_team_id, etc., since they are not predictive features. - Player-specific identifiers: Player names and IDs were removed to anonymize the data and focus solely on performance metrics.
These steps help reduce overfitting and ensure that the model generalizes well on unseen data.
4. Normalization
To ensure that all features contribute equally to the machine learning model, we applied Min-Max normalization to scale the features between 0 and 1. This is particularly important for algorithms like logistic regression and neural networks, which are sensitive to the magnitude of input features.
We used MinMaxScaler to normalize all numerical columns except for radiant_win (our target variable). The normalization ensures that features such as kills, GPM, and hero win rates are all on the same scale.
def prepare_data(df):
# Apply feature engineering for both Radiant and Dire teams
df = calculate_team_features(df, "radiant")
df = calculate_team_features(df, "dire") # Calculate KDA for each player (for both teams)
df = calculate_player_kda(df, "radiant")
df = calculate_player_kda(df, "dire")
# Create a new column for the match target: 1 if radiant_win is True, else 0
try:
df["radiant_win"] = df["radiant_win"].astype(int)
except KeyError:
pass
df.drop(
columns=[
"match_id",
"radiant_team_id",
"radiant_team_name",
"dire_team_id",
"dire_team_name",
# Drop player names to anonymize data
*[f"radiant_player_{i}_name" for i in range(1, 6)],
*[f"radiant_player_{i}_id" for i in range(1, 6)],
*[f"radiant_player_{i}_hero_name" for i in range(1, 6)],
*[f"dire_player_{i}_name" for i in range(1, 6)],
*[f"dire_player_{i}_id" for i in range(1, 6)],
*[f"dire_player_{i}_hero_name" for i in range(1, 6)],
],
inplace=True,
)
columns_to_normalize = df.columns.difference(["match_id", "radiant_win"])
# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Apply Min-Max normalization
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
return df
Machine Learning Model Development
For our first iteration of model development, we selected XGBoost as the primary algorithm. XGBoost, short for Extreme Gradient Boosting, is known for its efficiency and high performance, particularly with structured/tabular data, making it a popular choice in machine learning competitions and real-world applications.
Why XGBoost?
There are several reasons why XGBoost was chosen for this initial iteration:
- Performance on Small Datasets: XGBoost is highly effective even with limited data, which aligns with our relatively small dataset size. It can deliver competitive results without requiring a massive amount of data for training.
- Efficiency: XGBoost is optimized for speed and performance. Its implementation is designed to handle large datasets efficiently, but it also shines with smaller datasets due to its fast training times and ability to quickly iterate over hyperparameters.
- Robustness: XGBoost inherently reduces the risk of overfitting through built-in regularization techniques like L1 (Lasso) and L2 (Ridge) regularization. This makes it more robust when dealing with noisy or complex datasets.
Given these advantages, we used XGBoost’s classification capabilities to predict the outcome of a match (Radiant win or loss). Our workflow included several key steps:
1. Data Splitting
We first divided our dataset into:
- Features: These included player statistics (e.g., kills, deaths, assists), hero selection, and other team performance metrics.
- Target Variable: The target is the match outcome (
radiant_win), which indicates whether the Radiant team won or lost.
We further split the data into training and testing sets using an 80–20 split to ensure that the model has unseen data to evaluate its performance.
2. Model Training
For the initial iteration, we used the default parameters of XGBoost to establish a baseline model. While hyperparameter tuning can improve performance, starting with default parameters allows us to gauge the base performance and identify areas for improvement.
3. Model Evaluation
After training the model, we evaluated its performance on the test set using key classification metrics:
- Accuracy: Measures overall correctness of the predictions.
- Precision, Recall, and F1-Score: These metrics give us a deeper understanding of how well the model distinguishes between classes, especially in scenarios where the class distribution may be imbalanced.
- Confusion Matrix: Provides a detailed breakdown of true positives, false positives, true negatives, and false negatives.
4. Model Saving
Upon successful training, the model was saved using the joblib library. This allows us to reuse the model for future predictions without retraining it every time.
5. Prediction on New Data
With the trained model, we can now make predictions on new, unseen data. This capability allows us to forecast match outcomes based on similar feature sets, providing insights into potential game results.
Implementation
Below is the Python implementation of the machine learning pipeline, encapsulated in the MainML class:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import joblibclass MainML:
"""
Main class that orchestrates model training, evaluation, and prediction.
"""
def __init__(self, df, model_path):
self.df = df
self.model_path = model_path
self.xgb_model = XGBClassifier(random_state=42)
def train_and_save_model(self, features, target):
"""
Trains the XGBoost model and saves it to the specified path.
"""
# Split the dataset into features (X) and target (y)
X = self.df[features]
y = self.df[target]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train the model
self.xgb_model.fit(X_train, y_train)
# Save the model
joblib.dump(self.xgb_model, self.model_path)
print(f"Model saved to {self.model_path}")
# Evaluate the model on the test set
self.evaluate_model(X_test, y_test)
def evaluate_model(self, X_test, y_test):
"""
Evaluates the model on the test data and prints the classification report and confusion matrix.
"""
# Make predictions on the test set
y_pred = self.xgb_model.predict(X_test)
# Print classification report
print("XGBoost Classification Report:")
print(classification_report(y_test, y_pred))
# Print confusion matrix
print("XGBoost Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
def load_model(self):
"""
Loads the model from the specified path.
"""
self.xgb_model = joblib.load(self.model_path)
print(f"Model loaded from {self.model_path}")
def predict(self, new_data):
"""
Predicts the class for the new data point.
"""
# Ensure that the new_data has the same features as the training set
prediction = self.xgb_model.predict(new_data)
return prediction
Model Training and Prediction
In this section, we focus on the practical steps for training our machine learning model using the XGBoost classifier and then leveraging the trained model to make predictions. The workflow includes loading the dataset, preparing the data, training the model, saving it for future use, and ultimately making predictions on new data.
Here’s a step-by-step breakdown:
1. Loading and Preparing the Dataset
We begin by loading the dataset from a CSV file. The raw data contains various match-level and player-level statistics, which we prepare for machine learning by applying feature engineering and normalization. The prepare_data function, which we implemented earlier, handles these tasks by transforming the raw data into a format suitable for model training.
import os
import pandas as pd
from ml.model import MainML
from structure.helpers import prepare_data# Define the file path to the dataset
file_path = os.path.join("..", "dataset", "train_data", "all_data.csv")
# Load the dataset into a DataFrame
df = pd.read_csv(file_path)
# Prepare the dataset for model training
df = prepare_data(df)
In the prepare_data function, we applied feature engineering techniques to create team-based and player-based metrics, normalized the features, and removed unnecessary columns. This prepares the data for efficient model training and ensures consistency.
2. Defining Features and Target
Next, we define the features (input variables) and the target (output variable). In our case, the target is radiant_win, a binary variable that indicates whether the Radiant team won (1) or lost (0) the match. The features consist of all other columns in the dataset that describe various team and player performance metrics.
# Specify the target column (the variable we want to predict)
target = "radiant_win"# Specify the features (all columns except the target)
features = df.columns.drop(target).tolist()
This step ensures that the model knows which features to consider when making predictions and which column represents the match outcome.
3. Initializing and Training the Model
We create an instance of the MainML class, which encapsulates the XGBoost model, and proceed to train the model. The training process involves splitting the data into training and testing sets, fitting the model on the training data, and evaluating its performance on the test data.
# Path to save the trained model
model_path = "../xgb_model.pkl"# Create an instance of MainML with the dataset and model path
main_ml = MainML(df, model_path)
# Train the model using the features and target, and save the model
main_ml.train_and_save_model(features, targe
In this step:
- The
train_and_save_modelmethod handles the entire training process. - After training, the model is saved to a file (
xgb_model.pkl) using thejobliblibrary, so it can be reloaded for future predictions without retraining.
4. Loading the Saved Model
Once the model is saved, we can reload it at any time to make predictions on new data. Loading the model from disk avoids retraining, making it more efficient for real-time or repeated predictions.
# Load the previously saved model
main_ml.load_model()This line reloads the saved XGBoost model from the model_path, making it available for prediction tasks.
5. Making Predictions on New Data
Finally, we prepare a sample of new data to demonstrate how the model can be used for prediction. Here, we simulate new data by taking the last five rows of our dataset (excluding the target column). In practice, this would be real-time match data or previously unseen data.
# Prepare new data for prediction (excluding the target column)
new_data = df.tail(5).drop(columns=[target]) # Example: Using the last 5 rows as new data# Make predictions using the loaded model
prediction = main_ml.predict(new_data)
# Output the predictions
print(f"Prediction for new data: {prediction}")
In this example:
- We use the
.tail(5)method to select the last five rows of data and drop theradiant_wincolumn, as we are making predictions on this unseen data. - The
predictmethod from theMainMLclass outputs the predicted match outcome (1 for Radiant win, 0 for Radiant loss).
How to Run the Dota 2 Predictor Project
This section provides a step-by-step guide on how to set up and run the Dota 2 Predictor project on your local machine.
1. Clone the Repository
To begin, download the project code by cloning the GitHub repository. Open your terminal and run the following commands:
git clone https://github.com/masterhood13/dota2predictor.git
cd dota2predictor
git checkout tags/1.0.1This will create a local copy of the repository and move you into the project directory.
2. Install Dependencies
Ensure you have Python 3.9 or higher installed on your system. Next, you will need to install the required Python libraries by running:
pip install -r requirements.txtThis command installs all the necessary packages, such as pandas, xgboost, scikit-learn, and others needed for data processing and machine learning tasks.
3. Set Up Environment Variables
You will need API keys for OpenDota, Steam, and Telegram to enable data fetching and bot functionality. Create a .env file in the project directory and add your API keys as follows:
OPENDOTA_KEY=your_actual_opendota_api_key
STEAM_API_KEY=your_actual_steam_api_key
TELEGRAM_KEY=your_actual_telegram_bot_tokeReplace your_actual_opendota_api_key, your_actual_steam_api_key, and your_actual_telegram_bot_token with your actual API keys.
4. Run the Project
Once everything is set up, you can start the bot by running the following command:
python start.pyThis will launch the predictor bot, allowing it to fetch data and interact with the Dota 2 API to make match outcome predictions based on the trained machine learning model.
Interacting with the Telegram Bot for Dota 2 Match Predictions
Once the model is trained and the bot is running, follow these steps to start using the Telegram bot for match predictions:
1. Access the Telegram Bot
After setting up the bot in the previous steps, open Telegram on your device. Search for your bot by its username or use the bot link (if you’ve shared it with yourself or others).
2. Start the Conversation
Send any message to the bot to begin the interaction. The bot will respond with a menu of options to choose from.
3. Get Predictions for Ongoing Dota 2 Matches
To receive predictions for all ongoing Dota 2 matches, simply press the relevant button in the bot’s menu (e.g., “Get Match Predictions”). The bot will then query the Dota 2 API and display predictions for each active game.
4. View Prediction Results
For each active match, the bot will return a predicted winner — either the Radiant or Dire team — based on the machine learning model’s evaluation of the match data. Instead of win probabilities, you’ll receive a clear prediction indicating which team is more likely to win the match, providing real-time insights into ongoing games.
Here is a sample of the output:
High-Level System Architecture
The architecture of our Dota 2 predictor bot demonstrates how the different components work together to deliver match outcome predictions to the end-user via a Telegram bot. The flow begins with a user query and proceeds through various stages of data collection, preprocessing, model prediction, and finally, response delivery.
Here’s a step-by-step breakdown of the system’s workflow:
1. User Interaction:
- The process starts when a user sends a request to the bot via Telegram. This could be a request to predict the outcome of ongoing Dota 2 matches.
2. Telegram Bot:
- The Telegram bot acts as the interface between the user and the backend system. Upon receiving the query, the bot forwards it to the backend for further processing.
3. Backend System:
- The backend server orchestrates the core logic of the prediction process. It gathers the necessary match and player data from external APIs like OpenDota and Steam.
4. Data Retrieval from External APIs:
- The backend system requests match and player data from OpenDota’s API and, if needed, from the Steam API to enrich the player data with additional context.
5. Data Preprocessing:
- Before feeding the data into the prediction model, the Preprocessing Engine transforms the raw data into a format suitable for analysis. This includes cleaning, feature engineering, and normalization to ensure the data is aligned with the requirements of the trained machine learning model.
6. XGBoost Model Prediction:
- The processed data is then passed into the XGBoost model, which has been trained to predict whether the Radiant or Dire team will win. The model makes a decision based on historical data and the current match context.
7. Returning Results:
- Once the model provides its prediction, the backend processes the result and sends it back to the Telegram bot, which then displays the predicted winner (Radiant or Dire) to the user.
System Flow Diagram
Press enter or click to view image in full size
This architecture ensures a smooth, automated flow from user interaction to final match prediction, allowing users to easily access predictions for ongoing Dota 2 games. Each component works independently but contributes to the overall functionality of the prediction system.
Conclusion and What’s Next
In this first phase, we successfully built a machine learning model to predict Dota 2 match outcomes, achieving a baseline accuracy of around 60%. While promising, this leaves plenty of room for improvement. The complexity of Dota 2 means that more refined data and feature engineering could unlock even better predictions.
But this is just the beginning.
In Part Two, we’ll dive deeper into feature engineering, introducing new metrics to better capture team dynamics and player performance. We’ll also integrate additional data sources, including live match stats from a custom Telegram bot. Most excitingly, we’ll design and train a custom deep learning model tailored specifically to this enriched dataset.
Stay tuned — big improvements are on the horizon!
EDIT : Part two
For more details and to explore the entire project, visit the Dota 2 Predictor GitHub Repository.