Building a Dota 2 Match Outcome Predictor: My Journey and Learnings

“I am not a genius, I don’t know the future. But I like the idea of smart people using technology and data to help us improve. We use analytics, and it has been very helpful.” — Jürgen Klopp.

Press enter or click to view image in full size

As an avid gamers we keep an eye on the unpredictable due to the unpredictable nature of multiplayer online battle arena (MOBA) games. Dota 2 is one of the most complex MOBAs out there. There are many variables — heroes, items, player skills, and strategies — that make predicting the outcome of a tournament a difficult task. Because people who love programming and Dota 2 can control the outcome of Dota 2 matches, this project started as a gift to my big sports-loving brother, and soon it turned into an opportunity to learn data science. Machine Learning and Esports Development.
In this article, I will take you through the journey of creating a Dota 2 predictor from data collection and processing to train machine learning models and ultimately predict outcomes. Whether you’re a Dota 2 fan, a programmer, or just curious about the process, I hope this post is suitable for you. It will give you an understanding of how the system is built and the challenges involved.

Why I Created the Dota 2 Predictor

As a Dota 2 fan, I often find myself wondering which teams win or lose. Are there any hero coincidences? Player’s personal skills? What do they buy? Inspired by these questions and driven by my brother’s enthusiasm for the game, I set out on a mission to answer those questions using data and machine learning.

Data set structure and forecasting system design

With a clear understanding of the classes we need — heroes, players, teams, matches, tournaments — it’s time to dive deeper into how they are processed and used to predict match results. Here’s how each category works and how to turn this data into key features for our prediction models.

Hero Class

The Hero class shows details about a specific hero including win rates in professional games. Using the method The get_hero_features() class retrieves information such as the hero’s name, picks, and victories. This information is important because a hero’s performance is a key factor in the match’s outcome.

class Hero:
    def __init__(self, hero_id):
        self.hero_id = hero_id
        self.features = self.get_hero_features()
        self.name = self.features["name"] if self.features else "Unknown Hero"        if self.features and self.features["pro_pick"] > 0:
            self.winrate = self.features["pro_win"] / self.features["pro_pick"]
        else:
            self.winrate = 0
    def get_hero_features(self):
        url = f"https://api.opendota.com/api/heroStats?api_key={opendota_key}"
        response = requests.get(url)
        if response.status_code == 200:
            heroes = response.json()
            for hero in heroes:
                if hero["id"] == self.hero_id:
                    return {
                        "hero_id": hero["id"],
                        "name": hero["localized_name"],
                        "pro_win": hero.get("pro_win", 0),
                        "pro_pick": hero.get("pro_pick", 0),
                    }
        else:
            print(f"Error fetching data: {response.status_code}")
            return None
    def __repr__(self):
        return f"Hero(ID: {self.hero_id}, Name: {self.name}, Features: {self.features})"

Player Class

Player class represent individual players and record their statistics such as kills, deaths, assists, gold per minute, and experience per minute. These statistics are retrieved using the get_player_total_data() and get_player_data(), both of which rely on the OpenD.

class Player:
    def __init__(self, account_id, name, hero_id, team):
        self.account_id = account_id
        self.name = name
        self.team = team
        self.hero = Hero(hero_id)
        self.player_data = self.get_player_data()        player_data = self.get_player_total_data()
        kills = find_dict_in_list(player_data, "field", "kills")
        self.kills = kills["sum"] / kills["n"] if kills["n"] > 0 else 0
        deaths = find_dict_in_list(player_data, "field", "deaths")
        self.deaths = deaths["sum"] / deaths["n"] if deaths["n"] > 0 else 0
        assists = find_dict_in_list(player_data, "field", "assists")
        self.assists = assists["sum"] / assists["n"] if assists["n"] > 0 else 0
        gold_per_min = find_dict_in_list(player_data, "field", "gold_per_min")
        self.gold_per_min = (
            gold_per_min["sum"] / gold_per_min["n"] if gold_per_min["n"] > 0 else 0
        )
        xp_per_min = find_dict_in_list(player_data, "field", "xp_per_min")
        self.xp_per_min = (
            xp_per_min["sum"] / xp_per_min["n"] if xp_per_min["n"] > 0 else 0
        )
        last_hits = find_dict_in_list(player_data, "field", "last_hits")
        self.last_hits = last_hits["sum"] / last_hits["n"] if last_hits["n"] > 0 else 0
        denies = find_dict_in_list(player_data, "field", "denies")
        self.denies = denies["sum"] / denies["n"] if denies["n"] > 0 else 0
    def get_player_total_data(self):
        """Fetch player total data with indefinite retries until success."""
        url = f"https://api.opendota.com/api/players/{self.account_id}/totals?api_key={opendota_key}&hero_id={self.hero.hero_id}&limit=30"
        while True:  # Retry loop
            try:
                response = requests.get(url)
                if response.status_code == 200:
                    return response.json()  # Successful response, exit loop
                else:
                    print(
                        f"Error fetching player data: {response.status_code}. Retrying..."
                    )
            except requests.exceptions.RequestException as e:
                print(f"Request failed: {e}. Retrying...")
            sleep(2)
    def get_player_data(self):
        # Fetch general win/loss data
        url = f"https://api.opendota.com/api/players/{self.account_id}/wl?api_key={opendota_key}"
        response = requests.get(url)
        if response.status_code == 200:
            data = response.json()
            player_stats = {
                "win_rate": (
                    data.get("win") / (data.get("win") + data.get("lose"))
                    if (data.get("win") + data.get("lose")) > 0
                    else 0
                ),
            }
            hero_url = f"https://api.opendota.com/api/players/{self.account_id}/heroes?api_key={opendota_key}&limit=30"
            hero_response = requests.get(hero_url)
            if hero_response.status_code == 200:
                hero_data = hero_response.json()
                for hero in hero_data:
                    if hero["hero_id"] == self.hero.hero_id:
                        # Calculate the hero's win rate
                        if hero["games"] > 0:
                            self.hero_win_rate = hero["win"] / hero["games"]
                        else:
                            self.hero_win_rate = 0
                        break
            else:
                print(f"Error fetching hero data: {hero_response.status_code}")
            return player_stats
        else:
            print(f"Error fetching player data: {response.status_code}")
            return None
    def __repr__(self):
        return f"Player({self.name}, Hero : {self.hero.name}, Team: {self.team}, Data: {self.player_data})"

In addition to general win/loss information, The Player class tracks the hero’s performance only when the player plays. This combination of each hero’s stats forms the basis of our prediction system by combining the skills of the players Hero selection and past performance combined with operational insights.

Team Class

Team class organize players into their teams. Each team instance can add a Player object using the add_player() method. This construct is important in simulating real competition situations where coordination between team members and hero choices can influence the outcome of the match.

class Team:
    def __init__(self, team_name: str, team_id: int):
        self.team_name = team_name
        self.team_id = team_id
        self.players = []    def add_player(self, player):
        self.players.append(player)
    def __repr__(self):
        return f"Team({self.team_name}, ID: {self.team_id}, Players: {self.players})"

Match Class

The Match class stores all relevant statistics for the match, including both teams, their players, and the final match result. The get_match_data() method fills in this information by pulling match details from the OpenDota API, allowing us to break down each player’s impact on the match.

class Match:
    def __init__(
        self,
        match_id: int,
        radiant_team_id: int,
        dire_team_id: int,
        league_id: int,
        radiant_win=None,
    ):
        self.match_id = match_id
        self.radiant_team_id = radiant_team_id
        self.dire_team_id = dire_team_id
        self.radiant_team = None
        self.dire_team = None
        self.league_id = league_id
        self.radiant_win = radiant_win    def get_match_data(self):
        url = f"https://api.opendota.com/api/matches/{self.match_id}?api_key={opendota_key}"
        response = requests.get(url)
        if response.status_code == 200:
            match_info = response.json()
            radiant_team = Team(
                match_info["radiant_name"], match_info["radiant_team_id"]
            )
            dire_team = Team(match_info["dire_name"], match_info["dire_team_id"])
            self.radiant_win = match_info["radiant_win"]
            for player in match_info["players"]:
                if player["isRadiant"]:
                    player = Player(
                        player["account_id"],
                        player["name"],
                        player["hero_id"],
                        radiant_team.team_name,
                    )
                    radiant_team.add_player(player)
                else:
                    player = Player(
                        player["account_id"],
                        player["name"],
                        player["hero_id"],
                        dire_team.team_name,
                    )
                    dire_team.add_player(player)
            self.radiant_team = radiant_team
            self.dire_team = dire_team
    def get_match_data_for_prediction(self):
        if len(self.radiant_team.players) == 5 and len(self.dire_team.players) == 5:
            # Create a single row with match and player data
            match_data = {
                "match_id": self.match_id,
                "radiant_team_id": self.radiant_team.team_id,
                "radiant_team_name": self.radiant_team.team_name,
                "dire_team_id": self.dire_team.team_id,
                "dire_team_name": self.dire_team.team_name,
            }
            # Add radiant team player data (5 players)
            for i, player in enumerate(self.radiant_team.players):
                match_data[f"radiant_player_{i + 1}_id"] = player.account_id
                match_data[f"radiant_player_{i + 1}_name"] = player.name
                match_data[f"radiant_player_{i + 1}_hero_id"] = player.hero.hero_id
                match_data[f"radiant_player_{i + 1}_hero_name"] = player.hero.name
                match_data[f"radiant_player_{i + 1}_hero_winrate"] = player.hero.winrate
                match_data[f"radiant_player_{i + 1}_winrate"] = player.player_data[
                    "win_rate"
                ]
                match_data[f"radiant_player_{i + 1}_kills"] = player.kills
                match_data[f"radiant_player_{i + 1}_deaths"] = player.deaths
                match_data[f"radiant_player_{i + 1}_assists"] = player.assists
                match_data[f"radiant_player_{i + 1}_gold_per_min"] = player.gold_per_min
                match_data[f"radiant_player_{i + 1}_xp_per_min"] = player.xp_per_min
            # Add dire team player data (5 players)
            for i, player in enumerate(self.dire_team.players):
                match_data[f"dire_player_{i + 1}_id"] = player.account_id
                match_data[f"dire_player_{i + 1}_name"] = player.name
                match_data[f"dire_player_{i + 1}_hero_id"] = player.hero.hero_id
                match_data[f"dire_player_{i + 1}_hero_name"] = player.hero.name
                match_data[f"dire_player_{i + 1}_hero_winrate"] = player.hero.winrate
                match_data[f"dire_player_{i + 1}_winrate"] = player.player_data[
                    "win_rate"
                ]
                match_data[f"dire_player_{i + 1}_kills"] = player.kills
                match_data[f"dire_player_{i + 1}_deaths"] = player.deaths
                match_data[f"dire_player_{i + 1}_assists"] = player.assists
                match_data[f"dire_player_{i + 1}_gold_per_min"] = player.gold_per_min
                match_data[f"dire_player_{i + 1}_xp_per_min"] = player.xp_per_min
        df = pd.DataFrame([match_data])
        df = prepare_data(df)
        top_features = df.columns.tolist()
        return df, top_features
    def __repr__(self):
        # Prepare the Radiant team players
        radiant_players = "\n".join(
            [
                f"    Player: {player.name} (Hero : {player.hero.name})"
                for player in self.radiant_team.players
            ]
        )
        # Prepare the Dire team players
        dire_players = "\n".join(
            [
                f"    Player: {player.name} (Hero : {player.hero.name})"
                for player in self.dire_team.players
            ]
        )
        # Format the result
        return (
            f"Match ID: {self.match_id}\n"
            f"League ID: {self.league_id}\n"
            f"Radiant Team: {self.radiant_team.team_name}\n"
            f"Radiant Players:\n{radiant_players}\n"
            f"Dire Team: {self.dire_team.team_name}\n"
            f"Dire Players:\n{dire_players}\n"
            f"Radiant Win: {'Yes' if self.radiant_win else 'No'}"
        )

The get_match_data_for_prediction() method collects data from both teams, which are then formatted to feed data into the prediction model. This ensures appropriate dataset structure for machine learning by separating related statistics

Tournament Class

Finally, the Tournament class collects all matches for the tournament and stores them in a list. This is especially useful for analyzing large groups of matches and isolating broad trends.

class Tournament:
    def __init__(self, league_id: int, name: str):
        self.league_id = league_id
        self.name = name
        self.matches = []    def add_match(self, match):
        self.matches.append(match)
    def get_league_matches(self):
        url = f"https://api.opendota.com/api/leagues/{self.league_id}/matches?api_key={opendota_key}"
        response = requests.get(url)
        if response.status_code == 200:
            for match_info in response.json():
                match_id = match_info["match_id"]
                radiant_team_id = match_info["radiant_team_id"]
                dire_team_id = match_info["dire_team_id"]
                radiant_win = match_info["radiant_win"]
                match = Match(
                    match_id, radiant_team_id, dire_team_id, self.league_id, radiant_win
                )
                match.get_match_data()
                self.add_match(match)
        else:
            print(
                f"Error fetching matches for league {self.league_id}: {response.status_code}"
            )
    def __repr__(self):
        return f"Tournament({self.name}, ID: {self.league_id})"

Dataset Generation for Premium Dota 2 Leagues

The predictive ability of a machine learning model depends heavily on the quality of the data it processes. For Dota 2, creating a comprehensive dataset is essential. This section describes how we extract historical data and the level of match. Focusing on premium tournaments using the OpenDota API, this dataset is used to train a model that predicts tournament results based on a combination of historical performance and live match data.

1. Historical Data for Players

One of the most important sources of information is player profile information, that contains metrics that reflect overall performance in the previous game. It provides insights into the following matters:

Player consistency: Players who maintain a high kill, assist, or GPM (gold per minute) average are more effective and reliable.
Hero-specific performance: Statistics such as the win rate of players with a specific hero tell us how proficient they are with a particular character.

For example, using OpenDota’s /totals or /heroes endpoints, you can collect:

Kills per game: The average number of kills a player has earned in all or most recent games.
Hero-specific win rate: How often players win when using special heroes.
GPM and XPM (Experience Per Minute): Measures how efficiently players farm gold and gain experience over time.

These statistics serve as an aid in predicting how a player is likely to perform in future matches. Basically, historical data reveals patterns regarding players’ general skill levels, tendencies, and hero-specific strengths.

2. Using Historical Data for Predictions

This is because historical data is not tied to a specific competition. It thus serves as a comprehensive performance indicator. Here’s how these measures contribute to the model:

Player performance features: Metrics like kills, deaths, assists, GPM, and XPM are used as predictors of a player’s in-game behavior.
Averaged historical data: For example, a player’s average kills, win rate, and GPM in the last N games provide a more stable estimate of their overall potential.

Example historical features for players might include:

average_kills: Average player kills in the last N games.
win_rate: Player’s win rate in the last N games.
hero_win_rate: Player’s win rate with their specific hero in the last N games.
gold_per_min: The player’s average GPM across their last N games.

These features allow the model to estimate the likely impact of each player in a given match based on their past performance.

3. Hero Features

In addition to player stats, hero-specific data also plays a critical role. These stats don’t refer to the ongoing game but are averages from all matches where the hero was picked. This helps assess the strength of heroes in the current meta:

Pro Pick Rate: How frequently the hero is picked in professional matches.
Pro Win Rate: The hero’s win rate across professional games.

These hero features give the model an understanding of the broader effectiveness of a given hero in professional play, which is crucial when making predictions.

4. Combining Historical Data with Live Match Data

For the match outcome prediction model, both historical data and live match data are essential. The idea is to combine:

Historical player data: Information like kills, assists, win rates, and GPM that indicate how players have performed in the past.
Hero performance data: Stats like hero win rates and pick rates that represent a hero’s overall strength in professional matches.

This mix of historical and live data forms a comprehensive input set for machine learning models. Even though historical data doesn’t reflect the events of a specific match, it serves as a strong indicator of player and hero potential, which can be used to predict outcomes.

5. Extracting Match Data with OpenDota API

With this understanding of historical and hero data, the next step is extracting match-level data from premium Dota 2 leagues. Using OpenDota’s API, we gather detailed match information from prestigious tournaments like The International and ESL One.

Here’s a Python script that demonstrates how to generate a dataset from these leagues, incorporating both match-level and player-level data:

from structure.struct import Tournament
from structure.opendota import OpenDotaApidef generate_dataset():
    api = OpenDotaApi()
    dataset = []
    premium_leagues = api.set_premium_leagues()
    last_big_leagues = [
        "ESL One Kuala Lumpur powered by Intel" "BetBoom Dacha Dubai 2024",
        "DreamLeague Season 22 powered by Intel",
        "Elite League Season 2 Main Event – presented by ESB",
        "ESL One Birmingham 2024 Powered by Intel",
        "DreamLeague Season 23 powered by Intel",
        "Riyadh Masters 2024 at Esports World Cup",
        "Clavision DOTA League S1 : Snow-Ruyi",
        "The International 2024",
        "PGL Wallachia 2024 Season 1",
    ]
    for premium_league in premium_leagues:
        league_id = premium_league["leagueid"]
        league_name = premium_league["name"]
        if league_name in last_big_leagues:
            tournament = Tournament(league_id=league_id, name=league_name)
            tournament.get_league_matches()  # Load matches for the tournament
            # Extract data from each match in the tournament
            for match in tournament.matches:
                radiant_team = match.radiant_team
                dire_team = match.dire_team
                # Ensure we have 5 players on each team
                if len(radiant_team.players) == 5 and len(dire_team.players) == 5:
                    # Create a single row with match and player data
                    match_data = {
                        "match_id": match.match_id,
                        "radiant_team_id": radiant_team.team_id,
                        "radiant_team_name": radiant_team.team_name,
                        "dire_team_id": dire_team.team_id,
                        "dire_team_name": dire_team.team_name,
                        "radiant_win": match.radiant_win,  # True/False if Radiant team won
                    }
                    # Add radiant team player data (5 players)
                    for i, player in enumerate(radiant_team.players):
                        match_data[f"radiant_player_{i + 1}_id"] = player.account_id
                        match_data[f"radiant_player_{i + 1}_name"] = player.name
                        match_data[f"radiant_player_{i + 1}_hero_id"] = (
                            player.hero.hero_id
                        )
                        match_data[f"radiant_player_{i + 1}_hero_name"] = (
                            player.hero.name
                        )
                        match_data[f"radiant_player_{i + 1}_hero_winrate"] = (
                            player.hero.winrate
                        )
                        match_data[f"radiant_player_{i + 1}_winrate"] = (
                            player.player_data["win_rate"]
                        )
                        match_data[f"radiant_player_{i + 1}_kills"] = player.kills
                        match_data[f"radiant_player_{i + 1}_deaths"] = player.deaths
                        match_data[f"radiant_player_{i + 1}_assists"] = player.assists
                        match_data[f"radiant_player_{i + 1}_gold_per_min"] = (
                            player.gold_per_min
                        )
                        match_data[f"radiant_player_{i + 1}_xp_per_min"] = (
                            player.xp_per_min
                        )
                    # Add dire team player data (5 players)
                    for i, player in enumerate(dire_team.players):
                        match_data[f"dire_player_{i + 1}_id"] = player.account_id
                        match_data[f"dire_player_{i + 1}_name"] = player.name
                        match_data[f"dire_player_{i + 1}_hero_id"] = player.hero.hero_id
                        match_data[f"dire_player_{i + 1}_hero_name"] = player.hero.name
                        match_data[f"dire_player_{i + 1}_hero_winrate"] = (
                            player.hero.winrate
                        )
                        match_data[f"dire_player_{i + 1}_winrate"] = player.player_data[
                            "win_rate"
                        ]
                        match_data[f"dire_player_{i + 1}_kills"] = player.kills
                        match_data[f"dire_player_{i + 1}_deaths"] = player.deaths
                        match_data[f"dire_player_{i + 1}_assists"] = player.assists
                        match_data[f"dire_player_{i + 1}_gold_per_min"] = (
                            player.gold_per_min
                        )
                        match_data[f"dire_player_{i + 1}_xp_per_min"] = (
                            player.xp_per_min
                        )
                    print(match_data)
                    # Append match data to dataset
                    dataset.append(match_data)
    df = pd.DataFrame(dataset)
    # Write DataFrame to a CSV file
    df.to_csv("premium_league_matches.csv", index=False)
    print("Match dataset has been generated and saved to 'premium_league_matches.csv'.")
generate_dataset()

6. Structuring and Saving the Dataset

The final dataset is saved in a CSV file. Each row represents a single match, capturing both match-level and player-level data, including historical player stats and hero performance. This dataset provides the input for our machine learning models to predict match outcomes.

Data Preparation

Data preparation is one of the most critical stages in the machine learning workflow, as it significantly impacts the performance and accuracy of predictive models. In this section, we outline how raw match data was transformed into a structured dataset ready for machine learning analysis.

1. Feature Engineering

Feature engineering involves creating new features from existing data to improve model insights. For this project, we developed team-based features that summarize key performance metrics for both Radiant and Dire teams:

Average Hero Win Rate: We calculated the average win rate of the heroes selected by the players on each team. This provides insight into how historically successful each team’s hero lineup is.
Average Player Win Rate: Similar to hero win rates, this feature represents the average win rate of the players on the team, offering a view of the team’s player strength.
Total Kills, Deaths, and Assists: Summing the kills, deaths, and assists for all players on each team gives us a picture of overall team performance.
Average Gold per Minute (GPM) and Experience per Minute (XPM): Calculating the average GPM and XPM for each team provides a measure of resource accumulation, which is crucial in understanding how well a team farms and levels up.

These team-based features were generated using the calculate_team_features function, which processes a DataFrame and calculates new metrics based on a given team prefix (either "radiant" or "dire"). Once the new features were created, the original columns used to derive them were dropped to avoid redundancy.

Here is the implementation of the calculate_team_features function:

def calculate_team_features(df, team_prefix):
    """
    Function to calculate team-based features for a given prefix (radiant or dire).
    """
    # Team Hero Win Rate: Average win rate of the heroes for the team
    hero_winrate_cols = [f"{team_prefix}_player_{i}_hero_winrate" for i in range(1, 6)]
    df[f"{team_prefix}_avg_hero_winrate"] = df[hero_winrate_cols].mean(axis=1)    # Team Player Win Rate: Average win rate of the players for the team
    player_winrate_cols = [f"{team_prefix}_player_{i}_winrate" for i in range(1, 6)]
    df[f"{team_prefix}_avg_player_winrate"] = df[player_winrate_cols].mean(axis=1)
    # Team Kills, Deaths, Assists
    kills_cols = [f"{team_prefix}_player_{i}_kills" for i in range(1, 6)]
    deaths_cols = [f"{team_prefix}_player_{i}_deaths" for i in range(1, 6)]
    assists_cols = [f"{team_prefix}_player_{i}_assists" for i in range(1, 6)]
    df[f"{team_prefix}_total_kills"] = df[kills_cols].sum(axis=1)
    df[f"{team_prefix}_total_deaths"] = df[deaths_cols].sum(axis=1)
    df[f"{team_prefix}_total_assists"] = df[assists_cols].sum(axis=1)
    # Team GPM and XPM: Average GPM and XPM per team
    gpm_cols = [f"{team_prefix}_player_{i}_gold_per_min" for i in range(1, 6)]
    xpm_cols = [f"{team_prefix}_player_{i}_xp_per_min" for i in range(1, 6)]
    df[f"{team_prefix}_avg_gpm"] = df[gpm_cols].mean(axis=1)
    df[f"{team_prefix}_avg_xpm"] = df[xpm_cols].mean(axis=1)
    # Drop the original columns used to create these features
    df.drop(
        columns=hero_winrate_cols + player_winrate_cols + gpm_cols + xpm_cols,
        inplace=True,
    )
    return df

In addition to team features, we calculated the Kill-Death-Assist (KDA) ratio for each player. KDA is a widely used metric in competitive gaming that helps evaluate individual player performance. The formula used is:

This avoids division by zero, ensuring robust calculations. After calculating KDA for all players, the original columns for kills, deaths, and assists were dropped.

Here’s the implementation of the calculate_player_kda function:

def calculate_player_kda(df, team_prefix):
    """
    Function to calculate KDA (Kill-Death-Assist ratio) for each player.
    """
    for i in range(1, 6):
        df[f"{team_prefix}_player_{i}_kda"] = (
            df[f"{team_prefix}_player_{i}_kills"]
            + df[f"{team_prefix}_player_{i}_assists"]
        ) / df[f"{team_prefix}_player_{i}_deaths"].replace(
            0, 1
        )  # Avoid division by zero
        # Drop kills, deaths, and assists for each player
        df.drop(
            columns=[
                f"{team_prefix}_player_{i}_kills",
                f"{team_prefix}_player_{i}_deaths",
                f"{team_prefix}_player_{i}_assists",
            ],
            inplace=True,
        )    return df

2. Target Variable Creation

To transform the dataset for binary classification, we converted the radiant_win column into an integer format where:

1 represents a Radiant victory
0 represents a Dire victory

This step is essential for the supervised learning model, where predicting the winning team (Radiant or Dire) is the primary objective.

3. Data Cleaning

Data cleaning was performed to remove unnecessary columns that don’t contribute to model performance. Specifically, we dropped:

Match-specific identifiers: Such as match_id, radiant_team_id, dire_team_id, etc., since they are not predictive features.
Player-specific identifiers: Player names and IDs were removed to anonymize the data and focus solely on performance metrics.

These steps help reduce overfitting and ensure that the model generalizes well on unseen data.

4. Normalization

To ensure that all features contribute equally to the machine learning model, we applied Min-Max normalization to scale the features between 0 and 1. This is particularly important for algorithms like logistic regression and neural networks, which are sensitive to the magnitude of input features.

We used MinMaxScaler to normalize all numerical columns except for radiant_win (our target variable). The normalization ensures that features such as kills, GPM, and hero win rates are all on the same scale.

def prepare_data(df):
    # Apply feature engineering for both Radiant and Dire teams
    df = calculate_team_features(df, "radiant")
    df = calculate_team_features(df, "dire")    # Calculate KDA for each player (for both teams)
    df = calculate_player_kda(df, "radiant")
    df = calculate_player_kda(df, "dire")
    # Create a new column for the match target: 1 if radiant_win is True, else 0
    try:
        df["radiant_win"] = df["radiant_win"].astype(int)
    except KeyError:
        pass
    df.drop(
        columns=[
            "match_id",
            "radiant_team_id",
            "radiant_team_name",
            "dire_team_id",
            "dire_team_name",
            # Drop player names to anonymize data
            *[f"radiant_player_{i}_name" for i in range(1, 6)],
            *[f"radiant_player_{i}_id" for i in range(1, 6)],
            *[f"radiant_player_{i}_hero_name" for i in range(1, 6)],
            *[f"dire_player_{i}_name" for i in range(1, 6)],
            *[f"dire_player_{i}_id" for i in range(1, 6)],
            *[f"dire_player_{i}_hero_name" for i in range(1, 6)],
        ],
        inplace=True,
    )
    columns_to_normalize = df.columns.difference(["match_id", "radiant_win"])
    # Initialize the MinMaxScaler
    scaler = MinMaxScaler()
    # Apply Min-Max normalization
    df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
    return df

Machine Learning Model Development

For our first iteration of model development, we selected XGBoost as the primary algorithm. XGBoost, short for Extreme Gradient Boosting, is known for its efficiency and high performance, particularly with structured/tabular data, making it a popular choice in machine learning competitions and real-world applications.

Why XGBoost?

There are several reasons why XGBoost was chosen for this initial iteration:

Performance on Small Datasets: XGBoost is highly effective even with limited data, which aligns with our relatively small dataset size. It can deliver competitive results without requiring a massive amount of data for training.
Efficiency: XGBoost is optimized for speed and performance. Its implementation is designed to handle large datasets efficiently, but it also shines with smaller datasets due to its fast training times and ability to quickly iterate over hyperparameters.
Robustness: XGBoost inherently reduces the risk of overfitting through built-in regularization techniques like L1 (Lasso) and L2 (Ridge) regularization. This makes it more robust when dealing with noisy or complex datasets.

Given these advantages, we used XGBoost’s classification capabilities to predict the outcome of a match (Radiant win or loss). Our workflow included several key steps:

1. Data Splitting

We first divided our dataset into:

Features: These included player statistics (e.g., kills, deaths, assists), hero selection, and other team performance metrics.
Target Variable: The target is the match outcome (radiant_win), which indicates whether the Radiant team won or lost.

We further split the data into training and testing sets using an 80–20 split to ensure that the model has unseen data to evaluate its performance.

2. Model Training

For the initial iteration, we used the default parameters of XGBoost to establish a baseline model. While hyperparameter tuning can improve performance, starting with default parameters allows us to gauge the base performance and identify areas for improvement.

3. Model Evaluation

After training the model, we evaluated its performance on the test set using key classification metrics:

Accuracy: Measures overall correctness of the predictions.
Precision, Recall, and F1-Score: These metrics give us a deeper understanding of how well the model distinguishes between classes, especially in scenarios where the class distribution may be imbalanced.
Confusion Matrix: Provides a detailed breakdown of true positives, false positives, true negatives, and false negatives.

4. Model Saving

Upon successful training, the model was saved using the joblib library. This allows us to reuse the model for future predictions without retraining it every time.

5. Prediction on New Data

With the trained model, we can now make predictions on new, unseen data. This capability allows us to forecast match outcomes based on similar feature sets, providing insights into potential game results.

Implementation

Below is the Python implementation of the machine learning pipeline, encapsulated in the MainML class:

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier
import joblibclass MainML:
    """
    Main class that orchestrates model training, evaluation, and prediction.
    """
    def __init__(self, df, model_path):
        self.df = df
        self.model_path = model_path
        self.xgb_model = XGBClassifier(random_state=42)
    def train_and_save_model(self, features, target):
        """
        Trains the XGBoost model and saves it to the specified path.
        """
        # Split the dataset into features (X) and target (y)
        X = self.df[features]
        y = self.df[target]
        # Split data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )
        # Train the model
        self.xgb_model.fit(X_train, y_train)
        # Save the model
        joblib.dump(self.xgb_model, self.model_path)
        print(f"Model saved to {self.model_path}")
        # Evaluate the model on the test set
        self.evaluate_model(X_test, y_test)
    def evaluate_model(self, X_test, y_test):
        """
        Evaluates the model on the test data and prints the classification report and confusion matrix.
        """
        # Make predictions on the test set
        y_pred = self.xgb_model.predict(X_test)
        # Print classification report
        print("XGBoost Classification Report:")
        print(classification_report(y_test, y_pred))
        # Print confusion matrix
        print("XGBoost Confusion Matrix:")
        print(confusion_matrix(y_test, y_pred))
    def load_model(self):
        """
        Loads the model from the specified path.
        """
        self.xgb_model = joblib.load(self.model_path)
        print(f"Model loaded from {self.model_path}")
    def predict(self, new_data):
        """
        Predicts the class for the new data point.
        """
        # Ensure that the new_data has the same features as the training set
        prediction = self.xgb_model.predict(new_data)
        return prediction

Model Training and Prediction

In this section, we focus on the practical steps for training our machine learning model using the XGBoost classifier and then leveraging the trained model to make predictions. The workflow includes loading the dataset, preparing the data, training the model, saving it for future use, and ultimately making predictions on new data.

Here’s a step-by-step breakdown:

1. Loading and Preparing the Dataset

We begin by loading the dataset from a CSV file. The raw data contains various match-level and player-level statistics, which we prepare for machine learning by applying feature engineering and normalization. The prepare_data function, which we implemented earlier, handles these tasks by transforming the raw data into a format suitable for model training.

import os
import pandas as pd
from ml.model import MainML
from structure.helpers import prepare_data# Define the file path to the dataset
file_path = os.path.join("..", "dataset", "train_data", "all_data.csv")
# Load the dataset into a DataFrame
df = pd.read_csv(file_path)
# Prepare the dataset for model training
df = prepare_data(df)

In the prepare_data function, we applied feature engineering techniques to create team-based and player-based metrics, normalized the features, and removed unnecessary columns. This prepares the data for efficient model training and ensures consistency.

2. Defining Features and Target

Next, we define the features (input variables) and the target (output variable). In our case, the target is radiant_win, a binary variable that indicates whether the Radiant team won (1) or lost (0) the match. The features consist of all other columns in the dataset that describe various team and player performance metrics.

# Specify the target column (the variable we want to predict)
target = "radiant_win"# Specify the features (all columns except the target)
features = df.columns.drop(target).tolist()

This step ensures that the model knows which features to consider when making predictions and which column represents the match outcome.

3. Initializing and Training the Model

We create an instance of the MainML class, which encapsulates the XGBoost model, and proceed to train the model. The training process involves splitting the data into training and testing sets, fitting the model on the training data, and evaluating its performance on the test data.

# Path to save the trained model
model_path = "../xgb_model.pkl"# Create an instance of MainML with the dataset and model path
main_ml = MainML(df, model_path)
# Train the model using the features and target, and save the model
main_ml.train_and_save_model(features, targe

In this step:

The train_and_save_model method handles the entire training process.
After training, the model is saved to a file (xgb_model.pkl) using the joblib library, so it can be reloaded for future predictions without retraining.

4. Loading the Saved Model

Once the model is saved, we can reload it at any time to make predictions on new data. Loading the model from disk avoids retraining, making it more efficient for real-time or repeated predictions.

# Load the previously saved model
main_ml.load_model()

This line reloads the saved XGBoost model from the model_path, making it available for prediction tasks.

5. Making Predictions on New Data

Finally, we prepare a sample of new data to demonstrate how the model can be used for prediction. Here, we simulate new data by taking the last five rows of our dataset (excluding the target column). In practice, this would be real-time match data or previously unseen data.

# Prepare new data for prediction (excluding the target column)
new_data = df.tail(5).drop(columns=[target])  # Example: Using the last 5 rows as new data# Make predictions using the loaded model
prediction = main_ml.predict(new_data)
# Output the predictions
print(f"Prediction for new data: {prediction}")

In this example:

We use the .tail(5) method to select the last five rows of data and drop the radiant_win column, as we are making predictions on this unseen data.
The predict method from the MainML class outputs the predicted match outcome (1 for Radiant win, 0 for Radiant loss).

How to Run the Dota 2 Predictor Project

This section provides a step-by-step guide on how to set up and run the Dota 2 Predictor project on your local machine.

1. Clone the Repository

To begin, download the project code by cloning the GitHub repository. Open your terminal and run the following commands:

git clone https://github.com/masterhood13/dota2predictor.git
cd dota2predictor
git checkout tags/1.0.1

This will create a local copy of the repository and move you into the project directory.

2. Install Dependencies

Ensure you have Python 3.9 or higher installed on your system. Next, you will need to install the required Python libraries by running:

pip install -r requirements.txt

This command installs all the necessary packages, such as pandas, xgboost, scikit-learn, and others needed for data processing and machine learning tasks.

3. Set Up Environment Variables

You will need API keys for OpenDota, Steam, and Telegram to enable data fetching and bot functionality. Create a .env file in the project directory and add your API keys as follows:

OPENDOTA_KEY=your_actual_opendota_api_key
STEAM_API_KEY=your_actual_steam_api_key
TELEGRAM_KEY=your_actual_telegram_bot_toke

Replace your_actual_opendota_api_key, your_actual_steam_api_key, and your_actual_telegram_bot_token with your actual API keys.

4. Run the Project

Once everything is set up, you can start the bot by running the following command:

python start.py

This will launch the predictor bot, allowing it to fetch data and interact with the Dota 2 API to make match outcome predictions based on the trained machine learning model.

Interacting with the Telegram Bot for Dota 2 Match Predictions

Once the model is trained and the bot is running, follow these steps to start using the Telegram bot for match predictions:

1. Access the Telegram Bot

After setting up the bot in the previous steps, open Telegram on your device. Search for your bot by its username or use the bot link (if you’ve shared it with yourself or others).

2. Start the Conversation

Send any message to the bot to begin the interaction. The bot will respond with a menu of options to choose from.

3. Get Predictions for Ongoing Dota 2 Matches

To receive predictions for all ongoing Dota 2 matches, simply press the relevant button in the bot’s menu (e.g., “Get Match Predictions”). The bot will then query the Dota 2 API and display predictions for each active game.

4. View Prediction Results

For each active match, the bot will return a predicted winner — either the Radiant or Dire team — based on the machine learning model’s evaluation of the match data. Instead of win probabilities, you’ll receive a clear prediction indicating which team is more likely to win the match, providing real-time insights into ongoing games.

Here is a sample of the output:

High-Level System Architecture

The architecture of our Dota 2 predictor bot demonstrates how the different components work together to deliver match outcome predictions to the end-user via a Telegram bot. The flow begins with a user query and proceeds through various stages of data collection, preprocessing, model prediction, and finally, response delivery.

Here’s a step-by-step breakdown of the system’s workflow:

1. User Interaction:

The process starts when a user sends a request to the bot via Telegram. This could be a request to predict the outcome of ongoing Dota 2 matches.

2. Telegram Bot:

The Telegram bot acts as the interface between the user and the backend system. Upon receiving the query, the bot forwards it to the backend for further processing.

3. Backend System:

The backend server orchestrates the core logic of the prediction process. It gathers the necessary match and player data from external APIs like OpenDota and Steam.

4. Data Retrieval from External APIs:

The backend system requests match and player data from OpenDota’s API and, if needed, from the Steam API to enrich the player data with additional context.

5. Data Preprocessing:

Before feeding the data into the prediction model, the Preprocessing Engine transforms the raw data into a format suitable for analysis. This includes cleaning, feature engineering, and normalization to ensure the data is aligned with the requirements of the trained machine learning model.

6. XGBoost Model Prediction:

The processed data is then passed into the XGBoost model, which has been trained to predict whether the Radiant or Dire team will win. The model makes a decision based on historical data and the current match context.

7. Returning Results:

Once the model provides its prediction, the backend processes the result and sends it back to the Telegram bot, which then displays the predicted winner (Radiant or Dire) to the user.

System Flow Diagram

Press enter or click to view image in full size

This architecture ensures a smooth, automated flow from user interaction to final match prediction, allowing users to easily access predictions for ongoing Dota 2 games. Each component works independently but contributes to the overall functionality of the prediction system.

Conclusion and What’s Next

In this first phase, we successfully built a machine learning model to predict Dota 2 match outcomes, achieving a baseline accuracy of around 60%. While promising, this leaves plenty of room for improvement. The complexity of Dota 2 means that more refined data and feature engineering could unlock even better predictions.

But this is just the beginning.

In Part Two, we’ll dive deeper into feature engineering, introducing new metrics to better capture team dynamics and player performance. We’ll also integrate additional data sources, including live match stats from a custom Telegram bot. Most excitingly, we’ll design and train a custom deep learning model tailored specifically to this enriched dataset.

Stay tuned — big improvements are on the horizon!

EDIT : Part two

For more details and to explore the entire project, visit the Dota 2 Predictor GitHub Repository.