Data Engineering for AI/ML - Event | Agentic AI Foundation

Your AI Means Nothing Without the Data.

Everyone in data engineering is fighting a hard battle you know nothing about.

This is a conference at the intersection of Data and AI. It will be fun and educational. Don't believe me? Check out what past guests have said.

Oh yeah, and the speakers are top of their game.

some-file-91c642d4-bf5e-4319-936d-75b8107410a9 some-file-5bc63651-3ce1-43cc-a0de-528b883ce8da

Speakers

Sadie St. Lawrence

Founder / AI Instructor @ Human Machine Collaboration Institute / LinkedIn Learning

Hannes Mühleisen

Co-Founder & CEO @ DuckDB Labs

Sol Rashidi

CEO and Founder @ ExecutiveAI

Tengyu Ma

Co-Founder and CEO @ Voyage AI

Shelby Heinecke

Senior AI Research Manager @ Salesforce

Joe Reis

CEO/Co-Founder @ Ternary Data

Ryan Wolf

Deep Learning Algorithm Engineer @ NVIDIA

Yangqing Jia

Founder @ Lepton AI

Chad Sanderson

CEO & Co-Founder @ Gable

Vinoth Chandar

Founder/CEO @ Onehouse

Miriah Peterson

Data Engineer @ Soypete tech

Benjamin Rogojan

Data Science And Engineering Consultant @ Seattle Data Guy

Michael Del Balso

CEO & Co-founder @ Tecton

Pushkar Garg

Staff Machine Learning Engineer @ Clari Inc.

Simon Whiteley

CTO & Co-Owner @ Advancing Analytics

Shailvi Wakhlu

Founder @ Shailvi Ventures LLC

Aishwarya Joshi

Machine Learning Engineer @ Chime

Aishwarya Ramasethu

AI Engineer @ Prediction Guard

Ciro Greco

Founder and CEO @ Bauplan

Sridhar Natarajan

Senior Software Engineer @ Intuit

Daniela Santisteban

Product Manager - Data Taxonomies @ Numerator

Nikhil Simha

CTO @ Zipline AI

Sri Harsha Yayi

Product Manager @ Intuit

Tobias Macey

Associate Director of Platform and DevOps Engineering @ Massachusetts Institute of Technology (MIT)

Alex Strick van Linschoten

ML Engineer @ ZenML

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

Beverly Wright

VP - Data Science & AI / CAIO @ Wavicle Data Solutions

Rebecca Taylor

Tech lead: Personalization @ Lidl e-commerce

Emily Ekdahl

AI Ops Engineer @ Gusto

Jacopo Himberg

Director, Data @ Wolt

Devon Mittow

Staff Software Engineer @ Lyft

Jose Navarro

MLOps Engineer @ Cleo

Stephen Bailey

Data Engineer @ Whatnot

Mark Freeman

Tech Lead, GTM Engineering @ Gable

Victor Cuadros

Senior Software/Data Engineer @ Microsoft

Akmal Chaudhri

Technical Evangelist @ SingleStore

Jesse Anderson

Managing Director @ Big Data Institute

Mehdi Ouazza

Data Eng & Devrel @ MotherDuck

Michelle Leon

Staff Product Manager @ Databricks

Victoria Bukta

Member of Technical Staff @ Databricks

Christophe Blefari

CTO & Co-founder @ NAO

Daniel Svonava

CEO & Co-founder @ Superlinked

Maggie Hays

Founding Community Product Manager, DataHub @ Acryl Data

Korri Jones

Senior Lead Machine Learning Engineer @ Chick-fil-A, Inc.

Nehil Jain

MLE Consultant @ TBA

Sonam Gupta

Sr. Developer Relations @ aiXplain

Valdimar Eggertsson

AI Developer @ Snjallgögn (Smart Data inc.)

Simba Khadder

Founder & CEO @ Featureform

Jay Chia

Cofounder @ Eventual

Colleen Tartow

Field CTO @ VAST Data

Elena Boiarskaia

Head of Applied Machine Learning @ Snorkel AI

Ben Wilson

Software Engineer, ML @ Databricks

Agenda

Welcome - Data Engineering for AI/ML

Speakers:

user's Avatar

11 lessons learned from doing deployments

With over 200+ POCs built and with nearly 40 products in production, Sol walks us through the journey of developing AI products at scale and the 11 Lessons Learned in the journey - spoiler alert, only 30% of the challenges are tech related, 70% are non-tech issues!

Speakers:

user's Avatar

Diving into the results of the 2024 Data Teams survey.

Speakers:

user's Avatar

Building Data Infrastructure at Scale for AI/ML with Open Data Lakehouses

Data engineers love to solve interesting new problems. Sometimes an existing off-the-shelf tool will suffice; sometimes we have to get creative and come up with new ways to build with our existing toolkit. And, perhaps most rewarding, some use cases call for us to develop something completely new that takes on a life of its own - see Apache Spark, Apache Kafka, and the entire data lakehouse category for somewhat recent examples.

AI and ML engineers find themselves at these crossroads all the time. In this keynote, we will explore how a data lakehouse architecture with Apache Hudi is being used to support real-world predictive ML and vector-based AI use cases across organizations such as NielsenIQ, Notion, and Uber.

We’ll explore how a data lakehouse can be used to ingest data with minute-level freshness and provide a single source of truth for all of an organization’s structured and unstructured data. We’ll show how the lakehouse can be used for feature engineering, to generate accurate training datasets and generate production features. We’ll further explain the role of the lakehouse for GenAI use cases, allowing organizations to operate vector generation pipelines at scale and integrate with vector databases for real-time vector serving.

Speakers:

user's Avatar

Going Beyond Two Tier Data Architectures with DuckDB

DuckDB is an in-process analytical data management system. DuckDB is lightweight yet fast and available under the permissive MIT license. DuckDB can be deployed everywhere, from a smart watch to a big iron server. This flexibility has lead to a plethora of new and exciting data architectures, for example on-device processing , SQL lambdas, efficient large-scale pipelines, in-browser SQL, and more.

In this talk, Hannes will give an overview of architectures observed in the wild and some ideas on what would be possible.

Speakers:

user's Avatar

Data Infrastructure Cost: Tips to Keep Our CFOs Happy

“Hey Platform team, any idea why the cloud bill is up by X% this term compared to our previous one?”

As platform engineers working at organisations developing or using AI products, that X % amount can be quite high very quickly.

In this talk, I will show some strategies that you can use to reduce the cost of your Data Infrastructure, share the responsibility across product teams and control it overtime.

Speakers:

user's Avatar

Building Hyper-Personalized LLM Applications with Rich Contextual Data

In the era of AI-driven applications, personalization is paramount. This talk explores the concept of Full RAG (Retrieval-Augmented Generation) and its potential to revolutionize user experiences across industries. We examine four levels of context personalization, from basic recommendations to highly tailored, real-time interactions. The presentation demonstrates how increasing levels of context - from batch data to streaming and real-time inputs - can dramatically improve AI model outputs. We discuss the challenges of implementing sophisticated context personalization, including data engineering complexities and the need for efficient, scalable solutions. Introducing the concept of a Context Platform, we showcase how tools like Tecton can simplify the process of building, deploying, and managing personalized context at scale. Through practical examples in travel recommendations, we illustrate how developers can easily create and integrate batch, streaming, and real-time context using simple Python code, enabling more engaging and valuable AI-powered experiences.

Speakers:

user's Avatar

The Daft distributed Python data engine: multimodal data curation at any scale

It's 2024 but data curation for ML/AI is still incredibly hard. Daft is an open-sourced Python data engine that changes that paradigm by focusing on the 3 fundamental needs for any ML/AI data platform:

ETL at terabyte+ scale: with steps that require complex model batch inference or algorithms that can only be expressed in custom Python.
Analytics: involving multimodal datatypes such as images and tensors, but with SQL as the language of choice.
Dataloading: performant streaming transport and processing of data from cloud storage into your GPUs for model training/inference

In this talk, we explore how other tools fall short of delivering on these hard requirements for any ML/AI data platform. We then showcase a full example of using the Daft Dataframe and simple open file formats such as JSON/Parquet to build out a highly performant data platform - all in your own cloud and on your own data!

Speakers:

user's Avatar

How Feature Stores Work: Enabling Data Scientists to Write Petabyte-Scale Data Pipelines for AI/ML

The term "Feature Store" often conjures a simplistic idea of a storage place for features. However, in reality, feature stores are powerful frameworks and orchestrators for defining, managing, and deploying data pipelines at scale. This session is designed to demystify feature stores, outlining the three distinct types and their roles within a broader ML ecosystem. We’ll explore how feature stores empower data scientists to build and manage their own data pipelines, even at petabyte scale, while efficiently processing streaming data, and maintaining versioning and lineage.

Join Simba Khadder, founder and CEO of Featureform, as he moves beyond concepts and marketing talk to deliver real-world, applicable examples. This session will demonstrate how feature stores can be leveraged to define, manage, and deploy scalable data pipelines for AI/ML, offering a practical blueprint for integrating feature stores into ML workflows.

We’ll also dive into the internals of feature stores to reveal how they achieve scalability, ensuring participants leave with actionable insights. You’ll gain a solid grasp of feature stores, equipped to drive meaningful enhancements in your ML platforms and projects.

Speakers:

user's Avatar

Guest Roundtable Discussion

DuckDB is fast for analytics, but what can it do for AI?

The need for versatile and efficient search mechanisms has never been more critical. This talk will explore DuckDB's underrated search capabilities and usage within an LLM stack.

Speakers:

user's Avatar

From Notebook to Kubernetes: Scaling GenAI Pipelines with ZenML

This lightning talk demonstrates how ZenML, an open-source MLOps framework, enables seamless transition from local development to cloud-scale deployment of generative AI pipelines. We'll showcase a workflow that begins in a Jupyter notebook, with data processing steps run locally, then scales up by offloading intensive training to Kubernetes. The presentation will highlight ZenML's Kubernetes integration and caching features, illustrating how they streamline the development-to-production pipeline for generative AI projects.

Speakers:

user's Avatar

LLMs in Financial Services: Personalized Portfolio Recommendation Engines

This session will delve into how LLMs can be leveraged to create highly personalized and efficient portfolio recommendation engines. Using a Kafka stock ticker feed, tick data will be ingested into a database system where we'll query the data using natural language and build a simple chatbot using speech-to-text.

Speakers:

user's Avatar

Unified Data + AI Governance with Unity Catalog

In today’s multi-vendor data and AI landscapes, organizations often find themselves struggling with fragmented governance. The proliferation of diverse tools and platforms leads to increased overhead, making it challenging to maintain a unified governance strategy across data and AI assets. This session will explore what a typical multi-vendor organization looks like and highlight the common challenges they face.

We’ll delve into the complexities of the current governance space, focusing on the inefficiencies and risks that arise from tool sprawl. The talk will then introduce Unity Catalog’s mission to simplify and unify governance across diverse data formats and AI assets. Attendees will gain insights into how Unity Catalog’s multi-format, multi-asset approach enables seamless governance, empowering organizations to effectively manage their data and AI resources under a cohesive framework.

Join us to discover how Unity Catalog can transform your organization’s governance strategy, reducing overhead and enhancing control over your data and AI assets.

Speakers:

user's Avatar

Musical Entertainment By Yours Truly

Speakers:

user's Avatar

Data Scientists & Data Engineers: How the Best Teams Work

There are clear patterns that make the highest functioning data teams work so well together. In this panel we will explore what data scientists and data engineers need to know about each other's responsibilities to speak the same language and align incentives.

Supercharging Your RAG System: Techniques and Challenges

Retrieval-augmented generation is the predominant way to ingest proprietary unstructured data into generative AI systems. First, I will briefly state my view on the comparison between RAG and other competing paradigms such as finetuning and long-context LLMs. Then, I will briefly introduce embedding models and rerankers, two key components responsible for the retrieval quality. I will then discuss a list of techniques for improving the retrieval quality, such as query generation/decomposition and proper evaluation methods. Finally, I will discuss some current challenges in RAG and possible future directions.

Speakers:

user's Avatar

Your AI Means Nothing Without the Data.

Speakers

Agenda

Sponsors