Abstract
Abstract
Achieving low remote memory access latency remains the primary challenge in realizing memory disaggregation over Ethernet within the datacenters. We present EDM that attempts to overcome this challenge using two key ideas. First, while existing network protocols for remote memory access over the Ethernet, such as TCP/IP and RDMA, are implemented on top of the Ethernet MAC layer, EDM takes a radical approach by implementing the entire network protocol stack for remote memory access within the Physical layer (PHY) of the Ethernet. This overcomes fundamental latency and bandwidth overheads imposed by the MAC layer, especially for small memory messages. Second, EDM implements a centralized, fast, in-network scheduler for memory traffic within the PHY of the Ethernet switch. Inspired by the classic Parallel Iterative Matching (PIM) algorithm, the scheduler dynamically reserves bandwidth between compute and memory nodes by creating virtual circuits in the PHY, thus eliminating queuing delay and layer 2 packet processing delay at the switch for memory traffic, while maintaining high bandwidth utilization. Our FPGA testbed demonstrates that EDM's network fabric incurs a latency of only ~300 ns for remote memory access in an unloaded network, which is an order of magnitude lower than state-of-the-art Ethernet-based solutions such as RoCEv2 and comparable to emerging PCIe-based solutions such as CXL. Larger-scale network simulations indicate that even at high network loads, EDM's average latency remains within 1.3x its unloaded latency.
AI Summary
AI-Generated Summary (Experimental)
This summary was generated using automated tools and was not authored or reviewed by the article's author(s). It is provided to support discovery, help readers assess relevance, and assist readers from adjacent research areas in understanding the work. It is intended to complement the author-supplied abstract, which remains the primary summary of the paper. The full article remains the authoritative version of record. Click here to learn more.
Click here to comment on the accuracy, clarity, and usefulness of this summary. Doing so will help inform refinements and future regenerated versions.
To view this AI-generated plain language summary, you must have Premium access.
Formats available
You can view the full content in the following formats:
References
[1]
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data Center TCP (DCTCP). SIGCOMM, 2010.
[2]
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda. Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center. NSDI, 2012.
[3]
Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. pFabric: Minimal Near-optimal Datacenter Transport. SIGCOMM, 2013.
[4]
Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. Can far memory improve job throughput? EuroSys, 2020.
[5]
Daniel Amir, Tegan Wilson, Vishal Shrivastav, Hakim Weatherspoon, Robert Kleinberg, and Rachit Agarwal. Optimal Oblivious Reconfigurable Networks. STOC, 2022.
[6]
Thomas Anderson, Susan Owicki, James Saxe, and Charles Thacker. High Speed Switch Scheduling for Local Area Networks. TOCS, 1993.
[7]
Hitesh Ballani, Paolo Costa, Raphael Behrendt, Daniel Cletheroe, Istvan Haller, Krzysztof Jozwik, Fotini Karinou, Sophie Lange, Kai Shi, Benn Thomsen, and Hugh Williams. Sirius: A Flat Datacenter Network with Nanosecond Optical Switching. SIGCOMM, 2020.
[8]
Deepak Bansal, Gerald DeGrace, Rishabh Tewari, Michal Zygmunt, James Grantham, Silvano Gai, Mario Baldi, Krishna Doddapaneni, Arun Selvarajan, Arunkumar Arumugam, Balakrishnan Raman, Avijit Gupta, Sachin Jain, Deven Jagasia, Evan Langlais, Pranjal Srivastava, Rishiraj Hazarika, Neeraj Motwani, Soumya Tiwari, Stewart Grant, Ranveer Chandra, and Srikanth Kandula. Disaggregating Stateful Network Functions. NSDI 23, 2023.
[9]
Daniel Bittman, Robert Soulé, Ethan L. Miller, Vishal Shrivastav, Pankaj Mehra, Matthew Boisvert, Avi Silberschatz, and Peter Alvaro. Don't Let RPCs Constrain Your API. HotNets, 2021.
[10]
Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard, Fernando Mujica, and Mark Horowitz. Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN. SIGCOMM, 2013.
[11]
Niv Buchbinder, Danny Segev, and Yevgeny Tkach. Online Algorithms for Maximum Cardinality Matching with Edge Arrivals. ESA, 2017.
[12]
Qizhe Cai, Mina Tahmasbi Arashloo, and Rachit Agarwal. dcPIM: Near-optimal Proactive Datacenter Transport. SIGCOMM, 2022.
[13]
Irina Calciu, M Talha Imran, Ivan Puddu, Sanidhya Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. Rethinking Software Runtimes for Disaggregated Memory. ASPLOS, 2021.
[14]
Irina Calciu, Ivan Puddu, Aasheesh Kolli, Andreas Nowatzyk, Jayneel Gandhi, Onur Mutlu, and Pratap Subrahmanyam. Project Pberry: FPGA Acceleration for Remote Memory. HotOS, 2019.
[15]
Xinyi Chen, Liangcheng Yu, Vincent Liu, and Qizhen Zhang. Cowbird: Freeing CPUs to Compute by Offloading the Disaggregation of Memory. SIGCOMM, 2023.
[16]
Inho Cho, Keon Jang, and Dongsu Han. Credit-Scheduled Delay-Bounded Congestion Control for Datacenters. SIGCOMM, 2017.
[17]
Sharad Chole, Andy Fingerhut, Sha Ma, Anirudh Sivaraman, Shay Vargaftik, Alon Berger, Gal Mendelson, Mohammad Alizadeh, Shang-Tse Chuang, Isaac Keslassy, Ariel Orda, and Tom Edsall. dRMT: Disaggregated Programmable Switching. SIGCOMM, 2017.
[18]
Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. 2010.
[19]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. {FaRM}: Fast Remote Memory. NSDI, 2014.
[20]
Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. Helios: a hybrid electrical/optical switch architecture for modular data centers. SIGCOMM, 2010.
[21]
Daniel Firestone, AndrewPutnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. Azure Accelerated Networking: SmartNICs in the Public Cloud. NSDI, 2018.
[22]
Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Network Requirements for Resource Disaggregation. OSDI, 2016.
[23]
Peter X. Gao, Akshay Narayan, Gautam Kumar, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. PHost: Distributed near-Optimal Datacenter Transport over Commodity Network Fabric. CoNEXT, 2015.
[24]
Dan Gibson, Hema Hariharan, Eric Lance, Moray McLaren, Behnam Montazeri, Arjun Singh, Stephen Wang, Hassan M. G. Wassel, Zhehua Wu, Sunghwan Yoo, Raghuraman Balasubramanian, Prashant Chandra, Michael Cutforth, Peter Cuy, David Decotigny, Rakesh Gautam, Alex Iriza, Milo M. K. Martin, Rick Roy, Zuowei Shen, Ming Tan, Ye Tang, Monica Wong-Chan, Joe Zbiciak, and Amin Vahdat. Aquila: A unified, low-latency fabric for datacenter networks. NSDI, 2022.
[25]
Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Sangwon Lee, and Myoungsoo Jung. Memory Pooling With CXL. MICRO, 2023.
[26]
Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. Direct Access, High-Performance Memory Disaggregation with {DirectCXL}. USENIX ATC 22, 2022.
[27]
Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Efficient Memory Disaggregation with Infiniswap. NSDI, 2017.
[28]
Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew Moore, Gianni Antichi, and Marcin Wojcik. Re-architecting datacenter networks and stacks for low latency and high performance. SIGCOMM, 2017.
[29]
Mohamed Hassan. On the Off-chip Memory Latency of Real-Time Systems: Is DDR DRAM Really the Best Option? https://arxiv.org/pdf/1810.07059.pdf, 2018.
[30]
Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz, Changhoon Kim, and Nick McKeown. The nanoPU: A Nanosecond Network Stack for Datacenters. OSDI, 2021.
[31]
Van Jacobson. Congestion avoidance and control. SIGCOMM, 1988.
[32]
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and Krste Asanovic. FireSim: FPGA Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud. ISCA, 2018.
[33]
Akhilesh Kumar. The New Intel® Xeon® Processor Scalable Family (Formerly Skylake-SP). HotChips, 2017.
[34]
Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems, 1998.
[35]
Yanfang Le, Radhika Niranjan Mysore, Lalith Suresh, Gerd Zellweger, Sujata Banerjee, Aditya Akella, and Michael M. Swift. PL2: Towards Predictable Low Latency in Rack-Scale Networks. https://arxiv.org/abs/2101.06537, 2021.
[36]
Ki Suh Lee, Han Wang, Vishal Shrivastav, and Hakim Weatherspoon. Globally Synchronized Time via Datacenter Networks. SIGCOMM, 2016.
[37]
Ki Suh Lee, Han Wang, and Hakim Weatherspoon. PHY Covert Channels: Can you see the Idles? NSDI, 2014.
[38]
Seung-seob Lee, Yanpeng Yu, Yupeng Tang, Anurag Khandelwal, Lin Zhong, and Abhishek Bhattacharjee. Mind: In-Network Memory Management for Disaggregated Data Centers. SOSP, 2021.
[39]
Jason Lei and Vishal Shrivastav. Seer: Enabling Future-Aware Online Caching in Networked Systems. NSDI, 2024.
[40]
Philip Levis, Kun Lin, and Amy Tai. A Case Against CXL Memory Pooling. HotNets, 2023.
[41]
Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. Pond: CXL-based Memory Pooling Systems for Cloud Platforms. ASPLOS, 2023.
[42]
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu. HPCC: High Precision Congestion Control. SIGCOMM, 2019.
[43]
William M. Mellette, Rajdeep Das, Yibo Guo, Rob McGuinness, Alex C. Snoeren, and George Porter. Expanding across time to deliver bandwidth efficiency and low latency. NSDI, 2020.
[44]
William M. Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C. Snoeren, and George Porter. RotorNet: A Scalable, Low-complexity, Optical Datacenter Network. SIGCOMM, 2017.
[45]
Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and Minlan Yu. SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs. SIGCOMM, 2017.
[46]
Radhika Mittal, Terry Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. TIMELY: RTT-based Congestion Control for the Datacenter. SIGCOMM, 2015.
[47]
Behnam Montazeri, Yilong Li, Mohammad Alizadeh, and John Ousterhout. Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities. SIGCOMM, 2018.
[48]
Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. Understanding PCIe performance for end host networking. SIGCOMM, 2018.
[49]
Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. Scale-out NUMA. ASPLOS, 2014.
[50]
Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. Scale-out NUMA. ASPLOS, 2014.
[51]
Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans Fugal. Fastpass: A Centralized "Zero-queue" Datacenter Network. SIGCOMM, 2014.
[52]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating largescale datacenter services. ISCA, 2014.
[53]
Zhenyuan Ruan, Malte Schwarzkopf, Marcos K Aguilera, and Adam Belay. {AIFM}:High-Performance, Application-Integrated Far Memory. OSDI, 2020.
[54]
Fred Schneider. Implementing Fault-tolerant Services using the State Machine Approach: A Tutorial. ACM Computing Surveys, 1990.
[55]
Henry N. Schuh, Arvind Krishnamurthy, David Culler, Henry M. Levy, Luigi Rizzo, Samira Khan, and Brent E. Stephens. CC-NIC: a Cache-Coherent Interface to the NIC. ASPLOS, 2024.
[56]
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation. OSDI, 2018.
[57]
Vishal Shrivastav. Fast, Scalable, and Programmable Packet Scheduler in Hardware. SIGCOMM, 2019.
[58]
Vishal Shrivastav. Programmable Multi-Dimensional Table Filters for Line Rate Network Functions. SIGCOMM, 2022.
[59]
Vishal Shrivastav. Stateful Multi-Pipelined Programmable Switches. SIGCOMM, 2022.
[60]
Vishal Shrivastav, Ki Suh Lee, Han Wang, and Hakim Weatherspoon. Globally Synchronized Time via Datacenter Networks. Transactions on Networking, 2019.
[61]
Vishal Shrivastav, Asaf Valadarsky, Hitesh Ballani, Paolo Costa, Ki Suh Lee, Han Wang, Rachit Agarwal, and Hakim Weatherspoon. Shoal: A Network Architecture for Disaggregated Racks. NSDI, 2019.
[62]
David Sidler, Zeke Wang, Monica Chiosa, Amit Kulkarni, and Gustavo Alonso. StRoM: Smart Remote Memory. EuroSys, 2020.
[63]
Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad Chole, Shang-Tse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Sachin Katti, and Nick McKeown. Programmable Packet Scheduling at Line Rate. SIGCOMM, 2016.
[64]
10.1109/IEEESTD.2022.9844436. IEEE Standard for Ethernet. IEEE Std 802.3-2022 (Revision of IEEE Std 802.3-2018), 2022.
[65]
https://amplab.cs.berkeley.edu/benchmark/. Berkeley Big Data Benchmark. AMP Lab, UC Berkeley, 2014.
[66]
https://en.wikipedia.org/wiki/Compare-and-swap. Compare-and-swap. Wikipedia.
[67]
https://en.wikipedia.org/wiki/InfiniBand. Infiniband. Wikipedia.
[68]
https://en.wikipedia.org/wiki/Priority_encoder. Priority Encoder. Wikipedia.
[69]
https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet. RDMA over converged Ethernet. Wikipedia.
[70]
https://en.wikipedia.org/wiki/Shortest_remaining_time. Shortest Remaining Time First. Wikipedia.
[71]
https://en.wikipedia.org/wiki/The_Machine_(computer_architecture). The Machine. Wikipedia.
[72]
https://github.com/corundum/corundum. Corundum. GitHub.
[73]
https://images.nvidia.com/content/pdf/nvswitch-technicaloverview.pdf. NVIDIA NVLink and NVSwitch Technical Overview. NVIDIA Corporation.
[74]
https://investors.broadcom.com/news-releases/news-release-details/broadcom-delivers-industrys-first-512-tbps-co-packaged-optics. Broadcom Delivers Industry's First 51.2-Tbps Co-Packaged Optics Ethernet Switch Platform for Scalable AI Systems. Broadcom.
[75]
https://opencapi.org/technical/specifications/. OpenCAPI Specifications. OpenCAPI Consortium.
[76]
https://www.accton.com/Technology-Brief/the-new-world-of-400-gbps-ethernet/. The New World of 400 Gbps Ethernet. Accton.
[77]
https://www.barefootnetworks.com. Tofino Switch. Intel.
[78]
https://www.ccixconsortium.com/library/specification/. CCIX Base Specification 1.0. CCIX Consortium Inc.
[79]
https://www.computeexpresslink.org/download-the-specification. CXL 3.0 Specification. Compute Express Link Consortium Inc.
[80]
https://www.datacenterdynamics.com/en/opinions/intel-rack-scale-design-just-what-is-it/. Intel Rack Scale Design: Just what is it? Intel.
[81]
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10-overview.pdf. Stratix 10 FPGA. Intel.
[82]
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-v/stx5_51001.pdf. Stratix V FPGA. Intel.
[83]
https://www.intel.com/content/www/us/en/products/details/fpga/agilex.html. Utra Path Interconnect. Intel.
[84]
https://www.semianalysis.com/p/cxl-is-dead-in-the-ai-era. CXL Is Dead In The AI Era. SemiAnalysis.
[85]
https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/dc-ultra.html. DC Ultra RTL Synthesis. Synopsys.
[86]
https://www.uciexpress.org/specification. UCIe 1.0 Specification. Universal Chiplet Interconnect Express.
[87]
https://www.xconn-tech.com/product. XC50256 CXL2.0/PCle5.0 switch.XconnTech.
[88]
https://www.xilinx.com/products/boards-and-kits/alveo/u200.html. Alveo U200 Data Center Accelerator Card. AMD Xilinx.
[89]
http://www.ieee802.org/1/pages/802.1bb.html. Priority-based Flow Control. IEEE DCB. 802.1Qbb, 2011.
[90]
Chenxi Wang, Haoran Ma, Shi Liu, Yuanqi Li, Zhenyuan Ruan, Khanh Nguyen, Michael D Bond, Ravi Netravali, Miryung Kim, and Guoqing Harry Xu. Semeru: A Memory-Disaggregated Managed Runtime. OSDI, 2020.
[91]
Han Wang, Ki Suh Lee, Erluo Li, Chiun Lin Lim, Ao Tang, and Hakim Weatherspoon. Timing is Everything: Accurate, Minimum Overhead, Available Bandwidth Estimation in High-Speed Wired Networks. IMC, 2014.
[92]
Shu-Ting Wang and Weitao Wang. Aurelia: CXL Fabric with Tentacle. WORDS, 2023.
[93]
Adam Wierman and Bert Zwart. Is Tail-Optimal Scheduling Possible? Operations Research, INFORMS, 2012.
[94]
Qizhen Zhang, Philip A Bernstein, Daniel S Berger, and Badrish Chandramouli. Redy: Remote Dynamic Memory Cache. https://arxiv.org/abs/2112.12946, 2021.
[95]
Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion Control for Large-Scale RDMA Deployments. SIGCOMM, 2015.
[96]
Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Klaus-Tycho Förster, Arvind Krishnamurthy, and Thomas Anderson. Understanding and Mitigating Packet Corruption in Data Center Networks. SIGCOMM, 2017.