UPP: Universal Predicate Pushdown to Smart Storage | Proceedings of the 52nd Annual International Symposium on Computer Architecture

Abstract

In large-scale analytics, in-storage processing (ISP) can significantly boost query performance by letting ISP engines (e.g., FPGAs) pre-select only the relevant data before sending them to databases. This reduces the amount of not only data transfer between storage and host, but also database computation, facilitating faster query processing. However, existing ISP solutions cannot effectively support a wide range of modern analytical queries because they only support simple combinations of frequently used operators (e.g., =, <), particularly on fixed-length columns. As modern databases allow filter predicates to include numerous operators/functions (e.g., dateadd) compatible with diverse data formats (and their complex combinations), it becomes more challenging for existing approaches to accelerate such queries efficiently.

To address the limitations, we propose a new ISP approach, called Universal Predicate Pushdown (UPP), that can accelerate modern analytical databases, leveraging hardware/software co-design for a high level of flexibility. Our core insight is that instead of programming for individual filter operators/functions, we should devise a compact instruction set architecture (ISA) tailored explicitly for predicate pushdown. The software (i.e., database) layer recognizes and compiles various general filters (called a universal predicate) to a set of UPP-compliant instructions, which are then processed efficiently by FPGA using bitwise comparisons, leveraging lightweight metadata. In our experiments with a 100 GB TPC-H dataset, UPP running on SmartSSD could speed up Spark’s end-to-end query performance by 1.2 × –7.9 × without changing input data formats.

AI Summary

AI-Generated Summary (Experimental)

This summary was generated using automated tools and was not authored or reviewed by the article's author(s). It is provided to support discovery, help readers assess relevance, and assist readers from adjacent research areas in understanding the work. It is intended to complement the author-supplied abstract, which remains the primary summary of the paper. The full article remains the authoritative version of record. Click here to learn more.

Click here to comment on the accuracy, clarity, and usefulness of this summary. Doing so will help inform refinements and future regenerated versions.

To view this AI-generated plain language summary, you must have Premium access.

1 Introduction

Many analytical queries begin with reading data and filtering out irrelevant items. Reading the entire data from storage and processing it on the host CPU can consume significant CPU cycles, on/off-chip cache/memory capacity, and PCIe/memory channel bandwidth. If storage devices can send only relevant data to the host, query latencies and energy efficiency will greatly improve. As increasingly pursued by on/off-chip accelerators [39, 66], can modern storage devices aid this through in-storage processing, enhancing overall system efficiency?

In-storage processing (ISP) can reduce the amount of data that must be transferred to and processed by the host. It can improve energy efficiency and query latency by saving CPU cycles, on-chip cache and off-chip memory capacity, and limited PCIe bandwidth [24, 71]. Existing ISP techniques [24, 28, 33, 40, 58, 74, 75, 76] offload parts of database queries (especially filtering) to ISP devices. Ibex [75] presents the first prototype of in-storage multi-predicate filtering using FPGAs. POLARDB [24] demonstrates the pushdown of table filtering tasks to ISP devices in real-world databases. In general, these ISP techniques can deliver practical benefits if they can support a wide range of real-world queries, significantly improving energy efficiency and user-perceived end-to-end query latency.

Table 1:

Mechanism	Novelty	Column format	Max predicates (max columns)^†	Supported operations	ISP engine
Smart SSD [28]	Proving the concept of in-storage query processing	4-byte record	Five (three)	Filter, Agg.	ARM cores
Ibex [75]	Supporting multi-predicate selection and aggregation	4-byte record	Four (two)	Filter, Agg.	FPGA [17]
Biscuit [31]	Demonstrating industrial product-strength ISP framework	16-byte record	Three (three)	Filter	ASIC+ARM cores
YourSQL [33]	Designing ISP-aware query planner for early filtering	16-byte record	Three (three)	Filter	ASIC+ARM cores
Summarizer [38]	Dynamic load balancing between host and ISP engines	Unknown	Five (three)	Filter	ARM cores
FCAccel [74]	Designing ISP-friendly column-oriented data format	16-byte record	Six (six)	Filter, Agg.	FPGA [15]
POLARDB [24]	Real-world deployment of DBMS powered by ISP	Custom format	Unknown	Filter	FPGA [8]
AQUOMAN [77]	Supporting complex multi-way join on large data sets	4/8-byte record	Four (four)	Filter, Agg., join	FPGA [35]
Tobias et al. [72]	Transactionally consistent ISP platform for HTAP workloads	Unknown	Unknown	Filter, Agg., join	RISC-V cores
AIDE [42]	Designing a canonical interface for vendor-neutral ISP	Fixed-length	One (one)	Filter, join	FPGA [11]

ISP-based database query acceleration mechanisms (focused on filtering capabilities).

† Maximum numbers of predicates and unique columns handled by ISP engines, derived from the evaluated queries in the original papers.

Unfortunately, existing ISP techniques suffer from two fundamental limitations in accelerating analytical queries [26, 53, 62, 63, 79]: filter generality and data generality.

Filter generality: Analytical databases like Apache Spark allow numerous functions (e.g., add_months, ceil) within filter predicates. Moreover, users can easily define custom functions (e.g., myfilter(col)) inside SQL. However, implementing each and every function requires significant amounts of FPGA resources or is even impossible since user-defined filter functions cannot be known in advance. Also, dynamically offloading arbitrary logic to FPGAs via high-level synthesis (HLS) [2] may offer suboptimal performance or it can be too time-consuming to compile for every query. For these reasons, existing ISP supports only relatively simple combinations of frequently used operators (e.g., =, <) without the ability to compile complex predicates into the forms that can be evaluated using ISP hardware engines.

Data generality: Existing ISP targets only fixed-length columns (e.g., padded strings) due to potential performance concerns in parsing and processing diverse data formats, which cannot be parallelized efficiently using FPGAs or wimpy embedded cores. However, most real-world data have columns of varying lengths and custom formats (e.g., timestamps with time zones). ISP for modern analytical queries should support these diverse formats by optionally leveraging hardware/software co-design for a high level of flexibility, while still achieving significant performance gains.

Our proposal. To overcome these limitations, we propose a novel ISP approach named Universal Predicate Pushdown (UPP). UPP accelerates a wide range of modern analytical queries by leveraging an instruction set architecture (ISA) specifically designed for predicate pushdown. The core insight of UPP is that many filter operations can be expressed using primitive constructs (as described shortly below). During query execution, databases dynamically compile a group of primitive constructs into an ISP instruction incorporating an FPGA-friendly representation of filter predicates called query hash. The FPGAs compute this instruction efficiently as bitwise operations, utilizing metadata (i.e., data hash) prepared alongside each database table. By storing row sizes along with data hash, UPP’s ISP can also process variable-length columns efficiently.

Requirements. To meet the needs of modern analytics, UPP strives to achieve the following three Requirements.

(R1) Guaranteeing correctness: when databases execute queries on UPP-filtered data, the results must be identical.

(R2) Ensuring generality: our proposed mechanism with primitive constructs should be applicable to most queries.

(R3) Improving efficiency: ISP engines should achieve a high level of parallelism inside FPGAs while ensuring the ability to process variable-length column formats.

Our approach. UPP satisfies all three requirements, Achieving correctness, generality, and efficiency.

(A1) UPP ensures correctness by guaranteeing that FPGAs never miss relevant data (i.e., no false negative). UPP-filtered data may include irrelevant data; however, regular query processing on those data produces identical results, according to the theory of relational algebra. We also demonstrate that our mechanisms achieve small false positive rates, minimizing host-side processing.

(A2) UPP offers generality by recognizing that most predicates can be expressed as one of primitive constructs:

\begin{align}\begin{split} \text{(Type I)} \;\; &\; \text{mono}_{-}\!\text{func}(\text{numeric}_{-}\!\text{col}) \in [\text{lb}, \text{ub}] \\ \text{(Type II)} \;\; &\; \text{contains}(\text{text}_{-}\!\text{col}, \text{val}).\text{and(...)} \end{split}\end{align}

(1)

where mono_func() is a monotonically increasing (e.g., log ) or decreasing (e.g., exp (-x)) function, and contains() evaluates true if text_col includes val; additional conditions may appear inside and(). For example, trunc(‘year’, timestamp_col) = 2023 returns true if timestamp_col (e.g., ‘2023-11-21 23:59:59’) has 2023 in its year field. Note that trunc() is a monotonically increasing function of its argument (i.e., timestamp_col) since the output value (e.g., year) never decreases when the input column value increases (in time). Also, “= 2023” can be expressed as “ ∈ [2023, 2023]”, meaning the whole predicate can be expressed using a primitive construct. If this condition is satisfied, UPP can convert the predicate evaluation into an ISP operation—parallel bitwise comparisons.

(A3) UPP’s bit-based mechanism is suitable for maximizing performance while simplifying logic-gate designs. For example, evaluating a pattern-matching predicate like method like ‘%in person%’ can incur significant overhead when directly implemented in FPGAs, due to sequential scans and repeated comparisons; however, in UPP, every predicate evaluation is simple parallel bitwise comparisons. One step further, a complex combination of predicates can be translated into custom instructions supported by ISP engines, reducing the amount of computation significantly. As such, for the first time in ISP-for-DB literature, UPP can handle complex predicates on variable-length columns with high efficiency.

This work focuses on accelerating large-scale analytics of human-readable data stored in CSV-like row-wise formats, which are widely supported by modern query engines such as Spark [79], Presto [62], Pandas [50], and Polars [55]. However, our core technical contributions also apply to conventional RDBMSs [10, 19] by generating metadata at data ingestion, HTAP databases [48, 56] by generating metadata when recently written data are moved into a read-optimized view, and columnar layouts [65, 73] to enable fine-grained filtering within data blocks.

Novelty over prior work. Most previous work focuses on pushing more database operations into FPGAs (Table 1). In contrast, UPP pursues an orthogonal direction. It proposes a novel intermediate representations to offer higher generality in filter predicates and data formats, as well as higher efficiency in evaluating complex queries inside ISP engines. As such, this approach makes the most out of the heterogeneous computational capabilities of ISP engines and host CPUs.

Summary of results. In end-to-end evaluations using TPC-H queries, UPP provides 1.2 × —7.9 × faster query performance while reducing system-wide energy consumption by 9%—87%, compared to Apache Spark’s regular query processing.

2 Diagnosis and Our Approach

2.1 Limitations of Existing Approaches

The primary advantage of ISP is its ability to greatly reduce I/O overhead by performing data-intensive operations directly within storage, thus transferring only intermediate or final results to the host CPU. This makes database query processing—particularly filtering—a strong candidate for ISP offloading, as the CPU only needs to read the relevant portions of database tables that meet the query conditions. As such, ISP offers several Benefits: (B1) saves CPU cycles and improves system-wide energy efficiency by processing data-intensive query filtering within lightweight ISP engines; (B2) boosts query performance through parallel evaluation of filter predicates; and (B3) increases effective bandwidth over PCIe/DRAM channels by transferring only output tables (typically smaller than the original tables) while leveraging SSD’s high internal bandwidth.

Building on this insight, a large body of ISP research has investigated methods to accelerate query filtering, as outlined in Table 1. However, with the growing variety of data formats and complexity of queries in modern analytical databases, previous solutions are facing two main Challenges that could significantly hinder the potential benefits (B2) and (B3).

Figure 1:

Hardware implementations of multi-predicate evaluation on a CSV format table with variable-length columns.

(C1) Limited flexibility for data formats. Modern databases employ various table formats, e.g., row-oriented and column-oriented tables with variable-length columns. However, previous ISP solutions support only predefined, fixed-length column formats (Table 1, 3^rd column). This is because extracting variable-length columns is hard to parallelize, making it unsuitable for ISP processing.

For example, to execute a Filtering task in an SQL query on a row-oriented table, an ISP engine (F1) reads an input table from SSD, (F2) extracts individual columns by identifying row and column delimiters, and (F3) evaluates individual predicates to check if each row satisfies the WHERE clause. Figure 1a depicts a feasible FPGA implementation for (F2) and (F3), evaluating a conjunction of four equality predicates. To extract columns, an ISP engine temporarily buffers characters from streaming table data until a column or row delimiter (e.g., ‘|’ and ‘\n’, respectively) is encountered. Upon detecting a delimiter, it transfers the buffered characters to an input buffer for the comparison logic (COMP), aligning them with the predicate constants. Then, the extracted columns are compared with the predicate constants, which collectively JUDGE whether the current row satisfies the query condition.

As column extraction, including delimiter detection, must be carried out sequentially on streaming data, it is non-trivial to parallelize (F2) either within a single row or across multiple rows. This results in significantly high query processing latency, which will be further discussed in §2.3. To bypass this challenge, prior ISP approaches rely on pre-defined, fixed-length column formats, enabling column extraction without inspecting the table data. However, this approach limits practical deployment because: (1) columns with varying data types and lengths (e.g., integers, floats, strings) cannot be flexibly supported; (2) the layout of each database table must be pre-processed specifically for the target ISP engines; and (3) unnecessary zero-padding to align the columns can lead to significant storage and memory waste, particularly when data elements have large variations in length.

(C2) Limited scalability for multi-predicate evaluation. With respect to (F3), previous ISP solutions have difficulty evaluating complex multi-predicate conditions—such as disjunctive normal forms (DNFs)—and support only a few simple predicates (Table 1, 4^th column). Additionally, they are unable to support DBMS-provided or user-defined functions (e.g., lower, date_trunc). The major reason is the limited computing capacity of ISP engines (Table 1, 6^th column). With wimpy embedded cores [28, 38, 72], filtering latency increases linearly with the number of predicates. For dedicated FPGAs [42, 74, 75, 77] or ASICs [31, 33], the maximum number of predicates is constrained by that of comparators in ISP engines (COMPs in Figure 1a). While a recursive architecture [24] could bypass this constraint by iteratively evaluating multiple predicates using a single comparator, it still faces the same drawback as embedded cores, i.e., increased filtering latency. Furthermore, every custom function cannot be pre-programmed individually due to the limited resources in ISP engines. Even if they can be somehow programmed dynamically using HLS tools (e.g., partial dynamic reconfiguration [81]), it would cause non-negligible delays in query response time whenever new functions are encountered.

In summary, (C1) and (C2) lead to significantly longer query latency and thus lower processing throughput of FPGA when handling complex filtering queries across various table formats, thereby hindering (B2) and (B3). Note that these challenges are further exacerbated by the growing bandwidth of commodity SSDs, which has surpassed 2 million IOPS (approximately 96 Gbps) [4] as of 2024, necessitating more powerful ISP throughput to match their internal bandwidth. However, pushing massive computational resources into the SSD form factor is not feasible due to the power envelopes set by the PCIe form factor standard [14].

2.2 Intuition: Universal Predicate Pushdown

To address (C1) and (C2) and fully realize (B2) and (B3), we introduce a novel approach called UPP, which offers both the flexibility to handle diverse data formats and the scalability to support multi-predicate evaluations. The key insight is that we can efficiently determine which rows satisfy query conditions through bit-vector comparisons rather than naïve predicate evaluations, leveraging: (1) row vectors and (2) query vectors, collectively referred to as data hash and query hash, respectively. A row vector encodes values from a row, with each column allocated an equal portion of bits (e.g., a 256-bit row vector encoding 16 columns with 16 bits per column). A query vector encodes filter predicates specified in a user query, which is feasible when individual predicates conform to one of the two primitive constructs (eq. (1)). The following sections detail the mechanisms behind this approach.

Type I. Suppose a numeric column ncol can take any value between 0 and 16 million. With a 16-bit row vector per column, we quantize this range into 16 buckets—assigning 1 to the first 1/16^th (i.e., 0-1M), 2 to the next 1/16^th (i.e., 1M–2M), and so forth—while designating an associated position in the row vector for each column. This allows us to evaluate log10(ncol) between 6 and 6.5 using the quantized representation of ncol without directly referencing its actual values. Note that we are using log only as an example of monotonically increasing/decreasing functions, but the same logic applies to other cases (e.g., dateadd, floor). In order to check if any of the column values—in the first bucket (i.e., the first 1/16^th range)— satisfy the filter predicate, it suffices to check the boundaries (i.e., 0 and 1M for the first bucket). There are a few cases: (1) both boundaries are less than “lb”, (2) both boundaries are greater than “ub”, and (3) otherwise. In cases 1 and 2, it is guaranteed that no column values in the bucket can satisfy the predicate due to the property of increasing/decreasing functions. Therefore, ISP can safely discard all the rows associated with the bucket. §5.2 presents a concrete example.

Type II. Let tcol be a text column. Many real-world predicates (e.g., startswith (tcol, ‘IEEE’)) can be expressed as contains followed by additional checks, because, for example, a string starting with ‘IEEE’ must contain ‘IEEE’ as well. In general, there is no known method for performing an exact containment check for arbitrary words solely based on FPGA-friendly bit vectors. However, UPP offers efficient containment checks based on pre-extracted features encoded in row vectors. Specifically, for each column, UPP identifies frequently occurring words and constructs a dictionary, a common technique in databases [20, 51] but has not been explored in ISP-based database acceleration studies. If a column contains one or more recognized dictionary words, associated positions are set to 1 in the row vector. Then, contains(tcol, val) can be evaluated solely based on the row vectors if val appears in the dictionary.

Conjunctions/Disjunctions. Primitive constructs can be combined using AND and OR operations, making the WHERE clause more complex. To ensure efficient evaluation, we consolidate these individual bit vector comparisons into a single parallel operation within the FPGA, further minimizing overhead.

Implications. Putting it all together, UPP efficiently evaluates complex filter predicates within ISP engines by leveraging simple comparisons between fixed-length row vectors and query vectors. This approach significantly reduces computational overhead within ISP engines while enhancing both inter- and intra-row parallelism during table scans. As depicted in Figure 1b, instead of extracting individual columns and comparing them with predicate constants one by one (Figure 1a), UPP performs row-wise evaluations, minimizing the number of comparisons and thereby increasing intra-row parallelism. Additionally, since the length of row vector is fixed, evaluations across different rows can be easily parallelized, further improving inter-row parallelism. Detailed implementation of the universal predicate compare logic (UP-COMP) is discussed in §6.1.

Limitations. Data hash need to be prepared prior to actual query processing. However, this process only requires a single scan over data; thus, it can be performed efficiently (§8.1).

2.3 Performance Impact

Figure 2:

Query latency of original DBMS (SW only), Naïve ISP implementation (Figure 1a), and UPP-enabled ISP implementation (Figure 1b) with varying filter ratios (FRs).

Figure 2 demonstrates the advantages of UPP in evaluating a simple predicate (l_shipinstruct = ‘DELIVER IN PERSON’) on the LINEITEM table from TPC-H. For details on our experimental setup, refer to §7. Without ISP (SW only), it takes 0.4s to read the table from SSD and 7.7s to process the query, whereas the Naïve ISP implementation (Figure 1a) takes over 178s only for ISP processing. This is primarily due to FPGA’s extremely long latency for column extraction. Specifically, first, while both CPU and FPGA must perform byte-granular matching to identify delimiters, a CPU core can execute this much more quickly through aggressive out-of-order execution, using a 160-entry issue queue (IQ) and 200-entry load/store queues (LSQs) in an 8-wide issue pipeline. In contrast, Naïve ISP parses input data sequentially, at a rate of one byte per cycle [3], and struggles to parallelize this process due to its limited flexibility and resource constraints. Second, a CPU core operates at a significantly higher frequency (up to 3.4 GHz) than FPGA (300 MHz), further widening the performance gap between them. This finding highlights the potential performance bottlenecks in prior ISP solutions when handling variable-length columns.

Nevertheless, SW only experiences relatively long query processing latency compared to SSD access latency, primarily due to the byte-granular extraction of arbitrary columns. By adopting UPP (Figure 1b), delimiter-based column extraction is entirely bypassed, allowing predicates to be evaluated much faster within the ISP engine. Notably, the resulting query latency varies based on filter ratio; while FPGA-side latency remains nearly constant, result transfer time and CPU-side latency increase linearly with filter ratio due to the larger volume of selected data. Consequently, UPP outperforms SW only when the filter ratio is 0.8 or lower, avoiding delimiter search and thus exploiting inter-row parallelism. We make two observations here. First, to prevent performance degradation from FPGA processing overhead, UPP should bypass FPGA-based filtering when a filter ratio is high. Second, the performance gains of UPP would further increase with more complex predicates, as they reduce the filter ratio while allowing our FPGA to process them within the same timeframe by leveraging intra-row parallelism.

3 UPP: System Architecture Overview

Figure 3:

3.1 Software and Hardware Architectures

Figure 3 illustrates the overall architecture of UPP, which comprises software components integrated into the DBMS (UPP-DB, §5) and hardware components that enable in-storage processing (UPP-ISP, §6). UPP-DB parses incoming queries and generates UPP-ISP instructions using a new instruction set architecture (ISA) specifically designed for UPP’s ISP engines (UPP-ISA, §4).

UPP-DB functions as a lightweight wrapper around Spark’s core engine, enabling seamless integration with UPP. UPP’s query processing is handled by the three DBMS components: (1) query analyzer, (2) ISP manager, and (3) query runner. The query analyzer comprises: (1-1) Spark’s analyzer (Catalyst), which remains unchanged, and (1-2) a lightweight wrapper for query hash, which operates externally to Spark, requiring no modifications to its core. This design is possible because, at query time, we can dynamically switch the raw data referenced by table names. UPP’s query analyzer, based on Apache Calcite [23], recursively inspects filter predicates appearing in nested select statements. All the predicates satisfying primitive constructs are identified and transformed into a universal predicate, subsequently compiled to a UPP-ISP instruction with operators and fixed-length query hash. The ISP manager coordinates between UPP-DB and UPP-ISP, through Xilinx runtime (XRT) that provides OpenCL-compliant APIs and kernel drivers for (2-1) managing FPGA global memory; (2-2) transferring data (including input parameters) between SSD, global memory, and host memory; and (2-3) loading and invoking ISP kernels. Finally, the query runner enables Spark’s core engine to process data pre-filtered by UPP-ISP. UPP leverages Spark’s flexibility such that a database table can be dynamically registered for any POSIX files.

UPP-ISP is implemented on SmartSSD [11] as FPGA kernels that are written in C/C++ and then converted into RTL kernels (.xo) using Vitis HLS toolchain. Each kernel is compiled into an executable device binary (.xclbin), which defines custom hardware operating within the programmable logic (PL) region of FPGA, and then loaded into FPGA. After setting input parameters and transferring the database table from SSD to FPGA memory, UPP-DB triggers ISP functions. Finally, the output tables are transferred to host memory for further processing. SmartSSD comprises three main components: (1) NVMe SSD, (2) FPGA with global memory, and (3) a PCIe switch (see Table 4 for details). It features direct data transfer between SSD and FPGA DRAM through the PCIe switch, mitigating I/O overhead by bypassing the host DRAM.

Table 2:

OP	Description
INCL	Returning a bit vector indicating RVs contains all bits set in QV
OVLP	Returning a bit vector indicating RVs contains at least one bit set in QV
AND	Logically ANDing bit vectors resulting from INCLs and/or OVLPs
OR	Logically ORing bit vectors resulting from INCLs and/or OVLPs
Example query		UPP-ISP instruction
(Col₀ EQUAL N₀) ∧ (Col₁ EQUAL N₁) ∧...		RVsINCLQV
(Col₀ > = N₀) ∧ (Col₀ < = N₁)		RVsOVLPQV
(Conj₀(EQUAL)) ∨ (Conj₁(EQUAL)) ∨...		(RVs INCL QV₀) OR (RVs INCL QV₁) OR...

UPP instruction set architecture.

* RV: row vector, QV: query vector, Conj: conjunction

3.2 Workflow

Prior to query processing, UPP-DB prepares two metadata tables for each database table (§5.1): (1) Meta-DB for UPP-DB to generate UPP-ISP instructions (e.g., hash functions used for individual columns), and (2) Meta-ISP for UPP-ISP to perform filtering (e.g., data hash and row lengths). Upon receiving a user query, UPP-DB generates and offloads UPP-ISP instructions using Meta-DB (§5.2). UPP-ISP then performs table scan (§6.1) and pruning (§6.2) on the database table using Meta-ISP, transferring the filtered table to UPP-DB for further query processing. Host CPUs execute unmodified queries on pre-filtered data, producing correct results as validated in §5.3. To mitigate potential ISP-related overhead, a two-step selective kernel bypassing scheme is proposed in §5.4.

4 UPP: Instruction Set Architecture

UPP-DB compiles combinations of primitive constructs into UPP-ISP instructions, leveraging operators defined by the UPP-ISA (Table 2). The INCL (inclusion) operator verifies whether each row vector (RV) satisfies all conditions encoded in a query vector. For example, the UPP-ISP instruction RVs INCL 00100101 evaluates multiple AND-connected equality-like predicates encoded as the query vector 00100101 on individual row vectors. Here, row vectors like xx1xx1x1 satisfy the predicates, with x denoting a “don’t care” value. An OVLP (overlap) operator verifies whether each row vector satisfies any condition encoded in a query vector. For instance, a range predicate with hashed minimum and maximum constants 00100000 and 00001000, respectively, is transformed into RVs OVLP 00111000, where row vectors like xx1xxxxx, xxx1xxxx, and xxxx1xxx satisfy the predicates. AND and OR operators are used to connect multiple INCLs or OVLPs, accommodating various combinations of primitive constructs such as DNFs. As such, for each row vector, complex combinations of primitive constructs are reduced to a smaller number of fixed-length, bit-wise operations (INCLs and OVLPs) followed by 1-bit logical operations (ANDs and ORs). This facilitates their complexity-effective and faster evaluations within ISP engines while exploiting both inter- and intra-row parallelisms.

5 UPP: Database Layer Design

5.1 Metadata Generation at Data Ingestion

UPP’s predicate evaluations rely on data hash, generated for each table during data ingestion, prior to query processing. A primary property we leverage is that for the two primitive constructs (§1, §2.2), we can effectively capture the inherent properties of both numeric and string columns using bit vectors, which can be used during UPP-ISP’s filtering. For each row of a table, a size-K bit vector (e.g., K=256), called row vector, is created, as described below.

Data hash for a single column. UPP represents classes of a column value using a row vector, where those classes represent sufficient information for evaluating satisfiability of two primitive constructs. Specifically, our approach extracts those classes for a numeric and text column as follows. (Numeric) We calculate which 1/K^th quantile a value belongs to. As described in §2.2, the column value is bounded by the lower/upper bounds of the associated quantile. An integer indicating the quantile is then hashed into one of 1, 2, …, K positions, to turn on the corresponding (only one) bit in row vector. See §5.4 for the analysis of false positive rate. (Text) We identify tokens in a text value, where a token is a word without whitespace (e.g., ‘order’, not ‘or_der’). These tokens are hashed into one of 1, 2, …, K positions, to turn on the corresponding (one or multiple) bits in the row vector. Multiple bits are turned on if a text value consists of one or more tokens (e.g., ‘hello world’). These hashed tokens are recorded as part of metadata (Meta-ISP), which are used during filtering inside ISP engines.

To identify those tokens used for hashing, UPP employs SpaceSaving, an online one-pass algorithm for mining frequent items [51]. This allows UPP to dynamically construct a dictionary consisting of dataset-aware common words, which becomes part of Meta-DB. The number of words in the dictionary (e.g., 1K or 10K) is independent of the hash bits (e.g., 16) used for a column, and having more words in the dictionary does not influence false positive rates. This is because while many words in the dictionary may share a common hash bit, a collision does not occur if those words do not appear in a cell. That is, only when the words in the same cell are hashed into the bit we are looking at, false positives occur, and its chance is proportional to the number of words in each cell, not the size of the dictionary. We analyze the false positive rate in §5.4. At query time, UPP’s pruning is employed only when text predicates involve those mined dictionary words, to prevent false negatives. This means that using a dictionary with many more words than the number of hash bits for a cell (e.g., 16) can be effective.

Mining frequently used tokens takes additional preprocessing. UPP performs it efficiently using a fixed number (i.e., 1,000) of randomly chosen blocks. The block count does not scale with raw data size by design because, according to the theory of sampling (e.g., central limit theorem [57]), the accuracy of sampling-based estimates can be bounded by \(O(1/\sqrt {n})\), where n is the (absolute) sample size. For example, 1,000 4 KB blocks contain about 27 K rows of the TPC-H LINEITEM table. Let the word cat appear in a row with (only) 0.1% probability. Then, the chance that cat does not appear in 27 K randomly chosen rows is 1.9 × 10^{− 12} (i.e., very small). See §8.1 for empirical preprocessing time.

Data hash for multiple columns. The above procedure is applied to each cell to turn on the bits of an exclusively divided partition of a size-K row vector (e.g., 256 bits per row). For example, in the TPC-H LINEITEM table, which contains 16 (= W) columns, each column (within each row) is assigned 16 (= K/W) dedicated bits. Even 16 bits per cell is effective in pruning unnecessary data because false positive rates are low, as discussed in §5.4 and demonstrated in §8.1. This partitioning strategy has two advantages. First, it reduces false positive rates because the values in the other columns do not conflict with the column we are examining. Second, it simplifies FPGA design, because the number of pre-programmed logic gates does not need to scale based on the number of columns.

Metadata and supported predicates. The above data hash is designed to indirectly evaluate the filters based on primitive constructs (eq. (1)). Specifically, a WHERE clause must be in a disjunctive normal form where each conjunction must include at least one primitive construct defined through our provided base classes (see §5.2 for an example). Non-conforming filters involving non-monotonic functions (e.g., sin (col) < 0.1) may still appear; however, UPP does not offload them.

5.2 Predicate Hashing at Query Time

Identifying primitive constructs. Let a WHERE clause be in the disjunctive normal form (i.e., ORs of conjunctions), Listing 1 as an example, where we will use individual predicates to refer to the expressions connected via logical operators (i.e., ORs of ANDs). A check is performed, as described below, for each conjunction (i.e., ANDs of individual predicates). If at least one individual predicate satisfies primitive constructs within each conjunction, UPP-DB generates a UPP-ISP instruction because result correctness can be guaranteed (discussed below). Otherwise, UPP-DB falls back to regular query processing without using any ISP functionality.

Check. Using a generic SQL parser [23], UPP-DB evaluates whether each individual predicate can be expressed using one of the two primitive constructs. There are two cases: system-defined and user-defined. (System-defined) An individual predicate satisfies a system’s predefined format; then, UPP-DB automatically recognizes them. For example, we know equality (e.g., lines 2, 5, 7, 10, 12, and 15) and between (e.g., lines 3, 4, 8, 9, 13, and 14) comparison operators are mathematically identical to “ ∈ [lb, ub]”, and know any compositions of monotonically increasing/decreasing functions (e.g., log , exp , acos) are also a monotonically increasing/decreasing function. (User-defined) Users can add new user-defined predicates by overriding a provided base class without re-programming FPGA. For example, users create this class:

UPP invokes the above functions at query time to examine which bits must be on to satisfy a predicate (e.g., ToLunarDate(usa_date) = ‘2025-01-29’). This process is fast since we only need to invoke the functions m + 1 times for a column represented by m bits. Specifically, it is sufficient to check only the boundaries (of the range represented by each hash bit), leveraging the property of a monotonically increasing function. For example, suppose the second bit (out of m) is on if a row contains a value between ‘2024-12-01’ and ‘2024-12-31’. Since argument(‘2025-01-29’) is strictly greater than column(‘2024-12-31’) based on their return values, the row with the second bit on must not satisfy the predicate, thus can be safely pruned. A similar base class is provided for text filters. Defining such custom functions is common for Spark as its interface is akin to programming rather than SQL-only terminal.

Primitive constructs to UPP OPs. Each extracted primitive construct is converted to a lower-level operation (INCL or OVLP), as follows. (Numeric) An equality-like predicate (e.g., col = 3) is converted to INCL with a single bit on. This is because we aim to identify rows with that specific bit activated. Conversely, a range-like predicate (e.g., l_quantity >= 10 AND l_quantity <= 10 + 10) is converted to OVLP with one or more bits on. This allows us to include values within a specified range, represented by multiple bits. (Text) The second primitive construct aims to identify rows containing a specific search token (e.g., l_shipmode = ‘MAIL’). To identify such rows using a row vector, we convert the primitive construct into INCL by turning on query vector’s bit position generated by the same hash function used during data hash generation.

Figure 4:

Translation from filter predicates to a UPP-ISP instruction and its pushdown to ISP engines.

UPP instruction generation. Given UPP OPs (i.e., INCLs and OVLPs) extracted from individual predicates, UPP-DB now compiles their (possibly nested) conjunctions/disjunctions, as described from Step 1 to Step 2 in Figure 4. Specifically, let an SQL WHERE clause be ORs of conjunctions, where each conjunction consists of an arbitrary number of individual predicates connected via ANDs. The UPP OPs in each conjunction are merged into a single INCL and two OVLP (where the number of OVLP can easily be increased in FPGA by re-programming) using the following rules (R1 and R2).

(R1) Convert multiple INCLs into one by ORing query vectors

(R2) Choose two most selective OVLPs

In Figure 4 (Listing 1), R1 is applied for converting green predicates (lines 2, 5, 7, 10, 12, and 15), while R2 is used for converting purple predicates (lines 3, 4, 8, 9, 13, and 14). For the first conjunction, lines 2 and 5 are identified as text-type primitive constructs for equality comparison, and thus converted into two INCLs and then into one INCL by bitwise ORing their query vectors—RVs INCL QV₀₀ in Step 2. Lines 3 and 4 are identified as numeric-type primitive constructs for range comparison and thus converted into two OVLPs—RVs OVLP QV₁₀ and RVs OVLP QV₂₀, respectively. Once converting all three conjunctions, one UPP-ISP instruction is generated for 15 predicates in Listing 1, including an iteration count, an operation mask, and query hash. Note that TPC-H Q19 is one of the most complex queries expressed in the disjunctive normal form, inferring that the majority of predicates in typical queries can be translated into one UPP-ISP instruction. In R2, choosing the most selective (or the conditions for which the fewest rows will satisfy) helps UPP-ISP discard more irrelevant data, where those estimations—identifying the most selective predicates—can be performed inside DBMS using small samples [43, 53] or heuristics [61].

5.3 Correctness Guarantee

UPP’s offloading (i.e., predicate pushdown to ISP) is an extension of a widely used approach in DBMSs. The goal is to modify a query plan into an alternative form that can run more efficiently; it appears initially in [61] and is then formally presented in Maier [49]. One idea is to move filters (by switching orders of operations) closer to data sources, which is SSD in our case. This does not affect results. For example,

\begin{align*}\sigma _{\text{cond}_{-}\!\text{S}}(T \bowtie S) &= \sigma _{\text{cond}_{-}\!\text{S}} (\sigma _{\text{injected}_{-}\!\text{S}} (T \bowtie S)) \\ &= \sigma _{\text{cond}_{-}\!\text{S}} (T \bowtie \sigma _{\text{injected}_{-}\!\text{S}} (S))\end{align*}

where σ_cond is a filter using “cond” as a predicate, ⋈ stands for a join between two tables. First, \(\sigma _{\text{injected}_{-}\!\text{S}}\) is injected by UPP; rows that satisfy “injected_S” is the superset of those satisfy “cond_S”. \(\sigma _{\text{injected}_{-}\!\text{S}}\) is then pushed down to ISP, exploiting the commutativity between selection and join.

Here, “cond” must involve a single table and operate on individual rows (not on a group of rows). For example, TableA.colA - TableB.colB <= 10 or TableA.colA - TableA.nextrow.colA <= 5 cannot be pushed down because they involve multiple tables or multiple rows, respectively.¹ The predicates constructed through our primitive constructs always satisfy the condition; thus, UPP can offload them. The system uses these rules internally to optimize query processing.

5.4 Effectiveness and Optimization

Bounded false positive. The data, pre-filtered by UPP-ISP, is a superset of an exact match, meaning the superset may contain false positives. Let size-K row vectors (e.g., K = 256) be used for a table with W columns (e.g., W = 32), meaning each column uses m = K/W bits (e.g., 8 = 256/32). We focus on the case that each column value turns on one of those m bits. There are two cases where false positives occur (i.e., numeric and text). (Numeric) Let a query choose a range using OVLP. Irrelevant rows may be selected at the boundaries since we are using finite bits (m) to represent numeric values (of infinite cardinalities). Specifically, 1/m rows may be included at each lower/upper bound, meaning the false positive rate cannot be greater than 2/m. For equality (not range), the false positive rate reduces to 1/m (e.g., 12.5% for m = 8). (Text) Let each column value include d tokens, turning on d out of m bits randomly. False positives occur when unwanted tokens are hashed into the same bit. Such chance is upper-bounded by d/m (e.g., 25% for m = 8, d = 2).

Optimization. ISP offloading may result in suboptimal performance particularly if a filter ratio is high (§2.3). To address this, we take two-stage filter-bypassing optimization. UPP selectively bypasses ISP when the estimated filter ratio of a target table (first stage, in UPP-DB) or a 512MB chunk (second stage, in UPP-ISP) suggests that ISP processing would provide minimal benefit. In the first stage, UPP-DB reverts to the regular data path (i.e., bypasses ISP) if the estimated filter ratio is close to one, indicating that ISP would prune almost no data. This selectivity estimation is based on sampling techniques [43, 53]. The second stage operates within UPP-ISP by splitting the ISP function into two separate kernels: a table scan kernel and a pruning kernel. The pruning kernel is invoked only for 512MB data chunks where the table scan kernel detects a high proportion (more than 80%) of irrelevant data. Our evaluation shows that different chunks within a table typically exhibit similar filter ratios, meaning that most rows are either consistently selected for ISP processing or bypassed. Notably, the table scan kernel evaluates universal predicates to identify relevant rows, a fast and fundamental step for UPP, while also allowing filter ratio to be efficiently retrieved during this process (§6.1).

Figure 5:

6 UPP: ISP Layer Design

Figure 5 outlines the detailed workflow of UPP-ISP. Once a UPP-ISP instruction is generated for a given query, ① UPP-DB pushes it down along with other input parameters (e.g., query vectors) to the ISP engine. Then, ② UPP-DB transfers the corresponding Meta-ISP from SSD to FPGA DRAM and invokes a table scan kernel that estimates the filter ratios (i.e., row-level selectivity) for individual 512 MB database table chunks by evaluating the row vectors in Meta-ISP using the query vectors. ③ If the filter ratio is below a predefined threshold, ④ UPP-DB transfers the relevant chunk from SSD to FPGA DRAM and invokes a pruning kernel that filters out rows deemed uninterested based on the table scan results. Finally, ⑤ the selected rows are transferred to host DRAM. Figure 5 also shows the implementation of the ISP engine’s table scan (upper) and pruning (lower) functions, which are discussed in detail below.

6.1 Table Scan Kernel

The table scan kernel scans a database table using row vectors (in Meta-ISP) and query vectors, and generates filter ratios with a valid row vector (i.e., input to the pruning kernel).

Software interface. UPP-DB invokes the table scan kernel after setting up the input arguments listed below.

•

Data hash: a part of Meta-ISP, comprising row vectors associated with individual rows in the database table.

•

Iteration count: an iteration count of evaluating a universal predicate within UP-COMP, derived from predicate hashing.

•

Operation mask: a bit vector specifying the evaluation operations (i.e., INCLs and/or OVLPs) across iterations.

•

Query hash: a sequence of query vectors used in individual evaluation operations.

•

Output: the destinations of the table scan kernel outputs.

Kernel implementation. When a table scan kernel is invoked (Figure 5), 1 data hash (i.e., row vectors) and a UPP-ISP instruction (i.e., iteration count, operation mask, and query vectors) are read from FPGA DRAM to metadata buffer (M-buf) and instruction buffer (I-buf), respectively. Then, 2 the instruction is executed by the evaluation logic—a cluster of UP-COMPs—examining individual row vectors using query vectors. At the end, 3 the evaluation logic produces filter ratios (i.e., #valid rows/#all rows), which is subsequently read by the host CPU to determine the invocation of the pruning kernel. The evaluation logic also generates a valid row vector indicating rows satisfying the primitive constructs encoded in query vectors. This bit vector is later used by the pruning kernel to compile a collection of the valid rows.

As both row vector and query vector have a fixed size, predicate evaluations can be effectively parallelized and distributed across multiple UP-COMPs. Step 3 in Figure 4 and Algorithm 1 outline the behavior of UP-COMPs, which iteratively evaluate query vectors. We assume that a UP-COMP has one INCL and two OVLPs, which is easily adjustable (§7). In each iteration, UP-COMP concurrently assesses INCL and/or OVLP from a single conjunction (lines 4–18). To evaluate the INCL in a conjunction of the N^th row, UP-COMP executes a bitwise AND operation between RV_N and QV_0X (line 9), and then compares the result with QV_0X (line 11). If they are the same, it confirms that RV_N has all the bits set in QV_0X and this row satisfies the INCL condition. For the OVLP, UP-COMP performs a bitwise AND operation between RV_N and QV_1X|2X (line 9), followed by a bitwise OR operation on the result (line 13). If the result is true, this row satisfies the OVLP because RV_N has at least one bit set in QV_1X|2X. In the end, these evaluation results are combined through an AND operation (the blue box and line 16) to get the decision for the current conjunction. In each iteration, conjunction-level decisions are aggregated through an OR operation (the orange box and line 19) to get the final decision for the given universal predicate. The individual evaluations can be performed in parallel for different rows using multiple UP-COMPs. Although not explicitly depicted, UP-COMP also has counters tracking the numbers of valid and overall rows to calculate the filter ratios.

6.2 Pruning Kernel

The pruning kernel generates a filtered table using the valid row vector (output of the table scan kernel) and row lengths (in Meta-ISP).

Software interface. UPP-DB triggers the pruning kernel after setting up the following input arguments.

•

Database table: an input table for pruning.

•

Row lengths: a part of Meta-ISP, indicating the lengths of individual rows in the input database table.

•

Valid row vector: a bit vector indicating rows satisfying the primitive constructs encoded in query vectors.

•

Output: the destination of the filtered table.

Kernel implementation. When a pruning kernel is triggered (Figure 5), 4 a valid row vector and row lengths are read from FPGA DRAM to 5 compute the addresses of valid rows in the input (database) table and the output (filtered) table. Then, 6 64 B input table chunks containing valid rows are read from FPGA DRAM into the table buffer (T-buf), and 7 valid rows are copied to the output buffer. 8 When the 64 B output buffer is full, it is written to the FPGA DRAM location specified by the output address.

Algorithm 2 elaborates the pruning kernel. Iterating over the valid row vector, it updates the input address until the valid row is encountered (line 3). If the current row is valid, its length is written to remain_row_len, a register indicating the remaining length of the current row to be copied (line 5). Given that our FPGA reads/writes data blocks from/to FPGA DRAM in 64 B chunks [2], the copy process of a valid row can be performed across multiple sub-iterations, depending on its layouts in the input/output buffers. In each sub-iteration, the memory indexes and offsets of the 64 B input/output data blocks are retrieved from their addresses (lines 7–10). The copy length (copy_len) in each iteration is determined to the smallest one among remain_row_len, the valid section of the input buffer, and the remaining portion of the output buffer (lines 11–12). Then, the data block is read from FPGA DRAM into the input buffer (line 13), and then the specified range of data is extracted and appended to the output buffer (line 14). When the entire output buffer is filled with valid row data, it is written to FPGA DRAM location specified by the output address (lines 15–16). At the end of each sub-iteration, remain_row_len and input/output addresses are incremented by the copy_len (lines 17–19). At the end of pruning, the final segment of valid row data is written to memory, if any (lines 20–23).

Figure 6:

TPC-H query processing latencies and filter ratios on baseline analytical DB (CPU) and UPP-enabled DB (UPP). We use 4 CPU threads (cores) with one SmartSSD, 100 GB TPC-H datasets, and 256-bit row/query vectors.

Figure 7:

Relative energy consumption and CPU utilization of UPP over baseline analytical DB (CPU).

Table 3:

Element type	LUTs	BRAM	URAM
ISP kernels (scan/pruning)	118161 (22.6%)	183 (18.6%)	0 (0%)
Platform (e.g., memory ctrl.)	143624 (27.5%)	330 (33.5%)	12 (9.4%)

UPP’s resource usage on Xilinx KU15P FPGA.

7 Experimental Setup

Implementation. UPP-DB is built on Apache Spark’s SQL engine, with metadata generation and query analyzer/runner implemented in Java and Python. UPP-ISP is implemented within SmartSSD using HLS tools. Our current implementation includes two UP-COMPs, each capable of evaluating a DNF on a 256-bit row vector per cycle because our FPGA can read 512-bit data from its local DRAM (i.e., global memory) per cycle. Each UP-COMP assesses up to three OR-connected conjunctions, with each conjunction comprising up to four predicates (i.e., one INCL and three OVLPs). As detailed in Table 3, UPP utilizes 50.1% of LUTs and 52.1% of BRAM within the Xilinx KU15P FPGA, including logic for ISP kernels and essential platform components such as the DDR memory controller [13]. Notably, the number of UP-COMPs and the INCL/OVLP operation ratio can be easily adjusted through parameters in the query analyzer (UPP-DB) and table scan kernel (UPP-ISP).

Datasets/Queries. We use the 100 GB TPC-H dataset (i.e., scale factor of 100) with the standard 22 TPC-H queries [16]. The literals (e.g., ‘1998-12-01’) inside a predicate (e.g., l_shipdate <= ‘1998-12-01’) are set to have filter ratios around 20%. Those literals are variable components in the official documentation [16], and adjusting them to evaluate specific aspects of a newly proposed system is common in the literature [45, 53, 78, 80]. The remaining parts of the queries (e.g., joins, subqueries, and views) remain unchanged. As shown in Figure 2, our performance gains depend on the values of these literals, since if no tuples are filtered out, our in-storage filtering cannot provide bandwidth savings. Queries with such modifications are explicitly marked with asterisks in Figure 6 and Figure 7 (e.g., Q1*).

Evaluation. The details of our testbed configuration are presented in Table 4. We evaluate two approaches: (1) CPU: an analytical DB based on Apache Spark running on CPU cores, and (2) UPP: UPP-enabled DB utilizing ISP functionality. Due to the proprietary nature and limitations of prior ISP prototypes—supporting simple filter predicates on fixed-length columns—performing an apple-to-apple comparison is not feasible. Instead, we assess the coverage and analytical performance of UPP against a representative ISP solution (§8.3). Given the constrained FPGA resources in SmartSSD and prior demonstrations using similar or identical FPGAs—including real-world cloud-native RDBMS implementations [24, 36, 41, 59, 60, 74, 77]—we set the CPU cores to SmartSSD ratio at 4:1. Notably, due to the lack of internal cooling systems and the need to support more powerful ISP capability, vendors have recommended using SmartSSDs in server configurations that ensure efficient cooling and provide multiple NVMe slots. For instance, [1] supports up to 24 SmartSSDs alongside dual-socket CPUs with 16–128 cores. We measure SmartSSD-isolated power, system-wide power, and CPU utilization using xbutil [18], ipmitool [7], and sar [12], respectively.

Table 4:

Host machine
OS (kernel)	Ubuntu-20.04.6 LTS (Linux kernel v5.4)
CPU	Intel^® Xeon^® Platinum 8380 CPU (Ice Lake, 40 cores),
	Hyper-Threading/Turbo Boost disabled
Main memory	8 × 32 GB DDR4-3200 DRAM (256 GB, 8 channels)
SmartSSD Computational Storage Drive (CSD)
Host I/F	Single port PCIe Gen3 × 4 (U.2 form factor)
NVMe SSD	Samsung V-NAND^® (3.84 TB)
FPGA	Xilinx Kintex^™ Ultrascale+ KU15P, 300 K LUT,
	34.6 Mbit BRAM, 36.0 Mbit URAM, 4 GB DDR4-2400

Testbed configuration.

8 Evaluation Results

8.1 TPC-H Evaluation

End-to-end performance. In Figure 6, the stacked bars for each query illustrate the end-to-end query processing latency for regular Spark (CPU) and UPP. We present two time components: (1) latency consumed for storage access and (2) latency explicitly consumed for computation (i.e., total latency − storage-access latency). There are two key contributors to these performance gains. First, UPP-DB’s hashing mechanism achieves low false positive rates. In Figure 6, empty and filled diamonds in each query represent the average filter ratios of the input files without and with UPP capability, respectively. The differences between them indicate false positives in our hashing schemes, which are 0–6 percentage points with 256-bit hashing. This is quite low concerning a number of columns in each table and the potential variances of values within each column. That is, while not inspecting each column one by one, UPP successfully detects unnecessary rows with high accuracy. The second reason is that UPP-ISP’s evaluation logic realizes high-throughput table scans by leveraging universal predicates. UPP-ISP can exploit inter-/intra-row parallelisms of predicate evaluations while effectively reducing the sizes of tables, thanks to fixed-sized universal predicates and their substantially low false positive rates. As such, in Figure 6, UPP’s storage-access latency (including UPP-ISP and data transfer) increases by only 15%–56%, while computed-only latency decreases by 14%–93%. Q21 shows a relatively low speedup of 1.2 ×, mainly due to the complex join and aggregation operations on multiple tables handled by host CPU which cannot benefit from UPP.

Leveraging UPP’s metadata, CPU-based filtering can also benefit from speedups by transforming complex predicate evaluations for variable-length columns into fixed-length bit operations through UPP’s hashing mechanism. This allows CPU cores to exploit their high instruction-level parallelism. However, these speedups would still be lower than those of UPP, which features highly optimized evaluation logic for parallel multi-bit-vector comparisons (UP-COMPs). Moreover, UPP provides additional advantages by offloading the entire filtering process to storage. This not only conserves CPU computational resources but also reduces PCIe and memory bandwidth consumption required for file transfers to the host CPU, with savings inversely proportional to the filter ratio. Notably, since SSD and FPGA communicate via a PCIe interface with a maximum read bandwidth of 3.3 GB/s, SmartSSD cannot fully exploit the SSD’s internal bandwidth advantages—despite its FPGA DRAM bandwidth of 15.4 GB/s—over external bandwidth, which could be 2–4 × higher [25, 41]. This limitation currently restricts UPP from fully leveraging (B3) in §2.1 [58]. With higher internal bandwidth, UPP is expected to achieve even greater speedups by taking full advantage of its optimized evaluation logic.

CPU utilization. As shown in Figure 7, UPP significantly reduces the amount of CPU utilization to 9%–93%. Typically, more CPU cycles are saved with lower filter ratios, e.g., Q15 (93%). This is because parsing and filtering tables with variable-length columns typically take a long time to complete by CPU, especially as the table size increases. In contrast, UPP offloads most of such operations to UPP-ISP, using hardware modules tailored for faster table scan and pruning, and thus provides higher filtering performance while saving a lot of CPU cycles.

Energy consumption. Figure 7 also shows system-wide energy consumption in serving individual queries, exhibiting a similar trend with CPU utilization. This is because UPP-ISP performs filtering operations with much higher processing efficiency, i.e., consuming much less energy, by using metadata generated at data ingestion and processing them on dedicated ISP engines. Besides the 16%–87% decrease in CPU energy, other system energy consumption is also significantly reduced by 14%–87%, primarily due to the substantial reduction in data transfer. As a result, UPP yields 9%–87% energy savings.

Overhead analysis. UPP relies on metadata, incurring two types of overheads: preprocessing time and storage cost.

First, creating metadata is a two-step process: (1) creating a dictionary of frequently used words for each string column, and (2) generating data hash. For example, generating a dictionary for the 15 GB ORDERS table takes 0.383 seconds, which is efficient because UPP extracts commonly used words from 1,000 randomly chosen 4 KB blocks. Generating its data hash takes 142.0 seconds: 135.5 seconds for reading/parsing data and 6.5 seconds for generating 64-bit hashes. This is a one-time cost. The generated metadata can be re-used repeatedly for any queries that rely on the tables.

Second, the storage overhead of metadata varies according to the shape and organization of the original database tables as well as the length of hash values. With the 100 GB TPC-H dataset used in our evaluation, metadata tables of UPP account from 5% (of the PARTSUPP table) to 7% (of the ORDERS table), assuming 64-bit hash values. In addition, the total size of dictionaries is relatively small compared to the raw data because we do not scale dictionaries with data. For example, the TPC-H LINEITEM table in our experiment is about 74 GB when stored on disk; the table has three multi-character columns; and three corresponding dictionaries (i.e., word lists) for those columns, each storing 1,000 words, take about 18 KB (6 chars per word on average), which is about \(2.4 \times 10^{-5}\%\). Considering lowering storage prices as well as the system-wide performance and energy-efficiency gains of UPP, these overheads are justifiable.

Third, online processing is lightweight as it does not involve (large) raw data. For TPC-H Q1 as an example, query analysis, optimization, and instruction generation take 1.1 ms. Offloading the query hash to SmartSSD takes 0.129 ms.

8.2 Sensitivity Study

Core:SmartSSD ratio. We initially set the ratio between the numbers of CPU Cores and SmartSSD to 4:1 (4C:1S). Figure 8a further explores the potential performance impact of UPP with Q14, assuming a server is equipped with varying ISP capabilities, i.e., different numbers of SmartSSDs. Across different ratios, UPP achieves speedups of 2.5 × (8C:1S), 4.3 × (4C:1S), 5.4 × (2C:1S), and 7.5 × (1C:1S), demonstrating that increasing ISP capabilities directly translates into higher performance gains. As the core count decreases, the performance gain increases because filtering latency on large files (e.g., a 75 GB LINEITEM table) becomes a major bottleneck for the CPU. Conversely, as the core count increases, CPU-side filtering approaches the performance of UPP; however, UPP still achieves 37% higher performance at 40C:1S (not depicted in Figure 8a). Notably, UPP also benefits from increased core counts, as it leverages host processing after ISP filtering. Additionally, UPP’s fixed-length, row-wise scan–exploiting both inter- and intra-row parallelisms–delivers significantly higher filtering performance than its counterpart.

Hash length. The length of hash values (i.e., row vectors and query vectors) plays a crucial role in UPP’s performance, as it directly impacts the false positive rates during ISP processing. As shown in Figure 8b, with hash lengths of 64-bit, 128-bit, 256-bit, and infinite (Ideal), the filter ratio (speedup) decreases (increases) as 25% (2.9 ×), 12% (3.3 ×), 6% (4.3 ×), and 0% (5.2 ×), respectively. This demonstrates that UPP achieves linearly increasing filtering accuracy with longer hash lengths, directly leading to higher query processing performance. Notably, even with a 64-bit hash length, UPP achieves an impressive 2.9 × speedup, while a 256-bit hash length delivers performance comparable to Ideal. While TPC-H datasets do not contain tables with over 100 columns, such cases are common in practice. To maintain the same hash length per column (and thus preserve filter ratios), the total hash length must scale accordingly. This ensures that the relative storage overhead of data hash remains unchanged.

Figure 8:

Impact of Core:SmartSSD ratio and hash length on performance and filter ratio while running Q14.

8.3 Comparison to Prior Work

To quantitatively understand the UPP’s benefits, we compare its predicate coverage and performance with POLARDB [24], using 6.4 million SQL queries collected from open-source repositories. Using GitHub API [5], we collected 141,136 SQL files, containing 21 million (i.e., 21,074,824) correct/broken SQL statements, from which we could successfully parse 6,425,413 statements using Apache Calcite [23]. Unsuccessful parsing is due to lexical errors (or broken characters), non-standard SQL dialects (e.g., variables), unexpected inline comments, etc. We then discarded DDL/DML (i.e., non-select statements such as ROLLBACK, GRANT)—which this work does not concern—obtaining 1,714,922 SELECT queries. From them, we extracted 623,849 predicates (i.e., WHERE clauses).

In short, POLARDB can directly process predicates in DNFs (same as UPP); however, its predicate evaluation is limited to numeric comparisons (e.g., =, ≠, <) and null checks (e.g., is null).

Coverage. Table 5 shows the result with a breakdown by operator/function types (i.e., numeric, text, and others). POLARDB does not support most mathematical functions (simply because they did not implement the logic), datetime functions, and pattern-matching operations. In contrast, UPP can support most of them without any changes to FPGA logic because UPP is designed to support a wide range of predicates with software-only changes. Nevertheless, there are a few important predicates that both POLARDB and UPP do not support such as extracting integers from string (e.g., locate) and subquery evaluation (e.g., exists (select...)). UPP does not aim to support those predicates.

Table 5:

Numeric operations	POLARDB	UPP
Comparisons (e.g., <, !=)	6/7	7/7
Math functions (e.g., acos, sqrt)	0† /6	6/6
Datetime functions (e.g., timestampadd)	0† /3	3/3
Text operations	POLARDB	UPP
Pattern matching (e.g., like)	0/5	5^*/5
String functions (e.g., trim)	0/2	2/2
Others	POLARDB	UPP
Null-aware checks (e.g., is null)	2/4	4/4
String ↔ Integer (e.g., cast, length)	0/5	0/5
Subquery (e.g., EXISTS (select...))	0/3	0/3

Coverage Analysis: POLARDB [24] vs. UPP (Ours). X/Y (e.g., 6/7) indicates X (6) out of Y (7) distinct types are supported. Analyzed queries are described in §7.

† Coverage increases if our idea (§2.2) is employed

* If extracted tokens appear

POLARDB’s coverage can increase by adopting our proposed primitive constructs and by converting original expressions (e.g., acos(col) < 1) to a simple range check (e.g., 0.54 < col <= 1), provided that an inverse function is known or we can numerically compute the input range satisfying the original expression. UPP does not pursue such an approach—a combination of direct ISP computing (not via data hash) and DB-level primitive constructs handling—for two reasons. First, direct computation depends heavily on how data is stored. For example, if we implement FPGA logic for handling a specific timestamp format (e.g., 2023-11-30 23:59:59), the logic does not work if data is stored slightly differently (e.g., 2023/11/30T23:59:59). Second, direct computation is slow because its latency is orders of magnitude greater than UPP’s data hash-based computation (§2).

Performance. While UPP is the first ISP-based filtering framework designed for variable-length columns, it can also be applied to fixed-length columns. Based on the following reasonable assumptions (described shortly), our approach can provide 2.3 times higher scanning throughput than POLARDB. We assume that the input table has 4-byte items, and each 64-byte row generates a 4-byte row vector. Evaluating a DNF with 3 conjunctions, each containing 3 equal and 2 range predicates, a reclusive 4-byte comparator in POLARDB requires 21 cycles for scanning each row (i.e., 3 rows/21 cycles with 3 comparators). For the same query, UPP takes 3 cycles using 3 comparators (1 INCL and 2 OVLPs) per row.

9 Discussion and Related Work

Preprocessing to other formats? One potential alternative approach could be to preprocess the CSV format into columnar formats (e.g., Parquet) or other optimized formats instead of using data hash. However, this method still encounters challenges when parsing variable-length data types (e.g., strings) and evaluating them with complex predicates within ISP engines (§2.3) [74].

Applying UPP to columnar? UPP can easily extend to columnar formats by performing row-level data skipping within storage. Implementing this requires only a simple extension to UPP, as it is already designed to handle variable-length columns. The only change needed is to encode column value sizes (e.g., variable lengths for strings) in the metadata, replacing the row sizes.

Applying preprocessing to prior approaches? The preprocessing approach of UPP can be adapted to prior ISP designs by pre-identifying column formats and sizes while reusing their ISP engines to bypass fixed-size limitations. However, challenges remain in handling variable-length columns. First, even with known formats and sizes, FPGAs cannot extract columns in parallel due to their limited flexibility, requiring sequential extraction. Whether in software or hardware, parsing variable-length columns necessitates assigning column numbers to each byte, which inherently requires coordination and prevents full parallelism. Second, prior ISP solutions struggle with scalability in performance and cost for complex multi-predicate evaluations. As seen in Listing 1, ISP’s benefits diminish compared to UPP, which processes nested predicates efficiently with a single instruction, whereas prior approaches require significantly more operations and comparators.

Predicate pushdown. Research on query processing optimization has extensively explored predicate pushdown [9, 29, 34, 46, 47], leveraging preindexed or decoded column values for early predicate evaluation. While these methods enhance performance, UPP offers additional advantages. It efficiently evaluates complex predicates in a single step using ISP-tailored instructions and dedicated UP-COMP logic, reducing CPU cycles efficiently. Moreover, UPP prunes data within storage, transferring only filtered rows—even across multiple OS pages—whereas CPU-based approaches always fetch entire pages. This optimizes PCIe/memory bandwidth and preserves off-chip memory and on-chip cache capacity.

Hardware-based acceleration. In both academia and industry, substantial efforts have been devoted to accelerate database queries using special hardware, such as FPGAs [27, 64, 67], ASICs [6, 21, 46], GPUs [22, 30, 32, 54, 68], and programmable switches [44, 52, 69]. Another promising platform, ISP has also been widely explored by leveraging embedded cores inside SSD controllers [25, 28, 38, 70], FPGAs [24, 42, 58, 72, 74, 75, 77], and ASICs [31, 33, 37]. After an initial prototype demonstrated its benefits for query processing [28], ISP has been explored in two major directions. The first direction is pushing more computations into ISP, such as aggregation [75] and join [77]. The other direction is developing practical ISP ecosystems, including general ISP frameworks [24, 31, 33, 74], ISP resource management [58], and task scheduling [38]. Recent studies also explored ISP for preserving transactional consistency under HTAP workloads [42, 72].

10 Conclusion

This work presents UPP, a novel in-storage filtering mechanism that significantly improves the performance and system-wide energy efficiency of data-intensive query processing through universal predicates. Our hardware-software co-design allows various predicates (e.g., comparison operations, user-defined functions, and their disjunctive normal forms) to be offloaded to ISP by translating them into hardware-friendly operations and evaluating them using complexity-effective hardware modules inside ISP devices. To our knowledge, UPP is the first solution in the open literature to support variable-length column evaluation without requiring modifications to database tables, and does so on commercially available ISP devices without relying on simulation.

Acknowledgments

This work was supported in part by a grant from PRISM, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA; by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2024-00405857); by the MSIT, Korea, under the Global Scholars Invitation Program (RS-2024-00456287) supervised by the IITP; by a grant from the National Science Foundation (NSF) (No. 2312561); by the Technology Innovation Program (No. RS-2024-00420541, 2410000802) funded by the Ministry of Trade, Industry and Energy of Korea; by the Yonsei University Research Fund of 2025-22-0104; and by a generous gift from AMD-Xilinx HACC.

Footnote

We use nextrow for intuitive understanding; in SQL, it may be expressed using windows functions.

References

[20]

Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compression and execution in column-oriented database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 671–682.

[21]

Kathirgamar Aingaran, Sumti Jairath, and David Lutz. 2016. Software in silicon in the Oracle SPARC M7 processor. In 2016 IEEE Hot Chips 28 Symposium (HCS). IEEE, 1–31.

[22]

Peter Bakkum and Kevin Skadron. 2010. Accelerating SQL database operations on a GPU with CUDA. In Proceedings of the 3rd workshop on general-purpose computation on graphics processing units. 94–103.

[23]

Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J Mior, and Daniel Lemire. 2018. Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In Proceedings of the 2018 International Conference on Management of Data. 221–230.

[24]

Wei Cao, Yang Liu, Zhushi Cheng, Ning Zheng, Wei Li, Wenjie Wu, Linqiang Ouyang, Peng Wang, Yijing Wang, Ray Kuan, et al. 2020. POLARDB Meets Computational Storage: Efficiently Support Analytical Workloads in Cloud-Native Relational Database. In FAST. 29–41.

[25]

Sangyeun Cho, Chanik Park, Hyunok Oh, Sungchan Kim, Youngmin Yi, and Gregory R Ganger. 2013. Active disk meets flash: A case for intelligent ssds. In Proceedings of the 27th international ACM conference on International conference on supercomputing. 91–102.

[26]

Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al. 2016. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data. 215–226.

[27]

Christopher Dennl, Daniel Ziener, and Jurgen Teich. 2012. On-the-fly composition of FPGA-based SQL query accelerators using a partially reconfigurable module library. In 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines. IEEE, 45–52.

[28]

Jaeyoung Do, Yang-Suk Kee, Jignesh M Patel, Chanik Park, Kwanghyun Park, and David J DeWitt. 2013. Query processing on smart ssds: Opportunities and challenges. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 1221–1230.

[29]

Ziqiang Feng, Eric Lo, Ben Kao, and Wenjian Xu. 2015. Byteslice: Pushing the envelop of main memory data processing with a new storage layout. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 31–46.

[30]

Naga K Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, and Dinesh Manocha. 2005. Fast computation of database operations using graphics processors. In ACM SIGGRAPH 2005 Courses. 206–es.

[31]

Boncheol Gu, Andre S Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, et al. 2016. Biscuit: A framework for near-data processing of big data workloads. ACM SIGARCH Computer Architecture News 44, 3 (2016), 153–165.

[32]

Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 511–524.

[33]

Insoon Jo, Duck-Ho Bae, Andre S Yoon, Jeong-Uk Kang, Sangyeun Cho, Daniel DG Lee, and Jaeheon Jeong. 2016. YourSQL: a high-performance database system leveraging in-storage computing. Proceedings of the VLDB Endowment 9, 12 (2016), 924–935.

[34]

Ryan Johnson, Vijayshankar Raman, Richard Sidle, and Garret Swart. 2008. Row-wise parallel predicate evaluation. Proceedings of the VLDB Endowment 1, 1 (2008), 622–634.

[35]

Sang-Woo Jun, Ming Liu, Sungjin Lee, Jamey Hicks, John Ankcorn, Myron King, Shuotao Xu, and Arvind. 2015. BlueDBM: an appliance for big data analytics. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon). 1–13.

[36]

Yangwook Kang, Yang-suk Kee, Ethan L Miller, and Chanik Park. 2013. Enabling cost-effective data processing with smart SSD. In 2013 IEEE 29th symposium on mass storage systems and technologies (MSST). IEEE, 1–12.

[37]

Sungchan Kim, Hyunok Oh, Chanik Park, Sangyeun Cho, Sang-Won Lee, and Bongki Moon. 2016. In-storage processing of database scans and joins. Information Sciences 327 (2016), 183–200.

[38]

Gunjae Koo, Kiran Kumar Matam, Te I, HV Krishna Giri Narra, Jing Li, Hung-Wei Tseng, Steven Swanson, and Murali Annavaram. 2017. Summarizer: trading communication with computing near storage. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 219–231.

[39]

Reese Kuper, Ipoom Jeong, Yifan Yuan, Ren Wang, Narayan Ranganathan, Nikhil Rao, Jiayu Hu, Sanjay Kumar, Philip Lantz, and Nam Sung Kim. 2024. A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 37–54.

[40]

Gyusun Lee, Seokha Shin, Wonsuk Song, Tae Jun Ham, Jae W Lee, and Jinkyu Jeong. 2019. Asynchronous I/O stack: A low-latency kernel I/O stack for Ultra-Low latency SSDs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 603–616.

[41]

Joo Hwan Lee, Hui Zhang, Veronica Lagrange, Praveen Krishnamoorthy, Xiaodong Zhao, and Yang Seok Ki. 2020. SmartSSD: FPGA accelerated near-storage data analytics on SSD. IEEE Computer architecture letters 19, 2 (2020), 110–113.

[42]

Kitaek Lee, Insoon Jo, Jaechan Ahn, Hyuk Lee, Hwang Lee, Woong Sul, and Hyungsoo Jung. 2023. Deploying Computational Storage for HTAP DBMSs Takes More Than Just Computation Offloading. Proceedings of the VLDB Endowment 16, 6 (2023), 1480–1493.

[43]

Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR.

[44]

Alberto Lerner, Rana Hussein, Philippe Cudré-Mauroux, and U eXascale Infolab. 2019. The Case for Network Accelerated Query Processing. In CIDR.

[45]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615–629.

[46]

Yinan Li, Jianan Lu, and Badrish Chandramouli. 2023. Selection Pushdown in Column Stores Using Bit Manipulation Instructions. Proceedings of the ACM on Management of Data 1, 2 (2023), 1–26.

[47]

Yinan Li and Jignesh M Patel. 2013. Bitweaving: Fast scans for main memory data processing. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 289–300.

[48]

Zhenghua Lyu, Huan Hubert Zhang, Gang Xiong, Gang Guo, Haozhou Wang, Jinbao Chen, Asim Praveen, Yu Yang, Xiaoming Gao, Alexandra Wang, et al. 2021. Greenplum: a hybrid database for transactional and analytical workloads. In Proceedings of the 2021 International Conference on Management of Data. 2530–2542.

[49]

David Maier. 1983. The theory of relational databases. Vol. 11. Computer science press Rockville.

[50]

Wes McKinney et al. 2011. pandas: a foundational Python library for data analysis and statistics. Python for high performance and scientific computing 14, 9 (2011), 1–9.

[51]

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In International conference on database theory. Springer, 398–412.

[52]

Craig Mustard, Fabian Ruffy, Anny Gakhokidze, Ivan Beschastnikh, and Alexandra Fedorova. 2019. Jumpgate: In-Network Processing as a Service for Data Analytics. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19).

[53]

Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. Verdictdb: Universalizing approximate query processing. In Proceedings of the 2018 International Conference on Management of Data. 1461–1476.

[54]

Johns Paul, Jiong He, and Bingsheng He. 2016. GPL: A GPU-based pipelined query processing engine. In Proceedings of the 2016 International Conference on Management of Data. 1935–1950.

[55]

pola rs. [n. d.]. Polars: Lightning-fast DataFrame library for Rust and Python. https://www.pola.rs/.

[56]

Jags Ramnarayan, Barzan Mozafari, Sumedh Wale, Sudhir Menon, Neeraj Kumar, Hemant Bhanawat, Soubhik Chakraborty, Yogesh Mahajan, Rishitesh Mishra, and Kishor Bachhav. 2016. Snappydata: A hybrid transactional analytical store built on spark. In Proceedings of the 2016 International Conference on Management of Data. 2153–2156.

[57]

Murray Rosenblatt. 1956. A central limit theorem and a strong mixing condition. Proceedings of the national Academy of Sciences 42, 1 (1956), 43–47.

[58]

Zhenyuan Ruan, Tong He, and Jason Cong. 2019. INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive. In USENIX Annual Technical Conference. 379–394.

[59]

Sahand Salamat, Armin Haj Aboutalebi, Behnam Khaleghi, Joo Hwan Lee, Yang Seok Ki, and Tajana Rosing. 2021. NASCENT: Near-storage acceleration of database sort on SmartSSD. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 262–272.

[60]

Sahand Salamat, Hui Zhang, Yang Seok Ki, and Tajana Rosing. 2022. NASCENT2: Generic near-storage sort accelerator for data analytics on SmartSSD. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 15, 2 (2022), 1–29.

[61]

P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data. 23–34.

[62]

Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al. 2019. Presto: SQL on everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1802–1813.

[63]

Nikhil Sheoran, Supawit Chockchowwat, Arav Chheda, Suwen Wang, Riya Verma, and Yongjoo Park. 2023. A step toward deep online aggregation. Proceedings of the ACM on Management of Data 1, 2 (2023), 1–28.

[64]

Malcolm Singh and Ben Leonhardi. 2011. Introduction to the IBM Netezza warehouse appliance. In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research. 385–386.

[65]

Mike Stonebraker, Daniel J Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, et al. 2018. C-store: a column-oriented DBMS. In Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker. 491–518.

[66]

Yutaka Sugawara, Dong Chen, Ruud A Haring, Abdullah Kayi, Eugene Ratzlaff, Robert M Senger, Krishnan Sugavanam, Ralph Bellofatto, Ben J Nathanson, and Craig Stunkel. 2022. Data movement accelerator engines on a prototype power10 processor. IEEE Micro 43, 1 (2022), 67–75.

[67]

Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. 2012. Database analytics acceleration using FPGAs. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques. 411–420.

[68]

Chengyu Sun, Divyakant Agrawal, and Amr El Abbadi. 2003. Hardware acceleration for spatial selections and joins. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 455–466.

[69]

Muhammad Tirmazi, Ran Ben Basat, Jiaqi Gao, and Minlan Yu. 2020. Cheetah: Accelerating database queries with switch pruning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2407–2422.

[70]

Devesh Tiwari, Simona Boboila, Sudharshan Vazhkudai, Youngjae Kim, Xiaosong Ma, Peter Desnoyers, and Yan Solihin. 2013. Active Flash: Towards Energy-Efficient, In-Situ Data Analytics on Extreme-Scale Machines. In 11th USENIX Conference on File and Storage Technologies (FAST 13). 119–132.

[72]

Tobias Vinçon, Christian Knödler, Leonardo Solis-Vasquez, Arthur Bernhardt, Sajjad Tamimi, Lukas Weber, Florian Stock, Andreas Koch, and Ilia Petrov. 2022. Near-data processing in database systems on native computational storage under htap workloads. Proceedings of the VLDB Endowment 15, 10 (2022), 1991–2004.

[73]

Deepak Vohra and Deepak Vohra. 2016. Apache parquet. Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools (2016), 325–335.

[74]

Satoru Watanabe, Kazuhisa Fujimoto, Yuji Saeki, Yoshifumi Fujikawa, and Hiroshi Yoshino. 2019. Column-oriented database acceleration using FPGAs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 686–697.

[75]

Louis Woods, Zsolt István, and Gustavo Alonso. 2014. Ibex: An intelligent storage engine with support for advanced sql offloading. Proceedings of the VLDB Endowment 7, 11 (2014), 963–974.

[76]

Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita Shayesteh, and Vijay Balakrishnan. 2015. Performance analysis of NVMe SSDs and their implication on real world databases. In Proceedings of the 8th ACM International Systems and Storage Conference. 1–11.

[77]

Shuotao Xu, Thomas Bourgeat, Tianhao Huang, Hojun Kim, Sungjin Lee, and Arvind Arvind. 2020. Aquoman: An analytic-query offloading machine. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 386–399.

[78]

Yifei Yang, Matt Youill, Matthew Woicik, Yizhou Liu, Xiangyao Yu, Marco Serafini, Ashraf Aboulnaga, and Michael Stonebraker. 2021. Flexpushdowndb: Hybrid pushdown and caching in a cloud dbms. Proceedings of the VLDB Endowment 14, 11 (2021).

[79]

Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.

[80]

Yue Zhao, Gao Cong, Jiachen Shi, and Chunyan Miao. 2022. Queryformer: A tree transformer model for query plan representation. Proceedings of the VLDB Endowment 15, 8 (2022), 1658–1670.

[81]

Daniel Ziener, Florian Bauer, Andreas Becher, Christopher Dennl, Klaus Meyer-Wegener, Ute Schürfeld, Jürgen Teich, Jörg-Stephan Vogt, and Helmut Weber. 2016. FPGA-based dynamically reconfigurable SQL query processing. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 9, 4 (2016), 1–24.