A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation | Proceedings of the Great Lakes Symposium on VLSI 2025

Abstract

Pattern matching using non-deterministic finite automata (NFA) is critical in applications such as network intrusion detection systems (IDS), but efficient acceleration of NFA computations remains a challenge. In this work, we present a high-performance accelerator for NFA, fabricated in 22 nm Fully-Depleted Silicon-on-Insulator (FD-SOI) technology. The accelerator achieves a throughput of 2822 MBs^{− 1} at maximum frequency and 406 MBs^{− 1} (1.27 fJ/B/transition) throughput (energy per operation) at the minimum energy point. We attain this performance by employing digital compute-in-memory (CIM) macros and integrating a CIM Bloom filter to gate the activity of the macros, enabling opportunistic symbol skipping. Our accelerator demonstrates significant improvements in throughput and energy efficiency, making it suitable for high-performance pattern matching applications.

AI Summary

AI-Generated Summary (Experimental)

This summary was generated using automated tools and was not authored or reviewed by the article's author(s). It is provided to support discovery, help readers assess relevance, and assist readers from adjacent research areas in understanding the work. It is intended to complement the author-supplied abstract, which remains the primary summary of the paper. The full article remains the authoritative version of record. Click here to learn more.

Click here to comment on the accuracy, clarity, and usefulness of this summary. Doing so will help inform refinements and future regenerated versions.

To view this AI-generated plain language summary, you must have Premium access.

1 Introduction

Regular expressions are a powerful descriptive language used to define signatures for network intrusion detection systems (IDS). IDS check all incoming and outgoing packets in a computer network for known malicious signatures. One widely used open-source IDS is SNORT, which provides a community-assembled rule set [1], which we use also in our evaluation. Fig. 1 shows the general structure of SNORT: It uses a publish-subscribe model to enable efficient multiprocessing and modularity. Raw data packets are first decoded, and related packets are merged to enable detection across packet boundaries. Modules such as the firewall, application identifier or thread detection can subscribe to this data, enabling the different functionalities. The packets are scanned for known malicious signatures during threat detection, which is what we focus on in this work. At the end, depending on the results from the firewall, threat detection and other potential filters, the packet is dropped, an alert is issued or other actions are taken. Due to the subscribe-publish model of SNORT, where different modules subscribe or publish on topics, efficient multiprocessing and modularity can be achieved. However, for a single packet stream, certain modules, such as threat detection, are singlethreaded, but can be parallelized for multiple packet streams.

Figure 1:

For effective threat detection, an up-to-date rule set containing signatures of potential threats is required. These rules are commonly defined as regular expressions. As shown in the profiling results in Fig. 2, the evaluation of these patterns is the single largest contributor with 53.6% of the total runtime. The evaluation time scales with the rule set size. With the full community rule set loaded, SNORT achieves a throughput of 2.5 Gbs^{− 1} — four times slower than the 10 Gbs^{− 1} required for modern networking demands—thus necessitating customized hardware solutions.

The key contributions of this paper are:

•

Design of a novel CIM architecture capable of computing arbitrary non-deterministic finite automata (NFA).

•

Integration of a CIM Bloom filter to suppress irrelevant symbols and gate the NFA CIM array, thereby reducing energy per operation and increasing throughput.

•

Fabrication and validation of the proposed accelerator in 22 nm FD-SOI technology, with comprehensive measurements demonstrating its performance.

The remainder of this paper is structured as follows. In Section 2, we present related work in the fields of compute-in-memory (CIM) and finite automata. Section 3 discusses the structure of the proposed system and the design space exploration of its key CIM components. In Section 4, we present measurement results, compare the proposed system to the state of the art, and provide an outlook for future research.

Figure 2:

Left: Run time breakdown of SNORT. We focus on accelerating the match operation (bold). Right: Throughput as a function of rule set size.

2 Related Works

Compute-in-Memory (CIM) has emerged as a promising approach to overcome the memory wall [15]. By combining computation and memory, CIM removes the bottleneck between the memory and processing units, allowing for memory-level parallelism. This paradigm is especially beneficial in memory-bound applications such as matrix-vector multiplications for neural networks [11, 19, 20], search operations using Content Addressable Memory (CAM) and Ternary Content Addressable Memory (TCAM) [8], sorting algorithms [16, 17], combinatorial optimization problems [2, 4], and cryptographic functions [31].

The diversity of memory technologies and computational regimes in CIM implementations reflects the versatility of this approach. Digital and time-domain circuits often utilize custom 6T/8T SRAM cells [11], latches constructed from cross-coupled pairs of standard cells [9, 16, 19, 31], or flash memory [22]. Analog domain computations can additionally leverage novel devices such as memristors [5, 14], Ferroelectric Field-Effect Transistors (FeFETs) [27], and Spin-Transfer Torque Magnetic RAM (STT-MRAM) [13] for the actual computation. Each memory technology and compute domain offers unique trade-offs, ranging from area efficiency to approximate computations and energy consumption [10].

The compilation of compute-in-memory macros has become an important topic, aiming to reduce the design effort and bring it closer to the level of Register Transfer Level (RTL) design [6, 18, 29, 30]. This approach enables more flexible designs, faster turn-around times, and simplified transitions between technology nodes. Utilizing standard cells as the building blocks for CIM arrays allows for densely packed architectures while minimizing design complexity.

Finite automata, defined by a finite number of states, an input alphabet, and transition rules, are fundamental in pattern matching and regular expression evaluation. In particular, deterministic finite automata (DFA) and non-deterministic finite automata (NFA) are widely used. In a DFA, for each state and input symbol, there is at most one outgoing transition, whereas an NFA allows for multiple possible transitions, including ϵ -transitions that transition to the next state without consuming an input symbol.

Although DFA and NFA are equivalent in the languages they can accept, the number of states required for a DFA can be exponentially larger than that of an equivalent NFA [24], making NFAs more compact and efficient for certain applications. However, there is no efficient algorithm to create an optimal (i.e., minimal) NFA equivalent to a given DFA. Heuristic methods exist to approximate this conversion and reduce the number of states effectively [21].

Given the widespread applicability of regular expressions in fields such as network security, several DFA [12] and NFA accelerators [7, 23, 25] have been proposed, including commercial solutions [26]. These accelerators commonly employ CAM-like search mechanisms for the symbol and state lookup. The triggered transitions are typically stored as large parallel vectors (as wide as the memory) or require tie-breaking logic and memory lookups to determine the target states.

3 Proposed System

Figure 3 presents an overview of the proposed system. A RISC-V processor [28] is used to load the NFA into the accelerator, which is memory-mapped into the RISC-V’s address space. This programming occurs only once at the initialization stage, resulting in a negligible impact on system performance. During operation, the RISC-V processor offloads data management to a dual-channel Direct Memory Access (DMA) controller by writing descriptors into memory. The DMA uses these descriptors to load the input data stream into the accelerator and to write the stream of results back into main memory. The descriptors form a linked list, allowing the DMA to automatically continue with the next command without processor intervention. The entire system is based on the AXI and AXI stream protocols, with the arrows in Figure 3 indicating the direction of data flow.

Figure 3:

Proposed system, normal operation sequence and how it integrates into a RISC-V based SoC.

After buffering and clock domain crossing (CDC), the data is downsized to match the input data width of the key compute components. The original data width is 256 bits to provide high-throughput connectivity to the main memory and to limit congestion in the chip-wide AXI network. The rule set is defined at the character (byte) level, thus a straightforward hardware mapping would serialize the data into bytes, limiting the throughput to at most one byte per cycle, which is inadequate for modern networking speeds. This can be alleviated for multiple independent packet streams by running multiple units in parallel. However, on a per-stream basis, each stream is still subject to this bottleneck.

Figure 4:

Example how a 1 character/cycle is equivalent to a 2 character/cycle NFA.

To overcome this limitation, we process data symbols that combine two bytes (i.e., two characters per symbol). We create two data streams: the primary stream and an offset stream. At step k, the primary stream consists of characters c_{2k − 1} and c_2k, while the offset stream consists of c_2k and c_{2k + 1}. As illustrated in Figure 4, processing these two streams with the NFA is equivalent to handling the data one character at a time. The top row of the timing diagram shows how the NFA processes the input pattern, displaying the tuple {current symbol, set of active states} and ending at the accepting state. The subsequent rows depict the two streams, which can be processed in parallel—each with two characters per symbol at half the data rate. From this point in the processing chain, we handle two streams in parallel, effectively doubling the throughput.

An additional Bloom filter is used to prune the input data and is introduced in Section 3.2. In any case, the core operation mapped to the CIM accelerator is unchanged and discussed next.

To increase the capacity of the system and demonstrate scalability to arbitrary sizes, we implement multiple NFA banks in parallel. Each bank operates independently, but is working on the same data stream. The resulting matches from all banks are merged into a single output stream, which is transferred back into the main memory by the DMA.

3.1 CIM Array

The core principle of the implemented NFA accelerator is illustrated in Algorithm 1 . The NFA is defined by its transition function δ, which returns a set of next states for a given combination of current state and input symbol. It also includes the initial state q₀ and the set of final (accepting) states F. The set of active states Q comprises all currently reached states, including the initial state. Keeping the initial state always active ensures that matches can start at any position in the input string, not just at the beginning. A match m is defined by the location in the input symbol stream i and the reached final state, which identifies the specific regular expression that matched.

The key operation to implement in a CIM structure is the evaluation of the transition function δ. In practical NFAs, the transition function is sparse; that is, the number of non-zero entries is low relative to the total possible number of transitions. Therefore, a direct mapping into memory would be inefficient and wasteful. Instead, a Content Addressable Memory (CAM)-like structure is well-suited for storing the transition function, as proposed in previous works [12, 25].

In our proposed NFA accelerator, each column of the CIM array represents one non-zero entry of the transition function. It stores the current state, the triggering symbol, and one of the next states. If a state-symbol pair results in multiple next states (which is allowed in NFAs), multiple columns are utilized to represent each possible transition. This approach is more efficient than storing the active states in a binary mask, as done in some other designs [7, 25], because in most cases, a symbol-state pair triggers only a few next states, making the binary mask approach unnecessarily large.

Figure 5:

Left: Implemented computation per entry. Gray boxes correspond to stored data. Right: FSM which implements the core control for each entry. Together, they form the logic in each CIM slice.

To efficiently handle state transitions, we use two bits of registers per column to store whether the transition is active and if it has been triggered. This eliminates the need for explicit read-outs of the state over multiple cycles, as required in some previous implementations [12]. The iteration over active states in line 7 of Algorithm 1 becomes a successive reading of the output states and clearing of the active transitions. Once all active transitions have been processed, the triggered transitions are moved into the active transition bitmap, and the next input symbol is loaded. The state machine governing this behavior is depicted in Figure 5 on the right side. Through this method, the NFA accelerator computes one {symbol, state} pair evaluation in a single cycle.

Each column in the array also includes several flags to implement the full functionality:

•

Initial State Flag: For transitions originating from the initial state, a flag is set that bypasses the state comparison logic. This allows these transitions to evaluate to true regardless of the current active state, as shown in Figure 5.

•

Don’t-Care Symbol Flag: To support regular expressions with an odd length or with wildcard characters (e.g., the dot operator), we implement a don’t-care flag for single characters. This flag indicates that the transition can be triggered by any input symbol. Implementing this with a single flag bit is more storage-efficient compared to using a full TCAM approach [12], which would double the number of required bits.

•

Final State Flag: A flag marks transitions that lead into a final (accepting) state. When such a transition is triggered, it results in a match being output.

The NFA array defines the timing critical path in the system, as shown in Fig. 6: The read logic from the state register is a binary tree, so it scales logarithmically with the number of slices and results in the state which is being compared (1). The comparison corresponds to an XOR with the stored state and an s-input AND (2). Finally, a flag has to be computed if the current symbol has not triggered any transition, which again scales logarithmically (3) with the number of slices and is used by the droppable FIFO discussed in the next section.

Figure 6:

Flow diagram of an NFA array with the timing critical path shown.

The actual NFA stored in the CIM array is static: Once written, it does not change anymore. Thus, the write performance of the used memory is not important and no read logic is required, as the compute logic is directly coupled to the memory elements. While previous standard-cell CIM used two cross coupled OAI21 to implement a latch [9] which corresponds to a gated SR latch, we instead propose to use an SR latch with global reset realized with an OAI21 and NAND2 gate. It is shown in Fig. 8, in the green box. This saves us 20% of area compared to the OAI21 latch. At the same time, it provides a larger static noise margin over traditional 6T cells, and thus larger range of voltage scaling is preserved.

A breakdown of the area and power consumption of the NFA CIM array is shown in Fig. 7. The memory’s area footprint is halved compared to using fully disjoint arrays as it is reused for both parallel streams, only taking up 14% of the total area. It does not contribute any power (except for negligible amounts of static power).

Figure 7:

Area (left) and power (right) breakdown for the NFA array. Compare and memory correspond to the left part of Fig. 5, the slice FSM corresponds to the right.

3.2 Bloom Filter

One key aspect to note in Algorithm 1 is that the loop over the states in line 7 includes the initial state in addition to all active states. If states are rarely active, then only the initial state is evaluated. As discussed earlier, the number of entries in δ is sparse, so there are few symbols that trigger outgoing transitions from the initial state. If we can determine that a symbol does not trigger an initial transition (and there are no active states), we can skip evaluation in the CIM array, thereby reducing energy per operation. Moreover, if multiple symbols in a row do not trigger initial transitions, we can skip all of them at once, further reducing energy consumption and effectively increasing throughput. It is important to note that false positives inherent to Bloom filters do not impact the final output; they only increase activity in the CIM array.

We can leverage this insight to efficiently gate the NFA array using a Bloom filter [3]. The Bloom filter operates by computing multiple hashes (in our case 3) of the input, addressing a bit-vector. During the initialization, the bits corresponding to these addresses are set to 1. To test if an input is in the dataset, we check the positions in the bit-vector indicated by the hashes; if all the corresponding bits are set, the value is likely in the dataset. However, false positives can occur. There is a trade-off among the number of bits in the vector, the number of elements in the dataset, the number of hash functions, and the resulting false-positive rate.

Figure 8:

Data flow for the proposed CIM Bloom filter. The hash functions are amortized over the number of parallel NFA banks (p). The memory (green) is amortized over the number of parallel data streams (2).

We show the structure of the proposed CIM Bloom filter in Fig. 8: A very basic hash function is applied to the input streams. They are then converted into one-hot representation and OR’d together. This resulting mask is broadcast over the CIM array, where, for each NFA CIM bank, an imply operation is performed (i.e. if a, then b). To obtain the output of the Bloom filter, we AND reduce the result of this operation. We choose a more efficient standard-cell implementation, for example using inverting and complex gates. The number of bits in the array is a lot larger than the number of tiles and streams, leading to a very narrow and high array. To make the array more efficient, we store two bits of the array per entry and use a complex gate to merge the imply and the first stage of the AND operation. The actual standard-cell implementation of the base element is shown in Fig. 8 in the red box. These outputs feed into two inverting AND tree (one for each input stream), which can be implemented in a single column of standard-cells. For clarity, the cells are not drawn in one row, but in the actual layout, they are placed in one row. Orange signals are only needed during the write operation. Gray boxes are a single complex gate each.

In addition to the Bloom filter, we implement a droppable FIFO that either pops the current symbol (in normal operation) or skips all successive symbols that are skippable according to the Bloom filter. The deeper the FIFO, the more successive values can potentially be dropped. This increases throughput, as the FIFO can temporarily process more than one symbol (two characters) per cycle. With increasing FIFO depth, an average throughput of one symbol per cycle is reached, which is the upper limit due to the Bloom filter only processing one symbol per cycle (per stream).

To evaluate the optimal depth of the droppable FIFO and the capacity of the Bloom filter, we sweep these two parameters and assess their impact on energy, throughput, and area, as shown in Fig. 9. The results are normalized to the performance without these features. Increasing the Bloom filter size decreases the false-positive rate, which improves energy efficiency. At a size of 512 elements, the area has increased by only 7%, but the energy efficiency has improved by 3.2 ×. Increasing the size further causes the energy consumption of the Bloom filter to become significant, decreasing overall energy efficiency. Conversely, too small sizes lead to a false-positive rate of effectively 100%, resulting in performance equivalent to having no Bloom filter. For the droppable FIFO, choosing a depth of eight leads to almost doubling (1.95 ×) the throughput and increases energy efficiency by 2.45 × compared to not using a FIFO. The minor improvement in throughput for deeper FIFOs does not justify the additional area and energy costs. For patterns that are only one symbol long (of which there are few), a second Bloom filter of size 32 elements is necessary, but its area and power contributions are negligible.

Figure 9:

Evaluation of Bloom filter size and droppable FIFO depth. Dashed lines at (Bloom filter capacity of 512 and FIFO depth of 8) show final selected parameters. Bloom filter capacity in log scale.

Finally, Fig. 10 shows the area and power breakdown of the finalized Bloom filter and FIFO. Although the area contribution of the memory is significant, it has no relevant contribution in power as it is quasi-static. We use the same memory element composed of an OAI21 and NAND gate as used for the NFA arrays. We also demonstrate the benefit in area reduction by employing a CIM approach for the Bloom filter itself. The hash functions are synthesized RTL, while the imply operation and AND reduction of the Bloom filter are implemented in CIM. The hash function evaluation is amortized over all NFA tiles, thus reducing their contribution to area and power.

Figure 10:

Left: Area and power breakdown for the Bloom filter and droppable FIFO. Right: Area savings due to use of CIM approach over a synthesized baseline.

4 Results

The performance of finite automata accelerators depends on the data distribution and the specific automata used [23], as the number of active states can vary significantly. To address this variability, we use real-world rules and data to evaluate our design. The data is captured from voluntary participants in an office setting, including internet browsing, network shares, and media streaming. To construct the NFA, we use the Thompson construction, split the resulting graph into p = 4 disjoint graphs, and minimize the automata, converting them into NFAs [21].

Table 1:

Metric	MICRO’17 [25]		TNANO’19 [12]	CODES’16 [26]	MICRO’19 [23]		Proposed
Technology	28		16	45	28		22
State	Simulation		Sim/Tapeout	Tapeout	Simulation		Tapeout
Capability	NFA		DFA	NFA	NFA	NFA	NFA
Capacity	131072		250	49152	32768	32768	1024 (2048)
Stride	1		1	1	1	1	2
Approach	SRAM Space Opt	SRAM Throughput Opt	4T2R MRAM Cell	DRAM	Custom 2T1D Cell	8T SRAM Cell	SC CIM MEP	SC CIM fmax
Area/transition [µm²]	117		5.92	458	61	130	73	73
E/B/transition [fJ/B]	24.7	106	6.12	381	73	243	1.27	3.4
Throughput [MB/s]	1175	1950	1000	133	1500	2500	406	2822

State of the Art Comparison

In post-layout simulations, we evaluate the impact of data and automata distribution, with the results shown in Fig. 11. Real workloads increase the energy per operation by 1.74 × while decreasing the throughput by 5%, highlighting the importance of realistic test settings. Using simulated power numbers for real data and automata, we present the power breakdown (as well as the data-independent area breakdown) in Fig. 11. The NFA CIM array dominates the area but contributes only half of the total power, due to the Bloom filter efficiently gating it. The Bloom filter’s small contribution in area and power demonstrates its effectiveness for the application.

Figure 11:

Left: Area and simulated power breakdown of the proposed system. Right: Simulated impact of random data and automata on throughput and energy/operation.

Figure 12:

Measurement results for energy per input byte and maximum frequency. The two marked points correspond to the minimum energy and maximum throughput point.

The proposed design was fabricated in a 22 nm FD-SOI technology. The die measures 1.73 mm × 1.73 mm and is shown in Fig. 13. The actual NFA accelerator occupies \(405 \,\mu \mathrm{m}\) × \(370 \,\mu \mathrm{m}\), corresponding to \(73 \,\mu \mathrm{m}^{2}\,\) per transition. Our measurement results are presented in Fig. 12. We achieve an energy efficiency of 2.62 pJB^{− 1} and an operating frequency of 220 MHz (406 MBs^{− 1}) at the minimum energy point (MEP). The maximum throughput is reached at 2.822 GBs^{− 1} with an energy per byte of 7 pJB^{− 1}.

The profiling results and measured throughput for the processor baseline shown in Fig. 2 are obtained from a commercial server running an AMD EPYC 7502. We estimate the energy per operation using its TDP, number of cores and the contribution of the match operation to the total run time. Extrapolating our the proposed design’s performance at 0.8 V, we achieve a 6.8 × higher throughput and 45 × lower energy per byte compared to the baseline, on a packet stream basis, i.e. for a single core. For multiple parallel data streams, the processor could utilize more cores, but similarly, we could instantiate parallel accelerators, exploiting the small footprint and maintaining our advantage. Other aspects, such as off-chip memory are ignored for both the processor and the proposed design.

A comparison with the state of the art is shown in Table 1. As expected, full-custom DRAM and MRAM-based designs have a smaller area. However, our accelerator achieves up to 4.8 × lower energy per byte and transition (at MEP), and 1.12 × higher throughput (at f_max) than previous designs.

Figure 13:

Left: Die graph of the fabricated die. Right: Layout of the proposed accelerator. The vertical lines are columns of bias and power gating cells.

5 Conclusion

In this paper, we have presented a novel CIM architecture for an NFA accelerator that uses a Bloom filter to reduce power and increase throughput. The proposed combination of droppable FIFO and Bloom filter can be used to improve all CIM architectures, where an input signal is broadcast without prior pruning, such as work focusing on the efficient implementation of the match operation and TCAM[12], which remains future work.

The use of standard-cell memory enables the proposed accelerator to be tightly co-integrated with digital logic, enabling fast transitions to new nodes, compared to full custom designs with novel devices [12], DRAM based designs [7, 23] or SRAM [25]. Applying previously proposed methods of scaling the proposed architecture to handle the entire ruleset remains an engineering work [12].

Acknowledgments

This work was partially funded by the Federal Ministry of Education and Research (BMBF, Germany) in the project NEUROTEC II (project numbers 16ME0399 and 16ME0398K).

References

[1]

2025. Snort Intrusion Detection System. Technical Report. snort.org

[2]

Jooyoung Bae, Wonsik Oh, Jahyun Koo, and Bongjin Kim. 2023. CTLE-Ising: A 1440-Spin Continuous-Time Latch-Based isling Machine with One-Shot Fully-Parallel Spin Updates Featuring Equalization of Spin States. In 2023 IEEE International Solid-State Circuits Conference (ISSCC). 142–144.

[3]

Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422–426.

[4]

Fuxi Cai, Suhas Kumar, Thomas Vaerenbergh, Xia Sheng, Rui Liu, Can Li, Zhan Liu, Martin Foltin, Shimeng Yu, Qiangfei Xia, Jianhua Joshua Yang, Ray Beausoleil, Wei Lu, and John Paul Strachan. 2020. Power-efficient combinatorial optimization using intrinsic noise in memristor Hopfield neural networks. Nature Electronics 3 (07 2020), 1–10.

[5]

Jia Chen, Jiancong Li, Yi Li, and Xiangshui Miao. 2021. Multiply accumulate operations in memristor crossbar arrays for analog computing. Journal of Semiconductors 42, 1 (jan 2021), 013104.

[6]

Jia Chen, Fengbin Tu, Kunming Shao, Fengshi Tian, Xiao Huo, Chi-Ying Tsui, and Kwang-Ting Cheng. 2023. AutoDCIM: An Automated Digital CIM Compiler. In 2023 60th ACM/IEEE Design Automation Conference (DAC). 1–6.

[7]

Paul Dlugosch, Dave Brown, Paul Glendenning, Michael Leventhal, and Harold Noyes. 2014. An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing. IEEE Transactions on Parallel and Distributed Systems 25, 12 (2014), 3088–3098.

[8]

Xin Fan, Niklas Meyer, and Tobias Gemmeke. 2022. Compiling All-Digital-Embedded Content Addressable Memories on Chip for Edge Application. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 8 (2022), 2560–2572.

[9]

Xin Fan, Jan Stuijt, Bo Liu, and Tobias Gemmeke. 2019. Synthesizable Memory Arrays Based on Logic Gates for Subthreshold Operation in IoT. IEEE Transactions on Circuits and Systems I: Regular Papers 66, 3 (2019), 941–954.

[10]

Florian Freye, Jie Lou, Christian Lanius, and Tobias Gemmeke. 2024. Merits of Time-Domain Computing for VMM – A Quantitative Comparison. In 2024 25th International Symposium on Quality Electronic Design (ISQED). 1–8.

[11]

Hidehiro Fujiwara, Haruki Mori, Wei-Chang Zhao, Mei-Chen Chuang, Rawan Naous, Chao-Kai Chuang, Takeshi Hashizume, Dar Sun, Chia-Fu Lee, Kerem Akarvardar, Saman Adham, Tan-Li Chou, Mahmut Ersin Sinangil, Yih Wang, Yu-Der Chih, Yen-Huei Chen, Hung-Jen Liao, and Tsung-Yung Jonathan Chang. 2022. A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write Operations. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. 1–3.

[12]

Catherine E. Graves, Can Li, Xia Sheng, Wen Ma, Sai Rahul Chalamalasetti, Darrin Miller, James S. Ignowski, Brent Buchanan, Le Zheng, Si-Ty Lam, Xuema Li, Lennie Kiyama, Martin Foltin, Matthew P. Hardy, and John Paul Strachan. 2019. Memristor TCAMs Accelerate Regular Expression Matching for Network Intrusion Detection. IEEE Transactions on Nanotechnology 18 (2019), 963–970.

[13]

Zhezhi He, Shaahin Angizi, and Deliang Fan. 2017. Exploring STT-MRAM Based In-Memory Computing Paradigm with Application of Image Edge Extraction. In 2017 IEEE International Conference on Computer Design (ICCD). 439–446.

[15]

Xiaohe Huang, Chunsen Liu, Yu-Gang Jiang, and Peng Zhou. 2020. In-memory computing to break the memory wall. Chinese Physics B 29, 7 (2020), 078504.

[16]

Christian Lanius and Tobias Gemmeke. 2022. Multi-Function CIM Array for Genome Alignment Applications built with Fully Digital Flow. In 2022 IEEE Nordic Circuits and Systems Conference (NorCAS). 1–7.

[17]

Christian Lanius and Tobias Gemmeke. 2024. Fully Digital, Standard-Cell-Based Multifunction Compute-in-Memory Arrays for Genome Sequencing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 32, 1 (2024), 30–41.

[18]

Christian Lanius, Jie Lou, Johnson Loh, and Tobias Gemmeke. 2023. Automatic Generation of Structured Macros Using Standard Cells ‒ Application to CIM. In 2023 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). 1–6.

[19]

Jie Lou, Florian Freye, Christian Lanius, and Tobias Gemmeke. 2023. Scalable Time-Domain Compute-in-Memory BNN Engine with 2.06 POPS/W Energy Efficiency for Edge-AI Devices. In Proceedings of the Great Lakes Symposium on VLSI 2023 (Knoxville, TN, USA) (GLSVLSI ’23). Association for Computing Machinery, New York, NY, USA, 665–670.

[20]

Jie Lou, Christian Lanius, Florian Freye, Tim Stadtmann, and Tobias Gemmeke. 2022. All-Digital Time-Domain Compute-in-Memory Engine for Binary Neural Networks With 1.05 POPS/W Energy Efficiency. In IEEE 48th European Solid State Circuits Conference (ESSCIRC). 149–152.

[21]

Richard Mayr and Lorenzo Clemente. 2013. Advanced automata minimization. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Rome, Italy) (POPL ’13). Association for Computing Machinery, New York, NY, USA, 63–74.

[22]

MythicAI. 2019. Taking powerful, efficient inference to the edge. Technical Report.

[23]

Elaheh Sadredini, Reza Rahimi, Vaibhav Verma, Mircea Stan, and Kevin Skadron. 2019. eAP: A Scalable and Efficient In-Memory Accelerator for Automata Processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 87–99.

[25]

Arun Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, David Blaauw, Dennis Sylvester, and Reetuparna Das. 2017. Cache automaton. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, Massachusetts) (MICRO-50 ’17). Association for Computing Machinery, New York, NY, USA, 259–272.

[26]

Ke Wang, Kevin Angstadt, Chunkun Bo, Nathan Brunelle, Elaheh Sadredini, Tommy Tracy, Jack Wadden, Mircea Stan, and Kevin Skadron. 2016. An overview of micron’s automata processor. In 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 1–3.

[28]

Florian Zaruba and Luca Benini. 2019. The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 11 (2019), 2629–2640.

[30]

Hongyi Zhang, Haozhe Zhu, Siqi He, Mengjie Li, Chengchen Wang, Xiankui Xiong, Haidong Tian, Xiaoyang Zeng, and Chixiao Chen. 2024. ARCTIC: Agile and Robust Compute-In-Memory Compiler with Parameterized INT/FP Precision and Built-In Self Test. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). 1–6.

[31]

Shutao Zhang and Tobias Gemmeke. 2023. A 22-nm 1,287-MOPS/W Structured Data-Path Array for Binary Ring-LWE PQC. In ESSCIRC 2023- IEEE 49th European Solid State Circuits Conference (ESSCIRC). 189–192.