In-memory analog computing for non-negative matrix factorization

45 min read Original article ↗

Introduction

Non-negative matrix factorization (NMF), a low-rank decomposition technique that approximates a non-negative matrix as the product of two non-negative factor matrices (Fig. 1a), has become a pivotal tool for uncovering latent structures in high-dimensional datasets1. By imposing non-negativity constraints, NMF inherently aligns with real-world physical and semantic interpretations, such as facial features and user preferences2,3. Therefore, it produces parts-based object representations that enhance interpretability compared to other low-rank factorization methods4, such as singular value decomposition, principal component analysis and vector quantization. This unique property has propelled its adoption across diverse domains: in recommendation systems, NMF disentangles user-item interaction patterns to predict unknown marks5,6; in bioinformatics, it extracts biologically meaningful features from gene expression profiles7,8; in image processing, it decomposes visual data into additive components resembling object parts9,10; and in clustering tasks, it serves as a dimensionality reduction tool for unsupervised learning11,12 (Fig. 1b). However, for modern large-scale matrices, such as the Yahoo Music dataset (1,200,000 × 137,000 rating matrix), traditional digital hardware struggles to meet the demands of real-time NMF tasks13,14. This limitation arises from intrinsic factors, including high algorithmic complexity, the slowdown of Moore’s Law15,16, and the von Neumann bottleneck17,18. Analog computing leverages massive parallelism to accelerate matrix operations. In particular, by combining block-matrix algorithms to scale up the addressable problem size19, analog computing can provide a viable path to solving large-scale NMF challenges.

Fig. 1: Introduction to NMF.

a Schematic illustration of NMF, where the original matrix R (\(m\times n\)) is decomposed into two non-negative factor matrices with a low-rank k, namely, U (\(m\times k\)) and V (\(n\times k\)), such that \({{{\bf{R}}}}\,\approx \,{{{\bf{U}}}}\cdot {{{{\bf{V}}}}}^{{{\rm{T}}}}\). The element Ri,j in matrix R could be reconstructed by multiplying the ith row of U and the jth column of V. b Applications of NMF in various fields, including recommendation systems using user-item interactions, bioinformatics with gene expression analysis, image processing by additive part-based decomposition, and clustering by dimensionality reduction for unsupervised learning. c Progressive complexity hierarchy of matrix operations. Matrix multiplication and matrix inversion are both about computing another vector from a matrix and a vector. They are a pair of forward and inverse problems, but the latter has a higher computational complexity. NMF requires joint optimization of two non-negative matrices from a single input matrix, which is a non-convex problem. By implementing ANLS algorithm with AMC systems, it is promising to solve NMF problems efficiently.

Analog matrix computing (AMC) with resistive memory arrays presents a promising path and has recently been applied to accelerate linear and nonlinear matrix equations—that is, equations in which the coefficients and unknowns are represented as matrices or vectors. This class includes both linear and certain nonlinear problems such as square matrix inversion20, generalized inverse21, and sparse approximation22. When used in applications like wireless signal processing23, neural network training21, and compressed sensing recovery22, AMC can achieve orders-of-magnitude improvements in computing throughput and energy efficiency compared to conventional digital solvers. Although conventional analog complementary metal-oxide semiconductor circuits may be constructed for similar purposes24, emerging resistive memory (or memristor25) is advantageous in terms of chip area and power consumption, thanks to the high-density crossbar array architecture26,27. The high regularity of crossbar also offers an inherent parallelism for fast computation. In addition, in situ computing with resistive memory represents an efficient solution for addressing the von Neumann bottleneck28,29,30,31. By constructing closed loop between resistive memory array and operational amplifiers (OPAs), AMC circuits can solve matrix equations in a single step within nanoseconds to microseconds20,21,22, without having to compute the inverse of the data matrix. Theoretical analysis shows that AMC circuits achieve O(1) time complexity for solving matrix problems32,33, significantly outperforming the cubic time complexity of traditional digital computing.

The computational complexity of matrix operations increases progressively from multiplication to inversion and, finally, to decomposition (Fig. 1c). Matrix-vector multiplication is a forward problem that computes the product of a matrix and a vector, e.g., r = Uv. In contrast, matrix inversion is an inverse problem that solves Uv = r, where matrix U and vector r are known, and vector v is unknown. The solution is given by v = U−1r, where U−1 is the inverse matrix of U. The concept is similar for matrix generalized inverse (GINV). Decomposition, however, represents a higher-order inverse problem, in which a known matrix R is factorized into two unknown matrices, U and V (e.g., R = U\(\cdot\)VT), making it fundamentally a non-convex optimization problem34. While AMC enables single-step solutions for multiplication and inversion using open-loop and closed-loop resistive crossbar circuits, respectively, NMF remains unsolvable in a purely analog domain. In low-rank matrix factorization, a substantial basin of attraction often exists around the global optimal solution34. The alternating non-negative least squares (ANLS) algorithm reformulates the non-convex problem into multiple convex subproblems, efficiently converging to this basin35, and is therefore widely adopted for NMF. AMC represents a highly parallel computing paradigm, where matrix operations are executed in a massively parallel fashion. During computation, all cells in the activated rows and columns participate simultaneously. This stands in sharp contrast to digital computing, where basic operations (e.g., multiplication and addition) are performed sequentially, leading to polynomial complexity for solving NMF problems. By leveraging massive parallelism, AMC is particularly well suited for accelerating NMF with low computational complexity. In particular, the GINV operations with various constraints in ANLS suggest that AMC circuits could be leveraged for acceleration. However, the conventional GINV circuit design relies on densely interconnected OPAs21, leading to excessive area and power consumption, which hinders large-scale integration. Additionally, the single-purpose nature of traditional GINV circuits limits their ability to accommodate various constraints in solving matrix equations. Therefore, the development of efficient and versatile AMC circuits is essential.

In this work, we pioneer the analog implementation of NMF. At the core of our approach is a reconfigurable compact GINV circuit, optimized using conductance compensation (CC) principles to minimize the number of OPAs compared to previous designs22,36,37. This circuit supports diverse regularization penalties, enabling non-negative linear regression (NLR), non-negative ridge regression (NRR), and non-negative constrained ridge regression (NCRR) with known entries. By integrating the compact-GINV circuit with the ANLS algorithm, we achieve a fully analog NMF solution, experimentally demonstrating its effectiveness in distinct scenarios such as image compression (via NLR) and recommender systems (via NRR and NCRR). The analog NMF solver operates with only a few alternating iterations, delivering reliable results and high energy efficiency.

Results

Algorithm for non-negative matrix factorization

As shown in Fig. 1a, NMF is about factorizing the original matrix R into two dense latent feature matrices with a low-rank k, namely \({{{\bf{R}}}}\,\approx \,{{{\bf{U}}}} \cdot {{{{\bf{V}}}}}^{{{\rm{T}}}}\), where U (\(m\times k\)) and V (\(n\times k\)) are usually identified as feature matrix and coefficient matrix, respectively. In other scenarios, like recommender system, U and V are the user and item latent feature matrices. The product of U and V approximates the original matrix R for predicting the default values in recommender systems. Since its formalization in 1999, NMF has gained prominence in diverse fields, owing to its ability to extract interpretable representations4. Its development has expanded to incorporate sparse constraints, L2 constraints (L2 regularization term \({||}{{{\bf{x}}}}|{|}_{2}^{2}\)), reducing optimization scope constrains and so on1, enhancing adaptability to complex data structures in various scenarios. NMF is typically solved by minimizing the following cost function:

$${\min }_{{{{\bf{U}}}},{{{\bf{V}}}}\ge 0}(||{{{\bf{R}}}}-{{{\bf{U}}}}{{{{\bf{V}}}}}^{T}|{|}_{F}^{2}+\alpha \varTheta ({{{\bf{U}}}})+\beta \varPsi ({{{{\bf{V}}}}}^{T})).$$

(1)

where F represents the Frobenius norm. Sometimes the Kullback-Leibler (KL) divergence may also serve as the cost function4. \(\varTheta (\cdot )/\varPsi (\cdot )\) and \(\alpha\)/\(\beta\) represent the penalty terms and their penalty parameters, respectively. The optimization problem in Eq. (1) is usually solved with two algorithmic frameworks: stochastic gradient descent (SGD)38 and ANLS39. While SGD is more popular in the conventional computing paradigm, the process is relatively slow as it costs tens to hundreds of iterations to converge6,38,39,40. Matrix computing contains regular, massive multiply-and-accumulate operations. This regularity makes ANLS algorithm well suited for acceleration via analog parallel architectures. In particular, ANLS inherently avoids negative values, circumventing the need for signed signal representation in analog devices. Thanks to the substantial basin of attractions around the global optimal solution33, a large set of initial points can converge to the global optimum. Therefore, even when large computing errors occur in one iteration cycle (updating U and V matrix once), subsequent iterations are still able to converge to the optimal solution with high probability. In addition, for non-convex optimization problems, errors in analog computing may even facilitate escaping local optima41. In this work, the extensively used L2 Frobenius norm (F = 2) is applied with different constraints, and two different application scenarios are considered, including image compression and recommender systems. The choice of regularization terms in NMF is tied to application-specific objectives. In image compression, where the original matrix R is fully observed (i.e., pixel intensities) and the need of L2-regularization is correspondingly eliminated, the cost function degenerates to \(\begin{array}{c}\min \\ {{{\bf{U}}}},{{{\bf{V}}}}{{{\boldsymbol{\ge }}}}0\end{array}\left({||}{{{{\bf{R}}}}-{{{\bf{UV}}}}}^{{{{\rm{T}}}}}|{|}_{{{{\rm{F}}}}}^{2}\right)\). Conversely, recommender systems operate on highly incomplete user-item rating matrices5,37,38,39,42, where missing entries dominate, and L2-regularization on both U and V becomes critical to prevent overfitting to sparse observations, which is expressed as5:

$$\begin{array}{c}\min \\ {{{\bf{U}}}},{{{\bf{V}}}}\ge 0\end{array}(||{{{\boldsymbol{\Omega }}}}{{{\boldsymbol{\odot }}}}({{{\bf{R}}}}-{{{\bf{U}}}}{{{{\bf{V}}}}}^{T})|{|}_{F}^{2}+\lambda ||{{{\bf{U}}}}|{|}_{F}^{2}+\lambda ||{{{\bf{V}}}}|{|}_{F}^{2}).$$

(2)

where λ is the coefficient of \({{\mbox{L}}}_{2}\) penalty to prevent overfitting and is Hadamard product. Ω represents the mask matrix, where Ωij = 1 if the rating of user i to item j is known in training set, and Ωij = 0 otherwise. Equation (2) represents a general case, as it is degenerated to the simplest format when \(\lambda=0\) and \({{{\bf{R}}}}\) is completely known (all elements of Ω are 1). The ANLS method is shown in Fig. 2a, where the initial feature matrix U0 is generated randomly and the subsequent coefficient matrix V1 is obtained through regression operations (NLR, NRR, or NCRR) based on the given matrices U0 and original matrix R. Through this alternating closed-form optimization approach, the cost function is minimized. The converged matrices U and V will be the final results of factorization, and their product will yield a highly accurate result for prediction.

Fig. 2: Compact-GINV circuit for implementing ANLS.

a Illustration of ANLS algorithm for NMF. Starting with an initial random feature matrix U0, the coefficient matrix V1 is solved via non-negative constrained ridge regression. Iterative alternation between optimizing U and V minimizes the cost function through a chain of regression operations until convergence. Treating NCRR as the general form, NRR and NLR can be derived under specific conditions. b Comparison between GINV and compact-GINV circuits. In conventional GINV circuit, only linear regression can be performed with q + p OPAs. Compact-GINV circuit can be reconfigured to perform multiple functions with p OPAs. The OPAs in the circuit are powered with single-power supply to guarantee the non-negativity of the outputs. When switch 1 and switch 2 are both turned off, the circuit solves NLR in one step. When switch 1 is turned on, it performs NRR. When switch 2 is partially turned on, it performs NCRR. c Experimental validation of NLR and NRR using compact-GINV circuit. Eight synthetic data points (xᵢ, yᵢ) are fitted with the model of \({w}_{0}+{w}_{1}{x}_{i}={y}_{i}\), where y is mapped as input voltages and x is mapped as the conductances of RRAM arrays. Output voltages from OPAs (solid lines) align closely with ideal solutions (dashed lines). d NMSE across 50 regression trials. NRR (NMSE = 0.0029) outperforms NLR (NMSE = 0.0059). e Relative errors of individual coefficients across 50 NLR and NRR experiments.

Analog compact-GINV circuit

The vanilla GINV circuit is shown in Fig. 2b. It was designed for solving linear regression problems in one step20, but without the constraint of non-negativity. In solving GINV, the target cost function is expressed as \(\min {||}{{{\bf{r}}}}-{{{\bf{Uv}}}}|{|}_{2}^{2}\), where U is a given matrix with size \(q\times p\) (q > p), r (size \(q\times 1\)) is the given vector, and v (size \(p\times 1\)) is the vector to be solved. In the GINV circuit, two copies of U are mapped as device conductances of two resistive random-access memory (RRAM) arrays, referenced to a unit conductance g0, vector r is mapped as voltages, then the output voltages represent the solution to v. This design, however, requires a large number (q + p) of OPAs, resulting in high power consumption and large chip area. In addition, the complex while fixed topology limits the flexibility to incorporate other constraints critical for real-world applications, e.g., adding L2 penalties necessitates p additional OPAs43. To address these challenges, we propose a new GINV circuit design (Fig. 2b), which is compact (thus termed compact-GINV) by using the CC principle. It reduces the number of required OPAs to p, and can be conveniently reconfigured to solve GINV problem with various constrains.

Since the cost function is defined as the squared Frobenius norm (i.e., the sum of squared errors), the matrix-form cost function is equivalent to the sum of vector-form cost functions. For example: \({||}{{{{\bf{R}}}}-{{{\bf{UV}}}}}^{{{{\rm{T}}}}}|{|}_{{{{\rm{F}}}}}^{2}={\sum }_{i=1}^{n}{||}{{{{\boldsymbol{r}}}}}_{{{{\boldsymbol{i}}}}}-{{{\bf{U}}}}{{{{\boldsymbol{v}}}}}_{{{{\boldsymbol{i}}}}}|{|}_{2}^{2}\), where ri and vi denote the ith column vectors in R and VT, respectively. Minimizing each vector-form cost functions is therefore equivalent to minimizing the matrix-form cost function. For convenience, we adopt the vector formulation in the following circuit analysis. Similarly, two copies of U are encoded in RRAM crossbars referenced to unit conductance g0. Meanwhile, a column of RRAM devices with compensation conductances gc are used to ensure conductance summation of each (including conductances of input resistors) row equal to a constant c. Upon the application of analog inputs −r, the circuit would generate regression solutions v instantaneously at OPA outputs. Briefly, according to the Kirchhoff’s current law, the potentials on the row lines constitute a vector \(\frac{1}{c}\)(Uv-r). Due to the extremely high input impedance of OPA, the input currents should be summed up to zero, namely \(\frac{1}{c}\)UT(Uv - r) = 0. Consequently, the potential vector v represents the solution of GINV. Compared to the vanilla GINV circuit, this design eliminates the use of a column of TIAs for bridging two crossbar arrays, thus substantially reducing the number of amplifiers. Also, because of the removal of TIAs (which otherwise invert the voltage signal), it becomes convenient to introduce various constraints in the target cost function. Non-negativity constraints are enforced via single-supply configurations, solving \(\begin{array}{c}\min \\ {{{\bf{v}}}}{{{\boldsymbol{\ge }}}}0\end{array}\left({||}{{{\bf{r}}}}-{{{\bf{Uv}}}}|{|}_{2}^{2}\right)\) (NLR) for tasks like image compression. For recommender systems with sparse observations, some vectors are fully known while most are only partially observed, leading to requirements of NRR and NCRR. By adding feedback conductances (\(\lambda /c{g}_{0}\)) between output and inverting input nodes of OPAs, the L2 penalty with parameter \(\lambda\) could be implemented and accomplishing NRR computing, which is expressed as \(\begin{array}{c}\min \\ {{{\bf{v}}}}{{{\boldsymbol{\ge }}}}0\end{array}\left({||}{{{\bf{r}}}}-{{{\bf{Uv}}}}|{|}_{2}^{2}+\lambda {||}{{{\bf{v}}}}|{|}_{2}^{2}\right)\). Grounding unmeasured entries in r restricts optimization to known data (\({{{\boldsymbol{\Omega }}}}\)), yields cost function \(\begin{array}{c}\min \\ {{{\bf{v}}}}{{{\boldsymbol{\ge }}}}0\end{array}\left({||}{{{\boldsymbol{\Omega }}}}\odot \left({{{\bf{r}}}}-{{{\bf{Uv}}}}\right)|{|}_{2}^{2}+\lambda {||}{{{\bf{v}}}}|{|}_{2}^{2}\right)\), which solves NCRR. Theoretical analysis of circuit equations is detailed in Supplementary Note 1 and the detailed circuit configurations for NLR, NRR and NCRR are shown in Supplementary Fig. 1.

Experiment demonstration

For demonstration purposes, a widely adopted HfO2-based RRAM technology was used here (see methods and Supplementary Fig. 2) to validate the proposed compact-GINV circuit. The electrical characteristics of the fabricated devices were tested in our previous works21, showing analog conductance states with good I-V linearity, endurance and retention performances, which are beneficial for NMF tasks. To speed up device programming during experiments, an adaptive accuracy mapping method is developed for the widely-applied write-verify process. For different target conductance values Gt, the programming precision is adjusted correspondingly (Supplementary Fig. 3). All programming cycles in experiments were recorded (Supplementary Fig. 4), demonstrating that an average of 9 programming cycles is required to program a single device, and it shows a 32.3% programming speedup compared to the conventional programming method (13.3 programming cycles)44. Here, one programming cycle is defined as a single sequence of programming, reading, and comparison operations (Supplementary Fig. 3). In our experiments, the unit conductance is selected as g0 = 100 μS. The compensation conductance gc is pre-calculated with minimal computational overhead and can be mapped into several parallel rows when dealing with large values21.

As a simple attempt, the NLR and NRR experiments have been demonstrated with a simple model w0 + w1 xi = yi firstly. Eight points of (xi, yi) were randomly generated, forming an \(8\times 2\) matrix U, where all elements in the first column are one20. In experiment, two copies of matrix U and gc are mapped into a RRAM array through the adaptive write-verify method, and the constant c was set to 8.3. In NRR, \(\lambda\) was selected as 2.52, resulting in a 30 μS feedback conductance. By encoding the given vector y as input voltages, both regression tasks were solved in continuous-time, with OPA outputs representing the solution of coefficients. The experimental results are shown in Fig. 2c, which demonstrates high consistence with the ideal fitting lines. Subsequently, 50 independent trials were performed across distinct (xi, yi) sets. To evaluate computing accuracy, normalized mean square errors (NMSE) is applied here. It is defined as \({{||}{{{\bf{x}}}}-\hat{{{{\bf{x}}}}}{||}}_{2}^{2}/{{||}\hat{{{{\bf{x}}}}}{||}}_{2}^{2}\), where \({{{\bf{x}}}}\) and \(\hat{{{{\bf{x}}}}}\) represent the experimental and ideal algorithmic results, respectively. The results show good agreement with the ideal ones, with NMSEs of merely 0.0059 and 0.0029, respectively (Fig. 2d). Notably, relative errors of individual coefficients are calculated for more detailed observations and most of them are less than 10% (Fig. 2e). The experimental results demonstrate that NRR achieves lower computing error compared to NLR, which should be attributed to the reduced condition number of UTU + \(\lambda {{{\bf{I}}}}\) (I is an identity matrix) compared to UTU.

Application to image compression

The implementation of ANLS algorithm for NMF includes frequent programming of U/V, circuit computing and data readouts. We developed a microcontroller unit (MCU)-controlled printed circuit boards (PCB) platform for running the algorithm. The PCBs integrate OPAs, digital-to-analog converters (DACs), analog-to-digital converters (ADCs), a write-verify module, and other peripheral circuits. The MCU manages data caching, dynamic signal scaling, compensation conductance gc computations and system control. Figure 3a illustrates the hardware configuration (detailed in “Methods”). Key experimental parameters, including constant c, L2 penalty coefficient λ, feedback conductances, circuit functions and so on, are summarized in Supplementary Table 1. Taking NMF of a small matrix (4\(\times\)4) as an example, the original matrix (Fig. 3b) will be decomposed into product of two 4\(\times\)1 matrices via NLR circuit. As shown in Fig. 3c, the initial feature matrix U0 is generated randomly and mapped into RRAM array together with the calculated gc, leading to a \(4\times\)3 RRAM array. Then, the original matrix is scaled as input voltages and fed into circuit column by column. As a result, the outputs of compact-GINV circuit after scaling become the first coefficient V1. The scaling methodology is explained in Supplementary Note 2. Iterative update of U and V through alternating optimization led to rapid convergence, with all results shown in Supplementary Fig. 5. The final factorized U and V are depicted in Fig. 3d, where the product of U and VT represents the recovered matrix that approximates the original matrix. As shown in Fig. 3e, hardware-generated solutions progressively aligned with algorithmic predictions (full-precision digital solver). After merely one cycle of iteration, the NMSE is reduced to 9.06 × 10−4 from 0.127, demonstrating the high effectiveness of the ANLS algorithm (Supplementary Fig. 6).

Fig. 3: Image compression with NMF.

a Hardware architecture of the automated analog NMF testing platform. An MCU coordinates DACs/ADCs, RRAM write-verify modules, and OPAs to execute iterative ANLS workflows. b Original 4\(\times\)4 non-negative matrix for NMF. The target is to factorize it into two 4\(\times\)1 matrices (U, V) with 50% compression. c Experimental results of computing coefficient matrix V1 with a randomly initialized U0. First, the randomly initialized matrix U0 is mapped into a 4\(\times\)3 RRAM array. Second, the original matrix R is scaled as input voltages. Third, the compact-GINV circuit outputs yield V1. d The factorized U and V and their product as the recovered matrix. For demonstration purpose, U2 and V2 are multiplied with a constant and its reciprocal to maintain the same value range while keeping their product unchanged. e Convergence trajectory of analog NMF solver over two ANLS cycles. Experimental results (solid lines) gradually match full-precision digital solver predictions (dashed lines), achieving NMSE = 1.35\(\times\)10−4. f Original RGB nebular image and the reconstructed image using the analog NMF solver. The original image was captured by the Hubble Space Telescope, with credit to NASA, ESA, and The Hubble Heritage Team (STScI/AURA). The image used in experiment was downsampled to a RGB format of 200 × 100 pixels. The image is partitioned into 4\(\times\)4 patches (3750 in total) for factorization. g Per-patch NMSE distribution of experimental results and algorithmic solution. h The factorized matrices U and V of the RGB nebular image, computed by our analog NMF solver.

Using the same scheme, image compression tasks were conducted. The image is factorized into small matrices using NMF, which requires less storage space and the product represents the recovery of original image. A red-green-blue (RGB) nebular image with size 200\(\times\)100 (Fig. 3f) from National Aeronautics and Space Administration (NASA) was adopted and it was segmented into 50\(\times\)25 small patches with size 4\(\times\)4 in each channel, leading to 3750 patches in total. For each small matrix, the ANLS process was performed for two iteration cycles individually to factorize it into 2 4\(\times\)1 matrices, which suggests a compression ratio of 50%, thus saving half of the storage space. The same initial feature matrix U0 (Fig. 3b) and scaling strategy are applied for all small matrices. The recovery quality of image is usually measured by the peak signal-to-noise ratio (PSNR), which is defined as \(10{{\mbox{lo}}}{{\mbox{g}}}_{10}(\frac{{255}^{2}\cdot M}{{\Vert{{{\bf{x}}}}-\hat{{{{\bf{x}}}}}\Vert}_{2}^{2}})\) (M is the length of vector x). 3750 groups of NMF experiments were conducted and the final corresponding reconstructed image is depicted in Fig. 3f, whose NMSE and PSNR against original image are 3.3 × 10−3 and 32. 88 dB, respectively. Compared to algorithmic solution running in a full-precision digital computer, the accuracy loss is merely 1.39 dB, suggesting a high recovery quality. Specifically, the NMSE distributions in Fig. 3g exhibit excellent agreement between experimental measurements and theoretical predictions across all small patches, validating the effectiveness of the analog NMF solver. The decomposed results contain 50\(\times\)25\(\times\)3 = 3750 pairs of low-rank matrices U and V with size 4\(\times\)1, and the recombination results of Uall and Vall in RGB format are shown in Fig. 3h. To evaluate the impact of matrix size on convergence speed, block matrices of varying sizes were extracted from the nebula image and simulated. All cases consistently converged within 2–3 iteration cycles (Supplementary Fig. 7), indicating that problem size has only a negligible effect on convergence speed.

Application to recommender system

Recommender system is another important application of NMF. It is a kind of algorithm that links people with their potential favorites by historical data (e.g., user-item rating matrix), thus enhancing user experience while generating substantial commercial and societal value. Collaborative filtering (CF)-based recommender system is one of the most successful methods, leveraging connections between users and items based on neighborhood information45. Among CF-based algorithms, NMF is a mainstream approach, thanks to its accuracy and scalability. Meanwhile, it also serves as a key enabler in many deep neural networks-based hybrid models46,47. A tiny example of user-item rating matrix is shown in Fig. 4a, containing both known and unknown marks. NMF is to decompose the rating matrix R into user latent feature U and item latent feature V, based on the known entries. The product of U and V will make predictions of those unknown marks for recommendation. The original matrix R is divided into training set and testing set (Fig. 4a), where the training set is used for optimizing Eq. (2) and testing set is for validation of prediction. The compact-GINV circuit configured as NRR and NCRR will be adopted for this demonstration.

Fig. 4: Experimental results for small-scale recommender systems.

a User-item rating matrix for recommendation tasks. Seven known ratings serve as the training set, while four withheld entries form the testing set, allowing validation of the predictive capability of NMF. b Experimental workflow of the deviation from U0 to the first item feature matrix V1. First, the randomly initialized user matrix U0 is mapped into a 4\(\times\)3 RRAM array. Second, the training ratings scaled by 1/10 are input as voltages, with unknown entries grounded. Third, the compact-GINV circuit outputs yield V1. c Evolution of user (U) and item (V) latent feature matrices over about 4 ANLS cycles, from U0 to U3 and V1 to V4. d Cost function convergence across 4 cycles. The analog NMF solver exhibits a similar trend to the algorithm running on a digital solver. e Comparison of the ground truth, algorithm-predicted values, and experimentally predicted values of NMSE for the testing set.

The process of calculating V1 with U0 is summarized in Fig. 4b. Similar as the procedure in Fig. 3c, the initial user matrix is randomly generated and mapped as the conductances, resulting in a 4\(\times\)3 RRAM array. The rating matrix R (training set) is then mapped to input voltages, where unknown entries can be assigned arbitrary values or left floating, as the corresponding column lines are grounded. The outputs of compact-GINV circuit will be the first item feature matrix V1. 4 cycles of iteration that progressively optimize the latent feature matrices U and V are shown in Fig. 4c, suggesting a fast convergence, which is evidenced by the monotonic decrease of the cost function (Eq. (2)) in Fig. 4d. Experimental results show good agreement with full-precision algorithmic solutions (Fig. 4d), which is attributable to the ANLS algorithm’s inherent error tolerance for NMF. Based on the convergent U and V matrices, precise prediction results are obtained on the test set, with NMSE = 0.0027 (Fig. 4e).

Real-world recommender systems typically operate on massive-scale matrices (m and n varying from thousands to millions) with extremely high sparsity. To validate our analog NMF solver on a real-world dataset, we use the MovieLens 100 k dataset (m = 943, n = 1682, 100,000 known entries) (Supplementary Fig. 8), with 80,000 randomly selected known entries for training (94.96% sparsity) and the remaining 20,000 for testing. To factorize large-scale matrices within resource-limited resistive memory arrays, a block matrix-based method is designed for NRR and NCRR in recommender systems (Supplementary Note 3). As shown in Fig. 5a, the original rating and user matrix is segmented into many sub-matrices (e.g., R[i], U1[i]) with a parameter d for regression individually, thus generating many intermediate item matrices (e.g., V2,i). The next item latent matrix is obtained from the intermediate matrices after a nonlinear operation (maximization and scaling, Supplementary Note 3) to capture the global features. The validation of this block matrix-based method and the strategy to deciding scaling factor are also provided in Supplementary Note 3. In experiment, the number of latent features is set to 2 (k = 2) for extreme compression, namely the original 943 × 1682 rating matrix R will be factorized into two matrices U and V of sizes 943 × 2 and 1682 × 2, respectively. The alternating user and item feature matrices (i.e., U and V) are segmented into 236 and 421 \(4\times 2\) sub-matrices (d = 4), respectively, with the help of zero padding. The resulting sub-matrices are mapped in a \(4\times 5\) RRAM array.

Fig. 5: Experimental results of recommend system for the MovieLens 100k dataset.

a Block matrix-based method, with the example of calculating V2 with U1. The transpose of original rating matrix is segmented into h sub-matrices \({{{{\bf{R}}}}}_{[1-h]}\) along row with interval d. The first item matrix is also segmented into h sub-matrices U1[1-h] with size \(d\times k\). Based on the ith pair of \({{{{\bf{R}}}}}_{[i]}\) and U1[i], the ith intermediate matrix V2,i is acquired through minimizing the cost function in the figure. After repeating h times, all the intermediate matrices V2,1-h are ready and the second user matrix V2 is calculated through maximization and scaling operation. b Conductance mapping of matrix U1,[1~236] (4 × 2) in a 4 × 5 RRAM array. Small conductance values show larger errors due to device relaxation. c The intermediate matrices V2,1~236 obtained in experiment, and the actual second item matrix V2. Global latent features are extracted through nonlinear operation. d The experimental elements in matrix V2 against with the ideal ones, showing an NMSE of 0.0035. e All the experimental user and item latent feature matrices. Except for random initialization of U0, the other matrices are obtained experimentally. The final factorization results are U3 and V3, with which the default values in original rating matrix will be predicted. f The evolution process of the predicted rating matrix of each iteration. Rpredicted = \({{{{\bf{U}}}}}_{3}{{{{\bf{V}}}}}_{3}^{{\mbox{T}}}\) is the final predicted rating matrix. g Illustrations of optimization process for cost function in training process and NMSE of testing set, as well as the result without segmenting.

We have conducted 3 iteration cycles in experiment, with the product U3V3T representing the predicted rating matrix Rpredicted. As a result, during the experiment, the RRAM array was reprogrammed a total of 1971 times (or 17,739 writing pulses with write-verify for each device), which is manageable for the fabricated devices, thanks to their robust endurance22. Taking the procedure from U1 to V2 as an example, two copies of U1[i] along with gc were programmed into the \(4\times 5\) RRAM array firstly, and all the 236 programming results of U1[1~236] are shown in Fig. 5b. Small conductances in Gleft and Gright are prone to suffering more from device relaxation, thus presenting larger absolute programming errors (e.g., > 30\(\mu {\mbox{S}}\)). The programming results of the other cycles are shown in Supplementary Fig. 9, exhibiting the same behavior. Then, the sub-matrix R[i] was encoded as input voltages and sequentially fed into circuit column by column. After performing 1682 operations with compact-GINV circuit, the ith intermediate matrix V2,i (size \(1682\times 2\)) was obtained. By repeating this procedure, all the 236 intermediate matrices V2,1-236 were generated with a good accuracy (NMSE = 0.007, Supplementary Fig. 10). The raw circuit outputs for other iteration cycles are also displayed in Supplementary Fig. 8, with all computed NMSE values below 0.01. The final \({{{{\bf{V}}}}}_{2}\) is acquired through the nonlinear operation (Fig. 5c), resulting in an NMSE of 0.0035 (Fig. 5d). The single-element relative error of all experimental user and item matrices are calculated and shown in Supplementary Fig. 11, demonstrating very low NMSEs (ranging from 0.003 to 0.0044) and average relative errors (ranging from 1 to 4%). All user and item feature matrices are presented in Fig. 5e. In each iteration cycle of NMF experiment, a new pair of user and item feature matrix were generated, which can then be utilized to compute the predicted rating matrix for evaluation. The corresponding evolution process is illustrated in Fig. 5f. The predicted rating matrix Rpredicted = U3V3T after 3 iteration cycles closely aligns with the original matrix (Supplementary Fig. 12). Notably, the predicted rating matrix after just one iteration cycle (i.e., U1V1T) is visually very close to the converged matrix, demonstrating again the algorithm’s fast convergence. For training set, the value of cost function in Eq. (2) decreases with iteration and converges rapidly. The behavior is the same for NMSE of the test set. Notably, for both indicators, the experimental results are almost identical with the algorithmic ones (full-precision digital solver) (Fig. 5g). Compared to the original ANLS algorithm without segmenting the matrices (and its circuit simulation, see Methods), the results show some performance degradation, which should be inherent of the block matrix method.

To investigate the impact of block size d on solving quality, simulations were performed across various d ranging from 4 to 256. Results demonstrate that the cost function and NMSE on the testing set decrease for larger d (Supplementary Fig. 13), suggesting the quality deficiency in experiment can be faithfully overcome with large-scale RRAM arrays, which are readily affordable with advanced RRAM technology48,49. Additionally, for all block sizes, the algorithm converges within 2–3 iteration cycles (Supplementary Fig. 13), indicating that block size has minor impact on convergence speed. To compare the solving quality with deep learning method, the indictor root mean square error (RMSE) is induced, defined as \({({\Vert{{{\bf{x}}}}-\hat{{{{\bf{x}}}}}\Vert}_{2}^{2}/|{{{\bf{T}}}}|)}^{1/2}\), where x, \(\hat{{{{\bf{x}}}}}\) represents the predicted value and its ground truth, and \(|{{{\bf{T}}}}|\) denotes the number of testing set. The state-of-the-art (SOTA) method we found to deal with MovieLens 100k dataset combines graph model and deep learning method and generates a small RMSE50. As presented in Supplementary Table 2, the simulation result without segment is worse than the advanced machine learning method but comparable to a recurrent neural network method51. However, when employing the block matrix-based method, the experimental results underperformed both approaches, which is from the unavoidable information loss using very small segment parameter (d = 4). Nevertheless, this degradation is avoidable by using large RRAM arrays (d = 256).

Analysis of error tolerance

To demonstrate the error tolerance of analog NMF solver, an additional 4\(\times\)4 NMF experiment was conducted. Significant mapping errors occurred during programming matrix V1, resulting in a high NMSE. However, subsequent iterations successfully converged to a minimum (Supplementary Fig. 14). Similar phenomena also occur in earlier experiments. Across all experiments (~430,000 trials), the average NMSE of circuit outputs was 0.0072, while the errors in the final NMF results were substantially suppressed due to the large basin of attraction around the global optimum in NMF problems34. In the image compression experiment, the recovered image shows only a 1.39 dB PSNR loss (exp: 32.88 dB vs. FP64: 34.27 dB) and a minor NMSE increase of 0.0009 (exp: 0.0033 vs. FP64: 0.0024). In the recommender system experiment (MovieLens 100k), the NMSE on the test dataset increased by just 0.0001 in the 3th iteration cycle (exp: 0.0800 vs. FP64: 0.0799). To further investigate error tolerance in large-scale arrays, the image compression was applied as an example, where a 200 × 66 RRAM array (including two columns for CC) was used to decompose the original 200 × 100 × 3 nebula image into 3 pairs of matrices (dimensions 200 × 32 and 100 × 32) over 2 iteration cycles. In simulations, a practical OPA model AD8572 from ADI (1.5 MHz bandwidth, 145 dB DC open-loop gain, 51 nV/\(\surd {\mbox{Hz}}\) voltage noise) is adopted to mimic the non-ideal characteristics of OPAs. Programming errors were varied from ±1 to ±60%. Results show that mapping errors within ±20% have negligible impact on the final NMSE and PSNR. Remarkably, even with substantial mapping errors (±50% and ±60%), the solver still correctly reconstructed the image within 2–3 iteration cycles (Supplementary Fig. 15).

Temperature effects were also investigated, primarily impacting retention and conductance values (similar to mapping errors). Since computing is completed immediately («1 s) after programming, the 8-state RRAM retention were tested for 104 s in our prior work22, which fully meets the requirements. Regarding conductance variation, a temperature model for analog RRAM52 was applied, as described by Eq. (3).

$$R={R}^{0}(1+\rho (T-{T}_{0})),$$

(3)

where R is the resistance at temperature T, R0 is the resistance at room temperature T0, ρ is the coefficient. At high conductance, the device exhibits metallic behavior (ρ > 0), so increased temperature decreases conductance. Conversely, at low conductance, it exhibits semiconducting behavior (ρ < 0), where increased temperature increases conductance (Supplementary Fig. 16). This phenomenon was also observed and verified in 1-bit RRAM53. Assumed the programming is performed at room temperature (20 °C) with ±5% programming error, the simulations are conducted from 20 to 120 °C. The results demonstrate minimal impact on decomposition quality (Supplementary Fig. 16) due to the high error tolerance.

Discussion

Graphics processing units (GPUs) and field-programmable gate arrays (FPGAs) with optimized algorithms remain state-of-the-art solutions for accelerating solving NMF38,54,55,56,57. In contrast, this work explores the use of alternative computing paradigms, i.e., analog computing, for solving this problem. To benchmark the analog NMF solver against the advanced digital solvers, the case of recommender systems is considered. The dataset ranges from MovieLens 100k dataset to larger ones, including Netflix (m = 480,000, n = 17,000) and Yahoo (m = \({{\mathrm{1,200,000}}}\) n = \({{\mathrm{137,000}}}\))38,55. A multi-core architecture is designed for fair comparison with entire FPGAs and GPUs. The architecture includes a global buffer, a functional module, instruction-related registers and decoders, and 1108 cores58. Each core contains a RRAM array, DACs, ADCs, I/O registers, a control module, a CC compute module, write circuitry, etc. Detailed configurations and parameters applied are provided in Supplementary Note 4. Since rating matrices in recommender systems typically consist of discrete integers, low-precision DACs could be employed. Figure S17 demonstrates decomposition results on MovieLens 100k using different DAC precisions, indicating that 3-bit DACs represent an optimal selection. Consequently, a 3-bit R-2R DAC (Supplementary Fig. 17) was designed and implemented in the multi-core architecture to reduce system energy and chip area.

The computing core utilizes a large-scale 256\(\times\)66 (two columns for CC) RRAM array that is capable of supporting regression problem for 256\(\times\)32 matrix. Combined with the proposed block matrix-based method, it efficiently solves ultra-large-scale NMF problems in recommendation systems. However, some problems may occur in large-scale array, such as extra overhead for computing and programming CC, long-wire delay, large amounts of DACs/ADCs and so on. To reduce the conductance value of CC, two parallel columns are employed for mapping CC, decreasing its conductance to half the original value. System-level evaluation shows the resources required for computing and mapping CC represent a minimal overhead: latency accounts for 3.18% of total time, and energy consumption accounts for 0.12% of total energy. Furthermore, wire delay issues in large-scale arrays were investigated using a parasitic capacitance model (Supplementary Fig. 18)59. The convergence time of the compact-GINV circuit is increased from 76 ns to 79.2 ns, which is also applied into the evaluations. Additionally, to reduce power assumption and chip area, the prevailing approach is to sacrificing computing speed by using multiplexers (MUX) and shared ADCs29.

For the MovieLens 100k dataset, the NMF task is running on a Xilinx Zynq-7000 FPGA (28 nm node) platform57. For larger datasets, the Virtex UltraScale+ FPGA (16 nm node)38 and Nvidia Titan RTX GPU (4608 CUDA cores, 12 nm node)57 are the most advanced hardware we found to perform NMF algorithm. To benchmark multi-core solver with SOTA digital solvers, NVDIA H100 (14,592 CUDA cores, 4 nm node) and AMD Versal™ AI Core (VC2802, 7 nm node) FPGA are considered and their performance are estimated through comparing the peak computing capabilities. Task of Movilens 100k occupies 7 cores and achieves a 212\(\times\) speedup and 4.6\(\times\)104 improvement of energy efficiency compared to SOTA FPGA. In tasks of Netflix and Yahoo, the proposed architecture operates with full 1108 cores and it achieves a speedup of thousands of times compared to SOTA FPGA and several times compared to H100 GPU on average (Supplementary Fig. 19). Additionally, it demonstrates thousands of times improvement in energy efficiency over SOTA FPGA and hundreds of times over H100 GPU on average (Supplementary Fig. 19). The assessment reveals that latency is primarily dominated by device programming (54%, Supplementary Fig. 20), while power consumption is largely attributed to DACs and ADCs (92%, Supplementary Fig. 20) in the analog NMF solver. This suggests that optimizing programming methods and the design of DACs and ADCs could enhance the performance of the analog MF solver in the future. Additionally, a detailed comparison between vanilla GINV circuit and compact-GINV circuit is provided in Supplementary Note 5, and it demonstrates compact-GINV circuit obtaining 90% of OPAs reduction, 9× energy efficiency in circuit level and 1.46× energy efficiency in system level at comparable solving time and accuracy.

In conclusion, we have developed an in-memory analog solver for efficiently handling complex NMF problems. At the core of this solver is a closed-loop compact-GINV circuit, designed based on the CC principle, which enables one-step solutions to non-negative regression problems under various constraints. By utilizing a minimal number of OPAs, the circuit significantly reduces power consumption. Integrating the ANLS algorithm with the compact-GINV circuit allows for efficient NMF computation while leveraging ANLS’s inherent error tolerance for low-rank matrix factorization, which further enhances the robustness of the analog solver. Experimental results demonstrate that our solver can reliably handle diverse applications of NMF, such as RGB image compression and recommender systems, requiring only a few iterations with minimal accuracy loss compared to full-precision solvers. The analog approach shows great promise in terms of speed and energy efficiency, making it a compelling solution for tackling complex NMF problems in modern applications. This could pave the way for advanced real-time applications in image processing and recommender systems.

Methods

Device fabrication

The RRAM devices were fabricated on SiO₂/Si substrates using standard semiconductor laboratory processes. First, a 5 nm Ti adhesion layer and a 30 nm Pt bottom electrode were deposited via magnetron sputtering. Next, photolithography and lift-off processes were used to pattern the layers. A 5 nm HfO₂ resistive layer was then deposited by atomic layer deposition (ALD). Top electrodes consisting of 30 nm Ta and 40 nm Pt were fabricated by magnetron sputtering and patterned using photolithography and lift-off processes.

Experimental measurements

An automatic test platform was designed for NMF tasks in this work. The control script, written in C + +, runs on a digital computer. During RRAM programming, the script loads data into the MCU (an Arduino Mega 2560). The MCU performs scaling and conductance compensation (CC) computations, then controls the on-board write-verify circuit to program devices. For compact-GINV operation, the script loads the original matrix into the MCU, which scales the data, controls DACs to supply input voltages, and uses ADCs to read circuit outputs. The compact-GINV circuit uses commercial AD8572 OPAs (Analog Devices, ADI) integrated on a PCB. Peripheral circuits, including multiplexers and switches, are also onboard. Four diode pairs limit output voltages to <0.3 V, protecting RRAM devices from conductance changes during testing.

Simulations

Large-scale compact-GINV circuit simulations (Fig. 5g) were performed in SPICE (Simulation Program with Integrated Circuit Emphasis). We modeled RRAM devices with a conductance range of 10–1000 μS, retaining the adaptive accuracy mapping method. AD8572 models (Analog Devices), identical to experimental units, were used for OPAs. User/item latent matrices of MovieLens 100k were mapped directly into the RRAM array, bypassing the block matrix-based method. To keep consistence with experiment, three cycles are conducted to investigate the final predicted rating matrix.

Data availability

The data generated in this study have been deposited in the Zenodo database under accession code [https://zenodo.org/records/17105419].

Code availability

The code used in this paper is available from the corresponding authors upon request.

References

  1. Wang, Y.-X. & Zhang, Y.-J. Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25, 1336–1353 (2012).

    Article  CAS  Google Scholar 

  2. Zhi, R., Flierl, M., Ruan, Q. & Kleijn, W. B. Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition. IEEE Trans. Syst. Man, Cybern. Part B 41, 38–52 (2010).

    ADS  Google Scholar 

  3. Mehta, R. & Rana, K. A review on matrix factorization techniques in recommender systems. In 2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA) 269–274 (IEEE, 2017).

  4. Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).

    Article  ADS  CAS  PubMed  Google Scholar 

  5. Koren, Y., Bell, R. & Volinsky, C. Matrix factorization techniques for recommender systems. Computer 42, 30–37 (2009).

    Article  Google Scholar 

  6. Luo, X., Zhou, M., Xia, Y. & Zhu, Q. An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems. IEEE Trans. Ind. Inform. 10, 1273–1284 (2014).

    Article  Google Scholar 

  7. Li, Z., Tang, J. & He, X. Robust structured nonnegative matrix factorization for image representation. IEEE Trans. Neural Netw. Learn. Syst. 29, 1947–1960 (2017).

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  8. Liu, H., Wu, Z., Li, X., Cai, D. & Huang, T. S. Constrained nonnegative matrix factorization for image representation. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1299–1311 (2011).

    Article  ADS  PubMed  Google Scholar 

  9. Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Johnson, J. A. et al. Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS. Nat. Protoc. 18, 3690–3731 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Kuang, D., Ding, C. & Park, H. Symmetric nonnegative matrix factorization for graph clustering. In Proc. 2012 SIAM International Conference on Data Mining 106–117 (Society for Industrial and Applied Mathematics, 2012).

  12. Laxmi Lydia, E. et al. Charismatic document clustering through novel K-Means Non-negative Matrix Factorization (KNMF) algorithm using key phrase extraction. Int. J. Parallel Prog. 48, 496–514 (2020).

    Article  Google Scholar 

  13. Zhang, J., Wang, Y., Yuan, Z. & Jin, Q. Personalized real-time movie recommendation system: practical prototype and evaluation. Tsinghua Sci. Technol. 25, 180–191 (2019).

    Article  Google Scholar 

  14. Park, Y., Oh, J. & Yu, H. RecTime: real-time recommender system for online broadcasting. Inf. Sci. 409, 1–16 (2017).

    Article  Google Scholar 

  15. Waldrop, M. M. The chips are down for Moore’s law. Nat. News 530, 144 (2016).

    Article  CAS  Google Scholar 

  16. Ielmini, D. & Wong, H.-S. P. In-memory computing with resistive switching devices. Nat. Electron. 1, 333–343 (2018).

    Article  Google Scholar 

  17. Mannocci, P. et al. In-memory computing with emerging memory devices: status and outlook. APL Machine Learn. 1, 010902 (2023).

  18. Mehonic, A. et al. Roadmap to neuromorphic computing with emerging technologies. APL Materials 12, 109201 (2024).

  19. Pan, L., Zuo, P., Luo, Y., Sun, Z. & Huang, R. Blockamc: scalable in-memory analog matrix computing for solving linear systems. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE) 1–6 (IEEE, 2024).

  20. Sun, Z. et al. Solving matrix equations in one step with cross-point resistive arrays. Proc. Natl. Acad. Sci. USA 116, 4123–4128 (2019).

    Article  ADS  MathSciNet  PubMed  PubMed Central  Google Scholar 

  21. Sun, Z., Pedretti, G., Bricalli, A. & Ielmini, D. One-step regression and classification with cross-point resistive memory arrays. Sci. Adv. 6, eaay2378 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  22. Wang, S. et al. In-memory analog solution of compressed sensing recovery in one step. Sci. Adv. 9, eadj2908 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Zuo, P., Sun, Z. & Huang, R. Extremely-fast, energy-efficient massive MIMO precoding with analog RRAM matrix computing. IEEE Trans. Circuits Syst. II: Express Briefs 70, 2335–2339 (2023).

    Google Scholar 

  24. Huang, Y., Guo, N., Seok, M., Tsividis, Y. & Sethumadhavan, S. Evaluation of an analog accelerator for linear algebra. SIGARCH Comput. Archit. N. 44, 570–582 (2016).

    Article  CAS  Google Scholar 

  25. Strukov, D. B., Snider, G. S., Stewart, D. R. & Williams, R. S. The missing memristor found. nature 453, 80–83 (2008).

    Article  ADS  CAS  PubMed  Google Scholar 

  26. Wang, Q. et al. A logic-process compatible RRAM with 15.43 Mb/mm 2 Density and 10years@ 150 °C retention using STI-less dynamic-gate and self-passivation sidewall. In 2023 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2023).

  27. Pellizzer, F., Pirovano, A., Bez, R. & Meyer, R. L. Status and perspectives of chalcogenide-based crosspoint memories. In 2023 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2023).

  28. Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).

    Article  ADS  CAS  PubMed  Google Scholar 

  29. Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  30. Le Gallo, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).

    Article  Google Scholar 

  31. Xue, C.-X. et al. Embedded 1-Mb ReRAM-based computing-in-memory macro with multibit input and weight for CNN-based AI edge processors. IEEE J. Solid-State Circ. 55, 203–215 (2019).

    Article  ADS  Google Scholar 

  32. Sun, Z. et al. Time complexity of in-memory solution of linear systems. IEEE Trans. Electron Devices 67, 2945–2951 (2020).

    Article  ADS  Google Scholar 

  33. Wang, S. et al. Optimization schemes for in-memory linear regression circuit with memristor arrays. IEEE Trans. Circuits Syst. I: Regul. Pap. 68, 4900–4909 (2021).

    Google Scholar 

  34. Chi, Y., Lu, Y. M. & Chen, Y. Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Trans. Signal Process. 67, 5239–5269 (2019).

    Article  ADS  MathSciNet  Google Scholar 

  35. Park, D., Kyrillidis, A., Caramanis, C. & Sanghavi, S. Finding low-rank solutions via nonconvex matrix factorization, efficiently and provably. SIAM J. Imaging Sci. 11, 2165–2204 (2018).

    Article  MathSciNet  Google Scholar 

  36. Wang, S. & Sun, Z. Dual in-memory computing of matrix-vector multiplication for accelerating neural networks. Device 2, 100546 (2024).

  37. Luo, Y., Qiao, F. & Sun, Z. Capacitance and conductance compensation methods for efficient computing‐in‐memory designs. Adv. Elect. Mater. 10, 2400452 (2024).

    Article  CAS  Google Scholar 

  38. Zhou, S., Kannan, R. & Prasanna, V. K. Accelerating stochastic gradient descent based matrix factorization on FPGA. IEEE Trans. Parallel Distrib. Syst. 31, 1897–1911 (2020).

    Article  Google Scholar 

  39. Kim, H. & Park, H. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 30, 713–730 (2008).

    Article  MathSciNet  Google Scholar 

  40. Luo, X. et al. A nonnegative latent factor model for large-scale sparse matrices in recommender systems via alternating direction method. IEEE Trans. Neural Netw. Learn. Syst. 27, 579–592 (2015).

    Article  MathSciNet  PubMed  Google Scholar 

  41. Langenegger, J. et al. In-memory factorization of holographic perceptual representations. Nat. Nanotechnol. 18, 479–485 (2023).

    Article  ADS  CAS  PubMed  Google Scholar 

  42. Aghdam, M. H., Analoui, M. & Kabiri, P. A novel non-negative matrix factorization method for recommender systems. Appl. Math. Inf. Sci. 9, 2721 (2015).

    Google Scholar 

  43. Mannocci, P. & Ielmini, D. A generalized block-matrix circuit for closed-loop analog in-memory computing. IEEE J. Explor. Solid-state Comput. Devices Circuits 9, 47–55 (2023).

    Article  ADS  Google Scholar 

  44. Jiang, Z. et al. COPS: an efficient and reliability-enhanced programming scheme for analog RRAM and on-chip implementation of denoising diffusion probabilistic model. In 2023 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2023).

  45. Chen, R. et al. A survey of collaborative filtering-based recommender systems: from traditional methods to hybrid methods based on social networks. IEEE Access 6, 64301–64320 (2018).

    Article  Google Scholar 

  46. He, X. et al. Neural collaborative filtering. In Proc. 26th International Conference on World Wide Web 173–182 (International World Wide Web Conferences Steering Committee, 2017).

  47. Dong, X. et al. A hybrid collaborative filtering model with deep structure for recommender systems. In Proc. AAAI Conference on Artificial Intelligence 31, (PKP, 2017).

  48. Rao, M. et al. Thousands of conductance levels in memristors integrated on CMOS. Nature 615, 823–829 (2023).

    Article  ADS  CAS  PubMed  Google Scholar 

  49. Khwa, W.-S. et al. A mixed-precision memristor and SRAM compute-in-memory AI processor. Nature 639, 1–7 (2025).

  50. Darban, Z. Z. & Valipour, M. H. GHRS: Graph-based hybrid recommendation system with application to movie recommendation. Expert Syst. Appl. 200, 116850 (2022).

    Article  Google Scholar 

  51. Monti, F., Bronstein, M. & Bresson, X. Geometric matrix completion with recurrent multi-graph neural networks. Adv. Neural Inform. Process. Syst. 30, 1–11 (2017).

  52. Yang, S. et al. Impact of programming process on temperature coefficient in analog RRAM. In 2023 International Conference on IC Design and Technology (ICICDT) 141–144 (IEEE, 2023).

  53. Walczyk, C. et al. Impact of temperature on the resistive switching behavior of embedded HfO2-Based RRAM devices. IEEE Trans. electron devices 58, 3124–3131 (2011).

    Article  ADS  CAS  Google Scholar 

  54. Xie, X., Tan, W., Fong, L. L. & Liang, Y. CuMF_SGD: parallelized stochastic gradient descent for matrix factorization on GPUs. In Proc. 26th International Symposium on High-Performance Parallel and Distributed Computing 79–92 (ACM, 2017).

  55. Chen, J., Fang, J., Liu, W. & Yang, C. BALS: blocked alternating least squares for parallel sparse matrix factorization on GPUs. IEEE Trans. Parallel Distrib. Syst. 32, 2291–2302 (2021).

    Article  Google Scholar 

  56. Chen, Y., Nevarez, Y., Lu, Z. & Garcia-Ortiz, A. Accelerating non-negative matrix factorization on embedded FPGA with hybrid logarithmic dot-product approximation. In 2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 239–246 (IEEE, 2022).

  57. Pajuelo-Holguera, F., Gómez-Pulido, J. A. & Ortega, F. Performance of two approaches of embedded recommender systems. Electronics 9, 546 (2020).

    Article  Google Scholar 

  58. Ankit, A. et al. PUMA: a programmable ultra-efficient memristor-based accelerator for machine learning inference. In Proc. Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems 715–731 (ACM, 2019).

  59. Zhang, Y. et al. Optimized programming scheme enabling symmetric conductance modulation in HfO₂ resistive random-access memory (RRAM) for neuromorphic systems. IEEE Electron Device Lett. 43, 1203–1206 (2022).

    Article  ADS  CAS  Google Scholar 

Download references

Acknowledgements

This work was supported by Beijing Natural Science Foundation (4252016), National Key R&D Program of China (2020YFB2206001), and the 111 Project (B18001).

Author information

Author notes

  1. These authors contributed equally: Shiqing Wang, Yubiao Luo.

Authors and Affiliations

  1. Institute for Artificial Intelligence, Peking University, Beijing, China

    Shiqing Wang, Yubiao Luo, Pushen Zuo, Yongxiang Li, Lunshuai Pan & Zhong Sun

  2. School of Integrated Circuits, Peking University, Beijing, China

    Shiqing Wang, Yubiao Luo, Pushen Zuo, Yongxiang Li, Lunshuai Pan & Zhong Sun

  3. Dipartimento di Elettronica e Informazione and IU.NET, Politecnico di Milano, Milano, Italy

    Daniele Ielmini

  4. Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China

    Zhong Sun

Authors

  1. Shiqing Wang
  2. Yubiao Luo
  3. Pushen Zuo
  4. Yongxiang Li
  5. Lunshuai Pan
  6. Daniele Ielmini
  7. Zhong Sun

Contributions

S.W. and Z.S. conceived the idea of in-memory analog matrix factorization, and designed the compact-GINV circuit. Yubiao Luo designed the test board. S.W. and Yongxiang Li fabricated the RRAM arrays. S.W., Yubiao Luo, D.I., and Z.S. designed the experiments. S.W. and Yubiao Luo performed the experiments. S.W. and P.Z. designed the block matrix-based algorithm for non-negative matrix factorization. S.W. and L.P. designed the multi-core architecture. S.W. and Z.S. wrote the manuscript with input from all authors. Z.S. supervised the research.

Corresponding author

Correspondence to Zhong Sun.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Jong-Ho Lee and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Luo, Y., Zuo, P. et al. In-memory analog computing for non-negative matrix factorization. Nat Commun 17, 1881 (2026). https://doi.org/10.1038/s41467-026-68609-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41467-026-68609-8