In-memory analog computing for non-negative matrix factorization

Introduction

Non-negative matrix factorization (NMF), a low-rank decomposition technique that approximates a non-negative matrix as the product of two non-negative factor matrices (Fig. 1a), has become a pivotal tool for uncovering latent structures in high-dimensional datasets¹. By imposing non-negativity constraints, NMF inherently aligns with real-world physical and semantic interpretations, such as facial features and user preferences^2,3. Therefore, it produces parts-based object representations that enhance interpretability compared to other low-rank factorization methods⁴, such as singular value decomposition, principal component analysis and vector quantization. This unique property has propelled its adoption across diverse domains: in recommendation systems, NMF disentangles user-item interaction patterns to predict unknown marks^5,6; in bioinformatics, it extracts biologically meaningful features from gene expression profiles^7,8; in image processing, it decomposes visual data into additive components resembling object parts^9,10; and in clustering tasks, it serves as a dimensionality reduction tool for unsupervised learning^11,12 (Fig. 1b). However, for modern large-scale matrices, such as the Yahoo Music dataset (1,200,000 × 137,000 rating matrix), traditional digital hardware struggles to meet the demands of real-time NMF tasks^13,14. This limitation arises from intrinsic factors, including high algorithmic complexity, the slowdown of Moore’s Law^15,16, and the von Neumann bottleneck^17,18. Analog computing leverages massive parallelism to accelerate matrix operations. In particular, by combining block-matrix algorithms to scale up the addressable problem size¹⁹, analog computing can provide a viable path to solving large-scale NMF challenges.

Analog matrix computing (AMC) with resistive memory arrays presents a promising path and has recently been applied to accelerate linear and nonlinear matrix equations—that is, equations in which the coefficients and unknowns are represented as matrices or vectors. This class includes both linear and certain nonlinear problems such as square matrix inversion²⁰, generalized inverse²¹, and sparse approximation²². When used in applications like wireless signal processing²³, neural network training²¹, and compressed sensing recovery²², AMC can achieve orders-of-magnitude improvements in computing throughput and energy efficiency compared to conventional digital solvers. Although conventional analog complementary metal-oxide semiconductor circuits may be constructed for similar purposes²⁴, emerging resistive memory (or memristor²⁵) is advantageous in terms of chip area and power consumption, thanks to the high-density crossbar array architecture^26,27. The high regularity of crossbar also offers an inherent parallelism for fast computation. In addition, in situ computing with resistive memory represents an efficient solution for addressing the von Neumann bottleneck^28,29,30,31. By constructing closed loop between resistive memory array and operational amplifiers (OPAs), AMC circuits can solve matrix equations in a single step within nanoseconds to microseconds^20,21,22, without having to compute the inverse of the data matrix. Theoretical analysis shows that AMC circuits achieve O(1) time complexity for solving matrix problems^32,33, significantly outperforming the cubic time complexity of traditional digital computing.

The computational complexity of matrix operations increases progressively from multiplication to inversion and, finally, to decomposition (Fig. 1c). Matrix-vector multiplication is a forward problem that computes the product of a matrix and a vector, e.g., r = Uv. In contrast, matrix inversion is an inverse problem that solves Uv = r, where matrix U and vector r are known, and vector v is unknown. The solution is given by v = U⁻¹r, where U⁻¹ is the inverse matrix of U. The concept is similar for matrix generalized inverse (GINV). Decomposition, however, represents a higher-order inverse problem, in which a known matrix R is factorized into two unknown matrices, U and V (e.g., R = U$\cdot$V^T), making it fundamentally a non-convex optimization problem³⁴. While AMC enables single-step solutions for multiplication and inversion using open-loop and closed-loop resistive crossbar circuits, respectively, NMF remains unsolvable in a purely analog domain. In low-rank matrix factorization, a substantial basin of attraction often exists around the global optimal solution³⁴. The alternating non-negative least squares (ANLS) algorithm reformulates the non-convex problem into multiple convex subproblems, efficiently converging to this basin³⁵, and is therefore widely adopted for NMF. AMC represents a highly parallel computing paradigm, where matrix operations are executed in a massively parallel fashion. During computation, all cells in the activated rows and columns participate simultaneously. This stands in sharp contrast to digital computing, where basic operations (e.g., multiplication and addition) are performed sequentially, leading to polynomial complexity for solving NMF problems. By leveraging massive parallelism, AMC is particularly well suited for accelerating NMF with low computational complexity. In particular, the GINV operations with various constraints in ANLS suggest that AMC circuits could be leveraged for acceleration. However, the conventional GINV circuit design relies on densely interconnected OPAs²¹, leading to excessive area and power consumption, which hinders large-scale integration. Additionally, the single-purpose nature of traditional GINV circuits limits their ability to accommodate various constraints in solving matrix equations. Therefore, the development of efficient and versatile AMC circuits is essential.

In this work, we pioneer the analog implementation of NMF. At the core of our approach is a reconfigurable compact GINV circuit, optimized using conductance compensation (CC) principles to minimize the number of OPAs compared to previous designs^22,36,37. This circuit supports diverse regularization penalties, enabling non-negative linear regression (NLR), non-negative ridge regression (NRR), and non-negative constrained ridge regression (NCRR) with known entries. By integrating the compact-GINV circuit with the ANLS algorithm, we achieve a fully analog NMF solution, experimentally demonstrating its effectiveness in distinct scenarios such as image compression (via NLR) and recommender systems (via NRR and NCRR). The analog NMF solver operates with only a few alternating iterations, delivering reliable results and high energy efficiency.

Results

Algorithm for non-negative matrix factorization

As shown in Fig. 1a, NMF is about factorizing the original matrix R into two dense latent feature matrices with a low-rank k, namely ${{{\bf{R}}}}\,\approx \,{{{\bf{U}}}} \cdot {{{{\bf{V}}}}}^{{{\rm{T}}}}$, where U ($m\times k$) and V ($n\times k$) are usually identified as feature matrix and coefficient matrix, respectively. In other scenarios, like recommender system, U and V are the user and item latent feature matrices. The product of U and V approximates the original matrix R for predicting the default values in recommender systems. Since its formalization in 1999, NMF has gained prominence in diverse fields, owing to its ability to extract interpretable representations⁴. Its development has expanded to incorporate sparse constraints, L₂ constraints (L₂ regularization term ${||}{{{\bf{x}}}}|{|}_{2}^{2}$), reducing optimization scope constrains and so on¹, enhancing adaptability to complex data structures in various scenarios. NMF is typically solved by minimizing the following cost function:

$${\min }_{{{{\bf{U}}}},{{{\bf{V}}}}\ge 0}(||{{{\bf{R}}}}-{{{\bf{U}}}}{{{{\bf{V}}}}}^{T}|{|}_{F}^{2}+\alpha \varTheta ({{{\bf{U}}}})+\beta \varPsi ({{{{\bf{V}}}}}^{T})).$$

(1)

where F represents the Frobenius norm. Sometimes the Kullback-Leibler (KL) divergence may also serve as the cost function⁴. $\varTheta (\cdot )/\varPsi (\cdot )$ and $\alpha$/$\beta$ represent the penalty terms and their penalty parameters, respectively. The optimization problem in Eq. (1) is usually solved with two algorithmic frameworks: stochastic gradient descent (SGD)³⁸ and ANLS³⁹. While SGD is more popular in the conventional computing paradigm, the process is relatively slow as it costs tens to hundreds of iterations to converge^6,38,39,40. Matrix computing contains regular, massive multiply-and-accumulate operations. This regularity makes ANLS algorithm well suited for acceleration via analog parallel architectures. In particular, ANLS inherently avoids negative values, circumventing the need for signed signal representation in analog devices. Thanks to the substantial basin of attractions around the global optimal solution³³, a large set of initial points can converge to the global optimum. Therefore, even when large computing errors occur in one iteration cycle (updating U and V matrix once), subsequent iterations are still able to converge to the optimal solution with high probability. In addition, for non-convex optimization problems, errors in analog computing may even facilitate escaping local optima⁴¹. In this work, the extensively used L₂ Frobenius norm (F = 2) is applied with different constraints, and two different application scenarios are considered, including image compression and recommender systems. The choice of regularization terms in NMF is tied to application-specific objectives. In image compression, where the original matrix R is fully observed (i.e., pixel intensities) and the need of L₂-regularization is correspondingly eliminated, the cost function degenerates to $\begin{array}{c}\min \\ {{{\bf{U}}}},{{{\bf{V}}}}{{{\boldsymbol{\ge }}}}0\end{array}\left({||}{{{{\bf{R}}}}-{{{\bf{UV}}}}}^{{{{\rm{T}}}}}|{|}_{{{{\rm{F}}}}}^{2}\right)$. Conversely, recommender systems operate on highly incomplete user-item rating matrices^{5,37,38,39,42}, where missing entries dominate, and L₂-regularization on both U and V becomes critical to prevent overfitting to sparse observations, which is expressed as⁵:

$$\begin{array}{c}\min \\ {{{\bf{U}}}},{{{\bf{V}}}}\ge 0\end{array}(||{{{\boldsymbol{\Omega }}}}{{{\boldsymbol{\odot }}}}({{{\bf{R}}}}-{{{\bf{U}}}}{{{{\bf{V}}}}}^{T})|{|}_{F}^{2}+\lambda ||{{{\bf{U}}}}|{|}_{F}^{2}+\lambda ||{{{\bf{V}}}}|{|}_{F}^{2}).$$

(2)

where λ is the coefficient of ${{\mbox{L}}}_{2}$ penalty to prevent overfitting and ⊙ is Hadamard product. Ω represents the mask matrix, where Ω_ij = 1 if the rating of user i to item j is known in training set, and Ω_ij = 0 otherwise. Equation (2) represents a general case, as it is degenerated to the simplest format when $\lambda=0$ and ${{{\bf{R}}}}$ is completely known (all elements of Ω are 1). The ANLS method is shown in Fig. 2a, where the initial feature matrix U₀ is generated randomly and the subsequent coefficient matrix V₁ is obtained through regression operations (NLR, NRR, or NCRR) based on the given matrices U₀ and original matrix R. Through this alternating closed-form optimization approach, the cost function is minimized. The converged matrices U and V will be the final results of factorization, and their product will yield a highly accurate result for prediction.

**Fig. 2: Compact-GINV circuit for implementing ANLS.**

Analog compact-GINV circuit

The vanilla GINV circuit is shown in Fig. 2b. It was designed for solving linear regression problems in one step²⁰, but without the constraint of non-negativity. In solving GINV, the target cost function is expressed as $\min {||}{{{\bf{r}}}}-{{{\bf{Uv}}}}|{|}_{2}^{2}$, where U is a given matrix with size $q\times p$ (q > p), r (size $q\times 1$) is the given vector, and v (size $p\times 1$) is the vector to be solved. In the GINV circuit, two copies of U are mapped as device conductances of two resistive random-access memory (RRAM) arrays, referenced to a unit conductance g₀, vector r is mapped as voltages, then the output voltages represent the solution to v. This design, however, requires a large number (q + p) of OPAs, resulting in high power consumption and large chip area. In addition, the complex while fixed topology limits the flexibility to incorporate other constraints critical for real-world applications, e.g., adding L₂ penalties necessitates p additional OPAs⁴³. To address these challenges, we propose a new GINV circuit design (Fig. 2b), which is compact (thus termed compact-GINV) by using the CC principle. It reduces the number of required OPAs to p, and can be conveniently reconfigured to solve GINV problem with various constrains.

Since the cost function is defined as the squared Frobenius norm (i.e., the sum of squared errors), the matrix-form cost function is equivalent to the sum of vector-form cost functions. For example: ${||}{{{{\bf{R}}}}-{{{\bf{UV}}}}}^{{{{\rm{T}}}}}|{|}_{{{{\rm{F}}}}}^{2}={\sum }_{i=1}^{n}{||}{{{{\boldsymbol{r}}}}}_{{{{\boldsymbol{i}}}}}-{{{\bf{U}}}}{{{{\boldsymbol{v}}}}}_{{{{\boldsymbol{i}}}}}|{|}_{2}^{2}$, where r_i and v_i denote the i^th column vectors in R and V^T, respectively. Minimizing each vector-form cost functions is therefore equivalent to minimizing the matrix-form cost function. For convenience, we adopt the vector formulation in the following circuit analysis. Similarly, two copies of U are encoded in RRAM crossbars referenced to unit conductance g₀. Meanwhile, a column of RRAM devices with compensation conductances g_c are used to ensure conductance summation of each (including conductances of input resistors) row equal to a constant c. Upon the application of analog inputs −r, the circuit would generate regression solutions v instantaneously at OPA outputs. Briefly, according to the Kirchhoff’s current law, the potentials on the row lines constitute a vector $\frac{1}{c}$(Uv-r). Due to the extremely high input impedance of OPA, the input currents should be summed up to zero, namely $\frac{1}{c}$U^T(Uv - r) = 0. Consequently, the potential vector v represents the solution of GINV. Compared to the vanilla GINV circuit, this design eliminates the use of a column of TIAs for bridging two crossbar arrays, thus substantially reducing the number of amplifiers. Also, because of the removal of TIAs (which otherwise invert the voltage signal), it becomes convenient to introduce various constraints in the target cost function. Non-negativity constraints are enforced via single-supply configurations, solving $\begin{array}{c}\min \\ {{{\bf{v}}}}{{{\boldsymbol{\ge }}}}0\end{array}\left({||}{{{\bf{r}}}}-{{{\bf{Uv}}}}|{|}_{2}^{2}\right)$ (NLR) for tasks like image compression. For recommender systems with sparse observations, some vectors are fully known while most are only partially observed, leading to requirements of NRR and NCRR. By adding feedback conductances ($\lambda /c{g}_{0}$) between output and inverting input nodes of OPAs, the L₂ penalty with parameter $\lambda$ could be implemented and accomplishing NRR computing, which is expressed as $\begin{array}{c}\min \\ {{{\bf{v}}}}{{{\boldsymbol{\ge }}}}0\end{array}\left({||}{{{\bf{r}}}}-{{{\bf{Uv}}}}|{|}_{2}^{2}+\lambda {||}{{{\bf{v}}}}|{|}_{2}^{2}\right)$. Grounding unmeasured entries in r restricts optimization to known data (${{{\boldsymbol{\Omega }}}}$), yields cost function $\begin{array}{c}\min \\ {{{\bf{v}}}}{{{\boldsymbol{\ge }}}}0\end{array}\left({||}{{{\boldsymbol{\Omega }}}}\odot \left({{{\bf{r}}}}-{{{\bf{Uv}}}}\right)|{|}_{2}^{2}+\lambda {||}{{{\bf{v}}}}|{|}_{2}^{2}\right)$, which solves NCRR. Theoretical analysis of circuit equations is detailed in Supplementary Note 1 and the detailed circuit configurations for NLR, NRR and NCRR are shown in Supplementary Fig. 1.

Experiment demonstration

For demonstration purposes, a widely adopted HfO₂-based RRAM technology was used here (see methods and Supplementary Fig. 2) to validate the proposed compact-GINV circuit. The electrical characteristics of the fabricated devices were tested in our previous works²¹, showing analog conductance states with good I-V linearity, endurance and retention performances, which are beneficial for NMF tasks. To speed up device programming during experiments, an adaptive accuracy mapping method is developed for the widely-applied write-verify process. For different target conductance values G_t, the programming precision is adjusted correspondingly (Supplementary Fig. 3). All programming cycles in experiments were recorded (Supplementary Fig. 4), demonstrating that an average of 9 programming cycles is required to program a single device, and it shows a 32.3% programming speedup compared to the conventional programming method (13.3 programming cycles)⁴⁴. Here, one programming cycle is defined as a single sequence of programming, reading, and comparison operations (Supplementary Fig. 3). In our experiments, the unit conductance is selected as g₀ = 100 μS. The compensation conductance g_c is pre-calculated with minimal computational overhead and can be mapped into several parallel rows when dealing with large values²¹.

As a simple attempt, the NLR and NRR experiments have been demonstrated with a simple model w₀ + w₁ x_i = y_i firstly. Eight points of (x_i, y_i) were randomly generated, forming an $8\times 2$ matrix U, where all elements in the first column are one²⁰. In experiment, two copies of matrix U and g_c are mapped into a RRAM array through the adaptive write-verify method, and the constant c was set to 8.3. In NRR, $\lambda$ was selected as 2.52, resulting in a 30 μS feedback conductance. By encoding the given vector y as input voltages, both regression tasks were solved in continuous-time, with OPA outputs representing the solution of coefficients. The experimental results are shown in Fig. 2c, which demonstrates high consistence with the ideal fitting lines. Subsequently, 50 independent trials were performed across distinct (x_i, y_i) sets. To evaluate computing accuracy, normalized mean square errors (NMSE) is applied here. It is defined as ${{||}{{{\bf{x}}}}-\hat{{{{\bf{x}}}}}{||}}_{2}^{2}/{{||}\hat{{{{\bf{x}}}}}{||}}_{2}^{2}$, where ${{{\bf{x}}}}$ and $\hat{{{{\bf{x}}}}}$ represent the experimental and ideal algorithmic results, respectively. The results show good agreement with the ideal ones, with NMSEs of merely 0.0059 and 0.0029, respectively (Fig. 2d). Notably, relative errors of individual coefficients are calculated for more detailed observations and most of them are less than 10% (Fig. 2e). The experimental results demonstrate that NRR achieves lower computing error compared to NLR, which should be attributed to the reduced condition number of U^TU + $\lambda {{{\bf{I}}}}$ (I is an identity matrix) compared to U^TU.

Application to image compression

The implementation of ANLS algorithm for NMF includes frequent programming of U/V, circuit computing and data readouts. We developed a microcontroller unit (MCU)-controlled printed circuit boards (PCB) platform for running the algorithm. The PCBs integrate OPAs, digital-to-analog converters (DACs), analog-to-digital converters (ADCs), a write-verify module, and other peripheral circuits. The MCU manages data caching, dynamic signal scaling, compensation conductance g_c computations and system control. Figure 3a illustrates the hardware configuration (detailed in “Methods”). Key experimental parameters, including constant c, L₂ penalty coefficient λ, feedback conductances, circuit functions and so on, are summarized in Supplementary Table 1. Taking NMF of a small matrix (4$\times$4) as an example, the original matrix (Fig. 3b) will be decomposed into product of two 4$\times$1 matrices via NLR circuit. As shown in Fig. 3c, the initial feature matrix U₀ is generated randomly and mapped into RRAM array together with the calculated g_c, leading to a $4\times$3 RRAM array. Then, the original matrix is scaled as input voltages and fed into circuit column by column. As a result, the outputs of compact-GINV circuit after scaling become the first coefficient V₁. The scaling methodology is explained in Supplementary Note 2. Iterative update of U and V through alternating optimization led to rapid convergence, with all results shown in Supplementary Fig. 5. The final factorized U and V are depicted in Fig. 3d, where the product of U and V^T represents the recovered matrix that approximates the original matrix. As shown in Fig. 3e, hardware-generated solutions progressively aligned with algorithmic predictions (full-precision digital solver). After merely one cycle of iteration, the NMSE is reduced to 9.06 × 10⁻⁴ from 0.127, demonstrating the high effectiveness of the ANLS algorithm (Supplementary Fig. 6).

Using the same scheme, image compression tasks were conducted. The image is factorized into small matrices using NMF, which requires less storage space and the product represents the recovery of original image. A red-green-blue (RGB) nebular image with size 200$\times$100 (Fig. 3f) from National Aeronautics and Space Administration (NASA) was adopted and it was segmented into 50$\times$25 small patches with size 4$\times$4 in each channel, leading to 3750 patches in total. For each small matrix, the ANLS process was performed for two iteration cycles individually to factorize it into 2 4$\times$1 matrices, which suggests a compression ratio of 50%, thus saving half of the storage space. The same initial feature matrix U₀ (Fig. 3b) and scaling strategy are applied for all small matrices. The recovery quality of image is usually measured by the peak signal-to-noise ratio (PSNR), which is defined as $10{{\mbox{lo}}}{{\mbox{g}}}_{10}(\frac{{255}^{2}\cdot M}{{\Vert{{{\bf{x}}}}-\hat{{{{\bf{x}}}}}\Vert}_{2}^{2}})$ (M is the length of vector x). 3750 groups of NMF experiments were conducted and the final corresponding reconstructed image is depicted in Fig. 3f, whose NMSE and PSNR against original image are 3.3 × 10⁻³ and 32. 88 dB, respectively. Compared to algorithmic solution running in a full-precision digital computer, the accuracy loss is merely 1.39 dB, suggesting a high recovery quality. Specifically, the NMSE distributions in Fig. 3g exhibit excellent agreement between experimental measurements and theoretical predictions across all small patches, validating the effectiveness of the analog NMF solver. The decomposed results contain 50$\times$25$\times$3 = 3750 pairs of low-rank matrices U and V with size 4$\times$1, and the recombination results of U_all and V_all in RGB format are shown in Fig. 3h. To evaluate the impact of matrix size on convergence speed, block matrices of varying sizes were extracted from the nebula image and simulated. All cases consistently converged within 2–3 iteration cycles (Supplementary Fig. 7), indicating that problem size has only a negligible effect on convergence speed.

Application to recommender system

Recommender system is another important application of NMF. It is a kind of algorithm that links people with their potential favorites by historical data (e.g., user-item rating matrix), thus enhancing user experience while generating substantial commercial and societal value. Collaborative filtering (CF)-based recommender system is one of the most successful methods, leveraging connections between users and items based on neighborhood information⁴⁵. Among CF-based algorithms, NMF is a mainstream approach, thanks to its accuracy and scalability. Meanwhile, it also serves as a key enabler in many deep neural networks-based hybrid models^46,47. A tiny example of user-item rating matrix is shown in Fig. 4a, containing both known and unknown marks. NMF is to decompose the rating matrix R into user latent feature U and item latent feature V, based on the known entries. The product of U and V will make predictions of those unknown marks for recommendation. The original matrix R is divided into training set and testing set (Fig. 4a), where the training set is used for optimizing Eq. (2) and testing set is for validation of prediction. The compact-GINV circuit configured as NRR and NCRR will be adopted for this demonstration.

**Fig. 4: Experimental results for small-scale recommender systems.**

The process of calculating V₁ with U₀ is summarized in Fig. 4b. Similar as the procedure in Fig. 3c, the initial user matrix is randomly generated and mapped as the conductances, resulting in a 4$\times$3 RRAM array. The rating matrix R (training set) is then mapped to input voltages, where unknown entries can be assigned arbitrary values or left floating, as the corresponding column lines are grounded. The outputs of compact-GINV circuit will be the first item feature matrix V₁. 4 cycles of iteration that progressively optimize the latent feature matrices U and V are shown in Fig. 4c, suggesting a fast convergence, which is evidenced by the monotonic decrease of the cost function (Eq. (2)) in Fig. 4d. Experimental results show good agreement with full-precision algorithmic solutions (Fig. 4d), which is attributable to the ANLS algorithm’s inherent error tolerance for NMF. Based on the convergent U and V matrices, precise prediction results are obtained on the test set, with NMSE = 0.0027 (Fig. 4e).

Real-world recommender systems typically operate on massive-scale matrices (m and n varying from thousands to millions) with extremely high sparsity. To validate our analog NMF solver on a real-world dataset, we use the MovieLens 100 k dataset (m = 943, n = 1682, 100,000 known entries) (Supplementary Fig. 8), with 80,000 randomly selected known entries for training (94.96% sparsity) and the remaining 20,000 for testing. To factorize large-scale matrices within resource-limited resistive memory arrays, a block matrix-based method is designed for NRR and NCRR in recommender systems (Supplementary Note 3). As shown in Fig. 5a, the original rating and user matrix is segmented into many sub-matrices (e.g., R_[i], U_1[i]) with a parameter d for regression individually, thus generating many intermediate item matrices (e.g., V_2,i). The next item latent matrix is obtained from the intermediate matrices after a nonlinear operation (maximization and scaling, Supplementary Note 3) to capture the global features. The validation of this block matrix-based method and the strategy to deciding scaling factor are also provided in Supplementary Note 3. In experiment, the number of latent features is set to 2 (k = 2) for extreme compression, namely the original 943 × 1682 rating matrix R will be factorized into two matrices U and V of sizes 943 × 2 and 1682 × 2, respectively. The alternating user and item feature matrices (i.e., U and V) are segmented into 236 and 421 $4\times 2$ sub-matrices (d = 4), respectively, with the help of zero padding. The resulting sub-matrices are mapped in a $4\times 5$ RRAM array.

**Fig. 5: Experimental results of recommend system for the MovieLens 100k dataset.**

We have conducted 3 iteration cycles in experiment, with the product U₃V₃^T representing the predicted rating matrix R_predicted. As a result, during the experiment, the RRAM array was reprogrammed a total of 1971 times (or 17,739 writing pulses with write-verify for each device), which is manageable for the fabricated devices, thanks to their robust endurance²². Taking the procedure from U₁ to V₂ as an example, two copies of U_1[i] along with g_c were programmed into the $4\times 5$ RRAM array firstly, and all the 236 programming results of U_1[1~236] are shown in Fig. 5b. Small conductances in G_left and G_right are prone to suffering more from device relaxation, thus presenting larger absolute programming errors (e.g., > 30$\mu {\mbox{S}}$). The programming results of the other cycles are shown in Supplementary Fig. 9, exhibiting the same behavior. Then, the sub-matrix R_[i] was encoded as input voltages and sequentially fed into circuit column by column. After performing 1682 operations with compact-GINV circuit, the i^th intermediate matrix V_2,i (size $1682\times 2$) was obtained. By repeating this procedure, all the 236 intermediate matrices V_2,1-236 were generated with a good accuracy (NMSE = 0.007, Supplementary Fig. 10). The raw circuit outputs for other iteration cycles are also displayed in Supplementary Fig. 8, with all computed NMSE values below 0.01. The final ${{{{\bf{V}}}}}_{2}$ is acquired through the nonlinear operation (Fig. 5c), resulting in an NMSE of 0.0035 (Fig. 5d). The single-element relative error of all experimental user and item matrices are calculated and shown in Supplementary Fig. 11, demonstrating very low NMSEs (ranging from 0.003 to 0.0044) and average relative errors (ranging from 1 to 4%). All user and item feature matrices are presented in Fig. 5e. In each iteration cycle of NMF experiment, a new pair of user and item feature matrix were generated, which can then be utilized to compute the predicted rating matrix for evaluation. The corresponding evolution process is illustrated in Fig. 5f. The predicted rating matrix R_predicted = U₃⋅V₃^T after 3 iteration cycles closely aligns with the original matrix (Supplementary Fig. 12). Notably, the predicted rating matrix after just one iteration cycle (i.e., U₁⋅V₁^T) is visually very close to the converged matrix, demonstrating again the algorithm’s fast convergence. For training set, the value of cost function in Eq. (2) decreases with iteration and converges rapidly. The behavior is the same for NMSE of the test set. Notably, for both indicators, the experimental results are almost identical with the algorithmic ones (full-precision digital solver) (Fig. 5g). Compared to the original ANLS algorithm without segmenting the matrices (and its circuit simulation, see Methods), the results show some performance degradation, which should be inherent of the block matrix method.

To investigate the impact of block size d on solving quality, simulations were performed across various d ranging from 4 to 256. Results demonstrate that the cost function and NMSE on the testing set decrease for larger d (Supplementary Fig. 13), suggesting the quality deficiency in experiment can be faithfully overcome with large-scale RRAM arrays, which are readily affordable with advanced RRAM technology^48,49. Additionally, for all block sizes, the algorithm converges within 2–3 iteration cycles (Supplementary Fig. 13), indicating that block size has minor impact on convergence speed. To compare the solving quality with deep learning method, the indictor root mean square error (RMSE) is induced, defined as ${({\Vert{{{\bf{x}}}}-\hat{{{{\bf{x}}}}}\Vert}_{2}^{2}/|{{{\bf{T}}}}|)}^{1/2}$, where x, $\hat{{{{\bf{x}}}}}$ represents the predicted value and its ground truth, and $|{{{\bf{T}}}}|$ denotes the number of testing set. The state-of-the-art (SOTA) method we found to deal with MovieLens 100k dataset combines graph model and deep learning method and generates a small RMSE⁵⁰. As presented in Supplementary Table 2, the simulation result without segment is worse than the advanced machine learning method but comparable to a recurrent neural network method⁵¹. However, when employing the block matrix-based method, the experimental results underperformed both approaches, which is from the unavoidable information loss using very small segment parameter (d = 4). Nevertheless, this degradation is avoidable by using large RRAM arrays (d = 256).

Analysis of error tolerance

To demonstrate the error tolerance of analog NMF solver, an additional 4$\times$4 NMF experiment was conducted. Significant mapping errors occurred during programming matrix V₁, resulting in a high NMSE. However, subsequent iterations successfully converged to a minimum (Supplementary Fig. 14). Similar phenomena also occur in earlier experiments. Across all experiments (~430,000 trials), the average NMSE of circuit outputs was 0.0072, while the errors in the final NMF results were substantially suppressed due to the large basin of attraction around the global optimum in NMF problems³⁴. In the image compression experiment, the recovered image shows only a 1.39 dB PSNR loss (exp: 32.88 dB vs. FP64: 34.27 dB) and a minor NMSE increase of 0.0009 (exp: 0.0033 vs. FP64: 0.0024). In the recommender system experiment (MovieLens 100k), the NMSE on the test dataset increased by just 0.0001 in the 3th iteration cycle (exp: 0.0800 vs. FP64: 0.0799). To further investigate error tolerance in large-scale arrays, the image compression was applied as an example, where a 200 × 66 RRAM array (including two columns for CC) was used to decompose the original 200 × 100 × 3 nebula image into 3 pairs of matrices (dimensions 200 × 32 and 100 × 32) over 2 iteration cycles. In simulations, a practical OPA model AD8572 from ADI (1.5 MHz bandwidth, 145 dB DC open-loop gain, 51 nV/$\surd {\mbox{Hz}}$ voltage noise) is adopted to mimic the non-ideal characteristics of OPAs. Programming errors were varied from ±1 to ±60%. Results show that mapping errors within ±20% have negligible impact on the final NMSE and PSNR. Remarkably, even with substantial mapping errors (±50% and ±60%), the solver still correctly reconstructed the image within 2–3 iteration cycles (Supplementary Fig. 15).

Temperature effects were also investigated, primarily impacting retention and conductance values (similar to mapping errors). Since computing is completed immediately («1 s) after programming, the 8-state RRAM retention were tested for 10⁴ s in our prior work²², which fully meets the requirements. Regarding conductance variation, a temperature model for analog RRAM⁵² was applied, as described by Eq. (3).

$$R={R}^{0}(1+\rho (T-{T}_{0})),$$

(3)

where R is the resistance at temperature T, R⁰ is the resistance at room temperature T⁰, ρ is the coefficient. At high conductance, the device exhibits metallic behavior (ρ > 0), so increased temperature decreases conductance. Conversely, at low conductance, it exhibits semiconducting behavior (ρ < 0), where increased temperature increases conductance (Supplementary Fig. 16). This phenomenon was also observed and verified in 1-bit RRAM⁵³. Assumed the programming is performed at room temperature (20 °C) with ±5% programming error, the simulations are conducted from 20 to 120 °C. The results demonstrate minimal impact on decomposition quality (Supplementary Fig. 16) due to the high error tolerance.

Discussion

Graphics processing units (GPUs) and field-programmable gate arrays (FPGAs) with optimized algorithms remain state-of-the-art solutions for accelerating solving NMF^{38,54,55,56,57}. In contrast, this work explores the use of alternative computing paradigms, i.e., analog computing, for solving this problem. To benchmark the analog NMF solver against the advanced digital solvers, the case of recommender systems is considered. The dataset ranges from MovieLens 100k dataset to larger ones, including Netflix (m = 480,000, n = 17,000) and Yahoo (m = ${{\mathrm{1,200,000}}}$ n = ${{\mathrm{137,000}}}$)^38,55. A multi-core architecture is designed for fair comparison with entire FPGAs and GPUs. The architecture includes a global buffer, a functional module, instruction-related registers and decoders, and 1108 cores⁵⁸. Each core contains a RRAM array, DACs, ADCs, I/O registers, a control module, a CC compute module, write circuitry, etc. Detailed configurations and parameters applied are provided in Supplementary Note 4. Since rating matrices in recommender systems typically consist of discrete integers, low-precision DACs could be employed. Figure S17 demonstrates decomposition results on MovieLens 100k using different DAC precisions, indicating that 3-bit DACs represent an optimal selection. Consequently, a 3-bit R-2R DAC (Supplementary Fig. 17) was designed and implemented in the multi-core architecture to reduce system energy and chip area.

The computing core utilizes a large-scale 256$\times$66 (two columns for CC) RRAM array that is capable of supporting regression problem for 256$\times$32 matrix. Combined with the proposed block matrix-based method, it efficiently solves ultra-large-scale NMF problems in recommendation systems. However, some problems may occur in large-scale array, such as extra overhead for computing and programming CC, long-wire delay, large amounts of DACs/ADCs and so on. To reduce the conductance value of CC, two parallel columns are employed for mapping CC, decreasing its conductance to half the original value. System-level evaluation shows the resources required for computing and mapping CC represent a minimal overhead: latency accounts for 3.18% of total time, and energy consumption accounts for 0.12% of total energy. Furthermore, wire delay issues in large-scale arrays were investigated using a parasitic capacitance model (Supplementary Fig. 18)⁵⁹. The convergence time of the compact-GINV circuit is increased from 76 ns to 79.2 ns, which is also applied into the evaluations. Additionally, to reduce power assumption and chip area, the prevailing approach is to sacrificing computing speed by using multiplexers (MUX) and shared ADCs²⁹.

For the MovieLens 100k dataset, the NMF task is running on a Xilinx Zynq-7000 FPGA (28 nm node) platform⁵⁷. For larger datasets, the Virtex UltraScale+ FPGA (16 nm node)³⁸ and Nvidia Titan RTX GPU (4608 CUDA cores, 12 nm node)⁵⁷ are the most advanced hardware we found to perform NMF algorithm. To benchmark multi-core solver with SOTA digital solvers, NVDIA H100 (14,592 CUDA cores, 4 nm node) and AMD Versal™ AI Core (VC2802, 7 nm node) FPGA are considered and their performance are estimated through comparing the peak computing capabilities. Task of Movilens 100k occupies 7 cores and achieves a 212$\times$ speedup and 4.6$\times$10⁴ improvement of energy efficiency compared to SOTA FPGA. In tasks of Netflix and Yahoo, the proposed architecture operates with full 1108 cores and it achieves a speedup of thousands of times compared to SOTA FPGA and several times compared to H100 GPU on average (Supplementary Fig. 19). Additionally, it demonstrates thousands of times improvement in energy efficiency over SOTA FPGA and hundreds of times over H100 GPU on average (Supplementary Fig. 19). The assessment reveals that latency is primarily dominated by device programming (54%, Supplementary Fig. 20), while power consumption is largely attributed to DACs and ADCs (92%, Supplementary Fig. 20) in the analog NMF solver. This suggests that optimizing programming methods and the design of DACs and ADCs could enhance the performance of the analog MF solver in the future. Additionally, a detailed comparison between vanilla GINV circuit and compact-GINV circuit is provided in Supplementary Note 5, and it demonstrates compact-GINV circuit obtaining 90% of OPAs reduction, 9× energy efficiency in circuit level and 1.46× energy efficiency in system level at comparable solving time and accuracy.

In conclusion, we have developed an in-memory analog solver for efficiently handling complex NMF problems. At the core of this solver is a closed-loop compact-GINV circuit, designed based on the CC principle, which enables one-step solutions to non-negative regression problems under various constraints. By utilizing a minimal number of OPAs, the circuit significantly reduces power consumption. Integrating the ANLS algorithm with the compact-GINV circuit allows for efficient NMF computation while leveraging ANLS’s inherent error tolerance for low-rank matrix factorization, which further enhances the robustness of the analog solver. Experimental results demonstrate that our solver can reliably handle diverse applications of NMF, such as RGB image compression and recommender systems, requiring only a few iterations with minimal accuracy loss compared to full-precision solvers. The analog approach shows great promise in terms of speed and energy efficiency, making it a compelling solution for tackling complex NMF problems in modern applications. This could pave the way for advanced real-time applications in image processing and recommender systems.

Methods

Device fabrication

The RRAM devices were fabricated on SiO₂/Si substrates using standard semiconductor laboratory processes. First, a 5 nm Ti adhesion layer and a 30 nm Pt bottom electrode were deposited via magnetron sputtering. Next, photolithography and lift-off processes were used to pattern the layers. A 5 nm HfO₂ resistive layer was then deposited by atomic layer deposition (ALD). Top electrodes consisting of 30 nm Ta and 40 nm Pt were fabricated by magnetron sputtering and patterned using photolithography and lift-off processes.

Experimental measurements

An automatic test platform was designed for NMF tasks in this work. The control script, written in C + +, runs on a digital computer. During RRAM programming, the script loads data into the MCU (an Arduino Mega 2560). The MCU performs scaling and conductance compensation (CC) computations, then controls the on-board write-verify circuit to program devices. For compact-GINV operation, the script loads the original matrix into the MCU, which scales the data, controls DACs to supply input voltages, and uses ADCs to read circuit outputs. The compact-GINV circuit uses commercial AD8572 OPAs (Analog Devices, ADI) integrated on a PCB. Peripheral circuits, including multiplexers and switches, are also onboard. Four diode pairs limit output voltages to <0.3 V, protecting RRAM devices from conductance changes during testing.

Simulations

Large-scale compact-GINV circuit simulations (Fig. 5g) were performed in SPICE (Simulation Program with Integrated Circuit Emphasis). We modeled RRAM devices with a conductance range of 10–1000 μS, retaining the adaptive accuracy mapping method. AD8572 models (Analog Devices), identical to experimental units, were used for OPAs. User/item latent matrices of MovieLens 100k were mapped directly into the RRAM array, bypassing the block matrix-based method. To keep consistence with experiment, three cycles are conducted to investigate the final predicted rating matrix.

Data availability

The data generated in this study have been deposited in the Zenodo database under accession code [https://zenodo.org/records/17105419].

Code availability

The code used in this paper is available from the corresponding authors upon request.

References

Wang, Y.-X. & Zhang, Y.-J. Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25, 1336–1353 (2012).
Article CAS Google Scholar
Zhi, R., Flierl, M., Ruan, Q. & Kleijn, W. B. Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition. IEEE Trans. Syst. Man, Cybern. Part B 41, 38–52 (2010).
ADS Google Scholar
Mehta, R. & Rana, K. A review on matrix factorization techniques in recommender systems. In 2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA) 269–274 (IEEE, 2017).
Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).
Article ADS CAS PubMed Google Scholar
Koren, Y., Bell, R. & Volinsky, C. Matrix factorization techniques for recommender systems. Computer 42, 30–37 (2009).
Article Google Scholar
Luo, X., Zhou, M., Xia, Y. & Zhu, Q. An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems. IEEE Trans. Ind. Inform. 10, 1273–1284 (2014).
Article Google Scholar
Li, Z., Tang, J. & He, X. Robust structured nonnegative matrix factorization for image representation. IEEE Trans. Neural Netw. Learn. Syst. 29, 1947–1960 (2017).
Article ADS MathSciNet PubMed Google Scholar
Liu, H., Wu, Z., Li, X., Cai, D. & Huang, T. S. Constrained nonnegative matrix factorization for image representation. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1299–1311 (2011).
Article ADS PubMed Google Scholar
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Article CAS PubMed PubMed Central Google Scholar
Johnson, J. A. et al. Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS. Nat. Protoc. 18, 3690–3731 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kuang, D., Ding, C. & Park, H. Symmetric nonnegative matrix factorization for graph clustering. In Proc. 2012 SIAM International Conference on Data Mining 106–117 (Society for Industrial and Applied Mathematics, 2012).
Laxmi Lydia, E. et al. Charismatic document clustering through novel K-Means Non-negative Matrix Factorization (KNMF) algorithm using key phrase extraction. Int. J. Parallel Prog. 48, 496–514 (2020).
Article Google Scholar
Zhang, J., Wang, Y., Yuan, Z. & Jin, Q. Personalized real-time movie recommendation system: practical prototype and evaluation. Tsinghua Sci. Technol. 25, 180–191 (2019).
Article Google Scholar
Park, Y., Oh, J. & Yu, H. RecTime: real-time recommender system for online broadcasting. Inf. Sci. 409, 1–16 (2017).
Article Google Scholar
Waldrop, M. M. The chips are down for Moore’s law. Nat. News 530, 144 (2016).
Article CAS Google Scholar
Ielmini, D. & Wong, H.-S. P. In-memory computing with resistive switching devices. Nat. Electron. 1, 333–343 (2018).
Article Google Scholar
Mannocci, P. et al. In-memory computing with emerging memory devices: status and outlook. APL Machine Learn. 1, 010902 (2023).
Mehonic, A. et al. Roadmap to neuromorphic computing with emerging technologies. APL Materials 12, 109201 (2024).
Pan, L., Zuo, P., Luo, Y., Sun, Z. & Huang, R. Blockamc: scalable in-memory analog matrix computing for solving linear systems. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE) 1–6 (IEEE, 2024).
Sun, Z. et al. Solving matrix equations in one step with cross-point resistive arrays. Proc. Natl. Acad. Sci. USA 116, 4123–4128 (2019).
Article ADS MathSciNet PubMed PubMed Central Google Scholar
Sun, Z., Pedretti, G., Bricalli, A. & Ielmini, D. One-step regression and classification with cross-point resistive memory arrays. Sci. Adv. 6, eaay2378 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, S. et al. In-memory analog solution of compressed sensing recovery in one step. Sci. Adv. 9, eadj2908 (2023).
Article PubMed PubMed Central Google Scholar
Zuo, P., Sun, Z. & Huang, R. Extremely-fast, energy-efficient massive MIMO precoding with analog RRAM matrix computing. IEEE Trans. Circuits Syst. II: Express Briefs 70, 2335–2339 (2023).
Google Scholar
Huang, Y., Guo, N., Seok, M., Tsividis, Y. & Sethumadhavan, S. Evaluation of an analog accelerator for linear algebra. SIGARCH Comput. Archit. N. 44, 570–582 (2016).
Article CAS Google Scholar
Strukov, D. B., Snider, G. S., Stewart, D. R. & Williams, R. S. The missing memristor found. nature 453, 80–83 (2008).
Article ADS CAS PubMed Google Scholar
Wang, Q. et al. A logic-process compatible RRAM with 15.43 Mb/mm 2 Density and 10years@ 150 °C retention using STI-less dynamic-gate and self-passivation sidewall. In 2023 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2023).
Pellizzer, F., Pirovano, A., Bez, R. & Meyer, R. L. Status and perspectives of chalcogenide-based crosspoint memories. In 2023 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2023).
Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).
Article ADS CAS PubMed Google Scholar
Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Le Gallo, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).
Article Google Scholar
Xue, C.-X. et al. Embedded 1-Mb ReRAM-based computing-in-memory macro with multibit input and weight for CNN-based AI edge processors. IEEE J. Solid-State Circ. 55, 203–215 (2019).
Article ADS Google Scholar
Sun, Z. et al. Time complexity of in-memory solution of linear systems. IEEE Trans. Electron Devices 67, 2945–2951 (2020).
Article ADS Google Scholar
Wang, S. et al. Optimization schemes for in-memory linear regression circuit with memristor arrays. IEEE Trans. Circuits Syst. I: Regul. Pap. 68, 4900–4909 (2021).
Google Scholar
Chi, Y., Lu, Y. M. & Chen, Y. Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Trans. Signal Process. 67, 5239–5269 (2019).
Article ADS MathSciNet Google Scholar
Park, D., Kyrillidis, A., Caramanis, C. & Sanghavi, S. Finding low-rank solutions via nonconvex matrix factorization, efficiently and provably. SIAM J. Imaging Sci. 11, 2165–2204 (2018).
Article MathSciNet Google Scholar
Wang, S. & Sun, Z. Dual in-memory computing of matrix-vector multiplication for accelerating neural networks. Device 2, 100546 (2024).
Luo, Y., Qiao, F. & Sun, Z. Capacitance and conductance compensation methods for efficient computing‐in‐memory designs. Adv. Elect. Mater. 10, 2400452 (2024).
Article CAS Google Scholar
Zhou, S., Kannan, R. & Prasanna, V. K. Accelerating stochastic gradient descent based matrix factorization on FPGA. IEEE Trans. Parallel Distrib. Syst. 31, 1897–1911 (2020).
Article Google Scholar
Kim, H. & Park, H. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 30, 713–730 (2008).
Article MathSciNet Google Scholar
Luo, X. et al. A nonnegative latent factor model for large-scale sparse matrices in recommender systems via alternating direction method. IEEE Trans. Neural Netw. Learn. Syst. 27, 579–592 (2015).
Article MathSciNet PubMed Google Scholar
Langenegger, J. et al. In-memory factorization of holographic perceptual representations. Nat. Nanotechnol. 18, 479–485 (2023).
Article ADS CAS PubMed Google Scholar
Aghdam, M. H., Analoui, M. & Kabiri, P. A novel non-negative matrix factorization method for recommender systems. Appl. Math. Inf. Sci. 9, 2721 (2015).
Google Scholar
Mannocci, P. & Ielmini, D. A generalized block-matrix circuit for closed-loop analog in-memory computing. IEEE J. Explor. Solid-state Comput. Devices Circuits 9, 47–55 (2023).
Article ADS Google Scholar
Jiang, Z. et al. COPS: an efficient and reliability-enhanced programming scheme for analog RRAM and on-chip implementation of denoising diffusion probabilistic model. In 2023 International Electron Devices Meeting (IEDM) 1–4 (IEEE, 2023).
Chen, R. et al. A survey of collaborative filtering-based recommender systems: from traditional methods to hybrid methods based on social networks. IEEE Access 6, 64301–64320 (2018).
Article Google Scholar
He, X. et al. Neural collaborative filtering. In Proc. 26th International Conference on World Wide Web 173–182 (International World Wide Web Conferences Steering Committee, 2017).
Dong, X. et al. A hybrid collaborative filtering model with deep structure for recommender systems. In Proc. AAAI Conference on Artificial Intelligence 31, (PKP, 2017).
Rao, M. et al. Thousands of conductance levels in memristors integrated on CMOS. Nature 615, 823–829 (2023).
Article ADS CAS PubMed Google Scholar
Khwa, W.-S. et al. A mixed-precision memristor and SRAM compute-in-memory AI processor. Nature 639, 1–7 (2025).
Darban, Z. Z. & Valipour, M. H. GHRS: Graph-based hybrid recommendation system with application to movie recommendation. Expert Syst. Appl. 200, 116850 (2022).
Article Google Scholar
Monti, F., Bronstein, M. & Bresson, X. Geometric matrix completion with recurrent multi-graph neural networks. Adv. Neural Inform. Process. Syst. 30, 1–11 (2017).
Yang, S. et al. Impact of programming process on temperature coefficient in analog RRAM. In 2023 International Conference on IC Design and Technology (ICICDT) 141–144 (IEEE, 2023).
Walczyk, C. et al. Impact of temperature on the resistive switching behavior of embedded HfO₂-Based RRAM devices. IEEE Trans. electron devices 58, 3124–3131 (2011).
Article ADS CAS Google Scholar
Xie, X., Tan, W., Fong, L. L. & Liang, Y. CuMF_SGD: parallelized stochastic gradient descent for matrix factorization on GPUs. In Proc. 26th International Symposium on High-Performance Parallel and Distributed Computing 79–92 (ACM, 2017).
Chen, J., Fang, J., Liu, W. & Yang, C. BALS: blocked alternating least squares for parallel sparse matrix factorization on GPUs. IEEE Trans. Parallel Distrib. Syst. 32, 2291–2302 (2021).
Article Google Scholar
Chen, Y., Nevarez, Y., Lu, Z. & Garcia-Ortiz, A. Accelerating non-negative matrix factorization on embedded FPGA with hybrid logarithmic dot-product approximation. In 2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 239–246 (IEEE, 2022).
Pajuelo-Holguera, F., Gómez-Pulido, J. A. & Ortega, F. Performance of two approaches of embedded recommender systems. Electronics 9, 546 (2020).
Article Google Scholar
Ankit, A. et al. PUMA: a programmable ultra-efficient memristor-based accelerator for machine learning inference. In Proc. Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems 715–731 (ACM, 2019).
Zhang, Y. et al. Optimized programming scheme enabling symmetric conductance modulation in HfO₂ resistive random-access memory (RRAM) for neuromorphic systems. IEEE Electron Device Lett. 43, 1203–1206 (2022).
Article ADS CAS Google Scholar

Download references

Acknowledgements

This work was supported by Beijing Natural Science Foundation (4252016), National Key R&D Program of China (2020YFB2206001), and the 111 Project (B18001).

Author information

Author notes

These authors contributed equally: Shiqing Wang, Yubiao Luo.

Authors and Affiliations

Institute for Artificial Intelligence, Peking University, Beijing, China
Shiqing Wang, Yubiao Luo, Pushen Zuo, Yongxiang Li, Lunshuai Pan & Zhong Sun
School of Integrated Circuits, Peking University, Beijing, China
Shiqing Wang, Yubiao Luo, Pushen Zuo, Yongxiang Li, Lunshuai Pan & Zhong Sun
Dipartimento di Elettronica e Informazione and IU.NET, Politecnico di Milano, Milano, Italy
Daniele Ielmini
Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China
Zhong Sun

Authors

Shiqing Wang
Yubiao Luo
Pushen Zuo
Yongxiang Li
Lunshuai Pan
Daniele Ielmini
Zhong Sun

Contributions

S.W. and Z.S. conceived the idea of in-memory analog matrix factorization, and designed the compact-GINV circuit. Yubiao Luo designed the test board. S.W. and Yongxiang Li fabricated the RRAM arrays. S.W., Yubiao Luo, D.I., and Z.S. designed the experiments. S.W. and Yubiao Luo performed the experiments. S.W. and P.Z. designed the block matrix-based algorithm for non-negative matrix factorization. S.W. and L.P. designed the multi-core architecture. S.W. and Z.S. wrote the manuscript with input from all authors. Z.S. supervised the research.

Corresponding author

Correspondence to Zhong Sun.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Jong-Ho Lee and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, S., Luo, Y., Zuo, P. et al. In-memory analog computing for non-negative matrix factorization. Nat Commun 17, 1881 (2026). https://doi.org/10.1038/s41467-026-68609-8

Download citation

Received: 14 April 2025
Accepted: 12 January 2026
Published: 19 January 2026
Version of record: 20 February 2026
DOI: https://doi.org/10.1038/s41467-026-68609-8