Notation for Machine Learning - BAAI

2 min read Original article ↗

This proposal suggests a standard for commonly used mathematical notation for machine learning.

The field of machine learning is evolving rapidly in recent years. Communication between different researchers and research groups becomes increasingly important. A key challenge for communication arises from inconsistent notation usages among different papers. This proposal suggests a standard for commonly used mathematical notation for machine learning. In this first version, only some notation are mentioned and more notation are left to be done. This proposal will be regularly updated based on the progress of the field.

You can use this notation by downloading LaTeX macro package MLMath.sty tuned for the updates and you can turn to GitHub for more information.

Notation Table

See the full Guide for more

  • \(S=\{\mathbf{z}_i\}_{i=1}^n=\{(\mathbf{x}_i,\mathbf{y}_i)\}_{i=1}^n\)
  • Dataset
  • \(\mathcal{H}\)
  • function space
  • \(f_{\mathbf{\theta}}:\mathcal{X}\to \mathcal{Y}\)
  • hypothesis function
  • \(L_{S}(\mathbf{\theta}), L_{n}(\mathbf{\theta}), R_{n}(\mathbf{\theta}), R_{S}(\mathbf{\theta})\)
  • empirical risk or training loss
  • \(f(\mathbf{x};\mathbf{\theta})=\sum_{j=1}^{m} a_j \sigma (\mathbf{w}_j\cdot \mathbf{x} + b_j) \)
  • two-layer neural network
  • \({\rm Rad}_{n} (\mathcal{H})\)
  • Rademacher complexity
  • GD
  • gradient descent
  • SGD
  • stochastic gradient descent
  • \(B\)
  • a batch set
  • \(|B|\)
  • batch size
  • \(\eta\)
  • learning rate
  • \(\mathbf{\xi}\)
  • continuous frequency

Notation Table

See the full Guide for more

symbolmeaningLATEXsimplied
xinput\bm{x}\vx
youtput, label\bm{y}\vy
dinput dimensiond
dooutput dimension d_{\rm o}d_{\rm o}
nnumber of samplesn
Xinstances domain (a set)\mathcal{X}\fX
Ylabels domain (a set)\mathcal{Y}\fY
Z= X × Y example domain\mathcal{Z}\fZ
Hhypothesis space (a set)\mathcal{H}\fH
θa set of parameters\bm{\theta}\vtheta
fθ : X → Yhypothesis function\f_{\bm{\theta}}f_{\vtheta}
f or f ∗ : X → Ytarget functionf,f^*
ℓ : H × Z → R+loss function\ell
Ddistribution of Z\mathcal{D}\fD
S = {zi}ni=1= {(xi, yi)}ni=1 sample set
LS(θ), Ln(θ),empirical risk or training loss
Rn(θ), RS(θ)empirical risk or training loss
LD(θ), RD(θ)population risk or expected loss
σ : R → R+activation function\sigma
wjinput weight\bm{w}_j\vw_j
ajoutput weighta_j
bjbias termb_j
f∑θ(x) or f(x; θ)neural networkf_{\bm{\theta}}f_{\vtheta}
∑mj=1 ajσ(wj · x + bj )two-layer neural network
VCdim(H)VC-dimension of H
Rad(H ◦ S), RadS(H)Rademacher complexity of H on S
Radn(H)Rademacher complexity over samples of size n
GDgradient descent
SGDstochastic gradient descent
Ba batch setB
|B|batch sizeb
ηlearning rate\eta
kdiscretized frequency\bm{k}\vk
ξcontinuous frequency\bm{\xi}\vxi
convolution operation*

And a quick video that explains it