Introduction
There is a thesis that high dimensional optimization is easy, and it is only our low dimensional intuition that is holding progress back. I’d like to try and explain why that isn’t quite the case.
The background
Some of the rigorous work on the nature of stationary and critical points in high dimensional optimization includes:
- Alan J. Bray and David S. Dean, “Statistics of Critical Points of Gaussian Fields on Large-Dimensional Spaces,” Phys. Rev. Lett. 98, 150201 – Published 10 April, 2007. DOI: https://doi.org/10.1103/PhysRevLett.98.150201
- Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization” In Proceedings of the 28th International Conference on Neural Information Processing Systems – Volume 2 (NIPS’14), Vol. 2. MIT Press, Cambridge, MA, USA, 2933–2941. https://dl.acm.org/doi/10.5555/2969033.2969154
- Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G. & LeCun, Y.. (2015). “The Loss Surfaces of Multilayer Networks.” Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 38:192-204 https://proceedings.mlr.press/v38/choromanska15.html
Unfortunately the above get caricatured down to not quite true claims such as the following.
Optimizing in
Rnis easy because the probability of a stationary point being a local minima falls ofg exponentially inn. The misplaced fear of local minima comes from low dimensional experience.
The implication doesn’t hold in common practical work for a number of reasons.
- Unconstrained optimizers stop for a great number of reasons other than hitting local minima. A few of these include: low norm gradients, large flat regions, iteration bounds, and numeric instability. It isn’t enough to eliminate “false stops at local minima.”
- Linear independence in this context isn’t enough to establish statistical independence. Losing to a great number of simultaneous conditions can be easy if the conditions are correlated.
- The original references had to assume or establish a number of additional conditions to put a favorable probabilistic model on the stationary points. If you are not in similar circumstances (such as exploring the properties of a spin-glass or eigenvectors of random matrices) you should not expect similar properties.
- Even if most states are nice, optimizers search for exceptional ones. The desired global minimum itself being the most exceptional one. Even if local minima are rare in proportion to stationary points, there could still be a lot of them compared to the number of global optima. So they may still drown out the globally optimum solutions. This is a common “base rate fallacy“: each attempt to spoil the optimizer may have a low chance of spoiling- but if there are enough attempts they have a good chance to spoil the optimization in aggregate.
Industrial style optimization
My experience is that common objective functions tend to be structured and full of coincidences and symmetries. And because they have these structures they are hard to optimize.
Let’s work up what I claim to be a fairly typical optimization problem that arises from planning or scheduling. I’ll call it the train arrival schedule problem.

Maekawa Sempan Gotanda Station 1932, Public Domain
We are going to model a proposed k-train schedule with variables:
x(i): the time thei-th train arrives.y(i) = b i: the track number of theith train scaled bybto give a track position. Each train has its own track.bis modeled as being small.target(i): when we would prefer to schedule theith train to arrive.z(i) = c |x(i) - target(i)|the penalty for trainiarriving at timex(i).cis chosen to be small.d(i, j)the Euclidean distance between(x(i), y(i))and(x(j), y(j)).
We will use an overall penalty (to minimize) for a schedule x = (x(0), x(1), ..., x(k-1)) as:
where E and G are positive constants. This function is designed to penalize trains arriving other than the desired times and also penalize trains on nearby tracks arriving at the close by times. If the “trains arriving similar times” (“E” penalty) is not chosen to be too large and none of the target times are near each other, then the optimal solution is near (target(0), target(1), ..., target(k-1)). If the “E” penalty is large or some of the target times are close to each other the objective starts trading off hitting the targets against keeping near-track arrival times far apart. I.e. we have a non-trivial search among competing trade-offs.
The problem
The problem is in the term: 1 / d(i, j). This is encoding that start times repulse each other so we don’t schedule all the trains at the same time. This is a major source of non-convexity and difficulty in this formulation. A gradient aware optimizer taking sufficiently small steps can never change the order of the x(i) versus x(j), as it would experience bad objective values as x(i) and x(j) take on close intermediate values as they try to trade places. Of course, this can be mitigated with large steps, annealing, and stochastic ideas.
But from a pure gradient point of view the n-dimensional phase space for this problem is cut up into n! (“n factorial”) non-communicating sub regions or large valleys. Our optimizer starts possibly trapped in one such region, and the globally optimal solution is likely inaccessible in another. Each valley has a large ridge system or boundary that makes it difficult to escape. There are not “cheap random gaps or passes” in the ridges, as the ridges in this high dimensional configuration space come from simple facts such as the relative order of x(i) and x(j).
Let’s see the problem in action.
We have translated all of the above into Python (article here, code and tests here). We are only going to work on the trivial case target(i) = i. And we will run into problems even there!
First a success
Let’s show that if we start in the right region a common conjugate gradient optimizer can in fact finish the job and get us to the global optimum. I do know this is not the same as the stochastic gradient descent optimizer so much is claimed for. I feel this is good enough to illustrate the claims about the gradient structure of high dimensional problems.
In [1]:
import numpy as np from scipy.optimize import minimize from opt_fns import f, g # import objective function f, and gradient g rng = np.random.default_rng(2025) # seed at a obvious seed for repeatability
Let’s show that the claimed solution is in fact near a local optimum. We will call 100 variables “big” (I know “big” is commonly much bigger, but this is for ease of demonstration) and just type in the ideal solution that train i is given time-slot i.
In [2]:
n_vars = 100 ideal_soln = np.array(list(range(n_vars)), dtype=float)
Now we build a starting point that isn’t exactly the ideal solution. Knowing a starting solution this good is unlikely, but we want to first show the optimizer working well.
In [3]:
# noisy "not quite at optimum" starting solution x0_good = ideal_soln + rng.normal(scale=0.1, size=n_vars) np.sum(np.abs(x0_good - ideal_soln))
Now we optimize to try and get back to the global best solution.
In [4]:
soln_good = minimize( fun=f, x0=x0_good, method="CG", jac=g, )
And we can confirm the recovered solution is the ideal solution.
In [5]:
assert np.abs(f(soln_good.x) - f(ideal_soln)) < 1e-2 assert np.sum(np.abs(soln_good.x - ideal_soln)) < 1e-6
Now a failure
Let’s build a non-optimal initial solution the disagrees with the ideal solution. This simulates the more common use of optimizers: finding a good solution when we don’t already have a good solution.
First we build a bad initial solution.
In [6]:
# initial solution with no knowledge of objective x0_bad = np.array(list(reversed(range(n_vars))), dtype=float) f(x0_bad)
In [7]:
soln_bad = minimize( fun=f, x0=x0_bad, method="CG", jac=g, )
And it turns out the returned solution is not good.
We did see a great improvement in objective. However notice the solution is still bad and not near the objective value achieved near the global optimum.
In [10]:
np.sum(np.abs(soln_bad.x - ideal_soln))
Examining the solution shows us planned train arrival times are near each other and also not near their ideal times. The solver is stuck with some nasty intermediate trade-offs that are not required in the final optimal solution.
Out[11]:
array([-0. , 1. , 2. , 5.1 , 4. , 5. , 6. , 8.24, 8. ,
8.15, 13.92, 11. , 12. , 13. , 13.83, 15. , 16. , 17. ,
18. , 19. , 20. , 21. , 22.07, 23. , 24.12, 25.13, 14.78,
27. , 28. , 24.02, 25.03, 21.97, 32. , 33. , 34. , 35.23,
35.13, 37. , 38. , 39. , 40. , 41. , 42. , 43. , 44. ,
45. , 27.89, 47. , 48. , 49. , 50. , 51. , 52. , 59.86,
54. , 55. , 56. , 57. , 58.07, 57.97, 59.76, 61. , 62. ,
63. , 64. , 65. , 66. , 67. , 84.88, 75.06, 74.06, 71. ,
72. , 85.14, 73.96, 74.96, 76. , 77. , 78. , 79. , 80. ,
81. , 82. , 83. , 84. , 85. , 84.69, 87. , 88. , 84.8 ,
90. , 91. , 92. , 93. , 95.33, 95. , 95.23, 97. , 98. ,
99. ])
Maybe the function was too artificial?
There can be a push-back of “of course you broke the optimizer, you designed the problem to do that.” In my experience optimizers are incredibly easy to break, and structured problems from nature routinely do this. Getting them to work at all remains a high value skill.
One could also argue perhaps the objective function chosen is itself just too artificial to actually encounter.
I would argue the objective function wasn’t completely artificial. It is in fact smooth and bounded, as the differing tracks prevent the distance from going to zero. It did have some nasty features such as abs() in it. However the deep learning problems claimed to be easy have tons of features like this (such as ReLUs).
In fact I chose the function to be physical. It is the potential energy function for a simplified system featuring:
- electrostatic repulsion
- gravity
- v-shaped tracks or slots.
In an odd, steam-punk, alternate reality we can even imagine our train planning facility has a special planning table where:
- The table is labeled “times 0 to k-1” on one edge, and “track 0 though k-1” on the other.
- For each track a slot is cut across the table with the lowest point of slot
ibeing at timeiand the slot rising linearly from this low point in each direction. - In each slot we put an identical heavy steel ball with an identical huge positive charge.
As close as Gemini gets to an optimal image of the planning table.
If the change in slot height is small (compared to slot length and slot separation), we can ignore it in our charge distance calculations (while retaining it for the gravity calculations). In this case the potential energy of this system represents:
- The “E: don’t be near each other” part of our objective function is then exactly the Coulomb potential between the charges.
- The “G: per train arrival time penalty” term is the height potential energy term.
Shaking the table for a while could be expected to anneal into a near optimal schedule (though this is likely to electrocute the operator).
This is just a fun way of claiming our objective function is the potential energy of a reasonable physical system, and therefore itself somewhat reasonable (or at least something one might encounter).
Unconstrained optimizer stops
In my experience non-pathological unconstrained optimizer stops are usually due to low norm gradient. When one looks at these gradients the coordinates are all small (the low norm condition) and often appear to have random sign. It is easy to imagine the signs of the low norm gradient are in fact reading off the relative position between the stopping point and an actual nearby stationary point (where the gradient is all zero). And it is further easy to imaging that the relative position is somewhat random or isotropic. Thus I don’t attach too much meaning to these values, though I do think they have been driving some of the “problems fail to have systematic blocking structure” fallacy.
Other high dimensional claims
There are additional hopes that appeals to hyperbolic geometry, ergodic theory or percolation theory can establish that high dimensional optimization is easy.
For example: David D. Nolte “A Random Walk in 10 Dimensions”, 2021 link (some discussions here) makes the interesting argument that if the level-sets of an optimization function “percolate“, then they give a useful tour of the entire objective space. It is really worth a read, as it is relating global optimization to an actual global property. However, to say “if” the system percolates (or more precisely all the level sets percolate, or a single percolation can be striated into many) is not quite the same as knowing the system percolates. And, as we have seen, problem structure can cause a partition that block such global access.
That being said: there is value in characterizing what makes for easy problems. This helps us reformulate our problems and solutions towards such realizations.
How to fix
For our problem there is a fairly standard fix. Turn off the “E” or repulsive terms for initial optimization. Then bring them back in and re-optimize.
The idea behind this is to identify the difficult terms and treat them more like inequality constraints. Which means to turn them off when they are not active (a standard treatment of inequality constraints).
- It isn’t just local features (such as local minima) that make unconstrained global optimization hard. Global optimization is often hard due to global structure. Many problems have non-trivial global structure.
- Unconstrained optimizers tend to stop or fail for reasons other then hitting local minima. They tend to step at points with merely low norm gradients that are not in fact even critical or stationary points (let alone local minima).
- Most operations research problems have a lot of structure and symmetries. These can partition the optimization space into a large number of large non-communicating regions. Even deep learning problems can have global structure, as they may need to break a symmetry by realizing some functionality in one part of a network or another.
- Many low dimensional bugbears do in fact have easy high dimensional analogues. For example a long nearly flat region in a 1-d problem, can arise in 2-d as a pie tin, or in n-d as a ball.
- Optimizers are quite easy to break when one tries. And nature tries hard. It is only when one pretends optimization is easy that long run times are considered proof of optimality.
- In my opinion classical machine learning and optimization lulled people into a false sense of security by supplying spectacular optimization results. However, most of these results were hard won by establishing strict conditions such as convexity. In deep learning even if your loss function (square loss, deviance, entropy, cross entropy) is convex your overall system of data -> parameters -> prediction -> loss is not going to be convex. This is especially true when you are dealing with billions of datums, and deep nets with billions of parameters simultaneously. When your objective function is literally billions of time more complex than your optimizer- it isn’t obvious your optimizer wins.
Tagged as: convexity logistics operations research Optimization scheduling