**A decision tree for function minimization**

What R calls “optimization” is more generally known as function minimization. The tools in the **stats **package function *optim() *are all essentially function mimizers, as are those in the package **optimrx **found at https://r-forge.r-project.org/projects/optimizer/. **optimrx **tries to make it easier for users to call quite a few methods without having to learn many different sets of command syntax. It uses the same syntax as *optim() , *even down to parameter scaling, by changing the

**method=”name”**argument.

There are, however, a number of methods. Which is best for your purpose? This posting presents a form of decision tree to try to help. It isn’t the last word, and you should not take it as a set of hard rules. But it does reflect a lot of experience and I have tried to give the reasoning behind some of the suggestions.

We assume there is a function we wish to minimize or maximize. Call it *fn(x) *where *x* is parameter vector of length *n*; we will let e*data* be exogenous data — those numbers or factors that are needed to compute the function but which we will not be adjusting. So we should really write *fn(x, edata)*, but will often be lazy, just as R allows us to be with exogenous data.

Function *fn(x)* is assumed:

- computable
- to have an initial starting set of parameters
*x0*that is feasible (which I won’t define just now) - to have more or less 1 local minimum (though we often try to solve problems with multiple minima present if these minima are believed to be sufficiently separated).

Note: if *fn()* is a sum of squares, then packages **nlmrt** or **minpack.lm **are preferred. The **stats** function *nls()* may also be useful. I plan to write more about nonlinear least squares in a later post. If you are desperate, then get in touch with me!

Let us proceed in a stepwise fashion.

**Step 0: Presets.**

If we are maximizing , create *fn(x, edata) <- -fn.orig(x, edata) . *That is, we

minimize the negative of the function we want to maximize.

Let us set some logical variables to define our problem:

*haveg*= TRUE if there are analytic gradients of*fn(x)*available*bounds*= TRUE if there are lower and/or upper bounds on the parameters*x**masks*= TRUE if some parameters are fixed — currently only packages**Rvmmin**and**Rcgmin**support masks well.

* Step 1: Check function *(This is always a REALLY good idea.)

- compute
*fn(x, edata)*for at least*x0*and verify the value is correct *haveg*TRUE: compute*g(x0)*and compare it with*grad(fn, x0, edata)*from package**numDeriv**

**Step2: Running a method**

*bounds*TRUE,*haveg*FALSE: try*nmkb()*from package**dfOptim**, but note that the transfinite function approach to bounds in this method cannot have*x0*— in fact any initial parameter — on a bound.*bobyqa()*from package**minqa**may be much more efficient, but can behave poorly.*hjkb()*from package**dfOptim**is very easy to understand, but sometimes stalls and generally inefficient.*haveg*TRUE:*n < 50*(i.e., small)*:*packages**Rvmmin**or**nlminb***n large*: packages**Rtnmin**,**Rcgmin**, or method**L-BFGS-B**(from*optim()*in**stats**). Package**lbfgsb3**is an implementation of the 2011 FORTRAN version of the same algorithm and in theory may be preferred, but*L-BFGS-B()*called via*optim()*or*optimr()*is likely more efficient. The three approaches mentioned can also be applied to small*n*problems, and sometimes are as efficient as the methods recommended for small*n*.**Rtnmin**and LBFGS variants are in the same general class of methods, but have many differences in implementation approach. Generally I have found that the relatively newer**Rtnmin**may be faster, but less reliable than,**Rcgmin**though the former seems to work better when the function has more “nonlinearity” (which for now I will not define).**Rvmmin**is, by design, very aggressive in trying to find a minimum, so will “try too hard” in many cases. As such it should not be criticised for slowness when it is intended to favour reliability over speed.

*bounds*FALSE:*haveg*FALSE: Methods*nmk()*from**dfOptim**, or*uobyqa()*from**minqa**are indicated (but*bobyqa()*from**minqa**may be better as it is more up-to-date than*uobyqa();*we can use very loose values for the bounds on*x*)*haveg*TRUE:*n*small: packages**Rvmmin**or**ucminf**, or function*nlm()*from**stats.***n*large: method**L-BFGS-B**or packages**Rcgmin**and**Rtnmin**

In the above I have not mentioned function *spg()* from package **BB**. My experience is that occasionally this can be very efficient, but more often slower than other methods. However it has a particular advantage of allowing the application of quite complicated constraints.

Hi John,

I have the experience that nlminb works pretty well even if one does not have gradients. Can you provide some explanation for why those functions that can work with a gradient, but do not need it should not be used if one does not have the gradient?

Thanks,

Henrik

LikeLike

That’s actually worth a blog post itself, and I’ll try to do one sometime. The general approach is that gradient methods supply an approximation when there is no gradient provided. With a simple forward approximation ( (f(x+delta) – f(x))/delta ) we get a secant method for minimization. This often does even better than analytic gradient in early steps. But when we get “near” an minimum, we really want the precise gradient to know we are “finished”. For a large percentage of problems, the approximation is fine. But there is a real fraction of problems for which we get into trouble and either stop early or just plain get things wrong. Unfortunately, people (and the one I see in the mirror!) want to get the answers and don’t investigate the failures enough. That’s work that needs to be done to give better support to this answer.

So the argument comes down to “Are we there yet?”. You will find the non-gradient Nelder-Mead family methods do very well getting close to a minimum, but spend a lot of effort deciding they are done. That is where good gradient calculations are valuable.

FWIW, I’m working on an optimization test structure to try to record and analyze cases. It’s a lot of work, and not stuff that generally gets a lot of applause. However, I’ll keep at it steadily.

JN

LikeLike

Hi John,

I am interested if similar rules apply for the maximization problem. Can you provide some brief explanation.

I am trying to find the optimal promotional [spend] strategy across different digital channels to maximize revenue. In this manner I simply used optimx function (optimx package). Since I have budget constraint I decided to go with L-BFGS-B method. Do you think any other appropriate method?

Regards Jerry

Response: Maximization is accomplished by minimizing (-f(x)), so the argument is the same.

There are several bounds constrained methods. L-BFGS-B and Rcgmin are good if you have a lot of parameters (say more than 40). Rvmmin is rather crude (it is the

same method as optim::BFGS and taken from my 1979 book) but has bounds constraints included. If you don’t have gradients, then BOBYQA from minqa (there are other

packages that have it too) can sometimes be brilliant, but occasionally a disaster. I strongly recommend working out the gradients if you can. They generally give

better efficiency, but especially allow better understanding of whether a solution has been found. Numerical approximations are just that, approximations.

optimx is best used for comparing a few methods. It got to be difficult to maintain and extend, and the code got very awkward so that I had to fork the package

to optimr. I have a “to do” item to update that on CRAN with my version with many more methods. It is at https://r-forge.r-project.org/scm/viewvc.php/pkg/optimr/?root=optimizer or can be found by using the source code (subversion) browser. However, the CRAN version will run L-BFGS-B and Rvmmin and Rcgmin, and, in particular,

the optimr() function uses the same syntax for each method as that in optim().

LikeLike