Unifying packages to help users access R repositories

With over 11,000 packages in the main CRAN (Consolidated R Archive Network) collection, users of R face a challenge when looking for tools to carry out their statistical and computational tasks, especially in subject areas outside their main domain of expertise. Moreover, a lot of packages have similar or overlapping features. Some are, to be kind, not very well written, use methods which are outdated, or are simply inappropriate to a given situation.

In an attempt to address this, I joined with Julia Silge and Spencer Graves to organize a special session “Navigating the R package universe” at the UseR2017 meeting in Brussels. A lot of other people contributed to the session — see https://github.com/nashjc/Rnavpkg, and especially the wiki portion, https://github.com/nashjc/Rnavpkg/wiki. The presentation materials are there, along with links to various follow-on material such as notes and this and other blog posts. In particular, there is a general follow-up blog post at https://juliasilge.com/blog/navigating-packages/.

There are lots of ideas about how to help users find the right package, but we found that we could consider them as generally falling into three categories:

  • unification of packages that have similar goals or features (of which more in this posting)
  • guidance in the form of CRAN Task Views and similar infrastructure (see Julia’s posting https://juliasilge.com/blog/package-guidance/)
  • search tools for R packages (pending posting by Spencer Graves)

There are overlaps in these categories, and we are certain some ideas have been left out. Let us know!

Approaches to unification of packages

In the breakout session on unification at UseR2017, many ideas were suggested, sometimes in rapid-fire fashion. Alice Daish kindly took notes on her cell phone, but I will apologize in advance if I have missed anything important.

My own view of unification with regard to the package universe is to reduce the effective span of the number of “things” the user must consider. If the rule of thumb for presentations is to have no more than (fill in your favourite single digit integer) bullets per screen, surely a user should not have to face thousands of choices? Following this objective, we can consider:

  • developing wrappers that provide a common syntax or interface to multiple packages
  • finding ways to identify packages that should not generally be used so they can be archived out of the immediate or first-level view of users who are searching for tools
  • promoting collaboration between developers of similar packages with the aim of consolidating those packages
  • encouraging those working on guidance and search to provide filtering of their output to reduce the choice burden on users.

None of these approaches can be considered “easy”.

Wrappers

My own efforts to unifying packages have been in the field of optimization, particularly function minimization. With Ravi Varadhan, I developed the optimx package to allow a common interface to a number of base and package tools. It is an indicator of the difficulty of doing this well that I later refactored optimx to optimr to allow for easier maintenance of the package. Note that because these packages call other packages, failures in the called functions may give rise to error messages in the unifying package. To avoid too many CRAN alerts, optimrx can be found on R-forge in the optimizer project https://r-forge.r-project.org/projects/optimizer/ and can call more solvers than optimr. I believe that optimrx makes a worthwhile step towards simplifying the user’s task in carrying out function minimization where some of the parameters can also be bounded or temporarily fixed. A similar wrapper for global optimizers has been written by Hans Werner Borchers. gloptim is so far only on R-forge in the optimizer project. As I mentioned, writing these codes takes a lot of effort. They are never truly finished because new solvers become available from time to time.

A sample of some other packages which unify a number of methods:

  • mlr: Machine Learning in R
  • caret: Classification And REgression Training for developing predictive models
  • Matrix: Matrix methods for dense and sparse matrices (What would we do without them?)

We could also consider the nls() function in the stats package that is distributed with base R to unify several methods for nonlinear least squares. Unfortunately, the default method and derivative approximation are quite often mis-applied by non-experts (and even some experts!). For the Marquardt stabilization, users must look in packages like minpack.lm and nlsr. Duncan Murdoch and I wrote the last package to allow for analytic derivatives of expressions in nonlinear modelling, but it would be nice to be able to use them in conjunction with separable models (essentially those with linear structure with some nonlinear terms). While nlsr has a wrapnls() function to call nls(), we don’t yet have a way to access minpack.lm or some other packages with relevant features. There is always more work to do!

Archiving

During the breakout session on unification, there were a lot of comments about packages that should be merged, as well as a few about packages that should be abandoned. A subtext of the discussion that is relevant to R in general is that younger workers were quite clear that they felt they could not voice their opinions openly because some packages had been written by members of well-established groups who might provide employment opportunities or other perks. In this, the fact that two of the session organizers (JN and SG) are retired and the third (JS) does not work in academia was noted. We are relatively protected from backlash to honest commentary. Here I genuinely mean honest and not capriciously critical. We must recognize that many packages have been important in the life of R but now need renewal or replacement. I have been trying to get my own contributions to the optim() function deprecated for some time! However, there are workers who, I am sure, will for a variety of reasons be unwilling to discuss ways to reorganize the package repositories.

Ultimately, the users will vote with their feet (or rather mouse/touchpad). However, there will be a lot of wasted effort that does them and R no good whatever.

Filtered presentation

The CRAN repository is, at present, offered to users as a list alphabetically or by date on the different mirrors, (e.g., https://cloud.r-project.org/). While a way to see every package available is necessary, it is far from friendly to users, either novices or experts. Clearly it is feasible to restructure the list(s) for different audiences. Moreover, this restructuring or filtering need not be on the CRAN sites, but could be delegated to sites dealing with particular applications or methods. Here is a prime area of overlap with the guidance and search ideas for navigating the packages. In particular, it will be helpful if automated tools are developed that do the lion’s share of the work for this.

Another approach to filtering is to present R tools within a different framework. https://www.jamovi.org/ uses a spreadsheet paradigm. https://www.tidyverse.org/ collects tools that deal with data science and adds a streamlined command set. Unfortunately, there is a learning cost to using new frameworks like these. They simplify the use of R in one domain, but add to the general collection of packages and tools.

Directions

Unification of R packages with the aim to reduce user costs in finding the right tools is never going to be trivial. Moreover, it is not as directly rewarding as writing an (apparently) new package or method, and getting a paper published abut it. On the other hand, making R infrastructure more accessible to users is key to continued vitality of the overall resource and community.

John C. Nash     2017-9-3

Advertisements

Choosing which method to use for “optimizing” functions

A decision tree for function minimization

What R calls “optimization” is more generally known as function minimization. The tools in the stats package function optim() are all essentially function mimizers, as are those in the package optimrx found at https://r-forge.r-project.org/projects/optimizer/. optimrx tries to make it easier for users to call quite a few methods without having to learn many different sets of command syntax. It uses the same syntax as optim()even down to parameter scaling, by changing the method=”name” argument.

There are, however, a number of methods. Which is best for your purpose? This posting presents a form of decision tree to try to help. It isn’t the last word, and you should not take it as a set of hard rules. But it does reflect a lot of experience and I have tried to give the reasoning behind some of the suggestions.

We assume there is a function we wish to minimize or maximize. Call it fn(x) where x is parameter vector of length n; we will let edata be exogenous data — those numbers or factors that are needed to compute the function but which we will not be adjusting. So we should really write fn(x, edata), but will often be lazy, just as R allows us to be with exogenous data.

Function fn(x) is assumed:

  • computable
  • to have an initial starting set of parameters x0 that is feasible (which I won’t define just now)
  • to have more or less 1 local minimum (though we often try to solve problems with  multiple minima present if these minima are believed to be sufficiently separated).

Note: if fn() is a sum of squares, then packages nlmrt or minpack.lm are preferred. The stats function nls() may also be useful. I plan to write more about nonlinear least squares in a later post. If you are desperate, then get in touch with me!

Let us proceed in a stepwise fashion.

Step 0: Presets.
If we are maximizing , create   fn(x, edata) <- -fn.orig(x, edata) .    That is, we
minimize the negative of the function we want to maximize.

Let us set some logical variables to define our problem:

  • haveg = TRUE if there are analytic gradients of fn(x) available
  • bounds = TRUE if there are lower and/or upper bounds on the parameters x
  • masks = TRUE if some parameters are fixed — currently only packages Rvmmin  and Rcgmin support masks well.

Step 1: Check function  (This is always a REALLY good idea.)

  • compute fn(x, edata) for at least x0 and verify the value is correct
  • haveg TRUE: compute g(x0) and compare it with grad(fn, x0, edata) from package numDeriv

Step2: Running a method

  • bounds TRUE,
    • haveg FALSE:  try nmkb() from package dfOptim, but note that the transfinite function approach to bounds in this method cannot have x0 — in fact any initial parameter — on a bound.  bobyqa() from package minqa may be much more efficient, but can behave poorly. hjkb() from package  dfOptim is very easy to understand, but sometimes stalls and generally inefficient.
    • haveg TRUE:
      • n < 50 (i.e., small)packages Rvmmin or nlminb
      • n large: packages Rtnmin, Rcgmin, or method L-BFGS-B (from optim() in stats). Package lbfgsb3 is an implementation of the 2011 FORTRAN version of the same algorithm and in theory may be preferred, but L-BFGS-B() called via optim() or optimr() is likely more efficient. The three approaches mentioned can also be applied to small problems, and sometimes are as efficient as the methods recommended for small nRtnmin and LBFGS variants are in the same general class of methods, but have many differences in implementation approach. Generally I have found that the relatively newer Rtnmin may be faster, but less reliable than,  Rcgmin though the former seems to work better when the function has more “nonlinearity” (which for now I will not define).  Rvmmin is, by design, very aggressive in trying to find a minimum, so will “try too hard” in many cases. As such it should not be criticised for slowness when it is intended to favour reliability over speed.
  • bounds FALSE:
    • haveg FALSE: Methods nmk() from dfOptim,  or uobyqa() from minqa are indicated (but bobyqa() from minqa may be better as it is more up-to-date than uobyqa(); we can use very loose values for the bounds on x)
    • haveg TRUE:
      • n small: packages Rvmmin or ucminf , or function nlm() from stats.
      • n large: method L-BFGS-B or packages Rcgmin and Rtnmin

In the above I have not mentioned function spg() from package BB. My experience is that occasionally this can be very efficient, but more often slower than other methods. However it has a particular advantage of allowing the application of quite complicated constraints.

 

 

Why optim() is out of date

Why optim() is out of date

And perhaps you should be careful using it

Once upon a time

Once upon a time, there was a young Oxford D.Phil. graduate with a thesis on quantum mechanics who — by virtue of a mixup in identities — got hired as an Agricultural Economist. He was actually better qualified than the person he was mistaken for, especially for the work that his employers wanted him to do. This was to carry out some rather cutting edge modelling calculations that needed pretty good linear algebra and function minimization software.

In this far away time, computers were very exotic pieces of machinery. Our minor hero was lucky enough to have access to what was essentially a personal computer. This was a “share” of a Data General Nova computer. There were about 6 shares. Each share had about 4K (that’s K, or 1000) bytes for program and data, though for some reason the shares were slightly different sizes. Storage for programs and re-readable output was via a paper tape reader/punch and there was a 10 character per second teletype printer. Ear protectors were available as operating equipment. (No kidding!)

In this early 1970s world, S had not yet been thought of, and R was two decades in the future, but a crude dialect of BASIC was available. Less than a decade earlier, John Nelder and Roger Mead had published their “simplex” method, which will forever be confused with the Dantzig linear programming method of a similar name. Less than five years before, Roger Fletcher published his variant of the Variable Metric code, though he and Reeves had earlier described their non-linear conjugate-gradient method. These three methods were coded in BASIC for the small partition of the Data General Nova. On R, these methods are called “œNelder-Mead”, “BFGS” and “CG”, though “CG” offers two other ways of computing the conjugate gradients search direction besides that of Fletcher-Reeves.

How they came to R

The various tools for small computers were written up as my 1979 book “Compact numerical methods for computers: Linear algebra and function minimization” that was published by Adam Hilger. This had about 25 algorithms in step and description form. The Institute of Physics acquired Adam Hilger and published the second edition in 1990 in which the major update was to present the algorithms in Turbo Pascal with a diskette in a pocket at the back.

In 1995, I was visiting Oxford and had a meeting with Brian Ripley, who happened to have the office once occupied by my D.Phil. supervisor (C.A.Coulson FRS — Wikipedia gives a reasonable but rather bland biography). I agreed R could use the codes.

Meanwhile etc.

Of course, the world of optimization did not stand still in the quarter century between Fletchers VM code and R getting it in optim(). There were developments in quasi-Newton minimizers, and the 1980s code L-BFGS-B from Nocedal et al. was eventually brought into R, as was a rather crude Simulated Annealing (method “SANN”). The last method I will not mention further as it is really a different beast from other minimizers.

Dennis and Schnabel’s approach was codified in the function nlm() while David Gay’s Bell Labs code nlminb() supplied a bounds constrained minimizer. However, the calling syntax and controls of these functions differ, and differ from those of optim().

On the other hand, there were a lot of other developments in optimization and function minimization that could be of interest to R users, and some of the shortcomings of the original three methods should be acknowledged. Moreover, in 1987, Mary Walker-Smith and I had published “Nonlinear parameter estimation: an integrated system in BASIC”, where the methods were slightly improved and bounds constraints and masks (fixed parameter constraints) were added. The book and BASIC software are available at https://archive.org/details/NLPE87plus.

Weaknesses of Nelder-Mead, BFGS, and CG

Let us examine the results of applying some of these methods to the standard Rosenbrock banana-shaped valley. The original problem has starting point (-1.2, 1). The function is given in the code below. To save myself some work, I am calling the methods through the package optimrx which allows multiple methods to be applied via the single function opm(). Alternatively, they can be applied one at a time using method = "name" using the function optimr() which has the same syntax as the base R optim(). However, we can specify which gradient approximation to use. Here we apply a very simple forward difference approximation. I intend to blog about gradients in a later posting.

require(optimrx)
## Loading required package: optimrx
options(width=110)
fnRosenbrock <- function (x) 
{
    n <- length(x)
    x1 <- x[2:n]
    x2 <- x[1:(n - 1)]
    sum(100 * (x1 - x2^2)^2 + (1 - x2)^2)
}
meth0 <- c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "nlm", "nlminb")
st0 <- c(-1.2, 1) # the standard start
result0 <- opm(st0, fnRosenbrock, "grfwd", method=meth0)
summary(result0, order=value)
##                    p1        p2        value fevals gevals convergence  kkt1 kkt2 xtime
## BFGS        0.9999999 0.9999998 9.784363e-15    138     41           0  TRUE TRUE 0.003
## nlm         0.9999971 0.9999941 8.697058e-12     NA     24           0  TRUE TRUE 0.001
## nlminb      0.9999970 0.9999940 9.025650e-12     44     36           0  TRUE TRUE 0.002
## L-BFGS-B    0.9999970 0.9999940 9.119830e-12     49     49           0  TRUE TRUE 0.002
## Nelder-Mead 1.0002601 1.0005060 8.825241e-08    195     NA           0 FALSE TRUE 0.003
## CG          0.9174150 0.8413668 6.802418e-03   3957   1001           1 FALSE TRUE 0.049

Here we see that most of the methods get a “reasonable” answer in a short time. However, CG has not terminated within the allowed number of gradient evaluations. This is indicated by the convergence code and also the number of gradient evaluations recorded.

Truthfully, I have never been very happy with the CG algorith or code as I specified it in the book or Pascal code. That is a pity, as it is a very compact code, uses only a few vectors of working storage and allows very nice vectorization. Fortunately, as we shall see later, there is a happy update to this code.

The issue may, of course, be the use of an approximate gradient. Just to be sure, let us try using analytic gradients.

grRosenbrock <- function (x) 
{
    n <- length(x)
    g <- rep(NA, n)
    g[1]  2) {
        ii <- 2:(n - 1)
        g[ii] <- 2 * (x[ii] - 1) + 400 * x[ii] * (x[ii]^2 - x[ii + 
            1]) + 200 * (x[ii] - x[ii - 1]^2)
    }
    g[n] <- 200 * (x[n] - x[n - 1]^2)
    g
}
result0g <- opm(st0, fnRosenbrock, grRosenbrock, method=meth0)
summary(result0g, order=value)
##                    p1        p2        value fevals gevals convergence  kkt1 kkt2 xtime
## nlminb      1.0000000 1.0000000 4.291816e-22     43     36           0  TRUE TRUE 0.001
## nlm         1.0000000 1.0000000 1.182096e-20     NA     24           0  TRUE TRUE 0.002
## BFGS        1.0000000 1.0000000 9.594956e-18    110     43           0  TRUE TRUE 0.001
## L-BFGS-B    0.9999997 0.9999995 2.267577e-13     47     47           0  TRUE TRUE 0.001
## Nelder-Mead 1.0002601 1.0005060 8.825241e-08    195     NA           0 FALSE TRUE 0.001
## CG          0.9174130 0.8413632 6.802745e-03   3957   1001           1 FALSE TRUE 0.033

The methods needing gradients generally are doing better with the more precise gradient, except our CG code. And if we look at the timings (which may be untrustworthy since the calculations are very fast), the gradient methods appear to have taken less time.

Improved codes

Wanting to incorporate the bounds and masks constraints that were in Nash and Walker-Smith, I prepared package Rvmmin a few years ago. Moreover, I had discovered a very nice approach to the CG code by Yuan and Dai that combined several formulas into one, giving a program that was simultaneouly better and simpler. I was also able to incorporate bounds and masks, and this became package Rcgmin. Both these packages are entirely written in R so the algorithms can be instrumented and reworked if people wish.

In the 1990s C T Kelley made some improvements to the Nelder-Mead code. Ravi Varadhan and Hans Werner Borchers put this code into package dfoptim as nmk() with a bounds constrained version nmkb(). However, the bounds are implemented via the transfinite function rescaling, and it is not possible to start on a bound. This limitation is an unfortunate and frequent nuisance, but having the bounds possibility is nevertheless welcome. Here, of course, we are not imposing bounds.

meth1 <- c(meth0, "nmkb", "Rcgmin", "Rvmmin")
result1 <- opm(st0, fnRosenbrock, "grfwd", method=meth1)
## Warning in optimr(par, fn, gr, method = meth, lower = lower, upper = upper, : Successful convergence Restarts
## for stagnation =0
summary(result1, order=value)
##                    p1        p2        value fevals gevals convergence  kkt1 kkt2 xtime
## BFGS        0.9999999 0.9999998 9.784363e-15    138     41           0  TRUE TRUE 0.003
## Rcgmin      1.0000003 1.0000006 1.009084e-13    869    515           0  TRUE TRUE 0.041
## Rvmmin      0.9999972 0.9999944 7.779352e-12    196     42           0  TRUE TRUE 0.008
## nlm         0.9999971 0.9999941 8.697058e-12     NA     24           0  TRUE TRUE 0.002
## nlminb      0.9999970 0.9999940 9.025650e-12     44     36           0  TRUE TRUE 0.001
## L-BFGS-B    0.9999970 0.9999940 9.119830e-12     49     49           0  TRUE TRUE 0.002
## Nelder-Mead 1.0002601 1.0005060 8.825241e-08    195     NA           0 FALSE TRUE 0.001
## nmkb        1.0002749 1.0006092 4.268664e-07    120     NA           0 FALSE TRUE 0.007
## CG          0.9174150 0.8413668 6.802418e-03   3957   1001           1 FALSE TRUE 0.056

And using the analytic gradient:

result1g <- opm(st0, fnRosenbrock, grRosenbrock, method=meth1)
## Warning in optimr(par, fn, gr, method = meth, lower = lower, upper = upper, : Successful convergence Restarts
## for stagnation =0
summary(result1g, order=value)
##                    p1        p2        value fevals gevals convergence  kkt1 kkt2 xtime
## Rvmmin      1.0000000 1.0000000 1.232595e-32     59     39           0  TRUE TRUE 0.004
## nlminb      1.0000000 1.0000000 4.291816e-22     43     36           0  TRUE TRUE 0.001
## nlm         1.0000000 1.0000000 1.182096e-20     NA     24           0  TRUE TRUE 0.000
## Rcgmin      1.0000000 1.0000000 8.843379e-18    735    434           0  TRUE TRUE 0.022
## BFGS        1.0000000 1.0000000 9.594956e-18    110     43           0  TRUE TRUE 0.001
## L-BFGS-B    0.9999997 0.9999995 2.267577e-13     47     47           0  TRUE TRUE 0.001
## Nelder-Mead 1.0002601 1.0005060 8.825241e-08    195     NA           0 FALSE TRUE 0.001
## nmkb        1.0002749 1.0006092 4.268664e-07    120     NA           0 FALSE TRUE 0.006
## CG          0.9174130 0.8413632 6.802745e-03   3957   1001           1 FALSE TRUE 0.033

What do we see here? The conjugate gradients codes are still “slow”, but we now get a reasonable solution with Rcgmin. The all-R codes, while a bit slower in general than those from base-R written in C, are nevertheless likely fast enough for most users. And, for this problem, standard Nelder-Mead does as well as the “improved” version (which gives rise to the warning message about “stagnation”).

A less extreme problem

The Rosenbrock problem has very bad scaling (note the factor of 100 in the definition). Let us try a somewhat simpler problem, but allow for more parameters at once. Here we will try a problem in 25 parameters. Note that when we do the optimization, we turn off the computation of the KKT (Kuhn Karush Tucker) tests for optimality because they require the computation of the 25 by 25 Hessian matrix approximation and then its eigenvalues. We could include the trace of the computations, and indeed had this set to 1 during early tests of this problem, but do not want a lot of output in the final blog. Similarly, we select just the first 4 parameters for display using par.select in the summary output.

fourth <- function(x){
   n <- length(x)
   val <- 10 # to ensure we don't get a very small number that keeps changing
   for (i in 2:n){
      t1 <- (x[i-1] - i + 1)
      t2 <- (x[i] - i)
      val <- val + (t1*t1+t2*t2)^2
   }
   val
}
stt <- rep(1,25)
resultt <- opm(stt, fourth, "grfwd", method=meth1, control=list(kkt=FALSE, trace=0))
## Warning in optimr(par, fn, gr, method = meth, lower = lower, upper = upper, : Maximum number of fevals
## exceeded Restarts for stagnation =0
## Warning in Rvmminu(par = spar, fn = efn, gr = egr, control = mcontrol, ...): Too many gradient evaluations
summary(resultt, order=value, par.select=1:4)
##                    p1         p2       p3        p4        value fevals gevals convergence kkt1 kkt2 xtime
## Rcgmin      1.0000000   2.001618 3.000212  4.000777 1.000000e+01     46     41           0   NA   NA 0.070
## Rvmmin      1.0000000   2.000576 3.000087  4.000809 1.000000e+01    376   2501           1   NA   NA 4.288
## L-BFGS-B    1.0000000   1.996713 2.997969  3.997143 1.000000e+01     33     33           0   NA   NA 0.054
## CG          1.0000000   1.994371 2.996702  3.995209 1.000000e+01    389    193           0   NA   NA 0.332
## nlm         1.0000000   1.988893 3.003896  4.000518 1.000000e+01     NA    118           0   NA   NA 0.257
## nlminb      1.0000000   2.001322 2.988232  3.992054 1.000000e+01    155    151           1   NA   NA 0.261
## BFGS        1.0000000   2.007567 2.998411  4.002900 1.000000e+01    182     70           0   NA   NA 0.122
## nmkb        0.8702348   2.145185 2.977281  3.986003 1.001408e+01   2500     NA           1   NA   NA 0.423
## Nelder-Mead 4.5912291 -14.477946 4.450378 -3.549623 2.672876e+06   2502     NA           1   NA   NA 0.156

We can also compute with the analytic gradient. This gives the following code and results.

require(numDeriv)
## Loading required package: numDeriv
fourth.g <- function(x){
   n <- length(x)
   gg <- rep(0,n)
   for (i in 2:n){
      t1 <- (x[i-1] - i + 1)
      t2 <- (x[i] - i)
#      val <- val + (t1*t1+t2*t2)^2
      gg[i-1] <- gg[i-1] + 4*(t1*t1+t2*t2)*t1
      gg[i] <- gg[i] + 4*(t1*t1+t2*t2)*t2
   }
   gg
}
ga <- fourth.g(stt)
gn <- grad(fourth, stt)
cat("max abs difference (analytic - numeric)=", max(abs(gn-ga)), "\n")
## max abs difference (analytic - numeric)= 6.714727e-05
cat("max rel difference (analytic - numeric)=", max(abs(gn-ga))/max(abs(ga)), "\n")
## max rel difference (analytic - numeric)= 3.445995e-10
resulttg <- opm(stt, fourth, fourth.g, method=meth1, control=list(kkt=FALSE, trace=0))
## Warning in optimr(par, fn, gr, method = meth, lower = lower, upper = upper, : Maximum number of fevals
## exceeded Restarts for stagnation =0
## Warning in Rvmminu(par = spar, fn = efn, gr = egr, control = mcontrol, ...): Too many gradient evaluations
summary(resulttg, order=value, par.select=1:4)
##                    p1         p2        value fevals gevals convergence kkt1 kkt2 xtime
## Rcgmin      1.0000000   2.001618 1.000000e+01     46     41           0   NA   NA 0.075
## Rvmmin      1.0000000   2.000576 1.000000e+01    376   2501           1   NA   NA 4.371
## L-BFGS-B    1.0000000   1.996713 1.000000e+01     33     33           0   NA   NA 0.057
## CG          1.0000000   1.994371 1.000000e+01    389    193           0   NA   NA 0.341
## nlm         1.0000000   1.988893 1.000000e+01     NA    118           0   NA   NA 0.259
## nlminb      1.0000000   2.001322 1.000000e+01    155    151           1   NA   NA 0.271
## BFGS        1.0000000   2.007567 1.000000e+01    182     70           0   NA   NA 0.124
## nmkb        0.8702348   2.145185 1.001408e+01   2500     NA           1   NA   NA 0.419
## Nelder-Mead 4.5912291 -14.477946 2.672876e+06   2502     NA           1   NA   NA 0.160

Note that we ALWAYS test our gradient code using a good approximation such as that in the grad() function of package numDeriv.

Observations we can make on the results above are as follows:

  • The heuristic direct-search methods Nelder-Mead and nmkb have to try too many test sets of parameters to find the minimum of the fourth function. Both methods hit the limit of function evaluations without even getting near the solution.

  • The variable metric methods BFGS and Rvmmin both do quite well. In particular Rvmmin gets the best result with both approximate and analytic gradients. However, it takes quite a lot of function and gradient evaluations. With the trace flag set, we see these methods trying very hard to reduce the function long after a reasonable solution has been found. This is, in fact, part of the very conservative design of this method (Algorithm 21 in the original Compact Numerical Methods). We are getting the behaviour that was intended, but that might not quite give the performance users like.

  • Both CG and its update, Rcgmin, do well. Despite being purely in R, we see Rcgmin is very fast for this problem, taking relatively few gradient and function evaluations and turning in a very fast timing, along with a respectable solution.

  • Gradient methods do better with analytic gradients, both in result and timing. L-BFGS-B does particularly well (it shares some similarities with Rcgmin, but is written in C).

If we don’t have analytic gradients, can we do better than a simple forward difference approximation. Package optimrx allows for a default approximation if we set the gradient function NULL, but such a choice may leave the gradient undefined, and certainly does not allow us to say how it is computed. We can expect a backward difference to have similar properties. On the other hand, both central differences and extrapolated approximations are known to offer better accuracy. Let us try these as a final look at these methods.

# Central difference approximation
meth2<- c("BFGS", "CG", "L-BFGS-B", "nlm", "nlminb", "Rcgmin", "Rvmmin")
resultt2 <- opm(stt, fourth, "grcentral", method=meth2, control=list(kkt=FALSE, trace=0))
## Warning in Rvmminu(par = spar, fn = efn, gr = egr, control = mcontrol, ...): Too many gradient evaluations
summary(resultt2, order=value, par.select=1:4)
##          p1       p2 value fevals gevals convergence kkt1 kkt2 xtime
## Rvmmin    1 1.998545    10    379   2501           1   NA   NA 8.197
## Rcgmin    1 1.996014    10     45     41           0   NA   NA 0.130
## L-BFGS-B  1 1.996731    10     33     33           0   NA   NA 0.103
## CG        1 1.994366    10    389    193           0   NA   NA 0.617
## nlm       1 1.988894    10     NA    118           0   NA   NA 0.520
## BFGS      1 1.994444    10    178     66           0   NA   NA 0.221
## nlminb    1 2.001275    10    155    151           1   NA   NA 0.500
# numDeriv grad() approximation
resultt3 <- opm(stt, fourth, "grnd", method=meth2, control=list(kkt=FALSE, trace=0))
## Warning in Rvmminu(par = spar, fn = efn, gr = egr, control = mcontrol, ...): Too many gradient evaluations
summary(resultt3, order=value, par.select=1:2)
##          p1       p2 value fevals gevals convergence kkt1 kkt2  xtime
## Rcgmin    1 1.996722    10     44     38           0   NA   NA  0.503
## Rvmmin    1 1.996348    10    369   2501           1   NA   NA 33.771
## L-BFGS-B  1 1.996754    10     33     33           0   NA   NA  0.439
## CG        1 1.994370    10    389    193           0   NA   NA  2.593
## BFGS      1 1.996236    10    186     73           0   NA   NA  0.972
## nlm       1 1.988893    10     NA    117           0   NA   NA  2.031
## nlminb    1 2.001443    10    155    151           1   NA   NA  2.003

From these we see that the more accurate gradient approximations require more time. In conjunction with the very conservative and bulldog-like Rvmmin, they take so much time as to be a poor choice for regular usage. On the other hand, Rcgmin does “not too badly”.

Conclusions

The function minimization methods I developed that are in optim() are still quite effective, but they were cutting edge four decades ago and showing a few wrinkles. There are very similar tools available in R that usually will do better.

From the examples above, we see what I have more generally observed:

  • For large, not too badly scaled problems, Rcgmin and L-BFGS-B are good choices for solvers, especially if analytic gradients are available.

  • For large problems it is wise to turn off the computation of the KKT tests when using opm(). There is a function kktchk() in the packages optextras which can be called separately once the trial solution has been found.

  • For small problems, most of the solvers do quite well. Rvmmin is, by design, very aggressive in seeking solutions, so quite useful when problems are “difficult” but the function and gradient are relatively quick to compute.

  • When gradients are not available, and the problem is not too “difficult”, a simple forward gradient approximation with a gradient method is often more efficient than a derivative-free method.

The above observations should be read with the proviso that this short discussion has omitted to include several of the solvers available to optimrx, some of which may be particularly suited to special cases users may encounter.