The module LSQ is for unconstrained linear least-squares fitting. It is based upon Applied Statistics algorithm AS 274 (see comments at the start of the module). A planar-rotation algorithm is used to update the QR- factorization. This makes it suitable for updating regressions as more data become available. The module contains a test for singularities which is simpler and quicker than calculating the singular-value decomposition. An important feature of the algorithm is that it does not square the condition number. The matrix X'X is not formed. Hence it is suitable for ill- conditioned problems, such as fitting polynomials. By taking advantage of the MODULE facility, it has been possible to remove many of the arguments to routines. Apart from the new function VARPRD, and a back-substitution routine BKSUB2 which it calls, the routines behave as in AS 274. The package as posted comprises: LSQ.DOC this note LSQ.F90 the least-squares module DEMO.F90 a simple demonstration program FUELCONS.DAT a file of data to be read by the demo program TEST1.F90 another program which fits a cubic polynomial and has a deliberate singularity This package is being made available freely, and may be freely distributed. The author is: Author: Alan Miller Formerly of CSIRO Division of Mathematics & Statistics but now retired. e-mail: amiller@bigpond.net.au Representation of floating-point numbers ---------------------------------------- With many Fortran 77 compilers on many computers, the REAL data type gives floating-point numbers to about 7 decimal digits accuracy. This is barely adequate for least-squares calculations, if there is any ill-conditioning. In such cases, the use of the DOUBLE PRECISION data type is highly desirable, while quadruple precision is available with a few compilers. Fortran 90 allows the user to specify the minimum accuracy required. This module uses `lsq_kind' to specify the accuracy required. This parameter is set in the module as: INTEGER, PARAMETER :: lsq_kind = SELECTED_REAL_KIND(10,70) This means that the compiler must provide a representation which stores floating-point numbers to at least 10 significant decimal digits, and the range of exponents must be at least from 10^(-70) to 10^(+70). To change the accuracy of all floating-point operations in this module, only the value of this one parameter needs to be changed. Using many computer/compiler combinations, `lsq_kind' will be identical to using the Fortran 77 DOUBLE PRECISION. N.B. All floating-point arguments passed to program units in this module should be of kind `lsq_kind'. That is, they can be declared in the calling program units as, e.g. REAL (lsq_kind) :: x(0:10), y, regn_coeff(0:10), resid(1:1000) Contents of module LSQ ---------------------- The program units in LSQ are: SUBROUTINE startup(nvar, fit_const) SUBROUTINE includ(weight, xrow, yelem) SUBROUTINE regcf(beta, nreq, ifault) SUBROUTINE tolset SUBROUTINE sing(lindep, ifault) SUBROUTINE ss SUBROUTINE cov(nreq, var, covmat, dimcov, sterr, ifault) SUBROUTINE inv(nreq, rinv) SUBROUTINE partial_corr(in, cormat, dimc, ycorr, ifault) SUBROUTINE vmove(from, to, ifault) SUBROUTINE reordr(list, n, pos1, ifault) SUBROUTINE hdiag(xrow, nreq, hii, ifault) REAL (lsq_kind) FUNCTION varprd(x, nreq, var, ifault) SUBROUTINE bksub2(x, b, nreq) Use of module LSQ ----------------- If the user has one complete set of data, and simply wants to calculate the least-squares fit, then 1. CALL startup(nvar, fit_const) where nvar = the number of variables, and fit_const = .true. if a constant is to be fitted and fit_const = .false. otherwise. For instance, if you want to fit: Y = b0 + b1.X1 + b2.X2 + b11.X1^2 + b12.X1.X2 + b22.X2^2 then you will need to pass X1, X2, X1^2, X1.X2 and X2^2 as if they were separate variables. Specify nvar = 5 in this case, and fit_const = .true. 2. For each case (observation), prepare a one-dimensional array, XROW in the example below, declared as type lsq_kind. In the above case, you may like to declare XROW as: REAL (lsq_kind) :: xrow(0:5) where the first element, xrow(0) should be set equal to 1.0 to give the constant in the model. Then CALL includ(weight, xrow, yelem) where yelem is the value of the Y-variable, and weight is the weight to be assigned to this case. If unweighted least squares is required, simply set weight = 1.0. Each time that routine includ is called, a count is updated in integer variable nobs. This is a public variable in module lsq which is available for use in the user's calling program. 3. CALL tolset This automatically calculates tolerances for detecting singularities. 4. CALL sing(lindep, ifault) This call is optional, but recommended. It tests for singularities. If singularities are detected, then a non-zero value is returned for integer variable ifault. lindep is a logical array which must be declared in the user's calling program, e.g. LOGICAL :: lindep(0:5) If lindep(i) = .true., it means that the i-th variable is linearly related to some of the earlier variables. For instance, if we are fitting the quadratic surface in the example above, and we have only 5 observations, then lindep(5) will be returned as .true. If singularities are detected, then the QR-factorization produced by includ will be corrected. 5. CALL regcf(beta, nreq, ifault) This returns the least-squares regression coefficients in array beta. nreq = the number of coefficients required. For instance, in the above example, set nreq = 6 to calculate all regression coefficients. If you would also look at the coefficients for the linear model (constant plus terms for X1 and X2), specify nreq = 3. If singularities are detected, one or more of the regression coefficients will be returned as zero, and a negative value will be returned for ifault. 6. The residual sum of squares with the full model fitted is contained in the public variable sserr. If you want to add more data, then just call includ as many times as necessary to process the additional data. The module contains facilities for calculating residual sums of squares for sub-models, and for re-ordering variables. These are included to facilitate various kinds of subset regression. The standard errors and covariances returned by subroutine COV, and the standard error of prediction returned by VARPRD, all assume that the residuals are independent and homogeneous. In most cases these quantities will be under-estimates if residuals are correlated. For more details, see the comments in the module or the demo program, or consult the reference below. Reference: Miller, A.J. (1992). Algorithm AS 274: Least squares routines to supplement those of Gentleman. Appl. Statist., vol.41(2), 458-478. Some additional references on least-squares methods: Farebrother, R.W. (1988). Linear least squares computations. Particularly chapter 6. Publisher: Marcel Dekker. ISBN 0-8247-7661-5 Unfortunately it is ridiculously expensive. Seber, G.A.F. (1977). Linear regression analysis. Particularly chapter 11. Publisher: Wiley. ISBN 0-471-01967-4 George is professor of statistics at the University of Auckland. A very good book on the statistics of regression, and one of my basic references. Longley, J.W. (1984). Least squares computations using orthogonalization methods. Marcel Dekker again. Astronomical price even though it is photo-offset. Volume 93 in Lecture notes in pure and applied maths. ISBN 0-8247-7232-6 Miller, A.J. (1990). Subset selection in regression. Publisher: Chapman & Hall (monographs on Statistics & Appl. Probablility no. 40). See chapter 2. ISBN 0-412-35380-6 There is a second edition of this book, published by CRC Press in 2002. ISBN 1-58488-171-2 !--------------------------------------------------------------------------