CLsqFit:Rejecting Outliers from the Fit

Rejecting Outliers from the Fit

When a data sample is corrupted by points that don't follow the same functional form as the majority of points, then the fit will be biased and not reflect the properties of the majority. Somehow, the "bad" data need to be identified and removed from the fit to the majority, or the "good" data.

To objectively identify and remove bad data from a sample of points is not straightforward. In a statistical sense, bad data cannot be detected in the overall population unless the majority of points have enough "power" to define a trend that makes the bad data appear significantly different. Points too far from the trend to be drawn from the same population as the good points are called "outliers". The power to detect outliers increases as number of sample points increases and as the scatter shrinks about the underlying trend. Bad data detection does not work so well when there are only a small number of points to define the trend, even if the "good" data have a small scatter about the true trend. The reason for this behavior is that both the good and bad points are used to determine the fit, so the bad points need to be significantly bad to be distinguished from the good ones. The more points in the sample, the less effect the bad data have on the biased fit that includes them, hence it becomes easier to detect them.

To detect outliers, a measure of the scatter about the fit is normally used. The standard deviation, or "sigma", of the fit assesses the typical scatter of all points away from the fit. By setting threshold, such as "3 sigmas", the rejection algorithm can identify which points lie "too far" from the fit to be considered acceptable. Normally, rejection works iteratively, by identifying outlying points based on their large residual distance from the fit, removing them from the fit, and then recomputing the fit without them. This is repeated until some condition is met, such as no more points being rejected or after some specified number of iterations has been completed.

The CLsqFit class includes an iterative data rejection algorithm which is enabled using the CLsqFit:DoRejection method. The rejection process occurs during the fitting process after calling CLsqFit:Fit.

Data rejection uses the following 3 parameters:


sSigmaHigh	This parameter holds a list of 1 or more k values for k-sigma clipping above the fit. Sample points having a residual more than this distance above the fit will be rejected. This parameters is specified as a string of comma separated values. The default is "2.8,2.8,2.8".
sSigmaLow	This parameters holds a list of 1 or more k values for k-sigma clipping below the fit. Sample points having a residual more than this distance below the fit will be rejected. The parameter is specified as a string of comma separated values. The default is "2.8,2.8,2.8".
nCycles	This parameter describes the number of data rejection cycles (iterations) that will be performed. The default is 3.

Mira uses a so-called "k-sigma clipping" strategy for data rejection. Here, "sigma" corresponds to the standard deviation of the fit (see CLsqFit:GetSigmaFit) and k is some user-specified number. Points having residuals further than k sigmas from the fit are discarded and the fit recomputed. Values of k are specified by the data rejection configuration whereas sigma itself is recalculated each cycle based on the points remaining in the fit. This repeats for nCycle iterations. At the end, the number of points remaining ("used"), as well as the number of points rejected and the number manually deleted, may be queried by calling CLsqFit:GetNumPtsUsed and related methods.

Using strings for sSigmaHigh and sSigmaLow allows these parameters to contain 1 or more values of k so that k can be refined each cycle. If the list of k values is exhausted, rejection continues using the last value. The first cycle removes points higher than sSigmaHigh[1] and lower than sSigmaLow[1]. The second cycle rejects points further than sSigmaHigh[2] and sSigmaLow[2], and so on.

Adopting separate value sequences for sSigmaHigh and sSigmaLow allows the rejection strategy to be tailored to the characteristics of the data. For example, if the data are from an image that has cosmic ray hits, there should be many more "high" rejections than "low" rejections. Using a naive rejection strategy like "+/- 3 sigmas" would result in a biased fit that retains some positive outliers at the expense of trimming valid data below the fit. This happens because the outliers pull the fit away from the true center of the data distribution, artificially pushing low values to larger residuals. A better strategy would be to start with a high clipping threshold of 2.5 sigmas and a low clipping threshold or around 10 sigmas. This would allow aggressive removal of positive outliers while tolerating greater residuals below the fit. After the major pruning done in the first cycle, reduce the lower k threshold so the values are closer, such as 3, 4. On the 3rd iteration, use 3, 3.2, and so on. Remember that in this type of data, the fit is being pulled preferentially upward by the outliers, so the lower threshold needs to be a little larger so that good data below the fit are not accidentally rejected. Conversely, for the case in which positive and negative outliers are equally probable, the values of sSigmaHigh and sSigmaLow should usually be similar.

Since the CLsqFit class works with multi-channel data, such as RGB values, a given data rejection strategy can be used for all channels or can be tailored for each channel. For example, if the R channel values are more likely to contain outliers than the other channels, you may wish to enable data rejection for the R channel but not for the other channels.

Examples

The example below fits a line to (x,y) pair data which includes 1 point significantly below the trend. Data rejection is then enabled and the fit re-done, giving a significantly lower standard deviation after the point is rejected.

L = new_lsqfit()	-- create a CLsqFit object
L:SetNumCoefs( 2 )	-- set 2 coefficients to fit a line
L:AddPt( -3, -1 )	-- add data points
L:AddPt( 4, 12 )
L:AddPt( -1, 3 )
L:AddPt( 5, -10 )	-- significantly below the trend
L:AddPt( 6, 16.5 )
L:AddPt( 12, 29 )
L:AddPt( 3, 10.7 )
L:AddPt( 5, 15.3 )
L:AddPt( 2, 9.3 )
L:AddPt( 8, 20.8 )
L:AddPt( 4, 12.7 )
L:AddPt( 5, 15.3 )
L:AddPt( 3, 11.3 )
L:AddPt( 11, 27.1 )
L:Fit()	-- Fit the polynomial
sd = L:GetSigmaFit()	-- standard deviation of the fit
Printf( "sd = %lg\n", sd )	-- result: sd = 6.93996
L:DoRejection( true )	-- use data rejection, assume standard params
L:Fit()	-- re-do the fit using data rejection
sd = L:GetSigmaFit()	-- standard deviation of the fit
Printf( "sd = %lg\n", sd )	-- result: sd = 0.40018

Examples

Related Topics