I. Curve Fitting

A. Introduction

Curve fitting is the process of determining the mathematical relationshp between data series in a 2D set. The phrase "curve fitting" comes from a traditional graphic solution: the data is plotted, a line or smooth curve is drawn through the data points, and an equation computed by measuring on or interpreting from the graph. If the data contains measurements, there will be random errors so a perfect fit is not possible. The line or curve will not pass through all the points so a visual best-fit curve is drawn. 

With enough data points least squares can be used to numerially determine a best-fit equation.

 


B. Types of variables

Two-dimensional data is based on two variables: an independent variable and a dependent one. The dependent variable's value is based on the value of the independent one. The equation relating the variables indicates what happens to the dependent variable as the independent one changes.

With measurements, the independent variable values indicate where they are made, the dependent variable values are the measurements. 

When plotting data, the independent variable is on the x-axis while the dependent is on the y-axis, Figure I-1.

 
 Figure I-1
Plotting Data

 


C. Theoretical vs Empirical

The independent and dependent variable relationship can be either theoretical or empirical.

1. Theoretical

A theoretical relationship is based strictly on a mathematical equation; there are no observed or measured data. There is an exact relationship between the two so there is no error.

A basic example in surveing is a proposed grade line in an alignment design. A +3.0% grade begins at station 10+00 with elevation 800.0.and ends at station 13+00. For every 100 ft horizontally, elevation changes 3.0 ft vertically. Expressed as an equation:

 Elev = 800.0+d(g) 
    g: 0.03
    d: distance from sta 10+00, in ft

 

We select a station and compute its elevation; elevation is the dependent variable.

Set up a table for elevations at full stations, Table I-1

 Table I-1
Grade Elevations
Station Elevation
 10+00 800.0
 11+00 803.0
 12+00 806.0
 13+00 809.0


Although the table shows only four data pairs, because theirs is an exact relationship, we can add as many as we like: 801.5 ft at 10+50, 804.5 at 11+50, etc.

The table is one way to look at the data, plotting it is another way to visualize it. Using the same scale for both axes, Figure I-2, results in a very flat plot that is difficult to interpret. That's because elevation changes only 9.0 ft while stationing changes 300 ft, a 1:33.33 ratio.

 
 Figure I-2
Grade; Same X and Y scales


Increasing the elevation scale exaggerates the plot vertically and makes the data grade line more apparent. Figure I-3 uses a vertical exaggeration of 10:

Figure I-3
Grade; Vertical exaggeration 10

 

Because a theoretical plot is based on a mathematical equation, there are infinite data points. When the graph is drawn, only the line or curve is shown, not individual data points.

2. Empirical

Empirical data is based on measurement or observation. The dependent variable is measured at a specific independent values. Being measured, the dependent values are is subject to errors; the independent ones are considered error-free.

Let's use another surveying example. A profile level is run along a section of center line on existing terrain. The results are in Table I-2.

Table I-2
Profile Level
 Station Elevation 
22+00  1250.2 
23+00 1248.7 
24+00 1245.5 
25+00 1243.8 


Plotting the data, using a vertical exaggeration of 4, Figure I-4:

 
Figure I-4
Profile Elevations


It looks like a straight line will fit ... but will it? Since two points define a straight line, we can construct a whole bunch of grade lines, Figure I-5.

 
Figure I-5
Multiple Straight Lines


Any single line fits two points perfectly, but misses the other two, sometimes by quite a lot.

Draw a best-fit line which may or may not go through individual data points but meets some specific criteria, Figure I-6

 
Figure I-6
Best-fit Straight Line

 

Because this is a linear relationship, a best-fit line represents acceptable compromises but must still be straight. 


D. Fitting a Straight Line

1. Equation

The equation of a straight , Figure I-7,line is:

 y = mx+b    Equation I-1
    m is the line slope
b is the y intercept

 

 
Figure I-7
Straight Line Geometry


Slope, m, is rise/run. It is also the tangent of the angle from the x-axis to the line; m = tan(θ). 

2. Residuals

Empirical data doesn't exactly fit a straight line because:

  •   the objects measured may not have a linear relationship
  •   the measurements have random errors

A best-fit straight line is one which minimizes the sum of the squares of the residuals, Figure I-8.

 
Figure I-8
Profile Residuals

 

The residual is the difference between the theoretical and empirical values at an independent variable, Equation I-2

 vi = Yt - Ye      Equation I-2
  Yt - theoretical value
  Ye - empirical value

3. Least Squares

The best-fit line minimizes the sum of the residuals squared, Equation I-3

 Σ(vi)2= minimum  Equation I-3

 

There are two least squares solution methods for a straight line: Linear regression and Observation equations. Each has its particular advantage(s) and disadvantage(s).

a. Linear Regression

Linear regression can be done manually or using a built-in function on most scientific calculators. It is a standard function in most spreadsheet software and included in their graphic options (aka, trendline).

The line slope, m, is determined using Equation I-4:

Equation I-4


and intercept, b, from Equation I-5:

Equation I-5


The correlation coefficient, r, indicates how well the data fits a straight line.

Equation I-6
Equation I-7a
Equation I-7b


The coefficient varies between -1 (negatively sloped line) and +1 (positively sloped line); the closer to -1 or +1, the better the data fits a straight line.

 b. Observation Equations

An observation equation is written for each coordinate pair. Then one of two ways can be used to reach the solution:

1. Direct minimization: This is covered in Chapter C.
2. Matrix method: This is covered in Chapter D.

Direct minimization is the most arduous method requiring a considerable amount of computations to perform manually. Anything over four coordinate pairs increase computations substantially. The matrix method is more efficient and easier perform, even manually without benefit of software.

4. Application Example

Let's determine the best-fit line for the measured profile data of Table 2.

a. Linear regression

Organizing the data in an extended table simplifies computations, particularly since some of the terms are so large.

X is Station, Y is Elevation.


X, ft
Y, ft
X2
vx vy (vx)(vy) vx2
vy2
  2200 1250.2 4,840,000 2195 1225.20 2,689,319.49 4,818,025 1,501,121.2
  2300 1248.7 5,290,000 2295 1223.70 2,808,397.24 5,267,025 1,497,447.8
  2400 1245.5 5,760,000 2395 1220.50 2,923,103.49 5,736,025 1,489,626.4
  2500 1243.8 6,250,000 2495 1218.80 3,040,912.24 6,225,025 1,485,479.5
sums
9400 4988.2
22,140,000 9380 4888.20 11,461,732.46 22,046,100 5,973,674.9


Substituting terms in the linear regression equations:

 


The best-fit equation is: Elev = -0.0224(Station) + 1299.69
The correlation coefficient is -0.99 which is a pretty good fit.

Because Stations are used for x values, the slope, m, is the grade expressed as a ratio: grade = -0.0224 ft/ft = -2.24%

The simplicity of the table belies all the computations needed to construct it. The primary disadvantage of Liner Regression is the amount of calculations if done manually. It's quick, however, if using built-in calculator or spreadsheet functions.

b. Matrix Method

An observation equation is written for each data pair, using Equaton I-1, with a residual included on each dependent variable:

In matrix notation, the observation equations are [K] + [V] = [C] x [U].

The matrices are:

The matrix algorithm [U] = [Q] x [CTK] is solved to determine m and b.

Instead of the complete solution process step-by-step, intermediate products are shown:

Since the [CTC] matrix is only 2x2, it can be quickly inverted using the determinant method.

The matrix algorithm results are m = -0.0224 and b = 1299.69 just like Linear Regression's results.
Surprise.

While there is no direct equivalent to the correlation coefficient, the uncertainties for m and b can be determined.
Using the equations from Chapter D Section 3:

 


E. Interpolation, Extrapolation

1. Purpose

One of the reasons we fit a line or equation to data is to allow determination of one variable based on fixing the other variable's value. For example, Figure I-9 is a graph of a steel tape's length at differnt pull amounts. 

 
 Figure I-9
Tape Pull Clibration

 

As pull was changed, the tape length was measured against a calibrartion line; length is dependent on pull.

The equation of a best-fit line determined by Linear Regression is L = 0.0059P+99.869.

2. Interoplation

How much pull should be applied to achive 100.000 ft? We can determine the pull by either scaling it from the graph (22.1, shown in red) or by solving the equation:

100.000 = 0.0059P+99.869  arrange to solve for P:  P = (100.000-99.869)/0.0059 = 22.2 lbs.

This is called interpolation: using the data to predict one variable value from the other.

Interpolation is done within the data range. When collecting measurements, we try to bracket the range that we may later want to determine.

For example, we calibrate a tape to determine the conditions necessary for its length to be 100.000 ft. We stared with a pull that yielded a tape length less than 100.000 ft then progressively increased pull until the length exceeded 100.000 ft. Then we progressicvely decreased pull until the length was less than 100.000 ft. We have enough data on both sides of 100.000 ft to reliably determine the pull needed to achieve it.

3. Extrapolation

Using the same data in Figure I-9, what will be the tape length when 25 pounds of pull are applied?

We can't determine it directly from the graph because it doesn't go out far enough. We could extend it the graph, but that's a bit cumbersome. Or we can use the equation: 

L = 0.0059P+99.869 = 0.0059(25)+99.869 = 100.016 ft

This is extrapolation: predicting one variable value from the other outside the data range.

Extrapolations should be limited to values very close to the data range. The line fit is based on the behavior within the data range. We don't know what happens outside that range. Just because we can determine an equation doesn't mean it's a good predictor outside the range.

For example, how much pull would be needed to strech tape to 105.000 ft? From the equation, it would be:

P = (105.000-99.869)/0.0059 = 870 lbs

Not only is 870 lbs an unreasonable amount to attempt, the tape itself would fail before that. At some point it reaches its plasticity limit and won't return to its original length; keep going and it eventually will fail by breaking. We can't determine either of those based this data set because it is below the failure threshhold.

We generally leave extrapolation to economists, politicians, and weather forecasters. 


F. Curved Lines

1. General

Not all data relationships are linear. The graph in Figure I-10 are the results of an aerial camera calibration for radial lens distortion. The distortion is caused by lens material and surface curvature. Its effect is measured on the image plane at radial intervals from the principal point. 

 
 Figure I-10
Lens Distortion Data

 

It's obvious the data doesn't fall on a straight line, but it doesn't follow a simple curve either.  So how do we fit a curve to nonlinear data? Linear regression can't be used to fit a curve because, well, the curve isn't straight.

Actually, linear regression can be used in some cases if the logarithm of the dependent variable or logarithms of both variables are used. But those are exceptions to the rule, we want a universal method.

Let's examine a few surveying applications.


2. Vertical Curve

Table I-3 lists some elevations at particular stations through which it is desired to run a vertical alignment curve.

Table I-3
Station Elevation
11+00 848.8
12+00 850.7
13+00 851.1
14+00 850.3

 

A verical curve is part of a parabola whch is a second-degree polynomial. That means its equation includes a squared independent variable term, Equation I-8.

  y = ax2+bx+c  Equation I-8

 Where two points are needed to define a straight line, three points are needed for a second-degree polynomial. Why three points? Because there are three unknown coefficients: a, b, and c. More than three allows a least squares solution; each point beyond three is a redundancy.

We don't have to stop there. We can get much more complex: a nth degree polynomial needs n+1 points to fix and >n+1 points for a least squares solution. However, we'll stick to second degree polynomials for now since they cover the bulk of complex survey curves.

The data in Table I-3 is plotted in Figure I-11 along with a best-fit parabolic curve.

 parabola 1
 Figure I-11
Best-Fit Vertical Curve

 

To determine the curve equation, Equation I-8 is used to develop the observation equations, with the residual term on the Y (Elevation) value.

 The matrices are:

Now we just solve the matrix algorithm [U] =[Q] x [CTK]

Intermediate products:

Coefficients:

The equation is:

Notice that many intermediate matrix elements are either extremely large or small. That's not an issue using software to manipulate the matrices, but manual calculations can lead to errors. One way to lessen potential errors is to divide the stations by 100, reducing their size. The observation equations coefficient matrix becomes:

The [Q] matrix:

And the coefficients:

Compare the these coefficients with the solution using the full station expression:

a increases by 10,000 = 1002 which corresponds to a2
b increase by 100 which corresponds to b
c is the same

That means the equation can be written as


or

where S = Sta/100


3. Horizontal Curve

Probably the most prevalent curve that surveyors deal with are horizontal which use circular arc sections. These are also second degree polynomials but unlike a parabola, they have constant radii. Another difference is there is really no dependent-independent variable distinction. The data points for a horizontal curve are coordinates (X/Y or N/E), neither direction superior to the other.

Being a second-degree polynomial, three points are needed to define it.

Equation I-9 is used to solve the arc if one of the points is the radius point, O, Figure I-12.

Equation I-9

where:

Np, Ep - coordinates of a point on the arc
No, Eo - coordinates of the arc radius point

circle fit1  
Figure I-12
Including radius point
 

 

Equations I-10 and I-11 are used to solve the arc if all three points are on the arc, Figure I-13.

  Equation I-10
  Equation I-11
Figure I-11
Only arc points


Four points allows a least squares solution.

Example

Determine the radius for the arc in Figure I-12 using the coordinates of the four arc points in Table I-4.

Figure I-12
Least Squares Arc Fitting

 

Table I-4
Point
E N
1 117.68 806.74
2 690.37 795.29
3 316.93 940.81
4 618.34 873.55


Use the coordinates and Equation I-10 to set up the observation equations. Include a residual on the constant term.

Create the initial matrices

Intermediate matrix

Solution matrix

Substitute the solution coefficients into the arc equation.

Determine the curve radius

Residuals

The observation equation form of equations I-9 and I-10 are:

 circle obs eqn 1 Equation I-12
 circle obs eqn 2 Equation I-13

In Equation I-12 the residual is on the radius which is a function of the coordinates.

In Equation I-13 the residual is on the arc point coordinates.

This makes the residuals perpendicular to the arc (radial), Figure I-13, since there is no dependent-independent variable condition.

 circle res 1
 Figure I-13
Curve Fit Residuals