Given a sequence of \( n \) points generated by a function \( f(x) \) and a number of elements \( k \), find the subset of \( k \) elements plus the first and last points of the sequence for which the approximation error is minimized

Let us define the error function as

\[e = \sum^n_{i=1} (y_i - g(x_i))^2\]

i.e. the square of the difference between \( y_i \) and the ordinate \( g(x_i) \) for point \( x_i \) on a segment connecting two other points (i.e. approximating the \( i^{th} \) point); the algorithm should choose \( k \) points and minimize \( e \).

The problem can be solved via dynamic programming by recursively considering each point \( i \) as the first point of the set (excluding the first one of the entire \( n \) sequence and the last one of the sequence) and consider the error given by the approximation on the right and left of it.

Suppose the input data is

\[X = \{ 4,3,2,1 \} \\ Y = \{ 4,2,3,1 \} \\ K = 1\]

which corresponds to the following function

we’re allowed to choose only one point except the first and last one to minimize the error. If we were to choose point \( (x_3, y_3) \) i.e. the third one with coordinates \( (3;2) \), there would be a segment from the point \( (1;1) \) to \( (3;2) \) and point \( 2;3 \) would yield an approximation error \( e \) equal to

\[e_{\overline{13}} = \left( y_2 - \left( \frac{y_3 - y_1}{x_3 - x_1} (x_2 - x_1) + y_1 \right) \right)^2\]

Similar calculations would have to be performed if point \( (x_2;y_2) \) had been chosen instead. It follows that

\[e_{\overline{14}} = min \left\{ e_{\overline{13}} + e_{\overline{34}}, e_{\overline{12}} + e_{\overline{24}} \right\}\]

where

\[e_{\overline{12}} = e_{\overline{34}} = 0\]

#include <iostream>
#include <algorithm>
#include <limits>
#include <vector>
#include <type_traits>
using namespace std;

template<typename V1Type, typename V2Type>
void sort_XY_by_X(V1Type& X, V2Type& Y) { // Sort X and Y by X

  V1Type XOld = X;
  V2Type YOld = Y;
  using iter_type = typename remove_reference<V1Type>::type::iterator;
  vector<pair<iter_type, int>> XI;
  for (int i = 0; i < X.size(); ++i)
    XI.emplace_back(X.begin() + i, i);
  sort(XI.begin(), XI.end(), [](const pair<iter_type, int>& a, 
                                const pair<iter_type, int>& b) {
    return *a.first < *b.first;
  });
  for (int i = 0; i < XI.size(); ++i) {
    X[i] = XOld[XI[i].second];
    Y[i] = YOld[XI[i].second];
  }
}

int main() {

  vector<float> X = { 4,3,2,1 };
  vector<float> Y = { 4,2,3,1 };
  const int K = 1;
  
  sort_XY_by_X(X, Y);

  const float inf = numeric_limits<float>::max();
  vector<vector<vector<float>>> E(K + 1, 
         vector<vector<float>>(X.size() + 1, vector<float>(X.size() + 1, inf)));

  auto calculateError = [&X, &Y](int start, int end) {
    if (start + 1 == end)
      return 0.0f;
    float error = 0.0f;
    for (int i = start + 1; i < end; ++i) {
      float m = (Y[end] - Y[start]) / (X[end] - X[start]);
      float y = m * (X[i] - X[start]) + Y[start];
      error += pow(Y[i] - y, 2);
    }
    return error;
  };

  for (int i = 0; i < X.size() - 1; ++i) { // Base cases

    for (int j = i + 1; j < X.size(); ++j) {
      E[0][i][j] = calculateError(i, j);
    }
  }

  for (int k = 1; k <= K; ++k) {
    float min = inf;
    // Recursively set i as the first point of the set (and

    // therefore considers [start;i] as a single segment)

    for (int i = 1; i < X.size() - k; ++i) {
      float opt = E[0][0][i] + E[k - 1][i][X.size() - 1];
      if (opt < min)
        min = opt;
    }
    E[k][0][X.size() - 1] = min;
  }

  cout << E[K][0][X.size() - 1]; // 2.25


  return 0;
}

The proposed solution runs in \( O(n) \) to calculate the error threshold (and \( O(n \log{n}) \)to sort the data) while the dynamic programming loop runs in \( O(n^2) \) therefore yielding a total of \( O(n^3) \).

References

Handouts fall10 @ csail.mit.edu