abraker wrote:
...
Your idea of finding a spline function to rescale the current values is in the end very similar to using percentiles, but with less precision (the percentile idea is the same as the spline idea, but with one piece per data point, instead of having less amount of pieces than data points).
The data looks smooth enough to use generalized linear functions instead of linear splines, so let's see how to do what you are proposing using that instead, for a toy example with random data:
The raw data (blue line) and the goal scaled data (red line) look like this:
So we want to find a function from the raw data to the scaled data that looks like this (because most of the change happens in the smaller values, it is better to use functions that grow slowly, such as logarithms, as the basis, instead of just using polynomials):
To get an isotonic fit, we can express the terms in a scaled Bernstein basis and restrict the coefficients so they are non-decreasing. Then this is a quadratic programming problem, which can be solved by several algorithms.
For example, using powers of Log[1 + x] up to 21, with a Bernstein basis scale that guarantees isotonicity up to 22025.5 (e^10-1) (considerably larger than what we should expect for the raw values in the future; we could increase the range of guaranteed isotonity, but the larger it is, the worse the resulting function fits the data), from the data in the example, we obtain as a function:
-4.2*10^-8 Log[1 + x]^2 + 3.98952*10^-8 Log[1 + x]^3 -
2.39335*10^-8 Log[1 + x]^4 + 1.01701*10^-8 Log[1 + x]^5 +
54.6007 Log[1 + x]^6 - 70.1823 Log[1 + x]^7 + 42.9753 Log[1 + x]^8 -
16.549 Log[1 + x]^9 + 4.46706 Log[1 + x]^10 -
0.893174 Log[1 + x]^11 + 0.136421 Log[1 + x]^12 -
0.0161863 Log[1 + x]^13 + 0.00150262 Log[1 + x]^14 -
0.00010905 Log[1 + x]^15 + 6.13243*10^-6 Log[1 + x]^16 -
2.6228*10^-7 Log[1 + x]^17 + 8.25477*10^-9 Log[1 + x]^18 -
1.80421*10^-10 Log[1 + x]^19 + 2.44792*10^-12 Log[1 + x]^20 -
1.55382*10^-14 Log[1 + x]^21
(The function resulted being isotonic for values up to 167,322)
Which when applied to the raw values (blue line) gives the green line, which is close enough to the goal red line (the more terms we add, the closer they are):
I can give you the scaling function if you give me the data you want to scale this way, or asks me for more details if you want to do this yourself.