Instead of assigning each judgment value for each hit an arbitrary value (for example: a 300g a base value of 320, a 300 a 300, etc...), look at the timing windows of each judgment, so one can give a score based on the estimation of accuracy based on the distribution of the judgments in the play.
Because of the central limit theorem, the distribution of the hit errors would be close a Normal Distribution (experimentally, it seems like the real distributions tend to have heavier tails than a Normal Distribution, the amount of misses players tend to have is higher than predicted; this is not really an issue for the purposes here, though). The mean of the distribution would be ~0 ms (since players adjust their offset to correct for the average error), so the standard deviation (Unstable Rate in game terms) alone can be used to describe how accurate a player is (the lower the standard deviation, the more accurate a player is).
With that, it's possible to perform a maximum-likelihood estimation based on the distribution of the judgments obtained to calculate the standard deviation.
Note: it's possible to perform a maximum-likelihood estimation without assuming the mean is 0ms, and use the RMS of the error instead of just the standard deviation, but this makes calculation the considerably harder and expensive to calculate (and I don't think this would be really needed here).
A MISS would be considered as if the player hit the note, but the timing was so off it was outside the time window of a 50 (the system can't tell the difference between being extremely off in the timing, or not hitting a note at all; the penalty for a MISS is considerable in the case assumed)
With that, we have a measure of accuracy (Standard Deviation/Unstable Rate that makes the distribution of scores the most likely), now we can scale the result arbitrarily without affecting the balance of the results.
A possible scaling would be one that is based on the "Base Score" of the current mania scoring sytem:
1000000 * (320*Probability_of_300g + 300*Probability_of_300 + ... ) / (320 * Amount_of_Judgments)
Here the probabilities of the scaling always assume OD10 (it could use any other OD), fixing a OD instead of using the one of the play has the advantage of enabling comparing different scores in the same map if the ODs are different (it would enable to comparing EZ/HR and No-Mod scores directly, except for cases of very high accuracy, for reasons explained below). If the OD of a map is 10, and a play has a close to perfect Normal Distribution of hit errors, then this scale gives about the same values as the "Base Score" of the current scoring system (multiplied by two, since Base Score ranges from 0 to 500000).
Another possible scaling is using one based on the "Ex-score" of IIDX/LR2, the advantage of this scale is that the difference between high-accuracy scores is higher (For example, with the osu!mania scaling, 990000 is pretty close to 998000, even though the later score is much better). (Note: since the formula here is just a scaling, it doesn't have the same weakness as the LR2 formula, where misses barely matter at all, and there is no difference between a miss or other bad accuracy judgment).
The formula for the LR2-Based scale would be:
1000000 * (2*Probability_of_300g + Probability_of_300) / (2 * Amount_of_Judgments)
The difference between the "Base Scaling" and "LR2-Based Scale" here doesn't really affect things much, the only moment where it matters which scale to use would be in team multiplayer matches, where the score of the players are added together.
Note that the previous calculation is similar to the "Expected Unstable Rate" I made some years ago, but with some important differences:
- The Expected Unstable Rate estimated the standard deviation based on the Base Score of a play (which uses the arbitrary values of 320, 300, 200, etc... and the values of those judgments actually affect the balance of the results), while this uses the distribution of the judgments. Both calculations tend to give similar results, but when the distribution of judgments of a play is farther from a perfect Normal Distribution, the calculation here gives more accurate results (this is specially important with misses or very bad accuracy judgments, the Base Score is very insensitive to the amount of those judgments).
- The Expected Unstable Rate was based on the median (or other percentiles) of the Binomial Distribution associated with the amount of 300g judgments obtained. The median is more robust to results that are just a fluke (For example, if a play consists of only 200 300g judgments, the unstable rate that makes this the most likely is 0, which makes it 100% likely; but, even with a higher unstable rate, such as 50, the chances of getting 200 300g in a row are more than 50%, so it is very likely the real Unstable Rate of a play was higher than 0). The difference in the results obtained because of this is only noticeable with very high accuracies (or very low amount of objects, less than 50). For the purposes here, taking this into account is not really needed, and it would make the calculation formula more complex (when calculating pp, this is very important, though; also it would be important when mixing plays with different mods in the same leaderboard). Accounting for this in the formula is possible, but would make the calculation take a bit longer.
This makes the distinction between a "Score" and a "Accuracy Percentage" of a play superfluous, the play that gets the best score is the one that has the best accuracy (and I think that the score calculated here reflects accuracy better than the current "Accuracy Percentage" formula).
Examples of results using the formula proposed:
(300g, 300, 200, 100, 50, MISS). All results use the "LR2-based scale", and OD8 for the plays considered.
(495, 5, 0, 0, 0, 0) -> 995000
(499, 0, 1, 0, 0, 0) -> 994563 (Note: when using OD 10, the score with 1 200 is actually better than the score with 5 300s)
(499, 0, 0, 1, 0, 0) -> 983084
(499, 0, 0, 0, 1, 0) -> 968606
(499, 0, 0, 0, 0, 1) -> 954903
Applied to leaderboards (Empress, and Anemone):
https://www.dropbox.com/s/wbz4ra1tqeb0r ... .xlsx?dl=0The main issue would be implementation and performance, the code I wrote in Mathematica takes up to ~100ms (about 32ms average) on my laptop to calculate a single score, and uses calculations with 70 decimal digits of precision (double-precision floating-point numbers aren't accurate enough to perform the numerical minimization of the likelihood function); Mathematica supports only either double-precision or arbitrary precision calculations, implementing the same code with other number formats (such as binary256) could make things faster. Taking that amount of time to calculate the value might not allow the score to be shown in real time during play (only at the results screen), and calculating the values for the current ranked scores would take a while.