forum

Statistic approach to Player Skill and Beatmap Difficulty

posted
Total Posts
62
Topic Starter
Full Tablet
PM me if you want someone to be included in the next updates of this.


Based on scores on ranked beatmaps, the results here simultaneously estimate the difficulty of beatmaps (difficulty of achieving certain scores) and player skill (ability to score high on beatmaps).

A beatmap is rated high when there are low scores by players who are rated high in skill, whereas players are rated high when they have high scores in beatmaps that are rated high. The method is completely statistical, and doesn't look into the content of beatmaps (except for amount of objects for the expected amount of variance).

Terms:
Tom Stars: Star Rating of a beatmap (based on the algorithm mainly designed by Tom94)

"X" Diff: The estimated difficulty of achieving at least X*1000 of score in a map with 1000 retries. They are measured in a scale that resembles star rating, with the 900K Difficulty giving the closest values to Tom Stars.

Score Count: Amount of scores retrieved for the beatmap or player in the online leaderboards.

Average Play Skill: The average difficulty of all the scores in the records set by the player. Not a very meaningful measure, since it might consider scores made when the player was at a lower skill level.

Peak Play Skill: Similar to Average Skill, but the best scores are weighted much higher compared to the rest. It's very sensitive to outliers, so it is not a very robust indicator of skill.

Accuracy Performance: Indicator of skill that works similarly to pp (setting a sub-par score doesn't lower the value. The best score has a weight of 100%, while the 2nd one a weight of 95%, etc.). The scale is the same as the ones for XK Difficulty, so having a lot of scores of a certain difficulty makes the Accuracy Performance converge to that difficulty value.

Technical Performance: Similar to Accuracy Performance, but it doesn't award you for getting score past 900,000 (For example, setting a score of 960,000 awards the same performance as setting a score of 900,000). This estimates the ability of the player to set good scores in difficult maps, rather than awarding players for having very good accuracy in easier maps.

xK ppv2: The pp achieved from all ranked 4K maps the player has played, not considering the bonus pp from setting a lot of scores.

Newest Version (2020/02/02), 7K Only
https://drive.google.com/file/d/1vmWpPannfXiR3xTYoypbplV8xsciNPtB/view?usp=sharing

Old Version (2019/02/04), 4K Only
https://docs.google.com/spreadsheets/d/1njYWZSQjV6D8EHrCnpnzRbQycH0BG7C-DWy2--T8Zjw/edit?usp=sharing

Old Version (2018/02/16), 7K/9K
https://docs.google.com/spreadsheets/d/16ik3TElUYhzTkm6U6QdA_J0owiQJJ_Wx1yjmYNCJ9jk/edit?usp=sharing

Different keymodes have different scaling, they aren't meant to be directly compared one with another.

What are your opinions of the results?

This is not meant to replace the current beatmap difficulty algorithm used for pp, since it has limitations of purely statistical approaches. It might be used to calibrate beatmap difficulty algorithms based on beatmap analysis, though.

Edit: 2020/02/02: Updated results for 7K.
Shoegazer
I've always wanted to see a walkure algorithm or at least any form of algorithm based on leaderboards and a player's ability. The more I look at the spreadsheet however, the more I realise that the scores everyone has gotten are inconsistent to an extent (or some of the top players don't play at all, making the leaderboards skewed) and leaderboards don't quite show how difficult a map is in actuality because people of different skill levels will play different sets of maps and not all of them.

My biggest qualm is probably the fact that there's such a huge set of maps in the 2.4-3 range. I'm assuming that it's because a good number of the scores in that list are SSs and player skill wouldn't be captured very well because only those SS scores would be captured in the first place. I'm not sure how much more accurate would a top 100-150 be, though.

The maps ranging from AiAe to Bangin' Burst have odd numbers to me, is there any reason for those maps to be rated that much higher than say, Kamui? A good number of players are overrated in terms of skill level as well, but that's probably because of the 5 maps above.

Anyway, it's a nice idea, but it's probably not that meaningful of an approach mainly due to the fact that the top players only play some maps and avoid a good number, which makes a good number of maps underrated/overrated. Candy Galy, Sakura Mirage, Mastication Numerique and Brynhildr in the Darkness are definitely some examples - but for different statistical reasons (many players (mainly noticeably skilled ones) play CG/SM, not many played Mastication and Brynhildr).
_Kemo
um this is interesting but somehow there are inconsistent results, since some unpopular easy maps also have really bad scores on the leader board just as real hard maps do.

lol that's why I m soooo overrated in 5k and 8k lol
abraker
Impressive! You actually beat me to it XD. But as Kemo and Shoegazer said, your way of approach may not be 100% reliable.

There are 3 ways I can think of to calculate beatmap difficulty:
  • - Do what you did and base upon the score achieved. This will work only if many people play the beatmap to the best of their ability. So popularity may hinder this option useless.

    - The current system. Base it upon the highest note density. We all know how wrong this is.

    - Calculate the difficulty by beatmap composition. While this is the hardest of the 3 to do, it is also the most accurate. Composition would include patterns, the density of the patterns, extremes in BPM and SV, and keymode.
I am planning to inspect the beatmap patterns and come up with a difficulty index sometime in the (maybe far) future. But yeah, interesting.

stuff
4k: 2.85400073592293
7k: 3.55930806616186 <--- Unless the keymode is part of the calculation, I will not believe this. I struggle to get an A in 4* 7k, while I can do up to 5* 4k
8k: 2.66764821363234
Topic Starter
Full Tablet
For 6K, I used the same algorithm, but instead of only using top 50 scores, I used all the scores of 6K players that have at least 1 top 50 score (a total of 3443 scores, instead of 1200).

https://www.dropbox.com/s/1byfyyvo64b6d ... .xlsx?dl=0

Do you think the results are more accurate?

Doing the same with other keymodes would take me a while.
abraker
How come Sasaki Sayaka's [6K Normal] stat diff the same as Sasaki Sayaka's [6K Beginner] sat diff?
Topic Starter
Full Tablet

abraker wrote:

How come Sasaki Sayaka's [6K Normal] stat diff the same as Sasaki Sayaka's [6K Beginner] sat diff?
Beginner is rated 1.19984264020897, while Normal is rated 1.20305233104608 (very slight difference).
ovnz
I didn't know that star ratings were this precise holy shit
Bobbias
It's standard practice to use extremely precise numbers for all calculations and only round to whatever significant digits you want at the very end, to ensure no rounding errors enter the calculation.
[Crz]Player
9k when
Topic Starter
Full Tablet

ATTan wrote:

9k when
Added results for 9K in the first table (though they aren't very meaningful, since there is only 1 ranked mapset, and it is very easy).
Topic Starter
Full Tablet
Here are results for 4K using more scores (79840 instead of ~30000 taken from top 50 scores).
https://www.dropbox.com/s/1byfyyvo64b6d ... .xlsx?dl=0
This list used all the scores of randomly selected players (biased towards people with high amount of pp), instead of top 50 scores.

I plan to change the algorithm to consider plays with DT/HT/EZ mods as different maps instead of taking the score with the score penalty/bonus applied (the main problem currently is that I don't know exactly how much bonus DT gives, and it might not be possible to determine without per-object data).

I need to find a way to retrieve scores more quickly (the current way of taking scores outside of top 50 scores or performance with the osu! API is very inefficient and slow; it took me several days to obtain a list of 4K scores).
abraker

Full Tablet wrote:

I need to find a way to retrieve scores more quickly (the current way of taking scores outside of top 50 scores or performance with the osu! API is very inefficient and slow; it took me several days to obtain a list of 4K scores).
If you find a way, PM me. Osu!API++ is in need of that too.
Topic Starter
Full Tablet
I made some changes to the algorithm, inspired by this post: p/4383854

The algorithm fits the data (scores obtained by players) to logistic curves, where the parameters to fit are Player Skill, Beatmap Difficulty for 900K score, and Steepness of the difficulty curve for beatmaps.

The predicted score for a play is: .
Where P is the player skill, B is the beatmap difficulty (for 900K score), and S is the steepness parameter of the difficulty curve.

For example, 2 different maps that have the same difficulty at 900K, but different steepness:

The orange curve represents the difficulty curve of a map with high steepness, while the blue one has lower steepness.

The regression minimizes the sum of the square of the errors of the predicted scores compared to the data.

Here are results for ranked 6K maps: https://www.dropbox.com/s/vyoi1r86m9r8t ... .xlsx?dl=0

Take beatmap difficulty results with few scores with a grain of salt (specially ones with only 1 score to base the calculation from, those ones use a default steepness parameter instead of one calculated).

For the player rankings, there is also a "Performance" value. This value is calcutated based on the associated difficulty each play the player has, with a score penalty based on map length (since it's more likely to have fluke plays on shorter maps), and reduced weighting for beatmaps that had their difficulty estimated based on few scores (since they are more likely to not be accurate). The "Player Skill" is the value used in the beatmap difficulty estimation, and is more indicative of the average performance of the player in the plays he has had.

For running the algorithms for other keycounts, I would need to select players to base the calculations on (I can't use a very large amount, since the algorithm is expensive in RAM and CPU use). Ideally, the players should have a big amount of plays, and have a consistent performance (not having many scores with a performance below their current level of play, for example, a player that has improved a lot over time, but hasn't improved their old scores), also, the players should represent a wide range of skill levels. Once the beatmap difficulty values are calculated, adding more players to the ranking is relatively simple (but the score retrieval using the osu! API is still quite slow).
Clappy

Full Tablet wrote:

(I can't use a very large amount, since the algorithm is expensive in RAM and CPU use)
Get some faggot with a i7 5960X and 128 gigs of ddr4 to test it out for you
-Maus-
Your nick is my reaction
abraker

FullTablet wrote:

I can't use a very large amount, since the algorithm is expensive in RAM and CPU use
How much CPU time are we talking about here? Surely leaving the computer overnight would do the trick. As for RAM usage, I'm pretty sure there can be a way to avoid too much RAM usage by doing it in C++ non recursively.
Topic Starter
Full Tablet
Here are the current results for 7K maps and some (~400) 7K players:

https://www.dropbox.com/s/scz69rqs75g19 ... .xlsx?dl=0

Results for maps that have less than 20 plays in the data used are filtered out by default, since they are very likely to be innacurate. Since there aren't scores in the data used for the calculations from players that struggle in the easiest maps, results for very easy maps are also likely to be innacurate to low level players.

The "Ranking" column in the beatmap list is based on the difficulty of achieving 900k score in the map.

The scores analized are several weeks old (it doesn't take into accounts scores made by players recently), it takes several days to refresh the scores of the players to the current values.

The algorithm used for calculating the values is subject to change.

What do you think of the current results?
Tristan97
Sweet! I made the list of top players at rank 283!
Topic Starter
Full Tablet
Ran the same algorithm for 4K maps and 4K players (listing the players that appear in the top 100 map leaderboards the most, plus a few manual additions)

Here are results:
https://www.dropbox.com/s/scz69rqs75g19 ... .xlsx?dl=0

(Results for 7K in the document are based on the previous calculations, that uses only scores that are several months old)
Topic Starter
Full Tablet
Made a new version of the algorithm, and ran it with newer data (scores from about mid-April). Currently it only has data for 7K maps.

https://www.dropbox.com/s/74wnkix3ojmyd ... .xlsx?dl=0
Topic Starter
Full Tablet
Optimized some algorithms used in the calculation, allowing to set the tolerances tighter without increasing the computation time (this should give more accurate results using the data).

Added some more players to the data (players that appeared also in the previous version didn't have their scores updated since then, only new players have more recent score data). Fetching the data is still by far the most time consuming part of the whole process.

https://docs.google.com/spreadsheets/d/ ... sp=sharing
Topic Starter
Full Tablet
Added many more players for 7K rankings and beatmap difficulty estimation (include all the top 1000 in osu!mania pp system).
All ranked scores before 9th June should be included, players that were added later might have more recent scores as well.

https://docs.google.com/spreadsheets/d/ ... sp=sharing
Topic Starter
Full Tablet
I have made a form for people who want to see themselves or other players included in the rankings in the future:

http://goo.gl/forms/6ZxF5XlMT2P0eaqf1

The next calculation will be for 4K maps.

Players of all skill levels are welcome to the rankings, as long as they have played at least a few (~20) beatmaps in the keymodes they are included. Beginner or Intermediate players are particularly useful for making beatmap difficulty estimations more accurate for the easier maps included.
Yuudachi-kun
Added myself. Horray
snoverpk_old
owie i think i have the largest rank drop in the entire sheet
Topic Starter
Full Tablet

snoverpk wrote:

owie i think i have the largest rank drop in the entire sheet
The pp value shown in the table is the overall pp from all keymodes (adding those columns were a last minute idea, so I didn't store pp values from the scores beforehand), in the next calculations I will store the pp of each play so the pp value shown for comparison will only include maps from the respective keymode. Since you play both 4K and 7K (being better at 4K), the rank difference is quite high.

Also, currently, the system seems to favor accuracy players (people who get very high scores in relatively easy maps, but can't get good scores in hard maps) more than what several people have told me it should. For the next calculations I will try some ways to nerf the rank of those players.
  1. Modifying how the residuals of the curve fitting in the algorithm are calculated, so, for example, consider that the error in the model in the case when the predicted score is 850k and the achieved score is 900k, is bigger than when the predicted score is 900k and the achieved score is 950k.
  2. Having a hard cap in the score goals, for example, at 900k score consider as if the player has "mastered" the map, so getting more score than that wouldn't increase the estimation of their skill.
Aqo
Instead of putting all of this work into recreating ppv1 why not work on a more accurate diffcalc to improve stars?
Topic Starter
Full Tablet

Aqo wrote:

Instead of putting all of this work into recreating ppv1 why not work on a more accurate diffcalc to improve stars?
That's a long-term goal.

If the results are ever accurate enough, they can be used to construct a diffcalc algorithm eventually.
Topic Starter
Full Tablet
Made a new version of the algorithm (and ran it with the previous data for 7K, except that 7K players that were added in the submission form were included as well), with 2 performance values for each player:
  1. Technical Performance: increases when the player gets decent scores in hard maps, not giving much more for very high scores (above 900k). It is meant to award being able to play hard maps decently.
  2. Accuracy Performance: increases when the player gets high scores in hard maps, giving more for very high scores. It is meant to award being able to get high scores in maps that are hard (being possible to get a higher value by playing easier maps compared to Technical Performance, by getting scores with very good accuracy).
Here is the new version:
https://docs.google.com/spreadsheets/d/ ... sp=sharing

For the 4K rankings (and future rankings), would you prefer this version of the algorithm, or the previous one?
snoverpk_old
i wouldn't know until i saw the 4k rankings
Topic Starter
Full Tablet
Ranking for 4K beatmaps and 4K players calculated:

https://docs.google.com/spreadsheets/d/ ... sp=sharing

What do you think of those results?

What should be done for the next calculation? http://www.strawpoll.me/10919193
Yuudachi-kun
My stats were done when I had 1,000 less ppv2 so am sad :(

According to osu track you took it ~7th July. Is that about right?
Topic Starter
Full Tablet

Khelly wrote:

My stats were done when I had 1,000 less ppv2 so am sad :(

According to osu track you took it ~7th July. Is that about right?
You are among the first players in the database (the 6th one), so the data is about 1 month old (it takes a long while to collect all the data, so the data of the first players is already old when the calculation is done). Take into consideration that the pp amount in the table only considers pp from 4k maps, and doesn't consider the bonus pp from having many plays.
Ayaya
690 8-)
I'm ok with that~

But wow I dropped 384 :cry:
Topic Starter
Full Tablet
Updated 4K rankings with more recent scores, added more recent maps, and added players that were signed in the form.

https://docs.google.com/spreadsheets/d/ ... sp=sharing
coldloops
Hello there,
have you tried to compare your difficulty measure with star rating ? I made a few correlation plots to illustrate this:

http://imgur.com/a/F6HjL
the "rank" is calculated by ordering the difficulty values, 1 will be the lowest, 2 the second lowest and so on.

I found it pretty interesting that it correlates with star rating so well given that they are different methods, what do you think ?

actually I made a similar analysis of beatmap diff and player skill using only score data and also got a high correlation (~0.88), so I was wondering is it really worth the effort to do this if star rating seems to be giving the same results ?

don't get me wrong, analysing score data to derive actual difficulty seems to be the best shot at getting that "true difficulty" people want but when I see those correlations I can't help but conclude that star rating seems to be pretty good already, despite not taking patterns into account.
Topic Starter
Full Tablet

coldloops wrote:

Hello there,
have you tried to compare your difficulty measure with star rating ? I made a few correlation plots to illustrate this:

http://imgur.com/a/F6HjL
the "rank" is calculated by ordering the difficulty values, 1 will be the lowest, 2 the second lowest and so on.

I found it pretty interesting that it correlates with star rating so well given that they are different methods, what do you think ?

actually I made a similar analysis of beatmap diff and player skill using only score data and also got a high correlation (~0.88), so I was wondering is it really worth the effort to do this if star rating seems to be giving the same results ?

don't get me wrong, analysing score data to derive actual difficulty seems to be the best shot at getting that "true difficulty" people want but when I see those correlations I can't help but conclude that star rating seems to be pretty good already, despite not taking patterns into account.
While star rating seems to be highly correlated with the difficulty of maps, because of how the pp system works, that is not good enough for it's purposes.

Since the overall rating of a player puts a heavy weight on the plays that give the most pp, the error in the difficulty rating of maps that are overrated has a big influence in the overall quality of the determination of the rating of the players. In this case, the outliers in the data are more important than what correlation tests indicate.

For a person (or algorithm) to determine the rating of maps and players, the most objective way is by analyzing scores of the players. In cases where player X and player Y get a score of 700k and 800k respectively in map A, and 600k and 700k in map B; it's straightforward to infer that player Y is better than player X, and map B is harder than map A. The problem is determining how to assign uni-dimensional ratings in cases where the higher skilled players don't always get higher scores than lower skilled players; changes in the algorithm used here concern mostly how to judge those cases.

I have a new version of the algorithm (that reduces further the rating of maps where most high-skill players have low scores, but there are some low-skill players that have good scores; usually Monster, and other SV-heavy maps). I will use it for 4K maps and players (collecting scores will start in about New Year, taking scores from players in January or February, and it's estimated to finish calculating in March).
coldloops
While star rating seems to be highly correlated with the difficulty of maps, because of how the pp system works, that is not good enough for it's purposes.

Since the overall rating of a player puts a heavy weight on the plays that give the most pp, the error in the difficulty rating of maps that are overrated has a big influence in the overall quality of the determination of the rating of the players. In this case, the outliers in the data are more important than what correlation tests indicate.
yes thats a good point, I hadn't really considered the pp weighting thing, but for the outliers to be useful I need to figure out which one is closer to being "right", there must be some sort of validation.

The problem is determining how to assign uni-dimensional ratings in cases where the higher skilled players don't always get higher scores than lower skilled players; changes in the algorithm used here concern mostly how to judge those cases.
why does that happen ? is it lack of effort from high skilled players ? I thought about using the number of times a user has played a map to give some sort of "trustworthiness" to the score but this data is not available on the API.
abraker

coldloops wrote:

why does that happen ? is it lack of effort from high skilled players ? I thought about using the number of times a user has played a map to give some sort of "trustworthiness" to the score but this data is not available on the API.
Take me as an example. I used to be able to S 4.7* 4k a year ago. Now I can barely do a 4.4* S. People get rusty, magically get input lag, or some other shit happens where they can't play as good as they once used to.
Bobbias
Additionally, some people are simply REALLY good at specific things, but not at others. Some people can read SVs like they're not there, while other players might require quite a few tries to get a decent score on something with particularly nasty SVs.

As one example, look at ATTang vs Staiain in 4k. ATTang is extremely good at vibro files. Overall, he's a worse player than Staiain, but there are some files he can play that Staiain can't even pass (or does so poorly on he won't bother trying).
Yuudachi-kun
Attang is d8 jacks though; I don't think you can compare that to Staiain.

But staiain 1.1 AA'd uta and when I got attang to play 1.1 uta he quit 3/4 through saying it was probably too hard
coldloops
Take me as an example. I used to be able to S 4.7* 4k a year ago. Now I can barely do a 4.4* S. People get rusty, magically get input lag, or some other shit happens where they can't play as good as they once used to.
skill decay is something I have considered, actually my initial idea was to only use multiplayer scores, that way I can get recent scores of all players regardless of them being best scores or not ( and as a bonus we also get unranked scores), the problem is that people don't play multiplayer as much as I hoped, specially high level players, some don't play multi at all...

Additionally, some people are simply REALLY good at specific things, but not at others. Some people can read SVs like they're not there, while other players might require quite a few tries to get a decent score on something with particularly nasty SVs.
yea I guess thats what Fulll Tablet was talking about when he mentioned unidimensional ratings, different types of skill complicates things, but I think the ideal "best" player should be the one that can maximize the score of all types of maps.
Bobbias
Yes, that would be why Staiain is still considered better than ATTang.

I was just pointing out a particularly good example of a case where someone can achieve good scores on specific types of maps that would allow them to rank similarly or better than better overall players.
Topic Starter
Full Tablet
Since now each player can have several scores per map stored on osu! servers, I will delay the next update so players have time to get more scores (I expect several players will start getting DT scores on maps, making the ratings on DT versions of maps more accurate overall).
Topic Starter
Full Tablet
Here is updated results for 4K maps and players:

https://docs.google.com/spreadsheets/d/ ... sp=sharing

Next update will consider 7K maps and players.
snoverpk_old
nice update but all of the scores are from february
Topic Starter
Full Tablet

snoverpk wrote:

nice update but all of the scores are from february
It took a bit more than a month to retrieve the scores from the osu! servers using the API. The calculation after retrieving the scores then took several months (next updates would take less, considering several optimizations done to the algorithm meanwhile).
Minisora
I'm too horrible at mania to be included in the list :)

Nice list though, I give an A+ for the computer making the calculations :P
Topic Starter
Full Tablet
https://docs.google.com/spreadsheets/d/ ... sp=sharing

Here are results for 9K maps and players. Results for 7K were delayed because of complications while retrieving the score data with the API (there was a bug in Mathematica 11.1 that made some API calls return incorrect data).
show more
Please sign in to reply.

New reply