forum

Statistic approach to Player Skill and Beatmap Difficulty

posted
Total Posts
62
show more
Aqo
Instead of putting all of this work into recreating ppv1 why not work on a more accurate diffcalc to improve stars?
Topic Starter
Full Tablet

Aqo wrote:

Instead of putting all of this work into recreating ppv1 why not work on a more accurate diffcalc to improve stars?
That's a long-term goal.

If the results are ever accurate enough, they can be used to construct a diffcalc algorithm eventually.
Topic Starter
Full Tablet
Made a new version of the algorithm (and ran it with the previous data for 7K, except that 7K players that were added in the submission form were included as well), with 2 performance values for each player:
  1. Technical Performance: increases when the player gets decent scores in hard maps, not giving much more for very high scores (above 900k). It is meant to award being able to play hard maps decently.
  2. Accuracy Performance: increases when the player gets high scores in hard maps, giving more for very high scores. It is meant to award being able to get high scores in maps that are hard (being possible to get a higher value by playing easier maps compared to Technical Performance, by getting scores with very good accuracy).
Here is the new version:
https://docs.google.com/spreadsheets/d/ ... sp=sharing

For the 4K rankings (and future rankings), would you prefer this version of the algorithm, or the previous one?
snoverpk_old
i wouldn't know until i saw the 4k rankings
Topic Starter
Full Tablet
Ranking for 4K beatmaps and 4K players calculated:

https://docs.google.com/spreadsheets/d/ ... sp=sharing

What do you think of those results?

What should be done for the next calculation? http://www.strawpoll.me/10919193
Yuudachi-kun
My stats were done when I had 1,000 less ppv2 so am sad :(

According to osu track you took it ~7th July. Is that about right?
Topic Starter
Full Tablet

Khelly wrote:

My stats were done when I had 1,000 less ppv2 so am sad :(

According to osu track you took it ~7th July. Is that about right?
You are among the first players in the database (the 6th one), so the data is about 1 month old (it takes a long while to collect all the data, so the data of the first players is already old when the calculation is done). Take into consideration that the pp amount in the table only considers pp from 4k maps, and doesn't consider the bonus pp from having many plays.
Ayaya
690 8-)
I'm ok with that~

But wow I dropped 384 :cry:
Topic Starter
Full Tablet
Updated 4K rankings with more recent scores, added more recent maps, and added players that were signed in the form.

https://docs.google.com/spreadsheets/d/ ... sp=sharing
coldloops
Hello there,
have you tried to compare your difficulty measure with star rating ? I made a few correlation plots to illustrate this:

http://imgur.com/a/F6HjL
the "rank" is calculated by ordering the difficulty values, 1 will be the lowest, 2 the second lowest and so on.

I found it pretty interesting that it correlates with star rating so well given that they are different methods, what do you think ?

actually I made a similar analysis of beatmap diff and player skill using only score data and also got a high correlation (~0.88), so I was wondering is it really worth the effort to do this if star rating seems to be giving the same results ?

don't get me wrong, analysing score data to derive actual difficulty seems to be the best shot at getting that "true difficulty" people want but when I see those correlations I can't help but conclude that star rating seems to be pretty good already, despite not taking patterns into account.
Topic Starter
Full Tablet

coldloops wrote:

Hello there,
have you tried to compare your difficulty measure with star rating ? I made a few correlation plots to illustrate this:

http://imgur.com/a/F6HjL
the "rank" is calculated by ordering the difficulty values, 1 will be the lowest, 2 the second lowest and so on.

I found it pretty interesting that it correlates with star rating so well given that they are different methods, what do you think ?

actually I made a similar analysis of beatmap diff and player skill using only score data and also got a high correlation (~0.88), so I was wondering is it really worth the effort to do this if star rating seems to be giving the same results ?

don't get me wrong, analysing score data to derive actual difficulty seems to be the best shot at getting that "true difficulty" people want but when I see those correlations I can't help but conclude that star rating seems to be pretty good already, despite not taking patterns into account.
While star rating seems to be highly correlated with the difficulty of maps, because of how the pp system works, that is not good enough for it's purposes.

Since the overall rating of a player puts a heavy weight on the plays that give the most pp, the error in the difficulty rating of maps that are overrated has a big influence in the overall quality of the determination of the rating of the players. In this case, the outliers in the data are more important than what correlation tests indicate.

For a person (or algorithm) to determine the rating of maps and players, the most objective way is by analyzing scores of the players. In cases where player X and player Y get a score of 700k and 800k respectively in map A, and 600k and 700k in map B; it's straightforward to infer that player Y is better than player X, and map B is harder than map A. The problem is determining how to assign uni-dimensional ratings in cases where the higher skilled players don't always get higher scores than lower skilled players; changes in the algorithm used here concern mostly how to judge those cases.

I have a new version of the algorithm (that reduces further the rating of maps where most high-skill players have low scores, but there are some low-skill players that have good scores; usually Monster, and other SV-heavy maps). I will use it for 4K maps and players (collecting scores will start in about New Year, taking scores from players in January or February, and it's estimated to finish calculating in March).
coldloops
While star rating seems to be highly correlated with the difficulty of maps, because of how the pp system works, that is not good enough for it's purposes.

Since the overall rating of a player puts a heavy weight on the plays that give the most pp, the error in the difficulty rating of maps that are overrated has a big influence in the overall quality of the determination of the rating of the players. In this case, the outliers in the data are more important than what correlation tests indicate.
yes thats a good point, I hadn't really considered the pp weighting thing, but for the outliers to be useful I need to figure out which one is closer to being "right", there must be some sort of validation.

The problem is determining how to assign uni-dimensional ratings in cases where the higher skilled players don't always get higher scores than lower skilled players; changes in the algorithm used here concern mostly how to judge those cases.
why does that happen ? is it lack of effort from high skilled players ? I thought about using the number of times a user has played a map to give some sort of "trustworthiness" to the score but this data is not available on the API.
abraker

coldloops wrote:

why does that happen ? is it lack of effort from high skilled players ? I thought about using the number of times a user has played a map to give some sort of "trustworthiness" to the score but this data is not available on the API.
Take me as an example. I used to be able to S 4.7* 4k a year ago. Now I can barely do a 4.4* S. People get rusty, magically get input lag, or some other shit happens where they can't play as good as they once used to.
Bobbias
Additionally, some people are simply REALLY good at specific things, but not at others. Some people can read SVs like they're not there, while other players might require quite a few tries to get a decent score on something with particularly nasty SVs.

As one example, look at ATTang vs Staiain in 4k. ATTang is extremely good at vibro files. Overall, he's a worse player than Staiain, but there are some files he can play that Staiain can't even pass (or does so poorly on he won't bother trying).
Yuudachi-kun
Attang is d8 jacks though; I don't think you can compare that to Staiain.

But staiain 1.1 AA'd uta and when I got attang to play 1.1 uta he quit 3/4 through saying it was probably too hard
coldloops
Take me as an example. I used to be able to S 4.7* 4k a year ago. Now I can barely do a 4.4* S. People get rusty, magically get input lag, or some other shit happens where they can't play as good as they once used to.
skill decay is something I have considered, actually my initial idea was to only use multiplayer scores, that way I can get recent scores of all players regardless of them being best scores or not ( and as a bonus we also get unranked scores), the problem is that people don't play multiplayer as much as I hoped, specially high level players, some don't play multi at all...

Additionally, some people are simply REALLY good at specific things, but not at others. Some people can read SVs like they're not there, while other players might require quite a few tries to get a decent score on something with particularly nasty SVs.
yea I guess thats what Fulll Tablet was talking about when he mentioned unidimensional ratings, different types of skill complicates things, but I think the ideal "best" player should be the one that can maximize the score of all types of maps.
Bobbias
Yes, that would be why Staiain is still considered better than ATTang.

I was just pointing out a particularly good example of a case where someone can achieve good scores on specific types of maps that would allow them to rank similarly or better than better overall players.
Topic Starter
Full Tablet
Since now each player can have several scores per map stored on osu! servers, I will delay the next update so players have time to get more scores (I expect several players will start getting DT scores on maps, making the ratings on DT versions of maps more accurate overall).
Topic Starter
Full Tablet
Here is updated results for 4K maps and players:

https://docs.google.com/spreadsheets/d/ ... sp=sharing

Next update will consider 7K maps and players.
snoverpk_old
nice update but all of the scores are from february
Topic Starter
Full Tablet

snoverpk wrote:

nice update but all of the scores are from february
It took a bit more than a month to retrieve the scores from the osu! servers using the API. The calculation after retrieving the scores then took several months (next updates would take less, considering several optimizations done to the algorithm meanwhile).
Minisora
I'm too horrible at mania to be included in the list :)

Nice list though, I give an A+ for the computer making the calculations :P
Topic Starter
Full Tablet
https://docs.google.com/spreadsheets/d/ ... sp=sharing

Here are results for 9K maps and players. Results for 7K were delayed because of complications while retrieving the score data with the API (there was a bug in Mathematica 11.1 that made some API calls return incorrect data).
Topic Starter
Full Tablet
Added results for 7K players and maps.

https://docs.google.com/spreadsheets/d/ ... sp=sharing
abraker
I have been wondering, is there any correlation between the length of the map and the number of people who get a higher score when comparing maps of similar SR (tom stars)?
Topic Starter
Full Tablet

abraker wrote:

I have been wondering, is there any correlation between the length of the map and the number of people who get a higher score when comparing maps of similar SR (tom stars)?
Here are some graphs of the number of notes in beatmaps, and ratio of plays that pass certain score milestones (800k, 900k, 990k, 1M) in the scores in the data, for several star rating ranges.

There is a tendency for a decrease in the amount of passes when the number of notes increases, but the correlation is not strong.

The correlation coefficients of each linear regression is rather low, with r of around 0.35 in the case of [0.8,1.2] star rating with 990K and 1M milestones, and for the case of [1.8, 2.2] star rating with 990K milestone.

For other score milestones and other star rating ranges, the correlation is even lower.
Topic Starter
Full Tablet
https://drive.google.com/file/d/1KFFVOM_YsnRuvfSUWr4M2_KFZtoyyghH/view?usp=sharing

New update for 4K and 7K beatmap and players. Using newer scores and fixing a typo in the algorithm that made results slightly different from intended.
abraker
Blastix Riotz +DT seems a little weird for 500k and 600k, double difficulty. I am going to guess not enough data points.

This makes me think, since you have been working with the data for sometime now, how many data points do you typically need for the results to be accurate or at least make sense?
Topic Starter
Full Tablet

abraker wrote:

Blastix Riotz +DT seems a little weird for 500k and 600k, double difficulty. I am going to guess not enough data points.


Blastix Riots (GRAVITY) +DT difficulty curve

Blue dots are the scores in the data (score achieved vs average play skill of the player)

The data in Blastix Riots +DT is lacking. By what we can see in the data, it seems only the best players can achieve more than 500k score in the beatmap, but there aren't many scores set by the best players, so we can't be certain that the difficulty estimation is accurate. The difficulty rating for 700k score or better is very extrapolated, so it may not be accurate at all. The best player that has set a score in the beatmap is [Crz]Player (which is rated as the player who sets good scores in the most difficult maps, but it's not the player who consistently sets the best scores), and he was still far from getting a 600k score or better.

The scale of difficulty is set so the skill of the players in the data follow a gamma distribution with mean 3 and standard deviation 1.5. So a value of 3 is something the "average" of the players in the data is expected to be able to do (which is still quite an achievement since the data is mostly composed of the best players in the game), a value of 5 is roughly something only the top 10% is expected to be able to, while 10.71 is something that goes a bit beyond what any player could do consistently.

The scale is not actually something of importance regarding calculations. If you can define what "double the amount of difficulty" means, maybe I could set the scale according to that definition.

abraker wrote:

This makes me think, since you have been working with the data for sometime now, how many data points do you typically need for the results to be accurate or at least make sense?

The more scores from players that struggle for a certain goal, the more confident we can be the difficulty estimation for that goal is accurate.

Usually, about 20 scores in the same score range is the bare minimum to be confident about the estimation, but having above 200 or even 1000 plays is much better. Some popular maps have 1000+ scores in total in the data, but still have few scores in some score ranges (for example, despite AiAe [MX] being the most popular map, few players in the data have less than 700k score on it).
Topic Starter
Full Tablet
New update for 4K with scores retrieved mostly during November 2018. This one took the scores from 4000 players, so it took considerably longer to retrieve the scores and calculate the results.

https://docs.google.com/spreadsheets/d/1njYWZSQjV6D8EHrCnpnzRbQycH0BG7C-DWy2--T8Zjw/edit?usp=sharing

Topic Starter
Full Tablet
Updated results for 7K beatmaps and players

Newest Version (2020/02/02), 7K Only
https://drive.google.com/file/d/1vmWpPannfXiR3xTYoypbplV8xsciNPtB/view?usp=sharing
coldloops
Hey, have you seen https://data.ppy.sh/ ? can you use those dumps for your calculations ?
Topic Starter
Full Tablet

coldloops wrote:

Hey, have you seen https://data.ppy.sh/ ? can you use those dumps for your calculations ?

I wasn't aware of those dumps. I can use those dumps for calculations (it might be much faster than using the API for obtaining the scores of each player, once I figure out how to access that format efficiently with Mathematica). Thanks!

The current bottleneck is my computer or the optimization of algorithms (it currently takes about 4GB RAM in several Mathematica sub-kernels and several weeks of calculation with about 5,000 players and all current Loved and Ranked 7K maps, and calculation time and RAM use is O(n*m), with n and m being the number of players and beatmaps respectively).
coldloops

Full Tablet wrote:

once I figure out how to access that format efficiently with Mathematica). Thanks!


yeah, I don't know about Mathematica, but you will probably need to load those dumps into a sql server to extract a csv of each table.


Full Tablet wrote:

The current bottleneck is my computer or the optimization of algorithms (it currently takes about 4GB RAM in several Mathematica sub-kernels and several weeks of calculation with about 5,000 players and all current Loved and Ranked 7K maps, and calculation time and RAM use is O(n*m), with n and m being the number of players and beatmaps respectively).


are you using NMF ? I looked at the 7k results, and Diffs for maps with too few scores aren't very useful, you could prune maps/users with too few scores out to reduce the matrix.
Bobbias
If you can parse SQL, there's no need to load it into a server. You could write a fairly simple script to reformat the data into a more acceptable format pretty easily.

The top of the file contains several lines that you can basically skip over, the table description tells you what each column is, and then you move on to the INSERT statements which are just a comma separated list of (col1, col2...).



I have no idea what kind of support Mathematica has for SQL data, but I do know that you should be perfectly capable of writing a script that would convert this data to a more usable format if you dont want to just read directly from the file.
Please sign in to reply.

New reply