What data is required:
- Replays from various players, from low to high skill, with specific mods on specific maps
- Replays from specific players, from low to high skill, with specific mods on various maps
While data can be collected to help with making the formula, there needs to be a lot of data available to be able to figure out how various factors play into breaking combo. A lot of that data is inaccessible - only top 500 plays in leaderboards have replays, so there is no way to get data on lower skilled players, and no way to get data on plays using EZ mod.
As for data that is accessible, you need to filter out plays that have no combo breaks - meaning all maps less than 4 stars are ruled out because most top 500 plays on such easy maps tend to have no combo breaks. What you have remaining are plays with varied mix of various mods that make it hard to isolate factors, and 7 star+ maps for which most players cant FC no mod.
You also need to analyze players, not just the maps. This requires many replays from single players to figure out how they respond to various patterns - what their skills are - to build a player profile to test models against. Currently it's possible to collect this kind of data from top players only. Unfortunately the data would have various mods, so again, this makes it hard to isolate factors.
There has been a thought to get data from maps that have been recently ranked (before having 500 scores on it), but you need automation for that sort of thing and know which maps you want to target. It would an active effort of constantly going through qualified and modding queues to make a list of potential candidates to get data for. That would be quite a commitment.