forum

Will GPT-like models be able to generate Osu! beatmaps automatically from songs?

posted
Total Posts
13
Topic Starter
Trackest
Osu! beatmaps are text data, so theoretically GPT should be able to learn to generate them pretty quickly. I doubt GPT-3 or GPT-4 have been trained on beatmaps specifically, but if someone can fine-tune a lower resource model like Alpaca 7B on a multimodal dataset containing all previously mapped songs and their corresponding beatmaps, it's possible that the model could learn to associate audio files with the beatmap outputs right? Of course you would have to turn songs into machine-readable features like spectrograms or something. But beatmap files are basically code files and we know GPT is really good at generating code.

I also asked ChatGPT what information a hitobject string represents and it gave an almost correct answer!

Can you explain what each section of this hitobject represents (separated by commas)? 256,192,540,70,0,L|257:217,1,25.000000372529,2|2,0:0|0:0,0:0:0:0:


-----

Sure, I can explain each section of the hitobject string you provided:

256,192,540,70,0,L|257:217,1,25.000000372529,2|2,0:0|0:0,0:0:0:0:

Each section is separated by a comma, and they represent the following information:

X-coordinate: 256
Y-coordinate: 192
Time: 540
Hit object type: 70 (this represents a circle)
Hit sound: 0 (this means no sound will be played when the object is hit)
Hit object extras: L|257:217 (this represents a slider, with the slider's endpoint at x-coordinate 257 and y-coordinate 217, forming a straight line between the circle and endpoint)
Edge hitsounds and timing points: 1,25.000000372529,2|2 (this represents two timing points - the first one is at time 540 with a 1/1 beat length and no hitsound, and the second one is at time 565 with a 1/1 beat length and a hitsound of 2)
Slider path: 0:0|0:0 (this represents the path the slider follows, in this case it's just a straight line from the circle to the endpoint)
Slider edge hitsounds: 0:0:0:0: (this means no hitsounds will be played when the slider is hit)
I hope that helps! Let me know if you have any other questions.
[[[[[[
yeah they won't map anything lol

the ai is trained to sounds humanly, and osu's map file is as far as human-like as possible

olibomby made an osu ai mapper, tho the map it makes is still shit
Ryu Sei
osu! beatmap structure is kinda sophisicated, so even GPT should have hard time to learn how to make beatmaps from literally zero (like using text editor), not even without GUI.
lewski
yeah based on currently existing mapping AIs I don't think it's gonna happen soon

it does seem kinda inevitable though
kaedori
gpt is a natural language model so unless we figure out a way to translate natural language into osu maps gpt specifically wont work

but if this happened, it would be interesting to see dozens of lines of "place eight 1-2 vertically mapped jumps during the kiai"
Corne2Plum3
Already tried, and he was just adding a red timing point at every beat, at 7000 BPM...
dight_

Trackest wrote:

Osu! beatmaps are text data, so theoretically GPT should be able to learn to generate them pretty quickly. I doubt GPT-3 or GPT-4 have been trained on beatmaps specifically, but if someone can fine-tune a lower resource model like Alpaca 7B on a multimodal dataset containing all previously mapped songs and their corresponding beatmaps, it's possible that the model could learn to associate audio files with the beatmap outputs right? Of course you would have to turn songs into machine-readable features like spectrograms or something. But beatmap files are basically code files and we know GPT is really good at generating code.

I also asked ChatGPT what information a hitobject string represents and it gave an almost correct answer!

Can you explain what each section of this hitobject represents (separated by commas)? 256,192,540,70,0,L|257:217,1,25.000000372529,2|2,0:0|0:0,0:0:0:0:


-----

Sure, I can explain each section of the hitobject string you provided:

256,192,540,70,0,L|257:217,1,25.000000372529,2|2,0:0|0:0,0:0:0:0:

Each section is separated by a comma, and they represent the following information:

X-coordinate: 256
Y-coordinate: 192
Time: 540
Hit object type: 70 (this represents a circle)
Hit sound: 0 (this means no sound will be played when the object is hit)
Hit object extras: L|257:217 (this represents a slider, with the slider's endpoint at x-coordinate 257 and y-coordinate 217, forming a straight line between the circle and endpoint)
Edge hitsounds and timing points: 1,25.000000372529,2|2 (this represents two timing points - the first one is at time 540 with a 1/1 beat length and no hitsound, and the second one is at time 565 with a 1/1 beat length and a hitsound of 2)
Slider path: 0:0|0:0 (this represents the path the slider follows, in this case it's just a straight line from the circle to the endpoint)
Slider edge hitsounds: 0:0:0:0: (this means no hitsounds will be played when the slider is hit)
I hope that helps! Let me know if you have any other questions.
maybe if you made a model trained specifically on .osu files you could do it but that would be incredibly difficult if impossible for a home user. chat gpt is already alright with simple languages like python but those are meant to be human readable. osu!'s hitobject markers aren't at all. so idk if gpt will ever be able to generate even a functional osu! map
Leviathan
i hope not
nominomu
well, i feel like an ai would struggle the most with perceiving the music and all the layers of instruments and turning that into a map, and not the placement of objects itself
Luminiscental
decoding what's going on in the song is definitely the main issue. language models like GPT-3 just don't have the capabilities for music deconstruction (although of course you can do these kinds of things with ML, it just requires more thought and a tailor-made model).

moreover it's not clear how you could encode information about the song into something that the language model could effectively process; music most naturally encodes as numerical data (think rhythm as sequences of time ratios or binary data on a beat-grid, melody and harmony as pitch ratios, e.t.c.), which the language models are much worse at dealing with than, well, language.

if we look at tailor-made models rather than using GPT-like language models then you can try some kind of 2 stage "decode song à la stem isolation" (this seems like it would be very complicated in practice, and for general songs) -> "generate beatmap from decoded information à la DALL-E" (it's unclear how much training this would require for good output). while theoretically plausible this would require a lot of thought for encoding, model architecture, e.t.c., and potentially you would need more training data than we have ranked maps.

so I expect maps generated from audio files via ML not to happen (with any success) anytime soon, unless a serious ML-researcher develops an interest in osu mapping (or an osu mapper becomes a serious ML-researcher).
[DJ]Robsao
why
Randomness64
Its posible if you feed it enough maps but thankfully my style is as random as it gets,so i m safe :D
Mafumafu
Zero-shot transfer with only prompt engineering may not be very effective. But I think fine-tuning pre-trained transformers for downstream tasks such as beatmap generation can be possible and I am currently trying it.
Please sign in to reply.

New reply