forum

[Proposal] Metadata section overhaul

posted
Total Posts
216
show more
Fycho
Completing the cantonese part, nold_1702 and me re-wording the proposal about Chinese and Cantonese stuffs a bit.
We think Chinese / standard Chinese / Written vernacular Chinese are actually towards the same thing, and Mandarin is kind a tone of them that isn't a language. So we just use the Chinese back.

Glossary
Character-by-character Romanisation: each Chinese character must be romanised as a capitalised word and separated with a space.

Rules
Songs with Chinese metadata must be romanised in accordance with the Character-by-character method by using Hanyu Pinyin system in Romanised fields when there is no Romanisation or translation information listed by a reputable source. The same applies to the Source field if a romanised Source is preferred by the mapper. As they are non-unicode fields, all diacritical tone marks must be omitted. Songs with Cantonese metadata must be romanised by using Jyutping system.
Ulysses
As a speaker of the two languages, I struggle to understand why you keep emphasising that Mandarin and Cantonese are just different tones. Because they are languages that use Chinese characters not just different tones. But whatever, it is not something that we have to discuss here. (For some of you who may be interested, here is a fun and short youtube video explaining the differences and similarities between the two: https://youtu.be/s2km_z4-1T8 )


Anyway, I modified the grammar and changed some words to more precise ones:

Glossary
Character-by-character Romanisation: each Chinese character must be Romanised as a capitalised word and separated with a space.


Rules
Songs with metadata in Chinese must be Romanised in accordance with the Character-by-character method by using Hanyu Pinyin system in Romanised fields if there is no Romanisation or translation information listed in a credible source. The same applies to the Source field if a romanised Source is preferred by the mapper. As they are non-unicode fields, all diacritical tone marks must be omitted. Songs with Cantonese metadata must be Romanised by using Jyutping system.
Topic Starter
Okoratu
REmoved the part where it talked about unavailability of preferred romanisation: "If the artist provides a preferred way to romanise their title or name, that is to be followed unless it conflicts with other points of this criteria." handles that.

Reverted Glossary

Decluttered the rule into the following statements:
  1. Songs with Chinese metadata are to be handled with respect to the tones and dialects of Chinese they belong to. In any case, al diacritical tone marks must be omitted:
    1. Mandarin metadata must be romanised using the character-by-character method.
    2. Cantonese metadata must be romanised using the Jyutping system.
    3. If the song falls into neither category, this choice is left up to the mapper's discretion


i hope this is more clear and captures the spirit of what you wanted to say while being more straightforward to digest

ToDo:
- Spacing of special characters retarded loopholes fixing
- !where Korea
- common markers rules
- CrystilionZ point needs to be applied but idk how

If the artist provides a preferred way to romanise their title or name, that is to be followed unless it conflicts with other points of this criteria.
nope, this refers to special characters and title formatting rulings above you cant stuff that into one thing
how is a translation not officially referring to the song in multiple ways?
Ulysses
Hmmm you missed the Hanyu pinyin thing in the Mandarin part and the character-by-character part in the Cantonese part.

So both of them use the character by character method.
Mandarin uses Hanyu pinyin system to romanise chinese charaters
Whereas Cantonese uses Jyutping system to romanise chinese characters

So:

Mandarin metadata must be romanised usong the Hanyu Pinyin system.

Cantonese metadata must be romanised using the Jyutping system.

In both cases, the character-by-charatcer method is to be adopted.
Topic Starter
Okoratu
fixed!
_PhiLL
should one also append (TV Size) to the end of songs which are tv size but don't indicate it in the title? as it stands now, the proposal seems to point to no. what's up with that?
Lanturn
Seeing how discussions have died, I want to post some ideas I was planning on bringing up later since the time limit was close (half a month ago). This also has a few rule changes and guidelines. Some may not even need to be guidelines, but I wanted to spark discussion on them anyways and decide whether or not they are worth adding.

Regarding Full Width Special Characters:
When it comes down to adding spaces for special characters, there is one more issue with it that I think should be addressed. Some languages like Japanese, Chinese, whatever else is in here, and the likes don't utilize spaces when reading or writing. Seeing as how Japanese is one of the most common languages here in osu!, they normally write their special characters in full-width. The Comma (、,), colon (::), brackets ((())), as well as some others, wouldn't need a space. The current rule doesn't really mention these full-width characters.

For example:
チト(CV:水瀬いのり)、ユーリ(CV:久保ユリカ) (Official)
チト (CV: 水瀬いのり)、 ユーリ (CV: 久保ユリカ) (Proposal)
チト (CV:水瀬いのり)、ユーリ (CV:久保ユリカ) (Full-Width without spaces, Follows proposal otherwise (including the parenthesis guideline). The Parenthesis are half-width, so they would naturally have a leading whitespace.)
Chito (CV: Minase Inori), Yuuri (CV: Kubo Yurika) (Romanized Proposal)

http://www.bjd.com.cn/ A Chinese newspaper site. All special characters are written in full-width and it doesn't utilize spacing.

The tl;dr is that certain special characters in full-width don't need to utilize spaces since they are somewhat naturally included in them. This is not the case with all characters and should be used accordingly.

-----------------------------------------------

Regarding half-width & full-width usages of characters in the Unicode & source fields:
(Brought up to me by S o h)
Special characters should retain their original full-width/half-width characters in the Unicode fields. An exception to this is when it used for additional complimentary info like the CV section or mix descriptors. Improper usages can result in errors while searching. https://osu.ppy.sh/ss/10623085
Example using "カラフル。(Extended edit)"
The period cannot be substituted for its counterpart. "カラフル.(Extended edit)" is not acceptable.
The parenthesis may be either half or full-width. "カラフル。(Extended edit)" is acceptable.

Original width usages should still be prioritized in the unicode field when possible.


------------------------------

Regarding Special Characters and Spacing:
(I posted this earlier, but I might as well add it here)
ジョジョ~その血の運命~ Archetype MIX Ver.
JoJo ~Sono Chi no Sadame~ Archetype MIX Ver.

when a symbol is alone and doesn't have a spacing, the romanization should have a whitespace before and after.(Ex. if the title was "ジョジョ~その" we'd use "JoJo ~ Sono" when romanizing)

When a symbol comes in pairs (like mentioned above), use a space before the first symbol and after the last symbol (Not needed if the symbol is the last character). (Ex. if the title was "ジョジョ~その血の運命~" we would use "JoJo ~Sono Chi no Sadame~"

This can be excluded if the song has a good enough reason not to use it.

----------------------------------------

Standardizing the Romanised Artist Field Order:
Another topic I want to bring up is one from a few years ago. Since we're trying to 'standardize' metadata, I feel like pushing this old thread: Romanized Artist Preferences, as it would actually benefit with the current proposals.
Right now we basically have to search high and low to find an obscure reference for a preferred romanization when a much simpler method that most database and wiki sites use is a simple standardization of "Family Given" or "Given Family" and such. In the end, our artist fields end up messy to the point that you can't tell which order is which anymore.

Fycho also brought up a point of artists sometimes have an official Translated or English name, so we'd have to figure out if those would get more priority or not. Ex. 周杰伦 is Jay Chou in English, but Zhou Jie Lun when romanized.

Right now this is my current proposal:

When romanizing the artist field, it must be printed out as the Unicode field would be when read. The sole exception to this is if the artist has an official translation and are widely known with this name. (Please English this better. The idea is simply that we type any order out on how it would be read.)

The second line would be in cases like Girls' Generation where 소녀시대 is romanized as Sonyeo Sidae (I believe). We'd still use Girls' Generation in this case. This also includes the Chinese example mentioned earlier.

Pros:
- Consistent metadata with their Unicode counterparts and we no longer have to check for preferred romanization order anymore.
- It standardizes the romanized artist field for every language, not just Eastern.

Cons:
- It will conflict with some artists' preferred romanization (Kurosaki Maon will be used instead of Maon Kurosaki and such. A lot of famous video game composers are more recognized by Given - Family as well.)

If we're going to standardize things here in osu!, we might as well tackle this since it's also fairly inconsistent at times. Hi Shimotsuki Haruka Shimotsuki.

-----------------------------------------------------

Regarding TV Size:

Even if we were to open this to say, a community vote, (and I might be jumping the gun here) I'm sure the majority would rather include the length markers, so I'll try to keep it simple.

(TV Size) is used for cuts that are used in the show. (Anime/TV Show OP/ED, Insert Songs if shortened, etc)
(Short / Extended Ver.) for everything else. (Game Size is rarely used anyways now I think about it.)
Manually cut songs that closely resemble a (TV Size) on an applicable song would use (TV Size), otherwise, they should use (Short Ver.) or (Extended Ver.)

That's about as simple as I can make it I guess so it's as standardized as possible. The biggest downside to this is that it's difficult to tell Cuts and Official releases apart, but this makes it so we don't have to be direct when it comes to the versions, and it still does mention the length appropriately. The alternative is to use whatever the original release was before the cut, but then it contradicts the point of having a marker to reference the maps length on sight.

The main goal here is to make the labels as more as identifiers and less as official then it makes sense.

--------------------------

Regarding songs that have multiple sources:

When a song has appeared in multiple media, it may use the source that the mapset is themed around (Backgrounds, Storyboards, Videos, etc.) as long as the song itself appeared in it. These should use the direct source instead of the franchise source if applied.
Examples:
https://osu.ppy.sh/s/446547 may use Grand Theft Auto Vice City as the map is themed around it and the song appears in-game.
https://www.youtube.com/watch?v=UrJcQ2nZips may not use Naruto as a source as the song doesn’t appear in any Naruto media, even if the map itself is themed around Naruto. These can be placed in the tags.

------------------------------------

Regarding Original Releases without a source:
This will have to be mostly case by case, but if a song has had a noticeable gap between its original release and then eventually ends up on another media, (take that GTA song mentioned above) the source field isn't required and can be moved to the tags instead.
This may not have to be so much of a time-gap as well. We could try focusing more on if the first source released has any major significance.

----------------------------------------------------

Repeated words in romanization:
When a song uses repeat words in the title (one in unicode, and the other as a basic romanization), the romanized field should omit the repeated word.
Examples:
AIRI-愛離- would normally be AIRI -Airi- as a romanization. This proposal would have the romanized field just be AIRI. The Unicode would still be AIRI-愛離- as it originally is.

A more severe example of this would be:
Normal: (Unicode) 花簪 HANAKANZASHI -> (Romanized) HANAKANZASHI HANAKANZASHI
Proposed: (U)花簪 HANAKANZASHI -> (R)HANAKANZASHI

--------------------------------------------------------

Using LOGOS to determine stylization choices:
Sometimes the romanization of a non-roman language will lead to little to no info of how to romanize the artist's name. In the case of where a logo is only found on a website or a CD cover writing the song in all capitalization, We should be using standard capitalization methods ( https://capitalizemytitle.com/ as we generally would in any standard title or name)
Artist preference in any other case must still be followed over this.

In other words, this will hopefully prevent ITO KASHITARO cases from happening again. This is more case by case guidelines, but the idea of romanizing based on what may possibly be just a font has lead to some unfavorable romanizations in the past.

----------------------------------

Regarding covers and use of original metadata over the covers
Brought up originally by Monstrata. Sometimes a cover by another singer may be listed with slightly incorrect metadata compared to the original. We should probably use common sense when approaching this and judge them case by case. If the cover itself has very minor errors, then the original title would be recommended. If the cover feels like more of a remix or has been altered in some major way. The cover title would be recommended.


Umm yeah. Sorry I've been kinda absent on this proposal. I'm gonna try to be a bit more active so we can get this pushed forward as it was due 2 weeks ago. Hopefully, we can get this finalized by the end of the month (My goal now)

Anyways. Happy reading. Smack me if anything seems unreasonable. I mostly just want to spark a bit more discussion before we push this forward, and I wanted to attempt to merge a few more ideas I was originally planning on bringing up after this proposal went through.
pw384
Sorry for disturbing, but I would like to confirm whether character-by-character method is also applied to the Romanization of Chinese artists name (if s/he hasn't provide an official Romanization). If so, I suggest mentioning it in the proposal as it differs from native users' daily practice, and may result in confusion if not specifically mentioned in ranking criteria. Like this: "Songs with metadata in Chinese, including both Artist and Title, must be Romanised in accordance with the Character-by-character method..."
Wafu

pw384 wrote:

Sorry for disturbing, but I would like to confirm whether character-by-character method is also applied to the Romanization of Chinese artists name (if s/he hasn't provide an official Romanization). If so, I suggest mentioning it in the proposal as it differs from native users' daily practice, and may result in confusion if not specifically mentioned in ranking criteria. Like this: "Songs with metadata in Chinese, including both Artist and Title, must be Romanised in accordance with the Character-by-character method..."
Can you explain how it differs from "native users' daily practice"? Metadata = artist + title. Every language in osu! is (and always has been) using the same system for artists and titles (unless preferred Romanisation for one exists)
pw384

Wafu wrote:

pw384 wrote:

Sorry for disturbing, but I would like to confirm whether character-by-character method is also applied to the Romanization of Chinese artists name (if s/he hasn't provide an official Romanization). If so, I suggest mentioning it in the proposal as it differs from native users' daily practice, and may result in confusion if not specifically mentioned in ranking criteria. Like this: "Songs with metadata in Chinese, including both Artist and Title, must be Romanised in accordance with the Character-by-character method..."
Can you explain how it differs from "native users' daily practice"? Metadata = artist + title. Every language in osu! is (and always has been) using the same system for artists and titles (unless preferred Romanisation for one exists)
Officially, native speakers never romanize their name via character-by-character method under any circumstance (e.g. 曹雪芹 is always romanized as Cao Xueqin in daily practice, instead of Cao Xue Qin or Cao Xue qin). So I am not sure whether the character-by-character method is applied to Artist name since it contradicts with our habits. If that is true, a specific clarification is better in wording in my opinion.
Topic Starter
Okoratu
@Chinese clarify pls im confused


Can someone sit down with me trying to digest Lanturn's post into rulings
CrystilonZ
@chinese in a nutshell Chinese names are usually romanised like "family given (or given family idk)" with only one space separating given name and family name. osu uses character-by-character cuz of the difficulty to separate words but there is no difficulty separating given name from family name and vice versa so Chinese names shouldn't be romanised character-by-character and insteand should be romanised normally, with one space separating given name and family name.

Lanturn just brings stuff that he thinks they're worth considering up. I'll try to simplify it here I guess

1. full width special chars already have built-in space so they don't need whitespace before or after those chars. Proposal should add this statement about full width chars.
2. Full width chars are handled differently. example: "。" is full width period "." and they are not interchangeable in the unicode field but half-width brackets "(" and full-width counterparts "(" are. << should be fixed
3. specify stuff regarding spacing when there are special characters involved. (romanisation)
4. screw artist's romanisation preferences. All names should be romanised like how they are read in their original languages. ea Japanese names will be romanised with Family-given order only regardless of artist's preference.
5. (TV size) and other designators.
6. what should we do when a song is featured in a lot of medias (like featured in a lot of games/movies/animes). Lanturn proposed that source should be designated according to what's the map is themed around (sb/bg etc.)
7. if source doesn't have major significance it can be moved to tags instead.
8. ignore repeated words when romanising stuff.
9. Logos aren't reliable when it comes to capitalisation. should use a standard method instead.
10. Covers often get metadata wrong. Compare data with the original release and use commonsense when dealing with covers.
Topic Starter
Okoratu
Hi~

Noffy, Lanturn and I sat down and got this worked out as a draft implementing all the above points.

The draft as a whole is available over there: https://gist.github.com/Okorin/c551fd42 ... f51ffb2736

if nothing else is brought up i'll PR this in a week ok

thank
Fycho
a minor stuff, For the consistency, romanize => romanise
TheKingHenry

Fycho wrote:

a minor stuff, For the consistency, romanize => romanise
Pretty sure both ways of typing it out are fine (same with romanization = romanisation) though surely it'd be nice to use it consistently if that's what you meant with this ¯\_(ツ)_/¯
Kurai

Okoratu wrote:

Hi~

Noffy, Lanturn and I sat down and got this worked out as a draft implementing all the above points.

The draft as a whole is available over there: https://gist.github.com/Okorin/c551fd4263e437e0ffcbd3f51ffb2736

if nothing else is brought up i'll PR this in a week ok

thank

Romanisation, Romanise, Romanised, etc. should always be capitalised.

"Lenticular brackets should be romanised to either quotation marks or square brackets depending on the context they are used in."
A bit confusing, what are the two contexts of use?

In "Russian" Romanisation: "ё should be romanised to ye, however, use yo or o to avoid usage of special characters."
Don't even mention it should be Romanised to ye if we're not doing that, it's confusing. Also, using yo or o ? It should only be yo, it is never pronounced o?
TheKingHenry
Songs with German metadata must romanise umlauts into two-letter equivalents (ue, oe, ae and ss).
Is this supposed to contain the nordic equivalents of these too? As well as the additional Ø Æ Å and whatever there are, how's the deal with them?
Topic Starter
Okoratu
you are a nordic person you should know what deal those are better than i do
i just added the things i knew about languages to that draft
provide me knowledge if it's relevant and i'll add

@kurai, sometimes they're used as actual quotes, sometimes they're used as highly stylized brackets
the context depends on whichever statement makes sense i think

i can't answer your russian question because idk what you're saying

romanise is a normal verb, unless it's referring to fields in the client it should be treated as such
TheKingHenry

Okoratu wrote:

you are a nordic person you should know what deal those are better than i do
i just added the things i knew about languages to that draft
provide me knowledge if it's relevant and i'll add
Okay so first of all I ain't really expert, but maybe some of this might prove useful.

As for Norwegian/Danish, Æ and Ø are roughly the same as Ä and Ö (they can be used for them if needed I think) which then are about the same than the respective ones in german, thus ae and oe should work fine mostly.

As for Finnish/Swedish, stuff gets little trickier, as the umlauts aren't phonological but rather independent graphemes so they don't have equals the same way. So no idea if there's official romanisation for those, I've personally seen both ä -> ae and ä -> a style stuff and to quote wikipedia quickie on info for this: "In contexts of technological limitation, e.g. in English based systems, Swedes can either be forced to omit the diacritics or use the two letter system." which implies both could be fine?
It could also be possible to use machine-readable versions as is done in passports and such here. It'd recommend just dropping the additions (so ä->a) but also has some other romanisation options like å -> aa; ä -> ae; ö -> oe for the ones in question here. Stuff like å -> aa might get little off the course though if the point is to provide romanisations that'd help non-speaker pronounce it. Or maybe it'd be fine who knows ¯\_(ツ)_/¯

Hopefully there'd be someone more specialized here to offer some input lol :?
Topic Starter
Okoratu
i can't really make anything set in stone out of this - are there enough songs for this to matter right now?

otherwise i'd go ahead with the part of the draft that is pretty much approved by not disapproving of it anyways
Xayler

Okoratu wrote:

i can't really make anything set in stone out of this - are there enough songs for this to matter right now?

otherwise i'd go ahead with the part of the draft that is pretty much approved by not disapproving of it anyways
Well there was a song which in normal romanisation should include ä, but didn't because the editor couldn't somehow approve it. The map was this: https://osu.ppy.sh/s/740535 (that's included in map's description as well)

The title has Mikk Mae, but it's actually Mikk Mäe in Estonian.


Also what TheKingHenry said... for me ä turned into ae and ö into oe doesn't make any sense. At least they have completely different pronounciations. We also have a letter õ, what should be ooee then?
For reference, there's a pretty funny video in youtube what has instantly at the start of the video these pronunciations as pretty close: Link

Not that I'm a general expert in talking about other nordic languages, but I at least know mine here. And tbh - using a instead of ae (ä) and o instead of oe (ö) makes more sense tbh.
Topic Starter
Okoratu
as i said i pr'd the changes without the other nordic languages in consideration, we can add those if someone comes up with wording that is understandable

because idk what the heck is going on - case by case should be fine if the amount of songs isnt super high
Krfawy
Oko, the Russian Cyrillic rule still says "use the very ye for the very Russian ё" which part literally must die as soon as possible, really.
pishifat
finalized
Please sign in to reply.

New reply