00:21:802 - 00:41:261 I think whole section is too hard diff like spacing with nearly no spacing changes. I understand that its for sectional contrast sake but I would say having minimium spacing changes helps emphasizing sound and kills the monotonus gameplay.
U can reference tiny jumps like 00:39:801 (4,1) - which already feels good enough for showing strong vocal, and can be applied similarly on place like 00:22:774 (3,1) - 00:24:234 (4,1) - or 00:29:099 (6,1) -etc