Saturday, July 22, 2023

DS Ram Leakage in TestNScripts

I had thought that fixing 'slope unit-tests' in addition to 'wall unit tests' would mean my fixes to get the scorpeye behaving as intended. That would mean the 'summer code review' would be over and the 'summer game dev' could start. Well, that was hoping too much, because there's actually one more test that, for some reason, doesn't show up in ./testme --list but still exists and is important: TestNScripts. This one loads the real .cmd files of SchoolRush and checks parsing went fine. And it does not only check one of them: it will reload many scripts in the same "engine", the way the game does.

while (ntests < n) {
   TestBench tc;
   for (uint i=0; i < REDO; i++) {
       printf("== R%i/%i redo%i ", ntests, n, i);
       ParseScript(tc, scripts[rand()%nscripts], __FUNCTION__);

No assertion failure here, instead one of the trickiest things to track: memory exhaustion. So it's time for me to learn how to use my tracking tools again. I even left documentation for my future self back in the past (thanks, past self ^^). Once again, the problem arise also in 'default'... looks like I had been over-optimistic in September 2021 when I merged sprites overlay into default.

Well, at least if gives me hope this time: on the 'newmap branch', the forgotten test terminates with 'Out of DS memory'. I managed to get a log of what is allocated at that time, but it would require filtering what has leaked from last ParseScript, and what belongs to the current, interrupted because there's no more memory, running ParseScript. Hopefully, when the bug occurs on default, it gives me a nice 'halt because of leakage' and not a 'out of DS memory' condition. 

  • [done] understand why the new branch gets so high on memory consumption
  • [todo] make sure .spr and .cmd files in fakeroot/ are working fine (despite they now need bilou.spr and school.spr)
  • [todo] that should be the job of

edit: dummy! dummy! dummy! ... the 'DS Ram Usage' statistics in those tests is a lie! when it says 95% used, it doesn't mean you'd use 95% of the NDS 4MiB of RAM to play the script. It means that the dummy-allocator-that-never-recycle-memory used 95% of its storage to process the script. Should parsing become more complex (e.g. because we're loading one more file, or creating more intermediate structures), the memory used by the script will grow while the true RAM usage on the NDS would stay roughly the same.

Eeet ouaip. Je me suis donc morfondu un bon morceau de l'été sur le fait que mon moteur de jeu faisait péter sa consommation de mémoire. En tout cas, ça a sérieusement refroidi mon appétit pour le dévelopement de jeu après les heures de boulot. Trouver le problème, ça demandait de se plonger dans une montagne de données et les outils d'il-y-a-longtemps pour comprendre où était la fuite... Sauf qu'en vrai il n'y avait pas de fuite. Tout ça, c'était juste un ou deux messages d'erreur mal nommés qui me faisaient croire que la lecture des scripts prenait plus de 4MB de mémoire (tout ce que la DS a, en somme), sauf que l'environnement de test pour faire tourner les niveaux de mes jeux DS sur PC n'est pas prévu pour qu'on joue, mais pour qu'on trouve les erreurs. 

Donc contrairement à un système de gestion de mémoire ordinaire, quand on lui "rend" un bloc de mémoire il ne le réutilise jamais. La mémoire qu'il fournit à la demande est toujours de la mémoire qui n'a jamais servi pour une autre tâche dans le niveau, et c'est elle qui finissait par faire défaut. Vous me permettrez de croire que si vous avez envie d'en savoir plus, vous êtes aussi en mesure de lire l'Anglais?

If I want to have an estimate of the memory it would take on the NDS, I'd have to keep track of individual allocs and frees and see whether they reach some maximum at a new alloc. (2023-10-23)

What is that one-time allocator ? well, maybe you've heard of bottom-up allocator, where you just keep track of one position in memory: the top-of-used-area, and everything above the top is unused memory. Of course, as soon as you free things out-of-order compared to how you allocated them, that allocator is no longer enough. You would like at least a list of freed-blocks-that-could-further-reduce-top in case the block at the top is eventually freed. Well, the one-time allocator does not even do that. Any byte of one-time memory can be allocated exactly only once. If freed, it is "painted" as free and will remain like that until the ongoing test terminates. 

That is silly as far as memory management is concerned, but it means if some block has not benn freed when the test terminates, you know what role this memory had when it failed. There can't be any tricky-as-hell case where "yeah, that memory had been used for A and then freed, but then it has been re-allocated for B. So maybe B is wrong or maybe something still had a reference of when it was used for A". Sometimes it truly helps. When I'm not misguided by my past self with lying error messages.

Thursday, July 20, 2023

class AppleWalls : public WallTest, UsingApple

So I had managed to get shell/walls interaction mostly fixed. I thought it would be wise to get it run 'unit tests' to ensure nothing got broken by the fix, but it would do more than a few seconds of tests before failing. In Walls/Walk test , a fairly recent addition to MapTests.cpp, for which I do not seem to have any commit where that test is a success.

The test failure will produce patterns such as

--(120,0)+<-508,0>:[-224,0]     's/7f.f, S/fe08.0, '
--(118,0)+<-512,0>     's/7f.f, S/fe08.0, '
--(116,0)+<-516,0>     's/7f.f, S/fe08.0, '

--(114,0)+<-520,0>:[-232,0]     's/77.f, S/fdf8.0, '
--(112,0)     's/77.f, S/fdf8.0, '
--(110,0):[-240,0]     's/77.f, S/fdf8.0, '
--(108,0)     's/77.f, S/fdf8.0, '
--(106,0):[-248,0]     's/6f.f, S/fdf8.0, '
--(104,0)     's/6f.f, S/fdf8.0, '
--(103,0)+<248,0>:[0,0]     's/6f.f, S/fdf8.0, '

which I, too, find cryptic. I should at least inform future self that they are from saved "last state" array, dumped on exception like assert failure during the test.
  • (%{x}, %{y}) is used to report new coordinates
  • <%{vx}, %{vy}> is used to report a new speed
  • [%{dx}, %{dy}] is used to report new delayed step

Together with the arena layout defined by walls1, it can start making sense. For instance, x=104 is the lowest valid position before entering the two-# "wall" of the upper platform, where the Gob-under-test is walking. Granted, this is arcane and confusing, and it would deserve a review fix even if it is only for me. Possibly even more confusing are the `s/%x.%x` codes reported between quotes afterwards. These are generated from ChiefInspector::report() call from the WalkerController.

  • lowercase s indicates that there has been a change of tile considered in doslopes(). hex values are the coordinates of the hotspot used to decide that.
  • uppercase S indicates that doslopes() reported we're on sloped ground. hex values are the speed defined during doslopes().

Even with that wrong report of sloped ground fixed, I still have my 'follower' capable of entering walls. It happens when we stop next to the wall, but do not cancel the 'step value still needed'. More investigation needed.

With some break getspeed and cond BREAKNO (x >> 8) < 105 && cdata[GOB_XSPEED] == -520 in an epic gdb session, I could trace precisely what happens, one GobController call after another, and how the speed, delayed steps and the like where further processed.

The issue turned to be linked to the 'maxmove computation', an engine feature added in 2020 that scans the animation commands ahead of think() calls, so that we don't ask doslopes() to check for something different than what we'll actually do, else the vertical move and the horizontal move won't match and we'll end up out of the slope. The code that decides what to do when we detect a move must be cancelled assumes that the motion debt stored in the 'delayed step' variable has already been validated by controllers. In case of failure, we're safe to move *at least there*.

That was true with the "school rush" engine, where it was helpful to ensure we actually get close enough to the walls for the testpoints to work as intended. It is no longer true if maxmove shortened some of it. How precisely this can be fixed still has needed a bit more time to be figured out.

PS: while looking for the 'inspector codes' in my 2020-notepad, I got my eyes caught by a character stating "this is contemporary to maxmove introduction"... The line below reads "hopefully, no need to tweak things (how many loops are allowed while processing animation instructions, btw) too much. All it takes is building an animation list with enough 'I_MOVETO(+1, *)' commands before it loops."

PPS: of course, fixing things here broke things there, but mostly because code was wrongly assuming that some things (like vertical position or FAIL counter)