Wednesday, May 23, 2007

Beware of DmaCopy()

Hop. Entre deux transparents pour ACNM'07 et la navette pour IM'07, j'ai profité du portable emprunté à mon oncle pour faire un peu de HomeBrew. Il faut dire qu'avec son lecteur de carte SD et ubuntu "feisty", c'est méchamment plus pratique.

J'avais un problème avec le "scrolling" de la zone de travail, que j'utilise abondamment pour faire des tiles qui peuvent s'emboiter, et plus étonnant, celà n'arrivait que sur une 'vraie DS' ... aucun problème dans desmume (pour une fois). A force de chercher des failles hardware sur mon Network Processor pour ma thèse, j'en suis venu à suspecter les fameuses "commandes de copie DMA"...

Between two slides for the ACNM'07 workshop and the shuttle to IEEE IM'07, i'm enjoying my uncle's laptop for homebrew coding. With embedded SD card reader and the "feisty" Ubuntu, it's *way* better than playing with my regular hardware ...
So i took the opportunity to debug something odd occuring when "scrolling" the working grid -- some feature i'm heavily using to make tiles that ... tile. While it worked flawlessly in desmume emulator, it seemed to miss the first click on scroll button on real DS hardware ... As i'm looking for hardware issues on a Network Processor for my PhD thesis, i suddenly suspected something wrong with DMA copies.

Et j'avais raison. Le code du "scroll" proprement dit, c'est quelque-chose comme


for (i=0;i<16;i++) {
char extra=data[i*16]; // save first pixel
for (j=0;j<15;;j++) data[i*16+j]=data[i*16+j+1]; // shift row
data[15+i*16]=extra; // get back first pixel
}

après quoi, j'appelais à nouveau ma méthode "render" du widget "grid", laquelle faisait gentillement:
I first used to shift the working space (data[] array) in place, (see above), and then called the gridWidget::render() method which issued row-per-row DMA copies as below:

for (i=0;i<16;i++)
dmaCopy(/*from, char*/data+16*i,
/*to, uint16 */VRAM+ROWSIZE*i,
/*bytesize*/16);

Ouais. Bin si vous avez du code pareil qui se comporte bizarrement, courrez bien vite remplacer les "dmaCopy" par des "memcpy" bien gentils. Je n'ai pas le fin mot de l'histoire, mais n'oublions pas que tout ce que le processeur ARM lit/modifie passe par un cache, alors que les déplacement DMA, pas. En clair, pour peut que la commande DMA se mette à lire notre ligne de données avant que le cache n'ait été réécrit en mémoire centrale, c'est foutu.

Well, if you have something like that in your code and that the code behaves weirdly, do a rush query-replace of dmaCopy with standard memcpy. I haven't investigated this down to the nitty-gritty details, but we should keep in mind that the ARM processor read/writes goes through a cache while DMA memory transfers typically operate directly on physical DRAM rows... So to make it clearly, a DMA command could very well start moving the bits _before_ the content of the cache has been committed to main memory. And bang, we're doomed.

That's what computer scientists call a race condition. wuuzaa.
En informatique, ça s'appelle "une condition de course". Géant, non?

Edit: stellar date 2009.05.29, Cornarac (the author of TONC, a major reference in GBA programming) has published an impressive and detailed tutorial on caching and dma : http://www.coranac.com/2009/05/28/dma-vs-arm9-fight/

3 comments:

PypeBros said...

Les exemples fournis avec le devkitpro se servent de DC_FlushAll(); // Clean and invalidate entire data cache

ou éventuellement DC_FlushRange(), à utiliser sur la zone "source" du dmaCopy() avant celui-ci ... a priori, dans ce cas, il faudrait aussi un DC_InvalidateRange sur la zone destination après le dmaCopy() si on souhaite relire le résultat.

PypeBros said...

Examples in the devkitpro make use of DC_FlushAll() that will clean and invalidate the data cache before calling DmaCopy(). In other words, it ensure everything your program has modified is committed to memory before the transfer and that anything that has been modified by the transfer will be discovered by your program (rather than reusing obsolete cache content).

We could also theoretically live with
DC_FlushRange(src,len);
DmaCopy(src,dst,len);
DC_InvalidateRange(dst,len);
if we are concerned with performance penalty of the cache flush.

Afaik, the pre-flush could be omitted in case of an IO-to-mem transfer, and the post-invalidate can be omitted in case of a mem-to-IO (unless you read back the content of the IO registers, but that may depend on your cache configuration).

PypeBros said...

and Cornarac strikes back : The problem is that a cache-invalidate invalidates entire cache-lines, not just the range you supply. If the start or end of the data you want invalidate does not align to a cache-line, the adjacent data contained in that line is also thrown away. I hope you can see that this is bad.

As seen on his blog