📄 hacking
字号:
8.42 0.80 0.14 156380 0.91 0.91 spr_enum 6.76 0.91 0.11 483134 0.24 1.31 lcdc_trans 6.16 1.02 0.10 cpu_emulate . . . 0.59 1.61 0.01 216497 0.05 0.05 mem_readAs you can see, not only does mem_read take up (proportionally) 1/20as much time, since it is rarely called, but the main cpu loop incpu_emulate also runs considerably faster with all the function calloverhead and cache misses avoided.These tests were performed on K6-2/450 with the assembly coresenabled; your milage may vary. Regardless, however, I think it's clearthat using the address mapping tables is quite a worthwhileoptimization. LCD RENDERING CORE DESIGNThe LCD core presently used in gnuboy is very much a high-level one,performing the task of rasterizing scanlines as many independent stepsrather than one big loop, as is often seen in other emulators and theoriginal gnuboy LCD core. In some ways, this is a bit of a tradeoff --there's a good deal of overhead in rebuilding the tile pattern cachefor roms that change their tile patterns frequently, such as fullmotion video demos. Even still, I consider the method we're presentlyusing far superior to generating the output display directly from thegameboy tiledata -- in the vast majority of roms, tiles are changed soinfrequently that the overhead is irrelevant. Even if the tiles arechanged rapidly, the only chance for overhead beyond what would bepresent in a monolithic rendering loop lies in (host cpu) cache missesand the possibility that we might (tile pattern) cache a tile that haschanged but that will never actually be used, or that will only beused in one orientation (horizontally and vertically flipped versionsof all tiles are cached as well). Such tile caching issues could beaddressed in the long term if they cause a problem, but I don't see ithurting performance too significantly at the present. As for host cpucache miss issues, I find that putting multiple data decoding andrendering steps together in a single loop harms performance much moresignificantly than building a 256k (pattern) cache table, on accountof interfering with branch prediction, register allocation, and so on.Well, with those justifications given, let's proceed to the stepsinvolved in rendering a scanline:updatepatpix() - updates tile pattern cache.tilebuf() - reads gb tile memory according to its complicated tileaddressing system which can be changed via the LCDC register, andoutputs nice linear arrays of the actual tile indices used in thebackground and window on the present line.Before continuing, let me explain the output format used by thefollowing functions. There is a byte array scan.buf, accessible bymacro as BUF, which is the output buffer for the line. The structureof this array is simple: it is composed of 6 bpp gameboy colornumbers, where the bits 0-1 are the color number from the tile, bits2-4 are the (cgb or dmg) palette index, and bit 5 is 0 for backgroundor window, 1 for sprite.What is the justification for using a strange format like this, ratherthan raw host color numbers for output? Well, believe it or not, itimproves performance. It's already necessary to have the gameboy colornumbers available for use in sprite priority. And, when running inmono gb mode, building this output data is VERY fast -- it's just amatter of doing 64 bit copies from the tile pattern cache to theoutput buffer.Furthermore, using a unified output format like this eliminates theneed to have separate rendering functions for each host color depth ormode. We just call a one-line function to apply a palette to theoutput buffer as we copy it to the video display, and we're done. And,if you're not convinced about performance, just do some profiling.You'll see that the vast majority of the graphics time is spent in theone-line copy function (render_[124] depending on bytes per pixel),even when using the fast asm versions of those routines. That is tosay, any overhead in the following functions is for all intents andpurposes irrelevant to performance. With that said, here they are:bg_scan() - expands the background layer to the output buffer.wnd_scan() - expands the window layer.spr_scan() - expands the sprites. Note that this requires spr_enum()to have been called already to build a list of which sprites arevisible on the current scanline and sort them by priority.It should be noted that the background and window functions also havecolor counterparts, which are considerably slower due to merging ofpalette data. At this point, they're staying down around 8% timeaccording to the profiler, so I don't see a major need to rewrite themanytime soon. It should be considered, however, that a differentintermediate format could be used for gbc, or that asm versions ofthese two routines could be written, in the long term.Finally, some notes on palettes. You may be wondering why the 6 bppintermediate output can't be used directly on 256-color displaytargets. After all, that would give a huge performance boost. Theproblem, however, is that the gameboy palette can change midscreen,whereas none of the presently targetted host systems can handle such athing, much less do it portably. For color roms, using our owninternal color mappings in addition to the host system palette isessential. For details on how this is accomplished, read palette.c.Now, in the long term, it MAY be possible to use the 6 bpp color"almost" directly for mono roms. Note that I say almost. The idea isthis. Using the color number as an index into a table is slow. Ittakes an extra read and causes various pipeline stalls depending onthe host cpu architecture. But, since there are relatively fewpossible mono palettes, it may actually be possible to set up the hostpalette in a clever way so as to cover all the possibilities, then usesome fancy arithmetic or bit-twiddling to convert without a lookuptable -- and this could presumably be done 4 pixels at a time with32bit operations. This area remains to be explored, but if it works,it might end up being the last hurdle to getting realtime emulationworking on very low-end systems like i486. SOUNDRather than processing sound after every few instructions (and thuskilling the cache coherency), we update sound in big chunks. Yet thisin no way affects precise sound timing, because sound_mix is alwayscalled before reading or writing a sound register, and at the end ofeach frame.The main sound module interfaces with the system-specific code throughone structure, pcm, and a few functions: pcm_init, pcm_close, andpcm_submit. While the first two should be obvious, pcm_submit needssome explaining. Whenever realtime sound output is operational,pcm_submit is responsible for timing, and should not return until ithas successfully processed all the data in its input buffer (pcm.buf).On *nix sound devices, this typically means just waiting for the writesyscall to return, but on systems such as DOS where low level IO mustbe handled in the program, pcm_submit needs to delay until the currentposition in the DMA buffer has advanced sufficiently to make space forthe new samples, then copy them.For special sound output implementations like write-to-file or thedummy sound device, pcm_submit should write the data immediately andreturn 0, indicating to the caller that other methods must be used fortiming. On real sound devices that are presently functional,pcm_submit should return 1, regardless of whether it buffered oractually wrote the sound data.And yes, for unices without OSS, we hope to add piped audio outputsoon. Perhaps Sun audio device and a few others as well. OPTIMIZED ASSEMBLY CODEA lot can be said on this matter. Nothing has been said yet. INTERACTIVE DEBUGGERApologies, there is no interactive debugger in gnuboy at present. I'mstill working out the design for it. In the long run, it should beintegrated with the rc subsystem, kinda like a cross between gdb andQuake's ever-famous console. Whether it will require a terminal deviceor support the graphical display remains to be determined.In the mean time, you can use the debug trace code alreadyimplemented. Just "set trace 1" from your gnuboy.rc or the commandline. Read debug.c for info on how to interpret the output, which iscondensed as much as possible and not quite self-explanatory. PORTINGOn all systems on which it is available, the gnu compiler shouldprobably be used. Writing code specific to non-free compilers makes itimpossible for free software users to actively contribute. On theother hand, compiler-specific code should always be kept to a minimum,to make porting to or from non-gnu compilers easier.Porting to new cpu architectures should not be necessary. Just makesure you unset IS_LITTLE_ENDIAN in the makefiles to enable the bigendian default if the target system is big endian. If you do haveproblems building on certain cpus, however, let us know. Eventually,we will also want asm cpu and graphics code for popular host cpus, butthis can wait, since the c code should be sufficiently fast on mostplatforms.The bulk of porting efforts will probably be spent on adding supportfor new operating systems, and on systems with multiple video (orsound, once that's implemented) architectures, new interfaces forthose. In general, the operating system interface code goes in adirectory under sys/ named for the os (e.g. sys/nix/ for *nixsystems), and display interfaces likewise go in their respectivedirectories under sys/ (e.g. sys/x11/ for the x window systeminterface).For guidelines in writing new system and display interface modules, irecommend reading the files in the sys/dos/, sys/svga/, and sys/nix/directories. These are some of the simpler versions (aside from thetricky dos keyboard handling), as opposed to all the mess needed forx11 support.Also, please be aware that the existing system and display interfacemodules are somewhat primitive; they are designed to be as quick andsloppy as possible while still functioning properly. Eventually theywill be greatly improved.Finally, remember your obligations under the GNU GPL. If you produceany binaries that are compiled strictly from the source you received,and you intend to release those, you *must* also release the exactsources you used to produce those binaries. This is not pseudo-freesoftware like Snes9x where binaries usually appear before the latestsource, and where the source only compiles on one or two platforms;this is true free software, and the source to all binaries alwaysneeds to be available at the same time or sooner than thecorresponding binaries, if binaries are to be released at all. This ofcourse applies to all releases, not just new ports, but fromexperience i find that ports people usually need the most reminding. EPILOGUEThat's it for now. More info will eventually follow. Happy hacking!
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -