📄 slides.tex
字号:
% 4. Scheduing\begin{slide} \heading{Scheduling} Each {\tt CPU} module implements the method:{\footnotesize\begin{verbatim} void run(ClockValue timeslice);\end{verbatim}} which is supposed to execute exactly {\tt timeslice} instructions before returning. The scheduling is implementing by each CPU module, which, on each timeslice, should schedule a preemption event at the indicated time. Each CPU has its own clock frequency. The scheduler ({\tt CPU::run\_all} or the Tcl {\tt run} command) uses it to compute the timeslice of each CPU as:{\footnotesize\begin{verbatim} cpu->timeslice = cpu->freq / freq;\end{verbatim}} where {\tt freq} is the GCD of all CPU frequencies. The timeslices are also adjusted to ensure that each CPU's timeslice is in some preset range (ie. between 32 and 1024 cycles) for performance reasons.\end{slide}% 5. Checkpointing\begin{slide} \heading{Checkpointing} I provide facilities for saving the simulator state to disk. Each module takes care of only its own state, and the complete state is restored from a text file on request. The magic is performed using C++ multiple virtual inheritence (eek!) Each module {\tt T} must: \begin{itemize} \item be derived from {\tt virtual Serializable} \item define a member {\tt static SerialType<T> type} specifying its name and description \item define a checkpointing constructor {\tt T(Checkpoint\& cp)} \item define a {\tt T::checkpoint(Checkpoint\& cp, bool parent)} member. \end{itemize}\end{slide}\begin{slide} {\headcol Example:}{\footnotesize\begin{verbatim} class MT48Tx2 : public Module, public Device, public virtual Serializable { private: static SerialType<MT48Tx2> type; // ... public: MT48Tx2(Checkpoint& cp); void checkpoint(Checkpoint& cp, bool parent = false) const; // ... }; SerialType<MT48Tx2> MT48Tx2::type( "MT48Tx2", "T48T02/12 Timekeeper(R) RAM" );\end{verbatim}}\vfill\end{slide}\begin{slide}{\footnotesize\begin{verbatim} MT48Tx2::MT48Tx2(Checkpoint &cp) : Module(cp), Device(8), file_name(extract_word(cp)) { define("fileName", conf.file_name, file_name); cp >> conf.read_latency >> conf.write_latency >> '\n'; define("readLatency", conf.read_latency, conf.read_latency); define("writeLatency", conf.write_latency, conf.write_latency); reset(false); }\end{verbatim}}\vfill\end{slide}\begin{slide}{\footnotesize\begin{verbatim} void MT48Tx2::checkpoint( Checkpoint &cp, bool parent) const { ClockValue rl = nanoseconds(read_latency, freq); ClockValue wl = nanoseconds(write_latency, freq); if (parent) cp << type << ' ' << id << '\n'; cp << file_name << ' ' << rl << ' ' << wl << '\n'; if (file_name[0]) { FILE *fp = fopen(conf.file_name, "wb"); if (!fp) throw FileError(conf.file_name); fwrite(data, 1, size, fp); if (ferror(fp) || fclose(fp)) { fclose(fp); throw FileError(conf.file_name); } fclose(fp); } }\end{verbatim}}\end{slide}% 6. A CPU simulator\begin{slide} \heading{A CPU module: Koala} Everything happens in the {\tt run} loop:{\footnotesize\begin{verbatim}void Koala::run( ClockValue timeslice){ setjmp(env); for (;;) { // Reset gpr[0]. gpr[0] = 0; // First of all, increment the PClock counter. After this, all we need // to handle is any slips and stalls. ++now; // Check for interrupts. In read hardware, these have a priority lower // than all exceptions, but simulating this effect is too hard to be // worth the effort (interrupts and resets are not meant to be // delivered accurately anyway.) if (events) { if (bits(events, 7, 0)) process_reset(); else if (bit(cp0[SR], SR_IE) && (events & cp0[SR])) process_interrupt(); } // Look up the ITLB. It's not clear from the manuals whether the ITLB // stores the ASIDs or not. I assume it does. ITLB has the same size // as in the real hardware, mapping two 4KB pages. Because decoding a // MIPS64 virtual address is far from trivial, ITLB and DTLB actually // improve the simulator's performance: something I cannot say about // caches and JTLB. PA pa; VA vpn = pc / 4096; if (vpn == itlb[0].vpn && asid_match(asid, itlb[0].asid)) { pa = itlb[0].pa + (pc % 4096); lru_itlb = 1; } else if (vpn == itlb[1].vpn && asid_match(asid, itlb[1].asid)) { pa = itlb[1].pa + (pc % 4096); lru_itlb = 0; } else { // Do a full address translation. This introduces a slip in the I // pipeline stage. The slip costs 1 cycle for branch, jump and // ERET instructions, and 2 cycles otherwise. ++now; pa = translate_vaddr(pc, instr_fetch); itlb[lru_itlb].vpn = vpn; itlb[lru_itlb].asid = asid; itlb[lru_itlb].pa = round_down(pa, 4096); lru_itlb = !lru_itlb; } // Access the instruction cache. Because the simulated caches are // slow, we maintain a two-entry buffer to cut down full fetches by // the factor of (up to) sixteen. Instr instr; if (ibuf_match(pc, ibuf[0].tag)) { instr = swizzle<word>(ibuf[0][pc], pc); lru_ibuf = 1; } else if (ibuf_match(pc, ibuf[1].tag)) { instr = swizzle<word>(ibuf[1][pc], pc); lru_ibuf = 0; } else { // No such luck: fetch the data from the cache. instr = fetch(pc, pa); } // Now, we can decode and execute the instruction. int next_state = decode(instr); // Dump the registers if required. if (trace_level >= dump_gprs) dump_gpr_registers(); // Advance the PC. switch (pipeline) { case nothing_special: pc += 4; break; case branch_delay: pc = branch_target; break; case instr_addr_error: process_address_error(instr_fetch, branch_target); } pipeline = next_state; }}\end{verbatim}}\end{slide}\begin{slide} {\headcol Caches} Implemented as a completely generic module parametized on the cache geometry:{\footnotesize\begin{verbatim}template <int log2_size, int log2_line_size, int log2_assoc>struct MIPS64Cache { // ... // Actual cache data itself. Set set[sets]; // Extract an index from a virtual address. static VA index(VA va) { return bits(va, index_last, index_first); } static VA block(VA va) { return bits(va, index_last + log2_assoc, index_last + 1); } static PA tag(PA pa) { return bits(pa, paddr_width - 1, index_last + 1); }};\end{verbatim}}\end{slide}\begin{slide} which is used in {\tt Koala} as:{\footnotesize\begin{verbatim} typedef MIPS64Cache<log2_icache_size,log2_icache_line,log2_icache_assoc> ICache; typedef MIPS64Cache<log2_dcache_size,log2_dcache_line,log2_dcache_assoc> DCache ICache icache; DCache dcache;\end{verbatim}}\vfill\end{slide}\begin{slide} \heading{Putting it all together} To simulate a system with Sulima: \begin{slumerate} \item decide which peripheral devices you want simulated and to what degree of accuracy, and design simplified devices for everything else \item implement modules for each of the above \item if {\tt MIPS64Bus} is not good enough, implement a similar memory controller for your CPU. \item implement a CPU state structure (eg. {\tt MIPS64Cpu}) \item implement the CPU on top of it, using existing caches. most effort will go into designing the TLB and of course the instruction decoder loop \item debug it all \end{slumerate}\end{slide}% 7. Future directions\begin{slide} \heading{TO DO} \begin{slitemize} \item a generic FPU module \item a generic annotation/debugging interface in Kaffe or C \item an L{\it n} cache (trivial modification of {\tt MIPS64Cache}) \item an SMP memory controller \item translation-grade simulator \item simulators for ARM, Alpha and SPARC \item simulators for ethernet cards, SCSI cards and other such joy \end{slitemize}\end{slide}\end{document}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -