📄 technical_background.txt

📁 这是一个LINUX环境的 VDR 插件源代码,可支持Irdeto, Seca, Viaccess, Nagra, Conax & Cryptoworks等CA系统的读卡、共享等操作。
💻 TXT
字号:
-------FFdecsa-------This doc is for people who looked into the source code and found itdifficult to believe that this is a decsa algorithm, as it appearscompletely different from other decsa implementations.It appears different because it is different. Being different is whatenables it to be a lot faster than all the others (currently it has morethan 800% the speed of the best version I was able to find)The csa algo was designed to be run in hardware, but people are nowrunning it in software.Hardware has data lines carrying bits and functional blocks doingcalculations (logic operations, adders, shifters, table lookup, ...),software instead uses memory to contain data values and executes asequence of instructions to transform the values. As a consequence,writing a software implementation of a hardware algorithm can beinefficient.For example, if you have 32 data lines, you can permutate the bits withzero cost in hardware (you just permute the physical traces), but if youhave the bits in a 32 bit variable you have to use 32 "and" operationswith 32 different masks, 32 shifts and 31 "or" operations (if yousuggest using "if"s testing the bits one by one you know nothing abouthow jump prediction works in modern processors).So the approach is *emulating the hardware*.Then there are some additional cool tricks.TRICK NUMBER 0: emulate the hardware------------------------------------We will work on bits one by one, that is a 4 bit word is now fourvariables. In this way we revert complex software operations intohardware emulation:  software                      hardware  -------------------------------------------  copy values                   copy values  logic op                      logic op  (bit permut.) ands+shifts+ors copy values  additions                     logic op emulating adders  (comparisons) if              logic op selecting one of the two results  lookup tables                 logic op synthetizing a ROM (*)(*) sometimes lookup tables can be converted to logic expressionsThe sbox in the stream cypher have been converted to efficient logicoperations using a custom written software (look into logic directory)and is responsible for a lot of speed increase. Maybe there exists aslightly better way to express the sbox as logical expressions, but itwould be a minuscule improvement. The sbox in the block cypher can't beconverted to efficient logic operations (8 bits of inputs are just toomuch) and is implemeted with a traditional lookup in an array.But there is a problem; if we want to process bits, but our externalinput and output wants bytes. We need conversion routines. Conversionroutines are similar to the awful permutations we described before, sothis has to be done efficiently someway.TRICK NUMBER 1: virtual shift registers---------------------------------------Shift registers are normally implemented by moving all data around.Better leave the data in the same memory locations and redefine wherethe start of the register is (updating a pointer). That is calledvirtual shift register.TRICK NUMBER 2: parallel bitslice---------------------------------Implementing the algorithm as described in tricks 1 and 2 give us about15% of the speed of a traditional implementation. This happens becausewe work on only one bit, even if our CPU is 32 bit wide. But *we canprocess 32 different packets at the same time*. This is called"bitslice" method. It can be done only if the program flow is notdependent of the data (if, while,...). Luckily this is true.Things like  if(a){    b=c&d;  }  else{    b=e&f;  }can be coded as (think of how hardware would implement this)  b1=c&d;  b2=e&f;  b=b2^(a&(b1^b2));and things like  if(a){    b=c&d  }can be transformed in the same way, as they may be written as  if(a){    b=c&d  }  else{    b=b;  }It could look wasteful, but it is not; and destroys data dependency.Our codes takes the same time as before, but produces 32 results, sospeed is now 480% the speed of a traditional implementation.TRICK NUMBER 3: multimedia instructions---------------------------------------If our CPU is 32 bit but it can also process larger blocks of dataefficiently (multimedia instructions), we can use them. We only needlogic ops and these are typically available.We can use MMX and work on 64 packets, or SSE and work on 128 packets.The speed doesn't automatically double going from 32 to 64 because theinteger registers of the processor are normally faster. However, somespeed is gained in this way.Multimedia instructions are often used by writing assembler by hand, butcompilers are very good in doing register allocation, loop unrolling andinstruction scheduling, so it is better to write the code in C and usenative multimedia data types (intrinsics).Depending on number of available registers, execution latency, number ofexecution units in the CPU, it may be good to process more than one datablock at the same time, for example 2 64bit MMX values. In this case wework on 128 bits by simulating a 128 bit op with two consecutive 64 bitop. This may or may not help (apparently not because x86 architecturehas a small number of registers).We can also try working on 96 bit, pairing a MMX and an int op, or 192bit by using MMX and SSE. While this is doable in theory and couldexploit different execution units in the CPU, speed doesn't improve(because of cache line handling problems inside the CPU, maybe).Besides int, MMX, SSE, we can use long long int (64 bit) and, why not,unsigned char.Using groups of unsigned chars (8 or 16) could give the compiler anopportunity to insert multimedia instructions automatically. Forexample, icc can use one MMX istruction to do  unsigned char a[8],b[8],c[8];  for(i=0;i<8;i++){    a[i]=b[i]&c[i];  }Some compilers (like icc) are efficient in this case, but usingintrinsics manually is generally faster.All these experiments can be easily done if the code is written in a waywhich abstracts the data type used. This is not easy but doable, all theoperations on data become (inlined) function calls or preprocessormacros. Good compilers are able to simplify all the abstraction atcompile time and generate perfect code (gcc is great).The data abstraction used in the code is called "group".TRICK NUMBER 4: parallel byteslice----------------------------------The bitslice method works wonderfully on the stream cypher, but can't beapplied to the block cypher because of the evil big look up table.As we have to convert input data from normal to bitslice before startingprocessing and from bitslice to normal before output, we convert thestream cypher output to normal before the block calculations and do theblock stage in a traditional way.There are some xors in the block cypher; so we arrange bytes fromdifferent packets side by side and use multimedia instructions to workon many bytes at the same time. This is not exactly bitslice, maybe itis called byteslice. The conversion routines are similar (just a bitsimpler).The data type we use to do this in the code is called "batch".The virtual shift register described in trick number 2 is useful too.The look up table is the only thing which is done serially one byte at atime. Luckily if we do it on 32 or 64 bytes the loop is heavilyunrolled, and the compiler and the CPU manage to get a good speedbecause there is little dependency between instructions.TRICK NUMBER 5: efficient bit permutation-----------------------------------------The block cypher has a bit permutation part. As we are not in a bitsliced form at that point, permuting bits in a byte takes 8 masks, 8and, 7 or; but three bits move in the same direction, so we make it with6 masks, 6 and, 5 or. Batch processing through multimedia instructionsis applicable too.TRICK NUMBER 6: efficient normal<->slice conversion---------------------------------------------------The bitslice<->normal conversion routines are a sort of transpositionoperation, that is you have bits in rows and want them in columns. Thiscan be done efficiently. For example, transposition of 8 bytes (matrixof 8x8=64 bits) can be done this way (we want to exchange bit[i][j] withbit[j][i] and we assume bit 0 is the MSB in the byte):  // untested code, may be bugged  unsigned char a[8];  unsigned char b[8];  for(i=0;i<8;i++) b[i]=0;  for(i=0;i<8;i++){    for(j=0;j<8;j++){      b[i]|=((a[j]>>(7-i)&1))<<(7-j);    }  }but it is slow (128 shifts, 64 and, 64 or), or  // untested code, may be bugged  unsigned char a[8];  unsigned char b[8];  for(i=0;i<8;i++) b[i]=0;  for(i=0;i<8;i++){    for(j=0;j<8;j++){      if(a[j]&(1<<(7-i))) b[i]|=1<<(7-j);    }  }but is very very slow (128 shifts, 64 and, 64 or, 128 unpredictableif!), or using a>>=1 and b<<=1, which gains you nothing, or  // untested code, may be bugged  unsigned char a[8];  unsigned char b[8];  unsigned char top,bottom;  for(j=0;j<1;j++){    for(i=0;i<4;i++){      top=   a[8*j+i];      bottom=a[8*j+4+i];      a[8*j+i]=   (top&0xf0)    |((bottom&0xf0)>>4);      a[8*j+4+i]=((top&0x0f)<<4)| (bottom&0x0f);    }  }  for(j=0;j<2;j++){    for(i=0;i<2;i++){      top=   a[4*j+i];      bottom=a[4*j+2+i];      a[4*j+i]  = (top&0xcc)    |((bottom&0xcc)>>2);      a[4*j+2+i]=((top&0x33)<<2)| (bottom&0x33);    }  }  for(j=0;j<4;j++){    for(i=0;i<1;i++){      top=   a[2*j+i];      bottom=a[2*j+1+i];      a[2*j+i]  = (top&0xaa)    |((bottom&0xaa)>>1);      a[2*j+1+i]=((top&0x55)<<1)| (bottom&0x55);    }  }  for(i=0;i<8;i++) b[i]=a[i]; //easy to integrate into one of the stages abovewhich is very fast (24 shifts, 48 and, 24 or) and has redundant loopsand address calculations which will be optimized away by the compiler.It can be written as 3 nested loops but it becomes less readable andmakes it difficult to have results in b without an extra copy. Thecompiler always unrolls heavily.The gain is much bigger when operating with 32 bit or 64 bit values (weare going from N^2 to Nlog(N)). This method is used for rectangularmatrixes too (they have to be seen as square matrixes side by side).Warning: this code is not *endian independent* if you use ints to workon 4 bytes. Running it on a big endian processor will give you adifferent and strange kind of bit rotation if you don't modify masks andshifts.This is done in the code using int or long long int. It should bepossible to use MMX instead of long long int and it could be faster, butthis code doesn't cost a great fraction of the total time. There areproblems with the shifts, as multimedia instructions do not have allpossible kind of shift we need (SSE has none!).TRICK NUMBER 7: try hard to process packets together----------------------------------------------------As we are able to process many packets together, we have to avoidrunning with many slots empty. Processing one packet or 64 packets takesthe same time if the internal parallelism is 64! So we try hard toaggregate packets that can be processed together; for simplicity reasonswe don't mix packets with even and odd parity (different keys), even ifit should be doable with a little effort. Sometimes the transition fromeven to odd parity and viceversa is not sharp, but there are sequenceslike EEEEEOEEOEEOOOO. We try to group all the E together even if thereare O between them. This out-of-order processing complicates theinterface to the applications a bit but saves us three or four runs withmany empty slots.We have also logic to process together packets with a different size ofthe payload, which is not always 184 bytes. This involves sorting thepackets by size before processing and careful operation of the 23iteration loop to exclude some packets from the calculations. It is notCPU heavy.Packets with payload <8 bytes are identical before and after decryption(!), so we skip them without using a slot. (according to DVB specs thesekind of packets shouldn't happen, but they are used in the real world).TRICK NUMBER 8: try to avoid doing the same thing many times------------------------------------------------------------Some calculations related to keys are only done when the keys are set,then all the values depending on keys are stored in a convenient formand used everytime we convert a group of packets.TRICK NUMBER 9: compiler------------------------Compilers have a lot of optimization options. I used -march to target myCPU and played with unsual options. In particular  "--param max-unrolled-insns=500"does a good job on the tricky table lookup in the block cypher. Biggervalues unroll too much somewhere and loose speed. All the testing hasbeen done on an AthlonXP CPU with a specific version of gcc  gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7)Other combinations of CPU and compiler can give different speeds. If thecompiler is not able to simplify the group and batch structures andstores everything in memory instead of registers, performance will below.Absolutely use a good compiler!Note: the same code can be compiled in C or C++ mode. g++ gives a 3%speed increase compared to gcc (I suppose some stricter constraint onarray and pointers in C++ mode gives the optimizer more freedom).TRICK NUMBER a: a lot of brain work-----------------------------------The code started as very slow but correct implementation and was thentweaked for months with a lot of experimentation and by adding all thegood ideas one after another to achieve little steps toward the bestspeed possible, while continously testing that nothing had been broken.Many hours were spent on this code.Enjoy the result.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -