1Introduction
2============
3
4Having looked at the linux mtd/nand driver and more specific at nand_ecc.c
5I felt there was room for optimisation. I bashed the code for a few hours
6performing tricks like table lookup removing superfluous code etc.
7After that the speed was increased by 35-40%.
8Still I was not too happy as I felt there was additional room for improvement.
9
10Bad! I was hooked.
11I decided to annotate my steps in this file. Perhaps it is useful to someone
12or someone learns something from it.
13
14
15The problem
16===========
17
18NAND flash (at least SLC one) typically has sectors of 256 bytes.
19However NAND flash is not extremely reliable so some error detection
20(and sometimes correction) is needed.
21
22This is done by means of a Hamming code. I'll try to explain it in
23laymans terms (and apologies to all the pro's in the field in case I do
24not use the right terminology, my coding theory class was almost 30
25years ago, and I must admit it was not one of my favourites).
26
27As I said before the ecc calculation is performed on sectors of 256
28bytes. This is done by calculating several parity bits over the rows and
29columns. The parity used is even parity which means that the parity bit = 1
30if the data over which the parity is calculated is 1 and the parity bit = 0
31if the data over which the parity is calculated is 0. So the total
32number of bits over the data over which the parity is calculated + the
33parity bit is even. (see wikipedia if you can't follow this).
34Parity is often calculated by means of an exclusive or operation,
35sometimes also referred to as xor. In C the operator for xor is ^
36
37Back to ecc.
38Let's give a small figure:
39
40byte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
41byte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
42byte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
43byte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
44byte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
45....
46byte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
47byte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
48           cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
49           cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
50           cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
51
52This figure represents a sector of 256 bytes.
53cp is my abbreviation for column parity, rp for row parity.
54
55Let's start to explain column parity.
56cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
57so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
58Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
59cp2 is the parity over bit0, bit1, bit4 and bit5
60cp3 is the parity over bit2, bit3, bit6 and bit7.
61cp4 is the parity over bit0, bit1, bit2 and bit3.
62cp5 is the parity over bit4, bit5, bit6 and bit7.
63Note that each of cp0 .. cp5 is exactly one bit.
64
65Row parity actually works almost the same.
66rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
67rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
68rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
69(so handle two bytes, then skip 2 bytes).
70rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
71for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
72so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
73and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
74The story now becomes quite boring. I guess you get the idea.
75rp6 covers 8 bytes then skips 8 etc
76rp7 skips 8 bytes then covers 8 etc
77rp8 covers 16 bytes then skips 16 etc
78rp9 skips 16 bytes then covers 16 etc
79rp10 covers 32 bytes then skips 32 etc
80rp11 skips 32 bytes then covers 32 etc
81rp12 covers 64 bytes then skips 64 etc
82rp13 skips 64 bytes then covers 64 etc
83rp14 covers 128 bytes then skips 128
84rp15 skips 128 bytes then covers 128
85
86In the end the parity bits are grouped together in three bytes as
87follows:
88ECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
89ECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
90ECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
91ECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
92
93I detected after writing this that ST application note AN1823
94(http://www.st.com/stonline/) gives a much
95nicer picture.(but they use line parity as term where I use row parity)
96Oh well, I'm graphically challenged, so suffer with me for a moment :-)
97And I could not reuse the ST picture anyway for copyright reasons.
98
99
100Attempt 0
101=========
102
103Implementing the parity calculation is pretty simple.
104In C pseudocode:
105for (i = 0; i < 256; i++)
106{
107    if (i & 0x01)
108       rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
109    else
110       rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
111    if (i & 0x02)
112       rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
113    else
114       rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
115    if (i & 0x04)
116      rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
117    else
118      rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
119    if (i & 0x08)
120      rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
121    else
122      rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
123    if (i & 0x10)
124      rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
125    else
126      rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
127    if (i & 0x20)
128      rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
129    else
130    rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
131    if (i & 0x40)
132      rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
133    else
134      rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
135    if (i & 0x80)
136      rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
137    else
138      rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
139    cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
140    cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
141    cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
142    cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
143    cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
144    cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
145}
146
147
148Analysis 0
149==========
150
151C does have bitwise operators but not really operators to do the above
152efficiently (and most hardware has no such instructions either).
153Therefore without implementing this it was clear that the code above was
154not going to bring me a Nobel prize :-)
155
156Fortunately the exclusive or operation is commutative, so we can combine
157the values in any order. So instead of calculating all the bits
158individually, let us try to rearrange things.
159For the column parity this is easy. We can just xor the bytes and in the
160end filter out the relevant bits. This is pretty nice as it will bring
161all cp calculation out of the if loop.
162
163Similarly we can first xor the bytes for the various rows.
164This leads to:
165
166
167Attempt 1
168=========
169
170const char parity[256] = {
171    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
172    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
173    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
174    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
175    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
176    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
177    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
178    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
179    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
180    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
181    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
182    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
183    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
184    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
185    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
186    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
187};
188
189void ecc1(const unsigned char *buf, unsigned char *code)
190{
191    int i;
192    const unsigned char *bp = buf;
193    unsigned char cur;
194    unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
195    unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
196    unsigned char par;
197
198    par = 0;
199    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
200    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
201    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
202    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
203
204    for (i = 0; i < 256; i++)
205    {
206        cur = *bp++;
207        par ^= cur;
208        if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
209        if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
210        if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
211        if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
212        if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
213        if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
214        if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
215        if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
216    }
217    code[0] =
218        (parity[rp7] << 7) |
219        (parity[rp6] << 6) |
220        (parity[rp5] << 5) |
221        (parity[rp4] << 4) |
222        (parity[rp3] << 3) |
223        (parity[rp2] << 2) |
224        (parity[rp1] << 1) |
225        (parity[rp0]);
226    code[1] =
227        (parity[rp15] << 7) |
228        (parity[rp14] << 6) |
229        (parity[rp13] << 5) |
230        (parity[rp12] << 4) |
231        (parity[rp11] << 3) |
232        (parity[rp10] << 2) |
233        (parity[rp9]  << 1) |
234        (parity[rp8]);
235    code[2] =
236        (parity[par & 0xf0] << 7) |
237        (parity[par & 0x0f] << 6) |
238        (parity[par & 0xcc] << 5) |
239        (parity[par & 0x33] << 4) |
240        (parity[par & 0xaa] << 3) |
241        (parity[par & 0x55] << 2);
242    code[0] = ~code[0];
243    code[1] = ~code[1];
244    code[2] = ~code[2];
245}
246
247Still pretty straightforward. The last three invert statements are there to
248give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
249all data is 0xff, so the checksum then matches.
250
251I also introduced the parity lookup. I expected this to be the fastest
252way to calculate the parity, but I will investigate alternatives later
253on.
254
255
256Analysis 1
257==========
258
259The code works, but is not terribly efficient. On my system it took
260almost 4 times as much time as the linux driver code. But hey, if it was
261*that* easy this would have been done long before.
262No pain. no gain.
263
264Fortunately there is plenty of room for improvement.
265
266In step 1 we moved from bit-wise calculation to byte-wise calculation.
267However in C we can also use the unsigned long data type and virtually
268every modern microprocessor supports 32 bit operations, so why not try
269to write our code in such a way that we process data in 32 bit chunks.
270
271Of course this means some modification as the row parity is byte by
272byte. A quick analysis:
273for the column parity we use the par variable. When extending to 32 bits
274we can in the end easily calculate p0 and p1 from it.
275(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
276respectively)
277also rp2 and rp3 can be easily retrieved from par as rp3 covers the
278first two bytes and rp2 the last two bytes.
279
280Note that of course now the loop is executed only 64 times (256/4).
281And note that care must taken wrt byte ordering. The way bytes are
282ordered in a long is machine dependent, and might affect us.
283Anyway, if there is an issue: this code is developed on x86 (to be
284precise: a DELL PC with a D920 Intel CPU)
285
286And of course the performance might depend on alignment, but I expect
287that the I/O buffers in the nand driver are aligned properly (and
288otherwise that should be fixed to get maximum performance).
289
290Let's give it a try...
291
292
293Attempt 2
294=========
295
296extern const char parity[256];
297
298void ecc2(const unsigned char *buf, unsigned char *code)
299{
300    int i;
301    const unsigned long *bp = (unsigned long *)buf;
302    unsigned long cur;
303    unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
304    unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
305    unsigned long par;
306
307    par = 0;
308    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
309    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
310    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
311    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
312
313    for (i = 0; i < 64; i++)
314    {
315        cur = *bp++;
316        par ^= cur;
317        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
318        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
319        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
320        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
321        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
322        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
323    }
324    /*
325       we need to adapt the code generation for the fact that rp vars are now
326       long; also the column parity calculation needs to be changed.
327       we'll bring rp4 to 15 back to single byte entities by shifting and
328       xoring
329    */
330    rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
331    rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
332    rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
333    rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
334    rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
335    rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
336    rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
337    rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
338    rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
339    rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
340    rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
341    rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
342    rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
343    rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
344    par ^= (par >> 16);
345    rp1 = (par >> 8); rp1 &= 0xff;
346    rp0 = (par & 0xff);
347    par ^= (par >> 8); par &= 0xff;
348
349    code[0] =
350        (parity[rp7] << 7) |
351        (parity[rp6] << 6) |
352        (parity[rp5] << 5) |
353        (parity[rp4] << 4) |
354        (parity[rp3] << 3) |
355        (parity[rp2] << 2) |
356        (parity[rp1] << 1) |
357        (parity[rp0]);
358    code[1] =
359        (parity[rp15] << 7) |
360        (parity[rp14] << 6) |
361        (parity[rp13] << 5) |
362        (parity[rp12] << 4) |
363        (parity[rp11] << 3) |
364        (parity[rp10] << 2) |
365        (parity[rp9]  << 1) |
366        (parity[rp8]);
367    code[2] =
368        (parity[par & 0xf0] << 7) |
369        (parity[par & 0x0f] << 6) |
370        (parity[par & 0xcc] << 5) |
371        (parity[par & 0x33] << 4) |
372        (parity[par & 0xaa] << 3) |
373        (parity[par & 0x55] << 2);
374    code[0] = ~code[0];
375    code[1] = ~code[1];
376    code[2] = ~code[2];
377}
378
379The parity array is not shown any more. Note also that for these
380examples I kinda deviated from my regular programming style by allowing
381multiple statements on a line, not using { } in then and else blocks
382with only a single statement and by using operators like ^=
383
384
385Analysis 2
386==========
387
388The code (of course) works, and hurray: we are a little bit faster than
389the linux driver code (about 15%). But wait, don't cheer too quickly.
390THere is more to be gained.
391If we look at e.g. rp14 and rp15 we see that we either xor our data with
392rp14 or with rp15. However we also have par which goes over all data.
393This means there is no need to calculate rp14 as it can be calculated from
394rp15 through rp14 = par ^ rp15;
395(or if desired we can avoid calculating rp15 and calculate it from
396rp14).  That is why some places refer to inverse parity.
397Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
398Effectively this means we can eliminate the else clause from the if
399statements. Also we can optimise the calculation in the end a little bit
400by going from long to byte first. Actually we can even avoid the table
401lookups
402
403Attempt 3
404=========
405
406Odd replaced:
407        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
408        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
409        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
410        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
411        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
412        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
413with
414        if (i & 0x01) rp5 ^= cur;
415        if (i & 0x02) rp7 ^= cur;
416        if (i & 0x04) rp9 ^= cur;
417        if (i & 0x08) rp11 ^= cur;
418        if (i & 0x10) rp13 ^= cur;
419        if (i & 0x20) rp15 ^= cur;
420
421        and outside the loop added:
422    rp4  = par ^ rp5;
423    rp6  = par ^ rp7;
424    rp8  = par ^ rp9;
425    rp10  = par ^ rp11;
426    rp12  = par ^ rp13;
427    rp14  = par ^ rp15;
428
429And after that the code takes about 30% more time, although the number of
430statements is reduced. This is also reflected in the assembly code.
431
432
433Analysis 3
434==========
435
436Very weird. Guess it has to do with caching or instruction parallellism
437or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
438observation was that this one is only 30% slower (according to time)
439executing the code as my 3Ghz D920 processor.
440
441Well, it was expected not to be easy so maybe instead move to a
442different track: let's move back to the code from attempt2 and do some
443loop unrolling. This will eliminate a few if statements. I'll try
444different amounts of unrolling to see what works best.
445
446
447Attempt 4
448=========
449
450Unrolled the loop 1, 2, 3 and 4 times.
451For 4 the code starts with:
452
453    for (i = 0; i < 4; i++)
454    {
455        cur = *bp++;
456        par ^= cur;
457        rp4 ^= cur;
458        rp6 ^= cur;
459        rp8 ^= cur;
460        rp10 ^= cur;
461        if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
462        if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
463        cur = *bp++;
464        par ^= cur;
465        rp5 ^= cur;
466        rp6 ^= cur;
467        ...
468
469
470Analysis 4
471==========
472
473Unrolling once gains about 15%
474Unrolling twice keeps the gain at about 15%
475Unrolling three times gives a gain of 30% compared to attempt 2.
476Unrolling four times gives a marginal improvement compared to unrolling
477three times.
478
479I decided to proceed with a four time unrolled loop anyway. It was my gut
480feeling that in the next steps I would obtain additional gain from it.
481
482The next step was triggered by the fact that par contains the xor of all
483bytes and rp4 and rp5 each contain the xor of half of the bytes.
484So in effect par = rp4 ^ rp5. But as xor is commutative we can also say
485that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
486eliminate rp5 (or rp4, but I already foresaw another optimisation).
487The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
488
489
490Attempt 5
491=========
492
493Effectively so all odd digit rp assignments in the loop were removed.
494This included the else clause of the if statements.
495Of course after the loop we need to correct things by adding code like:
496    rp5 = par ^ rp4;
497Also the initial assignments (rp5 = 0; etc) could be removed.
498Along the line I also removed the initialisation of rp0/1/2/3.
499
500
501Analysis 5
502==========
503
504Measurements showed this was a good move. The run-time roughly halved
505compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
506of the processor time compared to the current code in the linux kernel.
507
508However, still I thought there was more. I didn't like all the if
509statements. Why not keep a running parity and only keep the last if
510statement. Time for yet another version!
511
512
513Attempt 6
514=========
515
516THe code within the for loop was changed to:
517
518    for (i = 0; i < 4; i++)
519    {
520        cur = *bp++; tmppar  = cur; rp4 ^= cur;
521        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
522        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
523        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
524
525        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
526        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
527	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
528	    cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
529
530	    cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
531        cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
532	    cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
533        cur = *bp++; tmppar ^= cur; rp8 ^= cur;
534
535        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
536        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
537        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
538        cur = *bp++; tmppar ^= cur;
539
540	    par ^= tmppar;
541        if ((i & 0x1) == 0) rp12 ^= tmppar;
542        if ((i & 0x2) == 0) rp14 ^= tmppar;
543    }
544
545As you can see tmppar is used to accumulate the parity within a for
546iteration. In the last 3 statements is added to par and, if needed,
547to rp12 and rp14.
548
549While making the changes I also found that I could exploit that tmppar
550contains the running parity for this iteration. So instead of having:
551rp4 ^= cur; rp6 = cur;
552I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next
553statement. A similar change was done for rp8 and rp10
554
555
556Analysis 6
557==========
558
559Measuring this code again showed big gain. When executing the original
560linux code 1 million times, this took about 1 second on my system.
561(using time to measure the performance). After this iteration I was back
562to 0.075 sec. Actually I had to decide to start measuring over 10
563million iterations in order not to lose too much accuracy. This one
564definitely seemed to be the jackpot!
565
566There is a little bit more room for improvement though. There are three
567places with statements:
568rp4 ^= cur; rp6 ^= cur;
569It seems more efficient to also maintain a variable rp4_6 in the while
570loop; This eliminates 3 statements per loop. Of course after the loop we
571need to correct by adding:
572    rp4 ^= rp4_6;
573    rp6 ^= rp4_6
574Furthermore there are 4 sequential assignments to rp8. This can be
575encoded slightly more efficiently by saving tmppar before those 4 lines
576and later do rp8 = rp8 ^ tmppar ^ notrp8;
577(where notrp8 is the value of rp8 before those 4 lines).
578Again a use of the commutative property of xor.
579Time for a new test!
580
581
582Attempt 7
583=========
584
585The new code now looks like:
586
587    for (i = 0; i < 4; i++)
588    {
589        cur = *bp++; tmppar  = cur; rp4 ^= cur;
590        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
591        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
592        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
593
594        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
595        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
596	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
597	    cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
598
599	    notrp8 = tmppar;
600	    cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
601        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
602	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
603        cur = *bp++; tmppar ^= cur;
604	    rp8 = rp8 ^ tmppar ^ notrp8;
605
606        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
607        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
608        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
609        cur = *bp++; tmppar ^= cur;
610
611	    par ^= tmppar;
612        if ((i & 0x1) == 0) rp12 ^= tmppar;
613        if ((i & 0x2) == 0) rp14 ^= tmppar;
614    }
615    rp4 ^= rp4_6;
616    rp6 ^= rp4_6;
617
618
619Not a big change, but every penny counts :-)
620
621
622Analysis 7
623==========
624
625Actually this made things worse. Not very much, but I don't want to move
626into the wrong direction. Maybe something to investigate later. Could
627have to do with caching again.
628
629Guess that is what there is to win within the loop. Maybe unrolling one
630more time will help. I'll keep the optimisations from 7 for now.
631
632
633Attempt 8
634=========
635
636Unrolled the loop one more time.
637
638
639Analysis 8
640==========
641
642This makes things worse. Let's stick with attempt 6 and continue from there.
643Although it seems that the code within the loop cannot be optimised
644further there is still room to optimize the generation of the ecc codes.
645We can simply calculate the total parity. If this is 0 then rp4 = rp5
646etc. If the parity is 1, then rp4 = !rp5;
647But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
648in the result byte and then do something like
649    code[0] |= (code[0] << 1);
650Lets test this.
651
652
653Attempt 9
654=========
655
656Changed the code but again this slightly degrades performance. Tried all
657kind of other things, like having dedicated parity arrays to avoid the
658shift after parity[rp7] << 7; No gain.
659Change the lookup using the parity array by using shift operators (e.g.
660replace parity[rp7] << 7 with:
661rp7 ^= (rp7 << 4);
662rp7 ^= (rp7 << 2);
663rp7 ^= (rp7 << 1);
664rp7 &= 0x80;
665No gain.
666
667The only marginal change was inverting the parity bits, so we can remove
668the last three invert statements.
669
670Ah well, pity this does not deliver more. Then again 10 million
671iterations using the linux driver code takes between 13 and 13.5
672seconds, whereas my code now takes about 0.73 seconds for those 10
673million iterations. So basically I've improved the performance by a
674factor 18 on my system. Not that bad. Of course on different hardware
675you will get different results. No warranties!
676
677But of course there is no such thing as a free lunch. The codesize almost
678tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
679
680
681Correcting errors
682=================
683
684For correcting errors I again used the ST application note as a starter,
685but I also peeked at the existing code.
686The algorithm itself is pretty straightforward. Just xor the given and
687the calculated ecc. If all bytes are 0 there is no problem. If 11 bits
688are 1 we have one correctable bit error. If there is 1 bit 1, we have an
689error in the given ecc code.
690It proved to be fastest to do some table lookups. Performance gain
691introduced by this is about a factor 2 on my system when a repair had to
692be done, and 1% or so if no repair had to be done.
693Code size increased from 330 bytes to 686 bytes for this function.
694(gcc 4.2, -O3)
695
696
697Conclusion
698==========
699
700The gain when calculating the ecc is tremendous. Om my development hardware
701a speedup of a factor of 18 for ecc calculation was achieved. On a test on an
702embedded system with a MIPS core a factor 7 was obtained.
703On  a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
7045 (big endian mode, gcc 4.1.2, -O3)
705For correction not much gain could be obtained (as bitflips are rare). Then
706again there are also much less cycles spent there.
707
708It seems there is not much more gain possible in this, at least when
709programmed in C. Of course it might be possible to squeeze something more
710out of it with an assembler program, but due to pipeline behaviour etc
711this is very tricky (at least for intel hw).
712
713Author: Frans Meulenbroeks
714Copyright (C) 2008 Koninklijke Philips Electronics NV.
715