doesn't make much sense to me. Seems like somebody screwed up the order.

Yes and no. The position of the return doesn't matter since the instruction "ld" doesn't affect flags. However, the "ld B, A" should go before the "ld A, $80" since the point of loading into B is to save the bit offset while A is occupied during shifting.

I guess I just want a line by line description of the code (the last part is only like 6 lines).

Here's the GetPixel routine that I made when I was making sure I understood how Asm28Days' worked. I think the only difference is that I use BC instead of DE in some parts (the input is A=x-location, L=y-location):

```
GetPixel:
;Code chunk 1:
ld H, 0 ;extend L into HL
ld B, H ;need extra copy of HL for x12
ld C, L
add HL, HL ;HLx12
add HL, BC
add HL, HL
add HL, HL
;Code chunk 2:
ld C, A
srl C ;need byte instead of bit
srl C
srl C
add HL, BC
ld BC, PlotSScreen
add HL, BC ;full address of byte
;Code chunk 3:
and $07 ;x-position (modulo 8)
ld B, A
ld A, %10000000
ret z ;don't loop if left-most bit
_maskLoop:
rrA ;shift the set bit until in correct spot
djnz _maskLoop
ret
```

First, I'll get something out of the way so I can be a bit lazier in my following explanation (this is known as the "division algorithm"): Any integer N can be represented as N = mR + T where m is an integer ≥2, R is an integer and T is an integer 0≤T<m (note the strict inequality for the upper bound). We'll be making use of this for m=8 and m=12.

The graph screen/buffer has 64 rows of 96 bits. Each column is subdivided into 12 bytes (8 bits in a byte), so it's 64 rows of 12 bytes. This is represented by a one-dimensional array where the 0^{th}, 12^{th}, 24^{th}, …, 756^{th} elements are the start of a column's data. The start of this array is located at PlotSScreen in RAM.

Assume we want the X^{th} column of the Y^{th} row (so 0≤X≤95 and 0≤Y≤63).

Chunk 1: First we need to get to the start of the data for the Y^{th} row (actually, right now we're getting the offset into the buffer). This is achieved by multiplying Y by 12 (since the byte will be (12 bytes/row)*(Y rows) + some_number). Y*12 is equivalent to ((Y+Y)+Y)*2*2. We put the final value in HL.

Chunk 2: Now we need the number of bytes offset into the row the pixel is. Since X=8R+T, the byte that contains the bit we need is the R^{th} byte into the row. We can extract R by taking the integer part of X/8. A logical right shift of X is equivalent to int(X/2), so using "srl" three times gives int(X/2/2/2)=int(X/8). We'll also need the original X position so we put a copy in the C register and manipulate that. The B register has already been zeroed, so adding BC to what we had in HL gives the byte we need. This is added to PlotSScreen to finally give the location in RAM that contains the desired pixel.

Chunk 3: All that's left to do is get the bitmask that will enable us to manipulate only the bit we need. This is achieved by shifting %10000000 enough times so that the 1 is lined up with the bit. The get this number we take the remainder when X is divided by 8; essentially extract the T from X=8R+T. Suppose you have the number S=abcdefgh (where a…h are binary digits). Shifting right three times (dividing by 8) gives 000abcde. Using the division algorithm we can put this in the form S = 8 * (000abcde) + T = (abcde000) + T. Looking closely you'll notice that to get back to the original value of S, T needs to be equal to the last three bits of S. By bit-wise ANDing S with %00000111 (=7d) these three bits can be extracted. Now the A register has the number of times to shift. Now we load this into the B resister and then load %10000000 (=$80) into A. If the AND results in zero, we already have the proper bitmask an so we return (remember that "ld" doesn't affect flags). Rotating A to the right inside of a loop with B as the counter will shift A the correct number of times.