Fast Affine Texture Mapping II

 

By Mats Byggmastar
MRI
3D programmer in Doomsday
mri@penti.sit.fi

Contents

About this document

The Original Ascii Version was relased on 17th April 1997
The HTML conversion by J. Sharman was released on 27th March 1998

This document is the follow-up to fatmap.txt released 19 Jun. 1996. The aim of this second document is to make a more accurate affine texture mapper. I have also skipped triangles and concentrated on convex polygons to make clipping more efficient. The tiled bitmap inner loop scheme described in fatmap.txt has now also been realized in a 6 clock tick inner loop that is put into action here with good results.

As in the previous document the inner loops in assembly are developed for the Intel Pentium processor.

People looking for descriptions on perspective correct texture mappers should visit Game Developer Magazine at http://www.gdmag.com and backorder the issues with Chris Hecker's articles on this subject!

I, the author, am a 25 year old computer and telecom engineer (B.Sc.) currently working as a teacher in a vocational school for students in the age of 16-19 years studying computers, electronic, automation and power electric. I do real-time 3D graphics programming mostly as a hobby and am an active member of the Finnish demogroup Doomsday. My dream is to one day work full-time with 3D graphics.

Special thanks to Harriet Mattfolk for proofreading this document and helping me with the english syntax.

About the source code

The source code, example executable and the original ASCII text version of this document can be downlaoded from here. (fatmap2.zip)

Except for the inner loops and some misc. functions the source is written in C++ very close to C. To make the most of the code I've made the inner loops as inline assembler. I personally have the whole mapping functions in fine-tuned assembly, but it would be pointless to include such a source in this type of document.

The source code that should accompany this document is divided into 8 files:

          misc.h      - Misc. declarations of structures and functions
  
          clip.cpp    - Sutherland-Hodgman 2D clipping
          draw.cpp    - Calculates deltas, clips and calls the mappers
          main.cpp    - Simple test program
  
          flat.cpp    - Flat filler
          gouraud.cpp - Gouraud filler
          txtmap.cpp  - Texture mapper
          txtarb.cpp  - Arbitrary size texture mapper
          txttil.cpp  - Tiled texture mapper

The last five files are the ones of interest. The other files were just included so I would be able to make the test program. Clip.cpp is my very own implementation of the Sutherland-Hodgman clipping algorithm and might be of some interest. You can find the theory behind the Sutherland-Hodgman algorithm in any graphics textbook.

From time to time I see people looking for a fast way to fill polygons with a solid color. I have therefore included the flat filler. You might want to replace the call to memset() with some other inner loop. I recently also participated in a discussion on the newsgroup comp.graphics.algorithms concerning floating point vs. fixed point in a gouraud filler so I decided to include my version of a gouraud filler.

The source should be compiled with Watcom C/C++ as the inline assembly is Watcom specific. It has only been tested with Watcom C/C++ version 10.0. The compiled version of the test program (main.exe) is included. You will need the DOS4GW protected mode stub to run it.

Convex polygons

First, let us define what a convex polygon is. A convex polygon must not have any of it's corners pointing inwards to the center of the polygon. In other words, the angle between two edges joining in a vertex must not be greater than 180 degree. This means that triangles are always convex polygons. The opposite of convex is concave and they can have corners pointing inwards.

The mappers presented in this document can only draw convex polygons with one exception. The third polygon above will be handled correctly even if it is concave. This is because it is concave in such a way that no scanlines need to be split while rendering it. The fourth polygon will not be handled properly because scanlines must be split.

Drawing convex polygons

Drawing convex polygons is just as simple and fast as drawing triangles. The only difference is that the vertex scanning code needs to be in a loop for polygons. If we only render triangles we know that we always have 3 vertices which makes the code a bit more straightforward.

First step in a polygon function is to search for the top and bottom vertex. So we scan the vertex array and search for min and max y. Let us take the following 4 vertex polygon as example:

Note that the vertices must be in an anti clockwise order. We find that v1 is the top vertex and v3 is the bottom vertex.

We now know that when scanning the left side of the polygon, we should start at v1 and move forward through the array until we come to v3. In general when moving forward, we should wrap to the start of the array if we move past the array. I.e. if we come to the last vertex, our next vertex will be v0.

For the right side we start at v1 and move backwards through the array until we come to v3. In general when moving backwards, we should wrap to the end of the array if we move outside the array. I.e. when we come to v0, our next vertex will be the last vertex in the array.

In our example we have 2 sections on the left side:

  
        v1 - v2   and  v2 - v3

and 2 sections on the right side:

  
        v1 - v0   and  v0 - v3

The second step in a polygon function is to calculate the x coordinate slope for the first section on the right side. If the section had zero height, try the next until a non-zero section is found.

The third step is to calculate x coordinate slope and any other slope (e.g. texture u and v) for the first section on the left side. If the section had zero height, try the next until a non-zero section is found.

If all sections on a side has zero height, the whole polygon has zero height and should not be plotted.

Now we can start to interpolate the values along the left and the right side as we draw each scanline using an inner loop.

When we come to the end of a section, we search for the next non-zero height section and calculate the new slope. If all sections on one side are done, i.e. if we have come to the bottom vertex, the polygon is done.

Constant texture gradients and polygons

With triangles the texture gradients (texture u,v slopes over the triangle surface) are guaranteed to be constant. Therefore we can calculate the gradients (dudx, dvdx) once and use the same values for all scanlines of the triangle.

If we start out with a triangle and clip it into a polygon, the texture gradients for the polygon are still constant over the whole surface. So if we calculate the gradients before the triangle gets clipped, we can still use the constant texture gradient method for polygons.

Calculating the texture gradients for a triangle can be done in the following way:

          double d = (v0.x - v2.x) * (v1.y - v2.y) -
                     (v1.x - v2.x) * (v0.y - v2.y);
          if(d == 0.0) return;
          double id = 1.0/d * 65536.0;
          long dudx = ((v0.u - v2.u) * (v1.y - v2.y) -
                       (v1.u - v2.u) * (v0.y - v2.y)) * id;
          long dvdx = ((v0.v - v2.v) * (v1.y - v2.y) -
                       (v1.v - v2.v) * (v0.y - v2.y)) * id;

The vertex data x,y,u,v is assumed to be in floating point and the resulting dudx,dvdx is in 16:16 bit fixed point.

16:16 bit fixed point idiv and imul

All mappers work with 16:16 bit fixed point internally. I will not explain how fixed point works here, you can look that up elsewhere. I'll just point out that we need to perform the fixed point divides and multiplies in assembler rather than in C. So from here on you can assume that code like:


          long dxdy = (width << 16) / height;
          long x = v1.x + ((prestep * dxdy) >> 16);

need to be performed by some inline assembly functions:


          long dxdy = idiv16(width, height);
          long x = v1.x + imul16(prestep, dxdy);

Fixed point ceil() function

We need a ceil() function for the subpixel and subtexel accuracy. Definition of ceil() is:

ceil(x) returns the smallest integer not < x

  
      e.g.  ceil(1.0)  returns  1 
            ceil(1.5)  returns  2 
            ceil(-2.0) returns -2 
            ceil(-1.5) returns -1

If we limit x to only positive numbers we can make a 16:16 bit fixed point version of ceil() very easily. We add 0xffff to x and right-shift by 16.


          inline long ceil(long x)
          {
              x += 0xffff;
              return x >> 16;
          }

Note that this function do not give the correct result if x is negative. This is no problem as x will never be negative the way ceil() is used in the mappers.

Subpixel accuracy

The first thing to note when aiming for subpixel accurate polygon drawing is that we must never truncate screen coordinates to integers. Today it's common to use floating point all the way trough the 3D engine and convert the projected screen coordinates to integers just before entering the polygon function. This conversion should of course be a float to fixed point conversion so that we do not lose the fractional part. The fractional part is in fact the subpixel position that will affect how the slope of a polygon edge will be rendered.

Without subpixel accuracy the polygons will jump one pixel at a time when the object is moving slowly over the screen. With subpixel accuracy the pixels that makes up the edges between the polygons will slowly "float", making the edges seem to move slowly over the screen.

Calculating the slope of a subpixel accurate section:

          long scanlines = ceil(v2.y) - ceil(v1.y);
          if(scanlines <= 0) return;              // Section had zero height
          
          long height = v2.y - v1.y;              
          long dxdy = ((v2.x - v1.x) << 16) / height; 
      
          long prestep = (ceil(v1.y) << 16) - v1.y;
          long left_x = v1.x + ((prestep * dxdy) >> 16);   

The height of the section, or the number of scanlines, will be an integer. When calculating the slope (dxdy) we will not use this height, rather the real height that includes the fractional part. We apply the subpixel accuracy to the slope by adjusting the initial x coordinate by the amount that was truncated when selecting the top y coordinate using ceil().

Interpolating along subpixel accurate left and right sections:

          for(  )
          {
              long x1 = ceil(left_x);               // Scanline start x
              long width = ceil(right_x) - x1;      // Scanline width
          
              if(width > 0)
              {
                  Now draw the scanline from x1,y, width pixels wide.
              }
                
              left_x  += left_dxdy;
              right_x += right_dxdy;
              y++;
          }

Subtexel accuracy

Calculate the u,v (dudy,dvdy) slopes along a section the same way as the x coordinate slope and prestep the initial values the same way:

          long height = v2.y - v1.y;              
          long dudy = ((v2.u - v1.u) << 16) / height; 
          long dvdy = ((v2.v - v1.v) << 16) / height; 
      
          long prestep = (ceil(v1.y) << 16) - v1.y;
          long left_u = v1.u + ((prestep * dudy) >> 16);   
          long left_v = v1.v + ((prestep * dvdy) >> 16);   

When we interpolate with subtexel accuracy we must prestep u,v before drawing each scanline. This is because we round the start of the scanline (x1) upward to the nearest pixel using ceil().


          for(  )
          {
              long x1 = ceil(left_x);                 // Scanline start x
              long width = ceil(right_x) - x1;        // Scanline width
      
              if(width > 0)
              {
                  long prestep = (x1 << 16) - left_x;
                  long u = left_u + ((prestep * dudx) >> 16); 
                  long v = left_v + ((prestep * dvdx) >> 16);
                  
                  Now draw the scanline from x1,y,u,v, width pixels wide.
              }
      
              left_x  += left_dxdy;
              left_u  += left_dudy;
              left_v  += left_dvdy;
              right_x += right_dxdy;
              y++;
          }

Avoiding divide overflow in slope calculations

For some sections that are very thin, the slope calculation might overflow.

Look at the following case:

          v1.x = 0000:0000    v2.x = 0002:0000
          v1.y = 0000:0000    v2.y = 0000:0001

ceil(v2.y) - ceil(v1.y) will report that this is one scanline even if the actual height is just a fraction of a pixel.

          height = v2.y - v1.y = 0000:00001
          width  = v2.x - v1.x = 0002:00000

so performing (width << 16) / height will cause divide overflow. We are aiming for nearly perfect affine mapping so we can't just take some default value for the slope. One way to avoid the overflow is to multiply width by the inverse of height using only 18:14 bit precision.

          long height = v2.y - v1.y;
          long inv_height = (0x10000 << 14) / height;
          long dxdy = ((v2.x - v1.x) * inv_height) >> 14;

Note that this method can only be used for this special case where the section height is less than one pixel. Other sections that are more than one pixel should of course be calculated as usual.

Loop counters in inner loops

There is a neat way to combine movement of destination pointer and loop counter in inner loops. It might not save much (half a clock tick on a Pentium) but you might find other ways of using it.

Assume that we want to draw a horizontal line on the screen with the following data:

          al  = color
          ecx = width of line
          edi = destination pointer to first pixel on the left

The standard way would be:

          inner:
              mov   [edi], al         ; plot
              inc   edi               ; advance destination pointer
              dec   ecx               ; decrease loop counter
              jnz   inner

Another way is to combine destination pointer and width and plot the line from right to left:


          inner:
              mov   [edi+ecx-1], al   
              dec   ecx               
              jnz   inner

In the above loop we got rid of one instruction. On the other hand, we are writing to memory from higher to lower addresses. This is bad for the write buffers. We should write to increasing addresses instead. That can be done in this way:


              lea   edi, [edi+ecx]  
              neg   ecx               ; loop counter goes from -width to 0
  
          inner:
              mov   [edi+ecx], al     
              inc   ecx               
              jnz   inner
      

The destination is first moved to the end of the line and the loop counter is negated. The first pixel will therefore be plotted at start+width-width, i.e. at the start of the line.


              neg   ecx  

can be replaced with:

  
              xor   ecx, -1
              inc   ecx

which most of the time can be paired with some other instructions in the setup code. neg is a non pairable, 1 tick instruction. We will then end up with:


              lea   edi, [edi+ecx]  
              xor   ecx, -1
              inc   ecx
          inner:
              mov   [edi+ecx], al     
              inc   ecx               
              jnz   inner

8:16 bit inner loop

After fatmap.txt was released I was sent a very nice 8:16 bit 4 clock tick inner loop made by Russel Simmons (Armitage/Beyond). In it's original form it plotted the scanlines from right to left. I have in a later stage changed this and present the new version here which plots the scanlines from left to right.

          ; bitmap (256x256) must be aligned on 64k. (low 16 bits zero)
          ; eax =  u frac           : -
          ; ebx =  bitmap ptr       : v int      : u int
          ; ecx =  scanline width
          ; edx =  v frac           : v int step : u int step
          ; esi =  u frac step      : 0          : 0
          ; edi =  destination ptr
          ; ebp =  v frac step      : 0          : 0
  
          lea   edi, [edi+ecx]
          xor   ecx, -1
          inc   ecx
  
       inner:
          mov   al, [ebx]      ; get color
          add   edx, ebp       ; v frac += v frac step
          adc   bh, dh         ; v int  += v int step (+carry from v frac)
          add   eax, esi       ; u frac += u frac step
          adc   bl, dl         ; u int  += u int step (+carry from u frac)
          mov   [edi+ecx], al  ; plot pixel
          inc   ecx
          jnz   inner

This is a good inner loop with enough precision, which also can handle texture wrapping. Note that the texture must be aligned on 64k.

One way of aligning texture maps on 64k is to first make a buffer which is 64k byte larger than the texture map and then align a pointer inside that buffer:


          char source[256*256];       // source bitmap
          char bigbuffer[256*256*2];  // 2*64k byte buffer
          
          char * aligned = (char *) (((int)(bigbuffer + 0xffff)) & ~0xffff); 
          memcpy(aligned, source, 256*256);

Now access the bitmap using the aligned pointer.

Arbitrary size bitmap inner loop

The idea for the following 5 clock tick inner loop was taken from Chris Hecker's scanline subdivision texture mapper. This loop doesn't rely on the fact that textures are always 256x256 pixels. We use a 2 dword lookup table instead to perform the v * width calculation. Therefore it can handle bitmaps of any size. The fractional part in this loop can be up to 32 bits but we only use 16 bits here. Note that the bitmap does not need to be aligned.

The trick in this loop is to convert the carry flag from the fractional v addition into an index in the lookup table. Then to get the texture increment from the lookup table.

          ; bitmap can be any size, setup duvdxstep table according to width
          ; dvdxfrac =  v frac step : 0
          ; eax =  u frac           : 0
          ; ebx =  v frac           : 0
          ; ecx =  scanline width
          ; edx =  u frac step      : 0
          ; esi =  bitmap ptr
          ; edi =  destination ptr
  
          lea   edi, [edi+ecx]
          xor   ecx, -1
          add   ebx, [dvdxfrac]               ; v frac += v frac step
          inc   ecx
          sbb   ebp, ebp                      ; -1 if carry from add
  
       inner:
          add   eax, edx                      ; u frac += u frac step
          mov   bl, [esi]                     ; get color
          adc   esi, [duvdxstep+4+ebp*4]      ; move texture pointer
          add   ebx, [dvdxfrac]               ; v frac += v frac step
          sbb   ebp, ebp                      ; -1 if carry from add
          mov   [edi+ecx], bl                 ; plot pixel
          inc   ecx
          jnz   inner

The duvdxstep lookup table is set up in the following way:


          long duvdxstep[2];
  
          duvdxstep[0] = (dudx >> 16) + (dvdx >> 16) * mapwidth + mapwidth;
          duvdxstep[1] = (dudx >> 16) + (dvdx >> 16) * mapwidth;  

This means that when we get that carry from the v addition and ebp becomes -1, we will address duvdxstep[0] and add one extra line in the bitmap.

A drawback of this inner loop is that it can not handle texture wrapping. I.e. you must never let the u,v coordinates be outside the texture map or you will read random data outside the texture map or even cause a protection fault.

8:15 bit tiled inner loop

In fatmap.txt I described a method of arranging the texture map as 8x8 tiles for more efficient cache usage. Back then I did not know any way of doing the bit-swapping inner loop in less than 11 clock ticks. But one day it just came to me and I could immediately write a 8 clock tick 8:16 bit version. That version did not even work in theory, but it was a starting point. The inner loop below has evolved from the original 8 clock tick version over a two month period. It was developed on paper only and the first time I actually tested any of the loops, was this version and it worked perfectly.

The version below runs at 6 clock ticks and uses 8:15 bit interpolation. Here 8x8 tiles are used but any type of scheme can be used, just modify the bit-masks. The 256x256 bitmap need not be aligned, but for the tiling scheme to be effective the bitmap should be aligned on 32 byte. Texture wrapping is allowed.

          ; bitmap (256x256) must be stored as 8x8 tiles
          ; tildudx =  wwwww11111111www1fffffffffffffffb  (w=whole, f=frac)
          ; tildvdx =  11111wwwwwwww1111fffffffffffffffb
          ; eax =  u   wwwww00000000www0fffffffffffffffb
          ; ebx =  v   00000wwwwwwww0000fffffffffffffffb
          ; ecx =  scanline width
          ; edi =  destination ptr
          ; esi =  bitmap ptr
  
          lea   edi, [edi+ecx-1]
          xor   ecx, -1
          lea   ebp, [eax+ebx]                            ; u+v
          inc   ecx
  
       inner:
          add   eax, [tildudx]                            ; u += tildudx
          add   ebx, [tildvdx]                            ; v += tildvdx
          shr   ebp, 16                                   ; (u+v) >> 16
          and   eax, 11111000000001110111111111111111b    ; clear bitgaps
          and   ebx, 00000111111110000111111111111111b
          inc   ecx
          mov   dl, [esi+ebp]                             ; get color
          lea   ebp, [eax+ebx]                            ; u+v
          mov   [edi+ecx], dl                             ; plot pixel
          jnz   inner
      

Converting 16:16 bit fixed point dudx,dvdx to the tiled format can be done the following way:

  
          tildudx = (((dudx << 8) & 0xf8000000) +
                     ((dudx >> 1) & 0x00007fff) +
                      (dudx       & 0x00070000)) | 0x07f88000;
      
          tildvdx = (((dvdx << 3) & 0x07f80000) +
                     ((dvdx >> 1) & 0x00007fff)) | 0xf8078000;

Note that we fill out the bit-gaps with 1's. This is because we must force the bits to jump over the gaps when we add in the inner loop. These 1's must be cleared out after the add. We do this with the 'and' instructions in the inner loop.

Before entering the inner loop we also convert u,v to the tiled format:

  
          u = ((u << 8) & 0xf8000000) +
              ((u >> 1) & 0x00007fff) +
               (u       & 0x00070000);
      
          v = ((v << 3) & 0x07f80000) +
              ((v >> 1) & 0x00007fff);

It should be noticed that 15 bits are only used for the fractional part and that it is right-shifted 1 bit. This is done because we must not let the fractional part overflow and clutter the texture pointer when we add u+v using the lea instruction. We have placed a bit-gap between the fractional part and the whole part to avoid this.

The bitmap itself can be tiled the following way:

          char source[256*256];   // linear source bitmap
          char tiled[256*256];    
  
          for(int v=0; v<256; v++) {
              for(int u=0; u<256; u++) {
                  int dst = ((u<<8) & 0xf800)+(u & 0x0007)+((v<<3) & 0x07f8);
                  tiled[dst] = source[u+v*256];
              }
          }

Polygons per second

From time to time I see coders talking about the speed of their 3D engine by specifying how many polygons per second can be plotted. In my opinion, this is completely irrelevant as they do not specify under what condition they have made the test. The speed will be _very_ dependent on the size and orientation of the polygons as well as how the texture is mapped onto the polygons. Many other factors will contribute to the speed, even the type of memory manager or DOS extender.

With the simple test program accompanying this document I try to compare the speed between the different mappers. In this case, a comparison is valid as they work under the exact same conditions. I get the following results on my Pentium 120:

          Flat polys/sec              3494
          Gouraud polys/sec           1849
          Texture polys/sec           1214
          Arbitrary texture polys/sec 1196
          Tiled texture polys/sec     1520

These figures do not seem very impressive, but keep in mind that the triangles can be very large as I allow x,y to be in the range [0...320],[0...200]. In any case you should notice that the flat filler (using standard memset() as inner loop) is about twice as fast as the others. The arbitrary size mapper and the ordinary mapper should run at about the same speed. The tiled mapper will come out as a winner, because we allow the u,v values to vary a lot over the triangle surface, i.e. the cache issues will become more important than a quick setup and a fast inner loop.

Greetings

Members of Doomsday
Members of SoCS
Members of Esteem
Phil Carmody
Nix / The Black Lotus
Submissive / Cubic Team & $eeN
Armitage / Beyond
Jare / Iguana
Wog / Orange
#coders. No one mentioned, no one forgotten.
All my students in YiJ 1996-1997; Data3, AD2, Elm2, El1A, El1B

...my heart in Oslo