Optimizations for the 68020+ by Erik H. Bakke Written 13/10-93 --- I ------------------------- INTRODUCTION ---------------------- I --- 1.1 Introduction The 68020+ (from here on 020) has several new registers and commands that help speeding up your code. This text also mentions some concepts new to the 68010 processor. This text contains information on how to use the new instructions, and address modes, as well as what modes are available to what instructions, and how much space they require. However, the timing diagrms for the different instructions are not included (I don't know them). 1.2 Index Chapter I-----------------Introduction------------------- 1.1 Introduction 1.2 Index Chapter II------------New Addressing Modes--------------- 2.1 Extended Address Register Indirect with Index mode 2.2 Memory Indirect modes Chapter III---------------Improvements------------------- 3.1 General Improvements 3.2 The CMP2 instruction 3.3 Improved Multiply/Divide instructions 3.4 The CHK2 instruction 3.5 The EXTB instruction Chapter IV--------------New Instructions----------------- 4.1 The BitField instructions 1...Single operand 2...Double operand 4.2 The RTD instruction 4.3 The BCD instructions 1...PACK 2...UNPK 4.4 The MC68040 Block Move instruction (MOVE16) 4.5 The 68010 BKPT instruction 4.6 The 68020 Module instructions 4.7 The CAS/CAS2 instructions 4.8 The CoProcessor interface instructions 1...User state 2...Supervisor state 4.9 Conditional TRAP instruction Chapter V-------------Addressing mode tables------------- 5.1 Allowed Adressing modes --- II --------------------- NEW ADDRESSING MODES ---------------- II --- First, some new addressing modes: The 020 supports 18 different addressing modes, where the 68000 only supports 12. The 6 new modes expand memory access. - The address register indirect with index now permit the index register to be scaled by a factor of 1,2,4 or 8 to allow easy access to byte, word, longword and quadword data units. This, in turn greatly improves access to arrays of such data. - The address register and PC indirect with index modes have been extended to a more general syntax, allowind 32-bit displacements. Any of the components of these modes are optional, giving us some very interesting addressing modes, such as DATA register indirect, called base displacement. - Another new concept in the 020 is the memory indirect addressing, which allows intermediate use of a pointer in memory. The contents of this pointer is then used as the base address for further memory access. We will see examples of how this is used later. 2.1 EXTENDED ADDRESS REGISTER INDIRECT with INDEX mode The 020 supports a scale factor to be used with the index register. This eliminates the need for an additional multiply/rotation instruction to compute the correct index value. Syntax: 1 (d8,An,Rm.Size*Scale) uses old 8-bit displacement 2 (bd,An,Rm.Size*Scale) uses 16 or 32-bit base displacement 3 (d8,PC,Rm.Size*Scale) uses old 8-bit displacement 4 (bd,PC,Rm.size*Scale) uses 16 or 32-bit base displacement 1 =d8+An+(Rm.Size*Scale) 2 =bd+An+(Rm.Size*Scale) 3 =d8+PC+(Rm.Size*Scale) 4 =bd+PC+(Rm.Size*Scale) The addressing mode works just like the old version of it, except that you may include the scale factor to multiply the index register by 1,2,4 or 8. The old version can be regarded as having a scale factor of 1. Example: "Table" is an array of quadwords (64-bits) "Element" (16 bits) is the element number to be looked up. (32-bit) move.l (Element,pc),d0 move.l (Table,pc,d0.w*8),d0 This code fetches the 32-bit element as indicated by "Element" from the table "Table". Many forms of this addressing mode is legal, as the different elements are optional. Allowed forms may be: (bd,An,Rm.Size) Corresponds to the old version (bd,Rm.Size*Scale) Omits the address register from the (bd) Equivalent to absolute addressing (Dm.l) Data register indirect addressing () Just an of 0 If you choose to omit the PC, you may have to use the notation ZPC and/or note the base displacement with .L, depending on your assembler. 2.2 MEMORY INDIRECT modes These modes enables the processor to step on a pointer in memory when computing an . These modes can be divided in two categories, Pre-indexed and Post-indexed. Syntax: 1 ([bd,An,Rm.Size*Scale],od) Pre-indexed form 2 ([bd,An],Rm.Size*Scale,od) Post-indexed form 3 ([bd,PC,Rm.Size*Scale],od) Pre-indexed form 4 ([bd,PC],Rm.Size*Scale,od) Post-indexed form 1 =Contents of(bd+An+Rm*Scale) + od 2 =Contents of(bd+An) + Rm*Scale+od 3 =Contents of(bd+PC+Rm*Scale) + od 4 =Contents of(bd+An) + Rm*Scale+od The same rules apllies to these addressing modes as do for the previously described modes. All elements are optional and may be omitted. Example: On Amiga computers, graphics rendering functions need a pointer to a rastport. To extract this RastPort pointer from a Window sctructure the following code would be used: move.l ([WindowBase,pc],wd_RPort),a2 When getting to grips with these addressing modes, they can greatly improve the performance of your program, as well as reducing the length of the code. --- III ---------------------- IMPROVEMENTS --------------------- III --- 3.1 General improvements: The 020 improves the branch instructions to use an 8, 16 or 32-bit displacement. 3.2 The CMP2 instruction The CMP instructions are extended to compare a register against a pair of bounds: CMP2.Size ,Rn The is a pointer to the bounds. The bounds are the same size as the operation. The lower bound is stored first, then the upper bound. If Rn is outside the bounds, the C flag is set, If Rn is equal to either of the bounds, the Z flag is set, and both are cleared if Rn is within the bounds. -This operation may be used on both unsigned and signed data. -If Rn is an address register, byte/word data is sign extended before comparison 3.3 Improved Multiply/Divide instructions The 020 greatly improves the multiplication/division instructions. Now, you have these possibilities: 3.3.1 MULU/MULS instructions Instruction Precision Syntax: 1 MULU.W ,Dn 16b*16b=>32b (68000 instruction) 2 MULU.L ,Dl 32b*32b=>32b 3 MULU.L ,Dh:Dl 32b*32b=>64b The syntax and precision of the MULS instruction are identical to those of the MULU instructions. 3.3.1.1 MULU.W ,Dn This instruction multiplies the 16-bit value indicated by with the 16-bit contents of Dn, and stores the result as a 32-bit value in Dn. This is the basic MULU instruction that is found on the 68000 processor. 3.3.1.2 MULU.L ,Dl This instruction multiplies the 32-bit value indicated by with the 32-bit contents of Dl. This produces a 64-bit result of which the low 32 bits are discarded. The high 32 bits are then stored in Dl. 3.3.2.3 MULU.L ,Dh:Dl This instruction multiplies the 32-bit value indicated by with the 32-bit contents of Dl. This produces a 64-bit result. The high 32 bits are stored in Dh, and the low 32 bits are stored in Dl. 3.3.2 DIVU/DIVS Instruction Precision Syntax: 1 DIVU.W ,Dn 32b/16b=>16r:16q (68000 instruction) 2 DIVU.L ,Dq 32b/32b=>32q 3 DIVU.L ,Dr:Dq 64b/32b=>32r:32q 4 DIVUL.L ,Dr:Dq 32b/32b=>32r:32q The syntax and precision of the DIVS instruction are identical to those of the DIVU instructions 3.3.2.1 DIVU.W ,Dn This instruction divides the 32-bit contents of Dn with the 16-bit value indicated by , and stores the quotient is stored in the lowest word of Dn, and the remainder is stored in the highest. This is the basic DIVU instruction that is found on the 68000 processor 3.3.2.2 DIVU.L ,Dq This instruction divides the 32-bit contents of Dq with the 32-bit value indicated by , and stores the quotient in Dq. The remainder is discarded. 3.3.2.3 DIVU.L ,Dr:Dq This instruction divides the 64-bit contents of Dr(MSLW):Dq(LSLW) with the 32-bit value indicated by and stores the quotien in Dq, and the remainder in Dr. 3.3.2.4 DIVUL.L ,Dr:Dq This instruction divides the 32-bit contents of Dq with the 32-bit value indicated by , and stores the quotient in Dq, and the remainder in Dr. 3.4 The CHK2 instruction The 020 extends the CHK instruction to check a value against a pair of bounds. See the description of CMP2 for information about these bounds. If the value is outside the specified bounds, a CHK exception is taken. 3.5 The EXTB instruction The 020 allows the direct sign extension from a byte to a longword. Syntax: EXTB.L Dn Extend byte to long Example: The following code: ext.w d0 ext.l d0 can be replaced with: extb.l d0 --- IV ---------------------- NEW INSTRUCTIONS ------------------- IV --- 4.1 Bit Field Instructions: The 020 is not confined to addressing data at byte, word or longword boundaries, the new bit-field instructions allows access to data at any arbitrarily bit position in a data register or memory. The length of the data may be from 1 up to 32 bits. These instructions have a different bit numbering than the ordinary instructions. The bits are numbered from the leftmost digit towards the right. To indicate a bit-field, the following syntax is used: {offset:width} The "offset" is the number of bits to skip The "width" is the number of bits included in the bit-field The following bit-field is described as {13:12} 31 23 15 7 0 Ordinary bit numbering | | | | | -------------XXXXXXXXXXXX------- Bit field | | | | | 0 8 16 24 31 Bit field numbering A bit field may also stretch across boundaries in memory, f. eks.: 31 23 15 7 031 23 15 | | | | || | | --------------------------XXXXXXXXXXXXXXXXX-------- | | | | | | | 0 8 16 24 32 40 48 This bit-field would have been described as {26:17} In addition, the offset may be negative when used in memory. The control of bit-field is supported by 8 instructions, 4 single- operand instructions (BFTST,BFCLR,BFSET, and BFCHG), and 4 double- operand instructions. (BFFFO,BFEXTU,BFEXTS, and BFINS) 4.1.1 Single operand Bit-field instructions These instructions can be viewed as extensions of the corresponding bit instructions (BTST,BCLR,BSET, and BCHG) Syntax: 1 BFTST {offset:width} Test bit-field 2 BFCLR {offset:width} Test bit-field and clear it 3 BFSET {offset:width} Test bit-field and set it 4 BFCHG {offset:width} Test bit-field and invert it Each of these instructions first tests the bit-field and sets the condition codes accordingly, then perform the action on the data (SET,CLR or CHG). Condition codes: N Set if the most significant digit was 1 Z Set if all bits are 0 C Cleared V Cleared X Not affected Only data register direct and control addressing modes are allowed for the operand. The offset may be either a value from 0-31 or contained as a 32-bit signed value in a data register. The width may be either a value from 1-32 or contained in the lower 5 bits of a data register. 4.1.2 Double operand Bit-field instructions These instruction provides more control over bit-fields, such as inserting, extracting and searching Syntax: 1 BFEXTU {offset:width},Dn Extract a bit-field 2 BFEXTS {offset:width},Dn Extract and sign extend 3 BFINS Dn,{offset:width} Inserts a bitfield 4 BFFFO {offset:width},Dn Find First 1 in a BF. Condition codes: N Set if the most significant bit in the BF is set Z Set if all bits in the BF are 0 C Cleared V Cleared X Not affected The offset may be either a value of 0-31 or contained as a 32-bit signed value in a data register. The width may be either a value of 1-32 or contained in the lower 5 bits of a data register. 4.1.2.1 BFEXTU instruction This instruction extracts a bit-field from the source operand, right-justifies it, and places it in the destination register. 4.1.2.2 BFEXTS instruction This instruction works just like the BFEXTU (4.1.2.1) instruction, but sign extends the bit-field to 32-bits before storing it in the destination register. 4.1.2.3 BFINS instruction This instruction extracts the lower bits of the source register, and inserts it into the destination bit-field. 4.1.2.4 BFFFO instruction The bit offset of the most significant 1 in the bit-field is stored in the data register. If no 1 is found in the bit-field, the sum of the offset and width is stored in the destination. The Bit-field instructions can be used for handling floating point numbers in software. 4.2 The RTD instruction (68010 and up) This instruction extends the operation of the RTS instruction. It pops the PC off the stack, then a 16-bit displacement is added to the SP. This makes it possible to clear parameters pushed on to the stack by a calling program. Syntax: RTD #displacement Example: A subroutine is called with parameters on the stack. The size of these parameters equals The subroutine allocates some stack space for local data. The size of this local data equals SubRoutine: link a5,#-LocalSize ;Allocate local data space movem.l d0-a6,-(sp) ;Save registers . . ;Do whatever... . movem.l (sp)+,d0-a6 ;Restore registers unlk a5 ;Deallocate local data space rtd #ParamSize ;Deallocate parameter space ; and return to caller Without this instruction, the stack cleanup would look like this: movem.l (sp)+,d0-a6 unlk a5 move.l (sp),(ParamSize,sp) lea (ParamSize,sp),sp rts 4.3 The BCD instructions The 68000 has some instructions for the manipulating of BCD coded data (ABCD,NBCD,SBCD). The 020 extends this command set to include instructions for packing/unpacking of such data, the PACK and UNPK instructions. 4.3.1 The PACK instruction This instruction packs data to a format usable by the other BCD instructions. When used in memory, the instruction fetches 2 bytes of data, adds a displacement, and concatenates bits 11-8 and 3-0 into a single byte. When used on a data register, the displacement and the contents of the data register is added, then bits 11-8 and 3-0 are contatenated to form a byte. This instruction is useful for encoding a decimal number stored as a string of ascii characters into a usable BCD code. Syntax: 1 PACK -(An),-(Am),#displacement 2 PACK Dn,Dm,#displacement Example: We want to encode the ascii string "76" ($3736) to BCD. Recall that the numeric characters have ascii codes $30-$39. "Data" is a pointer to the string we wish to convert. move.l (Data,pc),a0 move.w (a0),d0 pack d0,d1,#$0000 Register d1 now contains the hex value $76. If we'd wished to, we could have added the BCD 12 to the number in the same instruction, like this: pack d0,d1,#$0102 Register d1 now contains the hex value $88 4.3.2 The UNPK instruction This instruction unpacks a BCD coded number to a less compact version. When used in memory, this instruction copies the 2 nibbles of the source byte to the low nibble of two separate bytes, the two bytes are concatenated into a word, and a displacement is added to the word. When used on a data register, the nibbles are copied from the LSB or the source register, and the unpacked word is placed in the LSW of the destination register. Syntax: 1 UNPK -(An),-(An),#displacement 2 UNPK Dn,Dm,#displacement Example: We want to unpack the BCD number $76 to a printable ascii string. The numberical characters start at ascii $30, so we must add this value to both bytes. The displacement is then $3030. "Data" is a pointer to the string we wish to fill. Register d0 is preloaded with $76 move.l (Data,pc),a0 unpk d0,d1,#$3030 move.w d0,(a0) 4.4 The MC68040 Block move instruction (MOVE16) This instruction uses the burst mode for rapid movement of a block of data. The instruction can be used for fast block copy, memory initialization and co-processor communication. This instruction aligns all addresses to 16-byte boundaries by masking off the lower 4 bits of the addresses. A line of 16 bytes is copied from the source to the destination address. Syntax: 1 MOVE16 (Ax)+,(Ay)+ 2 MOVE16 xxx.L,(An) 3 MOVE16 xxx.L,(An)+ 4 MOVE16 (An),xxx.L 5 MOVE16 (An)+,xxx.L For the Amiga computers, some precautions must be taken when using this instruction in Chip Memory. Fast Memory works fine, though. Example: We want to copy an area of 128 bytes from "A" to "B" lea (A,pc),a0 lea (B,pc),a1 moveq #7,d0 Loop: move16 (a0)+,(a1)+ dbf d0,Loop 4.5 The 68010 BKPT instruction This instruction is used for hardware testing, and executes differently on the various members of the 68000 family, and is not described here. 4.6 The 68020 Module instructions These instructions, (CALLM and RTM) appear only on the 68020 processor, and the use of them is beyond the scope of this text. 4.7 The CAS/CAS2 instructions These instructrions are provided for maintaining and protecting critical data in a multiprocessor environment. Multiprocessing is beyond the scope of this text, but I'll explain these instructions anyway Suppose two different processes write to the same memory, and have access to a shared variable. This variable may be anything, such as a pointer to a list. Process 1 retrieves this variable in d7, and copies it to d0 as a backup pointer. It then operates upon the pointer in d7. Before Process 1 is finished with the operation, it is put to sleep, and another process is made ready to run. This process alters the contents of our variable. Later, our process is allowed to run again. Before it can update the variable, it must test if someone else has altered the variable in memory. This is done by comparing the backup pointer in d0 with the data in memory. -If the values are equal, the variable has not been altered by anybody, and the new value can safely be written to the variable. -However, if the values are not equal, the process should reload the new value of the variable, and rerun it's operation. The other process should protect itself in the same way as our process did. As a final point, the comparison and rewriting must be protected, so that the other process doesn't alter the variable between the comparison and rewriting. This is done by using an indivisible RMW cycle (RMW= Read-Modify-Write) 4.7.1 The CAS instruction The instruction CAS implements this comparison between global data and a register, as well as a data transfer using this RMW cycle. Syntax: CAS.Size Dc,Du, The is compared to the contents of register Dc. If they are equal, the contents of Du is written to . If they are not equal, the contents of is copied to Dc. The Z bit reflects the result of the comparison. 4.7.2 The CAS2 instruction The CAS2 instruction works like the CAS instruction except that it performs comparisons and updates on two data values. Syntax: CAS2.Size Dc1:Dc2,Du1:Du2,(Rn1):(Rn2) Rn may be any data or address register. Only if the contents of Dc1 equals (Rn1) and the contents of Dc2 equals (Rn2) will the contents of Du1 be written to (Rn1) and Du2 be written to (Rn2). This instruction is well suited to protect multi-linked lists in a multi-processor environment. WARNING: Like the TAS instruction, the CAS/CAS2 instructions should NOT be used on an Amiga, as they are not supported by the hardware. The indivisable RMW cycle conflicts with the Amiga's bus system. 4.8 The CoProcessor interface instructions. These instructions are outside the scope of this text, see the "MC68020 CoProcessor support instructions" text by this author for information about how to program using these instructions. However, here is a list of the instructions: 4.8.1 User state coprocessor instructions cpBcc Branch on Coprocessor Condition cpDBcc Test coprocessor Condition, Decrement and Branch cpGEN Coprocessor general function cpScc Set on Coprocessor Condition cpTRAPcc Trap on Coprocessor Condition 4.8.2 Supervisor state coprocessor instructions cpRESTORE Coprocessor Restore Functions cpSAVE Coprocessor save Function 4.9 Conditional TRAP instruction The 020 allows conditional traps. If the specified condition is true, a TRAPcc exception (exception 7) will be taken. Syntax: TRAPcc #Data The may be any of the 's that are supported by the ordinary conditional branch instructions. --- V ---------------------ADDRESSING MODE TABLES------------------- V --- 5.1 Allowed Adressing modes CMP2 Compare Register Against Bounds Syntax: CMP2 ,Rn Size: Byte/Word/Long Length: 4 bytes+ data Modes: (An) (d16,An) (d8,An,Xn) (bd,An,Xn) $xxx.W $xxx.L (d16,PC) (d8,PC,Xn) (bd,PC,Xn) ([bd,An,Xn],od) ([bd,An],Xn,od) ([bd,PC,Xn],od) ([bd,PC],Xn,od) ----------------------------------- MULU/MULS Multiply (Un)signed Syntax: MULU/S.W ,Dn 16b*16b=>32b MULU/S.L ,Dl 32b*32b=>32b MULU/S.L ,Dh:Dl 32b*32B=>64b Size: Word/Long Length: 2 bytes+ data (word) 4 bytes+ data (long) Modes: (both) Dn (An) (An)+ -(An) (d16,An) (d8,An,Xn) $xxx.W $xxx.L # (d16,PC) (d8,PC,Xn) (bd,An,Xn) ([bd,An,Xn],od) ([bd,An],Xn,od) (bd,PC,Xn) ([bd,PC,Xn],od) ([bd,PC],Xn,od) ----------------------------------- DIVU(L)/DIVS(L) Divide (Un)signed Syntax: DIVU.W ,Dn 32b/16b=>16r:16q DIVU.L ,Dq 32b/32b=>32q DIVU.L ,Dr:Dq 64b/32b=>32r:32q DIVUL.L ,Dr:Dq 32b/32b=>32r:32q Size: Word/Long Length: 2 bytes+ data (word) 4 bytes+ data (long) Modes: (both) Dn (An) (An)+ -(An) (d16,An) (d8,An,Xn) $xxx.W $xxx.L # (d16,PC) (d8,PC,Xn) (bd,An,Xn) ([bd,An,Xn],od) ([bd,An],Xn,od) (bd,PC,Xn) ([bd,PC,Xn],od) ([bd,PC],Xn,od) ----------------------------------- CHK2 Check Register Against Bounds Syntax: CHK2 ,Rn Size: Byte/Word/Long Length: 4 bytes+ data Modes: (An) (d16,An) (d8,An,Xn) (bd,An,Xn) $xxx.W $xxx.L (d16,PC) (d8,PC,Xn) (bd,PC,Xn) ([bd,An,Xn],od) ([bd,An],Xn,od) ([bd,PC,Xm],od) ([bd,PC],Xn,od) ----------------------------------- EXTB Extend Byte to long Syntax: EXTB.L Dn Size: Word/Long Length: 2 bytes Modes: Dn ----------------------------------- BFxxx BitField instructions Syntax: BFTST {offset:width} BF Test BFSET {offset:width} BF test and Set BFCLR {offset:width} BF test and clear BFCHG {offset:width} BF test and Change BFEXTS {offset:width},Dn BF Extract Signed BFEXTU {offset:width},Dn BF Extract Unsigned BFFFO {offset:width},Dn BF Find First One BFINS Dn,{offset:width} BF Insert Both offset and width may be specified as data registers Size: The BF instructions are unsized Length: 4 bytes+ data Modes: Dn (An) (d16,An) (d8,An,Xn) (db,An,Xn) ([bd,An,Xn],od) ([bd,An],Xn,od) ([bd,PC,Xn],od) ([bd,PC],Xn,od) $xxx.W $xxx.L (d16,PC) (d8,PC,Xn) (bd,PC,Xn) ([bd,PC,Xn],od) ([bd,PC],Xn,od) ----------------------------------- RTD Return and Deallocate Syntax: RTD #displacement Size: Unsized Length: 4 bytes Modes: # ----------------------------------- PACK Pack BCD data Syntax: PACK -(Ax),-(Ay),#displacement PACK Dx,Dy,#displacement Size: Unsized Length: 4 bytes Modes: See syntax ----------------------------------- UNPK Unpack BCD data Syntax: UNPK -(Ax),-(Ay),#displacement UNPK Dx,Dy,#displacement Size: Unsized Length: 4 bytes Modes: See sntax ----------------------------------- MOVE16 Move 16 bytes block Syntax: MOVE16 (Ax)+,(Ay)+ MOVE16 $xxx.L,(An) MOVE16 $xxx.L,(An)+ MOVE16 (An),$xxx.L MOVE16 (An)+,$xxx.L Size: 1 line (16 bytes) Length: 4 bytes (Ax)+,(Ay)+ 6 bytes The rest Modes: See syntax ----------------------------------- CAS/CAS2 Compare And Swap with operand Syntax: CAS Dc,Du, CAS2 Dc1:Dc2,Du1:Du2,(Rn1):(Rn2) Size: Byte/Word/Long (CAS) Word/Long (CAS2) Length: 4 bytes+ data (CAS) 6 bytes+ data (CAS2) Modes: (CAS) (An) (An)+ -(An) (d16,An) (d8,An,Xn) (bd,An,Xn) ([bd,An,Xn],od) ([bd,An],Xn,od) $xxx.W $xxx.L ----------------------------------- TRAPcc Trap on Condition Syntax: TRAPcc TRAPcc.W # TRAPcc.L # Size: Unsized/Word/Long Length: 2 bytes (Unsized) 4 bytes (Word) 6 bytes (Long) Modes: # ----------------------------------- *********************************************************************************************** * The text in this document is written by Erik H. Bakke * * © 1993 Erik H. Bakke/Bakke SoftDev * * This document may be freely redistributed as long as it remains unchanged and together with * * the FPU programming document * *********************************************************************************************** *Permission is granted to Michael Glew to incorporate it in his Asp68k project, and eventually* * editing it to fit in the Asp68k environment * *Permission is granted to include this document in the HowToCode archive as long as it remains* * unchanged. * *********************************************************************************************** * For error corrections, comments etc., the author can be reached at * * e-mail: db36@hp825.bih.no * * phone: +47-5630-5537 (13:00-21:00 GMT) * * post: Erik H. Bakke * * Bjørnen * * N-5227 SØRE NESET * * Norway * ***********************************************************************************************