Two Pass Assembler
In this project you are asked to write an Assembler program using the C programming language. Specifications for
this Assembler will be defined subsequently. Due to the size of this project it should be divided to several source
files. The source files have to meet standards of clarity readability and well formed syntactic conventions.
CLICK HERE to get answer for this project.
Background
As you may know, there are many programming languages. A huge amount of programs written in different languages
can run on the very same machine. How does the computer "know" so many languages? The answer is very simple, the
computer knows only one language. This language, known as machine language, includes instructions and data written
in binary code. The code is stored inside a block of memory appearing like a string of binary digits. The Central
Processing Unit - CPU - knows how to decode the instructions into small meaningful fields of instructions,
Addresses and data. The decoding process is determined unambiguously by the firmware of the CPU.
Computer memory is a collection of bits, usually seeing as collection of constant sized units, measured in bytes.
When a program is loaded into memory it is almost impossible for an unskilled eye to see any physical difference
between that part in memory where the program is stored and the rest of the memory.
The Central Processing Unit can execute certain amount of simple instructions using several build in registers.
For example: It can move value from memory cell to a CPU register and vice versa, it can add 1 to a value stored
in a register, it can check whether a value stored in a CPU register is equal to zero, etc. These simple instructions
and their combinations are the building blocks of computer program as appeared in memory. Program, written in any computer
language, can be translate to machine language by a special program called compiler, this program generates
file(s) called object file(s) . The object file(s) can be linked to one executable file that
can be load into computer memory.
Once our object file(s) have been loaded into memory, the computer starts to execute them in the following way: Each opcode, in the source file, refers to a datum, register or an address in memory, specified in the instruction itself. The CPU decode each line of instruction into operation and operands and executes it. A special register, Program Counter, built inside the CPU, is always pointing on the next instruction that need to be executed. The program stop executing when the CPU gets to a stop instruction.
Every programming language has a compiler or interpreter that translate the source code into machine language. In the assembly programming language the interpreter is also called assembler. In this project you are asked to write an assembler. For this purpose we are going to define a special type of hardware architecture and an assembly programming language written for this architecture.
Note: General explanation on how the assembler works also includes references to the working process of the linker program. The purpose of these references is to provide better understanding on the processing procedure of the assembler's output. Make no mistakes, you are not required to write linker or loader program. You should only write the assembler.
First let us define our imaginary computer machine and the assembler for this project.
"Hardware"
Our computer architecture consists from CPU (Central Processing Unit), registers and Random Access Memory RAM, where part of the memory is being used as a stack. The size of each word in memory is 16 bits. Arithmetics is to be carried by the '2 complement' method. Our computer machine can only handle integers (Positives or negatives), it doesn't handle real numbers.
Registers
Our computer machine includes the following list of registers:
- Eight general registers (r0, r1, r2, r3, r4, r5, r6, r7)
- One Program Counter register (PC).
- One Stack Pointer register (SP).
- One Status register (PSW - Program Status Word) which has two flags: carry flag and zero flag.
All registers are 16 bits in size.
The two first bits of the PSW register are C and Z in correspondence
The size of memory is 2000 words (each word is 16 bits in size).
Characters are coded in ascii.
Instruction Set Architecture ISA
Instruction
In our computer machine, instruction is a word (16 bits in size) that carries information about the operator and operands.
Although instruction is a string of 16 bits, it can be divided into fields. The following table provides further information about the instruction.
Fields |
Operation |
Source Operand |
Destination Operand |
Operator |
Addressing
Mode |
Register |
Addressing
Mode |
Register |
Bits |
15 - 12 |
11 - 9 |
8 - 6 |
5 - 3 |
2 - 0 |
Table 1: The structure of an instruction word.
From Table 1 we can see that bits 12-15 codes the operator (representing the operation or the operator field) this allow us to code 2
4=16 different operators. The following table maps operator's name to its corresponding instruction code (opcode).
Operator |
Opcode binary |
Opcode Octal |
Opcode Decimal |
mov |
0 |
0 |
0 |
cmp |
1 |
1 |
1 |
add |
10 |
2 |
2 |
sub |
11 |
3 |
3 |
mul |
100 |
4 |
4 |
div |
101 |
5 |
5 |
lea |
110 |
6 |
6 |
inc |
111 |
7 |
7 |
dec |
1000 |
10 |
8 |
jnz |
1001 |
11 |
9 |
jnc |
1010 |
12 |
10 |
shl |
1011 |
13 |
11 |
prn |
1100 |
14 |
12 |
jsr |
1101 |
15 |
13 |
rts |
1110 |
16 |
14 |
hlt |
1111 |
17 |
15 |
Table 2: The opcode
All operators are written in lower case letters, details on the
meaning of these operators will be specified later.
- Bits 9-11:
- This field refers to the addressing mode of the source operand. Depending on the value of this field (numeric values of bits 9-11) , the instruction may refer to additional word (first additional word)
- Bits 6-8:
- This field refers to the register of the source operand. The field (bits 6-8) maps its numeric value n to
register rn.
Notice: If the addressing mode in the source operand does not require
the source register, then the source register field are not in use. In such a case the numeric value of the field (bits
6-8) is equal to zero.
- Bits 3-5:
- This field refers to the addressing mode of the destination operand. Depending on the numeric value of this field (bits
3-5) , the instruction may refer to additional word (second additional word)
- Bits 0-2:
- This field refers to the register of the destination operand. The field (bits 0-2) maps its numeric value n to
register rn.
Notice: If the addressing mode in the destination operand does not require
the destination register, then the source register field are not in use. In such a case the numeric value of the field (bits
6-8) is equal to zero.
There are six types of addressing modes in our assembly language, some of these modes require additional information, i.e. additional word.
The following table provides information on all types of addressing mode.
First Word |
Additional Word |
Operand |
Way of Writing |
Example |
Addressing Mode |
Register |
Field Value (Decimal) |
Name |
Field Value (Decimal) |
0 |
Instant addressing |
zero Not in use |
Yes |
The numeric value of the operand is determined by the numeric value of the additional word. |
The operand is a number preceded by the number sign #. |
mov #-1,r2 |
1 |
Direct addressing |
zero Not in use. |
Yes |
The additional word contains memory address. The numeric value of the operand is the value of this address. |
The operand is a label, either declared or expected to be declared later in the file. |
mov x,r2 |
2 |
Indirect addressing |
zero Not in use. |
Yes |
The numeric value of the additional word contains memory address. The value of this address is also a memory address. The value of the second address is the numeric value of the operand. |
Indirect addressing is indicated by the at '@' sign which appeared just before the label. The label is declared in the same way as in the direct addressing mode. |
mov @x, r2 |
3 |
Relative Addressing |
zero Not in use. |
Yes |
The additional word contains an integer number (positive or negative) indicating the distance in words from the current command to the address of the operand. |
The operand is a label, defined as in the case of direct addressing mode, preceded by the asterisk sign *. |
mov *x, r2 |
4 |
Direct register addressing |
n (Positive integer) |
No |
Register rn contains the value of the operand. |
The operand is a legal register name. |
mov r1, r2 |
5 |
Indirect register addressing |
n (Positive integer) |
No |
Register rn contains information on memory address. This memory address contains the operand. |
|
mov @r1,r2 |
Table 3: Types of addressing mode.
As shown in Table 3, addressing mode(s) determine the appearance of the additional word(s).
Machine Instruction Characterization
Machine instruction may be classified into three different classes (according to the number of operands appeared in each instruction).
First Class of Operators
The first class contains all machine instructions that get two operands. Any machine instruction that belongs to this class may contain one of the following
operators:
mov, cmp, add, sub, mul, div, lea, shl
The following table provides further explanation on the operational aspects of these operators:
Numeric Code Octal |
Operator |
Description |
Example |
Example Description |
0 |
mov |
Copies the value of the source operand (the first operand) to the destination operand (the second operand). |
mov A, r1 |
copy the value of A to register r1 |
1 |
cmp |
Compare between two operands. The cmp operator subtracts the destination operand from the source operand, without saving the subtraction result, it then updates the zero flag, flag z, in the status register, PSW. |
cpm A, r1 |
If the values of A and r1 are equal, then the zero flag A, in the status register PSW, is turned on. Else the zero flag is turned off. |
2 |
add |
The destination operand is assigned with the value of the source operand plus the value of the destination operand. |
add A, r0 |
Register r0 gets the sum of r0 and A. |
3 |
sub |
The destination operand is assigned with the value of the source operand minus the value of the destination operand. |
sub #3, r1 |
Register r1 is assigned with the value of r1 minus 3 ; |
4 |
mul |
Destination operand assigned with the value of the source operand times the value of destination operand |
mul A, r2 |
Register r2 assigned with A × r2. |
5 |
div |
Destination operand is assigned with the value of destination operand divided by the source operand (destination divided by source) |
div A,r2 |
Register r2 assigned with r2/A |
6 |
lea |
Acronym for 'load effective address'. This operation loads memory address, marked with the label appeared in the first operand to the destination operand. |
lea ABC, r1 |
The memory address of label ABC is assigned to register r1. |
13 |
shl |
Shift bits to the left in the source operand. The number of shifts is determined by the value of the destination operand. |
shl r1, 1 |
Register r1 is shifted 1 bit to the left. |
Table 4: Operators with two operand arguments.
Second Class of Operators
The second class contains all machine instructions that gets one operand. In such cases there is no source operand, thus bits 6-11 are meaningless (their values is zero). Any machine instruction in this class may contain one of the following instruction:
inc, dec, jnz, jnc, prn, jsr
The following table provides further explanation on the operational aspects of these operators:
Numeric Code Octal |
Operator |
Description |
Example |
Example Description |
7 |
inc |
The operand is increased by one. |
inc r2 |
Register r2 is assigned with r2 plus 1. |
10 |
dec |
The operand is decreased by one. |
dec r2 |
Register r2 is assigned with r2 minus 1. |
11 |
jnz |
Acronym: jump if not zero. The Program Counter register PC is assigned with the source operand if the Z flag, in the Program Status Word register PSW is not zero. |
jnz LINE |
If the Z flag (in the PSW register) is not zero, then PC register is assigned with LINE. |
12 |
jnc |
Acronym: jump if not carry. The Program Counter register PC is assigned with zero if the C flag, in the Program Status Word register PSW is not 0. |
jnc LINE |
If the C flag (in the PSW register) is not zero, then PC register is assigned with LINE. |
14 |
prn |
Prints the ASCII equivalent of the operand to the standard output file (stdout). |
prn r1 |
The ASCII equivalent character of the value stored in r1 is printed to standard file. |
15 |
jsr |
Calls a subroutine that pushes register PC to the running time stack and assign the operand to the Program Counter register PC. |
jsr FUNC |
stack[sp]←PC
SP ←SP-1
PC←FUNC |
Table 5: Operators with one operand arguments.
Third Class of Operators
The third class contains all machine instructions that gets no operands. In such cases bits 0-11 are meaningless (their values is zero). Any machine instruction in this class may contain one of the following instruction:
rts, hlt
The following table provides further explanation on the operational aspects of these operators:
Numeric Code Octal |
Operator |
Description |
Example |
Example Description |
16 |
rts |
Pops a value from the running time stack and move this value to the Program Counter register. |
rts |
SP ← SP+1 PC← stack[SP] |
17 |
hlt |
Halts the program. |
hlt |
Halting the program. |
Table 6: Operators with one operand arguments.
Additional Notes
Our assembly language is consisted of statements separated by the new line character '\n'. When we look into a file it appeared to be made out of lines of statements, each statement appeared in its own line.
Our assembly language has four types of statements. These statements described in the following table.
Type of statement |
General Explanation |
Empty Statement |
Line with this kind of statement may contains only white spaces: tab character '\t' or space character ' ' |
Comment Statement |
The first character in a line with this statement is the semicolon ';' character. This line should be completely ignored by the assembler. |
Declarative Statement |
This statement is a directive to the assembler program. It does not generate machine instruction. |
Operation Statement |
This statement generates machine instruction that needs to be executed by the CPU. The statement represent machine instruction in symbolic form. |
Table 7: Operators with one operand arguments.
Directive Statement
Directive statement is of the following form:
Directive statement may optionally start with a label, the label has to follow certain syntax rules (to be described later). Directive can start with or without a label, in any case a directive name, preceded by a dot '.' character, must be included. NO whitespace allowed between the '.' character and the directive name. If the directive does include a label, then at least one whitespace character is separating between the label and the '.' character. Following the directive name, whitespace-separated, appearing, in the same line, the directive parameters (the number of parameters is determined by the type of the directive). As mentioned, directive statement may include four types of directive:
-
.data
The parameter(s) of data is a list of legal numbers separated by a comma ',' character. For example
.data +7,-57 ,17 , 9
Notice that any number of whitespace characters may appear between the number(s) and the comma character(s). However, the comma character must separate between two numeric values.
The '.data' directive statement directs the assembler to allocate space in its data image where the appropriate numeric parameters is to be stored. It also direct the assembler to advance the data counter by the number of parameters (of the '.data' directive). If the '.data' directive has a label name, then this label name is assigned with the value in the data image (before it was advanced) and get inserted to the symbols table. This way we can refer to certain place in the data image using the label name. For instance, if we write
XYZ: .data +7,-57,17,9
mov XYZ, r1
then register r1 is assigned with the value +7. If we continue to write
lea XYZ, r1
then r1 would have been assigned with the address (in the data image) that stores the +7 value.
-
.string
The '.string' directive statement gets only one legal string as parameter. The meaning of '.string' directive statement is similar to the '.data' directive statement. The ASCII characters composed the string are coded to their appropriate numeric ASCII value&40;s) and get inserted to the data image by their order. At the end a zero value is being inserted, to mark the end of the string. The value of the data counter is to be increase, according to the length of the string + one. If the line includes a label name, then the value of the label name is going to point to the location in memory that stores the ASCII code of the first character of the string, at the same way as it was done for the '.data' string. For instance the directive statement
ABC: .string "abcdef"
is going to allocate an array of characters of length 7 starting from the address stored in the ABC label name. This "array" is initialized to the ASCII value of characters a, b, c, d, e, f in correspondence, the array is to be ended with the zero value concatenate to the end of the array.
-
.enrty
The '.entry' directive statement gets one parameter only. This parameter is a label name, declared by other directive statement in the very same file where the
The purpose of the '.entry' directive statement is to deal handle cases where a label name defined in an assembly source file A needs to be referred by other assembly source file(s) B, C, D, etc. In this case the '.entry' directive statement, written in the file A, gets the label name as its parameter (the '.entry' directive statement has to have a single parameter). For instance, if an assembly source file A contains the following lines
HELLO .string "People"
.entry HELLO
HELLO: add #1, r1
then other assembly source file(s), may refer to HELLO label name. Notice that a label at the beginning of the '.entry' directive is meaningless.
-
.extern
The '.extern' directive statement gets one parameter this parameter is the name of a label name defined in other assembly source file. The purpose of this directive statement is to declare that the label has been defined in other source file and that this assembly source file (the one that contains the '.extern' directive statement) is using it. The correspondence between the value of the label, as appeared in the source file where it was defined, and the operation instruction(s) that are using it as an argument is to be done at linking time. Notice that a label at the beginning of the '.extern' directive is meaningless.
Operation Statement
Operation statement is composed from the following:
- Optional label.
- Operation name (Opname)
- Operands (the number of operands may be 0, 1 or 2 depending on the operation).
The length of a statement (of any type) cannot exceed 80 characters.
The name of the operation is to be written in lower case letter, operation name can be one of the 16 operations mentioned above.
After the operation name, separated with whitespace character(s), one or two operands may appear. In the case of two operands, the operands are separated with a comma ',' character. As mentioned before, whitespace character(s) may separate the comma and the operands. Operation statement with two operands has the following form:
<Label
name>: <operation> <Source operand> , <Destination operand>
For instance
HELLO: add r7, B
Operation with one operand has the following form:
<Label
name>: <operation> <Operand>
For instance
HELLO: jnc XYZ
Operation with no operands has the following form:
<Label
name>: <operation>
For instance
END: hlt
If a label is appearing in the beginning of the operation statement, then this label is inserted into the table of symbols. The value of the label is pointing on the location of the operation statement in the code image, built by the assembler.
Formal Definitions
Label
Every label must begin with an upper or lower case letter, the rest of the label may contain letters or numbers. The length of the label cannot exceed 30 characters. The label ends with a column ':' character. The column character is not part of the label name it is just a sign representing the end of the character. The label must begin with the first column of the line. Label name cannot have more than one definition. The following labels are written correctly.
hEllo:
x:
He78940
Label name cannot be the same as register or operation name.
The label derived its value from the syntax. Label written at the beginning of '.data' or '.string' directive gets the value of the appropriate data counter. Label written at the beginning of an operation statement gets the value of the appropriate operation counter.
Number
Number is a string of decimal digits &40;0-9) that may optionally be preceded by either '-' or '+' sign. The number gets its value from its decimal representation
represented by the string of digits &40;0-9). For instance the numbers 76, -5, +123 can be accepted as numbers. As mentioned, we do not handle rational or real numbers, only integers.
String
String is a sequence of visible ASCII characters surrounded by double quotation marks. The quotation marks are not part of the string. The string "Hello World" is an example for legal string.
Two Pass Assembler
When the assembler is starting to translate code it needs to carry two major assignments. Its first assignment is to identify and translate the operation code and its second assignment is to determine addresses for all data and variables appeared in the source file(s). For instance, when the assembler reads the following code:
it has to replace the operation names lea, jnz, prn, sub, inc, jnc, hlt with their equivalent binary codes, in addition, the assembler
has to replace the symbols STR, LENGTH, MAIN, LOOP, END with their appropriate addresses that have been allocated for the directive statements.
Assuming that the code in example I has being translated by the assembler and has been stored (operations and directives) in a memory block that starts from address 0100, then this translation can be described as follow:
Label |
Decimal Adress |
Octal Address |
Command |
Operands |
Binary Machine Code |
Octal Machine Code |
Main: |
0100 |
0144 |
mov |
LENGTH, r1 |
0000 001 000 100 001 |
001041 |
|
0101 |
0145 |
|
|
0000 000 001 111 000 |
000170 |
|
0102 |
0146 |
lea |
STR, r2 |
0110 001 000 100 010 |
061042 |
|
0103 |
0147 |
|
|
0000 000 001 110 001 |
000161 |
LOOP: |
0104 |
0150 |
jnz |
END |
1001 000 000 001 000 |
110010 |
|
0105 |
0151 |
|
|
0000 000 001 110 000 |
000160 |
|
0106 |
0152 |
prn |
@r2 |
1100 000 000 101 010 |
140052 |
|
0107 |
0153 |
sub |
#1, r1 |
0011 000 000 100 001 |
030041 |
|
0108 |
0154 |
|
|
0000 000 000 000 001 |
000001 |
|
0109 |
0155 |
jnc |
r2 |
0111 000 000 100 010 |
070042 |
|
0110 |
0156 |
jnc |
*LOOP |
1010 000 000 011 000 |
120030 |
|
0111 |
0157 |
|
|
1111 111 111 111 010 |
177772 |
END: |
0112 |
0160 |
hlt |
|
1111 000 000 000 000 |
170000 |
STR: |
0113 |
0161 |
.string |
"abcdef" |
0000 000 001 100 001 |
000141 |
|
0114 |
0162 |
|
|
0000 000 001 100 010 |
000142 |
|
0115 |
0163 |
|
|
0000 000 001 100 011 |
000143 |
|
0116 |
0164 |
|
|
0000 000 001 100 100 |
000144 |
|
0117 |
0165 |
|
|
0000 000 001 100 101 |
000145 |
|
0118 |
0166 |
|
|
0000 000 001 100 110 |
000146 |
|
0119 |
0167 |
|
|
0000 000 000 000 000 |
000000 |
LENGTH: |
0120 |
0170 |
.data |
6 |
0000 000 000 000 110 |
000006 |
If the assembler maintains a table of all the operation names and their corresponding binary codes, then all operation names can be easily converted. Whenever the assembler reads an operation name it can simply use the table to find its equivalent binary code. In order to carry the same conversion for the addresses of symbols the assembler has to build similar table.
For instance, in example I, prior to reading the source file(s) the assembler has no way to know that the LOOP symbol relates to address 0104 (decimal).
Thus, in regards to all symbols that have been defined by the programmer, the assembler has to accomplish two separate tasks. The first task is to build a table of all symbols and their related numeric values, and the second is to replace all the symbols, appeared in the source file(s) with the numeric values of the address fields. This two assignments can be achieved by performing two separate scans (passes) on the source file(s). In the first pass the assembler builds a table of symbols, this table correspond address to each symbol.
The following table refers to example I it represents the table of symbols generated by the assembler after the completion of the first pass.
Symbol |
Decimal Value |
Octal Value |
MAIN |
0100 |
0144 |
LOOP |
0104 |
0150 |
END |
0112 |
0160 |
STR |
0113 |
0161 |
LENGTH |
0120 |
0170 |
Table 8: Table of symbols derived from applying the first pass on the source file(s) of example I.
In the second pass the assembler translate the source file(s) into binary machine code.
Notice that the two passes are done by the assembler, during translation (in the assembly time), before the linking process.
After the translation process, the program may be linked and load to memory for execution.
First Pass
In order to assign each label name to an address value the assembler needs rules. The basic principle is to offset addresses. While reading an operation statement containing label operand(s), the assembler offset the next memory address(s) according to the number of label operand(s).
Address value determined by counting addresses this type of counting is done by the assembler and stored in the address location register. The initial value of the address location register is zero, thus the first statement is to be loaded into address 0. After the assembler determines the length of the instruction statement the value of the address location register is increased by the number of bytes, occupied by the statement operation, this way the address location register points to the next empty cell.
As mentioned, to assign each operation statement to its machine instruction the assembler stores a table (Table 2). At the time of translation the assembler replace each operator in the operation statement with its equivalent code. However the translation of all the operation statement is not so simple. There are many operators that are using different types of addressing modes. The same operator can get different meanings depending on addressing mode.
The assembler has to read the statement in its complete form and then decide on how to translate it to machine code instruction according to its addressing mode. Usually the statement is divided into operator field and additional fields, the additional fields contains information on addressing mode.
The same principle applies in our assembler. Bits 12-15 (in the first word) represent the operator, whereas bits 3-5 and 9-11 represent the addressing mode.
Our computer machine is flexible in regards to the addressing mode of two operands. Notice that this flexibility is not mandatory. There are computer machines where all of their operation statements get one operand (operations are executed on this operand and constant register) or other machines that allow addressing mode for only one operand, the other operand has to be any register or even other computer machines with 3 operands (the third operand stores the result of the operation).
When the assembler reads a label, appeared in the beginning of the line, it has to know that this is a definition of label, then the label is assigned with a memory address &40;the value of the location address register&41;. This way each of the labels get their addresses while reading by the assembler. The labels are inserted to the table of symbols containing the name of the label its address and other additional information, such as the type of the label. This way, when the assembler reads an operation statement with label name operand(s) it can get the appropriate address from the table of symbols.
As we already know, programmer may refer to symbol that has not been defined yet in the program, definition is later to come. For example: In the following code
jnc A
.
.
.
A: ...
Label name A, appeared as an argument for jnc, has not been defined yet, definition of this label appears later in the code. When the assembler reads a line with undefined label address, such as in the case of 'inc A', it cannot assigned it (label name A) with address value. This problem is to be solved in the second pass.
Second Pass
In the first pass the assembler cannot assign address value for undefined label address, during the first pass the assembler marked these label name(s) with zero(s). Only after the assembler complete a full scan, where all label name(s) have been inserted to the table of symbols, the assembler can assign an address value to all label name(s) appeared in the operand(s) field(s) of the operation statements. To do this the assembler reads the program once again assigning (using the table of symbols completed in the first pass) all label name(s) appeared in the operand(s) of the operation statements to their corresponding address value(s). This is the second pass, at the end of the second pass the assembler returns the final translated file(s).
Assembler of One Pass
There are assemblers that do not execute the second pass, the translation process in these kind of assemblers is done in the following way:
At the first pass the assembler maintains a table of symbols (as in the two pass example). Unlike the two pass assembler the table of symbols in the one pass assembler contains the label name and the memory address of the operation referring to the label name(s). At the end of the first pass the one pass assembler is to complete the translation by filling all missing addresses from the table of symbols.
For example: Let us consider the following code:
300: add TAB,r1
.
.
.
400: TAB: ...
.
.
.
500: jnz TAB
where 300, ... , 400,... ,500, ... are memory addresses. The one pass assembler is to insert label name TAB to the table of symbols and set aside memory addresses 301 and 501 for the TAB label. Note that addresses 300 and 500 are already occupied by the operation statements (operator
add and operator
jnz) the additional words is to be followed right after the operation instruction. After the first pass the one pass assembler will fill all missing addresses using the table of symbols.
The advantage of one-pass-assembler is speed, saving the need for the second pass. The disadvantage is lack of clarity, the two pass assembler is easier to maintain.
Separation of Data and Instructions
For several assembler programs there are different address location registers. This separation has several benefits, it allows us to separate the code and data to different segments in memory, this is much better than adjusting the definition of data into the instructions that are using them.
The method of adjusting definition of data into the instructions may cause runtime errors, following a small error the CPU may try to
execute the data. This kind of error may be caused by omitting the halt instruction or by wrong branching. If the CPU attempts to
execute none legal instruction code, then an error message should appear. Nevertheless, a line of code that is seems to look like a legal instruction,
may cause to more sever problem, since the error may not be found immediately.
In your assembler program we will be asked to separate the data segment from the instruction segment.
Assembler Error Detection
The assembler may detect errors in the syntax of the language, errors such as "Instruction does not exist" or "to many operands" and more.
The assembler has to verify that all symbols are defined exactly once. Thus all errors detected by the assembler may be attributed to a particular line
of input. If, for example, two addresses are assigned to instruction that should be assigned with only one address, then the assembler may respond with
"To many addresses" error message. The following table contains information on legal addressing mode for the source and destination operands.
Operator |
The Code in Hexadecimal Base |
Legal Addressing Modes for the Source Operand |
Legal Addressing Modes for the Destination Operand |
mov |
0 |
0,1,2,3,4,5 |
1,2,3,4,5 |
cmp |
1 |
0,1,2,3,4,5 |
0,1,2,3,4,5 |
add |
2 |
0,1,2,3,4,5 |
1,2,3,4,5 |
sub |
3 |
0,1,2,3,4,5 |
1,2,3,4,5 |
mul |
4 |
0,1,2,3,4,5 |
1,2,3,4,5 |
div |
5 |
0,1,2,3,4,5 |
1,2,3,4,5 |
lea |
6 |
1 |
1,2,3,4,5 |
inc |
7 |
No source operand |
1,2,3,4,5 |
dec |
10 |
No source operand |
1,2,3,4,5 |
jnz |
11 |
No source operand |
1,2,3,5 |
jnc |
12 |
No source operand |
1,2,3,5 |
shl |
13 |
1,2,3,4,5 |
0,1,2,3,4,5 |
prn |
14 |
No source operand |
0,1,2,3,4,5 |
jsr |
15 |
No source operand |
1,2,3,5 |
rts |
16 |
No source operand |
No source operand |
hlt |
17 |
No source operand |
No source operand |
Table 9: Legal addressing modes assigned as source and destination operands for the operators.
Other possible errors may involve with illegal operator, illegal register name, illegal label, etc.
General algorithm
In this section we introduce a general algorithm for the first and second passes. The code is divided into two segments, the instruction segment and the
data segment. Each segment has its own counter, marked by IC (Instruction Counter) and DC (Data Counter). The letter L denotes the number of words held by
the instruction.
First Pass
- IC <=0, DC<=0
- Read line
- If the first field is a symbol, then proceed. Else go to 5.
- Turn on "There is a Symbol" flag.
- If pseudo-instruction (directive for storing data, namely .data or .string), proceed. Else go to 8.
- if there is a symbol, then insert it into the table of signs with a mark (data mark). It's value should be DC.
- Identify the data, store them in memory, update the data counter (DC) according to their length. Go back to step 2.
- If it is an .extern or .entry directive, then proceed. Else go to 11.
- If it is an .extern declaration, then insert the symbols into the external symbols table, without an address.
- Go back to 2.
- If it is a symbol, then insert it into the table of symbols with a mark (code mark). It's value should be IC.
- Search the table of instructions. If no instruction has been find, then announce on error in the instruction code.
- IC <=L+IC.
- Go back to 2.
Second Pass
- IC <=0.
- Read line. If you done, go to 11.
- If the first field is a symbol, then skip on it.
- If the first field is a pseudo-instruction, (.data or .string), then go to 2.
- If the first field is a directive (an .entry or a .extern), then proceed. Else go to 7
- Identify the directive. Complete its corresponding operation. If it is an .entry directive, then mark the appropriate symbols as .entry
go back to 2.
- Evaluate the operands, search the table of instruction, substitute the instruction with its corresponding code.
- Store the operands starting from the nest byte. If it is a symbol, then find its address in the table of symbols, compute the addresses,
code the addressing mode.
- IC<=IC+L
- Go back to 2.
- Keep on a separate files the length of the program, length of data, table of external symbols, table of symbols marked with entrance point signs.
Let us apply this algorithm on the code written in
Example I repeated below for easy reference:
MAIN: mov LENGTH, r1
lea STR, r2
Loop: jnz END
prn @r2
sub #1, r1
inc r2
jnc *LOOP
END: hlt
STR: .string "abcdef"
LENGTH: .data 6
In the first pass, each instruction is being substituted with its appropriate code and the table of symbols is being built. The rest of the code are left
untouched. The code should be loaded at address zero. After applying the first pass on example I, we should get the following result
Label |
Decimal Address |
Octal Address |
Command |
Operands |
Binary Machine Code |
Octal Machine Code |
(MAIN::) |
0100 |
0144 |
mov |
LENGTH, r1 |
0000 001 000 100 001 |
001041 |
|
0101 |
0145 |
|
|
LENGTH |
?????? |
|
0102 |
0146 |
lea |
STR, r2 |
0110 001 000 100 010 |
061042 |
|
0103 |
0147 |
|
|
STR |
?????? |
(LOOP:) |
0104 |
0150 |
jnz |
END |
1001 000 000 001 000 |
110010 |
|
0105 |
0151 |
|
|
END |
?????? |
|
0106 |
0152 |
prn |
@r2 |
1100 000 000 101 010 |
140052 |
|
0107 |
0153 |
sub |
#1, r1 |
0011 000 000 100 001 |
030041 |
|
0108 |
0154 |
|
|
0000 000 000 000 001 |
000001 |
|
0109 |
0155 |
inc |
r2 |
0111 000 000 100 010 |
070042 |
|
0110 |
0156 |
jnc |
*LOOP |
1010 000 000 011 000 |
120030 |
|
0111 |
0157 |
|
|
1111 111 111 111 010 |
177772 |
(END:) |
0112 |
0160 |
hlt |
|
1111 000 000 000 000 |
170000 |
(STR:) |
0113 |
0161 |
.string |
"abcdef" |
0000 000 001 100 001 |
000141 |
|
0114 |
0162 |
|
|
0000 000 001 100 010 |
000142 |
|
0115 |
0163 |
|
|
0000 000 001 100 011 |
000143 |
|
0116 |
0164 |
|
|
0000 000 001 100 100 |
000144 |
|
0117 |
0165 |
|
|
0000 000 001 100 101 |
000145 |
|
0118 |
0166 |
|
|
0000 000 001 100 110 |
000146 |
|
0119 |
0167 |
|
|
0000 000 000 000 000 |
000000 |
(LENGTH:) |
0120 |
0170 |
.data |
6 |
0000 000 000 000 110 |
000006 |
In addition we get the following table of signs.
Symbol |
Decimal Value |
Octal Value |
MAIN |
0100 |
0144 |
LOOP |
0104 |
0150 |
END |
0112 |
0160 |
STR |
0113 |
0161 |
LENGTH |
0120 |
0170 |
Applying the second pass on the code of example I yields the following final result:
Label |
Decimal Address |
Octal Address |
Command |
Operands |
Binary Machine Code |
Octal Machine Code |
(MAIN::) |
0100 |
0144 |
mov |
LENGTH, r1 |
0000 001 000 100 001 |
001041 |
|
0101 |
0145 |
|
|
0000 000 001 111 000 |
000170 |
|
0102 |
0146 |
lea |
STR, r2 |
0110 001 000 100 010 |
061042 |
|
0103 |
0147 |
|
|
0000 000 001 110 001 |
000161 |
(LOOP:) |
0104 |
0150 |
jnz |
END |
1001 000 000 001 000 |
110010 |
|
0105 |
0151 |
|
|
0000 000 001 110 000 |
000160 |
|
0106 |
0152 |
prn |
@r2 |
1100 000 000 101 010 |
140052 |
|
0107 |
0153 |
sub |
#1, r1 |
0011 000 000 100 001 |
030041 |
|
0108 |
0154 |
|
|
0000 000 000 000 001 |
000001 |
|
0109 |
0155 |
inc |
r2 |
0111 000 000 100 010 |
070042 |
|
0110 |
0156 |
jnc |
*LOOP |
1010 000 000 011 000 |
120030 |
|
0111 |
0157 |
|
|
1111 111 111 111 010 |
177772 |
(END:) |
0112 |
0160 |
hlt |
|
1111 000 000 000 000 |
170000 |
(STR:) |
0113 |
0161 |
.string |
"abcdef" |
0000 000 001 100 001 |
000141 |
|
0114 |
0162 |
|
|
0000 000 001 100 010 |
000142 |
|
0115 |
0163 |
|
|
0000 000 001 100 011 |
000143 |
|
0116 |
0164 |
|
|
0000 000 001 100 100 |
000144 |
|
0117 |
0165 |
|
|
0000 000 001 100 101 |
000145 |
|
0118 |
0166 |
|
|
0000 000 001 100 110 |
000146 |
|
0119 |
0167 |
|
|
0000 000 000 000 000 |
000000 |
(LENGTH:) |
0120 |
0170 |
.data |
6 |
0000 000 000 000 110 |
000006 |
When the assembler program is done an object code is generated this object code is to be sent to a linker program. The purpose of the linker program is described as follows:
- To allocate the program with place in memory (allocation).
- To link the object file into one executable file (linking)
- To change addresses according to the loading place (relocation)
- To physically load the code into memory.
After the linker program is done the program can be loaded to memory and is ready to run. We are not
going to make further discussion on how the linker program works.
Your assembler program has to have the following properties:
Your assembler should get command line arguments, these arguments is a list
of text files written according to the assembly language that was defined above. For each file the assembly
create an object file, external file (if the source code has declared on external variables) and an entry file (if the source code
has declared on entry variables). The extension of source file is ".as" (for example the names a.as, b.as and start.as are all legal names
for source files). The assemblers' command line arguments are a list of source file names (without the ".as" extension). For instance if our assembler
program is called "assembler", then the following command line
assembler a b start
Reads the a.as, b.as and start.as source files and creates the appropriate object, entry and external files. In our case the file that should be created are
a.ob, a.ent (for the entry file), a.ext (for the external file); b.ob, b.ent, b.ext (for the b.as source file); start.ob, start.ent, start.ext
(for the start.as source file).
The structure of each file is to be described later.
The assembler holds two arrays, the code array and the data array. These arrays provide a practical picture of machine memory (the size of
each of the array element is equal to the size of one word-16 bits). In the code array the assembler writes the numeric code value of
machine instructions found during the assembly process in the input file. In the data array the assembly writes data (specified by .data and .string)
found during the assembly process in the input file.
The assembler has two counters: Instruction Counter (IC) and Data Counter (DC). These counters are pointing to the next available places in the above mentioned
arrays (Code array and Data array). When the assembler begin reading a file both the IC and DC counters are initialize to zero. In addition the assembler
maintains a table of symbols where all variables encountered during runtime are kept. The name, value and type (External, Relocatable, Absolute) are kept for
all variables.
Mode of Operation
The assembler reads the file line by line and decide on the type of the line (remark, operation, directive or empty-line).
- Empty or remark line: The assembler ignores the line and move on to the next line.
- Operation line:
When the assembler is encountered with operation-line it has to decide on the addressing modes and on the number of operands. The
assembler determines the value of each operands in the following way:
- If the operand is a register, then its value is the number of the register.
- If the operand is a label, then its value is determined by its appearance in the table of symbols.
- If it is a number (direct addressing), then its value is the value of the number.
Determination of the addressing mode is done according to the syntax of the operand(s), as mentioned in tables 3. In general the
# character indicates instance addressing, label indicates direct addressing, register name indicates register addressing, the @
character written before variable name indicates indirect addressing, the @ character written before register name indicates
indirect addressing, the * written before a label indicates on relative addressing.
Notice that in both cases, when addressing is a register or indirect register, the value of the operand field is the number of the
register. The same apply for direct and indirect addressing. The value of the operand field is the value of the label variable as
appeared in the table of symbols.
At this stage the assembler have decided up on the operation, addressing mode for the source operand, addressing mode for the destination operand,
the register of the source operand, the register of the destination operand, the requirement of additional word for the source operand, the
requirement of additional word for the destination operand. The assembler continue its operation in the following way:
If the operation requires two operands, then the assembler insert to the array of code, at the place indexed by the instruction counter (IC) the
numerical code of the operation, the addressing modes, and the registers. In addition the assembler save places for the additional words required for
this operation: 0,1 or 2. The assembler increase the instruction code (IC) in correspondence.
For example:
If the instruction is
cmp #-3,r1
then the first place, starting from index IC, in the array of code contains the word 0001 000 000 100 001 (or 010041 octal), the second place of the array of code
contains the word 1111 111 111 101 (177775 octal) (which is the value 3 in the two complement method applied for 16 bits length).
If the instruction gets only one operand (the source operand does not exist), then the translation is almost identical (the word may even
occupy two places in memory), however since the source operand is not being used by the CPU, bits 6-11 in the first word, representing
information on the source operand, can be of any value.
If the instruction has no operands (rts, hlt), then the only numeric value that should be inserted to the array of code at
index IC is the numeric value of the operation. In this case, where the instruction takes no operands, registers 0-11, in the first
word, can get any numeric value.
Label appeared in a line-code, written in a source file, should be inserted to the table of symbols under the appropriate name its numeric value
is equal to the value of the instruction counter (IC), prior to the coding of the instruction (appeared in this specific line). Its type should be
relocatable.
- Directive line:
When the assembler encounters with directive it operates in the following way:
- .data
The assembler reads the list of numbers appeared after the '.data' directive, each number, that has already being red, is
inserted to the array of data and exceeding the data counter DC by one.
If a line-code, found in the source file, contains a label that appears before the '.data' directive, then the label should be
inserted to the table of symbols. The label gets the value of the data counter DC, before the data have been inserted to the
code, plus the general length of the code. The type of this label should be relocatable.
- .string
The behaviour of the assembly in the case of the '.entry' directive is very similar to the case of the '.data' directive. However,
in the case of the '.string' directive the value that are being inserted to the array of data is the numeric ASCII value of the
characters. After that the zero value, which marks the end of the string, is being inserted to the table of data. The data counter
is exceeded by the length of the string plus 1 (zero occupies space too). In the case where a label appears, prior to the
.string directive, the assembly behave the same way as in the case of the '.data' directive.
- .entry
The '.entry' directive direct the assembler to insert the label, appeared
afeter the .entry (in the same line), to the entries file. The assembler
write this request, at the end the assembler writes these labels to the
entries file. The '.entry' directive declares on a label that is to be
used by other file, the information written by the assembler to the entries
file and to the externals file is to be used by the linker program to match
the references of the externals.
- .extern
The '.extern' directive declares on a variable defined in other file
and used by the current '.as' file.
The assembler inserts the requested variable to the table of symbols, the
value of the variable is zero (it may get any other value) and its type is
external. The question of where this variable is defined is insignificant
for the assembler, and remains unanswered.
Notice that either operation line or directive line may take variable
declared later in the file (the declaration may be indirect, by label, or
directly be extern).
The format of output files
The object file written by the assembler provides informations about machine's memory. The first
instruction is to be inserted to memory address 0, the second instruction is to be inserted to
be inserted to memory address 2,3 or 4 (depending on the length of the first instruction) and so
fourth until the translation of the last instruction. The next memory address, after
the last translated instruction, contains the data that were built by the '.data' and '.string'
instructions, their order of appearance in memory depends on their precedence of appearance in the
source file (first instruction occupies first free memory in a rising order).
The object file
The object file is composed out of lines of text. The first line contains (in octal) the length
of the code and the length of data, both are in terms of memory words. Those two numbers must be
separated by white space. Each of the next lines provides information on the content of memory
address (in octal form) starting from memory address 0. In addition, for each memory address,
occupied by instruction (not data), there appear additional information for the linker. This
additional information could be one of the following three characters: 'e' 'a' or 'r'. The character
'a' designates the fact that the content of the memory address is
absolute and does not
depend on where the file is to be loaded (the assembler assumes it to start from memory address 0).
The character 'r' designates the fact that memory address is
relocatable and should be added
with the appropriate offset, in regards to where the file is to be loaded. The offset is the first
memory address from which the first instruction of the program is to be loaded. The letter 'd'
designates the fact that the content of the file depends on
external variable, the linker
program is to take care on the insertion of the appropriate value.
The entries file
The entries file is composed out of lines of text. Each line contains the entry name and value, as
it was computed for this file.
The externals file
The externals file is composed out of lines of text. Each line contains the name and memory address
of the external variable.
Example files
The following source files (ps.as, cs.as, rs.as, a.as) comprising a single program.
________________________
ps.as __________________________________________
;File: ps.as -- Includes main routine of reversing string "abcdef"
; count length of string, print the
; original string, reverse string, print reversed string.
MAIN: lea STR, STRADD
jsr COUNT
jsr PRTSTR
mov *STRADD, LASTCHAR
add LEN, LASTCHAR
dec LASTCHAR
jsr REVERSE
jsr PRTSTR
hlt
.entry STRADD
.entry MAIN
.extern REVERSE
.extern PRTSTR
.extern COUNT
STRADD: .data 0
STR: .string "abcdef"
LASTCHAR: .data 0
LEN: .data 0
_________________________________________________________________________
_________________________
cs.as ___________________________________________
;file cs.as - includes routine count.
;This routine counts the length of the string
.entry COUNT
.extern STRADD
.extern LEN
COUNT: mov STRADD, r1
cmp #0, @r1
jnz ENDCOUNT
inc LEN
inc r1
jnc COUNT
ENDCOUNT: rts
_________________________________________________________________________
_________________________
rs.as ___________________________________________
;file rs.as - includes routine reverse. This routine reverses the string
.entry REVERSE
.extern LEN
REVERSE: mov STRADD, r1
mov LASTCHAR, r2
mov LEN, r3
;Exchanging places
LOOP: cmp #0, r3
jnz END
cmp r1, r2
jnz END
mov @r1, r4
mov @r2, @r1
mov r4, @r2
sub #2, r3
jnc LOOP
END: rts
.extern LASTCHAR
.extern STRADD
_________________________________________________________________________
__________________________
a.as ___________________________________________
;file a.as includes routine PRTSTR. This routine prints the string.
.entry PRTSTR
PRTSTR: mov STRADD, r2
; r2 now hold the memory location of the string.
LOOP: cmp #0,@r2
jnz BYE
prn @r2
inc r2
jnc LOOP
BYE: rts
.extern STRADD
__________________________________________________________________________
Running the assembler program on the ps.as source file creates the following output files:
ps.ob
24 12
0000 061010 a
0001 000025 r
0002 000024 r
0003 150010 a
0004 000000 e
0005 150010 a
0006 000000 e
0007 003010 a
0010 000015 a
0011 000034 r
0012 021010 a
0013 000035 r
0014 000034 r
0015 100010 a
0016 000034 r
0017 150010 a
0020 000000 e
0021 150010 a
0022 000000 e
0023 170000 a
0024 000000
0025 000141
0026 000142
0027 000143
0030 000144
0031 000145
0032 000146
0033 000000
0034 000000
0035 000000
ps.ent
STRADD 24
MAIN 0
ps.ext
COUNT 4
PRTSTR 6
REVERSE 20
PRTSTR 22
Running the assembler program on the cs.as source file creates the following output files:
cs.ob
14 0
0000 001041 a
0001 000000 e
0002 010051 a
0003 000000 a
0004 110010 a
0005 000013 r
0006 070010 a
0007 000000 e
0010 070041 a
0011 120010 a
0012 000000 r
0013 160000 a
cs.ent
COUNT 0
cs.ext
STRADD 1
LEN 7
Running the assembler program on the rs.as source file creates the following output files:
rs.ob
25 0
0000 001041 a
0001 000000 e
0002 001042 a
0003 000000 e
0004 001043 a
0005 000000 e
0006 010043 a
0007 000000 a
0010 110010 a
0011 000024 r
0012 014142 a
0013 110010 a
0014 000024 r
0015 005144 a
0016 005251 a
0017 004452 a
0020 030043 a
0021 000002 a
0022 120010 a
0023 000006 r
0024 160000 a
rs.ent
REVERSE 0
rs.ext
STRADD 1
LASTCHAR 3
LEN 5
Running the assembler program on the a.as source file creates the following output files:
a.ob
13 0
0000 001042 a
0001 000000 e
0002 010052 a
0003 000000 a
0004 110010 a
0005 000012 r
0006 140052 a
0007 070042 a
0010 120010 a
0011 000002 r
0012 160000 a
a.ent
PRTSTR 0
a.ext
STRADD 1
Notice that in a case where the input file contains no .extern declaration, the corresponding *.ext file is not created. Similarly, when the
input file contains no .entry declaration, the corresponding *.ent file is not created.
include("../adds/google/banner_468_60"); ?>
Written by Yonatan Zilpa