Two Pass Assembler

In this project you are asked to write an Assembler program using the C programming language. Specifications for this Assembler will be defined subsequently. Due to the size of this project it should be divided to several source files. The source files have to meet standards of clarity readability and well formed syntactic conventions.

CLICK HERE to get answer for this project.

Background

As you may know, there are many programming languages. A huge amount of programs written in different languages can run on the very same machine. How does the computer "know" so many languages? The answer is very simple, the computer knows only one language. This language, known as machine language, includes instructions and data written in binary code. The code is stored inside a block of memory appearing like a string of binary digits. The Central Processing Unit - CPU - knows how to decode the instructions into small meaningful fields of instructions, Addresses and data. The decoding process is determined unambiguously by the firmware of the CPU.

Computer memory is a collection of bits, usually seeing as collection of constant sized units, measured in bytes. When a program is loaded into memory it is almost impossible for an unskilled eye to see any physical difference between that part in memory where the program is stored and the rest of the memory.

The Central Processing Unit can execute certain amount of simple instructions using several build in registers. For example: It can move value from memory cell to a CPU register and vice versa, it can add 1 to a value stored in a register, it can check whether a value stored in a CPU register is equal to zero, etc. These simple instructions and their combinations are the building blocks of computer program as appeared in memory. Program, written in any computer language, can be translate to machine language by a special program called compiler, this program generates file(s) called object file(s) . The object file(s) can be linked to one executable file that can be load into computer memory.

Once our object file(s) have been loaded into memory, the computer starts to execute them in the following way: Each opcode, in the source file, refers to a datum, register or an address in memory, specified in the instruction itself. The CPU decode each line of instruction into operation and operands and executes it. A special register, Program Counter, built inside the CPU, is always pointing on the next instruction that need to be executed. The program stop executing when the CPU gets to a stop instruction.

Every programming language has a compiler or interpreter that translate the source code into machine language. In the assembly programming language the interpreter is also called assembler. In this project you are asked to write an assembler. For this purpose we are going to define a special type of hardware architecture and an assembly programming language written for this architecture.

Note: General explanation on how the assembler works also includes references to the working process of the linker program. The purpose of these references is to provide better understanding on the processing procedure of the assembler's output. Make no mistakes, you are not required to write linker or loader program. You should only write the assembler.

First let us define our imaginary computer machine and the assembler for this project.

"Hardware"

Our computer architecture consists from CPU (Central Processing Unit), registers and Random Access Memory RAM, where part of the memory is being used as a stack. The size of each word in memory is 16 bits. Arithmetics is to be carried by the '2 complement' method. Our computer machine can only handle integers (Positives or negatives), it doesn't handle real numbers.

Registers

Our computer machine includes the following list of registers:

Eight general registers (r₀, r₁, r₂, r₃, r₄, r₅, r₆, r₇)
One Program Counter register (PC).
One Stack Pointer register (SP).
One Status register (PSW - Program Status Word) which has two flags: carry flag and zero flag.

All registers are 16 bits in size.
The two first bits of the PSW register are C and Z in correspondence
The size of memory is 2000 words (each word is 16 bits in size).
Characters are coded in ascii.

Instruction Set Architecture ISA

Instruction

In our computer machine, instruction is a word (16 bits in size) that carries information about the operator and operands. Although instruction is a string of 16 bits, it can be divided into fields. The following table provides further information about the instruction.

Fields	Operation	Source Operand		Destination Operand
Fields	Operator	Addressing Mode	Register	Addressing Mode	Register
Bits	15 - 12	11 - 9	8 - 6	5 - 3	2 - 0

Table 1: The structure of an instruction word.

From Table 1 we can see that bits 12-15 codes the operator (representing the operation or the operator field) this allow us to code 2⁴=16 different operators. The following table maps operator's name to its corresponding instruction code (opcode).

Operator	Opcode binary	Opcode Octal	Opcode Decimal
mov	0	0	0
cmp	1	1	1
add	10	2	2
sub	11	3	3
mul	100	4	4
div	101	5	5
lea	110	6	6
inc	111	7	7
dec	1000	10	8
jnz	1001	11	9
jnc	1010	12	10
shl	1011	13	11
prn	1100	14	12
jsr	1101	15	13
rts	1110	16	14
hlt	1111	17	15

Table 2: The opcode

All operators are written in lower case letters, details on the meaning of these operators will be specified later.

Bits 9-11:: This field refers to the addressing mode of the source operand. Depending on the value of this field (numeric values of bits 9-11) , the instruction may refer to additional word (first additional word)
Bits 6-8:: This field refers to the register of the source operand. The field (bits 6-8) maps its numeric value n to register r_n.
Notice: If the addressing mode in the source operand does not require the source register, then the source register field are not in use. In such a case the numeric value of the field (bits 6-8) is equal to zero.
Bits 3-5:: This field refers to the addressing mode of the destination operand. Depending on the numeric value of this field (bits 3-5) , the instruction may refer to additional word (second additional word)
Bits 0-2:: This field refers to the register of the destination operand. The field (bits 0-2) maps its numeric value n to register r_n.
Notice: If the addressing mode in the destination operand does not require the destination register, then the source register field are not in use. In such a case the numeric value of the field (bits 6-8) is equal to zero.

There are six types of addressing modes in our assembly language, some of these modes require additional information, i.e. additional word.
The following table provides information on all types of addressing mode.

First Word			Additional Word	Operand	Way of Writing	Example
Addressing Mode		Register
Field Value (Decimal)	Name	Field Value (Decimal)
0	Instant addressing	zero Not in use	Yes	The numeric value of the operand is determined by the numeric value of the additional word.	The operand is a number preceded by the number sign #.	mov #-1,r2
1	Direct addressing	zero Not in use.	Yes	The additional word contains memory address. The numeric value of the operand is the value of this address.	The operand is a label, either declared or expected to be declared later in the file.	mov x,r2
2	Indirect addressing	zero Not in use.	Yes	The numeric value of the additional word contains memory address. The value of this address is also a memory address. The value of the second address is the numeric value of the operand.	Indirect addressing is indicated by the at '@' sign which appeared just before the label. The label is declared in the same way as in the direct addressing mode.	mov @x, r2
3	Relative Addressing	zero Not in use.	Yes	The additional word contains an integer number (positive or negative) indicating the distance in words from the current command to the address of the operand.	The operand is a label, defined as in the case of direct addressing mode, preceded by the asterisk sign *.	mov *x, r2
4	Direct register addressing	n (Positive integer)	No	Register r_n contains the value of the operand.	The operand is a legal register name.	mov r1, r2
5	Indirect register addressing	n (Positive integer)	No	Register r_n contains information on memory address. This memory address contains the operand.		mov @r1,r2

Table 3: Types of addressing mode.

As shown in Table 3, addressing mode(s) determine the appearance of the additional word(s).

Machine Instruction Characterization

Machine instruction may be classified into three different classes (according to the number of operands appeared in each instruction).

First Class of Operators

The first class contains all machine instructions that get two operands. Any machine instruction that belongs to this class may contain one of the following operators:
mov, cmp, add, sub, mul, div, lea, shl
The following table provides further explanation on the operational aspects of these operators:

Numeric Code Octal	Operator	Description	Example	Example Description
0	mov	Copies the value of the source operand (the first operand) to the destination operand (the second operand).	mov A, r1	copy the value of A to register r₁
1	cmp	Compare between two operands. The cmp operator subtracts the destination operand from the source operand, without saving the subtraction result, it then updates the zero flag, flag z, in the status register, PSW.	cpm A, r1	If the values of A and r₁ are equal, then the zero flag A, in the status register PSW, is turned on. Else the zero flag is turned off.
2	add	The destination operand is assigned with the value of the source operand plus the value of the destination operand.	add A, r0	Register r₀ gets the sum of r₀ and A.
3	sub	The destination operand is assigned with the value of the source operand minus the value of the destination operand.	sub #3, r1	Register r₁ is assigned with the value of r₁ minus 3 ;
4	mul	Destination operand assigned with the value of the source operand times the value of destination operand	mul A, r2	Register r₂ assigned with A × r₂.
5	div	Destination operand is assigned with the value of destination operand divided by the source operand (destination divided by source)	div A,r2	Register r₂ assigned with r₂/A
6	lea	Acronym for 'load effective address'. This operation loads memory address, marked with the label appeared in the first operand to the destination operand.	lea ABC, r1	The memory address of label ABC is assigned to register r₁.
13	shl	Shift bits to the left in the source operand. The number of shifts is determined by the value of the destination operand.	shl r1, 1	Register r₁ is shifted 1 bit to the left.

Table 4: Operators with two operand arguments.

Second Class of Operators

The second class contains all machine instructions that gets one operand. In such cases there is no source operand, thus bits 6-11 are meaningless (their values is zero). Any machine instruction in this class may contain one of the following instruction:
inc, dec, jnz, jnc, prn, jsr The following table provides further explanation on the operational aspects of these operators:

Numeric Code Octal	Operator	Description	Example	Example Description
7	inc	The operand is increased by one.	inc r2	Register r₂ is assigned with r₂ plus 1.
10	dec	The operand is decreased by one.	dec r2	Register r₂ is assigned with r₂ minus 1.
11	jnz	Acronym: jump if not zero. The Program Counter register PC is assigned with the source operand if the Z flag, in the Program Status Word register PSW is not zero.	jnz LINE	If the Z flag (in the PSW register) is not zero, then PC register is assigned with LINE.
12	jnc	Acronym: jump if not carry. The Program Counter register PC is assigned with zero if the C flag, in the Program Status Word register PSW is not 0.	jnc LINE	If the C flag (in the PSW register) is not zero, then PC register is assigned with LINE.
14	prn	Prints the ASCII equivalent of the operand to the standard output file (stdout).	prn r1	The ASCII equivalent character of the value stored in r₁ is printed to standard file.
15	jsr	Calls a subroutine that pushes register PC to the running time stack and assign the operand to the Program Counter register PC.	jsr FUNC	stack[sp]←PC SP ←SP-1 PC←FUNC

Table 5: Operators with one operand arguments.

Third Class of Operators

The third class contains all machine instructions that gets no operands. In such cases bits 0-11 are meaningless (their values is zero). Any machine instruction in this class may contain one of the following instruction:
rts, hlt The following table provides further explanation on the operational aspects of these operators:

Numeric Code Octal	Operator	Description	Example	Example Description
16	rts	Pops a value from the running time stack and move this value to the Program Counter register.	rts	SP ← SP+1 PC← stack[SP]
17	hlt	Halts the program.	hlt	Halting the program.

Table 6: Operators with one operand arguments.

Additional Notes

Our assembly language is consisted of statements separated by the new line character '\n'. When we look into a file it appeared to be made out of lines of statements, each statement appeared in its own line.
Our assembly language has four types of statements. These statements described in the following table.

Type of statement	General Explanation
Empty Statement	Line with this kind of statement may contains only white spaces: tab character '\t' or space character ' '
Comment Statement	The first character in a line with this statement is the semicolon ';' character. This line should be completely ignored by the assembler.
Declarative Statement	This statement is a directive to the assembler program. It does not generate machine instruction.
Operation Statement	This statement generates machine instruction that needs to be executed by the CPU. The statement represent machine instruction in symbolic form.

Table 7: Operators with one operand arguments.

Directive Statement

Directive statement is of the following form:
Directive statement may optionally start with a label, the label has to follow certain syntax rules (to be described later). Directive can start with or without a label, in any case a directive name, preceded by a dot '.' character, must be included. NO whitespace allowed between the '.' character and the directive name. If the directive does include a label, then at least one whitespace character is separating between the label and the '.' character. Following the directive name, whitespace-separated, appearing, in the same line, the directive parameters (the number of parameters is determined by the type of the directive). As mentioned, directive statement may include four types of directive:

```
.data
```
The parameter(s) of data is a list of legal numbers separated by a comma ',' character. For example
```
			.data    +7,-57 ,17   ,    9 
```
Notice that any number of whitespace characters may appear between the number(s) and the comma character(s). However, the comma character must separate between two numeric values.
The '.data' directive statement directs the assembler to allocate space in its data image where the appropriate numeric parameters is to be stored. It also direct the assembler to advance the data counter by the number of parameters (of the '.data' directive). If the '.data' directive has a label name, then this label name is assigned with the value in the data image (before it was advanced) and get inserted to the symbols table. This way we can refer to certain place in the data image using the label name. For instance, if we write
```
			XYZ:	.data	+7,-57,17,9
			mov 	XYZ, r1
```
then register r₁ is assigned with the value +7. If we continue to write
```
			lea	XYZ, r1 
```
then r₁ would have been assigned with the address (in the data image) that stores the +7 value.
```
.string
```
The '.string' directive statement gets only one legal string as parameter. The meaning of '.string' directive statement is similar to the '.data' directive statement. The ASCII characters composed the string are coded to their appropriate numeric ASCII value&40;s) and get inserted to the data image by their order. At the end a zero value is being inserted, to mark the end of the string. The value of the data counter is to be increase, according to the length of the string + one. If the line includes a label name, then the value of the label name is going to point to the location in memory that stores the ASCII code of the first character of the string, at the same way as it was done for the '.data' string. For instance the directive statement
```
			ABC:	.string		"abcdef" 
```
is going to allocate an array of characters of length 7 starting from the address stored in the ABC label name. This "array" is initialized to the ASCII value of characters a, b, c, d, e, f in correspondence, the array is to be ended with the zero value concatenate to the end of the array.
```
.enrty
```
The '.entry' directive statement gets one parameter only. This parameter is a label name, declared by other directive statement in the very same file where the The purpose of the '.entry' directive statement is to deal handle cases where a label name defined in an assembly source file A needs to be referred by other assembly source file(s) B, C, D, etc. In this case the '.entry' directive statement, written in the file A, gets the label name as its parameter (the '.entry' directive statement has to have a single parameter). For instance, if an assembly source file A contains the following lines
```
	HELLO	.string		"People"
	.entry	HELLO
	HELLO:	add		#1, r1
```
then other assembly source file(s), may refer to HELLO label name. Notice that a label at the beginning of the '.entry' directive is meaningless.
```
.extern
```
The '.extern' directive statement gets one parameter this parameter is the name of a label name defined in other assembly source file. The purpose of this directive statement is to declare that the label has been defined in other source file and that this assembly source file (the one that contains the '.extern' directive statement) is using it. The correspondence between the value of the label, as appeared in the source file where it was defined, and the operation instruction(s) that are using it as an argument is to be done at linking time. Notice that a label at the beginning of the '.extern' directive is meaningless.

Operation Statement

Operation statement is composed from the following:

Optional label.
Operation name (Opname)
Operands (the number of operands may be 0, 1 or 2 depending on the operation).

The length of a statement (of any type) cannot exceed 80 characters.
The name of the operation is to be written in lower case letter, operation name can be one of the 16 operations mentioned above.
After the operation name, separated with whitespace character(s), one or two operands may appear. In the case of two operands, the operands are separated with a comma ',' character. As mentioned before, whitespace character(s) may separate the comma and the operands. Operation statement with two operands has the following form:

For instance

			HELLO:	add	r7, B

Operation with one operand has the following form:

For instance

			HELLO:	jnc	XYZ

Operation with no operands has the following form:

For instance

			END:	hlt

If a label is appearing in the beginning of the operation statement, then this label is inserted into the table of symbols. The value of the label is pointing on the location of the operation statement in the code image, built by the assembler.

Formal Definitions

Label

Every label must begin with an upper or lower case letter, the rest of the label may contain letters or numbers. The length of the label cannot exceed 30 characters. The label ends with a column ':' character. The column character is not part of the label name it is just a sign representing the end of the character. The label must begin with the first column of the line. Label name cannot have more than one definition. The following labels are written correctly.

	hEllo:
	x:
	He78940

Label name cannot be the same as register or operation name.
The label derived its value from the syntax. Label written at the beginning of '.data' or '.string' directive gets the value of the appropriate data counter. Label written at the beginning of an operation statement gets the value of the appropriate operation counter.

Number

Number is a string of decimal digits &40;0-9) that may optionally be preceded by either '-' or '+' sign. The number gets its value from its decimal representation represented by the string of digits &40;0-9). For instance the numbers 76, -5, +123 can be accepted as numbers. As mentioned, we do not handle rational or real numbers, only integers.

String

String is a sequence of visible ASCII characters surrounded by double quotation marks. The quotation marks are not part of the string. The string "Hello World" is an example for legal string.

Two Pass Assembler

When the assembler is starting to translate code it needs to carry two major assignments. Its first assignment is to identify and translate the operation code and its second assignment is to determine addresses for all data and variables appeared in the source file(s). For instance, when the assembler reads the following code:

MAIN:		mov	 LENGTH, r1
		lea	 STR, r2
Loop:		jnz	 END
		prn	 @r2
		sub	 #1, r1
		inc	 r2
		jnc	 *LOOP
END:		hlt
STR:		.string	 "abcdef"
LENGTH:		.data	 6

Example I: Source file.

it has to replace the operation names lea, jnz, prn, sub, inc, jnc, hlt with their equivalent binary codes, in addition, the assembler has to replace the symbols STR, LENGTH, MAIN, LOOP, END with their appropriate addresses that have been allocated for the directive statements.
Assuming that the code in example I has being translated by the assembler and has been stored (operations and directives) in a memory block that starts from address 0100, then this translation can be described as follow:

Label	Decimal Adress	Octal Address	Command	Operands	Binary Machine Code	Octal Machine Code
Main:	0100	0144	mov	LENGTH, r1	0000 001 000 100 001	001041
	0101	0145			0000 000 001 111 000	000170
	0102	0146	lea	STR, r2	0110 001 000 100 010	061042
	0103	0147			0000 000 001 110 001	000161
LOOP:	0104	0150	jnz	END	1001 000 000 001 000	110010
	0105	0151			0000 000 001 110 000	000160
	0106	0152	prn	@r2	1100 000 000 101 010	140052
	0107	0153	sub	#1, r1	0011 000 000 100 001	030041
	0108	0154			0000 000 000 000 001	000001
	0109	0155	jnc	r2	0111 000 000 100 010	070042
	0110	0156	jnc	*LOOP	1010 000 000 011 000	120030
	0111	0157			1111 111 111 111 010	177772
END:	0112	0160	hlt		1111 000 000 000 000	170000
STR:	0113	0161	.string	"abcdef"	0000 000 001 100 001	000141
	0114	0162			0000 000 001 100 010	000142
	0115	0163			0000 000 001 100 011	000143
	0116	0164			0000 000 001 100 100	000144
	0117	0165			0000 000 001 100 101	000145
	0118	0166			0000 000 001 100 110	000146
	0119	0167			0000 000 000 000 000	000000
LENGTH:	0120	0170	.data	6	0000 000 000 000 110	000006

If the assembler maintains a table of all the operation names and their corresponding binary codes, then all operation names can be easily converted. Whenever the assembler reads an operation name it can simply use the table to find its equivalent binary code. In order to carry the same conversion for the addresses of symbols the assembler has to build similar table.
For instance, in example I, prior to reading the source file(s) the assembler has no way to know that the LOOP symbol relates to address 0104 (decimal).
Thus, in regards to all symbols that have been defined by the programmer, the assembler has to accomplish two separate tasks. The first task is to build a table of all symbols and their related numeric values, and the second is to replace all the symbols, appeared in the source file(s) with the numeric values of the address fields. This two assignments can be achieved by performing two separate scans (passes) on the source file(s). In the first pass the assembler builds a table of symbols, this table correspond address to each symbol.
The following table refers to example I it represents the table of symbols generated by the assembler after the completion of the first pass.

Symbol	Decimal Value	Octal Value
MAIN	0100	0144
LOOP	0104	0150
END	0112	0160
STR	0113	0161
LENGTH	0120	0170

Table 8: Table of symbols derived from applying the first pass on the source file(s) of example I.

In the second pass the assembler translate the source file(s) into binary machine code.
Notice that the two passes are done by the assembler, during translation (in the assembly time), before the linking process.
After the translation process, the program may be linked and load to memory for execution.

First Pass

In order to assign each label name to an address value the assembler needs rules. The basic principle is to offset addresses. While reading an operation statement containing label operand(s), the assembler offset the next memory address(s) according to the number of label operand(s). Address value determined by counting addresses this type of counting is done by the assembler and stored in the address location register. The initial value of the address location register is zero, thus the first statement is to be loaded into address 0. After the assembler determines the length of the instruction statement the value of the address location register is increased by the number of bytes, occupied by the statement operation, this way the address location register points to the next empty cell.
As mentioned, to assign each operation statement to its machine instruction the assembler stores a table (Table 2). At the time of translation the assembler replace each operator in the operation statement with its equivalent code. However the translation of all the operation statement is not so simple. There are many operators that are using different types of addressing modes. The same operator can get different meanings depending on addressing mode.
The assembler has to read the statement in its complete form and then decide on how to translate it to machine code instruction according to its addressing mode. Usually the statement is divided into operator field and additional fields, the additional fields contains information on addressing mode.
The same principle applies in our assembler. Bits 12-15 (in the first word) represent the operator, whereas bits 3-5 and 9-11 represent the addressing mode.
Our computer machine is flexible in regards to the addressing mode of two operands. Notice that this flexibility is not mandatory. There are computer machines where all of their operation statements get one operand (operations are executed on this operand and constant register) or other machines that allow addressing mode for only one operand, the other operand has to be any register or even other computer machines with 3 operands (the third operand stores the result of the operation).
When the assembler reads a label, appeared in the beginning of the line, it has to know that this is a definition of label, then the label is assigned with a memory address &40;the value of the location address register&41;. This way each of the labels get their addresses while reading by the assembler. The labels are inserted to the table of symbols containing the name of the label its address and other additional information, such as the type of the label. This way, when the assembler reads an operation statement with label name operand(s) it can get the appropriate address from the table of symbols.
As we already know, programmer may refer to symbol that has not been defined yet in the program, definition is later to come. For example: In the following code

	jnc	A
	.
	.
	.
A:	...

Label name A, appeared as an argument for jnc, has not been defined yet, definition of this label appears later in the code. When the assembler reads a line with undefined label address, such as in the case of 'inc A', it cannot assigned it (label name A) with address value. This problem is to be solved in the second pass.

Second Pass

In the first pass the assembler cannot assign address value for undefined label address, during the first pass the assembler marked these label name(s) with zero(s). Only after the assembler complete a full scan, where all label name(s) have been inserted to the table of symbols, the assembler can assign an address value to all label name(s) appeared in the operand(s) field(s) of the operation statements. To do this the assembler reads the program once again assigning (using the table of symbols completed in the first pass) all label name(s) appeared in the operand(s) of the operation statements to their corresponding address value(s). This is the second pass, at the end of the second pass the assembler returns the final translated file(s).

Assembler of One Pass

There are assemblers that do not execute the second pass, the translation process in these kind of assemblers is done in the following way:
At the first pass the assembler maintains a table of symbols (as in the two pass example). Unlike the two pass assembler the table of symbols in the one pass assembler contains the label name and the memory address of the operation referring to the label name(s). At the end of the first pass the one pass assembler is to complete the translation by filling all missing addresses from the table of symbols.
For example: Let us consider the following code:

300:	add	TAB,r1
	.
	.
	.
400:	TAB:	...
	.
	.
	.
500:	jnz	TAB

where 300, ... , 400,... ,500, ... are memory addresses. The one pass assembler is to insert label name TAB to the table of symbols and set aside memory addresses 301 and 501 for the TAB label. Note that addresses 300 and 500 are already occupied by the operation statements (operator add and operator jnz) the additional words is to be followed right after the operation instruction. After the first pass the one pass assembler will fill all missing addresses using the table of symbols.
The advantage of one-pass-assembler is speed, saving the need for the second pass. The disadvantage is lack of clarity, the two pass assembler is easier to maintain.

Separation of Data and Instructions

For several assembler programs there are different address location registers. This separation has several benefits, it allows us to separate the code and data to different segments in memory, this is much better than adjusting the definition of data into the instructions that are using them.
The method of adjusting definition of data into the instructions may cause runtime errors, following a small error the CPU may try to execute the data. This kind of error may be caused by omitting the halt instruction or by wrong branching. If the CPU attempts to execute none legal instruction code, then an error message should appear. Nevertheless, a line of code that is seems to look like a legal instruction, may cause to more sever problem, since the error may not be found immediately.
In your assembler program we will be asked to separate the data segment from the instruction segment.

Assembler Error Detection

The assembler may detect errors in the syntax of the language, errors such as "Instruction does not exist" or "to many operands" and more. The assembler has to verify that all symbols are defined exactly once. Thus all errors detected by the assembler may be attributed to a particular line of input. If, for example, two addresses are assigned to instruction that should be assigned with only one address, then the assembler may respond with "To many addresses" error message. The following table contains information on legal addressing mode for the source and destination operands.

Operator	The Code in Hexadecimal Base	Legal Addressing Modes for the Source Operand	Legal Addressing Modes for the Destination Operand
mov	0	0,1,2,3,4,5	1,2,3,4,5
cmp	1	0,1,2,3,4,5	0,1,2,3,4,5
add	2	0,1,2,3,4,5	1,2,3,4,5
sub	3	0,1,2,3,4,5	1,2,3,4,5
mul	4	0,1,2,3,4,5	1,2,3,4,5
div	5	0,1,2,3,4,5	1,2,3,4,5
lea	6	1	1,2,3,4,5
inc	7	No source operand	1,2,3,4,5
dec	10	No source operand	1,2,3,4,5
jnz	11	No source operand	1,2,3,5
jnc	12	No source operand	1,2,3,5
shl	13	1,2,3,4,5	0,1,2,3,4,5
prn	14	No source operand	0,1,2,3,4,5
jsr	15	No source operand	1,2,3,5
rts	16	No source operand	No source operand
hlt	17	No source operand	No source operand

Table 9: Legal addressing modes assigned as source and destination operands for the operators.

Other possible errors may involve with illegal operator, illegal register name, illegal label, etc.

General algorithm

In this section we introduce a general algorithm for the first and second passes. The code is divided into two segments, the instruction segment and the data segment. Each segment has its own counter, marked by IC (Instruction Counter) and DC (Data Counter). The letter L denotes the number of words held by the instruction.
First Pass

IC <=0, DC<=0
Read line
If the first field is a symbol, then proceed. Else go to 5.
Turn on "There is a Symbol" flag.
If pseudo-instruction (directive for storing data, namely .data or .string), proceed. Else go to 8.
if there is a symbol, then insert it into the table of signs with a mark (data mark). It's value should be DC.
Identify the data, store them in memory, update the data counter (DC) according to their length. Go back to step 2.
If it is an .extern or .entry directive, then proceed. Else go to 11.
If it is an .extern declaration, then insert the symbols into the external symbols table, without an address.
Go back to 2.
If it is a symbol, then insert it into the table of symbols with a mark (code mark). It's value should be IC.
Search the table of instructions. If no instruction has been find, then announce on error in the instruction code.
IC <=L+IC.
Go back to 2.

Second Pass

IC <=0.
Read line. If you done, go to 11.
If the first field is a symbol, then skip on it.
If the first field is a pseudo-instruction, (.data or .string), then go to 2.
If the first field is a directive (an .entry or a .extern), then proceed. Else go to 7
Identify the directive. Complete its corresponding operation. If it is an .entry directive, then mark the appropriate symbols as .entry go back to 2.
Evaluate the operands, search the table of instruction, substitute the instruction with its corresponding code.
Store the operands starting from the nest byte. If it is a symbol, then find its address in the table of symbols, compute the addresses, code the addressing mode.
IC<=IC+L
Go back to 2.
Keep on a separate files the length of the program, length of data, table of external symbols, table of symbols marked with entrance point signs.

Let us apply this algorithm on the code written in Example I repeated below for easy reference:

MAIN:		mov	 LENGTH, r1
		lea	 STR, r2
Loop:		jnz	 END
		prn	 @r2
		sub	 #1, r1
		inc	 r2
		jnc	 *LOOP
END:		hlt
STR:		.string	 "abcdef"
LENGTH:		.data	 6

In the first pass, each instruction is being substituted with its appropriate code and the table of symbols is being built. The rest of the code are left untouched. The code should be loaded at address zero. After applying the first pass on example I, we should get the following result

Label	Decimal Address	Octal Address	Command	Operands	Binary Machine Code	Octal Machine Code
(MAIN::)	0100	0144	mov	LENGTH, r1	0000 001 000 100 001	001041
	0101	0145			LENGTH	??????
	0102	0146	lea	STR, r2	0110 001 000 100 010	061042
	0103	0147			STR	??????
(LOOP:)	0104	0150	jnz	END	1001 000 000 001 000	110010
	0105	0151			END	??????
	0106	0152	prn	@r2	1100 000 000 101 010	140052
	0107	0153	sub	#1, r1	0011 000 000 100 001	030041
	0108	0154			0000 000 000 000 001	000001
	0109	0155	inc	r2	0111 000 000 100 010	070042
	0110	0156	jnc	*LOOP	1010 000 000 011 000	120030
	0111	0157			1111 111 111 111 010	177772
(END:)	0112	0160	hlt		1111 000 000 000 000	170000
(STR:)	0113	0161	.string	"abcdef"	0000 000 001 100 001	000141
	0114	0162			0000 000 001 100 010	000142
	0115	0163			0000 000 001 100 011	000143
	0116	0164			0000 000 001 100 100	000144
	0117	0165			0000 000 001 100 101	000145
	0118	0166			0000 000 001 100 110	000146
	0119	0167			0000 000 000 000 000	000000
(LENGTH:)	0120	0170	.data	6	0000 000 000 000 110	000006

In addition we get the following table of signs.

Applying the second pass on the code of example I yields the following final result:

Symbol	Decimal Value	Octal Value
MAIN	0100	0144
LOOP	0104	0150
END	0112	0160
STR	0113	0161
LENGTH	0120	0170

Label	Decimal Address	Octal Address	Command	Operands	Binary Machine Code	Octal Machine Code
(MAIN::)	0100	0144	mov	LENGTH, r1	0000 001 000 100 001	001041
	0101	0145			0000 000 001 111 000	000170
	0102	0146	lea	STR, r2	0110 001 000 100 010	061042
	0103	0147			0000 000 001 110 001	000161
(LOOP:)	0104	0150	jnz	END	1001 000 000 001 000	110010
	0105	0151			0000 000 001 110 000	000160
	0106	0152	prn	@r2	1100 000 000 101 010	140052
	0107	0153	sub	#1, r1	0011 000 000 100 001	030041
	0108	0154			0000 000 000 000 001	000001
	0109	0155	inc	r2	0111 000 000 100 010	070042
	0110	0156	jnc	*LOOP	1010 000 000 011 000	120030
	0111	0157			1111 111 111 111 010	177772
(END:)	0112	0160	hlt		1111 000 000 000 000	170000
(STR:)	0113	0161	.string	"abcdef"	0000 000 001 100 001	000141
	0114	0162			0000 000 001 100 010	000142
	0115	0163			0000 000 001 100 011	000143
	0116	0164			0000 000 001 100 100	000144
	0117	0165			0000 000 001 100 101	000145
	0118	0166			0000 000 001 100 110	000146
	0119	0167			0000 000 000 000 000	000000
(LENGTH:)	0120	0170	.data	6	0000 000 000 000 110	000006

When the assembler program is done an object code is generated this object code is to be sent to a linker program. The purpose of the linker program is described as follows:

To allocate the program with place in memory (allocation).
To link the object file into one executable file (linking)
To change addresses according to the loading place (relocation)
To physically load the code into memory.

After the linker program is done the program can be loaded to memory and is ready to run. We are not going to make further discussion on how the linker program works.
Your assembler program has to have the following properties:
Your assembler should get command line arguments, these arguments is a list of text files written according to the assembly language that was defined above. For each file the assembly create an object file, external file (if the source code has declared on external variables) and an entry file (if the source code has declared on entry variables). The extension of source file is ".as" (for example the names a.as, b.as and start.as are all legal names for source files). The assemblers' command line arguments are a list of source file names (without the ".as" extension). For instance if our assembler program is called "assembler", then the following command line

				assembler a b start

Reads the a.as, b.as and start.as source files and creates the appropriate object, entry and external files. In our case the file that should be created are a.ob, a.ent (for the entry file), a.ext (for the external file); b.ob, b.ent, b.ext (for the b.as source file); start.ob, start.ent, start.ext (for the start.as source file).
The structure of each file is to be described later.
The assembler holds two arrays, the code array and the data array. These arrays provide a practical picture of machine memory (the size of each of the array element is equal to the size of one word-16 bits). In the code array the assembler writes the numeric code value of machine instructions found during the assembly process in the input file. In the data array the assembly writes data (specified by .data and .string) found during the assembly process in the input file.
The assembler has two counters: Instruction Counter (IC) and Data Counter (DC). These counters are pointing to the next available places in the above mentioned arrays (Code array and Data array). When the assembler begin reading a file both the IC and DC counters are initialize to zero. In addition the assembler maintains a table of symbols where all variables encountered during runtime are kept. The name, value and type (External, Relocatable, Absolute) are kept for all variables.

Mode of Operation

The assembler reads the file line by line and decide on the type of the line (remark, operation, directive or empty-line).

Empty or remark line: The assembler ignores the line and move on to the next line.
Operation line:
When the assembler is encountered with operation-line it has to decide on the addressing modes and on the number of operands. The assembler determines the value of each operands in the following way:
- If the operand is a register, then its value is the number of the register.
- If the operand is a label, then its value is determined by its appearance in the table of symbols.
- If it is a number (direct addressing), then its value is the value of the number.
Determination of the addressing mode is done according to the syntax of the operand(s), as mentioned in tables 3. In general the # character indicates instance addressing, label indicates direct addressing, register name indicates register addressing, the @ character written before variable name indicates indirect addressing, the @ character written before register name indicates indirect addressing, the * written before a label indicates on relative addressing.
Notice that in both cases, when addressing is a register or indirect register, the value of the operand field is the number of the register. The same apply for direct and indirect addressing. The value of the operand field is the value of the label variable as appeared in the table of symbols.
At this stage the assembler have decided up on the operation, addressing mode for the source operand, addressing mode for the destination operand, the register of the source operand, the register of the destination operand, the requirement of additional word for the source operand, the requirement of additional word for the destination operand. The assembler continue its operation in the following way:
If the operation requires two operands, then the assembler insert to the array of code, at the place indexed by the instruction counter (IC) the numerical code of the operation, the addressing modes, and the registers. In addition the assembler save places for the additional words required for this operation: 0,1 or 2. The assembler increase the instruction code (IC) in correspondence.
For example:
If the instruction is
```
				cmp #-3,r1		
```
then the first place, starting from index IC, in the array of code contains the word 0001 000 000 100 001 (or 010041 octal), the second place of the array of code contains the word 1111 111 111 101 (177775 octal) (which is the value 3 in the two complement method applied for 16 bits length).
If the instruction gets only one operand (the source operand does not exist), then the translation is almost identical (the word may even occupy two places in memory), however since the source operand is not being used by the CPU, bits 6-11 in the first word, representing information on the source operand, can be of any value.

If the instruction has no operands (rts, hlt), then the only numeric value that should be inserted to the array of code at index IC is the numeric value of the operation. In this case, where the instruction takes no operands, registers 0-11, in the first word, can get any numeric value.

Label appeared in a line-code, written in a source file, should be inserted to the table of symbols under the appropriate name its numeric value is equal to the value of the instruction counter (IC), prior to the coding of the instruction (appeared in this specific line). Its type should be relocatable.
Directive line:
When the assembler encounters with directive it operates in the following way:
1. .data
  The assembler reads the list of numbers appeared after the '.data' directive, each number, that has already being red, is inserted to the array of data and exceeding the data counter DC by one.
  If a line-code, found in the source file, contains a label that appears before the '.data' directive, then the label should be inserted to the table of symbols. The label gets the value of the data counter DC, before the data have been inserted to the code, plus the general length of the code. The type of this label should be relocatable.
2. .string
  The behaviour of the assembly in the case of the '.entry' directive is very similar to the case of the '.data' directive. However, in the case of the '.string' directive the value that are being inserted to the array of data is the numeric ASCII value of the characters. After that the zero value, which marks the end of the string, is being inserted to the table of data. The data counter is exceeded by the length of the string plus 1 (zero occupies space too). In the case where a label appears, prior to the .string directive, the assembly behave the same way as in the case of the '.data' directive.
3. .entry
  The '.entry' directive direct the assembler to insert the label, appeared afeter the .entry (in the same line), to the entries file. The assembler write this request, at the end the assembler writes these labels to the entries file. The '.entry' directive declares on a label that is to be used by other file, the information written by the assembler to the entries file and to the externals file is to be used by the linker program to match the references of the externals.
4. .extern
  The '.extern' directive declares on a variable defined in other file and used by the current '.as' file.
  The assembler inserts the requested variable to the table of symbols, the value of the variable is zero (it may get any other value) and its type is external. The question of where this variable is defined is insignificant for the assembler, and remains unanswered.
  Notice that either operation line or directive line may take variable declared later in the file (the declaration may be indirect, by label, or directly be extern).

The format of output files

The object file written by the assembler provides informations about machine's memory. The first instruction is to be inserted to memory address 0, the second instruction is to be inserted to be inserted to memory address 2,3 or 4 (depending on the length of the first instruction) and so fourth until the translation of the last instruction. The next memory address, after the last translated instruction, contains the data that were built by the '.data' and '.string' instructions, their order of appearance in memory depends on their precedence of appearance in the source file (first instruction occupies first free memory in a rising order).

The object file

The object file is composed out of lines of text. The first line contains (in octal) the length of the code and the length of data, both are in terms of memory words. Those two numbers must be separated by white space. Each of the next lines provides information on the content of memory address (in octal form) starting from memory address 0. In addition, for each memory address, occupied by instruction (not data), there appear additional information for the linker. This additional information could be one of the following three characters: 'e' 'a' or 'r'. The character 'a' designates the fact that the content of the memory address is absolute and does not depend on where the file is to be loaded (the assembler assumes it to start from memory address 0). The character 'r' designates the fact that memory address is relocatable and should be added with the appropriate offset, in regards to where the file is to be loaded. The offset is the first memory address from which the first instruction of the program is to be loaded. The letter 'd' designates the fact that the content of the file depends on external variable, the linker program is to take care on the insertion of the appropriate value.

The entries file

The entries file is composed out of lines of text. Each line contains the entry name and value, as it was computed for this file.

The externals file

The externals file is composed out of lines of text. Each line contains the name and memory address of the external variable.

Example files

The following source files (ps.as, cs.as, rs.as, a.as) comprising a single program.

________________________ ps.as __________________________________________

;File: ps.as -- Includes main routine of reversing string "abcdef"
; count length of string, print the
; original string, reverse string, print reversed string.

MAIN:		lea STR, STRADD
	jsr	COUNT
	jsr	PRTSTR
	mov	*STRADD, LASTCHAR
	add	LEN, LASTCHAR
	dec	LASTCHAR
	jsr	REVERSE
	jsr	PRTSTR
	hlt


.entry	STRADD
	.entry	MAIN
	.extern REVERSE
	.extern PRTSTR
	.extern COUNT
STRADD:	  .data 0
STR:		.string "abcdef"
LASTCHAR: .data 0

LEN:		.data 0

_________________________________________________________________________

_________________________ cs.as ___________________________________________

;file cs.as - includes routine count.
;This routine counts the length of the string

		.entry		COUNT
		
		.extern		STRADD

		.extern		LEN
COUNT:	mov		STRADD,	r1
		
		cmp		#0, @r1
		
		jnz		ENDCOUNT
		
		inc		LEN

		inc		r1

		jnc		COUNT

ENDCOUNT:		rts

_________________________________________________________________________

_________________________ rs.as ___________________________________________

;file rs.as - includes routine reverse. This routine reverses the string


	.entry		REVERSE
	.extern		LEN
REVERSE: mov		STRADD, r1
	 mov		LASTCHAR, r2
	 mov		LEN, r3

;Exchanging places

LOOP:	cmp		#0, r3
	jnz		END
	cmp		r1, r2

	jnz		END
	mov		@r1, r4
	mov		@r2, @r1
	mov		r4, @r2
	sub		#2, r3
	jnc		LOOP
END:	rts
	.extern		LASTCHAR
	.extern		STRADD

_________________________________________________________________________

__________________________ a.as ___________________________________________

;file a.as includes routine PRTSTR. This routine prints the string.

	.entry PRTSTR
PRTSTR:	mov STRADD, r2

		; r2 now hold the memory location of the string.

LOOP:	cmp	#0,@r2
				jnz	BYE
				prn	@r2
				inc	r2
				jnc	LOOP
BYE:		rts
				.extern STRADD

__________________________________________________________________________

Running the assembler program on the ps.as source file creates the following output files:
ps.ob

	24 12
0000	061010   a 
0001	000025   r 
0002	000024   r 
0003	150010   a 
0004	000000   e 
0005	150010   a 
0006	000000   e 
0007	003010   a 
0010	000015   a 
0011	000034   r 
0012	021010   a 
0013	000035   r 
0014	000034   r 
0015	100010   a 
0016	000034   r 
0017	150010   a 
0020	000000   e 
0021	150010   a 
0022	000000   e 
0023	170000   a 
0024	000000 
0025	000141 
0026	000142 
0027	000143 
0030	000144 
0031	000145 
0032	000146 
0033	000000 
0034	000000 
0035	000000

ps.ent

STRADD	24 
MAIN	0

ps.ext

COUNT	4 
PRTSTR	6 
REVERSE	20 
PRTSTR	22

Running the assembler program on the cs.as source file creates the following output files:
cs.ob

14 0
0000	001041   a 
0001	000000   e 
0002	010051   a 
0003	000000   a 
0004	110010   a 
0005	000013   r 
0006	070010   a 
0007	000000   e 
0010	070041   a 
0011	120010   a 
0012	000000   r 
0013	160000   a

cs.ent

COUNT	0

cs.ext

STRADD	1 
LEN	7

Running the assembler program on the rs.as source file creates the following output files:
rs.ob

25 0
0000	001041   a 
0001	000000   e 
0002	001042   a 
0003	000000   e 
0004	001043   a 
0005	000000   e 
0006	010043   a 
0007	000000   a 
0010	110010   a 
0011	000024   r 
0012	014142   a 
0013	110010   a 
0014	000024   r 
0015	005144   a 
0016	005251   a 
0017	004452   a 
0020	030043   a 
0021	000002   a 
0022	120010   a 
0023	000006   r 
0024	160000   a

rs.ent

REVERSE	0

rs.ext

STRADD	1 
LASTCHAR	3 
LEN	5

Running the assembler program on the a.as source file creates the following output files:
a.ob

	13 0
0000	001042   a 
0001	000000   e 
0002	010052   a 
0003	000000   a 
0004	110010   a 
0005	000012   r 
0006	140052   a 
0007	070042   a 
0010	120010   a 
0011	000002   r 
0012	160000   a

a.ent

PRTSTR	0

a.ext

STRADD	1

Notice that in a case where the input file contains no .extern declaration, the corresponding *.ext file is not created. Similarly, when the input file contains no .entry declaration, the corresponding *.ent file is not created.

The C Programming Language

Two Pass Assembler

Background

"Hardware"

Registers

Instruction Set Architecture ISA

Instruction

Machine Instruction Characterization

First Class of Operators

Second Class of Operators

Third Class of Operators

Additional Notes

Directive Statement

Operation Statement

Formal Definitions

Label

Number

String

Two Pass Assembler

First Pass

Second Pass

Assembler of One Pass

Separation of Data and Instructions

Assembler Error Detection

General algorithm

Mode of Operation

The format of output files

The object file

The entries file

The externals file

Example files