Execution of the source code
To execute (in linux) the complete compiler it is necessary to use the command:
python3 runs_compiler.py
Running on windows requires the installation of python in its proper version (3 or higher)
Source files present
program.txt - file containing the program to be parsed. New codes should be entered here
analyzer_lexico.py - contains the lex analyzer file
resp.lex.txt - contains the result of the lexical analysis. It is created automatically after the lexical analyzer is run
analyzer_sync.py - contains the syntax analyzer file
resp-sint.txt - file that receives the result of the parsing
analyzer_semantico.py - will contain the semantic analyzer file
execute_compiler.py - run all compiler parsers in their correct order
Development Phases
Phase 1 - Lexical Analyzer (Completed);
Phase 2 - Syntactic Analyzer (Completed);
Phase 3 - Semantic Analyzer (in planning).
Detailed requirements for each development phase
Lexical Analyzer
Implementation of a lexical analyzer for the language defined by the following table (more details about the language were based on the programming language C):
Token Word Corresponding regular expression
Reserved words algorithm variables constants record function return empty if senao as to read write whole real boolean char string
Identifiers `Letra (Letra Number Dígito+(.Dígito+)? Letter ai Digit 0..9 Symbol ASCII de 32 a 126 Constant Chain
" (Letter Constant Character ``
'(Letter
Operators `` `. Eur-lex.europa.eu eur-lex.europa.eu
Delimiters ; , ( ) { } [ ]
Thread Starter /* Isto é um comentário de bloco /
Block Comments // Isto é um comentário de linha
The lexical analyzer results in a token chain (one per line of the output file), containing the code of the identified symbol, the contents of the symbol, and the line of occurrence of the symbol. The tokens have the following format: tok+código++conteúdo+->+númeroLinha. The following is the list of possible tokens in this language:
| tok1 - Operator |
| tok100.| | tok101_+| | tok102_-| | tok103_| | tok104_/| | tok105_++| | tok106_--| | tok107_==| | tok108_!=| | tok109_>| | tok110_>=| | tok111_<| | tok112_<=| | tok113_&&| | tok114_||| | tok115_=|
| tok2 - Delimiter |
| tok200_;| | tok201_,| | tok202_(| | tok203_)| | tok204_{| | tok205_}| | tok206_[| | tok207_]|
| tok3_Number |
| tok300_Numero Inteiro| | tok301_Numero Real|
| tok400 - Constant Character |
| tok500 - Identifier |
| tok6 - Reserved word |
| tok600_algoritmo| | tok601_variaveis| | tok602_constantes| | tok603_registro| | tok604_funcao| | tok605_retorno| | tok606_vazio| | tok607_se| | tok608_senao| | tok609_enquanto| | tok610_para| | tok611_leia| | tok612_escreva| | tok613_inteiro| | tok614_real| | tok615_booleano| | tok616_char| | tok617_cadeia| | tok618_verdadeiro| | tok619_falso|
| tok700_Current Chain |
Some of the tests performed in the lexical analyzer are present in the folder: "plans and tests /
Synthetic Analyzer
Construction of a context-free grammar to the left, with no recursion to the left, in the form of Backus-Naur (BNF), according to the specifications in annex A below
Implementation of a parser for the language defined by the built grammar.
Annex A - General Characteristics of the Programming Language
Structured language;
The declaration of constants should come in a block initiated by the reserved word constants. Constants can be of any primitive type (integer, real, bloolean, char, string). Each constant must be declared in a row within the block. This block is required. If there are no constants, the block must be empty.
Exemplo:
constantes
{
inteiro MAX = 10;
inteiro MAX2 = 50;
cadeia Texto = "mensagem";
}
All constants must be global;
The creation of a record type of this language must be started by the reserved record word, followed by the record type name, followed by a block containing the record fields declaration (type and name). A statement on each block line.
All record types, if any, must be global and declared at the beginning of the program before the declaration of constants and global variables.
Every declaration of variables should come in a block beginning with the reserved word variables. Each variable must be declared in a line within the block, containing the type and name of the variable followed by a semicolon. This block is required. If there are no variables, the block must be empty.
Exemplo:
variaveis
{
inteiro a;
inteiro b;
cadeia Mensagem;
}
Variables can be either primitive types or record types.
The declaration of vectors and matrices must be made in the variable declaration block. The language allows only the declaration and manipulation of vectors and matrices of the primitive types.
Initialization of variables in the declaration is allowed, but not vectors and arrays.
Variables can be local or global. If they are global must come must come declared after the declaration of constants. If they are local, they should be declared at the beginning of the function block, its scope being the body of the function where the variables are declared.
To access the fields of a variable of the record type, you must use the variable name followed by the dot operator, followed by the field name.
The main body of the program is a block initiated by the reserved word algorithm. When a program in this language is executed, it is this block that executes. This block is the last part of a program in this language.
A program in this language can have several functions.
Functions can have parameters of primitive types and of record type, and can return values (using the return command).
Every function must start with the reserved word function, followed by the return type (of the primitive types, of the record or empty type), followed by the function name identifier, followed by the list of formal parameters and the body of the function delimited by {e }.
Delimiters {e} mark start and end of blocks (commands, declarations, functions), respectively.
Relational, logical, and arithmetical exsessions are allowed.
The operators of this language have the precedence and associativity of operators identical to that of the C language.
Commands:
Command if..on:
In the condition of the command if only relational expressions or logic are allowed.
The part otherwise will be optional.
Command for:
Command similar to the for C languages and Java, being obligatory the three parts of the command.
Command while:
C-like and Java while-while command
In the condition of the command while only relational or logical expressions will be allowed.
Command type:
The command will start with the word type and what should be written in parentheses, ending with a semicolon.
Multiple impressions in the same command should be separated by commas.
The command can print: constants, constant strings, variables, vectors, matrices and expressions.
command read:
The command will start with the word read and the name of the variable, vector position or matrix in parentheses, ending with a semicolon.
Multiple readings in the same command should be separated by commas.
Variables, vectors, arrays, and record fields can be used in assignments, expressions, function returns, and parameters in function calls.
Free Grammar of Language Context - BNF Form
:= <registro_declaracao><constantes_declaracao><variaveis_declaracao><funcao_declaracao><algoritmo_declaracao>
<registro_declaracao> := registro token_identificador { <declaracao_reg> } <registro_declaracao> | Ɛ
<declaracao_reg> := ; <declaracao_reg> | Ɛ
:= <tipo_primitivo> token_identificador
<tipo_primitivo> := cadeia | real | inteiro | char | booleano
<constantes_declaracao> := constantes { <declaracao_const> }
<declaracao_const> := = <valor_primitivo>; <declaracao_const> | Ɛ
<valor_primitivo> := token_cadeia | token_real | token_inteiro | token_char | verdadeiro | falso
<variaveis_declaracao> := variaveis { <declaracao_var> }
<declaracao_var> := <identificador_deriva>; <declaracao_var> | token_identificador token_identificador; <declaracao_var> | Ɛ
<identificador_deriva> := [token_inteiro] | | Ɛ
:= [token_inteiro] | Ɛ
:= = <valor_primitivo> | Ɛ
<funcao_declaracao> := funcao <tipo_return> token_identificador (<decl_param>) { <deriva_cont_funcao> } <funcao_declaracao> | Ɛ
<tipo_return> := <tipo_primitivo> | vazio | token_identificador // Para retorno de variaveis e de registros
<decl_param> := <identificador_param_deriva> <deriva_param> | token_identificador token_identificador <deriva_param>
<identificador_param_deriva> := []<matriz_param> | Ɛ
<matriz_param> := [] | Ɛ
<deriva_param> := ,<decl_param> | Ɛ
<deriva_cont_funcao> := <variaveis_declaracao> <decl_comandos> retorno <return_deriva>; | <decl_comandos> retorno <return_deriva>;
<return_deriva> := vazio | token_identificador<identificador_imp_arm_deriva> | <valor_primitivo>
<decl_comandos> := <decl_comandos> | Ɛ
:= <se_declaracao> | <enquanto_declaracao> | <para_declaracao> | <escreva_declaracao> | <leia_declaracao> | <exp_aritmetica> | Ɛ
<se_declaracao> := se (<exp_rel_bol>) {<decl_comandos>}<senao_decl>
<senao_decl> := senao {<decl_comandos>} | Ɛ
<enquanto_declaracao> := enquanto (<exp_rel_bol>) { <decl_comandos> }
<para_declaracao> := para (token_identificador = token_inteiro; token_identificador <op_relacional> token_inteiro; token_identificador <op_cont>) {<decl_comandos>}
<leia_declaracao> := leia (<exp_leia>);
<exp_leia> := <exp_armazena><exp_leia_deriva><exp_leia> | Ɛ
<exp_leia_deriva> := ,<exp_armazena> | Ɛ
<exp_armazena> := token_identificador <identificador_imp_arm_deriva>
<escreva_declaracao> := escreva (<exp_escreva>);
<exp_escreva> := <exp_imprime><exp_escreva_deriva><exp_escreva> | Ɛ
<exp_escreva_deriva> := ,<exp_imprime> | Ɛ
<exp_imprime> := token_cadeia | token_char | token_identificador <identificador_imp_arm_deriva> | (<exp_simples>)
<identificador_imp_arm_deriva> := .token_identificador | [token_inteiro] | Ɛ
<exp_aritmetica> := token_identificador = <exp_simples>;
<exp_rel_bol> := <exp_simples> <op_relacional> <exp_simples> <exp_rel_deriva>
<exp_simples> := <op_ss><termo_deriva> | <termo_deriva>
<op_relacional> := < | > | == | != | <= | >=
<exp_rel_deriva> := <op_bolleano> <exp_simples> <op_relacional> <exp_simples> <exp_rel_deriva> | Ɛ
<op_ss> := + | -
:= <fator_deriva>
<termo_deriva> := +<op_soma_deriva> | -<op_sub_deriva> | Ɛ
<op_bolleano> := && | || | !
:= token_identificador <identificador_imp_arm_deriva> | token_inteiro | (<exp_simples>)
<fator_deriva> := <op_md><fator_deriva> | Ɛ
<op_soma_deriva> := <termo_deriva> | +
<op_sub_deriva> := <termo_deriva> | -
<op_md> := * | /
<op_cont> := ++ | --
<algoritmo_declaracao> := algoritmo {<deriva_cont_principal> }
<deriva_cont_principal> := <declaracao_var> <decl_comandos> | <decl_comandos> | Ɛ
The built-in syntax analyzer was of the Recursive Predictive Descending type. For each non-terminal symbol of the grammar, a new function has been constructed. The productions of grammar were represented by successive calls of these functions. It follows the algorithm used as base, remembering that 'A' in the algorithm represents a non-terminal symbol:
void A () {
Choose a production-A, A-> x1, x2, ..., xk
for (i = 1 ateh k) {
if (xi is a non-terminal) {
active procedure xi ();
}
else if (xi equals the input symbol a) {
enter the next symbol
}
else {
an error has occurred
}
}
}
Some of the tests performed on the parser are present in the folder: "schedules and tests / testing"
When a lexical error is found by the parser, it parses it and follows it with the analysis of the following tokens
When a syntactic error is encountered, an error message is indicated in the resp-sint.txt file stating the line where the error was encountered and the token that caused the error. If the user is using a terminal emulator (such as Linux) the errors found will appear there as well.
After indicating an error the parser may encounter cascading errors due to the initial error. If this happens, the user is advised to try to fix the initial errors before the later ones.
A successful message, both in the resp-sint.txt file and in the terminal emulator, if used, will indicate the complete syntax recognition of the input string.
Semantic Analyzer
Construct a semantic parser for the programming language then elaborated according to the specifications in Annex B.
Annex B - Semantic features of language
| Type |
| Assignment must be of the same type as declared | | Arithmetic, logic and relational operations must be done between operators of the same type | | Function calls must be made with the correct number and order of parameters | | Return of functions must be of the same declared type | | + - / * operations are only compatible with integer and real operands | | Operations && || can only be made between booleans | | Operations ++ - succeed only integer and real types | | Relational operations with operators ==! = Can be done with integer, real, char or string, as long as both operands are of the same type | | Operations> <> = <= can only be done with integer and real operands; and both operands must be of the same type | | Position vectors and arrays must be of integer type |
| Variables |
| Variables should be declared as local or global |
| Constants |
| Constants should be declared as global | | Constants can not be assigned out-of-block constants {} |
| Scope |
| Different scopes for subprograms | | Variable of the same global and local name may exist | | There can be no duplicity of variables and constants in the same scope | | There can be no duplicity of functions |
| Commands |
| The 'write' command accepts constant characters, constant strings, variables, vectors, arrays, and expressions |
Posted on Utopian.io - Rewarding Open Source Contributors