Preview of Programming Assignment #2: Lexical Scanner Module

 

Purpose:

To implement an interpreter for our very simple BIOLA language, we need a lexical scanner, which can recognize and divide each statement of a BIOLA program into its individual lexical components (tokens). Later a parser module will call the lexical scanner and rely on the information given by the lexical scanner to do further syntax analysis.

 

General Functionality of the Lexical Scanner Module:

From other modules, users should be able to lexically analyze the BIOLA source program (as a collection of lines of statements like what you have in your editor module) by using the implementation of the lexical scanner module. By looking at the source program, the lexical scanner module should derive the information of (i) for every statement of the source program, a collection of individual tokens (i.e. the basic lexical components) and (ii) for every statement of the source program, the category of these tokens. Users from other modules should be able to get such information back from the lexical scanner module. When requested, the lexical scanner module should also be able to display statement by statement, the individual tokens and their categories in each statement.

 

Sample Executable:

Please download, unzip, and play with this zipped sample executable to get a sense of what the lexical scanner module can accomplish for us.

 

Categories of Tokens:

We’ll use an enumeration type: tokenCategory to encode the category information of each token. Each token should be categorized as one of the following item in tokenCategory.

·                                             KEYWORD                                 // keywords of BIOLA, like function, return, display, read, while, if, else

·                                             ID_NAME                             // identifiers, i.e. names of variables and functions, first character must be a letter and the remaining characters can letters or digits

·                                             ASSIGNMENT_OP              // =, the assignment operator            

·                                             NUMERICAL_OP                        // numerical operators       (such as +, -, *, /, and %)      

·                                             LOGICAL_OP                      // logical operators    (such as &&, | | , ! )       

·                                             RELATIONAL_OP                       // relational operators        (such as >, = =, <, >=, <=, and !=)

·                                             NUMERICAL_LITERAL      // numerical values (such as 2.45)             

·                                             STRING_LITERAL               // string literals (like “Hi Hi”, in a pair of double quotes)

·                                             COLON                                        // end of function title statement (i.e. :)

·                                             SEMICOLON                               // end of statement (i.e. ;)

·                                             COMMA                                       // a comma

·                                             LEFT_PARENTHESIS          // (

·                                             RIGHT_PARENTHESIS       // )

·                                             LEFT_CURLYBRACE          // {

·                                             RIGHT_ CURLYBRACE      // }

·                                             COMMENT                          // //

·                                             UKNOWN                            //Everything else

 

 

Data Structures:

As in the Interface module, we use a vector of strings as the data structure to store the source code of a BIOLA program, with each string representing a line of statement. We’ll define and use four types based on the vector class of the C++ standard template library: perLineTokenVector, vectOfTokenVects, perLineCategoryVector, and vectOfCategoryVects to store the information of (i) the tokens of a line, (ii) the tokens of all lines, (iii) the categories of tokens of a line, and (iv) the categories of tokens of all lines respectively.

·        perLineTokenVector: a type for storing the tokens in a line of statement as a vector of strings.

·        vectOfTokenVects: a type for storing the tokens of all lines of statements in the program as a vector of perLineTokenVector objects.

·        perLineCategoryVector: a type for storing the categories of the tokens in a line of statement as a vector of tokenCategory items

·        vectOfTokenVects: a type for storing the categories of the tokens of all statements in the program as a vector of perLineCategoryVector objects.