Commit 09f2f067 authored by Alexander Hirsch's avatar Alexander Hirsch
Browse files

Update for 2020

parent 7eee3ca7
# Compiler Construction
| Date | Deadline |
| ---------- | ------------------------------------------ |
| 2019-03-15 | [Example Input](example_input.md) |
| 2019-04-05 | [Milestone 1](specification.md#milestones) |
| 2019-05-03 | [Milestone 2](specification.md#milestones) |
| 2019-05-24 | [Milestone 3](specification.md#milestones) |
| 2019-06-14 | [Milestone 4](specification.md#milestones) |
| 2019-06-21 | [Milestone 5](specification.md#milestones) |
| 2019-07-12 | [Final](evaluation_scheme.md) |
# Compiler Construction (Draft)
| Date | Topic / Recommended Schedule / Deadlines |
| ---------- | ----------------------------------------- |
| 2020-03-03 | Introduction |
| 2020-03-10 | Lexer complete |
| 2020-03-17 | |
| 2020-03-24 | |
| 2020-03-31 | Parser complete |
| 2020-04-07 | *no proseminar* |
| 2020-04-14 | *no proseminar* |
| 2020-04-21 | Semantic checks complete |
| 2020-04-28 | |
| 2020-05-05 | AST → TAC conversion complete |
| 2020-05-12 | |
| 2020-05-19 | TAC → ASM (no function calls) complete |
| 2020-05-26 | |
| 2020-06-02 | TAC → ASM (with function calls) complete |
| 2020-06-09 | CFG generation complete |
| 2020-06-16 | Polish |
| 2020-06-23 | Build test submission deadline |
| 2020-07-14 | Final submission deadline (no extensions) |
- [mC Compiler Specification](specification.md)
- [Getting Started Code-base](https://git.uibk.ac.at/c7031162/mcc)
......@@ -19,39 +30,46 @@
The ultimate goal of this course is to build a working compiler according to the given specification.
You are not allowed to use code from other people participating in this course or code that has been submitted previously by somebody else.
However, a *getting started* code-base is provided.
A *getting started* code-base is provided, but you can also start from scratch.
You will be able to work on your compiler during the lab.
During the lab, short QA sessions will be held.
You can work on your compiler in the meantime.
I'll be present for questions all the time, yet a big part of this course is to acquire the necessary knowledge yourself.
Please note that minor modifications may be made to the specification until 1 week before the final deadline.
Please note that minor modifications may be made to the specification until 2 weeks before the final deadline.
Therefore, double check for modifications before submitting — Git provides you the diff anyway.
Apart from this, there will be one *required* submission near the beginning of the semester.
You have to submit an additional example input, which may be added to the set of example inputs — this way the number of integration tests is extended.
Furthermore, there are five *optional* milestones.
They provide a golden thread and enable you to receive feedback.
You may work together in teams of 1–3 people.
Teams may span across pro-seminar groups.
## Grading
### Programming Language
The final grade is computed as the weighted average of the final submission (80%) and the QA sessions (20%).
Both of these parts as well as the majority of QA session grades must be positive to pass this course.
Any of the following programming languages can be used:
Other submissions are not graded.
- modern C (used for the getting started code-base)
- modern C++
- Go
- Rust
- Haskell
Be sure to adhere to the specification, deviating from it (without stating a proper reason) will negatively impact your grade.
See [Final Submission Evaluation Scheme](evaluation_scheme.md) for more details.
Go easy on external dependencies and obscure language extensions — yes, I'm looking at you, Haskell.
Code readability is paramount.
Using overly complex and cryptic concepts may negatively impact the evaluation process — again, looking at you, Haskell and your voodoo magic lenses.
### Evaluation System
I'll be using a virtualised, updated Ubuntu 18.04 LTS (64 bit) to examine your submissions.
I'll be using a virtualised, updated Ubuntu 20.04 LTS (64 bit) to examine your submissions.
From this you can infer the software versions I'll be using.
The submitted code has to compile and run on this system.
## Grading
The final grade is computed as the weighted average of the final submission (80%) and the QA sessions (20%).
Both of these parts as well as the majority of QA session grades must be positive to pass this course.
Be sure to adhere to the specification, deviating from it (without giving proper reason) will negatively impact your grade.
See [Final Submission Evaluation Scheme](evaluation_scheme.md) for more details.
### Absence
You must not be absent more than three times to pass this course.
......
# Final Submission Evaluation Scheme
Each checkbox represents 1 point to score.
The following key is used for calculating the resulting grade:
- **1:** ≥ 92%
- **2:** (92%, 84%]
- **3:** (84%, 76%]
- **4:** (76%, 68%]
- **5:** < 68%
- **1:** ≥ 90%
- **2:** [80%, 90%)
- **3:** [70%, 80%)
- **4:** [60%, 70%)
- **5:** < 60%
It is required that for the *mandelbrot* test input, a respective executable can be built and run successfully.
Points *may* be subtracted for shortcomings not explicitly listed in this form.
Points will be subtracted for shortcomings discovered during evaluation.
This includes things like:
- Encountered issues not mentioned or justified in the *Known Issues* section
......@@ -24,92 +21,36 @@ This includes things like:
- Inconsistently formatted or unreadable source code
-
## Boundary Conditions
- [ ] Correct submission
- Subject is correct
- Attached file has correct name and structure
- [ ] README is present
- Contains instructions
- Contains dependencies
- Contains *Known Issues*
- [ ] Code builds successfully
- Warnings are enabled
- No unjustified warnings of any kind
- [ ] All unit tests succeed
- [ ] All integration tests succeed
- provided test inputs must be included
- [ ] Additional integration tests (provided by the instructor) succeed
- [ ] Architecture consists of shared library + executables
- [ ] All symbols exported by the library are prefixed with `mcc_`
## Front-end
Errors need to come with a meaningful error message and source location information (filename, start line, and start column).
- Syntactic checks:
- [ ] Syntactically invalid mC programs are rejected with an error
- [ ] AST data structure is present and instantiated by the parser
- [ ] AST can be visualised using `mc_ast_to_dot`
- Semantic checks:
- [ ] Shadowing is supported correctly
- [ ] Error on use of undeclared variable
- [ ] Error on conflicting variable declaration
- [ ] Error on use of unknown function
- [ ] Error on missing `main` function
- [ ] Error on conflicting function names
- includes built-in functions
- [ ] Error on missing return-statement for non-void functions
- [ ] Correct type checking on scalars
- [ ] Correct type checking on arrays
- [ ] Error on invalid call-expressions
- Mismatching argument count
- Mismatching argument types
- Return type is taken into account by the type checker
- [ ] Symbol table data structure is present
- [ ] Symbol table can be visualised using `mc_symbol_table`
- [ ] Type checking can be traced using `mc_type_check_trace`
## Core
## Hard Requirements
- [ ] TAC data structure is present
- README is present:
- Contains list of prerequisites
- Contains build instructions
- Contains *Known Issues* section
- Submitted code builds successfully.
- `mcc` executable operates as demanded by the specification.
- A respective executable can be built and run for the *mandelbrot* test input.
- [ ] TAC can be visualised using `mc_ir`
## General (10 Points)
- [ ] CFG data structure is present
This is all about compiling *valid* input programs.
- [ ] CFG can be visualised using `mc_cfg_to_dot`
- Provided test inputs (examples) build and run successfully.
- Additional, secret test inputs build and run successfully.
## Back-end
## Front-end (8 Points)
- [ ] Assembly code can be obtained using `mc_asm`
This is all about rejecting *invalid* input programs.
- [ ] GCC is invoked to generate the final executable
- Invalid input yields a meaningful error message including source location (filename, start line, and start column).
- Syntactically invalid input is rejected by the parser.
- Semantic checks demanded by the specification are implemented and run on the obtained AST.
## Driver
## Core (2 Points)
- [ ] `mcc` executable supports the requested command-line flags
The IR needs to be decoupled in order to exploit its benefits.
Furthermore, the control flow graph is an essential tool used by optimising compilers.
- [ ] Multiple input files are supported
- TAC data structure is present and independent from front- and back-end.
- A dedicated CFG data structure is present.
- A CFG of a given IR function can be obtained and visualised.
# Example Input
Some example inputs for the compiler are already provided.
These examples are to be used as integration tests.
Your initial task is to create another example which may be added to the set.
Try to use as many features of the mC language as possible.
The example may read from `stdin` and write to `stdout` using the built-in functions.
Provide `.stdin.txt` and `.stdout.txt` files for verification purposes.
The getting started code-base provides a stub for the mC compiler.
It converts mC to C and compiles the result using GCC.
See [Submission Guideline](submission.md).
# mC Compiler Specification
This document describes the mC compiler as well as the mC language itself along with some requirements.
This document describes the mC compiler as well as the mC language along with some requirements.
Like a regular compiler the mC compiler is divided into 3 main parts: front-end, back-end, and a core in-between.
The front-end's task is to validate a given input using syntactic and semantic checks.
The syntactic checking is done by the *parser*, which, on success, generates an abstract syntax tree (AST).
The syntactic checking is done by the parser, which, on success, generates an abstract syntax tree (AST).
This tree data structure is mainly used for semantic checking, although transformations can also be applied to it.
Moving on, the AST is translated to the compiler's intermediate representation (IR) and passed to the core.
Invalid inputs cause errors to be reported.
Moving on, the AST is translated to the compiler's intermediate representation (IR) and passed on to the core.
The core provides infrastructure for running analyses and transformations on the IR.
These analyses and transformations are commonly used for optimisation.
......@@ -18,30 +18,7 @@ The back-end translates the platform *independent* IR code to platform *dependen
An assembler converts this code to *object code*, which is finally crafted into an executable by the linker.
For these last two steps, GCC is used — referred to as *back-end compiler* in this context.
The mC compiler is implemented using modern C (or C++) adhering to the C11 (or C++17) standard.
## Milestones
1. **Parser**
- Inputs are accepted / rejected correctly (syntax only).
- Syntactically invalid inputs result in a meaningful error message containing the corresponding source location.
- An AST is constructed for valid inputs.
- The obtained AST can be printed in the DOT format (see `mc_ast_to_dot`).
2. **Semantic checks**
- The compiler rejects semantically wrong inputs.
- Invalid inputs trigger a meaningful error message including source location information.
- Type checking can be traced (see `mc_type_check_trace`).
- Symbol tables can be viewed (see `mc_symbol_table`).
3. **Control flow graph**
- Valid inputs are converted to IR.
- The IR can be printed (see `mc_ir`).
- The CFG is generated and can be printed in the DOT format (see `mc_cfg_to_dot`).
4. **Back-end**
- Valid inputs are converted to IR and then to assembly code.
- The assembly code can be printed (see `mc_asm`).
- GCC is invoked to create the final executable.
5. **Build Infrastructure**
- Your code builds and tests successfully on my evaluation system.
Adapt project layout, build system, and coding guidelines according to the used programming language's conventions.
## mC Language
......@@ -50,7 +27,7 @@ The semantics of mC are identical to C unless specified otherwise.
### Grammar
The next segment defines the grammar of mC using this notation:
The grammar of mC is defined using the following notation:
- `#` starts a single line comment
- `,` indicates concatenation
......@@ -152,14 +129,14 @@ call_expr = identifier , "(" , [ arguments ] , ")"
arguments = expression , [ { "," expression } ]
# Program
# Program (Entry Point)
program = [ { function_def } ]
```
### Comments
mC supports only *C-style* comments, starting with `/*` and ending with `*/`.
mC supports only C-style comments, starting with `/*` and ending with `*/`.
Like in C, they can span across multiple lines.
Comments are discarded by the parser; however, line breaks are taken into account for line numbering.
......@@ -179,14 +156,16 @@ Furthermore, it is assumed that arrays and strings are at most `LONG_MAX` elemen
The operators `!`, `&&`, and `||` can only be used with Booleans.
Short-circuit evaluation is *not* supported.
An expression used as a condition (for `if` or `while`) is expected to be of type `bool`.
#### Strings
Strings are immutable and do not support any operation (e.g. concatenation).
Like comments, strings can span across multiple lines.
Newlines and indentation whitespaces are part of the string, when dealing with multiline strings.
Whitespaces (i.e. newlines, tabs, spaces) are part of the string.
Escape sequences are *not* supported.
Their sole purpose is to be used with the built-in `print` function (see below).
The sole purpose of strings in mC is to be used with the built-in `print` function (see below).
#### Arrays
......@@ -232,7 +211,7 @@ Modifications made to an array inside a function are visible outside the functio
int main() {
int[5] arr;
foo(arr);
print_int(arr[2]); // outputs 42
print_int(arr[2]); /* outputs 42 */
return 0;
}
......@@ -246,7 +225,7 @@ While strings can be re-assigned (in contrast to arrays), this is not visible ou
string s;
s = "bar";
foo(s);
print(s); // outputs bar
print(s); /* outputs bar */
return 0;
}
......@@ -254,32 +233,26 @@ While strings can be re-assigned (in contrast to arrays), this is not visible ou
There are *no* type conversion, neither implicit nor explicit.
An expression used as a condition (for `if` or `while`) is expected to be of type `bool`.
*Note:* If the need for explicit type conversion arises, additional built-ins will be added for this purpose.
#### Entry Point
The top-level grammar rule is `program` which simply consists of 0 or more function definitions.
The top-level grammar rule is `program` which consists of 0 or more function definitions.
While the parser happily accepts empty source files, a semantic check enforces the presence of a function named `main`.
This function takes no arguments and returns an `int`.
On success, an mC program returns `0`.
On success, an mC program's `main` function returns `0`.
#### Declaration, Definition, and Initialization
`declaration` is used to declare variables which can then be initialised with `assignment`.
Splitting declaration and initialisation simplifies the creation of symbol tables.
Functions are always declared by their definition.
Forward declarations are therefore *not* supported.
Contrary to C, it is possible to call a function before it has been declared (in case of mC defined).
Forward declarations are therefore *not* supported.
#### Empty Parameter List
In C, the parameter list of a function taking no arguments is written as `(void)`.
mC, in this case, just uses an empty parameter list `()`.
In mC, an empty parameter list is always written as `()`.
#### Dangling Else
......@@ -308,26 +281,23 @@ The following built-in functions are provided by the compiler for I/O operations
## mC Compiler
The mC compiler is implemented as a library.
It can be used either programmatically or via the provided command-line applications.
It can be used either programmatically or via the provided command-line applications (see below).
The focus lies on a clean and modular implementation as well as a straight forward architecture, rather than raw performance.
The focus lies on a clean and modular implementation as well as a straightforward architecture, rather than raw performance.
For example, each semantic check may traverse the AST in isolation.
The compiler guarantees the following:
- Exported symbols are prefixed with `mcc_`.
- It is thread-safe.
- No memory is leaked — even in error cases.
- All functions are thread-safe.
- Functions do not interact directly with `stdin`, `stdout`, or `stderr`.
- No function terminates the application on correct usage.
- No function terminates the application on correct usage (or replaces the running process using `exec`).
- No memory is leaked — even in error cases.
*Note for C++*:
Do not prefix symbols.
Put everything in an `mcc` namespace instead.
*Note for C*: Prefix symbols with `mcc_` due to the lack of namespaces.
### Logging
Logging infrastructure may be present; however, all log output is disabled by default.
Logging infrastructure *may* be present; however, all log (and debug) output is disabled by default.
The log level can be set with the environment variable `MCC_LOG_LEVEL`.
0 = no logging
......@@ -352,14 +322,11 @@ This allows for better IDE integration.
Displaying the offending source code along with the error message is helpful, but not required.
Parsing may stop on the first error.
Pay attention to operator precedence.
Error recovery is optional.
The parser component may be generated by tools like `flex` and `bison`, or similar.
Although, you are encouraged to implement a recursive descent or combinator parser instead.
Nevertheless, pay attention to operator precedence.
Note that partial mC programs, like an expression or statement, are not valid inputs to the main *parse* function.
The library may provide additional functions for parsing single expressions or statements.
The library *may* provide additional functions for parsing single expressions or statements.
### Abstract Syntax Tree
......@@ -369,13 +336,11 @@ Consider using the visitor pattern for tree traversals.
Given this example input:
```c
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
```
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
The visualisation of the AST for the `fib` function could look like this:
......@@ -392,7 +357,7 @@ As the parser only does syntactic checking, additional semantic checks are imple
- Checking for presence of `main` and correct signature
- Checking that all execution paths of a non-void function return a value
- Type checking (remember, neither implicit nor explicit type conversions)
- Includes checking operations on arrays
- Includes checking operations on arrays (including array size)
- Includes checking arguments and return types for call expressions
In addition to the AST, *symbol tables* are created and used for semantic checking.
......@@ -401,18 +366,23 @@ Be sure to correctly model [*shadowing*](https://en.wikipedia.org/wiki/Variable_
### Intermediate Representation
As IR, a low-level [three-address code (TAC)](https://en.wikipedia.org/wiki/Three-address_code) is used.
The instruction set of this code is *not* specified.
The instruction set of this IR is *not* specified.
The compiler's core is independent from the front- and back-end.
### Control Flow Graph
*Hint:* Handle arguments and return values for function calls via an *imaginary* stack using dedicated `push` and `pop` instructions.
Have a look at the calling convention used for assembly code generation.
### Control Flow Graph (CFG)
A control flow graph data structure is present and can be constructed for a given IR program.
This graph is commonly used by analyses for extracting structural information crucial for transformation steps.
A control flow graph data structure consisting of edges and basic blocks (containing IR instructions) is present.
For each function in a given IR program, a corresponding CFG can be obtained.
It is recommended to also provide a visitor mechanism for this graph.
The CFG is commonly used by analyses for extracting structural information crucial for transformation steps.
Like the AST, it can be visualised.
Providing a visitor mechanism for CFGs is optional, yet recommended.
Like the AST, CFGs can be printed using the DOT format.
The example below is taken from [Marc Moreno Maza](http://www.csd.uwo.ca/~moreno/CS447/Lectures/CodeOptimization.html/node6.html).
Given this example IR:
......@@ -447,61 +417,61 @@ Pay special attention to floating point and integer handling.
Use [cdecl calling convention](https://en.wikipedia.org/wiki/X86_calling_conventions#cdecl).
It is paramount to correctly implement the calling convention, otherwise the stack may get corrupted during function calls and returns.
Note that *all* function calls (including built-ins) use the same calling convention — do not needlessly introduce special cases.
*Hint:* There is a `.float` assembler directive.
*Hint:* If you are not familiar with x86 assembly, pass small C snippets to GCC and look at the generated assembly code (using `-S`).
Optimisations, mitigations, and other unnecessary features (e.g. dwarf symbols, unwind tables) should be disabled.
There are also flags like `-fverbose-asm` which add additional annotations to the output.
## Applications
Apart from the main compiler executable `mcc`, additional auxiliary executables are provided.
These executables aid the development process and are used for evaluation.
Do not omit details in the output (e.g. do not simplifying the AST).
The applications are commonly defined by their usage information.
The applications are specified by their usage information.
Composing them with other command-line tools, like `dot`, is a core feature.
The exact output format is not specified in all cases.
However, details should *not* be omitted — like simplifying the AST.
All applications exit with code `EXIT_SUCCESS` *iff* they succeeded in their operation.
Each executable accepts multiple input files.
The inputs are parsed in isolation; the resulting ASTs are merged before semantic checks are run.
Errors are written to `stderr`.
### `mcc`
This is the main compiler executable, sometimes referred to as *driver*.
usage: mcc [OPTIONS] file...
usage: mcc [OPTIONS] <file>
The mC compiler. It takes mC input files and produces an executable.
The mC compiler. It takes an mC input file and produces an executable.
Errors are reported on invalid inputs.
Use '-' as input file to read from stdin.
OPTIONS:
-h, --help displays this help message
-v, --version displays the version number
-h, --help display this help message
-q, --quiet suppress error output
-o, --output <file> write the output to <file> (defaults to 'a.out')
-o, --output <out-file> write the output to <out-file> (defaults to 'a.out')
Environment Variables:
MCC_BACKEND override the back-end compiler (defaults to 'gcc' in PATH)
MCC_BACKEND override the back-end compiler (defaults to 'gcc')
### `mc_ast_to_dot`
usage: mc_ast_to_dot [OPTIONS] file...
usage: mc_ast_to_dot [OPTIONS] <file>
Utility for printing an abstract syntax tree in the DOT format. The output
can be visualised using graphviz. Errors are reported on invalid inputs.
can be visualised using Graphviz. Errors are reported on invalid inputs.
Use '-' as input file to read from stdin.
OPTIONS:
-h, --help displays this help message
-o, --output <file> write the output to <file> (defaults to stdout)
-f, --function <name> limit scope to the given function
-h, --help display this help message
-o, --output <out-file> write the output to <out-file> (defaults to stdout)
### `mc_symbol_table`
usage: mc_symbol_table [OPTIONS] file...
usage: mc_symbol_table [OPTIONS] <file>
Utility for displaying the generated symbol tables. Errors are reported on
invalid inputs.
......@@ -509,27 +479,12 @@ This is the main compiler executable, sometimes referred to as *driver*.
Use '-' as input file to read from stdin.
OPTIONS:
-h, --help displays this help message
-o, --output <file> write the output to <file> (defaults to stdout)
-f, --function <name> limit scope to the given function
### `mc_type_check_trace`
usage: mc_type_check_trace [OPTIONS] file...
Utility for tracing the type checking process. Errors are reported on
invalid inputs.
Use '-' as input file to read from stdin.
OPTIONS:
-h, --help displays this help message
-o, --output <file> write the output to <file> (defaults to stdout)
-f, --function <name> limit scope to the given function
-h, --help display this help message
-o, --output <out-file> write the output to <out-file> (defaults to stdout)
### `mc_ir`
usage: mc_ir [OPTIONS] file...
usage: mc_ir [OPTIONS] <file>
Utility for viewing the generated intermediate representation. Errors are
reported on invalid inputs.
......@@ -537,13 +492,12 @@ This is the main compiler executable, sometimes referred to as *driver*.
Use '-' as input file to read from stdin.
OPTIONS:
-h, --help displays this help message
-o, --output <file> write the output to <file> (defaults to stdout)
-f, --function <name> limit scope to the given function
-h, --help display this help message
-o, --output <out-file> write the output to <out-file> (defaults to stdout)
### `mc_cfg_to_dot`
usage: mc_cfg_to_dot [OPTIONS] file...
usage: mc_cfg_to_dot [OPTIONS] <file>
Utility for printing a control flow graph in the DOT format. The output
can be visualised using graphviz. Errors are reported on invalid inputs.
......@@ -551,13 +505,13 @@ This is the main compiler executable, sometimes referred to as *driver*.
Use '-' as input file to read from stdin.
OPTIONS:
-h, --help displays this help message
-o, --output <file> write the output to <file> (defaults to stdout)
-f, --function <name> limit scope to the given function
-h, --help display this help message
-o, --output <out-file> write the output to <out-file> (defaults to stdout)
-f, --function <name> print the CFG of the given function (defaults to 'main')
### `mc_asm`
usage: mc_asm [OPTIONS] file...
usage: mc_asm [OPTIONS] <file>
Utility for printing the generated assembly code. Errors are reported on
invalid inputs.
......@@ -565,15 +519,14 @@ This is the main compiler executable, sometimes referred to as *driver*.
Use '-' as input file to read from stdin.
OPTIONS:
-h, --help displays this help message