类Python编译器项目Spec(一)

发布于 作者: maxsnew

Assignment 1: Straightline Code《作业 1:直线代码》

In this assignment you’ll implement a compiler for version 1.0 of the Snake language (codename "Adder"). The adder source language is an expression language supporting let-binding and binary arithmetic operations and we will compile it into a single block of straightline assembly code.

在这个作业中,你将实现一个用于 Snake 语言 1.0 版本的编译器(代号"加法器")。加法器源语言是一种支持 let 绑定和二元算术运算的表达式语言,我们将将其编译成一段直线汇编代码。

The "middle end" of the compiler will be covered in detail in the January 21st lecture, but the frontend can be completed after the January 15 lecture, and much of the backend as well.

编译器的"中间层"将在 1 月 21 日的讲座中详细讲解,但前端可以在 1 月 15 日的讲座后完成,后端的大部分工作也可以完成。

The Source Language 源语言

There are a few pieces that go into defining a language for us to compile.

定义我们编译的语言有几个部分组成。

  • A description of the concrete syntax — the text the programmer writes 具体语法的描述——程序员编写的文本

  • A description of the abstract syntax — how to express what the programmer wrote in a data structure our compiler uses. 对抽象语法的描述——如何用我们的编译器使用的数据结构来表达程序员所写的内容。

  • The semantics — or description of the behavior — of the abstract syntax, so our compiler knows what the code it generates should do. 语义——即行为描述——的抽象语法,这样我们的编译器就知道了它生成的代码应该做什么。

Concrete Syntax 具体语法

‹prog›: def main ( IDENTIFIER ) : ‹expr›
‹expr›: 
           | let IDENTIFIER = ‹expr› in ‹expr›
           | ‹binop-expr›
‹binop-expr›: 
                       | NUMBER
                       | IDENTIFIER
                       | ‹expr› + ‹expr›
                       | ‹expr› - ‹expr›
                       | ‹expr› * ‹expr›
                       | ( ‹expr› )

The main differences between this version of adder and the versions introduced in lecture are that let expressions can have one or many bindings. An IDENTIFIER is any non-sequence of letters and digits (starting with a letter). Any highlighted text is a literal token, meaning the programmer must type exactly those characters or keywords.

这个版本的加法程序与讲座中介绍的版本的主要区别在于 let 表达式可以有一个或多个绑定。IDENTIFIER 是任何非字母数字序列(以字母开头)。任何高亮文本都是一个字面量标记,这意味着程序员必须输入确切的字符或关键字。

Abstract Syntax 抽象语法

The abstract syntax of Adder is a Rust datatype, and corresponds closely with the concrete syntax. You don’t have to worry about implementing a parser, we have provided one for you that will parse the concrete syntax into an abstract syntax tree.

加法程序的抽象语法是一个 Rust 数据类型,与具体语法非常接近。你不必担心实现解析器,我们已经为你提供了一个将具体语法解析为抽象语法树的解析器。

The Adder program is parsed into the Prog struct which contains the input parameter name param and the main Expr called main. An Expr is either a number, a variable, a primitive operation or a sequence of let-bindings. Each of these constructors contains a loc: SrcLoc that corresponds to the source location information for that sub-expression. This information is only used to provide better error messages, you don’t need to inspect the internal details of SrcLoc type.

加法程序被解析为 Prog 结构,该结构包含输入参数名称 param 和主 Expr ,名为 main 。一个 Expr 要么是一个数字、一个变量、一个基本操作或一系列 let 绑定。这些构造器中的每一个都包含一个 loc: SrcLoc ,对应于该子表达式的源位置信息。这些信息仅用于提供更好的错误消息,你不需要检查 SrcLoc 类型的内部细节。

Semantics 语义

An Adder program takes in a single 64-bit integer as input and evaluates to a single 64-bit integer. Nums evaluate to themselves (so a program just consisting of Num(5) should evaluate to the integer 5). Primitive expressions perform addition or subtraction by one on their argument. Let bindings should evaluate all the binding expressions to values one by one, and after each, store a mapping from the given name to the corresponding value in both (a) the rest of the bindings, and (b) the body of the let expression. Identifiers evaluate to whatever their current stored value is. There are several examples further down to make this concrete.

一个加法程序接收一个 64 位整数作为输入,并计算出一个 64 位整数。 Num 评估为自己(所以一个仅由 Num(5) 组成的程序应该评估为整数 5 )。原始表达式对其参数执行加一或减一的运算。let 绑定应该逐个评估所有绑定表达式为值,并在每个之后,将给定的名称映射到相应的值存储在 (a) 剩余的绑定中,以及 (b) let 表达式的主体中。标识符评估为其当前存储的值。下面有多个示例来具体说明这一点。

The compiler should return an error if

编译器应返回错误,如果:

  • There is a binding list containing two or more bindings with the same name 存在一个包含两个或多个同名绑定的绑定列表

  • An identifier is unbound (there is no surrounding let binding for it) 标识符未绑定(没有围绕它的 let 绑定)

These errors are encoded in the CompileError type in compile.rs.

这些错误被编码在 CompileError 中的 compile.rs 类型中。

Note that shadowing of variables is allowed, but all variable names in a single binding list must be unique. So

请注意,变量遮蔽是允许的,但在单个绑定列表中的所有变量名必须是唯一的。因此

let x = 3 in
let x = 4 in
x

is a valid program, but

是一个有效的程序,但是

let x = 3, x = 4 in
x

should produce a compile-time error.

应该产生编译时错误。

Here are some examples of Adder expressions and how they parse into Rust values (ignoring the SrcLoc information):

这里是一些 Adder 表达式的例子,以及它们如何解析成 Rust 值(忽略 SrcLoc 信息):

| Concrete Syntax              | Abstract Syntax                                                                                                                                     | Result |
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|--------|
| 5                            | Num(5)                                                                                                                                              | 5      |
| sub1(add1(sub1(5)))          | Prim{ prim: Sub1, args: [ Prim{ prim: Add1, args: [ Prim{ prim: Sub1, args: [ Num(5) ] } ] } ] }                                                     | 4      |
| let x = 5 in add1(x)         | Let{ bindings: [ Binding{ var: "x", expr: Num(5) } ], body: Prim{ prim: Add1, args: [ Var("x") ] } }                                                 | 6      |
| let x = 5, y = sub1(x) in sub1(y) | Let{ bindings: [ Binding{ var: "x", expr: Num(5) }, Binding{ var: "y", expr: Prim{ prim: Sub1, args: [ Var("x") ] } } ], body: Prim{ prim: Sub1, args: [ Var("y") ] } } | 3      |
| 4 + 7 * 3                    | Prim{ prim: Add, args: [ Num(4), Prim{ prim: Mul, args: [ Num(7), Num(3) ] } ] }                                                                     | 25     |

Your program will not just be a single expression, but a main function that takes an input, for instance the program

你的程序不会只是一个表达式,而是一个接收输入的主函数,例如程序

def main(x):
  x * x

will be parsed into the abstract syntax

将被解析为抽象语法

Prog { param: ("x") , main: Prim{ prim: Mul, args: [ Var("x"), Var("x") ]} }

and its semantics is a function that outputs the square of the input.

其语义是一个输出输入平方的函数。

Compilation and Starter Code 编译和起始代码

The starter code provides much of the architecture of the compiler already:

启动代码已经提供了编译器的大部分架构:

  • mk_submission.sh, a script that creates a zip file with the correct files to be submitted to the autograder.

  • mk_submission.sh ,一个创建包含提交给自动评分器的正确文件的 zip 文件的脚本。

  • Cargo.toml the package description file.

  • Cargo.toml 包描述文件。

  • runime/stub.rs a Rust wrapper that calls our compiled assembly code function with an argument provided from the command line.

  • runtime/stub.rs 一个 Rust 包装器,它使用命令行提供的参数调用我们编译的汇编代码函数

  • src/main.rs, src/compile.rs and runner.rs are wrappers providing a command line interface to the compiler pipeline and interpreters. You should look at src/compile.rs to see how the compiler pipeline functions.

  • src/main.rs 、 src/compile.rs 和 runner.rs 是提供编译器管道和解释器命令行界面的包装器。你应该查看 src/compile.rs 来了解编译器管道的工作方式。

  • src/ast.rs, src/ssa.rs and src/asm.rs have type definitions for the abstract syntax trees, intermediate representation and assembly code, respectively.

  • src/ast.rs 、 src/ssa.rs 和 src/asm.rs 分别为抽象语法树、中间表示和汇编代码提供类型定义

  • src/interp.rs and src/pretty.rs implement interpreters and pretty printers for abstract syntax trees and the intermediate representation

  • src/interp.rs 和 src/pretty.rs 实现了抽象语法树和中间表示的解释器和美化打印器

  • src/identifiers.rs utilities for working with identifiers.

  • src/identifiers.rs 用于处理标识符的工具。

  • src/parser.rs, generated code for lexing and parsing source programs into abstract syntax trees, which is generated from src/parser.lalrpop using the Lalrpop parser generator.

  • src/parser.rs ,用于将源程序词法分析和解析成抽象语法树的生成代码,该代码通过使用 Lalrpop 解析器生成器从 src/parser.lalrpop 生成。

  • examples/ a directory for putting example programs

  • examples/ 一个用于存放示例程序的目录

  • tests/examples.rs a file with some testing macros to help you test your compiler against the examples in examples/

  • tests/examples.rs 一个包含一些测试宏的文件,用于帮助你用 examples/ 中的示例测试你的编译器

You will complete the implementation of the compilation pipeline in three files: front_end.rs, middleend.rs and back_end.rs.

你将在三个文件中完成编译流程的实现: front_end.rs 、 middleend.rs 和 back_end.rs 。

Front End 前端

After the program is parsed into an abstract syntax tree, the final stage of the front end is to ensure that the program satisfies the scoping rules for the language: all variables are declared before they are used and that multiple-let forms do not declare the same variable more than once. These errors are defined in CompileError. Additionally, to avoid having to deal with shadowing later in the compiler, we want to resolve the source-level identifiers into unique identifiers for use in the remainder of the compiler. This functionality is implemented in the resolve_prog method of the Resolver struct, a simple wrapper object that is used as a source of unique identifiers. Your task is to fill in the details of resolve_prog sub-procedure for this method. You will need to choose an appropriate notion of auxiliary environment to correctly implement this functionality. As in class, you may find it useful to use the HashMap from the im package, a package for immutable data structures that support sharing. The documentation for this data structure is here, most relevant are the new, get and insert methods.

程序被解析成抽象语法树后,前端最后的阶段是确保程序满足语言的的作用域规则:所有变量在使用前必须声明,且多个 let 形式不能多次声明同一个变量。这些错误在 CompileError 中定义。此外,为了避免在编译器后续阶段处理阴影作用,我们希望将源级标识符解析为唯一标识符供编译器其余部分使用。该功能在 Resolver 结构的 resolve_prog 方法中实现,这是一个用作唯一标识符来源的简单包装对象。你的任务是填充这个方法中 resolve_prog 子过程的细节。你需要选择一个合适的辅助环境来正确实现这一功能。如课堂上所述,使用来自 im 包的 HashMap 可能会很有用,这是一个支持共享的不可变数据结构包。该数据结构的文档在这里,最相关的是 new 、 get 和 insert 方法。

Middle End 中间层

As discussed in lecture on January 22nd, we next want to make the execution order of our program explicit by compiling it into our intermediate representation, a BlockBody of straight-line code, defined in src/ssa.rs. A BlockBody is either Returns and immediate value or it performs an Operation, binds that to a dest variable and then continues with a next BlockBody.

正如 1 月 22 日的讲座中讨论的,我们接下来希望通过将其编译到我们的中间表示中,使程序的执行顺序显式化,即编译为 BlockBody 中定义的直线代码 src/ssa.rs 。一个 BlockBody 要么是 Returns 和立即值,要么执行 Operation ,将其绑定到 dest 变量,然后继续执行 next BlockBody 。

This lowering is implemented as a method lower_prog around the Lowerer struct. Your task is to implement the core sub-procedure for this lowering process, lower_expr_kont. The parameter k is the Continuation of the code you are lowering, that is, it is a pair of a destination variable where the output of the current expression should be stored and a BlockBody that will execute after the current expression. The initial continuation is provided in lower_prog to just return the output, but you will build this up recursively in order to produce your flattened code from a tree.

这种降低是通过围绕 Lowerer 结构实现的方法 lower_prog 来完成的。你的任务是实现这个降低过程的核心理序 lower_expr_kont 。参数 k 是你要降低的代码的 Continuation ,即它是一个目标变量的对,当前表达式的输出应该存储在这个变量中,以及一个 BlockBody ,它将在当前表达式执行后执行。初始的连续性在 lower_prog 中提供,只是为了返回输出,但你将递归地构建它,以便从树中生成你的扁平化代码。

Back End 后端

To implement the final code generation phase of the pipeline, we need a way to map the variables in our intermediate representation into concrete memory locations. As discussed in lecture, we will store our local variables on the stack, as a memory locations relative to the stack pointer rsp. Concretely, in backend.rs, you will implement the emit_prog function, which produces assembly instructions from the intermediate representation. We have already provided the outline of the emit function, which sets up the correct .text section and global declarations to link with runtime/stub.rs. In doing so, you will need to design and maintain some kind of environment for mapping the intermediate representation variables to stack offsets.

为了实现管道的最终代码生成阶段,我们需要一种方法将我们的中间表示中的变量映射到具体的内存位置。正如在讲座中讨论的,我们将局部变量存储在栈上,作为相对于栈指针 rsp 的内存位置。具体来说,在 backend.rs 中,你将实现 emit_prog 函数,该函数将中间表示转换为汇编指令。我们已经提供了 emit 函数的框架,该函数设置正确的 .text 部分和 global 声明以链接到 runtime/stub.rs 。为此,你需要设计和维护某种环境,用于将中间表示变量映射到栈偏移量。

Command-Line Interface 命令行界面

To build the command-line interface run cargo build --release, after which the executable will be available at target/release/snake. Run ./target/release/snake --help for an overview of the command-line interface. To help you implement your compiler gradually, the command-line interface that supports partial execution of the compiler pipeline. That is, you can pass the --target tgt or -t tgt option to specify that you want the compiler to produce an ast, a resolved ast, your ssa intermediate representation, your assembly code, or the linked executable. You can also pass the --execute n or -x n option to say to run your code with the input argument n. By default this will compile your program to an executable and then run it, but if you select an (resolved) ast or ssa this will run the provided interpreter instead.

要构建命令行界面,运行 cargo build --release ,之后可执行文件将位于 target/release/snake 。运行 ./target/release/snake --help 可查看命令行界面概览。为了帮助你逐步实现编译器,命令行界面支持编译器管道的部分执行。也就是说,你可以通过传递 --target tgt 或 -t tgt 选项来指定编译器生成抽象语法树(ast)、解析后的抽象语法树、你的 SSA 中间表示、你的汇编代码或链接的可执行文件。你也可以通过传递 --execute n 或 -x n 选项来指定使用输入参数 n 运行你的代码。默认情况下,这将编译你的程序并运行它,但如果你选择解析后的抽象语法树或 SSA,这将运行提供的解释器。

Testing 测试

The file test/examples.rs contains utilities for testing the output of your compiler against example programs you include in the examples/ directory. We have provided a few examples to show how you can use this to test your end-to-end compiler pipeline, as well as the front end and middle ends. You can execute the command cargo test to run your tests.

文件 test/examples.rs 包含用于测试你的编译器输出与你在 examples/ 目录中包含的示例程序的实用工具。我们提供了一些示例,以展示如何使用这些工具来测试你的端到端编译器管道,以及前端和中端。你可以执行命令 cargo test 来运行你的测试。

Producing executables manually, installation issues 手动生成可执行文件,安装问题

On any platform you need to have nasm and ar installed. Ask on Piazza if you have any issues installing these. If you have installed these correctly and are using Mac or Linux the runner should correctly use the right flags for your platform when creating the final binary.

在任何平台上,你都需要安装 nasm 和 ar 。如果在安装这些时遇到任何问题,请在 Piazza 上提问。如果你已正确安装这些,并且正在使用 Mac 或 Linux,那么在创建最终二进制文件时,运行器应该会正确地使用适合你平台的标志。

Here are the explicit commands the starter code uses to build an executable from an assembly code file compiled_code.s and the rust stub stub.rs. First build an object file using nasm, passing the appropriate format flag for your platform.

这里列出了启动代码用来从汇编代码文件 compiled_code.s 和 rust 框架 stub.rs 构建可执行文件的具体指令。首先使用 nasm 构建一个目标文件,并传递适合您平台的格式标志。

$ nasm -felf64 compiled_code.s -o compiled_code.o   # Linux
$ nasm -fmacho64 compiled_code.s -o compiled_code.o # Mac

Then build a static library file to link with the rust file

然后构建一个静态库文件,用于与 rust 文件链接

$ ar r libcompiled_code.a compiled_code.o # Linux & Mac

Finally, compile the rust file, indicating where to look for the static library.

最后,编译 rust 文件,并指定静态库的查找路径。

$ rustc stub.rs -L . -o stub.exe                              # Linux
$ rustc stub.rs -L . --target x86_64-apple-darwin -o stub.exe # Mac

This should make an executable that you can run.

这将生成一个可执行文件,你可以运行它。

$ ./stub.exe

We recommend you try this out with the sample file

我们建议你使用示例文件来尝试这个功能

section .text
global start_here
start_here:
  mov rax, 483
  ret

To make sure you have installed nasm and ar correctly.

为确保您已正确安装 nasm 和 ar 。

Rust Tips Rust 技巧

To read more about pattern matching on Rust enums see this chapter of the Rust book.

想了解更多关于 Rust 枚举的模式匹配,请查看 Rust 书籍的这一章节。

In checking binding-lists for duplicates you might want to use a data structure to keep track of the names. For this you may want to try out the HashSet module in the standard library or the im library.

在检查绑定列表的重复项时,你可能需要使用数据结构来跟踪名称。为此,你可以尝试标准库中的 HashSet 模块或 im 库。

List of Deliverables 交付物清单

For this assignment you will submit your updated versions of three files:

对于这个作业,你将提交三个文件的更新版本:

  • src/frontend.rs
  • src/middle_end.rs
  • src/backend.rs

These are the only files that you can change in your submission, your code will be linked with reference versions of the other files.

这些是你提交中唯一可以修改的文件,你的代码将与其他文件的参考版本链接在一起。

We’ve included a script mk_submission.sh in the starter code repo that should make zip file with just those three files and your testing files that you should upload to gradescope. The testing files are included for us to get some idea of how students are testing their code, they are not graded.

我们在启动代码库中包含了一个脚本 mk_submission.sh ,它应该会创建一个仅包含这三个文件和你测试文件的 zip 文件,你应该将这个 zip 文件上传到 gradescope。测试文件是为了让我们了解学生是如何测试他们的代码的,它们不会被评分。

Grading Standards 评分标准

For this assignment, you will be autograded on whether your code implements the specification (functional correctness). We encourage you to test but your test coverage is not graded.

对于这个作业,你的代码是否实现了规范(功能正确性)将决定你的自动评分。我们鼓励你进行测试,但你的测试覆盖率不会作为评分标准。

Submission 提交

Wait! Please read the assignment again and verify that you have not forgotten anything!

等等!请再次阅读作业,确保你没有遗漏任何内容!

Please submit your homework on gradescope by the above deadline.

请在上述截止日期前在 gradescope 上提交你的作业。