greek++ Compiler: Full Implementation of a Recursive-Descent, Multi-Phase Compiler Targeting RISC-V

📑 Table of Contents

🚀 Introduction
🧱 Compiler Architecture
🔤 Lexical Analysis
🗂️ Syntactic Analysis
✅ Semantic Analysis
⚙️ Intermediate Code Generation
📒 Symbol Table
🖥️ Final Code Generation (RISC-V)
🚀 How to Run
📝 Examples
⚠️ Error Handling
👤 Credits

🚀 Introduction

This project implements a full compiler for the greek++ programming language, a compact yet expressive language designed to capture the essential features of modern programming.

Although greek++ is intentionally simple (supporting only integer data types), it provides a wide range of key programming constructs, including:

🛠️ Functions and Procedures
🔁 Loops:
✅ Conditionals
📥 Input/Output Operations
♻️ Recursive Calls
🔗 Parameter Passing by value & by reference*
🗂️ Nested Function & Procedure Declarations

🧱 Compiler Architecture

The greek++ compiler follows a multi-phase pipeline that systematically transforms high-level source code into an intermediate representation. The main phases, as orchestrated in the GreekPP main class, are:

1️⃣ Lexical Analysis:
The source file is loaded and scanned character-by-character. Tokens are generated (identifiers, keywords, numbers, operators, etc.).

2️⃣ Syntactic Analysis & AST Construction:
The token stream is parsed according to the grammar of greek++, and an Abstract Syntax Tree (AST) is constructed, capturing the full hierarchical structure of the program.

3️⃣ Semantic Analysis & Intermediate Code Generation:
The AST is traversed to ensure semantic correctness (scope handling, type checking, symbol resolution). Intermediate code is generated in the form of quadruples (quads), which provide a platform-independent representation of the program's logic.

4️⃣ Output Generation:

The intermediate code is saved to a .int file.
A symbol table log (with scopes, variables, and subroutines) is saved to a .sym file.

5️⃣ Final Code Generation:**
final code generation targeting the RISC-V architecture. This stage will translate the intermediate quadruples into executable RISC-V assembly code, ready for simulation or deployment.

────────────────────────────

🔗 Compilation Flow Diagram

           +-------------------+
           |   Source File     |
           +-------------------+
                    |
                    v
           +-------------------+
           |  Lexical Analysis |
           |      (Lexer)      |
           +-------------------+
                    |
                    v
           +-------------------+
           | Syntactic Analysis|
           |     (Parser)      |
           |    --> AST        |
           +-------------------+
                    |
                    v
           +------------------------------+
           | Semantic Analysis & IR Gen   |
           | (IRGenerator + QuadManager)  |
           | --> Quadruples (Quads)       |
           +------------------------------+
                    |
                    v
           +----------------------------+
           |       Output Files         |
           | - Intermediate Code (.int) |
           | - Symbol Table (.sym)      |
           +----------------------------+
                    |
                    v
           +----------------------------------+
           | Final Code Generation  |
           |      --> RISC-V Assembly Code    |
           +----------------------------------+

🔤 Lexical Analysis

The lexical analysis phase is implemented via a deterministic finite automaton (DFA) that scans the source code character-by-character and generates tokens. This process is crucial because it transforms the raw code into a structured stream of tokens, which are then consumed by the parser.

🧩 Key Classes & Their Roles

CharStream:
Responsible for reading characters from the source file. It keeps track of the current position, line, and column, offering methods like:
- peekNextChar(): Peek at the next character.
- consumeNextChar(): Consume the current character and advance.
- Line & column tracking for precise error reporting.
Lexer:
The core of lexical analysis. It:
- Utilizes a DFA (DFAState enum) to transition between states based on the next character's type (CharacterType).
- Detects token boundaries and creates a new Token object once a final DFA state is reached.
- Skips whitespace and comments automatically using DFAState.triggersDFARestart().
- Provides methods:
  - getNextToken(): Retrieves the next token.
  - printAllTokens(): Debug method to print all tokens until EOF.
Token:
Encapsulates all information about a recognized token:
- TokenFamily: Categorizes the token (e.g., KEYWORD, IDENTIFIER, NUMBER, ADD_OPERATOR, etc.).
- Holds metadata like:
  - Start and end indices in the file.
  - Line and column of occurrence.
  - Recognized string value (actual text).
LexerErrors:
Provides detailed error reporting when an invalid character sequence is encountered (e.g., illegal symbols, misplaced brackets, invalid transitions).
DFAState & CharacterType:
- DFAState: Represents the states of the automaton, including:
  - Start/final/error states.
  - Transitions such as IDENTIFIER_OR_KEYWORD_FINAL, DIGIT_FINAL, etc.
- CharacterType: Categorizes characters (e.g., LETTER, DIGIT, ADD, SUBTRACT, GROUP_SYMBOL, etc.).

────────────────────────────

🧮 Enums & DFA Logic

The lexical analysis engine leverages a highly modular design using enums to implement the finite automaton (DFA) in a concise and performant way.

🗂️ `DFAState`

This enum represents all possible states of the DFA during lexical analysis. It includes:

Scanning states: e.g., DIGIT, IDENTIFIER_OR_KEYWORD, ADD_OPERATOR.
Final states: e.g., DIGIT_FINAL, IDENTIFIER_OR_KEYWORD_FINAL.
Error states: e.g., ILLEGAL_CHARACTER_ERROR_STATE, LETTER_AFTER_DIGIT_ERROR_STATE.

Each state:

Defines if it’s a final state (isFinal()), meaning the token is complete.
Specifies if it triggers a DFA restart (e.g., whitespace, comments) via triggersDFARestart().
Can identify if it’s an error state (e.g., illegal characters) via isErrorState().

Transition logic:

Uses a static EnumMap<DFAState, Map<CharacterType, DFAState>> to store the transition table.
The static block initializeTransitionTable() sets up all the transitions between states based on input CharacterType.
Example:
- From START, if you get DIGIT, you move to DIGIT.
- Stay in DIGIT on more digits.
- If a non-digit is seen, move to DIGIT_FINAL (finishing the token).

This is a high-performance way to build a DFA in Java, because:

EnumMap ensures O(1) lookup speed.
The enums themselves are type-safe and self-documenting.

────────────────────────────

⚡ Why This Approach Is Highly Efficient

The use of a transition table implemented via EnumMap<DFAState, Map<CharacterType, DFAState>> provides multiple layers of performance and design benefits:

1️⃣ O(1) Transition Lookup:

EnumMap in Java is a highly optimized implementation for enums, backed internally by an array rather than a hash table.
This allows constant-time access to the next DFA state based on the current state and the character type.
Result: Even in large files, every character transition is lightning-fast.

2️⃣ Type Safety & Clarity:

Using enums (DFAState and CharacterType) ensures no magic strings or integers are used for state or character type representation.
This makes the code self-documenting: anyone reading the table instantly sees meaningful names like DIGIT, ADD_OPERATOR, etc.

3️⃣ Memory Efficiency:

Because EnumMap is based on arrays internally, its memory footprint is minimal compared to standard HashMap.
The same applies to the inner Map<CharacterType, DFAState> when implemented as an EnumMap, keeping everything tight.

4️⃣ Clear DFA Visualization:

The transition table provides a clean, tabular mapping between current states and input character types, which reflects the classic way DFA tables are described in theory.
This makes it easy to verify correctness and spot errors or missing transitions.

5️⃣ Easy Extensibility:

Need to add a new operator or token type?
→ Simply extend the enum and add a new entry in the transition table without touching core logic.
This follows the Open/Closed Principle in design.

6️⃣ Built-in Error Handling:

Error states like LETTER_AFTER_DIGIT_ERROR_STATE are part of the DFA, making invalid sequences detectable within the same fast lookup—no need for extra validation passes.

🛠️ In summary:
By combining enums with EnumMap-based transition tables, the lexer achieves:

Fast execution (O(1) per char).
Minimal memory overhead.
Maximum readability and maintainability.

This is arguably the most optimal way to build a DFA in Java for a lexer, blending both theoretical clarity and practical performance.

────────────────────────────

🔤 `CharacterType`

Categorizes every possible character into a type:

Examples:
- LETTER, DIGIT, ADD, SUBTRACT
- PARENTHESIS, GROUP_SYMBOL, CURLY_BRACKET_OPEN
- EOF, ILLEGAL_CHARACTER

The static method CharacterType.of(char c) maps each character from the source file to the correct CharacterType.
This handles:

Latin & Greek alphabets.
Symbols, whitespace, comments.
Special cases: EOF (end-of-file), ILLEGAL_CHARACTER.

────────────────────────────

🏷️ `TokenFamily`

Defines high-level categories for tokens (what kind of token each DFA final state produces):

KEYWORD
IDENTIFIER
NUMBER
ADD_OPERATOR
REL_OPERATOR
GROUP_SYMBOL
REFERENCE_OPERATOR
EOF
And more.

The static method TokenFamily.fromDFAState(...) maps a final DFA state to its corresponding TokenFamily, e.g.:

DIGIT_FINAL → NUMBER
IDENTIFIER_OR_KEYWORD_FINAL → (checks if it's a reserved word or an identifier)
ADD_OPERATOR_FINAL → ADD_OPERATOR

────────────────────────────

🔄 Lexical Flow

1️⃣ Initialization:

The GreekPP main class creates a Lexer instance, initializing it with a CharStream that reads the provided file.

2️⃣ Character Scanning:

Lexer.getNextToken() begins in the DFAState.START state.
Each character is classified via CharacterType.of(char c) and the DFA determines the next state.

3️⃣ State Handling:

Whitespace and comments are skipped without producing tokens.
When a final state is reached (e.g., recognizing an identifier or number), the Lexer constructs a Token.

4️⃣ Error Handling:

If an illegal transition is made (e.g., a digit followed by a letter), LexerErrors.IllegalStateTransition(...) is invoked, and the compilation aborts gracefully with a detailed error message.

5️⃣ Token Emission:

The produced Token is returned to the parser for syntactic analysis.

✔️ Highlights

Greek alphabet support: The lexer recognizes both Latin and Greek letters, enabling a wide range of identifier names.
Comprehensive keyword set: Tokens include constructs like πρόγραμμα, διάβασε, γράψε, συνάρτηση, and more.
Robust error reporting: Every token includes precise line and column information, and the DFA handles edge cases like EOF and illegal characters gracefully.

────────────────────────────

.... to be continued. Read reports for detailed info

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
out_files		out_files
report		report
src		src
test_files		test_files
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

greek++ Compiler: Full Implementation of a Recursive-Descent, Multi-Phase Compiler Targeting RISC-V

📑 Table of Contents

🚀 Introduction

🧱 Compiler Architecture

🔗 Compilation Flow Diagram

🔤 Lexical Analysis

🧩 Key Classes & Their Roles

🧮 Enums & DFA Logic

🗂️ `DFAState`

⚡ Why This Approach Is Highly Efficient

🔤 `CharacterType`

🏷️ `TokenFamily`

🔄 Lexical Flow

✔️ Highlights

About

Uh oh!

Releases

Packages

Languages

xrddev/greekpp-compiler

Folders and files

Latest commit

History

Repository files navigation

greek++ Compiler: Full Implementation of a Recursive-Descent, Multi-Phase Compiler Targeting RISC-V

📑 Table of Contents

🚀 Introduction

🧱 Compiler Architecture

🔗 Compilation Flow Diagram

🔤 Lexical Analysis

🧩 Key Classes & Their Roles

🧮 Enums & DFA Logic

🗂️ DFAState

⚡ Why This Approach Is Highly Efficient

🔤 CharacterType

🏷️ TokenFamily

🔄 Lexical Flow

✔️ Highlights

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🗂️ `DFAState`

🔤 `CharacterType`

🏷️ `TokenFamily`

Packages