Skip to content

Fully manual compiler for greek++: a custom language with Greek keywords and imperative syntax. Implements all stages — custom lexer, recursive descent parser, semantic analysis, quad-based IR, and RISC-V code generation — entirely from scratch in Java, with no parser generators. Outputs RISC-V assembly compatible with tools like Ripes.

Notifications You must be signed in to change notification settings

xrddev/greekpp-compiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

greek++ Compiler: Full Implementation of a Recursive-Descent, Multi-Phase Compiler Targeting RISC-V


📑 Table of Contents

  1. 🚀 Introduction
  2. 🧱 Compiler Architecture
  3. 🔤 Lexical Analysis
  4. 🗂️ Syntactic Analysis
  5. ✅ Semantic Analysis
  6. ⚙️ Intermediate Code Generation
  7. 📒 Symbol Table
  8. 🖥️ Final Code Generation (RISC-V)
  9. 🚀 How to Run
  10. 📝 Examples
  11. ⚠️ Error Handling
  12. 👤 Credits





🚀 Introduction

This project implements a full compiler for the greek++ programming language, a compact yet expressive language designed to capture the essential features of modern programming.

Although greek++ is intentionally simple (supporting only integer data types), it provides a wide range of key programming constructs, including:

  • 🛠️ Functions and Procedures
  • 🔁 Loops:
  • Conditionals
  • 📥 Input/Output Operations
  • ♻️ Recursive Calls
  • 🔗 Parameter Passing by value & by reference*
  • 🗂️ Nested Function & Procedure Declarations





🧱 Compiler Architecture

The greek++ compiler follows a multi-phase pipeline that systematically transforms high-level source code into an intermediate representation. The main phases, as orchestrated in the GreekPP main class, are:

1️⃣ Lexical Analysis:
The source file is loaded and scanned character-by-character. Tokens are generated (identifiers, keywords, numbers, operators, etc.).

2️⃣ Syntactic Analysis & AST Construction:
The token stream is parsed according to the grammar of greek++, and an Abstract Syntax Tree (AST) is constructed, capturing the full hierarchical structure of the program.

3️⃣ Semantic Analysis & Intermediate Code Generation:
The AST is traversed to ensure semantic correctness (scope handling, type checking, symbol resolution). Intermediate code is generated in the form of quadruples (quads), which provide a platform-independent representation of the program's logic.

4️⃣ Output Generation:

  • The intermediate code is saved to a .int file.
  • A symbol table log (with scopes, variables, and subroutines) is saved to a .sym file.

5️⃣ Final Code Generation:**
final code generation targeting the RISC-V architecture. This stage will translate the intermediate quadruples into executable RISC-V assembly code, ready for simulation or deployment.



────────────────────────────

🔗 Compilation Flow Diagram

           +-------------------+
           |   Source File     |
           +-------------------+
                    |
                    v
           +-------------------+
           |  Lexical Analysis |
           |      (Lexer)      |
           +-------------------+
                    |
                    v
           +-------------------+
           | Syntactic Analysis|
           |     (Parser)      |
           |    --> AST        |
           +-------------------+
                    |
                    v
           +------------------------------+
           | Semantic Analysis & IR Gen   |
           | (IRGenerator + QuadManager)  |
           | --> Quadruples (Quads)       |
           +------------------------------+
                    |
                    v
           +----------------------------+
           |       Output Files         |
           | - Intermediate Code (.int) |
           | - Symbol Table (.sym)      |
           +----------------------------+
                    |
                    v
           +----------------------------------+
           | Final Code Generation  |
           |      --> RISC-V Assembly Code    |
           +----------------------------------+





🔤 Lexical Analysis

The lexical analysis phase is implemented via a deterministic finite automaton (DFA) that scans the source code character-by-character and generates tokens. This process is crucial because it transforms the raw code into a structured stream of tokens, which are then consumed by the parser.

🧩 Key Classes & Their Roles

  • CharStream:
    Responsible for reading characters from the source file. It keeps track of the current position, line, and column, offering methods like:

    • peekNextChar(): Peek at the next character.
    • consumeNextChar(): Consume the current character and advance.
    • Line & column tracking for precise error reporting.
  • Lexer:
    The core of lexical analysis. It:

    • Utilizes a DFA (DFAState enum) to transition between states based on the next character's type (CharacterType).
    • Detects token boundaries and creates a new Token object once a final DFA state is reached.
    • Skips whitespace and comments automatically using DFAState.triggersDFARestart().
    • Provides methods:
      • getNextToken(): Retrieves the next token.
      • printAllTokens(): Debug method to print all tokens until EOF.
  • Token:
    Encapsulates all information about a recognized token:

    • TokenFamily: Categorizes the token (e.g., KEYWORD, IDENTIFIER, NUMBER, ADD_OPERATOR, etc.).
    • Holds metadata like:
      • Start and end indices in the file.
      • Line and column of occurrence.
      • Recognized string value (actual text).
  • LexerErrors:
    Provides detailed error reporting when an invalid character sequence is encountered (e.g., illegal symbols, misplaced brackets, invalid transitions).

  • DFAState & CharacterType:

    • DFAState: Represents the states of the automaton, including:
      • Start/final/error states.
      • Transitions such as IDENTIFIER_OR_KEYWORD_FINAL, DIGIT_FINAL, etc.
    • CharacterType: Categorizes characters (e.g., LETTER, DIGIT, ADD, SUBTRACT, GROUP_SYMBOL, etc.).



────────────────────────────

🧮 Enums & DFA Logic

The lexical analysis engine leverages a highly modular design using enums to implement the finite automaton (DFA) in a concise and performant way.

🗂️ DFAState

This enum represents all possible states of the DFA during lexical analysis. It includes:

  • Scanning states: e.g., DIGIT, IDENTIFIER_OR_KEYWORD, ADD_OPERATOR.
  • Final states: e.g., DIGIT_FINAL, IDENTIFIER_OR_KEYWORD_FINAL.
  • Error states: e.g., ILLEGAL_CHARACTER_ERROR_STATE, LETTER_AFTER_DIGIT_ERROR_STATE.

Each state:

  • Defines if it’s a final state (isFinal()), meaning the token is complete.
  • Specifies if it triggers a DFA restart (e.g., whitespace, comments) via triggersDFARestart().
  • Can identify if it’s an error state (e.g., illegal characters) via isErrorState().

Transition logic:

  • Uses a static EnumMap<DFAState, Map<CharacterType, DFAState>> to store the transition table.
  • The static block initializeTransitionTable() sets up all the transitions between states based on input CharacterType.
  • Example:
    • From START, if you get DIGIT, you move to DIGIT.
    • Stay in DIGIT on more digits.
    • If a non-digit is seen, move to DIGIT_FINAL (finishing the token).

This is a high-performance way to build a DFA in Java, because:

  • EnumMap ensures O(1) lookup speed.
  • The enums themselves are type-safe and self-documenting.



────────────────────────────

⚡ Why This Approach Is Highly Efficient

The use of a transition table implemented via EnumMap<DFAState, Map<CharacterType, DFAState>> provides multiple layers of performance and design benefits:

1️⃣ O(1) Transition Lookup:

  • EnumMap in Java is a highly optimized implementation for enums, backed internally by an array rather than a hash table.
  • This allows constant-time access to the next DFA state based on the current state and the character type.
  • Result: Even in large files, every character transition is lightning-fast.

2️⃣ Type Safety & Clarity:

  • Using enums (DFAState and CharacterType) ensures no magic strings or integers are used for state or character type representation.
  • This makes the code self-documenting: anyone reading the table instantly sees meaningful names like DIGIT, ADD_OPERATOR, etc.

3️⃣ Memory Efficiency:

  • Because EnumMap is based on arrays internally, its memory footprint is minimal compared to standard HashMap.
  • The same applies to the inner Map<CharacterType, DFAState> when implemented as an EnumMap, keeping everything tight.

4️⃣ Clear DFA Visualization:

  • The transition table provides a clean, tabular mapping between current states and input character types, which reflects the classic way DFA tables are described in theory.
  • This makes it easy to verify correctness and spot errors or missing transitions.

5️⃣ Easy Extensibility:

  • Need to add a new operator or token type?
    → Simply extend the enum and add a new entry in the transition table without touching core logic.
  • This follows the Open/Closed Principle in design.

6️⃣ Built-in Error Handling:

  • Error states like LETTER_AFTER_DIGIT_ERROR_STATE are part of the DFA, making invalid sequences detectable within the same fast lookup—no need for extra validation passes.

🛠️ In summary:
By combining enums with EnumMap-based transition tables, the lexer achieves:

  • Fast execution (O(1) per char).
  • Minimal memory overhead.
  • Maximum readability and maintainability.

This is arguably the most optimal way to build a DFA in Java for a lexer, blending both theoretical clarity and practical performance.



────────────────────────────

🔤 CharacterType

Categorizes every possible character into a type:

  • Examples:
    • LETTER, DIGIT, ADD, SUBTRACT
    • PARENTHESIS, GROUP_SYMBOL, CURLY_BRACKET_OPEN
    • EOF, ILLEGAL_CHARACTER

The static method CharacterType.of(char c) maps each character from the source file to the correct CharacterType.
This handles:

  • Latin & Greek alphabets.
  • Symbols, whitespace, comments.
  • Special cases: EOF (end-of-file), ILLEGAL_CHARACTER.



────────────────────────────

🏷️ TokenFamily

Defines high-level categories for tokens (what kind of token each DFA final state produces):

  • KEYWORD
  • IDENTIFIER
  • NUMBER
  • ADD_OPERATOR
  • REL_OPERATOR
  • GROUP_SYMBOL
  • REFERENCE_OPERATOR
  • EOF
  • And more.

The static method TokenFamily.fromDFAState(...) maps a final DFA state to its corresponding TokenFamily, e.g.:

  • DIGIT_FINALNUMBER
  • IDENTIFIER_OR_KEYWORD_FINAL → (checks if it's a reserved word or an identifier)
  • ADD_OPERATOR_FINALADD_OPERATOR



────────────────────────────

🔄 Lexical Flow

1️⃣ Initialization:

  • The GreekPP main class creates a Lexer instance, initializing it with a CharStream that reads the provided file.

2️⃣ Character Scanning:

  • Lexer.getNextToken() begins in the DFAState.START state.
  • Each character is classified via CharacterType.of(char c) and the DFA determines the next state.

3️⃣ State Handling:

  • Whitespace and comments are skipped without producing tokens.
  • When a final state is reached (e.g., recognizing an identifier or number), the Lexer constructs a Token.

4️⃣ Error Handling:

  • If an illegal transition is made (e.g., a digit followed by a letter), LexerErrors.IllegalStateTransition(...) is invoked, and the compilation aborts gracefully with a detailed error message.

5️⃣ Token Emission:

  • The produced Token is returned to the parser for syntactic analysis.

✔️ Highlights

  • Greek alphabet support: The lexer recognizes both Latin and Greek letters, enabling a wide range of identifier names.
  • Comprehensive keyword set: Tokens include constructs like πρόγραμμα, διάβασε, γράψε, συνάρτηση, and more.
  • Robust error reporting: Every token includes precise line and column information, and the DFA handles edge cases like EOF and illegal characters gracefully.

────────────────────────────

.... to be continued. Read reports for detailed info

About

Fully manual compiler for greek++: a custom language with Greek keywords and imperative syntax. Implements all stages — custom lexer, recursive descent parser, semantic analysis, quad-based IR, and RISC-V code generation — entirely from scratch in Java, with no parser generators. Outputs RISC-V assembly compatible with tools like Ripes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published