- 🚀 Introduction
- 🧱 Compiler Architecture
- 🔤 Lexical Analysis
- 🗂️ Syntactic Analysis
- ✅ Semantic Analysis
- ⚙️ Intermediate Code Generation
- 📒 Symbol Table
- 🖥️ Final Code Generation (RISC-V)
- 🚀 How to Run
- 📝 Examples
⚠️ Error Handling- 👤 Credits
This project implements a full compiler for the greek++ programming language, a compact yet expressive language designed to capture the essential features of modern programming.
Although greek++ is intentionally simple (supporting only integer data types), it provides a wide range of key programming constructs, including:
- 🛠️ Functions and Procedures
- 🔁 Loops:
- ✅ Conditionals
- 📥 Input/Output Operations
- ♻️ Recursive Calls
- 🔗 Parameter Passing by value & by reference*
- 🗂️ Nested Function & Procedure Declarations
The greek++ compiler follows a multi-phase pipeline that systematically transforms high-level source code into an intermediate representation. The main phases, as orchestrated in the GreekPP
main class, are:
1️⃣ Lexical Analysis:
The source file is loaded and scanned character-by-character. Tokens are generated (identifiers, keywords, numbers, operators, etc.).
2️⃣ Syntactic Analysis & AST Construction:
The token stream is parsed according to the grammar of greek++, and an Abstract Syntax Tree (AST) is constructed, capturing the full hierarchical structure of the program.
3️⃣ Semantic Analysis & Intermediate Code Generation:
The AST is traversed to ensure semantic correctness (scope handling, type checking, symbol resolution). Intermediate code is generated in the form of quadruples (quads), which provide a platform-independent representation of the program's logic.
4️⃣ Output Generation:
- The intermediate code is saved to a
.int
file. - A symbol table log (with scopes, variables, and subroutines) is saved to a
.sym
file.
5️⃣ Final Code Generation:**
final code generation targeting the RISC-V architecture. This stage will translate the intermediate quadruples into executable RISC-V assembly code, ready for simulation or deployment.
────────────────────────────
+-------------------+
| Source File |
+-------------------+
|
v
+-------------------+
| Lexical Analysis |
| (Lexer) |
+-------------------+
|
v
+-------------------+
| Syntactic Analysis|
| (Parser) |
| --> AST |
+-------------------+
|
v
+------------------------------+
| Semantic Analysis & IR Gen |
| (IRGenerator + QuadManager) |
| --> Quadruples (Quads) |
+------------------------------+
|
v
+----------------------------+
| Output Files |
| - Intermediate Code (.int) |
| - Symbol Table (.sym) |
+----------------------------+
|
v
+----------------------------------+
| Final Code Generation |
| --> RISC-V Assembly Code |
+----------------------------------+
The lexical analysis phase is implemented via a deterministic finite automaton (DFA) that scans the source code character-by-character and generates tokens. This process is crucial because it transforms the raw code into a structured stream of tokens, which are then consumed by the parser.
-
CharStream
:
Responsible for reading characters from the source file. It keeps track of the current position, line, and column, offering methods like:peekNextChar()
: Peek at the next character.consumeNextChar()
: Consume the current character and advance.- Line & column tracking for precise error reporting.
-
Lexer
:
The core of lexical analysis. It:- Utilizes a DFA (
DFAState
enum) to transition between states based on the next character's type (CharacterType
). - Detects token boundaries and creates a new
Token
object once a final DFA state is reached. - Skips whitespace and comments automatically using
DFAState.triggersDFARestart()
. - Provides methods:
getNextToken()
: Retrieves the next token.printAllTokens()
: Debug method to print all tokens until EOF.
- Utilizes a DFA (
-
Token
:
Encapsulates all information about a recognized token:TokenFamily
: Categorizes the token (e.g.,KEYWORD
,IDENTIFIER
,NUMBER
,ADD_OPERATOR
, etc.).- Holds metadata like:
- Start and end indices in the file.
- Line and column of occurrence.
- Recognized string value (actual text).
-
LexerErrors
:
Provides detailed error reporting when an invalid character sequence is encountered (e.g., illegal symbols, misplaced brackets, invalid transitions). -
DFAState
&CharacterType
:DFAState
: Represents the states of the automaton, including:- Start/final/error states.
- Transitions such as
IDENTIFIER_OR_KEYWORD_FINAL
,DIGIT_FINAL
, etc.
CharacterType
: Categorizes characters (e.g.,LETTER
,DIGIT
,ADD
,SUBTRACT
,GROUP_SYMBOL
, etc.).
────────────────────────────
The lexical analysis engine leverages a highly modular design using enums to implement the finite automaton (DFA) in a concise and performant way.
This enum represents all possible states of the DFA during lexical analysis. It includes:
- Scanning states: e.g.,
DIGIT
,IDENTIFIER_OR_KEYWORD
,ADD_OPERATOR
. - Final states: e.g.,
DIGIT_FINAL
,IDENTIFIER_OR_KEYWORD_FINAL
. - Error states: e.g.,
ILLEGAL_CHARACTER_ERROR_STATE
,LETTER_AFTER_DIGIT_ERROR_STATE
.
Each state:
- Defines if it’s a final state (
isFinal()
), meaning the token is complete. - Specifies if it triggers a DFA restart (e.g., whitespace, comments) via
triggersDFARestart()
. - Can identify if it’s an error state (e.g., illegal characters) via
isErrorState()
.
Transition logic:
- Uses a static
EnumMap<DFAState, Map<CharacterType, DFAState>>
to store the transition table. - The static block
initializeTransitionTable()
sets up all the transitions between states based on inputCharacterType
. - Example:
- From
START
, if you getDIGIT
, you move toDIGIT
. - Stay in
DIGIT
on more digits. - If a non-digit is seen, move to
DIGIT_FINAL
(finishing the token).
- From
This is a high-performance way to build a DFA in Java, because:
- EnumMap ensures O(1) lookup speed.
- The enums themselves are type-safe and self-documenting.
────────────────────────────
The use of a transition table implemented via EnumMap<DFAState, Map<CharacterType, DFAState>>
provides multiple layers of performance and design benefits:
1️⃣ O(1) Transition Lookup:
EnumMap
in Java is a highly optimized implementation for enums, backed internally by an array rather than a hash table.- This allows constant-time access to the next DFA state based on the current state and the character type.
- Result: Even in large files, every character transition is lightning-fast.
2️⃣ Type Safety & Clarity:
- Using enums (
DFAState
andCharacterType
) ensures no magic strings or integers are used for state or character type representation. - This makes the code self-documenting: anyone reading the table instantly sees meaningful names like
DIGIT
,ADD_OPERATOR
, etc.
3️⃣ Memory Efficiency:
- Because
EnumMap
is based on arrays internally, its memory footprint is minimal compared to standardHashMap
. - The same applies to the inner
Map<CharacterType, DFAState>
when implemented as anEnumMap
, keeping everything tight.
4️⃣ Clear DFA Visualization:
- The transition table provides a clean, tabular mapping between current states and input character types, which reflects the classic way DFA tables are described in theory.
- This makes it easy to verify correctness and spot errors or missing transitions.
5️⃣ Easy Extensibility:
- Need to add a new operator or token type?
→ Simply extend the enum and add a new entry in the transition table without touching core logic. - This follows the Open/Closed Principle in design.
6️⃣ Built-in Error Handling:
- Error states like
LETTER_AFTER_DIGIT_ERROR_STATE
are part of the DFA, making invalid sequences detectable within the same fast lookup—no need for extra validation passes.
🛠️ In summary:
By combining enums with EnumMap-based transition tables, the lexer achieves:
- Fast execution (O(1) per char).
- Minimal memory overhead.
- Maximum readability and maintainability.
This is arguably the most optimal way to build a DFA in Java for a lexer, blending both theoretical clarity and practical performance.
────────────────────────────
Categorizes every possible character into a type:
- Examples:
LETTER
,DIGIT
,ADD
,SUBTRACT
PARENTHESIS
,GROUP_SYMBOL
,CURLY_BRACKET_OPEN
EOF
,ILLEGAL_CHARACTER
The static method CharacterType.of(char c)
maps each character from the source file to the correct CharacterType
.
This handles:
- Latin & Greek alphabets.
- Symbols, whitespace, comments.
- Special cases:
EOF
(end-of-file),ILLEGAL_CHARACTER
.
────────────────────────────
Defines high-level categories for tokens (what kind of token each DFA final state produces):
KEYWORD
IDENTIFIER
NUMBER
ADD_OPERATOR
REL_OPERATOR
GROUP_SYMBOL
REFERENCE_OPERATOR
EOF
- And more.
The static method TokenFamily.fromDFAState(...)
maps a final DFA state to its corresponding TokenFamily
, e.g.:
DIGIT_FINAL
→NUMBER
IDENTIFIER_OR_KEYWORD_FINAL
→ (checks if it's a reserved word or an identifier)ADD_OPERATOR_FINAL
→ADD_OPERATOR
────────────────────────────
1️⃣ Initialization:
- The
GreekPP
main class creates aLexer
instance, initializing it with aCharStream
that reads the provided file.
2️⃣ Character Scanning:
Lexer.getNextToken()
begins in theDFAState.START
state.- Each character is classified via
CharacterType.of(char c)
and the DFA determines the next state.
3️⃣ State Handling:
- Whitespace and comments are skipped without producing tokens.
- When a final state is reached (e.g., recognizing an identifier or number), the
Lexer
constructs aToken
.
4️⃣ Error Handling:
- If an illegal transition is made (e.g., a digit followed by a letter),
LexerErrors.IllegalStateTransition(...)
is invoked, and the compilation aborts gracefully with a detailed error message.
5️⃣ Token Emission:
- The produced
Token
is returned to the parser for syntactic analysis.
- Greek alphabet support: The lexer recognizes both Latin and Greek letters, enabling a wide range of identifier names.
- Comprehensive keyword set: Tokens include constructs like
πρόγραμμα
,διάβασε
,γράψε
,συνάρτηση
, and more. - Robust error reporting: Every token includes precise line and column information, and the DFA handles edge cases like EOF and illegal characters gracefully.
────────────────────────────
.... to be continued. Read reports for detailed info