Skip to content

textcompare/detect-code-language

Repository files navigation

Programming Language Detector

This code provides a basic framework for detecting the programming language of a given code snippet. It has implemented in https://textcompare.io to compare code and text.

Supported Languages:

  • JavaScript
  • C
  • C++
  • Python
  • Java
  • HTML
  • CSS
  • Ruby
  • Go
  • PHP
  • C#
  • R
  • Objective-C
  • TypeScript
  • Swift
  • Perl
  • Haskell
  • Kotlin
  • Rust
  • Dart
  • Julia
  • Scala
  • Groovy
  • Bash
  • Lua
  • FSharp
  • MATLAB
  • Lisp
  • Prolog
  • COBOL
  • Fortran
  • Ada
  • Elm
  • Crystal
  • Elixir
  • Clojure
  • CoffeeScript
  • ObjectiveCPP
  • Racket
  • Erlang
  • Apex
  • VHDL
  • Verilog
  • Scheme
  • Tcl
  • COOL
  • Nim
  • K
  • Racket
  • Alice
  • APL
  • Bash
  • Kotlin
  • XML
  • YAML
  • JSON
  • Markdown
  • SGML
  • HTML5
  • TEI
  • LaTeX
  • AsciiDoc

How it Works:

  1. Language Definitions:

    • LANGUAGES is a dictionary where keys are language names (e.g., "JavaScript", "Python") and values are arrays of language-specific patterns.
    • Each pattern within an array is a regular expression object with:
      • pattern: The regular expression to match.
      • points: Points awarded for each match. Positive points indicate features of the language, negative points indicate features that suggest it's not that language.
      • nearTop: (Optional) If True, the pattern should be matched near the beginning of the code.
  2. Scoring:

    • For each language:
      • Iterate through the patterns for that language.
      • Apply each pattern to the code snippet.
      • Accumulate the points for each match.
  3. Language Prediction:

    • Determine the language with the highest score.
    • If no language has a significantly higher score than others, the prediction may be uncertain.

Limitations:

  • Simple Heuristics: The current implementation relies on simple heuristics (regular expressions) and may not accurately detect complex or ambiguous code.
  • Limited Language Support: While the code supports a wide range of languages, it may not be comprehensive and may miss some niche languages.
  • False Positives/Negatives: False positives (incorrectly identifying a language) and false negatives (failing to identify the correct language) are possible due to the nature of heuristics and the complexity of real-world code.

Potential Improvements:

  • More Sophisticated Language Models: Incorporate more advanced language models (e.g., machine learning models trained on large code datasets) for more accurate predictions.
  • Contextual Analysis: Consider the context of the code (e.g., surrounding files, project structure) to improve accuracy.
  • Dynamic Analysis: Analyze the code's behavior (e.g., by running it) to extract more meaningful features.
  • Lexical Analysis: Perform more in-depth lexical analysis (tokenization, parsing) to identify language-specific constructs.

Usage:

  1. Prepare the code snippet: The code snippet should be a string.
  2. Call the detection function:
    • Pass the code snippet to the detection function.
    • The function will analyze the code and return the predicted language.

Note: This is a simplified example. A production-ready implementation would require significant refinement and testing.

This README file provides a basic overview of the code and its limitations. Further documentation and testing are necessary for a complete understanding and evaluation.

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •