Skip to content

Automatic programming language detection of code snippets, in Emacs Lisp

Notifications You must be signed in to change notification settings

andreasjansson/language-detection.el

Repository files navigation

language-detection.el

MELPA Build Status

Emacs Lisp library that automatically detects the programming language in a buffer or string. Implemented as a random forest classifier, trained in scikit-learn and deployed to Emacs Lisp.

Usage

  • M-x language-detection-buffer
  • Interactive function, outputs the buffer's language to *Messages*
  • When called non-interactively, returns the language
  • (language-detection-string STRING)
  • Non-interactive function, returns the language of STRING

Model performance

Language classification accuracy of this model compared to a few existing systems on three different datasets:

  Stack Overflow Rosetta Code Github Linguist
language-detection.el 73% 84% 87%
Highlight.js 36% 54% 56%
Pygments 10% 25% 37%
SourceClassifier 51% 48% 65%

Computational performance

The actual random forest lookup is fast, the bottleneck is regular expression-based tokenization.

Function Name                         Call Count  Elapsed time  Average Time
language-detection-buffer             441         8.0538804580  0.0182627674
language-detection-string             441         8.0525858239  0.0182598318
language-detection-tokenize-string    441         6.2478335339  0.0141674229
language-detection-token-frequencies  441         1.6420298549  0.0037234237
language-detection-forest-lookup      441         0.0973103440  0.0002206583
language-detection-tree-lookup        4851        0.0885046049  1.824...e-05

Supported languages

  • ada
  • awk
  • c
  • clojure
  • cpp
  • csharp
  • css
  • dart
  • delphi
  • emacslisp
  • erlang
  • fortran
  • fsharp
  • go
  • groovy
  • haskell
  • html
  • java
  • javascript
  • json
  • latex
  • lisp
  • lua
  • matlab
  • objc
  • perl
  • php
  • prolog
  • python
  • r
  • ruby
  • rust
  • scala
  • shell
  • smalltalk
  • sql
  • swift
  • visualbasic
  • xml

EWW syntax highlighting

Write this in your .emacs:

(require 'cl-lib)

(defun eww-tag-pre (dom)
  (let ((shr-folding-mode 'none)
        (shr-current-font 'default))
    (shr-ensure-newline)
    (insert (eww-fontify-pre dom))
    (shr-ensure-newline)))

(defun eww-fontify-pre (dom)
  (with-temp-buffer
    (shr-generic dom)
    (let ((mode (eww-buffer-auto-detect-mode)))
      (when mode
        (eww-fontify-buffer mode)))
    (buffer-string)))

(defun eww-fontify-buffer (mode)
  (delay-mode-hooks (funcall mode))
  (font-lock-default-function mode)
  (font-lock-default-fontify-region (point-min)
                                    (point-max)
                                    nil))

(defun eww-buffer-auto-detect-mode ()
  (let* ((map '((ada ada-mode)
                (awk awk-mode)
                (c c-mode)
                (cpp c++-mode)
                (clojure clojure-mode lisp-mode)
                (csharp csharp-mode java-mode)
                (css css-mode)
                (dart dart-mode)
                (delphi delphi-mode)
                (emacslisp emacs-lisp-mode)
                (erlang erlang-mode)
                (fortran fortran-mode)
                (fsharp fsharp-mode)
                (go go-mode)
                (groovy groovy-mode)
                (haskell haskell-mode)
                (html html-mode)
                (java java-mode)
                (javascript javascript-mode)
                (json json-mode javascript-mode)
                (latex latex-mode)
                (lisp lisp-mode)
                (lua lua-mode)
                (matlab matlab-mode octave-mode)
                (objc objc-mode c-mode)
                (perl perl-mode)
                (php php-mode)
                (prolog prolog-mode)
                (python python-mode)
                (r r-mode)
                (ruby ruby-mode)
                (rust rust-mode)
                (scala scala-mode)
                (shell shell-script-mode)
                (smalltalk smalltalk-mode)
                (sql sql-mode)
                (swift swift-mode)
                (visualbasic visual-basic-mode)
                (xml sgml-mode)))
         (language (language-detection-string
                    (buffer-substring-no-properties (point-min) (point-max))))
         (modes (cdr (assoc language map)))
         (mode (cl-loop for mode in modes
                        when (fboundp mode)
                        return mode)))
    (message (format "%s" language))
    (when (fboundp mode)
      mode)))

(setq shr-external-rendering-functions
      '((pre . eww-tag-pre)))

Classifier training procedure

The model consists of a Random Forest ensemble of 11 Decision Tree classifiers.

It was trained on code snippets from the Stack Overflow data dump, extracted from <pre> tags. The number of snippets per language was capped at 10,000. The actual training snippets can be unpickled from snippets.cpkl.

The snippets were tokenized simple regex: ([a-zA-Z0-9_]+|[^ a-zA-Z0-9_\n\t]+). The set of single tokens and pairs of consecutive tokens are pruned to the top 500 token/token pairs per language.

Finally, the tokens are counted for frequency and placed in a 7772-dimensional bag-of-words vector. (7772 is the total number of unique tokens/pairs.) For example, this is a bag corresponding to a short Haskell snippet:

"         : 0.057971
(         : 0.0144928
)         : 0.0144928
->        : 0.0289855
-> IO     : 0.0144928
.         : 0.0289855
. h       : 0.0289855
::        : 0.0289855
IO        : 0.0144928
h         : 0.0289855
import    : 0.0289855

For classification I use a Random Forest with 11 decision trees. The hyperparameters were found using random grid search. Here's a confusion matrix:

Confusion matrix

The decision trees are written out as Emacs Lisp arrays and traversed recursively until leaves are found in each tree.

About

Automatic programming language detection of code snippets, in Emacs Lisp

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published