Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parsing unterminated statements #65

Open
XVilka opened this issue Apr 27, 2021 · 5 comments
Open

Support parsing unterminated statements #65

XVilka opened this issue Apr 27, 2021 · 5 comments

Comments

@XVilka
Copy link
Contributor

XVilka commented Apr 27, 2021

Currently parser is only able to successfully parse terminated statements, like:

const char * myarray[25];

But if you feed something like

const char * [25]

or

const char *

It emits an error.
It would be beneficial to support parsing such statements too.
@thestr4ng3r proposed the following change in the grammar:

diff --git a/grammar.js b/grammar.js
index 6a5fa25..5dc99a3 100644
--- a/grammar.js
+++ b/grammar.js
@@ -51,6 +51,7 @@ module.exports = grammar({
   word: $ => $.identifier,

   rules: {
+    the_actual_root: $ => $.type_descriptor,
     translation_unit: $ => repeat($._top_level_item),

     _top_level_item: $ => choice(
@thestr4ng3r
Copy link

To clarify, what we actually need is in addition to parsing a full translation_unit like int a() { x = (const char *[25])y; }, in the same application parse only the part inside the cast like const char *[25], which is type_descriptor.

So essentially, we would need a way to change the root rule to use at runtime, which isn't really tree-sitter-c specific. I wonder if that is even theoretically possible with how the code generator works.

Alternatively we will have to use two grammars for this where one is the original tree-sitter-c and the other is conceptually what is shown in the issue description (which of course breaks parsing regular translation_units).

@XVilka
Copy link
Contributor Author

XVilka commented May 6, 2021

Maybe worth to transfer the issue to the tree-sitter repository then? @maxbrunsfeld

@maxbrunsfeld
Copy link
Contributor

When you need to parse a fragment of incomplete source code (like a type_descriptor), can you just surround the fragment with a "context" that turns it into a valid C translation unit, and then extract out the piece of the syntax tree that you're interested in?

For example, to parse a type_descriptor, take the input string, append the suffix string x;, parse that combined string, and then take the subtree for the relevant byte range.

There is a long-standing Tree-sitter issue about selecting alternative root rules at runtime, but that is going to be complex to implement, and this workaround actually seems quite straightforward and scalable, in cases where you had many different rules that you wanted to try.

@thestr4ng3r
Copy link

Appending x; would not work for type_descriptor:

char *x;
(translation_unit [0, 0] - [1, 0]
  (declaration [0, 0] - [0, 8]
    type: (primitive_type [0, 0] - [0, 4])
    declarator: (pointer_declarator [0, 5] - [0, 7]
      declarator: (identifier [0, 6] - [0, 7]))))

But we could in theory use a cast, so assuming we want to parse const char *[42], wrap it like so:

void a() { (const char *[42])x; }
(translation_unit [0, 0] - [1, 0]
  (function_definition [0, 0] - [0, 33]
    type: (primitive_type [0, 0] - [0, 4])
    declarator: (function_declarator [0, 5] - [0, 8]
      declarator: (identifier [0, 5] - [0, 6])
      parameters: (parameter_list [0, 6] - [0, 8]))
    body: (compound_statement [0, 9] - [0, 33]
      (expression_statement [0, 11] - [0, 31]
        (cast_expression [0, 11] - [0, 30]
          type: (type_descriptor [0, 12] - [0, 28]
            (type_qualifier [0, 12] - [0, 17])
            type: (primitive_type [0, 18] - [0, 22])
            declarator: (abstract_pointer_declarator [0, 23] - [0, 28]
              declarator: (abstract_array_declarator [0, 24] - [0, 28]
                size: (number_literal [0, 25] - [0, 27]))))
          value: (identifier [0, 29] - [0, 30]))))))

The reason why in practice we can't do this is that the string that we want to parse could do some sort of injection and easily escape our wrapping, for example when we try to parse int)0; now_i_have_escaped(); //, we want to get a meaningful error rather than a well-parsed int type_descriptor with some garbage in the wrapped tree.

But I think the first workaround proposed in tree-sitter/tree-sitter#870, which is to always prepend some magic string to tell the parser how to proceed could work very well for us.

@XVilka
Copy link
Contributor Author

XVilka commented May 14, 2021

Just for the record, this is what I came up with:

    [$._type_specifier, $._expression],
    [$._type_specifier, $._expression, $.macro_type_specifier],
    [$._type_specifier, $.macro_type_specifier],
+   [$.type_expression, $._abstract_declarator],
+   [$.type_expression],
    [$.sized_type_specifier],
  ],

  word: $ => $.identifier,

  rules: {
-    translation_unit: $ => repeat($._top_level_item),
+    translation_unit: $ => choice(
+            repeat1($.type_expression),
+            repeat1($._top_level_item)
+    ),
+
+    type_expression: $ => seq(
+       '__TYPE_EXPRESSION',
+       repeat($.type_qualifier),
+       field('type', $._type_specifier),
+       repeat($.abstract_pointer_declarator),
+       repeat($.abstract_array_declarator),
+       repeat($.abstract_pointer_declarator),
+    ),

You can see the examples of what it can parse here: XVilka@fed7bd0:

__TYPE_EXPRESSION const int* [5]
__TYPE_EXPRESSION volatile uint8_t* [2]
__TYPE_EXPRESSION const uintptr_t* []
__TYPE_EXPRESSION struct s1 *

__TYPE_EXPRESSION struct s2 {
  int x;
  float y : 5;
} [5]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants