Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entities in attribute values #51

Open
savetheclocktower opened this issue Apr 13, 2023 · 1 comment
Open

HTML entities in attribute values #51

savetheclocktower opened this issue Apr 13, 2023 · 1 comment

Comments

@savetheclocktower
Copy link
Contributor

Support for HTML entities was requested in #10, and was mostly addressed in #50, but I think it's reasonable to want entities to be recognized inside of attribute values as well.

This is a trickier request because attribute_value is currently a simple node that does not have any children and doesn't envision being broken up by tokens with special meanings. An entity is roughly equivalent to an escape_sequence node in other tree-sitter parsers, but those parsers tend to represent a string's contents as a series of string_content and escape_sequence nodes.

So the most intuitive solution might be to introduce a string_content node (or attribute_value_content or something), and make it so that attribute_value's children are some combination of string_content and entity nodes. By and large I think it wouldn't disrupt existing consumers of tree-sitter-html.

<abbr title="American Telephone &amp; Telegraph">AT&amp;T</p>
(fragment
  (element
    (start_tag
      (tag_name)
      (attribute
        (attribute_name)
        (quoted_attribute_value
          (attribute_value
            (string_content)
            (entity)
            (string_content)))))
      (text)
      (entity)
      (text)
      (end_tag
        (tag_name))))

The only exception I can think of is injections — since injection.include-children is false by default, anyone injecting into attribute_value nodes would no longer see any content inside them until they change that setting.

Another option would be to do something like what tree-sitter-javascript does for template strings: make it so that attribute_value can contain entity nodes, but don't represent the non-entity text content of attribute_value with any sort of node. In this scenario, injections into attribute_value would at least still see all the non-entity content of the value when include-children is false. This might be more surprising behavior because it runs contrary to how we handle entities in tag contents (entity nodes break up text nodes), but maybe folks might feel it's less disruptive.

@milahu
Copy link

milahu commented Feb 21, 2024

lezer-parser-html does this already

<a t="a&amp;b">a&amp;b</a>
node 15 = Document: '<a t="a&amp;b">a&amp;b</a>\n'
node 20 = Element: '<a t="a&amp;b">a&amp;b</a>'
node 36 = OpenTag: '<a t="a&amp;b">'
node 6 = StartTag: '<'
node 22 = TagName: 'a'
node 23 = Attribute: 't="a&amp;b"'
node 24 = AttributeName: 't'
node 25 = Is: '='
node 26 = AttributeValue: '"a&amp;b"'
node 17 = EntityReference: '&amp;'
node 4 = EndTag: '>'
node 16 = Text: 'a'
node 17 = EntityReference: '&amp;'
node 16 = Text: 'b'
node 37 = CloseTag: '</a>'
node 11 = StartCloseTag: '</'
node 22 = TagName: 'a'
node 4 = EndTag: '>'
node 16 = Text: '\n'

lezer-parser-html parses the & in <a href="?a=1&b=2"> as InvalidEntity

ideally there should be 2 tokens: EntityReference and EntityReferenceInAttributeValue
so in a semantic stage i can ignore only EntityReferenceInAttributeValue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants