How to get a CSS Selector given a HTML Document?

Abstract

It’s relevant when discussing development tools to obtain a CSS Selector for HTML documents.  Chrome DevTools’ Inspect Feature allows for this using a point-and-click interaction.  This interaction is useful for web-based editors like WebFlow and for web scrapers like ParseHub.  It’s additionally relevant when using stringification methods to update virtual DOM trees in JavaScript Frameworks.  Iterative methods are quicker for reverse engineering a CSS selector given that an HTMLElement has exactly one parent.  However, due to the stringification requirement of efficiently building virtual DOM trees, I’ll cover the stringification method generating an :nth-child selector.

Selection Strategies

In a web scraper or website builder, it’s helpful to consider selectors matching multiple elements of the same kind.  Commonly, this is done using CSS class/ID selections and common ancestors in the DOM tree.  While not covered in this post, it’s important to heuristically refine a selection to match the clicked elements by a user even when elements are contained in disparate trees.  If no class/ID selections are used, then it’s important to rely on the :nth-child pseudo-class.

Stringification methods are additionally useful when considering the broader topic of reducing O(n log(n)) search times in a tree to O(n).  This rationale exists because DOM trees are unsorted and string contents must be searched on each iteration.  The same methodology applies to reconstructing JSON objects.

Stringification Requirements

To build a CSS selector, first a set of regular expressions must be generated to match an HTML document:

  • Open Elements (e.g. <p>)
  • Closed Elements (e.g. </strong>)
  • All elements (e.g. <p> and </strong>)
  • Self-Closing Elements (e.g. <img />)

The example function utilizes these regular expressions:

The function reads back all nodes from the offset.  This may look like the following chart:

Put into a table, we can determine what the path of the left-most tag may look like:

tag name index type include? path
p 0 open yes p
footer 1 open yes footer > p
p 2 close no footer > p
strong 3 close no footer > p
strong 4 open no footer > p
img 5 self-closing no footer > p
p 6 open no footer > p
body 7 open yes body > footer > p

Arranged as a tree, it becomes easier to see the requirements to obtain a correct path more clearly:



Note: the above tree is not visually sorted in left to right order to determine rank.

The code might start to look like this:

Basic HTML Heuristics

To reverse engineer a CSS path via an HTML string, it becomes important to define general rules about the path of a node:

  • Only open tags can be candidates for path inclusion
    • Self-closing tags per this rule are always excluded
  • All close tags must have their matching open tag met before an open tag may be considered for path inclusion

The first heuristic is most important.  This implies that we need to call parts of the array beginning with a closed element until that closed element is opened.  This is non-trivial when we consider a tree may contain multiple elements of the same name, i.e. searching for the first opened element of the same name may be incorrect.  Since in XML and HTML each close tag must have a matching open tag and because each tag has exactly one parent, then it becomes possible to consider the innermost closed tag and work outwards to find matching tags for previously discovered close tags.  A stack is the appropriate data structure to elegantly perform this operation.

To cull an irrelevant subtree, the array must be iterated through starting at the index after the first close tag.  A stack is created before the loop with the first closed element, and all subsequent closed elements are pushed to the stack.  When an open element matches the element at the top of the stack, the stack is popped of that element.  When the stack length is zero, the loop ends.  The number of iterations is counted, and the outer loop skips that many iterations, i.e. the irrelevant subtree is culled.

This is what the code looks like:

 

This becomes more complex when we consider tags with multiple siblings of the same tag name.

nth-child Heuristic

When an ancestor of a path node has multiple tag names of the same type preceding it, the selector needs to specify the index of the path node.  The rationale for this is more easily understood by re-examining the charts from above:

In a table this looks like the following:

tag name index type include? path
p 0 open yes p
p 1 close no p:nth-child(2)
strong 2 close no p:nth-child(2)
strong 3 open no p:nth-child(2)
img 4 self-closing no p:nth-child(2)
p 5 open no p:nth-child(2)
body 6 open yes body > p:nth-child(2)

While the CSS Selector would not include the second <p> tag, the DOM tree would still grab both <p> tags unless a :nth-child selector is used.  The desired tree looks like this:

The code only needs minor modification.  Each token initializes an index value of 1.  The index increments when a subtree cull operation is started and the element that started the subtree cull is the same tag type as the nextElement to be pushed.

Conclusions

Stringification of an HTML document is sometimes necessary to reduce cost of tree traversal when the tree must be traversed from the root element and the search subjects of a traversal are sparse.  Generating CSS Selectors from a mouse event or given a stringified HTML document is handy for multiple popular use cases.

It’s important to consider the overall method of converting a tree into a string is useful for other popular data formats like JSON, as well.  Often, it’s cheaper to stringify a tree and search it than to traverse it and search it.  The implementation of a string-search is more easily encapsulated and straightforward than an iterative search as well.

Code

Here is the code covered in this post:

 

Leave a Reply

Your email address will not be published. Required fields are marked *