Extract links from a webpage

How to use JavaScript to list and sort links on a webpage.

An uncommon bit of website maintenance involves reviewing or updating links on a webpage.

Typically, you want a list of the URLs specified in anchor elements (<a href="..."></a>).

This means you want a list of the URLs specified in the HREF attribute of anchors.

Like many tasks, this can be more complicated than it first appears.

Use the Document.links property to get a list of the links on a webpage.

Document.links returns an HTMLCollection object, which is a list of Element objects.

This list include anchor elements, but it can also include other things, such as SVG images, MathML objects, and more.

Not every element has an HREF attribute. For best results, make sure the attribute exists before trying to read its value.

Here’s one way to do this:

let aryLinks = document.links; 
for (const key in aryLinks ) 
{
  let obj = aryLinks[key]; 
  if (typeof obj === 'object' && 
      obj != null && 'getAttribute' in obj) 
  {
    console.log(obj.getAttribute("href"));
  }
};

Example #1: JavaScript code to list a webpage’s links to the browser console.

Sort and remove duplicates

To try this out, use the Copy to Clipboard button and then paste it into the Console tab of your browser developer tools.

As you review the results, you might notice a few things:

  • There are a lot of links

  • They’re ordered by appearance in the page structure

  • There are duplicates

While there are good reasons behind these observations, you might want to sort the list and remove duplicates.

Here’s one way to do this:

let aryLinks = document.links; 
let aryURLs = new Array();
for (const key in aryLinks ) 
{
  let obj = aryLinks[key]; 
  if (typeof obj === 'object' && 
      obj != null && 'getAttribute' in obj) 
  {
    const sURL = obj.getAttribute("href");
    if ( aryURLs.indexOf( sURL) == -1 )
    {
      aryURLs.push( sURL );
    }
  }
};

for (const item in aryURLs.sort())
{
   console.log( aryURLs[ item ] );
}

Example #2: Extracting links, and listing unique URLs as sorted results.

Instead of directly logging the target URL, this version add a second step.

It uses a second array to collect unique ULRs and then sorts the results before printing them to the console.

Vital statistics

  • 10 May 2024: First post, based on private notes.