XML Sitemap Auditor

Chris Green
May 25, 2023
5 min read

Updated: Mar 25, 2024

In an effort to grow my own skills (and provide useful tools/information to the community) I wanted to share a quick script that I've been working on to help you audit XML sitemaps.

NEW - Streamlit App to run this as a web test

A number of people had said they had trouble running this on Windows - I've re-written the app on Streamlit & you can run it below! The app will display all the URLs it has processed and will drop in a download button when done - so keep scrolling to find it!

This is still in testing, so please let me know of any issues!

In this Post:

- What does it do?

- Why a separate auditor?

- Mac OS Sitemap Analyser Script

- Questions/Comments & Feedback Welcome!

The objectives of this audit are:

To judge the overall XML sitemap help for the sitemap quickly & without the need for third-party software
Understand which pages are "non-compliant" for SEO - i.e. shouldn't be there!
Produce a CSV that is easy to work with/share with teams who can help fix the issue.

So a lightweight auditing script that I hope could be useful to you - and was quite an fun/interesting experience to build and test.

This is currently just for MacOS, if enough people bug me I'll work out the best way to build a Windows-specific one.

What does it do?

Quite simply, this code:

Fetches an XML sitemap
Parses the list of URLs
Crawls the URLs and checks the response code, canonical tag & meta robots tag
Outputs the data to a CSV
If the sitemap is an index, it should loop through the subsequent XML sitemaps
You can also set a user agent to crawl the site with - sometimes just a CURL request gets blocked, so a Chrome or Googlebot user agent can be set.

The output CSV will look something like this:

You can see that I have a field "canonical match" which evaluates whether the URL matches the canonical - which is handy.

There are a few improvements I can think of to add for this specific workflow - but this does well for what I need it for.

Why a Separate Auditor?

"Don't you have enough tools already?" Yes, maybe...

But since I have gotten more familiar with Terminal and working with Bash I have found that tasks that I need to repeat often can be done more quickly than with Screaming Frog or Sitebulb (both tools I am very fond of).

This script is something I wrote after the project where it was needed - for that, I did use desktop auditors and managed just fine. But part of my own curiosity around these things was to see if I could build something that could do the job and streamline the workflow.

Full disclosure, the version I wrote isn't the code you see below, it was functional but was clunky and was broken out into different steps. ChatGPT has been my collaborator in this process, consolidating the code & refining the output file.

Mac OS Sitemap Analyser Script

Copy & paste this script into a text editor and save it as a .sh file. The instructions beneath give you the full details, but I've tried to write a version that should run on most Mac OS systems without the need to install too many (if any!) dependencies. Any issues in running it, or with the outputs - please drop me a message!

# Extract the domain name from the URL
url="$1"
domain=$(echo "$url" | sed -n 's|.*://\(.*\)/.*|\1|p' | awk -F '/' '{print $NF}')

# Set the current date and time
current_datetime=$(date +"%m%d%Y_%H%M")

# Define column headers
column_headers="URL,Response Code,Canonical URL,Canonical Match,Meta Robots"

# User agent
user_agent="$2"

# Function to process XML sitemaps
process_sitemap() {
    local sitemap_url=$1
    
    # Fetch the sitemap XML
    sitemap_xml=$(curl -s -A "$user_agent" "$sitemap_url")
    
    # Check if the sitemap is an index file
    is_index_file=$(echo "$sitemap_xml" | grep -c '<sitemapindex')
    
    if [ "$is_index_file" -eq 1 ]; then
        # Sitemap is an index file, parse the sitemap URLs
        echo "$sitemap_xml" | sed -n 's|.*<loc>\(.*\)</loc>.*|\1|p' | while read -r nested_sitemap_url; do
            echo "Processing nested sitemap: $nested_sitemap_url"
            process_sitemap "$nested_sitemap_url"
        done
    else
        # Sitemap is a regular XML sitemap, parse the URLs
        echo "$sitemap_xml" | sed -n 's|.*<loc>\(.*\)</loc>.*|\1|p' | while read -r url; do
            echo "Checking URL: $url"
            
            # Retrieve the response code for each URL
            response_code=$(curl -s -A "$user_agent" -o /dev/null -w "%{http_code}" "$url")
            
            # Check if the response code is 200 (OK)
            if [ "$response_code" -eq 200 ]; then
                echo "Fetching page content: $url"
                # Fetch the page content and check for rel="canonical" and meta robots tag
                page_content=$(curl -s -A "$user_agent" "$url")
                
                # Extract rel="canonical" href value
                canonical_url=$(echo "$page_content" | awk -F 'rel="canonical" href="' 'NF>1 {split($2, a, "\""); print a[1]; exit}')
                
                # Extract meta robots tag content
                meta_robots=$(echo "$page_content" | awk -F '<meta name="robots" content="' 'NF>1 {split($2, a, "\""); print a[1]; exit}')
                
                # Check if the rel="canonical" href matches the URL
                if [ "$canonical_url" = "$url" ]; then
                    canonical_match="Match"
                else
                    canonical_match="Mismatch"
                fi
                
                # Store the results in the CSV file
                echo "$url,$response_code,$canonical_url,$canonical_match,$meta_robots" >> "${current_datetime}_${domain}_xml_sitemap_urls.csv"
            else
                # Store the results in the CSV file without rel="canonical" and meta robots tag information
                echo "$url,$response_code" >> "${current_datetime}_${domain}_xml_sitemap_urls.csv"
            fi
        done
    fi
}

# Write column headers to the CSV file
echo "$column_headers" > "${current_datetime}_${domain}_xml_sitemap_urls.csv"

# Process the initial XML sitemap
echo "Processing XML sitemap: $url"
process_sitemap "$url"

# Print completion message
echo "Process completed. CSV file: ${current_datetime}_${domain}_xml_sitemap_urls.csv"

Running the Mac OS Script

Open a text editor (such as TextEdit) on your Mac.
Copy the entire script (above) and paste it into the text editor.
Save the file with a descriptive name and the `.sh` extension. For example, you can save it as sitemap_checker.sh.
4. Open the Terminal application on your Mac. You can find it in the Applications > Utilities folder, or you can use Spotlight search (press Command + Space and type "Terminal").
In the Terminal, navigate to the directory where you saved the `sitemap_checker.sh` file. You can use the `cd` command followed by the directory path. For example, if you saved the file on your desktop, you can use the following command:

cd ~/Desktop

Make the script file executable by running the following command in the Terminal:
chmod +x sitemap_checker.sh
Now, you can run the script by typing the following command in the Terminal:

./sitemap_checker.sh "https://example.com/sitemap.xml" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Replace "https://example.com/sitemap.xml" with the URL of the XML sitemap you want to check, and "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" with a user agent of your choice.
The user agent is optional, so if you don't want to specify one, you can leave this bit out.
The script will start running, and you will see various messages indicating the progress. It may take some time depending on the size of the sitemap and the number of URLs.
Once the script completes, it will generate a CSV file with the results. The message in the Terminal will show the name of the file CSV file, that will look something like '05152023_1230_example_com_xml_sitemap_urls.csv'
You can find the generated CSV file in the same directory where you saved the script (sitemap_checker.sh).

Questions/Comments & Feedback Welcome!

This is a hobby project and a bit of an experiment, so it's not been tested anywhere near as rigorously as I'd like. Any questions, comments & feedback - please let me know!