256 Kilobytes

Analyzing the Web: Downloading the Majestic Million, Setting up SQLite, Crawling the Web, and Generating Reports

Articles in Web Scraping, Data Analysis | By August R. Garcia

Published 3 weeks agoWed, 24 Apr 2019 03:29:27 -0700 | Last update 3 weeks agoThu, 25 Apr 2019 09:14:10 -0700

The Majestic Million is a publicly available listing of the top one million domains, as evaluated by Majestic. These rankings, are available in CSV format and can be used as a good starting point for analyzing the internet.

141 views, 2 RAMs, and 2 comments

The Majestic Million is a publicly available listing of the top one million domains, as evaluated by Majestic. These rankings, based on the number of referring domains on unique subnets (defined here) found by Majestic’s crawlers, are available in CSV format and can be used as a good starting point for analyzing the internet.

This post analyses a number of topics using the Majestic Million downloaded at 12:30 AM US Pacific time, 23 April, 2019 as a starting point. As part one of two, this post covers setting up the Majestic Million for analysis in SQLite as well as how to generate various reports on the raw data. Part two will cover web crawling to gather further data for these domains, which can be analysed further.

Introduction and Setup

Importing the Majestic Million CSV to SQLite

Since the Majestic Million contains one million rows, each with 12 columns, working with it in CSV format is impractical. Google Sheets, for example, caps out at 5,000,000 cells per spreadsheet, despite being good at a lot of other things. A database is much better suited for handling data at this scale and SQLite is a good option for lightweight data processing and one-off projects. Converting the Majestic Million CSV to SQLite is straightforward:

  1. Download the Majestic Million and store it some arbitrary folder.
  2. From the terminal, navigate to the directory where majestic_million.csv is stored.
  3. Run the command sqlite3 to open SQLite in the terminal. If it is not installed, run this command or search for your OS’ instructions:
    • sudo apt-get install sqlite3
  4. Once you’ve opened SQLite in the terminal (your prompt will start with sqlite>), enter this command to tell SQLite to interpret input in the CSV format:
    • .mode csv
  5. Run this command to import the CSV:
    • .import majestic_million.csv majestic_million
    • .import csvname.csv name_to_use_for_db_table
  6. You can use some or all of these commands to verify that the import worked:
    • List all tables.tables
    • Show the table’s structure: .schema majestic_million
    • List all of the data in the table (will take a figurative minute to show all of this): select * from majestic_million;

Note: These instructions are for Linux. The process should be roughly identically on other operating systems.

Fixing the Column Formats

You will notice that the command .schema majestic_million indicates that all columns are of the type TEXT, since CSV files do not have data types and store all data as strings. To convert the non-text columns to the correct formats. Since SQLite does not support modifying column data types, run the following series of commands:

sqlite> .output update.sql
sqlite> .dump
sqlite> drop table majestic_million

This will export the instructions to create the database you just imported as an SQL file, rather than CSV. This can then be modified and reimported in the correct format. Open the update.sql file and adjust the CREATE TABLE statement at the top of the file to be as shown below:

CREATE TABLE majestic_million(
"GlobalRank" INTEGER,
"TldRank" INTEGER,
"Domain" TEXT,
"TLD" TEXT,
"RefSubNets" INTEGER,
"RefIPs" INTEGER,
"IDN_Domain" TEXT,
"IDN_TLD" TEXT,
"PrevGlobalRank" INTEGER,
"PrevTldRank" INTEGER,
"PrevRefSubNets" INTEGER,
"PrevRefIPs" INTEGER
);

Then:

  1. Reimport the database with the corrected column formats;
  2. Return to the default output target (the terminal); and
  3. Optionally save and/or make backup copies of the SQLite file, in case you want to modify the copy you’re working with.

To do these three steps, run these commands:

sqlite> .read update.sql
sqlite> .output stdout
sqlite> .save majestic_million.db

Note that, if you don’t want to go through this process of exporting and reimporting, you can also cast column data types on the fly by casting columns, such as this example:

select * from majestic_million where CAST(GlobalRank AS int) < 100001;

Working With SQLite: Exporting SQLite Queries to CSV

When working with SQLite, note that you can output the result of your SQL queries to CSV files with the following process:

  1. Reimport the database with the corrected column formats;
  2. Return to the default output target (the terminal); and
  3. Optionally save and/or make backup copies of the SQLite file, in case you want to modify the copy you’re working with.

To do these three steps, run these commands:

  1. Toggle headers on so that your output file will have headers, unless you don’t want headers for some reason
  2. Make sure you’re in csv mode
  3. Change the output location from priting directly to the terminal (stdout) to a filename you want to export to
  4. Run the query
  5. Once you’re done exporting and want to displaying input in the console, change the output target back to stdout

These are the commands:

sqlite> .headers on
sqlite> .mode csv
sqlite> .output report-1.csv
sqlite> SELECT tld FROM majestic_million;
sqlite> .output stdout

Basic Reports with SQLite

Now that your data is in a SQLite, you can analyze the data easily in a number of ways. Here are some basics to start with.

What TLDs are the most common?

To start with a basic query, here’s the SQL command to list the number of results listed by their TLD (.com, .net, etc) for the top twenty most common TLDs:

select TLD, count(TLD) as num_tlds from majestic_million GROUP BY TLD ORDER BY num_tlds DESC LIMIT 20;

The Data

The result of the query above (with a row added for "Other TLD" is included below as tab-separated values. These tab-separated data can be imported to a spreadsheet with Ctrl+A > Ctrl+C > Ctrl+V.

After exporting the data from the query above and porting it into Google Sheets, the chart below has been created.

TLD vs. Count of Number of TLDs - Majestic Million

[Number of Sites by Subdomain Count] Do Major Sites Use Subdomains?

The Majestic Million considers subdomains to be valid entries. The query below can be used to segment the majestic million by number of subdomains:

select length(Domain)-length(replace(Domain, ".", "")) as num_subdomains, count(Domain) as count from majestic_million GROUP BY num_subdomains ORDER BY count;

Which results in this data:

It should be noted that the query above considers site.co.uk to have two subdomains. For context, these queries indicates that there are 24,718 [...].uk domains, 17,940 of which are .co.uk domains:

select count(Domain) from majestic_million WHERE domain like '%.uk';
select count(Domain) from majestic_million WHERE domain like '%.co.uk';

[Number of Domains by Character Count] How long is the average domain name?

To generate a report of the number of characters in domain names, ignoring the TLD’s length, the following query can be used:

select LENGTH(Domain)-LENGTH(TLD)-1 as raw_length, COUNT(Domain) as num_of_length from majestic_million GROUP BY raw_length ORDER BY raw_length DESC;

And this query to get the domain lengths for only subdomains:

select LENGTH(Domain)-LENGTH(TLD)-1 as raw_length, COUNT(Domain) as num_of_length from majestic_million WHERE length(Domain)-1 <> length(replace(Domain, ".", "")) GROUP BY raw_length ORDER BY raw_length;

Data and Charts

After duct taping the data together in Google Sheets, the following data is found.

This data is plotted in the chart below:

Number of Domains by Character Count, Full View - Majestic Million

This second chart is the same data as above with the 281 domains longer than 36 characters truncated.

Number of Domains by Character Count, Outliers Truncated - Majestic Million

Do larger sites have shorter domain names?

It is generally a good idea to have a shorter domain name rather than a longer one, since they are often easier to use, spell, and remember. With this in mind, it is reasonable to hypothesize that there may be a correlation between position in the Majestic Million and domain name length. This can be tested fairly easily with SQL queries.

These two queries can be used to get the top 10,000 and 100,000 domains:

select * from majestic_million where CAST(GlobalRank AS int) < 10000;
select * from majestic_million where CAST(GlobalRank AS int) < 100000;

Which can then be used as subqueries in the following reports:

select LENGTH(Domain)-LENGTH(TLD)-1 as raw_length, COUNT(Domain) as num_of_length from (select * from majestic_million where CAST(GlobalRank AS int) < 100000) WHERE length(Domain)-1 = length(replace(Domain, ".", "")) GROUP BY raw_length ORDER BY raw_length;

Data and Charts

Using the queries above and some duct tape through Google Sheets, the data is shown below:

When plotted below, it is clear that there is a (presumably non-causal) correlation between a site having a larger number of referring subnets and a site having a shorter domain name:

 - Majestic Million

This trend is clearer in the chart below, which compares the top 100 non-subdomains in the Majestic million with the full set of all non-subdomains:

 - Majestic Million

Part 2, Web Scraping and Crawling

Analyzing the Majestic Million with the setup and techniques in this article is a good start to analyzing the web. To get more interesting data, basic web crawling techniques can be used to gather additional data about the sites in question, which can then be imported into the database for analysis. Come back for part 2 for the thrilling conclusion to this series.

Create a Python Script

One of the easiest ways to crawl the web is to use Python, which allows you to easily do tasks like download images in bulk or to scrape data from arbitrary websites. Create a file in the same directory as your database file and name it something relevant, like mm_crawler.py.

# mm_crawler.py
# Created 22 April, 2019
import os
import time
import requests
import sqlite3
from sqlite3 import Error

db_file = "./majestic_million.db"

# Connect to the SQLite DB file
try:
        conn = sqlite3.connect(db_file)
except Error as e:
        print(e)

if not conn:
    raise Exception("Error connecting to the database file: \t\t" + db_file  )
cur = conn.cursor()

# Create any new DB columns needed to store the data that will be scraped 
try:
        cur.execute("ALTER TABLE majestic_million ADD  COLUMN ResponseSize integer")
except: 
        print "Columns already created, probably. Nice error handling." 

# Get all DB rows
cur.execute("SELECT * FROM majestic_million WHERE ResponseSize IS NULL")
rows = cur.fetchall()

# Get column names 
field_names = [i[0] for i in cur.description]

try:
        for row in rows:

                # Get data from the current row         
                i = row[ field_names.index('GlobalRank') ]
                u = row[ field_names.index('Domain')     ]

                # If there is an error fetching the URL, set the response size to 0     
                try:
                        resp    = requests.get("http://" + u)
                        l       = len(resp.content)
                except:
                        l = 0

                # Use the response data to update the SQLite table
                sql = "UPDATE majestic_million SET ResponseSize = ? WHERE GlobalRank = ?"
                cur.execute(sql, (l, i))

                # Debugging/monitoring statement        
                print str(i) + "\t" + u + "\t" + str(l)
                #time.sleep(0.100) 
except (KeyboardInterrupt, SystemExit) as e:
        print "Keyboard interrupt or system exit detected."

# Push the changes to the database
conn.commit()
conn.close()

    Run the Python Script

    From the (Linux) terminal's command line, type python mm_crawler.py and hit enter to run the script.  This code will:

    1. Loop through all of the domains in the Majestic Million. For each domain, it will record the response size in character count in a newly added "ResponseSize" column. For domains that could not be accessed, it will handle the error and list the response size as 0.
    2. Upon looping through all 1,000,000 domains or a keyboard interrupt, it will exit and push the udpates to the SQLite database.
    3. If the script is run multiple times, it will only make requests to domains that it has not previously handled.

    This general program structure can be used to add other columns to your database and to update them with whatever arbitrary data you want to scrape. An efficient solution for scraping a non-trivial amount of data should also implement multithreading to crawl multiple pages at the same time without waiting 80 years for particularly slow requests to either finish or timeout.

    Using the New Data in a Report

    After crawling the top 1000 sites in the Majestic Million, we have some data to work from, which we can get with this query, which returns the 1,000 rows crawled:

    SELECT * FROM majestic_million WHERE ResponseSize IS NOT NULL AND GlobalRank <= 1000;

    Packet Sizes for Top 1000 Sites in the Majestic Million

    Here's the raw data:

    And when shown as a scatterplot, we can see that there is no particularly interesting trend from the 1,000 sites crawled.

    Scatter Plot - Packet Size of Top 1000 Sites in Majestic Million

    Conclusions

    Web crawling at it's basic level is relatively easy to do, as is generating reports on large datasets. Of course, depending on your use case, you can save a substantial amount of time by using an existing data set like Common Crawl, which contains petabytes of "raw web page data, extracted metadata, and text extractions," which is a related topic to be discussed in the future.

    Users Who Have Downloaded More RAM:
    Hash Brown (3 weeks ago)
    Huevos Rancheros (3 weeks ago)
    🐏 ⨉ 2
    Posted by August R. Garcia 3 weeks ago

    Edit History

    • [2019-04-24 3:29 PDT] August R. Garcia (3 weeks ago)
    • [2019-04-24 3:29 PDT] August R. Garcia (3 weeks ago)
    • [2019-04-24 3:29 PDT] August R. Garcia (3 weeks ago)
    • [2019-04-24 3:29 PDT] August R. Garcia (3 weeks ago)
    • [2019-04-24 3:29 PDT] August R. Garcia (3 weeks ago)
    🕓 Posted at 24 April, 2019 03:29 AM PDT

    Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
    🗎 135 🗨 706 🐏 164
    Site Owner

    Grahew Mattham

    August Garcia is some guy who used to sell Viagra on the Internet. He made this website to LARP as a sysadmin while posting about garbage like user-agent spoofing, spintax, the only good keyboard, virtual assitants from Pakistan, links with the rel="nofollow" attributeproxies, regular expressions, HTML and CSSsin, the developer console, and probably some other trash.


    Account created 5 months ago.
    135 posts, 706 comments, and 164 RAMs.

    Last active 5 hours ago:
    Posted thread Dear Websites -- Update Your "Powered by vBulletin® Version 3.7.3" Websites; The current version is 5.5.2

    Profile Photo - Hash Brown Hash Brown
    🗎 49 🗨 309 🐏 118
    Staff

    This is excellent!

    Users Who Have Downloaded More RAM:
    August R. Garcia (2 weeks ago)
    🐏 ⨉ 1
    Posted by Hash Brown 3 weeks ago 🕓 Posted at 24 April, 2019 10:29 AM PDT

    "THAT DOG IS GETTING RAPED" - Terry A. Davis

    Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
    🗎 135 🗨 706 🐏 164
    Site Owner

    The longest domain names in the Majestic Million, most of which are expired and basically all of which are terrible garbage:

    • 255461    federacionlatinoamericanadecomunidadesterapeuticas.org    50
    • 667525    all-free-hardcore-xxx-groupsex-orgy-gang-bang-pics.com    50
    • 753102    marketing-craigslist-facebook-twitter-youtube-ebay.info    50
    • 618561    family-vacation-getaways-at-los-angeles-theme-parks.com    51
    • 630284    all-free-nude-old-granny-mature-women-xxx-porn-pics.com    51
    • 913290    faithhopeloveandlucksurvivedespiteawhiskeredaccomplice.com    54
    • 235513    xn-------43dacaaoejhcgera1chbbd2aihms8dk0etarekmzcwc8mzl.top    56
    • 506839    sofortrente-vergleich-sofort-rente-vergleichen-berechnen.de    56
    • 176215    produzioneinstallazionegrondaiecoperturatettisoedilsicilia.com    58
    • 199186    laboratoires-des-produits-pharmaceutiques-dafrique-du-nord.net    58
    • 204652    laboratoires-des-produits-pharmaceutiques-dafrique-du-nord.com    58
    • 952995    booneoakleyshouldbetheofficaladagencyofthecarolinapanthers.com    58
    • 457932    111111111111111111111111111111111111111111111111111111111111.com    60
    • 553847    fixconnectionsbluetoothaudiodeviceswirelessdisplayswindows10.net    60
    • 177973    yucaipa-beaumont-redlands-computer-repair-services-inlandempire.com    63
    • 202235    thelongestlistofthelongeststuffatthelongestdomainnameatlonglast.com    63
    • 340049    141592653589793238462643383279502884197169399375105820974944592.com    63
    • 472467    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com    63
    • 848123    yellowbirdsdonthavewingsbuttheyflytomakeyouexperiencea3dreality.com    63
    Download more RAM. 🐏 ⨉ 0 Posted by August R. Garcia 2 weeks ago

    Edit History

    • [2019-04-29 9:10 PDT] August R. Garcia (2 weeks ago)
    • [2019-04-29 9:10 PDT] August R. Garcia (2 weeks ago)
    • [2019-04-29 9:10 PDT] August R. Garcia (2 weeks ago)
    🕓 Posted at 29 April, 2019 09:10 AM PDT

    The CIA wants all code in the cloud under their lock and key. They want to ban compilers and make people think HTML is computer programming. - Terry A. Davis

    Post a New Comment

    To leave a comment, login to your account or create an account.

    Do you like having a good time?

    Read Quality Articles

    Read some quality articles. If you can manage to not get banned for like five minutes, you can even post your own articles.

    View Articles →

    Argue with People on the Internet

    Use your account to explain why people are wrong on the Internet forum.

    View Forum →

    Vandalize the Wiki

    Or don't. I'm not your dad.

    View Wiki →

    Ask and/or Answer Questions

    If someone asks a terrible question, post a LMGTFY link.

    View Answers →

    Make Some Money

    Hire freelancers and/or advertise your goods and/or services. Hire people directly. We're not a middleman or your dad. Manage your own business transactions.

    Register an Account
    You can also login to an existing account or recover your password. All use of this site is subject to terms outlined in the terms of service and privacy policy.