Analyzing the Web: Downloading the Majestic Million, Setting up SQLite, Crawling the Web, and Generating Reports
The Majestic Million is a publicly available listing of the top one million domains, as evaluated by Majestic. These rankings, are available in CSV format and can be used as a good starting point for analyzing the internet.
Analyzing the Web: Downloading the Majestic Million, Setting up SQLite, Crawling the Web, and Generating Reports
Articles in Web Scraping, Data Analysis | By August R. Garcia
Published | Last Update
The Majestic Million is a publicly available listing of the top one million domains, as evaluated by Majestic. These rankings, are available in CSV format and can be used as a good starting point for analyzing the internet.
1,460 view, 2 RAMs, and 2 comments
- Introduction and Setup
- Importing the Majestic Million CSV to SQLite
- Fixing the Column Formats
- Working With SQLite: Exporting SQLite Queries to CSV
- Basic Reports with SQLite
- What TLDs are the most common?
- The Data
- [Number of Sites by Subdomain Count] Do Major Sites Use Subdomains?
- [Number of Domains by Character Count] How long is the average domain name?
- Data and Charts
- Do larger sites have shorter domain names?
- Data and Charts
- Part 2, Web Scraping and Crawling
- Create a Python Script
- Run the Python Script
- Using the New Data in a Report
- Packet Sizes for Top 1000 Sites in the Majestic Million
- Conclusions
TheMajestic Millionis a publicly available listing of the top one million domains, as evaluated by Majestic. These rankings, based on the number of referring domains on unique subnets (definedhere) found by Majestic’s crawlers, are available in CSV format and can be used as a good starting point for analyzing the internet.
This post analyses a number of topics using the Majestic Million downloaded at 12:30 AM US Pacific time, 23 April, 2019 as a starting point. As part one of two, this post covers setting up the Majestic Million for analysis in SQLite as well as how to generate various reports on the raw data. Part two will cover web crawling to gather further data for these domains, which can be analysed further.
Introduction and Setup
Importing the Majestic Million CSV to SQLite
Since the Majestic Million contains one million rows, each with 12 columns, working with it in CSV format is impractical. Google Sheets, for example,caps out at 5,000,000 cells per spreadsheet, despitebeing good at a lot of other things. A database is much better suited for handling data at this scale andSQLiteis a good option for lightweight data processing and one-off projects. Converting the Majestic Million CSV to SQLite is straightforward:
- Download the Majestic Million and store it some arbitrary folder.
- From the terminal, navigate to the directory where majestic_million.csv is stored.
- Run the command
sqlite3
to open SQLite in the terminal. If it is not installed, run this command orsearch for your OS’ instructions:sudo apt-get install sqlite3
- Once you’ve opened SQLite in the terminal (your prompt will start withsqlite>), enter this command to tell SQLite to interpret input in the CSV format:
- .mode csv
- Run this command to import the CSV:
- .import majestic_million.csv majestic_million
- .import csvname.csv name_to_use_for_db_table
- You can use some or all of these commands to verify that the import worked:
- List all tables:
.tables
- Show the table’s structure:
.schema majestic_million
- List all of the data in the table (will take a figurative minute to show all of this):
select * from majestic_million;
- List all tables:
Note: These instructions are for Linux. The process should be roughly identically on other operating systems.
Fixing the Column Formats
You will notice that the command .schema majestic_million indicates that all columns are of the type TEXT, since CSV files do not have data types and store all data as strings. To convert the non-text columns to the correct formats. Since SQLite does not support modifying column data types, run the following series of commands:
sqlite> .output update.sql
sqlite> .dump
sqlite> drop table majestic_million
This will export the instructions to create the database you just imported as an SQL file, rather than CSV. This can then be modified and reimported in the correct format. Open the update.sql file and adjust the CREATE TABLE statement at the top of the file to be as shown below:
CREATE TABLE majestic_million(
"GlobalRank" INTEGER,
"TldRank" INTEGER,
"Domain" TEXT,
"TLD" TEXT,
"RefSubNets" INTEGER,
"RefIPs" INTEGER,
"IDN_Domain" TEXT,
"IDN_TLD" TEXT,
"PrevGlobalRank" INTEGER,
"PrevTldRank" INTEGER,
"PrevRefSubNets" INTEGER,
"PrevRefIPs" INTEGER
);
Then:
- Reimport the database with the corrected column formats;
- Return to the default output target (the terminal); and
- Optionally save and/or make backup copies of the SQLite file, in case you want to modify the copy you’re working with.
To do these three steps, run these commands:
sqlite> .read update.sql
sqlite> .output stdout
sqlite> .save majestic_million.db
Note that, if you don’t want to go through this process of exporting and reimporting, you can also cast column data types on the fly by casting columns, such as this example:
select * from majestic_million where CAST(GlobalRank AS int) < 100001;
Working With SQLite: Exporting SQLite Queries to CSV
When working with SQLite, note that you can output the result of your SQL queries to CSV files with the following process:
- Reimport the database with the corrected column formats;
- Return to the default output target (the terminal); and
- Optionally save and/or make backup copies of the SQLite file, in case you want to modify the copy you’re working with.
To do these three steps, run these commands:
- Toggle headers on so that your output file will have hea ders, unless you don’t want headers for some reason
- Make sure you’re in csv mode
- Change the output location from priting directly to the term inal (stdout) to a filename you want to export to
- Run the query
- Once you’re done exporting and want to displaying input in the console, change the output target back to stdout
These are the commands:
sqlite> .headers on
sqlite> .mode csv
sqlite> .output report-1.csv
sqlite> SELECT tld FROM majestic_million;
sqlite> .output stdout
Basic Reports with SQLite
Now that your data is in a SQLite, you can analyze the data easily in a number of ways. Here are some basics to start with.
What TLDs are the most common?
To start with a basic query, here’s the SQL command to list the number of results listed by their TLD (.com, .net, etc) for the top twenty most common TLDs:
select TLD, count(TLD) as num_tlds from majestic_million GROUP BY TLD ORDER BY num_tlds DESC LIMIT 20;
The Data
The result of the query above (with a row added for "Other TLD" is included below as tab-separated values. These tab-separated data can be imported to a spreadsheet with Ctrl+A > Ctrl+C > Ctrl+V.
After exporting the data from the query above and porting it into Google Sheets,the chart below has been created.
[Number of Sites by Subdomain Count] Do Major Sites Use Subdomains?
The Majestic Million considers subdomains to be valid entries. The query below can be used to segment the majestic million by number of subdomains:
select length(Domain)-length(replace(Domain, ".", "")) as num_subdomains, count(Domain) as count from majestic_million GROUP BY num_subdomains ORDER BY count;
Which results in this data:
It should be noted that the query above considers site.co.uk to have two subdomains. For context, these queries indicates that there are 24,718 [...].uk domains, 17,940 of which are .co.uk domains:
select count(Domain) from majestic_million WHERE domain like '%.uk';
select count(Domain) from majestic_million WHERE domain like '%.co.uk';
[Number of Domains by Character Count] How long is the average domain name?
To generate a report of the number of characters in domain names, ignoring the TLD’s length, the following query can be used:
select LENGTH(Domain)-LENGTH(TLD)-1 as raw_length, COUNT(Domain) as num_of_length from majestic_million GROUP BY raw_length ORDER BY raw_length DESC;
And this query to get the domain lengths for only subdomains:
select LENGTH(Domain)-LENGTH(TLD)-1 as raw_length, COUNT(Domain) as num_of_length from majestic_million WHERE length(Domain)-1 <> length(replace(Domain, ".", "")) GROUP BY raw_length ORDER BY raw_length;
Data and Charts
After duct taping the data together in Google Sheets, the following data is found.
This data is plotted in the chart below:
This second chart is the same data as above with the 281 domains longer than 36 characters truncated.
Do larger sites have shorter domain names?
It is generally a good idea to have a shorter domain name rather than a longer one, since they are often easier to use, spell, and remember. With this in mind, it is reasonable to hypothesize that there may be a correlation between position in the Majestic Million and domain name length. This can be tested fairly easily with SQL queries.
These two queries can be used to get the top 10,000 and 100,000 domains:
select * from majestic_million where CAST(GlobalRank AS int) < 10000;
select * from majestic_million where CAST(GlobalRank AS int) < 100000;
Which can then be used as subqueries in the following reports:
select LENGTH(Domain)-LENGTH(TLD)-1 as raw_length, COUNT(Domain) as num_of_length from (select * from majestic_million where CAST(GlobalRank AS int) < 100000) WHERE length(Domain)-1 = length(replace(Domain, ".", "")) GROUP BY raw_length ORDER BY raw_length;
Data and Charts
Using the queries above and some duct tape through Google Sheets, the data is shown below:
When plotted below, it is clear that there is a (presumably non-causal) correlation between a site having a larger number of referring subnets and a site having a shorter domain name:
This trend is clearer in the chart below, which compares the top 100 non-subdomains in the Majestic million with the full set of all non-subdomains:
Part 2, Web Scraping and Crawling
Analyzing the Majestic Million with the setup and techniques in this article is a good start to analyzing the web. To get more interesting data, basic web crawling techniques can be used to gather additional data about the sites in question, which can then be imported into the database for analysis. Come back for part 2 for the thrilling conclusion to this series.
Create a Python Script
One of the easiest ways to crawl the web is to use Python, which allows you to easily do tasks like download images in bulk or to scrape data from arbitrary websites. Create a file in the same directory as your database file and name it something relevant, like mm_crawler.py
.
# mm_crawler.py
# Created 22 April, 2019
import os
import time
import requests
import sqlite3
from sqlite3 import Error
db_file = "./majestic_million.db"
# Connect to the SQLite DB file
try:
conn = sqlite3.connect(db_file)
except Error as e:
print(e)
if not conn:
raise Exception("Error connecting to the database file: \t\t" + db_file )
cur = conn.cursor()
# Create any new DB columns needed to store the data that will be scraped
try:
cur.execute("ALTER TABLE majestic_million ADD COLUMN ResponseSize integer")
except:
print "Columns already created, probably. Nice error handling."
# Get all DB rows
cur.execute("SELECT * FROM majestic_million WHERE ResponseSize IS NULL")
rows = cur.fetchall()
# Get column names
field_names = [i[0] for i in cur.description]
try:
for row in rows:
# Get data from the current row
i = row[ field_names.index('GlobalRank') ]
u = row[ field_names.index('Domain') ]
# If there is an error fetching the URL, set the response size to 0
try:
resp = requests.get("http://" + u)
l = len(resp.content)
except:
l = 0
# Use the response data to update the SQLite table
sql = "UPDATE majestic_million SET ResponseSize = ? WHERE GlobalRank = ?"
cur.execute(sql, (l, i))
# Debugging/monitoring statement
print str(i) + "\t" + u + "\t" + str(l)
#time.sleep(0.100)
except (KeyboardInterrupt, SystemExit) as e:
print "Keyboard interrupt or system exit detected."
# Push the changes to the database
conn.commit()
conn.close()
Run the Python Script
From the (Linux) terminal's command line, type python mm_crawler.py
and hit enter to run the script. This code will:
- Loop through all of the domains in the Majestic Million. For each domain, it will record the response size in character count in a newly added "ResponseSize" column. For domains that could not be accessed, it will handle the error and list the response size as 0.
- Upon looping through all 1,000,000 domains or a keyboard interrupt, it will exit and push the udpates to the SQLite database.
- If the script is run multiple times, it will only make requests to domains that it has not previously handled.
This general program structure can be used to add other columns to your database and to update them with whatever arbitrary data you want to scrape. An efficient solution for scraping a non-trivial amount of data should also implement multithreading to crawl multiple pages at the same time without waiting 80 years for particularly slow requests to either finish or timeout.
Using the New Data in a Report
After crawling the top 1000 sites in the Majestic Million, we have some data to work from, which we can get with this query, which returns the 1,000 rows crawled:
SELECT * FROM majestic_million WHERE ResponseSize IS NOT NULL AND GlobalRank <= 1000;
Packet Sizes for Top 1000 Sites in the Majestic Million
Here's the raw data:
And when shown as a scatterplot, we can see that there is no particularly interesting trend from the 1,000 sites crawled.
Conclusions
Web crawling at it's basic level is relatively easy to do, as is generating reports on large datasets. Of course, depending on your use case, you can save a substantial amount of time by using an existing data set like Common Crawl, which contains petabytes of "raw web page data, extracted metadata, and text extractions," which is a related topic to be discussed in the future.








Hash Brown (3 years ago)
Huevos Rancheros (3 years ago) 🐏 ⨉ 2Posted by August R. Garcia 3 years ago
Edit History
• [2019-04-24 3:29 PDT] August R. Garcia (3 years ago)• [2019-04-24 3:29 PDT] August R. Garcia (3 years ago)
• [2019-04-24 3:29 PDT] August R. Garcia (3 years ago)
• [2019-04-24 3:29 PDT] August R. Garcia (3 years ago)
• [2019-04-24 3:29 PDT] August R. Garcia (3 years ago)
🕓 Posted at 24 April, 2019 03:29 AM PDT
- C U [BASH, cURL] Yellow Pages Scraper: Fully Functional Script with Source Code
- C U [cURL, BASH] How to Crawl and Scrape DuckDuckGo Search Results
- C U The Basics to Web Scraping with cURL and XPath
- C U Extracting Emails from HTML with Google Sheets: Get Contact Information in Bulk for Guest Posts, (lmao) 404 Re...
- C U [Video] 33 More SEO Questions Answered
- C U [Infographic] The Beginner's SQLite Cheat Sheet
- C U Analyzing the Web: Downloading the Majestic Million, Setting up SQLite, Crawling the Web, and Generating Repor...
- C U Downloading Bulk Images: ThisPersonDoesNotExist with Python and urllib2
- C U [Tool + Code] Email Address and Phone Number Extractor: Using JavaScript Code to Extract Emails and Phone Numb...
- C U An introduction to scraping with Python and BeautifulSoup
August Garcia is some guy who used to sell Viagra on the Internet. He made this website to LARP as a sysadmin while posting about garbage like user-agent spoofing, spintax, the only good keyboard, virtual assitants from Pakistan, links with the rel="nofollow" attribute, proxies, sin, the developer console, literally every link building method, and other junk.
Available at arg@256kilobytes.com, via Twitter, or arg.256kilobytes.com. Open to business inquiries based on availability.
This is excellent!
August R. Garcia (3 years ago) 🐏 ⨉ 1Posted by Hash Brown 3 years ago 🕓 Posted at 24 April, 2019 10:29 AM PDT
"THAT DOG IS GETTING RAPED" - Terry A. Davis
The longest domain names in the Majestic Million, most of which are expired and basically all of which are terrible garbage:
255461 federacionlatinoamericanadecomunidadesterapeuticas.org 50
667525 all-free-hardcore-xxx-groupsex-orgy-gang-bang-pics.com 50
753102 marketing-craigslist-facebook-twitter-youtube-ebay.info 50
618561 family-vacation-getaways-at-los-angeles-theme-parks.com 51
630284 all-free-nude-old-granny-mature-women-xxx-porn-pics.com 51
913290 faithhopeloveandlucksurvivedespiteawhiskeredaccomplice.com 54
235513 xn-------43dacaaoejhcgera1chbbd2aihms8dk0etarekmzcwc8mzl.top 56
506839 sofortrente-vergleich-sofort-rente-vergleichen-berechnen.de 56
176215 produzioneinstallazionegrondaiecoperturatettisoedilsicilia.com 58
199186 laboratoires-des-produits-pharmaceutiques-dafrique-du-nord.net 58
204652 laboratoires-des-produits-pharmaceutiques-dafrique-du-nord.com 58
952995 booneoakleyshouldbetheofficaladagencyofthecarolinapanthers.com 58
457932 111111111111111111111111111111111111111111111111111111111111.com 60
553847 fixconnectionsbluetoothaudiodeviceswirelessdisplayswindows10.net 60
177973 yucaipa-beaumont-redlands-computer-repair-services-inlandempire.com 63
202235 thelongestlistofthelongeststuffatthelongestdomainnameatlonglast.com 63
340049 141592653589793238462643383279502884197169399375105820974944592.com 63
472467 abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com 63
848123 yellowbirdsdonthavewingsbuttheyflytomakeyouexperiencea3dreality.com 63
Edit History
• [2019-04-29 9:10 PDT] August R. Garcia (3 years ago)• [2019-04-29 9:10 PDT] August R. Garcia (3 years ago)
• [2019-04-29 9:10 PDT] August R. Garcia (3 years ago)
🕓 Posted at 29 April, 2019 09:10 AM PDT
Sir, I can do you a nice SEO.
Post a New Comment
To leave a comment, login to your account or create an account.
Do you like having a good time?
Register an Account
You can also login to an existing account or reset your password. All use of this site is subject to the terms of service and privacy policy.
Read Quality Articles
Read some quality articles. If you can manage to not get banned for like five minutes, you can even post your own articles.
View Articles →
Argue with People on the Internet
Use your account to explain why people are wrong on the Internet forum.
View Forum →
Vandalize the Wiki
Or don't. I'm not your dad.
View Wiki →
Ask and/or Answer Questions
If someone asks a terrible question, post a LMGTFY link.
View Answers →
Make Some Money
Hire freelancers and/or advertise your goods and/or services. Hire people directly. We're not a middleman or your dad. Manage your own business transactions.