256 Kilobytes

[BASH, cURL] Yellow Pages Scraper: Fully Functional Script with Source Code

Articles in Web Scraping, Data Analysis | By August R. Garcia

Published | Last Update

What a nice, free YellowPages scraper.

609 views, 1 RAM, and 0 comments

Edit: When trying to scrape indefinitely (~100+ pages), there's some buggy behavior with exit conditions currently. If/when an updated script is posted will delete this notice.

Wow. It's another post about cURL and BASH, similarly to this introductory post and this post on scraping DuckDuckGo search results.

Don't Do Anything Fucking Stupid or Illegal

See heading. As a general practice:

  • Set a reasonable delay between requests (this script waits for five seconds)
  • Set some kind of a user agent
  • Don't do some dumb-ass shit, such as blatantly committing copyright infringement by making a YellowPages.com clone

Also see the "Scraping Best Practices" section of the post on scraping DuckDuckGo search results.

Yellow Pages Scraper: Full Source Code


# Concatenate the result into the file
# -n flag prevents a newline from being added to the file after each concatenation
# Passing the result through xargs is a quality hack to remove leading and trailing whitespace.
# -0 flag needed to avoid this error: "xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option"    
xpath_wrapper() {
	echo $biz   | xmllint --noout --html -xpath "$1" - 2>/dev/null | xargs -0 | tr '\r\n' '\t' | sed "s/\t\t/\t/g" >> $fn      

parse_card() {
	# Store the current loop's match in a temporary variable
	biz=$(xmllint --html -xpath "//div[@class='v-card'][$i]" "temp2.html" 2>/dev/null)  &&

	echo $biz > current.html &&

	xpath_wrapper "//*[@class='business-name']/span/node()"                                   # business_name    
	echo $(( $i+( ($page_number-1)*30 ) )) | xargs -0 | tr '\r\n' '\t' | sed "s/\t\t/\t/g" >> $fn &&  # position           

	xpath_wrapper "string(//*[@class='business-name']/@href)"                                 # yp_full_profile    

	xpath_wrapper "string(//*[@class='media-thumbnail']//img/@src)"                           # image              

	xpath_wrapper "//p[@class='adr']/*/node()"                                                # full_address            
	xpath_wrapper "//p[@class='adr'][starts-with(node(), 'S')]/node()"                        # area            
	xpath_wrapper "//p[@class='adr']/span[@class='street-address']/node()"                    # street_address           
	xpath_wrapper "//p[@class='adr']/span[@class='locality']/node()"                          # locality            
	xpath_wrapper "//p[@class='adr']/span[3]/node()"                                          # state            
	xpath_wrapper "//p[@class='adr']/span[4]/node()"                                          # zip            

	xpath_wrapper "//div[@class='street-address']/node()"                                     # street_address_alt_format           
	xpath_wrapper "//div[@class='locality']/node()"                                           # locality_alt_format            

	xpath_wrapper "//*[self::div or self::li][contains(@class, 'phone')][1]/node()"           # phone_primary              
	xpath_wrapper "//*[self::div or self::li][contains(@class, 'phone')][2]/node()"           # phone_alt_1              
	xpath_wrapper "//*[self::div or self::li][contains(@class, 'phone')][3]/node()"           # phone_alt_2              

	xpath_wrapper "//div[@class='categories']/a[1]/node()"                                    # main_category           
	xpath_wrapper "//div[@class='categories']/a[2]/node()"                                    # subcategory_1              
	xpath_wrapper "//div[@class='categories']/a[3]/node()"                                    # subcategory_2              
	xpath_wrapper "//div[@class='categories']/a[4]/node()"                                    # subcategory_3              

	xpath_wrapper "string(//div[@class='links']/a[@class='track-visit-website']/@href)"       # website             
	xpath_wrapper "string(//div[@class='links']/a[@class='track-map-it directions']/@href)"   # directions            
	xpath_wrapper "string(//div[@class='links']/a[@class='track-custom-link']/@href)"         # custom           
	xpath_wrapper "string(//div[@class='links']/a[@class='track-more-info']/@href)"           # more_info
	xpath_wrapper "string(//div[@class='links']/a[@class='menu']/@href)"                      # services 

	xpath_wrapper "//span[@class='count']/node()"                                             # aggregate_score    
	xpath_wrapper "string(//div[contains(@class, 'result-rating')]/@class)"                   # num_reviews        
	xpath_wrapper "//span[contains(@class,'bbb-rating')]/node()"                              # bbb_rating    

	echo $niche             | xargs -0 | tr '\r\n' '\t' | sed "s/\t\t/\t/g" >> $fn &&  # search_terms       
	echo $geo               | xargs -0 | tr '\r\n' '\t' | sed "s/\t\t/\t/g" >> $fn &&  # geo_location_terms  

	date                    | xargs -0 | tr '\r\n' '\t' | sed "s/\t\t/\t/g" >> $fn &&  # crawl_timestamp         
	date +%s                | xargs -0 | tr '\r\n' '\t' | sed "s/\t\t/\t/g" >> $fn &&  # crawl_unix_timestamp    
	echo $page_number       | xargs -0 | tr '\r\n' '\t' | sed "s/\t\t/\t/g" >> $fn &&  # page_number    

	# Add a line break to the file
	echo "" >> $fn

# Create a header row for the output TSV file
create_tsv_header_row() {
	echo -ne "business_name\t"             >> $fn     
	echo -ne "position\t"                  >> $fn     
	echo -ne "yp_full_profile\t"           >> $fn     
	echo -ne "image\t"                     >> $fn     
	echo -ne "full_address\t"              >> $fn     
	echo -ne "area\t"                      >> $fn     
	echo -ne "street_address\t"            >> $fn     
	echo -ne "locality\t"                  >> $fn     
	echo -ne "state\t"                     >> $fn     
	echo -ne "zip\t"                       >> $fn     
	echo -ne "street_address_alt_format\t" >> $fn     
	echo -ne "locality_alt_format\t"       >> $fn     
	echo -ne "phone_primary\t"             >> $fn     
	echo -ne "phone_alt_1\t"               >> $fn     
	echo -ne "phone_alt_2\t"               >> $fn     
	echo -ne "main_category\t"             >> $fn     
	echo -ne "subcategory_1\t"             >> $fn     
	echo -ne "subcategory_2\t"             >> $fn     
	echo -ne "subcategory_3\t"             >> $fn     
	echo -ne "website\t"                   >> $fn     
	echo -ne "directions\t"                >> $fn     
	echo -ne "custom\t"                    >> $fn     
	echo -ne "more_info\t"                 >> $fn     
	echo -ne "services\t"                  >> $fn     
	echo -ne "aggregate_score\t"           >> $fn     
	echo -ne "num_reviews\t"               >> $fn     
	echo -ne "bbb_rating\t"                >> $fn     
	echo -ne "search_terms\t"              >> $fn     
	echo -ne "geo_location_terms\t"        >> $fn     
	echo -ne "crawl_timestamp\t"           >> $fn     
	echo -ne "crawl_unix_timestamp\t"      >> $fn     
	echo -ne "page_number\t"               >> $fn     
	echo ""                                >> $fn   

# Get the URL's contents and store it in a temporary file (temp.html); AND
# Put only the matches into a separate temporary file (temp2.html)
curl_wrapper() {
	curl -G --user-agent "YOUR INFO HERE - cURL"          \
		--data-urlencode "search_terms=$niche"              \
		--data-urlencode "geo_location_terms=Portland, OR"  \
		--data-urlencode "page=$page_number"                \
		https://www.yellowpages.com/search? > temp.html

	xmllint --html -xpath "//div[@class='search-results organic']//div[@class='v-card']" "temp.html" 2>/dev/null > "temp2.html" 

max_pages=20                    # The depth of pages to crawl (ex: 2 will crawl [up to] pages one and two). Set to 99999 or similar to crawl all pages.
fn="yp/yp-output--$(date +%s).tsv" # The filename for the final output
ln=0                               # The current line number for the output file


# Loop through all lines/keywords
# niches.txt is a file with one term to search for on each line (Ex: "Dentist")
# cities.txt is a file with one city/location term per line     (Ex: "Portland, OR")
# https://www.256kilobytes.com/content/show/10334/local-seo-resources-niche-list-and-cities-list
cat cities.txt | while read geo ; do
	cat niches.txt | while read niche; do

		page_number=0  # The page to start on (minus one)
		while true ; do 
			ln=$((ln+1))                    # Increment the line counter by one
			page_number=$(($page_number+1)) # The current YP result page number to crawl 

			start_time=$(date +%s)          # To track time per loop 
			(( $page_number <= $max_pages ))          || break           # If this page is not less than the maximum page number to crawl, exit the loop

			curl_complete=$(date +%s)          # To track time per loop 
			[ -s temp2.html ]                         || break           # If the current page has no results, exit the loop 
			cmp --silent temp2.html prev_matches.html && break           # If same matches as the previous loop, exit the loop 
			NUM_MATCHES=$(xmllint --html -xpath "count(//div[@class='v-card'])" "temp2.html" 2>/dev/null) 
			for i in $( seq 1 $NUM_MATCHES ) ; do parse_card ; done ;    # Loop through all matches
			cat temp2.html > prev_matches.html

			echo "=============================="
			echo "GEO:     $geo                                     "
			echo "Niche:   $niche                                   "
			echo "Page:    $page_number                             "
			echo "cURL:    $(( $curl_complete-$start_time )) Seconds"
			echo "Parse:   $(( $(date +%s)-$curl_complete )) Seconds"
			echo "Total:   $(( $(date +%s)-$start_time    )) Seconds"
			echo "=============================="
			echo "Waiting for five seconds before next request..."
			echo ""

			sleep 5 # Wait five seconds before making another cURL request to the YP server to prevent denial-of-service-ing it 

# Decode HTML entities, such as &nbsp; and &amp;
cat $fn > temp3.tsv
temp3.tsv | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);' > $fn

How to Run the Yellow Pages Scraper

  1. Copy-paste the code above into a text file and save it as yp-scraper.sh
    • Edit the line that reads  curl -G --user-agent "YOUR INFO HERE - cURL" to include a user-agent that identifies your crawler/bot.
  2. In the same folder, create a text file called "cities.txt" and include a list of cities to crawl with one per line. Ex:
    • Chicago, IL
    • New York, NY
  3. In the same folder, create a text file called "niches.txt" and include a list of niches/keywords to search with one per line. Ex:
    • Dentists
    • Lawyers
    • Gardeners
  4. In the same folder, create an empty folder called "yp" to store the output files.
  5. To run the script:
    • On Linux, open a terminal and navigate to the directory/folder that contains the script. Enter the command bash yp-scraper.sh. You may need to instal miscellaneous dependencies.
    • On Windows, MacOS, or other operating systems, follow steps 1 through 4, then look up how to run a BASH script from Windows/MacOS. It should be basically the same. as the bullet point above for Linux.

Finding City and Niche Lists

Note that there are large lists of local business niches and cities in this thread:

Which is located in the subscribers section of the 256 Kilobytes site. Those lists were originally part of a different eBook on keyword research; they will also work in this particular use case. 

Ending the Script Early

The script can be ended early by pressing CTRL+C. Note that if you do this, the final lines here:

cat $fn > temp3.tsv
temp3.tsv | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);' > $fn

Will not run. Which means that you'll have some characters like "&nbsp;" within the result, rather than having those be decoded. You can run those two lines directly from the terminal to run this portion of the script, if needed or desired. Replace "$fn" with your file name if necessary.

In Conclusion

Hell yeah. Friendship ended with Google Sheets and IMPORTXML. Now cURL and BASH are my best friend.

Users Who Have Downloaded More RAM:
yottabyte (1 year ago)
🐏 ⨉ 1
Posted by August R. Garcia 1 year ago

Edit History

• [2019-07-05 23:22 PDT] August R. Garcia (1 year ago)
• [2019-07-05 23:22 PDT] August R. Garcia (1 year ago)
• [2019-07-05 23:22 PDT] August R. Garcia (1 year ago)
• [2019-07-05 23:22 PDT] August R. Garcia (1 year ago)
• [2019-07-05 23:22 PDT] August R. Garcia (1 year ago)
• [2019-07-05 23:22 PDT] August R. Garcia (1 year ago)
🕓 Posted at 05 July, 2019 23:22 PM PDT

Profile Photo - August R. GarciaAugust R. GarciaLARPing as a Sysadmi...Portland, ORSite Owner

August Garcia is some guy who used to sell Viagra on the Internet. He made this website to LARP as a sysadmin while posting about garbage like user-agent spoofing, spintax, the only good keyboard, virtual assitants from Pakistan, links with the rel="nofollow" attributeproxiessin, the developer console, literally every link building method, and other junk.

Available at arg@256kilobytes.com, via Twitter, or arg.256kilobytes.com. Open to business inquiries based on availability.

Post a New Comment

Do you like having a good time?

Register an Account

You can also login to an existing account or reset your password. All use of this site is subject to the terms of service and privacy policy.

Read Quality Articles

Read some quality articles. If you can manage to not get banned for like five minutes, you can even post your own articles.

View Articles →

Argue with People on the Internet

Use your account to explain why people are wrong on the Internet forum.

View Forum →

Vandalize the Wiki

Or don't. I'm not your dad.

View Wiki →

Ask and/or Answer Questions

If someone asks a terrible question, post a LMGTFY link.

View Answers →

Make Some Money

Hire freelancers and/or advertise your goods and/or services. Hire people directly. We're not a middleman or your dad. Manage your own business transactions.

Register an Account