{"id":176,"date":"2020-07-01T17:00:28","date_gmt":"2020-07-01T17:00:28","guid":{"rendered":"https:\/\/emersoncode.com\/blog\/?p=176"},"modified":"2023-02-03T16:33:52","modified_gmt":"2023-02-03T16:33:52","slug":"building-bootleg-builtwith","status":"publish","type":"post","link":"https:\/\/emersoncode.com\/blog\/building-bootleg-builtwith\/","title":{"rendered":"Building Bootleg BuiltWith"},"content":{"rendered":"\n<h5 class=\"wp-block-heading\">How I outperformed BuiltWith and PublicWWW&#8217;s free plans with a couple hours and 30 lines of code.<\/h5>\n\n\n\n<figure class=\"wp-block-image size-large is-style-default\"><img decoding=\"async\" loading=\"lazy\" width=\"954\" height=\"505\" src=\"https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/07\/Screen-Shot-2020-07-01-at-12.16.58-PM.png\" alt=\"\" class=\"wp-image-201\" srcset=\"https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/07\/Screen-Shot-2020-07-01-at-12.16.58-PM.png 954w, https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/07\/Screen-Shot-2020-07-01-at-12.16.58-PM-300x159.png 300w, https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/07\/Screen-Shot-2020-07-01-at-12.16.58-PM-768x407.png 768w\" sizes=\"(max-width: 954px) 100vw, 954px\" \/><\/figure>\n\n\n\n<p>What if you wanted to find a list of websites that are using a specific piece of technology? Like if you wanted a list of all websites using WordPress? <strong>Or in my case, all websites built with Shopify?<\/strong><\/p>\n\n\n\n<p>You might turn to <a href=\"http:\/\/builtwith.com\/\">BuiltWith<\/a> or <a href=\"https:\/\/publicwww.com\">PublicWWW<\/a>. These sites scan the source code of websites to determine the technology the website is built on. <\/p>\n\n\n\n<p>Very useful and cool. Until you see the pricing pages:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-style-default\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"517\" src=\"https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-4.37.55-PM-1024x517.png\" alt=\"\" class=\"wp-image-188\" srcset=\"https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-4.37.55-PM-1024x517.png 1024w, https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-4.37.55-PM-300x151.png 300w, https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-4.37.55-PM-768x388.png 768w, https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-4.37.55-PM.png 1214w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Now, let me just say &#8211; I get it. <\/p>\n\n\n\n<p>The work they&#8217;re doing ain&#8217;t easy, and the pricing is justifiable, <strong>but since I can code, how about I just build my own for free?<\/strong><br><\/p>\n\n\n\n<p><strong>The way I see it, this problem has two components:<\/strong><\/p>\n\n\n\n<ol>\n<li>I need a script that can automatically check, when supplied a website domain, if that website is using Shopify or not.<\/li>\n\n\n\n<li>I need a really big list of website domains to run the first script against.<br><\/li>\n<\/ol>\n\n\n\n<p>Once I have both of those pieces, I can put them together to make my Bootleg Builtwith. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Part 1: Determining if a certain website uses Shopify<\/h2>\n\n\n\n<p>Let&#8217;s start with the first component: When given a website URL, I need a reliable method to check if that website uses Shopify.<\/p>\n\n\n\n<p>The simplest way to check if a website is using Shopify is to inspect the website source, and look for code that only a Shopify store would use.<\/p>\n\n\n\n<p>Now, I already know of such a snippet that is unique to Shopify stores: It&#8217;s &#8216;trekkie&#8217; &#8211; which is what Shopify calls their custom analytics solution:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-style-default\"><img decoding=\"async\" loading=\"lazy\" width=\"753\" height=\"256\" src=\"https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-3.14.56-PM.png\" alt=\"\" class=\"wp-image-180\" srcset=\"https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-3.14.56-PM.png 753w, https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-3.14.56-PM-300x102.png 300w\" sizes=\"(max-width: 753px) 100vw, 753px\" \/><\/figure>\n\n\n\n<p>Cool. Let&#8217;s use that as a base point and start coding.<\/p>\n\n\n\n<p>Let&#8217;s write a dead simple, 15 line, Ruby script to kick this off:<\/p>\n\n\n\n<pre><code lang=\"ruby\" class=\"language-ruby\">#~\/desktop\/is-shopify-store.rb\n\nrequire 'open-uri'\n\n# Prompt me for a URL \nputs \"What URL do you want to check?\"\nurl = gets.chomp\n\n# Open the URL I supplied above\nhtml = open(url).read\n\n# Checks the source code for the Shopify specific code snippet\nif html.include? \"var trekkie = window.ShopifyAnalytics.lib\"\n\tputs 'This site uses Shopify'\nelse\n\tputs 'This site does not use Shopify'\nend \n <\/code><\/pre>\n\n\n\n<p>Cool. Let&#8217;s run it, and try testing it against a couple websites to make sure it&#8217;s reliable:<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video controls src=\"https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/07\/My-Movie-Medium.mov\"><\/video><\/figure>\n\n\n\n<ul>\n<li>First test using a Shopify site => https:\/\/emerson-code-development.myshopify.com\n<ul>\n<li>That worked. Let&#8217;s try a different Shopify website:<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Second test with another Shopify site => https:\/\/allbirds.com\n<ul>\n<li>Once again, looking good. Now let&#8217;s try a website that definitely doesn&#8217;t use Shopify.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Third test with a site that doesn&#8217;t use Shopify => https:\/\/amazon.com\n<ul>\n<li>Error!<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p>Ah, yes, I should&#8217;ve anticipated this but Amazon blocks my script from inspecting the website source code here. <\/p>\n\n\n\n<p>Now I could get crazy and break out a more sophisticated scraping solution like <a href=\"http:\/\/watir.com\/\">Watir<\/a> but remember this is bootleg, so let&#8217;s think through that extra work for a second&#8230;<\/p>\n\n\n\n<p>So if Shopify stores don&#8217;t block my script, can&#8217;t I just assume that if my script is blocked, that it&#8217;s not a Shopify store? Makes sense to me. Let&#8217;s do that:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"ruby\" class=\"language-ruby\">begin\n\t\thtml = open(url).read\n\t\tif html.include? \"var trekkie = window.ShopifyAnalytics.lib\"\n\t\t\tputs url + ': This site uses Shopify'\n\t\t\t# It's a match! Let's write it to a CSV\n\t\t\tCSV.open(\"\/Users\/emerson\/Downloads\/results-shopify-stores.csv\", \"a\") do |csv|\n\t\t\t  csv &lt;&lt; [url]\n\t\t\tend\n\t\telse\n\t\t\tputs url + ': This site does not use Shopify'\n\t\tend \n\trescue\n\t\tputs url + ': Blocked. Probably does not use Shopify'\n\tend<\/code><\/pre>\n\n\n\n<p>Okay. Bingo. Part 1&#8217;s solution is decent enough.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Part 2: Get a big list of websites<\/h2>\n\n\n\n<p><strong>Onto the next one: <\/strong>I need a massive list of domains to check.<\/p>\n\n\n\n<p>Search engines typically use something called spiders here to find and create directories\/lists of webpages. Spiders start at one webpage, follow all the links on that webpage, and then the links on those webpages, and so on until it has mapped all the webpages it can get its hands on.<\/p>\n\n\n\n<p>Now if you wanted to build a Builtwith competitor you&#8217;d probably need to do something like that, but this is bootleg baby.<\/p>\n\n\n\n<p>I remember Alexa.com has a list of the top million website domains or whatever. I wonder if they make the list publicly available? Surely that&#8217;d give me a lot of high ranking websites to check, and high ranked websites are what we&#8217;d prefer here anyway, right?<\/p>\n\n\n\n<p>A quick google search revealed <a rel=\"noreferrer noopener\" href=\"https:\/\/gist.github.com\/chilts\/7229605\" target=\"_blank\">this<\/a> Github thread with some decent leads, and from that thread I have a downloaded CSV of the &#8216;top million domains&#8217;.<\/p>\n\n\n\n<p>Checking the spreadsheet, it only has 764,166 domains, not 1 million&#8230; Good enough!<\/p>\n\n\n\n<p>Let&#8217;s start with a couple lines of Ruby code that can use a CSV spreadsheet file as an input, and loop through each domain to perform my Shopify check I outlined in part 1.<\/p>\n\n\n\n<p>Oh yeah&#8230; the domain list only includes the URL&#8217;s base: Eg bestbuy.com, not https:\/\/www.bestbuy.com. That&#8217;s okay, let&#8217;s just assume most domains are using https:\/\/ and prepend it to the start of every URL before we check it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"ruby\" class=\"language-ruby\">url = 'https:\/\/' + url<\/code><\/pre>\n\n\n\n<p>Cool. One last thing. If my script does detect a Shopify store, let&#8217;s go ahead and save that to a separate CSV to serve as my output, so when my scripts finished, I&#8217;m left with just list of websites using Shopify.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"ruby\" class=\"language-ruby\">if html.include? \"var trekkie = window.ShopifyAnalytics.lib\"\n   puts url + ': This site uses Shopify'\n   # It's a match! Let's write it to a separate CSV\n   CSV.open(\"\/Users\/emerson\/Downloads\/results-shopify-stores.csv\", \"w\") do |csv|\n      csv &lt;&lt; [url]\n   end\nelse\n...<\/code><\/pre>\n\n\n\n<p>Okay&#8230; One hour later and I barely have gotten through 1000 URLs, and none of those are Shopify sites it seems. <\/p>\n\n\n\n<p>Make sense. If you&#8217;re one of the top websites in the world, it&#8217;d make sense that you&#8217;d be using a custom platform to run it.<\/p>\n\n\n\n<p>Given that, to make this a little bit easier, let&#8217;s go ahead and skip the first 50K top sites and start scanning at 50,001:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"ruby\" class=\"language-ruby\">domainList = CSV.read('\/Users\/emerson\/Downloads\/top-1m.csv', headers:true)\ndomainList.drop(50000).each do |row|\n    url = row[1]\n    url = 'https:\/\/' + url\n    isUsingShop(url)\nend\n<\/code><\/pre>\n\n\n\n<p>A couple minutes later and we got one! <\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-style-default\"><img decoding=\"async\" loading=\"lazy\" width=\"517\" height=\"414\" src=\"https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-4.21.48-PM.png\" alt=\"\" class=\"wp-image-187\" srcset=\"https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-4.21.48-PM.png 517w, https:\/\/emersoncode.com\/blog\/wp-content\/uploads\/2020\/06\/Screen-Shot-2020-06-30-at-4.21.48-PM-300x240.png 300w\" sizes=\"(max-width: 517px) 100vw, 517px\" \/><\/figure>\n\n\n\n<p>After letting that script run in the background for a few hours, I&#8217;ve checked about ~12K domains (Alexa&#8217;s top 50K-62K domains), and found ~220 of those sites are using Shopify. So 1.83% of the so far checked domains were using Shopify.<\/p>\n\n\n\n<p>Quick and dirty math: Roughly 700K domains still need to be checked. Assuming similar positive rate as the domains I have checked so far, my bootleg script should produce a list of 10K+ high ranking Shopify stores.<\/p>\n\n\n\n<p><strong>Free BuiltWith: 50 high ranking Shopify stores<\/strong><\/p>\n\n\n\n<p><strong>Free PublicWWW: 2359 high ranking Shopify stores<\/strong><\/p>\n\n\n\n<p><strong>30 lines of Ruby code: 10K+ high ranking Shopify stores<\/strong><\/p>\n\n\n\n<p><br><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Also, here is complete Ruby script:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"ruby\" class=\"language-ruby line-numbers\">require 'csv'\nrequire 'open-uri'\ndef isUsingShop(url)\n\tbegin\n\t\thtml = open(url).read\n\t\tif html.include? \"var trekkie = window.ShopifyAnalytics.lib\"\n\t\t\tputs url + ': This site uses Shopify'\n\t\t\t# It's a match! Let's write it to a CSV\n\t\t\tCSV.open(\"\/Users\/emerson\/Downloads\/results-shopify-stores.csv\", \"a\") do |csv|\n\t\t\t  csv &lt;&lt; [url]\n\t\t\tend\n\t\telse\n\t\t\tputs url + ': This site does not use Shopify'\n\t\tend \n\trescue\n\t\tputs url + ': Blocked. Probably does not use Shopify'\n\tend\nend \ndomainList = CSV.read('\/Users\/emerson\/Downloads\/top-1m.csv', headers:true)\ndomainList.drop(50001).each do |row|\n    url = row[1]\n    url = 'https:\/\/' + url\n    isUsingShop(url)\nend\n <\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>How I outperformed BuiltWith and PublicWWW&#8217;s free plans with a couple hours and 30 lines of code. What if you wanted to find a list of websites that are using a specific piece of technology? Like if you wanted a list of all websites using WordPress? Or in my case, all websites built with Shopify? [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/posts\/176"}],"collection":[{"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/comments?post=176"}],"version-history":[{"count":38,"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/posts\/176\/revisions"}],"predecessor-version":[{"id":248,"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/posts\/176\/revisions\/248"}],"wp:attachment":[{"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/media?parent=176"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/categories?post=176"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/emersoncode.com\/blog\/wp-json\/wp\/v2\/tags?post=176"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}