Web Scraping with No Coding in Google Sheets

Did you know that you can use Google Sheets to scrape data from the web? And the best part? You don't even need to know how to code! In this article, I’ll show you how to pull live data from websites using Google Sheets' built-in functions.

So, what you're about to learn is new content that I've just added to my BI Analyst Starter Program, available at the Learn BI Academy. You can sign up for a free preview via the ink in the description.

Web scraping is an incredibly useful skill for anyone working with data. Whether you want to monitor competitor prices, track financial trends, or automate data collection for reporting, being able to pull information directly from websites can save you a lot of time and effort.

Google Sheets provides simple functions that make this process accessible to everyone, even without coding knowledge. While Excel also has web scraping capabilities, it comes with some limitations compared to Google Sheets

Both tools allow users to pull in live data, but because Google Sheets operates entirely online, it handles dynamic updates more efficiently, whereas Excel often requires manual refreshes or VBA scripting to maintain live connections.

Anyway, let's start with something simple: pulling structured data from a webpage using the IMPORTHTML function.

The IMPORTHTML function has three parameters:

  • URL: The webpage URL that contains the data.

  • Query Type: This specifies whether you want to extract a table or a list. Use "table" for structured data and "list" for ordered or unordered lists.

  • Index: If there are multiple tables or lists on the page, this specifies which one to pull. The first table or list is indexed as 1, the second as 2, and so on.

Now, let's put this function into action using Wikipedia. Wikipedia is perfect for beginners because it organizes its data into tables, making it easy to extract.

Let’s say you want to get a list of the largest companies in the world by revenue. You can use this formula:

=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue", "table", 1)

Paste this into a cell, and Google Sheets will automatically pull in the first table on that Wikipedia page. If you want a different table from the same page, just change the last number.

One thing you might need to be wary of is the table formatting. You’ll notice in the result we can see that the data in the table headers has been split onto 2 rows for the Revenue column. Not a big deal but what I'm saying is that you'll need to check the result being pulled for this kind of thing.

Let's do another example from a different site. Let's say I want the current English Premier League standings. Using the same formula structure as before but changing the URL, we can easily scrape that. Here’s the formula:

=IMPORTHTML("https://www.theguardian.com/football/premierleague/table", "table", 1)

And what's great is that the data will automatically update in sheets when this table is updated. You don't need to manually refresh.

Let's do one final example of IMPORTHTML and get the latest prices for some crypto. With this simple formula:

=IMPORTHTML("https://coinmarketcap.com/", "table", 1)

And there you go, really easy.

So, this IMPORTHTML function is great for gathering structured business data like stock market indices, company rankings, or industry trends. In fact, there are tons of use cases for automating reports by linking them to live data sources.

Now let’s take things a step further and scrape specific elements from a webpage using IMPORTXML. Unlike IMPORTHTML, which grabs full tables or lists, IMPORTXML lets you target specific content on a webpage using something called XPath.

Let's walk through a real-world example using an online store.

Now, XPath is a powerful way to locate elements within a webpage's HTML, and learning how to use it effectively can make your data extraction much more accurate.

Before writing any formulas, we need to inspect the page to locate the correct elements. This will help us determine the right XPath to extract product names and prices.

To begin, let's inspect the product listing on the Molly Jogger inventory page. By right-clicking on a product name and selecting 'Inspect', we can see that the product names are inside <a> tags with the class product-title. Now, we can construct our first formula to extract product names. And here it is:

=IMPORTXML("https://www.mollyjogger.com/collections/inventory", "//a[contains(@class, 'product-title')]")

We've successfully extracted product names, but how exactly does this work? Let's break down this key part of the formula: contains(@class, 'product-title').

The contains() function in XPath is used to match elements that have a specific class. In this case, we are looking for <a> elements where the class attribute contains 'product-title'. This is useful because some elements may have multiple classes, and instead of searching for an exact match, we can retrieve elements that contain this class anywhere within their attribute list. This makes the query more flexible and reduces the risk of breaking if additional classes are added in the future.

Now, let's extract product prices:

Another important part of our XPath query is the // at the beginning. This tells Google Sheets to search for the specified element anywhere in the document, rather than requiring an exact path. This flexibility is useful because websites often have complex structures, and elements might not always be located in the same place. By using //, we ensure that our query still works even if the page structure changes slightly.

Now, let's extract product prices:

=IMPORTXML("https://www.mollyjogger.com/collections/inventory", "//span[contains(@class, 'money')]")

At first glance, this seems to work but we can see that there are not the same number of elements for products and prices so there's definitely something not quite right. Let's look at the page again to verify the price elements. When we check, we notice that some products have two prices—an original price and a sale price. This has shifted our data out of alignment. To fix this, we need to modify our XPath query to only grab the correct price.

By inspecting the page, we see that sale prices use <span class='money'>, while original prices have an additional class: <span class='original-price money'>. Because our original formula searched for anything containing 'money', we need to refine it to exclude original prices and keep only the correct price. Or if you want to have both prices in separate columns you could. So this is the revised formula:

=IMPORTXML("https://www.mollyjogger.com/collections/inventory", "//span[contains(@class, 'money')][not(contains(@class, 'original-price'))]")

We simply add another contains condition but this time with a not qualifier in front of it. Once you have your formulas in place, your spreadsheet will automatically update with live product names and prices. This can be extremely useful for competitor price tracking or market research.

Before we wrap up, it's important to talk about the limitations of these methods. One major challenge is that some websites dynamically load data using JavaScript. IMPORTXML can only scrape static content, so if your formulas return empty results, it’s likely that the data isn’t present in the page’s raw HTML. If the data is not present in the page’s raw HTML, Google Sheets functions won't be able to extract it. In those cases, you may need to use an API, a web scraping tool like Python with BeautifulSoup, or even browser automation with Puppeteer or Selenium.

So to recap, IMPORTHTML is great for structured data like tables and lists, while IMPORTXML gives you more control by allowing you to extract specific elements from a webpage. Whether you're tracking business trends, monitoring e-commerce prices, or automating reports, these functions make it easy to pull live data directly into Google Sheets, saving you time and effort.

If you want to discover some advanced analytics techniques why not check out this video here. And don’t forget to check out the BI Analyst Starter Program for more content like this. Until the next time, BI!

Adam Finer 0 Comments