If you are interested in anything that is sold nowadays, there is no way around Amazon. For your data science project that requires product data, you may wonder how to access their product data programmatically. Put simply, you have two different options: Speak to the Amazon product API or scrape the website directly.
Why the Amazon API may not be the right tool for you
If you can get access to the Amazon product API – great, use it! However, this isn’t as straight forward and reliable as one may think. You need an active partner account and people actually need to purchase things through your links so that your API key actually keeps working.
Of course, it makes sense. Amazon does not need to feed hungry data scientists through an open API. This interface is designed to drive retail business, so it’s supposed to be used by eCommerce sites and the likes. From what I can tell, this may not have been much of an issue in the past, but when I tried my old API key from when I still had referral links online – it wasn’t working anymore:
{"__type":"com.amazon.paapi5#TooManyRequestsException","Errors":[{"Code":"TooManyRequests","Message":"The request was denied due to request throttling. Please verify the number of requests made per second to the Amazon Product Advertising API."}]}
Fair enough, let’s try the more exciting route and scrape the website instead.
Why scraping from the terminal may not work for you
If you’re coming from any kind of data science background, your tool of choice is probably Python, so you fire up a notebook and grab a current copy from an Amazon product search. But behold, what’s this? Ah, we look like a bot, fair enough.
We may be able to get away with setting the user agent string and faking a user session somehow. But why not pause on Python and delve into JavaScript land again?
How to scrape right from your browser
When looking for a simple web scrapter, I found artoo.js: An older but cute little JavaScript project to scrape right from your browser.
And yes, it works! artoo.js works through a simple bookmarklet, and then with your own scripts right inside the browser console. The scraped results are downloaded as CSV or JSON.
I spend some time this weekend to create a scraping script to fetch a number of parameters for all items in an Amazon book search. A result then looks like this:
{
"title": "Unfinished Tales",
"url": "/Unfinished-Tales-J-R-Tolkien/dp/B00A2M4VZG/ref=sr_1_90?dchild=1&keywords=tolkien&qid=1589053005&s=books&sr=1-90",
"img": "https://m.media-amazon.com/images/I/41YaQLNWFWL._AC_UY218_.jpg",
"author": "J. R. R. Tolkien",
"published": "Jan 1, 1979,
"rating": 5,
"reviews": 1,
"price_1": 21.74,
"price_2": null,
"price_3": null
}