Thursday, November 3, 2016

Small Adventures In Data: Billboard Music

My Facebook-only friend Greg Newburn recently asked a question:


Greg is a world-renowned internet poster so I try to serve him in any way I can. I also have recently been picking up SQL and Python and love a project where I can try my skills, so I jumped at the opportunity to help.

The Billboard data that I found on Mode Analytics seemed useful at first but it only gave year rankings, and there are zero recording artists with five number one hit singles if you are describing "number one" as "number one for the entire year." A little investigation revealed that Billboard actually publishes weekly (I should know this already) and the hunt for weekly data lead me to this wonderful post where, at the bottom, a Python-language web scraper for gathering weekly chart data was provided.

I copied this into Jupyter Notebook and let it run, slowly, until Jupyter Notebook crashed somewhere around 1972. That's enough data for me, so I cleaned it up in Excel, uploaded it to Mode, and got cracking.

My first attempt that produced reasonable results was fairly simple:


But, predictably, it didn't work very well, as artists with similar names were not grouped together, lowering their count:


The sad ending (for now) is that I need to work on my actual day-job stuff and I really don't know how to fix this. You can view my work here. Any suggestions are welcome, but I'll probably be looking more this evening. I know there are some more advanced matching methods in Python, maybe I'll finally figure out how to use Python on SQL queries!

So far it appears that there are no near-misses who suffer from this issue but this is probably biased against rap/hip-hop artists as their artists names tend to include featured artists that would mess up the GROUP BY command. Will look again this evening.

My guess for now: 17