Digital Media Project: Last.fm Spelunker

For my non-written element that I’ve created/remixed, I’d like to talk about a personal project of mine: last.fm-spelunker.

Context

In its simplest form, last.fm is a service that records your listening history. I’ve had my Spotify account linked to last.fm for the past couple years, and my last.fm account keeps a history of everything I’ve listened to on Spotify since I created the account in early 2017. For example, if I wanted to look up what I listened to on my birthday, I could easily check my history for March 21:

last.fm recent tracks

When I first came up with the idea for last.fm-spelunker, I was unfamiliar with the Python and SQL programming languages and wanted to learn more about both by creating a data science project. I realized I had this great dataset from my last.fm history and decided to create visualizations of my listening history.

scraper.py

The first step to working with this dataset was downloading it into a SQL database. Since last.fm didn’t (and still doesn’t) have a user-friendly way to download an account’s history, I had to do this programmatically using last.fm’s API (Application Program Interface), which is a way for more tech-savvy users to access the internal data of last.fm using Python code:

with dataset.connect('sqlite:///last-fm.db') as db:
	resp = requests.get(self.RECENT_URL % (1, self.PER_PAGE)).json()
	TOTAL_PAGES = int(resp['recenttracks']['@attr']['totalPages'])
	for page in reversed(range(1, TOTAL_PAGES + 1)):
		scrobbles = requests.get(self.RECENT_URL % (page, self.PER_PAGE)).json()['recenttracks']['track']
		scrobbles = map(lambda x : process_scrobble(x), scrobbles)
		db[self.USER].insert(scrobbles)

The actual code can be found in a file called scraper.py here and is actually several hundred lines, but this is a (very) simplified snippet of code that creates a local database of my listening history through the last.fm API.

scrubber.py

Now that I had the local database, I could finally start to actually work with the data. But, before I could do so, I had to clean up the data. Because some albums could be titled as Album Title (Explicit Version) or other such variations of the real album title, it was difficult to get reliable statistics about my data. For example, if I wanted to see how many times I had listened to an album, e.g. The College Dropout, I’d have to check both The College Dropout and The College Dropout (Explicit). Or, if an artist had used a different name in the past, my listening history for that artist would be split across two different names (e.g. K Dot changed his name to Kendrick Lamar). To clean the data and avoid any such issues, I first renamed all album titles with variations such as “(Explicit)” or “(Remastered)” appended at the end of the album title using Python code as shown below.

with dataset.connect('sqlite:///last-fm.db') as db:
	sql = 'SELECT DISTINCT album FROM %s' % username
	albums = [str(row['album']).replace("'", "\\'") for row in db.query(sql)]
	versions = [' (Explicit)', ' (Explicit Version)', ' (Edited)', ' (Remastered)']
	valid, standalone = {}, {}
	standalone = {}

	for album in albums:
		altnames = [album + version for version in versions]
		for altname in altnames:
			if altname in albums:
				if album not in valid:
					valid[album] = []
				valid[album].append(altname)
				albums[:] = [copy for copy in albums if copy != album]
	for album in albums:
		for version in versions:
			base = album[:-len(version)]
			if version in album and base not in valid:
				standalone[album.replace('\'', '\'\'')] = version

	for album in valid:
		for alt in valid[album]:
			sql = 'UPDATE %s SET album = \'%s\' WHERE album = \'%s\'' % (username, album, alt)
			db.query(sql)

	for album in standalone:
		sql = 'UPDATE %s SET album = \'%s\' WHERE album = \'%s\'' % (username, album[:-len(standalone[album])], album)
		db.query(sql)

The full code for this can be found in scrubber.py here. Then, I manually renamed all other incorrect names using a SQL script that can be found here. Below is a snippet of this SQL script.

UPDATE sbansal21 SET artist = 'Anderson .Paak' WHERE artist LIKE 'Anderson%Paak%' OR album = 'The Anderson .Paak EP';
UPDATE sbansal21 SET album = 'Malibu' WHERE album = 'MALIBU';
UPDATE sbansal21 SET album = 'Robots - EP' WHERE artist = 'EarthGang' AND track IN ('Flickted', 'So Many Feelings', 'Underwater (feat. Sir)', 'Robots', 'Artificial', 'Lyfted Intro');
UPDATE sbansal21 SET album = 'Charlie Eastern' WHERE artist = 'Jalen Santoy' AND track = 'Foreplay';
UPDATE sbansal21 SET artist = 'Jay IDK' WHERE artist = 'Jay idk';
UPDATE sbansal21 SET track = 'Pick Up the Phone (feat. Quavo)', album = 'Jeffery' WHERE track LIKE 'Pick Up the Phone%';
UPDATE sbansal21 SET album = 'Jeffery' WHERE artist = 'Young Thug' AND album = 'No, My track is JEFFERY';

Finally, the data was pre-processed and cleaned.

spelunker.py

Now that the data was clean and ready for visualization, I had to decide what to do with it. I discovered an interesting statistical visualization called a streamgraph, which is essentially a stacked area graph displaced around a central axis, resulting in a flowing, organic shape.

To create this visualization, I used the popular D3.js data visualization library, which is written in JavaScript, another language I wanted to become more familiar with and used this code as a starting point. I first processed my listening history as necessary to get the data for the streamgraph and then used D3.js to graph this data in a web-based streamgraph using a mix of Python and HTML/JS/CSS code that can be found here.

Example screenshots and videos of the resulting streamgraphs can be found below.

screenshot1 embed1 embed2

screenshot2 embed3

screenshot3 screenshot4