Note
This post is analysis on data that I obtained freely - through GDPR, but you’re not getting access to. I’m not sharing my full playing history for the last 3 months with everyone who reads this!
Also, the long delay since last post is because I moved house. Picking this back up now.
Background
GDPR means that EU citizens are entitled to a copy of personal data that a company is holding on them. In particular, it must be in a machine-readable format.
Right now, Spotify has an interface for downloading the last 90 days of data, and I’m asking them for my whole history. For now, let’s look at the last 90 days. To start - importing the data, and a bit of clean-up.
Exploratory Data Analysis
This is deliberately unorganised. I’m just diving into the data to see what stories it can tell me.
data <- fromJSON(file="~/bin/R/data I've not yet worked on//my_spotify_data/StreamingHistory.json") %>%
bind_rows() %>%
mutate(time = ymd_hms(time)) %>%
mutate(artistName = str_split(artistName, ", "))
Let’s start with my top artists. The artistName field is sometimes the band, sometimes including the members of the band. By the looks of things, the first entry in this list is always the band, so let’s look at that.
first <- function(x){
x[[1]]
}
top_artists <- data %>%
mutate(artistName = map(artistName, first)) %>%
mutate(artistName = as.character(artistName)) %>%
group_by(artistName) %>%
summarise(plays = n()) %>%
dplyr::arrange(desc(plays)) %>%
head(10)
top_artists
## # A tibble: 10 x 2
## artistName plays
## <chr> <int>
## 1 Billy Idol 195
## 2 Green Day 156
## 3 The Offspring 111
## 4 blink-182 108
## 5 Cyndi Lauper 87
## 6 Bowling For Soup 77
## 7 KISS 73
## 8 Alice Cooper 63
## 9 Van Halen 55
## 10 Good Charlotte 53
In fact, the first item in artistName seems to be rather useful, so let’s add it to the dataset.
data <- data %>%
mutate(band = map(artistName, first)) %>%
mutate(band = as.character(band))
My motivating example was “what am I paying per track?”. Spotify is £9.99/month UK (but I’m looking at switching to the £14.99/month family package.) and this is 3 months of data. So per track, this is a very easy calculation:
total_tracks <- data %>%
nrow()
9.99 * 3 / total_tracks
## [1] 0.006802088
0.7p/track isn’t bad. Of course, it beats a traditional store if I’m getting good diversity, otherwise I’m better off buying just what I want.
distinct_tracks <- data %>%
select(-time) %>%
unique() %>%
nrow()
9.99*3/distinct_tracks
## [1] 0.0183865
So about 2p per track. It’ll be interesting looking at this again when I get my whole data.
It also means that I listened to a track 2.7030675 times on average.
There are timestamps on the data, and I’ve already converted them to a sensible format. Let’s get some wide looks at the data
g <- ggplot(data, aes(x=time)) + geom_freqpoly()
ggplotly(g)
It looks like freqpoly is the right plot, but it’s not brilliant at binning the data. I’m going to throw the lubridate package at using sensible time intervals for the binning.
data %>%
mutate(date = round_date(time, unit = "day")) %>%
group_by(date) %>%
summarise(plays = n()) %>%
plot_ly(x=~date, y=~plays, type="scatter", mode="lines")
I think I need to check my calendar for what was so exciting about Apr 14. And May 23.
What about time of day?
data %>%
mutate(hour = hour(time)) %>%
group_by(hour) %>%
summarise(plays = n()) %>%
plot_ly(x=~hour, y=~plays, mode="lines")
So my busy period seems to be 17:00. Or, “When I leave work”.
And there’s already privacy concerns here. If I’m listening to music, then I must be awake. (Usually, anyway. In the past I have left cds playing overnight. It was rather calming.)
I now wonder if there is a difference in the shape of this by day?
data %>%
mutate(day = wday(time, label = TRUE)) %>%
mutate(hour = hour(time)) %>%
group_by(hour, day) %>%
summarise(plays = n()) %>%
group_by(day) %>%
plot_ly(x=~hour, y=~plays, type="scatter", mode="lines", split=~day)
It’s easier just to leave the days of the week as numeric, because I then don’t have to tell it what the ordering of days of the week is. I just read the documentation, and lubridate already has the option of returning an ordered factor.
It’s messy, but you can remove lines by clicking the legend.
Or a column chart by day:
data %>%
mutate(day = wday(time, label = TRUE, week_start = 1)) %>%
group_by(day) %>%
summarise(plays = n()) %>%
plot_ly(x=~day, y=~plays, type = "bar")
Now, I like stacked columns. I don’t have genre data, yet, but I can look at my most common artists.
top_artists <- pull(top_artists, artistName)
data %>%
mutate(day = wday(time, label = TRUE, week_start = 1)) %>%
filter(band %in% top_artists) %>%
group_by(day, band) %>%
summarise(plays = n()) %>%
group_by(band) %>%
plot_ly(x=~day, y=~plays, type = "bar", split = ~band)
I initially kept “other” bands, but they vastly outnumbered everyone else. Let’s look at that long tail.
data %>%
group_by(band) %>%
summarise(plays=n()) %>%
mutate(plays = plays/sum(plays)) %>%
dplyr::arrange(desc(plays)) %>%
plot_ly(y=~plays, type="scatter", mode="lines")
So my top artist is less than 5% of my played tracks, and I’ve over 600 artists in these 3 months.
It looks a bit like a 1/x function, but there’s nothing really interesting in doing a regression analysis on this: the x value must be a whole number, it must be at least 1, and I’ve no interest in extrapolating past 600.
I’m now interested in my average session. Let’s say that if a track is more than 10 minutes after the last one, then it’s a new session. Then I can just number sessions.
data <- data %>%
mutate(`new session` = (time - lag(time)) > dminutes(10)) %>%
replace_na(list("new session"=TRUE)) %>%
mutate(session = cumsum(`new session`)) %>%
select(-`new session`)
data %>%
group_by(session) %>%
summarise(plays=n()) %>%
group_by(plays) %>%
summarise(count=n()) %>%
plot_ly(y=~count, x=~plays, type="bar")
I’m rather disappointed that it doesn’t appear to be one of the usual distributions. Also, when did I listen to 136 tracks back-to-back?
data %>%
add_count(session) %>%
top_n(1, n) %>%
summarise(start=min(time), end=max(time))
## # A tibble: 1 x 2
## start end
## <dttm> <dttm>
## 1 2018-04-03 05:18:22 2018-04-03 13:35:24
I don’t recall getting up at 5 that day.
I don’t have album data, but do have track data. First thought was to just group by track name, but…
data %>%
select(trackName, band) %>%
unique() %>%
add_count(trackName) %>%
top_n(1) %>%
select(trackName) %>%
inner_join(data) %>%
select(trackName, band) %>%
unique()
## # A tibble: 48 x 2
## trackName band
## <chr> <chr>
## 1 Rockin' In The Free World Neil Young
## 2 Rockin' In The Free World Pussy Revolution
## 3 Surrender Cheap Trick
## 4 Surrender Less Than Jake
## 5 Walk This Way Aerosmith
## 6 Walk This Way Run–D.M.C.
## 7 Cryin' Aerosmith
## 8 Cryin' Vixen
## 9 Dancing With Myself Billy Idol
## 10 Dancing With Myself Generation X
## # ... with 38 more rows
Some nice examples there of the same track name covered by 2 different bands. Or the same band, under different names. So my “top track” needs to consider this. (Of course, internally Spotify will have a unique track ID, and I don’t have that.)
data %>%
select(trackName, band) %>%
group_by_all() %>%
summarise(n()) %>%
dplyr::arrange(desc(`n()`)) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: trackName [10]
## trackName band `n()`
## <chr> <chr> <int>
## 1 White Wedding - Pt. 1 Billy Idol 28
## 2 The Kids Aren't Alright The Offspring 27
## 3 Poison Alice Cooper 25
## 4 Barracuda Heart 24
## 5 Lifestyles of the Rich & Famous Good Charlotte 23
## 6 All The Small Things blink-182 22
## 7 Hit Me With Your Best Shot Pat Benatar 20
## 8 Cum on Feel the Noize Quiet Riot 18
## 9 The Middle Jimmy Eat World 18
## 10 The Rock Show blink-182 18
I’m now going to look at a band that is well-represented in my dataset, and has 3 different lists of members - blink-182.
data %>%
filter(band=="blink-182") %>%
pull(artistName) %>%
map(unique) %>%
map_chr(toString) %>% #So far, this undoes the bit where I turned the string of artists into a list.
as.data.frame() %>%
group_by_all %>%
tally
## # A tibble: 5 x 2
## . n
## <fct> <int>
## 1 blink-182, JOHN FELDMANN 12
## 2 blink-182, Mark Hoppus, Tom DeLonge, Travis Barker, Jerry Finn 7
## 3 blink-182, Tom DeLonge, Mark Hoppus, Jerry Finn 5
## 4 blink-182, Tom DeLonge, Mark Hoppus, Travis Barker, Jerry Finn 76
## 5 blink-182, Tom DeLonge, Scott Raynor, Mark Hoppus, Mark Trombino 8
This definitely reveals that I was misunderstood about what the artistName field was for. At first I thought it was every artist on the track, so I’d get singers, guitarists, bassists, drummers for blink-182, and split by era. Instead it seems that Spotify have merged several fields, some with incomplete information, into one field. And not documented it. So I don’t have a way to reliably identify a producer. (And I’ve just identified some producers I might want to listen to more of.)
So, this analysis suggests that I might want to go back to Spotify and ask them to not merge the artistName field, but keep the various fields separate.
Or…
Next Steps
Most of this analysis has focused on time of listening rather than the track data, mostly because the track data is very sparse. I’ve found my top bands, and my top tracks, but I want to go into genre(s), writers, year, who was in the band at that point, album.
I want to look at graph generation, and graph visualisation, so I think I’ll make a massive graph linking any two tracks that share a person.