Tea Time Theme Time Term Matching

A quick code snippet to find matching terms between song lyrics

I love BBC 6 Music. On my morning train rides into LA I like to listen to the Radcliffe and Maconie Show. The hosts play a game at the end of their show where they play three tracks and ask listeners to guess what connects all three tracks.

I have yet to win this game which has prompted me to try to see if I can gain an advantage by analyzing some data about the the themes submitted. I have noticed that since the start of 2018, the theme relies on an exact term or phrase match among the lyrics 17% of the time.

library(tidyverse)
 
# read in the data
t <- read_csv("https://raw.githubusercontent.com/michaelpawlus/michaelpawlus.github.io/master/images/tttt1.csv")

## Parsed with column specification:
## cols(
##   index = col_integer(),
##   date = col_character(),
##   day = col_integer(),
##   month = col_integer(),
##   year = col_integer(),
##   submitter = col_character(),
##   submitter_city = col_character(),
##   country = col_character(),
##   Link = col_character(),
##   link_location = col_character(),
##   link_type = col_character(),
##   col = col_character(),
##   artist = col_character(),
##   song = col_character()
## )

# filter to one link per row, removing song 2 and 3
tr <- t %>% filter(col == "sa1")
 
# filter to just the rows were links are based on a term or phrase in the lyrics
tl <- t %>% filter(link_type == "Term", link_location == "Lyrics", col == "sa1")
 
# plot the proportion of cases were the link is a term or phrase in the lyrics
t %>%
select(index, link_location, link_type, col) %>%
filter(col == "sa1") %>%
  select(-col) %>%
  ggplot(aes(link_type)) + 
  geom_bar(aes(fill = link_location)) + 
  annotate("text", x = 'Term', y = 5, label = paste0(round(nrow(tl)/nrow(tr),2)*100,"%"), fontface = 2, size = 6) +
  labs(y = NULL, x = "Link Type") + 
  guides(fill=guide_legend(title="Link Location"))

plot of chunk unnamed-chunk-1

In these cases, these were the links.

tl %>% select(Link)

## # A tibble: 5 x 1
##   Link                                                  
##   <chr>                                                 
## 1 "All songs contain \"Joan of Arc\" in the lyrics"     
## 2 A reference to 4 AM in the lyrics                     
## 3 "All songs begin with phrase \"I was born\""          
## 4 "All lyrics include phrase \"Shut your mouth\""       
## 5 "All lyrics contain the phrase \"She's gone to stay\""

Using what I have recently learned from the excellent Tidy Text Mining and really helpful package that pulls in lyrics from the site Genius, I was able to write a really quick script to get the lyrics for two songs and look for common terms between them. Here is an example to find one of the links above:

library(tidytext)
library(geniusR)
 
song1 <- genius_lyrics(artist = "Bill Withers", song = "Ain't No Sunshine") %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  count(word1, word2, word3, sort = TRUE)
 
song2 <- genius_lyrics(artist = "Prince", song = "I Could Never Take The Place of Your Man") %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  count(word1, word2, word3, sort = TRUE)
 
song1 %>%
  inner_join(song2, by = c("word1","word2","word3")) %>%
  count(word1, word2, word3, sort = TRUE)

## # A tibble: 2 x 4
##   word1 word2 word3     n
##   <chr> <chr> <chr> <int>
## 1 and   i     know      1
## 2 gone  to    stay      1

My next steps is to wrap this all in a function and add a check that first looks for matching trigrams and then finding none looks for bigrams and failing that lists all matching terms ommitting stop words. I will post that update soon.

As of this post, I still have yet to win Tea Time Theme Time (#TTTT)

Written on February 9, 2018