October 11, 2017

Meet Dumb Hardwax

Twitter bots have gotten a fairly bad rap, recently (often with good reason). When they’re done right, a genuinely quirky robot can cut through a feed full of humans with beautiful tidbits. God bless [@tinycarebot](https://twitter.com/tinycarebot).


Another thing that gives me life is reading Hardwax, a hugely influential Berlin record store, get excited about their new stock. Their snappy, idiosyncratic descriptions of new music is steeped in electronic music folklore. Gems like this are peppered all over:

“Dancehall from cyberspace - awesomely fresh & fearless & full of Grime affinities”

I could flick through this stuff all day, but as a Londoner I’m usually gonna pick up records from my local stores. This means I don’t get around to checking these as much as I should, which got me thinking - I wish I could have Hardwax reviews in my Twitter feed, or something. They even fit in to 140 characters most of the time…

I’d also been meaning to have a go at generating pseudo random text with Markov chains, after coming across Roel’s post here. For those that don’t know about this type of chain - here’s a wicked visual intro - in short they are mathematical systems that describe the probabilities of moving from one “state” (or set of values) to another.

Could there be potential to subvert this principle, knitting together words to form sentences, imitating this inimitable style? I’m envisaging a bot that spits out pseudo-Hardwax reviews, just for my sadistic enjoyment. Let’s get it.1

Gettin’ in to a scrape

First up, I went straight to the Hardwax web shop to get hold of the review/description text accompanying releases on there. This would serve as the corpus of text from which we can build our Markov chain review generator. Here’s what the release pages look like:

Those bits circled in red? Those are the Hardwax reviews. I was able to put together a fairly simple function (leaning heavily on rvest) to scrape them.

# web scraper for hardwax reviews
hardwax_scrape <- function(page, no) {
  
  # construct url
  x <- paste0("https://hardwax.com/", page, "/?page=", no, "&paginate_by=50")
  
  # scrape reviews
  reviews <- x %>%
    read_html() %>%
    html_nodes("p") %>%
    html_text()
  
  return(reviews)
}

This simple URL structure meant the function could be easily applied for each section/genre on the site, like pulling the latest weeks new ish:

# scrape news
lapply(seq_along(1:2), hardwax_scrape, page="this-week") %>% unlist() %>% head()
## [1] "Flawless Acid Electro science by the truly professional grand master Carl Finlow"         
## [2] "Driving Electro Bass"                                                                     
## [3] "Stunning rhythm tracks somewhere in between leftfield Techno & Grime"                     
## [4] "Beautiful Ambient atmospheres & stepping House organics"                                  
## [5] "One sided 12\" - blinding, raw Proto-House reminiscent cut & noisy leftfield Electro trip"
## [6] "One sided 12\" - blinding, raw Proto-House reminiscent cut & noisy leftfield Electro trip"

Once the various sections were scraped, some data cleaning procedures (remove releases with no reviews, reviews longer than 140 characters, or duplicate reviews) ensured the reviews were fit for purpose to head on to the next stage.2

Preppin’ the Text

Now we’re entering text mining territory, it’s time to call on the might of tidytext to bring our body of text into forms suitable for Markov Chain text generation. A couple of wise steps should see us through.3

  1. Word counts: To aid the probabilistic elements of Markov chain text generation, we need an understanding of how many times words appear, in different contexts:

    • No. times words appear in corpus (all webstite review text)
    • No. times words appear at the beginning of a review (herein known as ‘openers’)
    • No. times words precede commas
  2. Ngrams:

    • Bigram counts (pairs of consecutive words)
    • Trigram counts (groups of three consecutive words)

With the above tasks done, the ‘fun’ can really begin - crafting our first Hardwax review.

My Bot’s First Words

In short, my tactic here is to add a word on to the end of two existing words opening a pseudo-review (with words that usually follow that bigram being weighted more highly), this process continuing until a sentence is formed (of a specified length).

Here’s a function that essentially performs a look-up on the trigram counts dataframe, filtering (non-standardly) based on a couple of inputs (the sentence ‘openers’) and returning the trigrams final word if possible. Otherwise, the bigram counts are filtered on the sentences current most recent word, and returns the final word again.

# function to return third word
return_third_word <- function(woord1, woord2){
  
  # sample a word to add to first two words
  woord <- trigram_counts %>%
    filter_(~word1 == woord1, ~word2 == woord2)
  
  if(nrow(woord) > 0) {
    woord <- sample_n(woord, 1, weight = n) %>%
      .[["word3"]]
    
  } else {
    woord <- filter_(bigram_counts, ~word1 == woord2) %>%
      sample_n(1, weight = n) %>%
      .[["word2"]]
  }
  
  # print
  woord
}

The above word generator function takes place iteratively as part of the below function to construct our review. Here, we again take two words in turn as inputs, along with an argument representing sentence length which will determine the no. of times we cycle through much of the function. Note, there is also the chance for commas to enter the review, based on the probability for words to precede them.

# capitalise first letter
firstup <- function(x) {
  substr(x, 1, 1) <- toupper(substr(x, 1, 1))
  x
}

generate_sentence <- function(word1, word2, sentencelength){
  
  # comma chance sample
  commas <- sample(0:100, 1)
  
  # choosing to add a comma based on probabilities
  if(commas <= as.numeric(word1$comma_prob)) {
    sentence <- paste(word1$word, ", ", word2$word, sep="")
  } else {
    sentence <- c(word1$word, word2$word)
  }
  
  # starting to add words
  woord1 <- word1$word
  woord2 <- word2$word
  for(i in seq_len(sentencelength)){
    
    commas <- sample(0:100, 1)
    
    word <- return_third_word( woord1, woord2)
    
    word <- left_join(as_data_frame(word), word_counts, by=c("value"="word"))
    
    if(commas <= as.numeric(word$comma_prob)) {
      sentence <- c(sentence, ", ", word$value[1])
    } else {
      sentence <- c(sentence, word$value[1])
    }
    
    woord1 <- woord2
    woord2 <- word$value[1]
  }
  
  # paste sentence together
  output <- paste(sentence, collapse = " ")
  output <- str_replace_all(output, " ,", ",")
  output <- str_replace_all(output, "  ", " ")
  
  # add suffix sometimes
  tip_n <- sample(1:20, 1)
  if(tip_n %in% c(1, 2)){
    output <- paste(output, "- TIP!")
  } else if(tip_n %in% c(3, 4)){
    output <- paste(output, "(one per customer)")
  } else if(tip_n %in% c(5)){
    output <- paste(output, "- Killer!")
  } else if(tip_n %in% c(6, 7)){
    output <- paste(output, "- Warmly Recommended!")
  } else if(tip_n %in% c(8, 9)){
    output <- paste(output, "- Highly Recommended!")
  } else if(tip_n %in% c(10, 11)){
    output <- paste(output, "(w/ download code)")
  }
  
  # print
  firstup(output)
}

The penultimate part of this function appears odd - this is my final artistic flourish in the process. Hardwax infamously ended reviews with the phrase “TIP!” to indicate strong positive feelings about a piece of music (until this phenomena was later parodied in an artist’s track title, after which Hardwax went through the site to remove almost all traces). I’m bringing it back, along with some other of the shop’s favourite ways to end a review.

Finally, we create a wrapper function for the word/sentence generator to be called at will - enter the (imaginatively titled) review generator!

# generate review
dumb_hardwax <- function(x) {
  a <- sample_n(opener_counts, size=1, weight = n)
  b <- sample_n(word_counts, size=1, weight = n)
  len <- sample(5:12, 1)
  
  generate_sentence(word1=a, word2=b, sentencelength=len)
}

dumb_hardwax()
## [1] "Industrial, dense big room techno album in gatefold cover dubbed out dj"

Look at that - the bot made it’s first review. Lets share it with the world…

TwitteRing

At this point, we can freely generate a simulated Hardwax review, but it’s still just lurking in the R console. To bridge the Twitter-shaped gap, rtweet gets us there. I won’t go into authentication/set-up details here - you should visit the packages dedicated site for all that (or check the footnotes for the GitHub repo and dig there). Once we’ve made a twitter app and authenticated R to post on our behalf, we’re tweeting in one line yo’:

# post tweet
post_tweet(status = dumb_hardwax(), token = twitter_token)

Nice. I can generate pseudo-Hardwax reviews and share them with anyone who cares. Still, I need to actually press ‘go’, which is a bit of a problem. I have to eat, sleep, work, all that stuff, unfortunately - which means this bot is only tweeting when I get around to making one happen myself. There’s always cronR, which is a great way to schedule tasks on my machine, but what if my machine is dead? The people need their reviews, and I don’t want this burden on my shoulders. There’s gotta be a way…

Up in the Clouds

After some ambling around the hottest cloud providers (namely - AWS, Google Cloud and Azure), I settled on a particular branch of the latter known as Azure Functions. While R isn’t natively supported, it offers an array of ‘triggers’, including timer (executes a Function on a schedule), which perfectly fit my needs for a simple tweetbot. By using GitHub as the deployment source, a continuous deployment workflow is possible so I can update the corpus later on, with the tweets adjusting accordingly. Dope!

I stumbled across an impeccable tutorial here to guide me through the steps to deployment. Like with rtweet, I’m not going to spend time repeating what someone else has covered with aplomb - just read that guide (and check the ins and outs of my repo if you really have to), you’ll be fine. I would just say that when it comes to running the script for the first time, during which you’ll need to install any packages used, the free plan does struggle to get done in five minutes (the default calculation time allowed on the consumption plan) - you can up this to ten minutes by following the hosting plans documentation.

Meet Dumb Hardwax

My work is done - introducing [@dumb_hardwax](https://twitter.com/dumb_hardwax). Trying it’s hardest to make Hardwax gold once an hour.

Well, the result is incredibly niche, but if you’ve made it this far, I’m sure you already have great plans for a uniquely useless bot (made with love).


  1. To keep the post concise I don’t show all of the code, especially code that generates figures. But you can find the full code here.

  2. Because the website isn’t static (i.e. the release pages change), the workflow is not entirely reproducible. While the code provided here will scrape, clean data etc., the end-to-end result may be different. Please refer to the scrape, clean and bot scripts hosted on GitHub for a full audit trail of the code used to create the current version of Dumb Hardwax.

  3. Given that Roel’s post covers a lot of the same word count/bigram/trigram processing steps that I did, check that out if you wanted more commentary around the code used during the process.