Robot-Sara – a recommendations robot

WOWHack is an annual, 2-day music hack event which precedes Way Out West festival, and this year Spotify kindly put us up in their Gothenburg office for the event. 67 developers organised themselves into 25 teams, and spent 2 days building cool music-related hacks.

My hack was called “Robot-Sara”, and this is a pretty accurate description of how I went about building it (a quote from my hack presentation):

Interactive Recommendations

I love Spotify’s Discover feed, and it works great when you’re using your laptop or mobile. Driven by a fascination with robots and human-computer interaction (and heavily inspired by Jibo), I wondered if I could try and build a more interactive and personal way of consuming music recommendations based on my Spotify listening habits. I built Robot-Sara, a recommendations robot – here’s a quick demo:

Designing the recommendations robot

After having the initial idea for this hack, I fleshed-out the core functionality I wanted to be able to support:

  • Speech recognition & response – I wanted to be able to talk to the recommendations robot, and also have it talk back to me
  • Physical movement – to move from a purely software-based implementation, into a real-world interactive device, this hack needed to move in some way
  • Music recommendations – can’t have a recommendations robot without recommendations
  • Audio playback – a way to play/preview specific recommendations

After a bit of brainstorming, I came up with this (very sketchy) sketch:

Robot-Sara-Sketch

The basic idea was to have an iPhone, running a speech-enabled app, act as the interface to a “recommendations robot”. The iPhone would be physically attached to a robotic arm, and as the user interacted with the iPhone app (via speech), the app would send requests to the robot arm over WiFi to make it move as necessary. The iPhone app would call a recommendations API to retrieve custom recommendations for the user, and the app could then play tracks based on the recommendations received.

Note: In my sketch, I envisaged the robot controller API and recommendations API running on a Raspberry Pi (“R-Pi” in the diagram) – after fighting with flakey WiFi on the Raspberry Pi, this was eventually replaced with the services running on my laptop.

The Parts

Robot-Sara consists of the following software and hardware:

An iOS app

I wrote a simple iOS app, with text-to-speech, speech-to-text, and audio playback capabilities. After playing around with the native text-to-speech and speech-to-text support in iOS, I settled on a third-party SDK by Nuance, which (in my opinion) gave more accurate speech detection results, and had higher quality voice synthesisers. I also integrated Spotify’s beta iOS SDK, to allow for audio playback.

The core design of the app involved a “brain” and a collection of “skills”. Once a user has uttered a phrase, and the speech-to-text library has processed the audio and returned a result, the phrase gets passed to the “brain” class for processing.

The brain is responsible for being aware of all supported skills, finding a skill for an uttered phrase, and requesting that the correct skill performs an action for a phrase.

RobotSaraBrainSkills

The base “skill” class has the following properties and methods:

NSArray *phrases;
- (BOOL)canActionPhrase:(NSString*)phrase;
- (void)actionPhrase:(NSString*)phrase;

Each skill implements it’s own set of supported phrases – these are the phrases that will cause a skill to be actioned. For example, the “WakeUp” skill has the following phrases:

phrases = @[@"sara wake up",
            @"wake up sara"];

The canActionPhrase method is implemented in the base skill class, and iterates over each skill’s phrase, performs fuzzy string matching between the uttered phrase and the skill’s supported phrases, to see if a close (enough) match is found to action the current skill.

For example, if the user utters “Sara wake up please”, the WakeUp’s phrase “sara wake up” has a fuzzy match of 80%. This is above the minimum fuzzy match threshold required to be sure a skill can perform an action for a phrase, and will cause the WakeUp skill to return TRUE for the “canActionPhrase” method. When this happens, the brain has successfully found a skill for the phrase, and requests that the skill “actions” the phrase (with a call to the skill’s “actionPhrase” method). In this example, the WakeUp action involved telling the robot arm to lift up and rotate, and then tell the app to say “Hello Aaron”.

A recommendations service

This is a simple web-service (using Sinatra), running on my laptop, which uses the Last.fm user.getTopArtists API to get my favourite artists based on scrobbles from Spotify. With that information, it hits EchoNest’s similar artist API to recommend artists based on my listening habits. It has a single “/recommendations” endpoint, here’s an example response:

{"results":
  {"recommendations":[{
     "artistName": "Tame Impala",
     "sourceArtistName": "Unknown Mortal Orchestra",
     "listensCount": "14"
    }],
  },
}

When a user requests a recommendation from the app, the app makes a call to the recommendations service running on my laptop, and the required data is returned. The app can then use the recommended artist name returned to find a popular track for that artist on Spotify, and stream a 15 second preview via the Spotify iOS SDK.

A robotic-arm controller service

Another simple web-service (using Sinatra), running on my laptop, which allows for controlling the robotic arm via an API. Using a robotic-arm driver I wrote at a previous hack day, I created a single endpoint “/perform_action”, which accepted “action” and “seconds” parameters, e.g.

“/perform_action?action=shoulder_up&seconds=1.5”

which would move the robotic arm’s should up for 1.5 seconds. I also created 3 pre-programmed actions for common tasks:

  • “wake_up” – lift the arm up, and rotate the base left
  • “sleep” – rotate the bast right, lower the arm
  • “test” – turn the LED on the arm on for 2 seconds (useful before a demo to confirm the arm is connected!)

When a user requests an action that involved movement from the robotic arm, the skill class responsible for that action (e.g. WakeUp), will call the robotic arm API running on my laptop, and the arm will spring into life.

A Spotify token-swap service

This is a simple OAuth token swap service, which the Spotify iOS SDK beta requires as part of the OAuth exchange process.

A robotic arm

The arm formerly known as Hipster-Robot, a dead-cheap, DIY robot arm I bought from Maplins and built in a couple of hours at a previous hack day.

An iPhone

‘Nuff said.

Why “Sara”?

I wanted an “action word” that could be programmed into the robot; a keyword that the robot could use to determine if a user was talking to the robot, or someone else in the room.

After a very unscientific experiment on some popular names, “Sara” and “David” appeared to be names the speech recognition library could consistently and accurately detect. From there, I tested the vocaliser libraries – the female voices were of a much higher quality than their male counterparts, so “Sara” it was.

With a very small amount of effort (and even less imagination) this became “SARA”, a backronym for Spotify Automated Recommendations App.

The Code

For this event, we submitted to private Github repos setup by the organisers – I’ll upload the code to my own Github page in the next day or two.