Mapping your Oyster card journeys in London with tidygraph and ggraph

If you live in London, or have ever visited it, you have probably used an Oyster Card. Introduced in 2003 by Transport for London, your Oyster is a smart payment card which you can use to travel on public transport around London: on buses, tubes, trains, trams and boats, by loading up pay-as-you-go cash onto the card, or pay for a travel card.

If you register your card with TfL, you can view your journey history (however you can only go back as far as 6 weeks). You can also set up TfL to email you at weekly or monthly intervals with your journeys in csv or pdf format. If you’re someone like me, who loves the idea of collecting data like this, then this is too good an opportunity to pass up.

I recent came across this great post by Jesse Sadler, which is a very good introduction to some network analysis concepts in R.

I have been collecting my own oyster journey history for a while now, but never quite been able to work out what to do with it. The concept of mapping it as a network interested me, but I could never get my ahead around some of the network and graph packages available. Jesse’s post really makes it straightforward, and also shows how to use the tidygraph and ggraph packages to do it in a tidy manner.

After I read it, I also queried (via Twitter) whether you could layout a ggraph geographically, and the package’s creator Thomas Lin Pedersen very helpfully replied which has prompted me to create this post (my first!)

So lets jump in with a random sample of 30 journeys I’ve made.

library(tidyverse)
library(stringr)
library(tidygraph)
library(ggraph)
library(ggmap)
oyster <- read_csv("https://gist.github.com/eldenvo/92d1a6bae98de429a47b6c229e140124/raw/86057c3156e8059f9e7281996deeb71fbfd79fa0/oyster-raw-sample.csv")

knitr::kable(head(oyster))

Date	Start.Time	End.Time	Journey.Action	Credit	Balance	Note
28-Nov-2016	20:51:00	21:11:00	Clapham Junction [National Rail] to Denmark Hill [National Rail]	NA	4.55	NA
17-Nov-2016	16:59:00	NA	Bus journey, route 40	NA	10.55	NA
25-Jul-2017	16:58:00	17:21:00	Westminster to Denmark Hill [National Rail]	NA	17.95	NA
26-Nov-2016	03:34:00	NA	Bus journey, route N37	NA	4.55	NA
12-Sep-2016	18:30:00	NA	Bus journey, route 345	NA	16.45	NA
09-Jul-2017	13:49:00	13:58:00	Denmark Hill [National Rail] to Clapham High Street	NA	2.25	NA

Now, in terms of mapping a network, there needs to be a start and a finish to each journey, so you can’t do it for buses, and in general we need to do a bit of work to manipulate the data into something we can use.

In this next bit we’re categorising the types of journey depending on the text contained in the Journey.Action column. Although they’re not there in my example extract, the history also records other actions such as when you top-up, buy a travelcard, or enter and exit the same station without making a journey. We’re also formatting the date and time.

oyster <- oyster %>%
  mutate(journey.type = case_when(str_detect(Journey.Action, "Bus journey") ~ "bus", 
                                  str_detect(Journey.Action, " to ") ~ "train.tube",
                                  str_detect(Journey.Action, "Season ticket") ~ "top.up",
                                  str_detect(Journey.Action, "Topped-up") ~ "top.up",
                                  str_detect(Journey.Action, "Topped up") ~ "top.up",
                                  str_detect(Journey.Action, "Entered and exited") ~ "other"),
         Date = as.Date(Date, format = "%d-%b-%Y"),
         Start.Time = as.POSIXct(paste(Date, Start.Time), format = "%Y-%m-%d %H:%M"),
         End.Time = as.POSIXct(paste(Date, End.Time), format = "%Y-%m-%d %H:%M"))

We want to split the train or tube journeys into a “to” and a “from” column. Before we do that however, we also have to remove some of the extra text that is included in the Journey History statements, such as “[National Rail]” or “[London Overground/National Rail]” and some other cases that I’ve found. There are also some things that we need to remove or edit (and this will become clearer why in a bit), such as “DLR” or changing “&” to “and”.

The way I’ve had to do this using mutate, str_replace_all seems to be very laborious and I feel like there must be a better way (e.g. to create a vector of all the character strings that I want to remove, and pass them to one line of mutate but I haven’t worked out how - please let me know if you have!)

To split the columns into “from” and “to”, we’re going to use str_split_fixed, which takes a string, in this case the Journey.Action column, takes a pattern to split on - " to " - a number of ‘pieces’ to return, i.e. how many columns we want to split this into (for us, 2) and it will return a matrix with two columns. We want the first column of this to become our “from”, and the second to become “to”.

oyster <- oyster %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " \\(platforms 9-19\\) \\[National Rail\\]", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " \\[National Rail\\]", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " \\[London Overground\\]", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " \\[London Underground\\]", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " \\[London Overground/National Rail\\]", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " \\[London Underground / National Rail\\]", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " \\(Bakerloo, Circle/District and H&C\\)", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " \\(Piccadilly, Victoria lines\\)", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " journey\\,", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " route", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, "'", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, " DLR", "")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, "&", "and")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, "St Pancras International", "St Pancras")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, "Kings Cross", "Kings Cross St. Pancras")) %>%
  mutate(Journey.Action = str_replace_all(Journey.Action, "Sutton Surrey", "Sutton")) %>%
  mutate(from = if_else(journey.type == "train.tube",str_split_fixed(Journey.Action," to ",2)[,1],""),
         to = if_else(journey.type == "train.tube",str_split_fixed(Journey.Action," to ",2)[,2],"")) %>%
  select(2,3,10,11,4:9)

knitr::kable(head(oyster,5))

Start.Time	End.Time	from	to	Journey.Action	Credit	Balance	Note	journey.type
2016-11-28 20:51:00	2016-11-28 21:11:00	Clapham Junction	Denmark Hill	Clapham Junction to Denmark Hill	NA	4.55	NA	train.tube
2016-11-17 16:59:00	NA			Bus 40	NA	10.55	NA	bus
2017-07-25 16:58:00	2017-07-25 17:21:00	Westminster	Denmark Hill	Westminster to Denmark Hill	NA	17.95	NA	train.tube
2016-11-26 03:34:00	NA			Bus N37	NA	4.55	NA	bus
2016-09-12 18:30:00	NA			Bus 345	NA	16.45	NA	bus

Next we can prepare the data for network analysis. First of all, we’re filtering so we end up with only the train or tube journey. And then we’re filtering out the ones that don’t have a start or end point (because I forgot to tap in or out.) We only want the names of the stations - for our purposes from now on, the other info (times, charges) from the Journey History ins’t relevant, although there’s plenty of other things you could do with that data.

to.remove <- c("[No touch-out]","[No touch-in]","[Unspecified location]")

for.network <- oyster %>%
  filter(journey.type == "train.tube") %>%
  select(from, to) %>%
  filter(!from %in% to.remove,
         !to %in% to.remove)

(Like a Blue Peter presenter, I’m going to now skip ahead and use a file I prepared earlier that’s a bigger random selection of just tube and train journeys - but the file used from now was arrived at using the steps above. At this stage, I’m also loading a csv with all the tube and train stations in London, with their locations, zone, etc.)

for.network <- read_csv("https://gist.github.com/eldenvo/746d430bfffb90440d9c623db1818fa8/raw/2b749befe5887fff7295bfdccff882fe47aa7288/oyster-network-prepared.csv")

stations <- read_csv("https://gist.github.com/eldenvo/91914f027af228036e1c2dc535646a90/raw/c6dd4e33e6a818541a32aada3cda2dbb3ee572b2/stations.csv")

Following the steps again from Jesse Sadler’s post, referenced above, we want to then make a list of all the unique stations I went from, and then a list of all the unique stations I went to, and then join them together, to create our table of nodes (i.e stations) for the network, and also assign them an id. Lastly, we’ll join our list of nodes with the info we’ve imported from the stations.csv - and you’ll understand that this is why we had to tidy up the names of the stations from the raw journey history file, so that they match at this point.

sources <- for.network %>%
  distinct(from) %>%
  rename(label = from)

destinations <- for.network %>%
  distinct(to) %>%
  rename(label = to)

nodes <- full_join(sources, destinations, by = "label")


nodes <- nodes %>%
  mutate(id = row_number())

nodes <- nodes %>%
  left_join(stations, by = c("label" = "Station")) %>%
  select(id, label, x = Longitude, y = Latitude, zone = Zone, postcode = Postcode)

knitr::kable(head(nodes,5))

id	label	x	y	zone	postcode
1	Peckham Rye	-0.0699836	51.46951	2	SE15 5DQ
2	East Croydon	-0.0918635	51.37572	5	CR0 1LF
3	Old Street	-0.0876230	51.52558	1	EC1V 9NR
4	West Croydon	-0.1020343	51.37855	5	CR0 2TA
5	Forest Hill	-0.0533285	51.43931	3	SE23 3HD

Then we want to create the edges, which is the list of the journeys made. Essentially, this is to identify all the unique journeys made, and how many times they have been made (their weight), using group_by and summarise.

edges <- for.network %>%
  group_by(from, to) %>%
  summarise(weight = n()) %>%
  ungroup() %>%
  arrange(-weight)

head(edges,5)

## # A tibble: 5 x 3
##               from               to weight
##              <chr>            <chr>  <int>
## 1     West Croydon      Forest Hill      9
## 2     East Croydon Clapham Junction      2
## 3      Westminster     Denmark Hill      2
## 4      Blackfriars      Westminster      1
## 5 Clapham Junction     Denmark Hill      1

Finally, join this table with the nodes, in order to just get a list of the source and destination id, and the weight.

edges <- edges %>%
  left_join(nodes[,c(1,2)], by = c("from" = "label")) %>% 
  rename(from.id = id)

edges <- edges %>% 
  left_join(nodes[,c(1,2)], by = c("to" = "label")) %>% 
  select(from.id, to.id = id, weight)

head(edges,5)

## # A tibble: 5 x 3
##   from.id to.id weight
##     <int> <int>  <int>
## 1       4     5      9
## 2       2     6      2
## 3       8     7      2
## 4       9     8      1
## 5       6     7      1

And now, we’re ready to use tidygraph to turn this into a network! And now we have our edges and nodes neatly sorted out, it’s very simple to create a tbl_graphthat can be analysed and used further in ggraph.

routes <- tbl_graph(nodes = nodes, edges = edges, directed = TRUE)

print(routes)

## # A tbl_graph: 22 nodes and 20 edges
## #
## # A directed simple graph with 5 components
## #
## # Node Data: 22 x 6 (active)
##      id            label           x        y  zone postcode
##   <int>            <chr>       <dbl>    <dbl> <int>    <chr>
## 1     1      Peckham Rye -0.06998355 51.46951     2 SE15 5DQ
## 2     2     East Croydon -0.09186350 51.37572     5  CR0 1LF
## 3     3       Old Street -0.08762296 51.52558     1 EC1V 9NR
## 4     4     West Croydon -0.10203429 51.37855     5  CR0 2TA
## 5     5      Forest Hill -0.05332850 51.43931     3 SE23 3HD
## 6     6 Clapham Junction -0.17031500 51.46437     2 SW11 2QP
## # ... with 16 more rows
## #
## # Edge Data: 20 x 3
##    from    to weight
##   <int> <int>  <int>
## 1     4     5      9
## 2     2     6      2
## 3     8     7      2
## # ... with 17 more rows

With that, we have enough data to produce a simple network graph, that is also geographically accurate! Using layout = 'nicely' in ggraph will automatically pull the x and y coordinates of the stations from the edges frame. ggraph follows the grammar of ggplot2 and adds geoms to represent the nodes and edges.

ggraph(routes, layout = "nicely") + 
  geom_node_point() +
  geom_edge_link() +
  theme_graph()

Well - that’s something!
We can make that a bit more comprehensible by providing some further details, and there’s plenty more information in the vignettes for ggraph (the stuff below for the edges comes from this page):

Rather than showing the nodes as points, I’ll put them as labels with geom_node_label
- and we will give them a colour based on the zone of the station
Rather than geom_edge_link I’ll use geom_edge_fan so that journeys along the same route but in the opposite direction no longer overlap
- we will also add arrows to make the direction clear, and then use the start_cap and end_cap to include a bit of space between the arrows and the labels
The thickness and the transparency of the edges is determined by the weigh (i.e how frequently each journey is made)
- but because there can be quite a large discrepancy between the weights (e.g some journeys could be made every day, others a one off), we want to define a range to the scale_edge_alpha and scale_edge_width so they’re not too drastic

ggraph(routes, layout = "nicely") + 
  geom_node_label(aes(label = label, fill = as.factor(zone)),
                  size = 1.75, show.legend = F,label.padding = unit(0.15, "lines")) +
  geom_edge_fan(aes(start_cap = label_rect(node1.label),
                    end_cap = label_rect(node2.label),
                    width = weight, alpha = weight),
                arrow = arrow(length = unit(2.25, 'mm'), angle = 20), show.legend = F) + 
  scale_edge_width(range = c(0.5, 3)) +
  scale_edge_alpha(range = c(0.5,0.75)) +
  theme_graph()

That’s much better! There’s some overlapping and lack of clarity, but that could all be tidied up for a final output. We can now make a bit more sense of the network, the stations, which journeys are more frequent.

For my next trick, I will plot this on a map using ggmap - as mentioned by Thomas Lin Pedersen in the tweet I linked above, and also described in this RPubs post, which is very helpful if not altogether easy to follow..

I want to find the limits of my network, and slightly expand them, to ensure everything fits on our eventual map, and set these as a vector anti-clockwise from the left. From these, we’ll make a nice watercolor map of London from get_stamenmap.

x.range <- (max(nodes$x) - min(nodes$x))/20
y.range <- (max(nodes$y) - min(nodes$y))/20

coords <- c(left = min(nodes$x)-x.range, bottom = min(nodes$y)-y.range, right = max(nodes$x)+x.range, top = max(nodes$y)+y.range)

map <- get_stamenmap(coords, zoom = 12, maptype = "watercolor")

ggmap(map)

Pretty, no? Now we have our base map, it’s easy to plot our network on top of it, following the steps indicated in the RPubs post.

We need to pass our first ggraph() call to the base_layer= argument of ggmap, and then add the rest of our ggraph geoms.

gg <- ggraph(routes, layout = "nicely")

ggmap(map, base_layer = gg) +
  geom_node_label(aes(label = label, fill = as.factor(zone)),
                  size = 1.75, show.legend = F,label.padding = unit(0.15, "lines")) +
  geom_edge_fan(aes(start_cap = label_rect(node1.label),
                    end_cap = label_rect(node2.label),
                    width = weight, alpha = weight),
                arrow = arrow(length = unit(2.25, 'mm'), angle = 20), show.legend = F) + 
  scale_edge_width(range = c(0.5, 3)) +
  scale_edge_alpha(range = c(0.5,0.75)) +
  theme_graph()

Et voila! There we have it. I’ll save it at a higher quality with ggsave("oyster-map-1.png", width = 9, height = 9, dpi = 300, type = "cairo-png") So there we have it - actually plotting the map was really not that complicated at all, most of the work is just wrangling the data into the right format. With bigger numbers of journeys (I have about 500 recorded…), the maps becomes quite busy - but you can play around with the geoms and other arguments to make it work.

Things I’d be interested in doing with this (but don’t know how) would be:

Rather than a geographically accurate map, could you plot the nodes to match the stations’ locations on the tube map? You’d just need to pass a different set of coordinates, but how would you get them?
As I understand, you can facet in ggraph, but I don’t know how you could do that here to show a month-by-month sequence of small multiples
In addition to that, could you use gganimate to show each journey stacking up on eachother over time?

Any thoughts or ideas on this, or any other tips for me based on this post, just send me a tweet or leave a comment on github. I’m new to this and I’d welcome any feedback!