I’ve started working with Champaign-Urbana’s real-time departure API. Right now, I’m using a little Python script to send requests and store them in a local PostgreSQL database. Below is a probability density plot from the first 1,000 or so data points I’ve pulled down. It’s only from the weekday mornings when I’ve run the script, mostly from the 150 stops I queried just a moment ago.
But it looks like my prediction (see the earlier post) may not have been terribly far from the mark.
The next step is to determine a programmatic way to randomly query particular arrivals to make sure I avoid any systematic error in the sampling. This is necessary because I’m limited to 1,000 API calls per day and can’t just hammer their server with requests for every scheduled arrival.
I’m also recording location attributes for all these records so I’ll be able to do some spatial analysis too :-)
I’m about to start digging into various real-time data feeds for American (bus) transit systems. For the most part right now I’m interested in finding a simple, average distribution of lateness/earliness across all stops, the idea being that this could help riders predict, without live real-time feeds, when the bus is most likely to show up, by looking only at a fixed schedule.
Are buses more likely to be late than early? What percentage of buses are early, anyway? If it’s already five minutes late, is it very likely it’s coming in the next minute? Or should you start walking? What’s the difference in tardiness distributions between frequent and infrequent services? Are there types of places in a city which have consistently different distributions?
In the name of science, I’d like to make a prediction, ie. state my hypothesis, before I’ve collected any actual data. So here it is:
I think that overall the distribution will have a strong late skew, a very short early tail, and a wide second hump around the time a second bus might start bunching up on the one in question. I’ll guess that between 10% and 20% of buses running on fixed schedules will be at least a few seconds early and that the median will be about 2 minutes late.
Now…anyone want to suggest a city with a real-time feed? I have my eye set on Portland at the moment but only because I’m have trouble finding decent APIs.
As SORTA continues it’s endless dissembling, and TANK more honestly lacks the resources, I wonder if the bikeshare of all things will beat both to the postmodern age and release real-time fleet location data.
Dare a boy dream of such things? Perhaps a less cynical one might.
But in the meantime, if anyone is curious to see the actual locations of the bike share stations as they’re installed(which, oddly I have not seen anywhere else yet), I’ve been adding them to OpenStreetMap as I find them. Look for nodes tagged ‘amenity’=’bicycle_rental’ and ‘network’=’Red Bike’, or on the main OSM map stylesheet. Or just follow that link for a map with dynamic query results ;-)
Huge thanks to John Back of OpenDataCincy for hosting all these gigabytes and Dave Walters of the CTHA for providing the data. If any of you have old schedules or maps that you don’t see in this listing, Dave would like to borrow them! Leave a comment and I’ll put y’all in touch.
I read a clever little article while I was researching R graphics for this project. It started out with a whole paragraph just making the point that no one would read the paragraph itself first. Instead, our eyes would all flash down to the brightly colored graphic right below. Mine certainly did. Did yours? I could write pretty much anything here, but you won’t read it until you’ve satisfied yourself with the silly flashy GIF:
But since we’re not interested in psychology(see above, if you haven’t), some description is perhaps in order: blue is a balance in favour of people boarding buses, red of people getting off the buses. Circle area equals the absolute difference between boardings and alightings at each scheduled stop. This means among other things that stops with balanced passenger boardings and alightings will not show up at all1. Each time-step is a half-hour period.
The idea here is simply to show the temporal unbalance of passenger flow in the transit system: at some times of day, more people are going in one direction than in the other. This is why you occasionally hear the complaint that “buses are running empty” and implicitly that no one is using the system or that parts of it could bear trimming. The routes are not usually inefficient so much as the passengers are, with their chaotic plans and schedules, failing to neatly balance themselves and set each other off as more orderly citizens might do.
Notice how the overall color of the outlying stops changes at the rush hours from blue (boarding) in the morning to red (alighting) in the evening, the reverse for downtown and most of uptown. Clearly, the overall pattern we’re seeing is that a lot of commuters are coming in from the outer neighborhoods for daytime work in the central ones.
As the title says though, these are preliminary results. I made the circles transparent so colors can blend and become more opaque where activity is more intense, but there’s still a tremendous amount that’s obscured in this map. The next step is to change it from a map of mutually interfering circles into something clearer at the busy confluences of multiple lines. I’m thinking, and if you’ve been following this blog lately it should be on the tip of your tongue already, of creating a kernel density surface to properly deal with the busy areas.
In the meantime, I’ll attempt to assuage your hungry eyes with a couple more GIFs.
My first attempt at a GIF, using a much less automated method2, shows this same basic pattern, but in the context of Cincinnati neighborhoods.
Another one breaks it down a further than the first one I showed, dropping the time-interval to two minutes rather a half-hour. Again, blue is a balance in favour of boardings, red in favour of alightings.
This one is particularly interesting, I think, because you can really start to see individual buses flashing through the city. I love the finale, right around 25:10 (1:10AM) where the buses all explode out of government square on their last run at around the same time to deposit their passengers in an adorably pathetic little fizzle of red from blue.
Anyhoo, I’ll say yet again, this is preliminary and I share it now because I think this data is tremendously exciting. We’ll see what I can make of it in the next few days!
I’m just beginning to play with the new, highly detailed ridership data I got from SORTA, and boy is it a treat. I’ll start here with a high-level overview of the temporal dimension of the data, before looking at spatial aspects and breaking it down by line, stop, service type etc as the summer progresses. I think I may also use this data as the basis of my study in R this coming semester (Hi Michael!), so perhaps we can count on seeing some more detailed and particularly nerdy and multivariate analysis through the cooler months as well. I am, by the way, acutely aware that I’ve started a number of little projects on this blog, and have failed as yet to carry them through to their completion. I keep getting distracted by the realization that I have no idea what I’m talking about, the inevitably illusive prospect of making money some way or another, and the all too comforting thought that no one is reading this or taking it seriously anyway. But hopefully, this is a small enough commitment and certainly it’s interesting enough for me to actually provide a reasonably complete picture of this particular dataset before altogether too long. Perhaps I can even apply the same techniques to the TANK ridership dataset that I’ve been meaning to get to and publish for more than a year.
Anyway, let’s actually get to that temporal overview. Since we know the trip each record belongs to, identified by the line number and the trip’s start time, I was able to identify the actual scheduled time for the great majority of stops in the data set by matching the records to GTFS schedule data for the same period. About 170,000 of the 230,000 records matched to a precise time. The remainder account for a very small portion of total activity, about 2%, and I think it’s most likely that many of these records are an artifact of the way SORTA’s database is structured and not actual stops belonging to a trip. I’ll dig into that more some other time though.
For the ~98% of boardings and alightings that I could pin to a precise time of day, I created a histogram:
As would be expected, alightings(that is, people getting OFF), trail boardings(getting ON) by a half-hour or so. People need to get on and get somewhere before they’ll get off at their destination. The difference therefor is peoples’ travel time.
Anecdotally, the temporal distribution of transit users closely mirrors the distribution of actual service. This is a chicken/egg situation, and it would make good sense to inquire what ridership might look like late at night if service itself didn’t trail off into hourly or half-hourly frequencies where it continues at all past 10pm. There’s also good reason to suspect that changing service levels at one time of day could effect ridership at another. Might we, for example, see differently shaped rush-hour peaks if suburbanites had and got used to having the option of staying late at their downtown office? If service continued all night, might we see echoes of the main rush-hours as second and third-shifters head for work? Might there be a night-life peak if night service weren’t so abysmal?
EDIT: For those of you who don’t have huge computer monitors…