I think I’ve finally perfected my method for linking real-time data with scheduled stops. This is a comparison of the average (weekly) scheduled speeds to the observed average speed for each stop->stop segment. Results that look roughly as expected are what we all hope for.
Note that each classification is broken into eight equal sized quantiles
There is a lot of information in that little gif! More than I can explain here. More to come…
Higher resolution here by the way. It’s interesting to look at even if you don’t know Toronto. Also, the line widths are determined by the number of trips scheduled for each segment.
The data is structured according to the GTFS real-time specification. I was able to parse it pretty easily in Python by following the instructions on that page. The fields currently included in the feed (many are optional in the specification) are as follows.
The feeds update every 30 seconds, which seems a little slow, but oh well.
Right now, my understanding is that these feeds have been tentatively released as-is for developers only, and that SORTA is not ready yet to make a general public announcement that real-time data is available. Tim Harrington at SORTA, who shared the links with us, has politely asked to see the neat stuff that we’re able to develop with this data. I imagine that the sooner someone sends him a link to a decent, working app, the sooner they’ll give us the go-ahead and the sooner we’ll all be able to use this data in every-day situations.
So who’s gonna make an app? There must be a dozen open-source applications that are already designed to work with GTFS-realtime. We probably just need to plug this feed in and maybe make a few localization tweaks. If you or anyone you know has the skills and/or interest to make an app…then for the love of transit, let’s make this happen ASAP!
A wee nit to pick from SORTA’s recent “State of Metro'” dog and pony show:
I distinctly remember one of the speakers saying something like ‘and all this without raising fares!’, and this to my feeble memory reeked of bullshit, so I found the numbers again and ran them to see if I was remembering correctly. I was indeed.
Here are the facts, as reported by SORTA to the Federal Transit Administration. Over the period where we have data on both fare revenues and ridership (currently 2002 to 2012) SORTA has been steadily getting more money from fare revenues while moving fewer passengers. We are currently at the nadir of this trend, with
More fare revenue than ever
Fewer passenger trips than ever
When SORTA says by the way that they are a ‘most efficient’ agency, a title pinned on them by the laughably unscientific UC Economics Center, it is precisely this measure they have in mind. There is hardly a better example of doublespeak to be found. Here’s the trend:
In order to plot both agencies together, I normalized fares and passenger trips to the same range. The scale is linear.
Now you may rightly note that the standard fare for a zone 1 trip hasn’t changed lately. But that’s not the only kind of fare that can be paid. It might not even be the most common! I don’t know for certain. I haven’t personally paid standard fare in quite a while because my transit use is partly subsidized by UC. So for example, the fare revenue variable in this data almost certainly includes UC’s cash subsidy for my fare as well as the dollar I put in myself. Multiply that by the dozens of private fare subsidies each agency probably negotiates (or drops) each year and you get a more dynamic picture. Fare could also be effected, though probably isn’t, by people using transit cards more or less, while paying the same monthly price.
But anyway, I’ll be damned if f the total price paid by riders or their agents, on a per-trip basis doesn’t constitute a better definition of ‘fare’ than SORTA’s standard zone-1 single-segment price. And by that definition, fares have risen from $0.76 in 2002 to $1.78 in 2012 (+134%). For TANK, the change is from $0.72 in 2002 to $1.16 in 2012 (+60%). Adjusting for inflation, the changes are 84% and 26% respectively. So much for SORTA’s unchanging fares theory lie.
I’ll end with an ineffectual plea to the people at SORTA. Please, understand that when you speak in lies and euphemisms, no matter how nice your breakfast spread, you turn off clever people and retain only the idiots and the cynical. People from all three of these categories vote, to be sure, but I know who I’d rather spend my time with. And I know who could build the better transit system.
It looks like a crude system map, and it is, but it’s actually made of thousands of vehicle GPS tracks.
The vehicle tracks are oddly pretty; I keep wasting time just zooming in on different spots as new tracks are added. Here’s a transfer point at one of the subway stations:
Bus stops shown as ~20m circles
And what looks like perhaps a train station or a bus garage, below. This image also shows the relative frequency of service on different streets, something that becomes quite visible in the data when the lines are given a high degree of transparency.
Buildings for scale
And here’s an expressway carrying some limited-stop services:
As of now, after just a week of erratic development and testing, I’ve collected ~60,000 unique tracks, representing ~160 scheduled lines, derived from 3,200,000+ vehicle location records. About 50 new vehicle locations come in each second that I have my little script running.
Here’s what I’m doing so far, described algorithmically here, and implemented in a Python script:
Request updated vehicle locations from the API every five or six seconds. The API only sends the ones which have updated since my last request, which ends up being between 200 and 600 depending on the day and time of day. There seem to be between 500 and 1500 vehicles operating at any given time in Toronto, so I’m seeing maybe a 10 second update from each on average. It lets me do this for the whole agency at once, which is slightly surprising.
Keep tabs on each vehicle by it’s given ID number, and begin building a track for each vehicle by putting points in order of their appearance.
If a vehicle fails to update within 60 seconds, gets a new route assignment, or a new direction identifier on the same route, I start it a new track and insert the old one into the database.
Tracks shorter than 5 points or 500m or 2 minutes, or some other arbitrary amount can then be ignored or dropped.
A track is a set of ordered points, each point with a position and a time. The next step is to line the tracks up with the stop segments to which they’re scheduled, and if they’re actually close and the direction matches, to calculate stop times and segment durations from the observations. That’s actually turning out to be pretty difficult, but I’m sure I’ll crack it fairly soon. One thing I’ll have to seriously consider as I’m doing this is error in the location reports.
As the first image and the one immediately above show, there is significant error in the data, particularly downtown where tall buildings are presumably interfering with GPS signal reception.
My master’s thesis proposal, something I’ll be talking a lot about in the coming months:
The popular conflation of bus coach and railcar-based public transit lines with their typical relation to automotive traffic has caused much confusion in recent years. Though the superficial wheel may not much matter, the general public is right to sense a distinction in speed and reliability between transit services that operate in mixed traffic and those that are given priority over such traffic. As the public more and more aggressively demands rail-based transit services, these should be read as demands for increased speed and reliability (among many other things) and planners should respond by modifying existing services to meet these implicit demands.
Speed and reliability are a function in large part of the potential delays a line encounters along its course. Potential delay, or random delay results from events that cannot be precisely planned for such as automotive, pedestrian or bicycle traffic, flag-stop passenger boardings and alightings, traffic signals, and bus wheelchair boardings and securements. Scheduled delay, also known as schedule padding is delay that is built into scheduled transit services that allows them to be tolerant of regular disruptions and unscheduled delays by conceding the average effects of such delays in advance. Agencies try to strike a balance between heavily padded (and thus slow) schedules and the disruptions of extra unscheduled delay to create schedules that are neither too slow nor too often late. While the public often reacts negatively to significantly late vehicles they are typically unaware of schedule padding though both are dependent on the same environmental factors.
Since the politically active public, and not transit schedulers, are in control of policy direction in most cities, it becomes important to explain delay and its causes and effects to a lay audience and thereby to direct them toward potentially fruitful responses. Further, since funds for radical infrastructure interventions are difficult to find in the current political regime, attention should be focused on potential incremental improvements to the surface-running bus lines which constitute the vast majority of all transit.
Toward a Solution:
Where can the smallest new delay-avoidance technique create the biggest potential improvement in speed and reliability for existing services?
This thesis proposes a technique for exposing and visualizing the spatial and temporal locations of random delay and schedule padding. Implicitly, this should reveal the space-time locations where transit is running more slowly and less reliably than it might if it could avoid delay, and suggest times and places where delay-avoidance techniques such as designated rights of way for transit could have the biggest impact. The focus will not be on any particular line or city but will try to demonstrate the possibility for and usefulness of the technique in a variety of different circumstances.
There will be two ways of identifying delay. The first will be to analyze the agency’s General Transit Feed Specification (GTFS) schedule data. For each scheduled trip segment ( stop A → stop B ), the minimum time to complete that segment in any trip will be identified and considered as the baseline for that segment. Assuming this gives reasonable values, any time another trip spends completing that same segment beyond the minimum will be considered to be schedule padding.
The second method will take a large sample of representative data from a set of three or four public ‘real-time’ transit APIs from the same agencies. Requests to these APIs will be made with a Python script which will process the results and store them in a PostGIS database for later analysis. These real-time data will be compared to the GTFS schedule data for the same period. The basic task in this method will be to identify temporal vehicle trajectories by following particular vehicles along each line as they become more or less displaced from their schedule. These data will be used to:
Relocate the ideal speed that determines the extent of schedule padding in a given segment by looking at the best reasonable, observed speeds of late-running-but-catching-up vehicles, in each (spatial) trip segment.
Identify excess random delay by identifying segments where vehicles became late or were becoming later relative to their (padded) schedules. Determine the amount by which they were delayed beyond their schedule time in these segments and look for non-random unscheduled delay.
Since the API’s vehicle location and arrival estimate reporting is fairly discontinuous (these might be updated every 5-30 seconds), arrival times will be interpolated along the length of a trip.
Results will be mapped both by line and for whole transit systems. Depending on temporal variability, these maps may or may not include a distinct temporal component. The maps must be engaging, attractive and informative to a general audience. They must invite exploration and be able to explain their complex subject without reference to the main text of the thesis.
One problem with this thesis is that it will not fully distinguish between lines that make every stop and those that operate by request only as almost all bus lines do. This is an important point because we risk finding the maximum (theoretically unpadded) speeds for a segment at moments when there are few or no passengers boarding or alighting. It would be a mistake to assume that all or most delay is due to street traffic without trying to measure the effect of passenger boardings, particularly for bunched vehicles. Passenger boarding is not a thing to be avoided; it is however possible to reduce delay caused by boardings with infrastructure changes such as increased stop spacing, pre-payment fare systems, and multi-door boarding. It is therefor not totally unreasonable to consider the effects of passenger boarding on flag-stop services as a source of delay.
Basically this thesis is after a systematic and effective way of identifying, measuring and displaying the effect of choke-points in existing scheduled transit services. It does this by analyzing publicly available data to identify variance in operating speeds through space and time. It assumes that the fastest reasonable observed speeds are undelayed and that slower trips are delayed in one of several ways. It measures and displays this delay and suggests potential causes and interventions for certain types of delay scenarios.
Transit delay has been studied extensively, but this thesis is novel in it’s focus on spatio-temporal description, it’s emphasis on schedule padding as a source of avoidable delay, and it’s use of cartographic technique to display the results.
I’m taking a self-guided course in R this semester — that is, teaching myself, but with deadlines — and since I’ve been playing with transit data for the most part, it seems appropriate to tickle y’all with some of the mildly interesting data visualizations that I’ve so far produced.
I’ll be using the 2014 SORTA spatio-temporal ridership dataset, which I’ve already sliced a couple different ways on this blog. The first was here with a set of animated maps andthe second here showing basic peaking in passenger activity through time.
This time, I’m going to take that later analysis a little further by breaking out passenger activity into lines. Go ahead and take a look at the graphic, which I’ll explain in more detail below.
Ok. So first, it’s important to understand what we’re measuring here. Our dataset tells us the average number of people getting on a bus (boarding) and the average number getting off (alighting) for each scheduled stop. There are1 about 162,000 scheduled stops on a weekday. Of those, I was able to identify a precise, scheduled time for all but ~ 2,0002. Of the remaining ~160,000 the dataset tells me that 77,763 have at least 0.1 people boarding or alighting on an average weekday. I used those stops to calculate a weighted density plot over the span of the service day for each route. Added together of course, the individual routes sum to the total ridership for the system3. I then sorted the routes by their total ridership and plotted them.
The first thing that becomes clear, to me at least, is that a minority of SORTA’s lines account for a large majority of actual riders. These lines by the way are precisely the ones featured in the Cincinnati Transit Frequency Map, and I’ve used their color from that map to distinguish them in the chart above. The remaining routes, as I knew even before I had this data, are relatively unimportant.
May 2013 routing
The one grey line mixed in among the colored lines is the m+ (a latecomer to the frequency map), which does actually run all day on weekdays.
Now another interesting question, to me at least, is what this would look like without the pea under the mattress; how large are the rush-hour peaks if we exclude the peak-only lines from the chart? Let’s try it. I’ll also reverse the order, so we can see some of the larger lines with less distortion.Well, the rush-hours are still pretty distinct. More distinct than I would have expected. It’s an open question whether this is the result of more service in the rush-hours, or more crowding at the same level of service.
One last way (for now) to slice the data will be to take the total ridership at any given moment, and relativize each line’s total, showing each line’s percent share of the total. To keep it easy to read, I’ll leave the peak-only lines out of this one too.I found it slightly surprising how straight these lines are. Only toward the end of the day do we see a major wobble in any direction, and that’s essentially the result of a few lines shutting down earlier than the others.