Discovering the Space-Time Dimensions of Schedule Padding and Delay from GTFS and Real-Time Transit Data

My master’s thesis proposal, something I’ll be talking a lot about in the coming months:

The popular conflation of bus coach and railcar-based public transit lines with their typical relation to automotive traffic has caused much confusion in recent years. Though the superficial wheel may not much matter, the general public is right to sense a distinction in speed and reliability between transit services that operate in mixed traffic and those that are given priority over such traffic. As the public more and more aggressively demands rail-based transit services, these should be read as demands for increased speed and reliability (among many other things) and planners should respond by modifying existing services to meet these implicit demands.

Speed and reliability are a function in large part of the potential delays a line encounters along its course. Potential delay, or random delay results from events that cannot be precisely planned for such as automotive, pedestrian or bicycle traffic, flag-stop passenger boardings and alightings, traffic signals, and bus wheelchair boardings and securements. Scheduled delay, also known as schedule padding is delay that is built into scheduled transit services that allows them to be tolerant of regular disruptions and unscheduled delays by conceding the average effects of such delays in advance. Agencies try to strike a balance between heavily padded (and thus slow) schedules and the disruptions of extra unscheduled delay to create schedules that are neither too slow nor too often late. While the public often reacts negatively to significantly late vehicles they are typically unaware of schedule padding though both are dependent on the same environmental factors.

Since the politically active public, and not transit schedulers, are in control of policy direction in most cities, it becomes important to explain delay and its causes and effects to a lay audience and thereby to direct them toward potentially fruitful responses. Further, since funds for radical infrastructure interventions are difficult to find in the current political regime, attention should be focused on potential incremental improvements to the surface-running bus lines which constitute the vast majority of all transit.

Toward a Solution:
Where can the smallest new delay-avoidance technique create the biggest potential improvement in speed and reliability for existing services?
This thesis proposes a technique for exposing and visualizing the spatial and temporal locations of random delay and schedule padding. Implicitly, this should reveal the space-time locations where transit is running more slowly and less reliably than it might if it could avoid delay, and suggest times and places where delay-avoidance techniques such as designated rights of way for transit could have the biggest impact. The focus will not be on any particular line or city but will try to demonstrate the possibility for and usefulness of the technique in a variety of different circumstances.

There will be two ways of identifying delay. The first will be to analyze the agency’s General Transit Feed Specification (GTFS) schedule data. For each scheduled trip segment ( stop A → stop B ), the minimum time to complete that segment in any trip will be identified and considered as the baseline for that segment. Assuming this gives reasonable values, any time another trip spends completing that same segment beyond the minimum will be considered to be schedule padding.

The second method will take a large sample of representative data from a set of three or four public ‘real-time’ transit APIs from the same agencies. Requests to these APIs will be made with a Python script which will process the results and store them in a PostGIS database for later analysis. These real-time data will be compared to the GTFS schedule data for the same period. The basic task in this method will be to identify temporal vehicle trajectories by following particular vehicles along each line as they become more or less displaced from their schedule. These data will be used to:

  1. Relocate the ideal speed that determines the extent of schedule padding in a given segment by looking at the best reasonable, observed speeds of late-running-but-catching-up vehicles, in each (spatial) trip segment.
  2. Identify excess random delay by identifying segments where vehicles became late or were becoming later relative to their (padded) schedules. Determine the amount by which they were delayed beyond their schedule time in these segments and look for non-random unscheduled delay.

Since the API’s vehicle location and arrival estimate reporting is fairly discontinuous (these might be updated every 5-30 seconds), arrival times will be interpolated along the length of a trip.

Results will be mapped both by line and for whole transit systems. Depending on temporal variability, these maps may or may not include a distinct temporal component. The maps must be engaging, attractive and informative to a general audience. They must invite exploration and be able to explain their complex subject without reference to the main text of the thesis.

Potential Problems:
One problem with this thesis is that it will not fully distinguish between lines that make every stop and those that operate by request only as almost all bus lines do. This is an important point because we risk finding the maximum (theoretically unpadded) speeds for a segment at moments when there are few or no passengers boarding or alighting. It would be a mistake to assume that all or most delay is due to street traffic without trying to measure the effect of passenger boardings, particularly for bunched vehicles. Passenger boarding is not a thing to be avoided; it is however possible to reduce delay caused by boardings with infrastructure changes such as increased stop spacing, pre-payment fare systems, and multi-door boarding. It is therefor not totally unreasonable to consider the effects of passenger boarding on flag-stop services as a source of delay.

Basically this thesis is after a systematic and effective way of identifying, measuring and displaying the effect of choke-points in existing scheduled transit services. It does this by analyzing publicly available data to identify variance in operating speeds through space and time. It assumes that the fastest reasonable observed speeds are undelayed and that slower trips are delayed in one of several ways. It measures and displays this delay and suggests potential causes and interventions for certain types of delay scenarios.

Transit delay has been studied extensively, but this thesis is novel in it’s focus on spatio-temporal description, it’s emphasis on schedule padding as a source of avoidable delay, and it’s use of cartographic technique to display the results.

Comments: 2
Posted in: Data | Priorities
Tags: | |

Schedule Padding in Time

An excerpt from (the background section of) the first draft of my thesis proposal:

“The bus is so slow! Isn’t rail just better?”
The popular confusion created by the conflation of coaches and railcars with their typical relation to automotive traffic has wasted billions of public dollars1 (and caused me no end of frustration). Though the superficial wheel may not much matter, the general public is right to sense a distinction in the reliability and speed of transit services that operate in mixed traffic and those that are given priority over such traffic. As the public more and more aggressively demands train-based transit services, these should be read as demands for increased speed and reliability (among several other things) and planners should respond by modifying existing services to meet these demands.

Speed and reliability are a function in large part of how many potential delays a line will encounter along it’s course. Random delay results from unplanned disruptions such as higher than expected passenger loads, traffic, serial red lights, etc. Scheduled delay, also known as schedule ‘padding’ is delay that is built into scheduled transit services that allows them to be tolerant of unscheduled disruptions by acknowledging their average effects in advance. Agencies try to balance scheduled and unscheduled delay to create schedules that are neither too slow nor too often disrupted by random delay. While the public often reacts negatively to random delay events, they’re typically unaware of schedule padding, though both are dependent on basically the same environmental factors.

Since the public, and not transit schedulers, are in control it becomes important to explain delay and it’s causes and effects to a lay audience and thereby to direct them toward a fruitful response. Further, since funds for radical infrastructure interventions may be difficult to find in the current political regime, attention should be focused on marginal cases and incremental improvements to surface-running bus lines.

Simply, the question is: where can the smallest new delay-avoidance technique create the biggest improvement in speed and reliability for existing services?


As a first step toward an answer to this question, I’ve created a rough measure of the amount of schedule padding identifiable from just the schedule information itself. It’s not perfect by any means, but I’m going to run with it for a moment and see where it leads me. First I identified all unique trip segments in the transit system. A segment is defined here as the travel between two unique stops, so…

( stop A -> travel -> stop B ) = segment 1
( stop B -> travel -> stop C ) = segment 2
( stop A -> travel -> stop B ) = segment 1 again
( stop B -> travel -> stop A ) = segment 3
for a total of 4 segments, 3 unique in our example.

For SORTA’s current GTFS schedule, we observe:

Total Segments Unique Segments
Weekday 164,621 5,159
Saturday 103,757 3,914
Sunday 69,380 3,550

Each segment has some times associated with it:

  1. A departure from the first stop
  2. An arrival at the second stop
  3. Implicitly, the time scheduled to complete the segment.

Because schedulers expect that the amount of time it will take a bus to get from A to B will be different at different times of day, these otherwise identical segments will have different durations. By finding the deviation from the minimum duration for each segment, we can get a crude measure of the schedule padding built into the system.


Hours of Padding Scheduled Vehicle Hours % padding
Weekday 414.60 1,925.22 21.53%
Saturday 174.19 1,036.57 16.80%
Sunday 97.36 664.13 14.66%

This method estimates that 21.5% of the weekday schedule is actually scheduled delay, more than 400 hours of it, each weekday. That is, at least relative to the fastest any bus is scheduled to complete a segment. Just where is this scheduled delay anyway? When and where are the schedules most heavily padded? I’ll save a spatial exploration for later, but let’s take a very preliminary peek into the temporal dimension.

The first question we must ask is: when are all of the segments? By taking a central moment as the time, we can plot them, in a kernel-smoothed histogram :

SORTA segments per hourThis clearly shows the basic level of service throughout the day and week. It’s not a great measure of that as such, but it does give us a definite sense of the balanced weekday rush-hours and diminished weekend service.

Then since most of the segments are padded,we ask when are the segments without padding? On the same scale, we get:

SORTA unpadded segments per hourAs we might have expected, there is less random delay and thus less need for padding when the streets and buses are less congested: early morning and late at night. It also appears that there are relatively fewer padded segments on the weekends, though the total number of unpadded segments is roughly the same as on weekdays.

Ok, so when is the padding itself and how much of it is there? Note that we’re measuring something different here: hours of padding per hour.

SORTA padding in timeNow, this definitely has a different shape than the overall distribution of schedule segments, but it’s a little hard to compare them when they’re so far apart. Let’s combine all of these into one plot. I might have got a little carried away in Inkscape…SORTA Schedule Padding by Time-of-Day

I’ll just let that speak for itself for now. We’ll get into spatial visualizations of this data next, and eventually real-time comparisons and measures in space-time.

Show 1 footnote

  1. Citation needed…
Comments: 2
Posted in: Back to Basics | Data
Tags: | | |

Delay Variance Through Time

Just some preliminary results from my first attempt to archive real-time bus data, these from the Champaign-Urbana ‘Mass Transit’ District:

Animated Delay Variance

This is a look at variance in the distribution of delayed buses throughout the day. It’s only looking at off-schedule buses right now so we aren’t seeing any change in the proportion of precisely on-time buses (if there is any such change). The little clock in the top right is the time of day the event was recorded, and right below that is the number of events used to ascertain the momentary distribution. For now, this sample size is as much a reflection of when I was running my computer as it is of schedule frequency.

I assumed that the first and last percentiles of the overall distribution were outliers and clipped them off.

I don’t actually see much going on here except random fluctuation, but I suspect this will get much more interesting with larger samples and many more and more diverse agencies included. I’m actually quite eager to see what that turns up! Right now, I’m working on developing a tool to query APIs from the Toronto Transit Commission and Philly’s SEPTA. Unfortunately, because neither Cincinnati transit agency is willing to share their data, which they’ve been collecting and using for years, it will be impossible to include them in this sort of analysis.

Comments: Leave one?
Posted in: Data
Tags: | | | | |

Why the buses just stop

A friend of mine just emailed with a very good question: Why do the buses just stop sometimes, with no one getting on or off, no traffic jam, etc, and no clear reason why they  aren’t running?

The Answer:
The buses here operate on a fixed schedule, with a set time for each and every stop. Planners try to adjust these times to accommodate varying traffic conditions like rush hours and other factors. Obviously, they can’t do this perfectly, and sometimes buses tend to run late, ie slower than the schedule says they ought to. Just as often, they should tend to get early. This can happen if traffic is particularly light, there aren’t as many passengers as usual, or maybe the bus just hits a string of green lights. When that happens, the driver tries to stay on-schedule by driving more slowly, or sometimes stopping completely. When they do stop completely, they’ll often take the opportunity to read a couple pages of their magazine, send a text, take a bite of lunch, etc.

To someone who doesn’t know what’s going on, it probably looks like the driver is behaving quite capriciously. Many systems make announcements explaining any longer-than-usual waits, but I’ll speculate that SORTA and TANK don’t do this because they’re not seriously trying to attract or retain new passengers. They see themselves as serving a relatively captive and stable audience who has already spent a lot of time learning the system’s quirks. This also explains their complete inattention to providing easy-to-understand overviews of the system in favour of obtuse and overly detailed maps and schedules.

Comments: 2
Posted in: Back to Basics
Tags: | | |

Back Door Please

When you’re preparing to disembark the bus, please look ahead to your stop before the bus reaches it. Does it look like there people about to get on? If so, now is the time to leave through the back door.

Leaving through the front door when people are trying to get on slows down the bus for everyone still on board because the people boarding have to wait for you to clear the door before they can step in and make their payment. If you go out the back door, the bus can unload you at the same time it’s loading other passengers, and everyone can be on their way more quickly. This is particularly important at major stops like Government Square where I’ve seen thirty people take a whole minute to file out through the front door while thirty more people waited anxiously to board. What the hell, people?

Over time, if everyone does little things like this to keep out of the way, schedule planners can start to assume that less time will be spent at stops boarding passengers and they’ll build less padding into their timetables. Quicker buses, less delay, a few less seconds standing out in the cold…and all we have to do is go out the back door. They put it there for a reason ;-)

Comments: 1
Posted in: Tips & Tricks
Tags: | | | |

Bunching, or why the buses all come at once

Have you ever been waiting way too long for a bus on a major line and then out of nowhere two or three buses come by one after another? Inevitably,  the one you get on is completely stuffed with people. “What the hell is SORTA up to??” you wonder.

The drivers aren’t just being ridiculous. In fact, this behaviour of buses can be explained by statistics and a little algebra. And it’s really hard to avoid in a transit system set up the way ours is.

On transit lines that are frequent enough that people don’t tend to look at schedules, people will arrive at their stops in a fairly steady, predictable way. We have a few such routes, a fuzzy distinction to be sure, but including the main corridors of the 33, 17/19, 43, and 4 at least.

main Cincinnati transit corridors

Main transit corridors in Cincinnati.

Since people won’t try to arrive at any particular time because of the schedule, and because there are a lot of people, their arrivals are more or less random and can reasonably be predicted for any given stop in similar circumstances.

Let’s say that stops along line X accumulate on average 1 person each minute. Let’s also say that during the day, a bus will come by every 10 minutes, and that it takes 7 seconds for each person to board. They have to line up single file, pay the fare, then move to their seat. In our hypothetical world, this bus will come by a given stop every ten minutes, spend 70 seconds loading 10 passengers and then move along to the next stop where it will do the same and so on. It will spend 20 seconds travelling between each stop.

What happens if one bus is delayed by just one minute?

A single bus gets delayed by one minute between stops; say a squirrel was in the way. When it arrives at it’s next stop 11 minutes will have elapsed since the last bus came by and there are likely to be 11 people waiting at the stop. The bus will spend 77 seconds loading these passengers before moving on. When it gets to the next stop, it is now one minute and seven seconds late and has 11.0833 (in reality, 11 or sometimes 12) passengers to board, taking 77.82 seconds, putting it then one minute and 14.82 seconds behind. Remember that while this is happening, the bus immediately before it has not slowed down. It’s still on schedule, picking up it’s average ten people per stop and taking the expected time to do so.

As our hypothetical late bus progresses, needing to board more passengers than normal and taking longer than the normal 70 seconds to do so, the the gap between it and the bus ahead of it is slowly widening. This delay grows logarithmicly; the further behind the bus gets the longer it has to take to board passengers at each stop since each stop has had more time to accumulate passengers in the interval.

The first bus in a bunching scenario, stop by stop.

The first bus in a bunching scenario, stop by stop.

And here’s the math(in PHP) in case anyone is interested:

$time_between_stops = 20; // time between stops in seconds
$initial_span = 600; // time between buses in seconds
$delay = 60; // initial delay in seconds
$board = 7; // boarding time of passengers
$pass = 1/60; // passengers per second at each stop
$time_elapsed = 0; // total time from start
$span = $initial_span - 60;
while($span > 0){ // IE, the bus behind moving at scheduled speed hasn't caught up yet
$passengers_at_stop = ($initial_span + $delay) * $pass;
$additional_delay = ($passengers_at_stop * $board) - ($initial_span *   $pass * $board);
$delay += $additional_delay;
$time_elapsed += $delay + $time_between_stops;
$span -= $additional_delay;

Meanwhile, if the bus behind our late bus isn’t careful, it will speed up once it passes the squirrel-point. The nearest stop, as the squirrel will observe, is likely to have 9 passengers, taking only 63 seconds boarding time.

The time between it and the late bus decreases, such that once it has caught up to the point at which the first bus became late it will be picking up less and less people and spending less time dwelling at a stop while people pay fare and find their seats.

How a squirrel delaying a bus for one minute can stop all buses forever

A one minute delay can hypothetically lead to infinite delay for all buses some distance after the point of initial delay.

The above chart shows the impact of a bus delayed by one minute on an infinitely long line with buses coming every ten minutes forever. It’s assumed that buses can’t pass and thus the first bus will dictate the speed of all. The chart doesn’t take into account the possibility for coordinated efforts of buses running the exact same route to pass one another and individually skip stops, such coordinated action as would be needed to avoid eventually infinite delay. But as all of our high-frequency trunk lines spread out into sub-routes once they get further from Downtown, such coordination would only even be theoretically possible in one direction anyway!

There are a few things we can do about vehicle bunching.

Drunk Squirrel

Go home, squirrel. You’re drunk.

This isn’t necessarily as hard as it sounds. Squirrels aside, things like putting transit underground, or letting it bypass stop-lights can go quite a long way toward avoiding random delays.

Comments: 4
Posted in: Back to Basics | Math | Technology Choices
Tags: | | | | | | |