I’m about to start digging into various real-time data feeds for American (bus) transit systems. For the most part right now I’m interested in finding a simple, average distribution of lateness/earliness across all stops, the idea being that this could help riders predict, without live real-time feeds, when the bus is most likely to show up, by looking only at a fixed schedule.
Are buses more likely to be late than early? What percentage of buses are early, anyway? If it’s already five minutes late, is it very likely it’s coming in the next minute? Or should you start walking? What’s the difference in tardiness distributions between frequent and infrequent services? Are there types of places in a city which have consistently different distributions?
In the name of science, I’d like to make a prediction, ie. state my hypothesis, before I’ve collected any actual data. So here it is:
I think that overall the distribution will have a strong late skew, a very short early tail, and a wide second hump around the time a second bus might start bunching up on the one in question. I’ll guess that between 10% and 20% of buses running on fixed schedules will be at least a few seconds early and that the median will be about 2 minutes late.
Now…anyone want to suggest a city with a real-time feed? I have my eye set on Portland at the moment but only because I’m have trouble finding decent APIs.
Champaign-Urbana is looking like a petty good option now…
https://developer.cumtd.com/