And here it is.
I’m pretty proud to say that beside searching for the elusive schedule padding, and possibly finding some, I managed fit in a comment about the inevitability of death, a quote from Jerry Seinfeld, and a self-deprecating jab at the idea of human rights.
Also, I put videos in a PDF. Who the hell knew that was possible?
Just another way of looking at speed distributions. This shows the distribution of observed speeds across several thousand route segments by percentile of the speeds on each segment. The animation changes as we’re shown different parts of the speed distribution on each segment.
Not totally sure if it’s useful yet, but it is kind of pretty.
Some news from the thesis front:
I think I’ve finally perfected my method for linking real-time data with scheduled stops. This is a comparison of the average (weekly) scheduled speeds to the observed average speed for each stop->stop segment. Results that look roughly as expected are what we all hope for.
Note that each classification is broken into eight equal sized quantiles
There is a lot of information in that little gif! More than I can explain here. More to come…
Higher resolution here by the way. It’s interesting to look at even if you don’t know Toronto. Also, the line widths are determined by the number of trips scheduled for each segment.
Here are some early results from my efforts to track transit vehicles, these ones in Toronto:
It looks like a crude system map, and it is, but it’s actually made of thousands of vehicle GPS tracks.
The vehicle tracks are oddly pretty; I keep wasting time just zooming in on different spots as new tracks are added. Here’s a transfer point at one of the subway stations:
Bus stops shown as ~20m circles
And what looks like perhaps a train station or a bus garage, below. This image also shows the relative frequency of service on different streets, something that becomes quite visible in the data when the lines are given a high degree of transparency.
Buildings for scale
And here’s an expressway carrying some limited-stop services:
As of now, after just a week of erratic development and testing, I’ve collected ~60,000 unique tracks, representing ~160 scheduled lines, derived from 3,200,000+ vehicle location records. About 50 new vehicle locations come in each second that I have my little script running.
Here’s what I’m doing so far, described algorithmically here, and implemented in a Python script:
- Request updated vehicle locations from the API every five or six seconds. The API only sends the ones which have updated since my last request, which ends up being between 200 and 600 depending on the day and time of day. There seem to be between 500 and 1500 vehicles operating at any given time in Toronto, so I’m seeing maybe a 10 second update from each on average. It lets me do this for the whole agency at once, which is slightly surprising.
- Keep tabs on each vehicle by it’s given ID number, and begin building a track for each vehicle by putting points in order of their appearance.
- If a vehicle fails to update within 60 seconds, gets a new route assignment, or a new direction identifier on the same route, I start it a new track and insert the old one into the database.
- Tracks shorter than 5 points or 500m or 2 minutes, or some other arbitrary amount can then be ignored or dropped.
A track is a set of ordered points, each point with a position and a time. The next step is to line the tracks up with the stop segments to which they’re scheduled, and if they’re actually close and the direction matches, to calculate stop times and segment durations from the observations. That’s actually turning out to be pretty difficult, but I’m sure I’ll crack it fairly soon. One thing I’ll have to seriously consider as I’m doing this is error in the location reports.
As the first image and the one immediately above show, there is significant error in the data, particularly downtown where tall buildings are presumably interfering with GPS signal reception.
I’m being urged to get my act together regarding my masters thesis. I have a set of datasets I know I want to explore but I need to find a question of sorts that I can quite thoroughly answer with them. I also need to decide what type of person would be good to oversee this project — the ‘committee’ and whatnot. As I so often do, I’ll use you anonymous readers as the spur to set my thoughts to bytes and thereby make rigorous my abstractions.
SO: My dataset is real-time transit data feeds. I don’t care what buses are doing right now unless I’m waiting for them — I care what patterns they’re scratching into our lives. I’ve already demonstrated a Python script that will make random requests from a real-time API and store the results. There exist comparable API’s from other agencies that this script can easily be adapted to. As many agencies as have APIs I could squirrel data from. That’s the dataset or set thereof.
My question has been more difficult to discover. I have so many! Here are a few:
- What is the distribution of delay? How does it vary? Spatially, temporally?
- What kinds of lines/agencies/times have non-random, systematic delay?
- How does the delay spread of ‘good’ transit systems compare to that of ‘bad’ transit systems and what might explain this?
- Good scheduling should minimize systematic delay: what sort of delay remains after that and what might riders learn from it? How should they learn to best accommodate this delay?
- What is the space-time trajectory of a vehicle in various states of delay?
- How different is the delay of lines that don’t mix with traffic?
- What relation does frequency have to delay? At what service frequency can we say quantitatively that schedules should be abandoned and headways maintained instead?
- What is the accuracy of arrival time predictions? What margin of error exists around predictions at various space-time distances?
I suppose the first question is probably my best shot. Though #5 is certainly intriguing. Now on to the lit review I suppose? *deep breath*
And then the committee! Beside my adviser, who is a regular transit user and quantitative geographer, I want another statistician/data-person, and this shouldn’t be too hard to find. I also want someone really good at graphic communication. For that latter, I want someone from DAAP. But I want to be sure that they don’t think or feel or act as though I’ve invited them to proof my presentation while others address it’s content; content is inseparable from presentation. Form does not follow function; rather both form and function must mirror each other. If I fail to make that happen, I will have miscommunicated or misunderstood my project.
Oh dear readers, what would you want to know if you knew, as I may, where all the buses are all the time?