I’m taking a self-guided course in R this semester — that is, teaching myself, but with deadlines — and since I’ve been playing with transit data for the most part, it seems appropriate to tickle y’all with some of the mildly interesting data visualizations that I’ve so far produced.
I’ll be using the 2014 SORTA spatio-temporal ridership dataset, which I’ve already sliced a couple different ways on this blog. The first was here with a set of animated maps andthe second here showing basic peaking in passenger activity through time.
This time, I’m going to take that later analysis a little further by breaking out passenger activity into lines. Go ahead and take a look at the graphic, which I’ll explain in more detail below.
Ok. So first, it’s important to understand what we’re measuring here. Our dataset tells us the average number of people getting on a bus (boarding) and the average number getting off (alighting) for each scheduled stop. There are1 about 162,000 scheduled stops on a weekday. Of those, I was able to identify a precise, scheduled time for all but ~ 2,0002. Of the remaining ~160,000 the dataset tells me that 77,763 have at least 0.1 people boarding or alighting on an average weekday. I used those stops to calculate a weighted density plot over the span of the service day for each route. Added together of course, the individual routes sum to the total ridership for the system3. I then sorted the routes by their total ridership and plotted them.
The first thing that becomes clear, to me at least, is that a minority of SORTA’s lines account for a large majority of actual riders. These lines by the way are precisely the ones featured in the Cincinnati Transit Frequency Map, and I’ve used their color from that map to distinguish them in the chart above. The remaining routes, as I knew even before I had this data, are relatively unimportant.
May 2013 routing
The one grey line mixed in among the colored lines is the m+ (a latecomer to the frequency map), which does actually run all day on weekdays.
Now another interesting question, to me at least, is what this would look like without the pea under the mattress; how large are the rush-hour peaks if we exclude the peak-only lines from the chart? Let’s try it. I’ll also reverse the order, so we can see some of the larger lines with less distortion.Well, the rush-hours are still pretty distinct. More distinct than I would have expected. It’s an open question whether this is the result of more service in the rush-hours, or more crowding at the same level of service.
One last way (for now) to slice the data will be to take the total ridership at any given moment, and relativize each line’s total, showing each line’s percent share of the total. To keep it easy to read, I’ll leave the peak-only lines out of this one too.I found it slightly surprising how straight these lines are. Only toward the end of the day do we see a major wobble in any direction, and that’s essentially the result of a few lines shutting down earlier than the others.
Just in time for Xmas, KINDA t-shirts are now available for purchase at Rock, Paper, Scissors in Over-the-Rhine!
(I also have the TANK T-shirts there too)
The KINDA shirts are printed on grey-white Gildan tees in small through extra large. The TANK shirts are printed on a slightly lighter Tultext 50/50 cotton/poly tee, available in small through large. The TANK shirts are available in both brown and grey.
Price is $20. If you’re not able to make it to OTR, you can also just email me and have one delivered straight to your door(!) for an extra $5 shipping.
Wear one to your next public meeting! Or while you’re riding your bike past some poor bus stuck in traffic ;-)
Just last year I had a fairly long email exchange with some CAGIS folks, including eventually a City of Cincinnati lawyer, who insisted that the data which they collect with public dollars for civil uses was not a ‘public record’ per Ohio state law. They gave no justification for that claim of course, but I didn’t, as they well knew, have the time or money to take them to court. But anyway…if they want to start playing nice and sharing some of their toys, I’m willing to let bygones be bygones.
So what’s in there? There are two files, the larger of which (‘annual release’) you probably don’t need to bother with. The smaller file (‘quarterly release’) has some real goodies which I’ll describe below. I say that you don’t need to bother with the larger file because it looks like it only has data which could be derived from public domain federal sources like the Census and the USGS. Those sources cover the whole US and are readily available, so why trust CAGIS employees to mediate when you can go straight to the source?
Here’s what’s in the ‘quarterly release’ file(with some notes of my own):
Building footprints, Hamilton County (they seem to have cut out all of the data on each building and just left us with the outline. Still, for what it is, the data set looks to be very thorough with ~360,000 buildings.)
Street Centerlines, Hamilton County (I would suggest OpenStreetMap as better for most uses, though there are some other fields in here which I haven’t properly explored. I did notice that the speed limit field seems to indicate a whole lot more 15mph streets than actually exist. The street I live on is tagged as 15mph and physically signed as 25mph for example.)
Neighborhood Boundaries, Cincinnati
‘Sidewalks’, ‘Pavements’, ‘Driveways’ & ‘Parking’, Hamilton County (These are lines only, no data or metadata. They might be useful for CAD-type architectural drawings, but for any sort of spatial analysis they’re probably not worth much.)
Zoning, Cincinnati and parts of Hamilton County
Railroad, Hamilton County (You’ll do better getting this from a federal source or from OpenStreetMap)
Parks, OKI Region (I’m not sure how CVG is considered a park, but this looks like an interesting dataset, with lots of useful fields like name and who’s responsible, and where you can find them)
Parcels, Hamilton County (There’s tons of juicy stuff in here, like the assessment and sale values for each and every parcel. I played with this a bit in the past, before I knew how to make maps.)
‘Subdivisions’, Hamilton County (This looks like a legal artifact more than anything. It seems to contain any parcel that’s been subdivided since GIS was invented. There are many ‘subdivisions’ downtown for example, and I can’t make any sense of other areas that I know well. 1/10: do not use.)
There are a few other datasets in there, but they’re either very obscure (survey benchmarks) or redundant (parcel ‘pages’) to other datasets. As far as I can make out, each of the files is projected in the Ohio Southern State Plane, EPSG 3735, though some of them appear to be missing that metadata.
Today, a trip in the WABAC machine:
A few years ago, I eagerly offered my services to a rather inauspicious but important project, important I still say though perhaps for someone with more stamina than I. I offered to lead the CUF Community Council’s effort to tackle the dreaded ‘parking problem’.
Lots of meetings and a couple years later, I’d met a lot of important people and ‘important people’ too, learned how the City works, and found out I had better ways to spend my time. I still actually hang out with those thicker skinned people still working at the issue though, and an exchange today last week prompted me to dig back into my parking archives and properly pass the torch.
All of this is a long wind-up to say that I rediscovered a series of photos I took that a certain tiny subset of central Cincinnati may find either extremely amusing or entirely aggravating. I intend to spoil any possible amusement by explaining the context of each photo in it’s caption.
There are dozens of things that the CUFNA board hated, but two of them were: 1. People who aren’t on the CUFNA board parking in the neighborhood and 2. Signs on telephone poles.
A common complaint: “What about the poor? They might not be able to afford [trivial amount] per year.” Though of course the ‘poor’ must have cars if they are to have this concern raised rather presumptuously on their behalf, inevitably by the rich.
Most people like it when outsiders come to the neighborhood–look at OTR–but not these folks.
Please give QUARTERS.
This photo is a collage of all of the things that the then CUFNA board (and the meetings’ regular curmudgeons) hated. I had to move that broken bottle to get it into the shot, but not very far ;-)
I’d have so much fun if my prop budget allowed more than cardboard, markers and string…
Also, my thanks to Tyler Catlin for modelling and for bringing some fun and prankishness into the whole discussion. I’ll never forget the night when I first presented our work to the CUF Neighborhood Association general meeting and things quickly devolved into a shouting match among the audience. At one point someone yelled “who gave you the right to speak!?”, I think to the board president, at which point Tyler replied for all to hear in a way that left the old windbag slightly flabbergasted and which of course only escalated the conflict.
It was difficult for professionalism to restrain my smile in that moment.
If anyone is interested in seeing where the process is now, Jack Martin and I are still going back and forth on parking issues between ourselves and various others at the City and in the neighborhoods. The latest thing I have is an outline of the current proposal which I’ve uploaded. I’ll let it speak for itself.
An excerpt from (the background section of) the first draft of my thesis proposal:
“The bus is so slow! Isn’t rail just better?” The popular confusion created by the conflation of coaches and railcars with their typical relation to automotive traffic has wasted billions of public dollars1 (and caused me no end of frustration). Though the superficial wheel may not much matter, the general public is right to sense a distinction in the reliability and speed of transit services that operate in mixed traffic and those that are given priority over such traffic. As the public more and more aggressively demands train-based transit services, these should be read as demands for increased speed and reliability (among several other things) and planners should respond by modifying existing services to meet these demands.
Speed and reliability are a function in large part of how many potential delays a line will encounter along it’s course. Random delay results from unplanned disruptions such as higher than expected passenger loads, traffic, serial red lights, etc. Scheduled delay, also known as schedule ‘padding’ is delay that is built into scheduled transit services that allows them to be tolerant of unscheduled disruptions by acknowledging their average effects in advance. Agencies try to balance scheduled and unscheduled delay to create schedules that are neither too slow nor too often disrupted by random delay. While the public often reacts negatively to random delay events, they’re typically unaware of schedule padding, though both are dependent on basically the same environmental factors.
Since the public, and not transit schedulers, are in control it becomes important to explain delay and it’s causes and effects to a lay audience and thereby to direct them toward a fruitful response. Further, since funds for radical infrastructure interventions may be difficult to find in the current political regime, attention should be focused on marginal cases and incremental improvements to surface-running bus lines.
Simply, the question is: where can the smallest new delay-avoidance technique create the biggest improvement in speed and reliability for existing services?
As a first step toward an answer to this question, I’ve created a rough measure of the amount of schedule padding identifiable from just the schedule information itself. It’s not perfect by any means, but I’m going to run with it for a moment and see where it leads me. First I identified all unique trip segments in the transit system. A segment is defined here as the travel between two unique stops, so…
( stop A -> travel -> stop B ) = segment 1
( stop B -> travel -> stop C ) = segment 2
( stop A -> travel -> stop B ) = segment 1 again
( stop B -> travel -> stop A ) = segment 3
for a total of 4 segments, 3 unique in our example.
For SORTA’s current GTFS schedule, we observe:
Each segment has some times associated with it:
A departure from the first stop
An arrival at the second stop
Implicitly, the time scheduled to complete the segment.
Because schedulers expect that the amount of time it will take a bus to get from A to B will be different at different times of day, these otherwise identical segments will have different durations. By finding the deviation from the minimum duration for each segment, we can get a crude measure of the schedule padding built into the system.
Hours of Padding
Scheduled Vehicle Hours
This method estimates that 21.5% of the weekday schedule is actually scheduled delay, more than 400 hours of it, each weekday. That is, at least relative to the fastest any bus is scheduled to complete a segment. Just where is this scheduled delay anyway? When and where are the schedules most heavily padded? I’ll save a spatial exploration for later, but let’s take a very preliminary peek into the temporal dimension.
The first question we must ask is: when are all of the segments? By taking a central moment as the time, we can plot them, in a kernel-smoothed histogram :
This clearly shows the basic level of service throughout the day and week. It’s not a great measure of that as such, but it does give us a definite sense of the balanced weekday rush-hours and diminished weekend service.
Then since most of the segments are padded,we ask when are the segments without padding? On the same scale, we get:
As we might have expected, there is less random delay and thus less need for padding when the streets and buses are less congested: early morning and late at night. It also appears that there are relatively fewer padded segments on the weekends, though the total number of unpadded segments is roughly the same as on weekdays.
Ok, so when is the padding itself and how much of it is there? Note that we’re measuring something different here: hours of padding per hour.
Now, this definitely has a different shape than the overall distribution of schedule segments, but it’s a little hard to compare them when they’re so far apart. Let’s combine all of these into one plot. I might have got a little carried away in Inkscape…
I’ll just let that speak for itself for now. We’ll get into spatial visualizations of this data next, and eventually real-time comparisons and measures in space-time.