It’s been a while since I posted on this topic, but a clever thought has brought it back to my attention.
Last year, I spent a little time exploring the City of Cincinnati’s publicly available 591-6000 dataset, a list of calls to the City’s public service help-line. It’s the kind of thing your unpleasant neighbours call in to when they want to nag the City for not fixing a crack in the sidewalk or when they’re otherwise too lazy to do do something themselves; also, sometimes for legitimate reasons like broken signs and overflowing trash cans. Most interesting to me is that you can call the number to report that an animal is dead, almost always after having been struck by a reckless driver, these invariably having not the fortitude or the decency to carry through the butchery that they’ve initiated. As a result, a living animal is reduced to a body, and the body is indecent and must be removed.
What I created then was a map showing a kernel density approximation of callers asking the city to basically pick up a rotting corpse in the right-of-way. I think that’s a pretty fucked-up thing to establish a bureaucracy for, so I mapped it! It occurred to me that a kernel density map looks like a splatter of blood if the colors are right, and so that’s what I tried to do. I even sampled the colors directly from a photo of a dissection.
But it didn’t come out like I had it in my mind…My clever thought then is the reason why: I was using pixels instead of blood!
The plan now, and I’m not sure what I’ll ever do with this, is to print a nice base-map on some decent paper, and then to either screenprint or stencil the density function over top with blood and other body fluids from some of the actual animals I’m concerned with. Initially, I asked my partner to bleed me, which he politely refused. It seems however, and the data doesn’t at all contradict this, the city is full enough with violent death that I needn’t worry my own fingers. He gave me a small bottle of blood and the promise of more.
I picked up some paper from the neighbourhood art store and I used this to test some swatches.
And I liked the color best on the off-white paper, so that’s what I’m currently designing for. I tossed together a super-simple base map this morning.
What will it look like when it’s done? How will it serve as a pretence for me to finally use that giant laser-cutting-machine? What’s the point in a ‘society’ that clearly doesn’t give one-tenth of a shit?
Tune in next time.
Just some preliminary results from my first attempt to archive real-time bus data, these from the Champaign-Urbana ‘Mass Transit’ District:
This is a look at variance in the distribution of delayed buses throughout the day. It’s only looking at off-schedule buses right now so we aren’t seeing any change in the proportion of precisely on-time buses (if there is any such change). The little clock in the top right is the time of day the event was recorded, and right below that is the number of events used to ascertain the momentary distribution. For now, this sample size is as much a reflection of when I was running my computer as it is of schedule frequency.
I assumed that the first and last percentiles of the overall distribution were outliers and clipped them off.
I don’t actually see much going on here except random fluctuation, but I suspect this will get much more interesting with larger samples and many more and more diverse agencies included. I’m actually quite eager to see what that turns up! Right now, I’m working on developing a tool to query APIs from the Toronto Transit Commission and Philly’s SEPTA. Unfortunately, because neither Cincinnati transit agency is willing to share their data, which they’ve been collecting and using for years, it will be impossible to include them in this sort of analysis.
I’m just beginning to play with the new, highly detailed ridership data I got from SORTA, and boy is it a treat. I’ll start here with a high-level overview of the temporal dimension of the data, before looking at spatial aspects and breaking it down by line, stop, service type etc as the summer progresses. I think I may also use this data as the basis of my study in R this coming semester (Hi Michael!), so perhaps we can count on seeing some more detailed and particularly nerdy and multivariate analysis through the cooler months as well. I am, by the way, acutely aware that I’ve started a number of little projects on this blog, and have failed as yet to carry them through to their completion. I keep getting distracted by the realization that I have no idea what I’m talking about, the inevitably illusive prospect of making money some way or another, and the all too comforting thought that no one is reading this or taking it seriously anyway. But hopefully, this is a small enough commitment and certainly it’s interesting enough for me to actually provide a reasonably complete picture of this particular dataset before altogether too long. Perhaps I can even apply the same techniques to the TANK ridership dataset that I’ve been meaning to get to and publish for more than a year.
Anyway, let’s actually get to that temporal overview. Since we know the trip each record belongs to, identified by the line number and the trip’s start time, I was able to identify the actual scheduled time for the great majority of stops in the data set by matching the records to GTFS schedule data for the same period. About 170,000 of the 230,000 records matched to a precise time. The remainder account for a very small portion of total activity, about 2%, and I think it’s most likely that many of these records are an artifact of the way SORTA’s database is structured and not actual stops belonging to a trip. I’ll dig into that more some other time though.
For the ~98% of boardings and alightings that I could pin to a precise time of day, I created a histogram:
As would be expected, alightings(that is, people getting OFF), trail boardings(getting ON) by a half-hour or so. People need to get on and get somewhere before they’ll get off at their destination. The difference therefor is peoples’ travel time.
Anecdotally, the temporal distribution of transit users closely mirrors the distribution of actual service. This is a chicken/egg situation, and it would make good sense to inquire what ridership might look like late at night if service itself didn’t trail off into hourly or half-hourly frequencies where it continues at all past 10pm. There’s also good reason to suspect that changing service levels at one time of day could effect ridership at another. Might we, for example, see differently shaped rush-hour peaks if suburbanites had and got used to having the option of staying late at their downtown office? If service continued all night, might we see echoes of the main rush-hours as second and third-shifters head for work? Might there be a night-life peak if night service weren’t so abysmal?
EDIT: For those of you who don’t have huge computer monitors…
That last post was fun. Here’s some more :-)
All data is from current(April 2014) public GTFS feeds listed here. I calculated the straight-line distance, in feet, using the appropriate US state-plane projections, from each and every stop on each line to the next scheduled stop in that line. I then weighted each segment distance thus created by the number of times the agency traverses that segment each week. Weekday trips count 5 times Saturday-only trips, etc. That should give us something like the average person’s experience of the distance to the next stop, if people are evenly distributed across vehicles. The chart was plotted in R using the density() function, output to SVG, and then tweaked and reoriented in Inkscape.
Draw your own conclusions!
EDIT: Some useful SQL code using the basic table structure from GTFS; I simply imported the calendar.txt, trips.txt, stop_times.txt & stops.txt from each agency’s feed into a POSTGIS DB and ran a spatial query on that. Here’s the basic procedure for Muni:
--create the whole table, even though we only need a few things
--it's easier than editing the CSV
CREATE TABLE sfmta_calendar (
CREATE TABLE sfmta_trips (
CREATE TABLE sfmta_stop_times (
CREATE TABLE sfmta_stops (
-- bring in the data
-- mind you get rid of the headers first
COPY sfmta_calendar FROM '/home/nate/calendar.txt' DELIMITER ',' CSV;
COPY sfmta_trips FROM '/home/nate/trips.txt' DELIMITER ',' CSV;
COPY sfmta_stop_times FROM '/home/nate/stop_times.txt' DELIMITER ',' CSV;
COPY sfmta_stops FROM '/home/nate/stops.txt' DELIMITER ',' CSV;
-- add a geometry column
ALTER TABLE sfmta_stops ADD COLUMN the_geom geometry(POINT,3494);
UPDATE sfmta_stops SET the_geom =
ST_GeomFromText('POINT('|| stop_lon ||' '|| stop_lat ||')', 4326)
--add columns which we'll update with the values we're actually interested in
ALTER TABLE sfmta_stop_times
ADD COLUMN weight integer,
ADD COLUMN next_stop real;
--run the spatial query
--first get each stop paired up with the next one down the line.
--this should return the total number of records in the stop_times
-- table minus the number of records in the trips table
--(each trip has one final stop)
WITH temp AS (
FROM sfmta_trips AS t
JOIN sfmta_stop_times AS st ON t.trip_id = st.trip_id
JOIN sfmta_stops AS s ON st.stop_id = s.stop_id),
--now get a table of weights
--most agencies use the same schedule for all week days and
--we don't want to over-emphasize weekend-only services
--each day column simply has a true/false binary value, which we've treated as an integer
weights AS (
(monday+tuesday+wednesday+thursday+friday+saturday+sunday) AS weight
--join all that shit and calculate the distance from each stop to the next
UPDATE sfmta_stop_times SET
next_stop = t1.the_geom <-> t2.the_geom,
weight = weights.weight
FROM temp AS t1
JOIN temp AS t2 ON
t1.stop_sequence = t2.stop_sequence + 1
AND t1.trip_id = t2.trip_id
AND t1.route_id = t2.route_id
JOIN weights ON weights.service_id = t1.service_id
t1.trip_id = sfmta_stop_times.trip_id AND
t1.stop_sequence = sfmta_stop_times.stop_sequence;
-- and export the data
WHERE next_stop > 0
) TO '/home/nate/sfmta.csv' DELIMITER ',' CSV HEADER;
And then it’s on to R!
Scary factoid of the day: Greater Cincinnati has about 9,000 cul-de-sacs, or streets that end bulbously. Generally, such streets are part of a dendritic hierarchy, a branching development pattern very common in post-car/war/car-war urban development.
8,894 Cul-de-sacs. Data is from OSM
I grew up on a cul-de-sac, but we’ll not go there: too much baggage. Also, it’s an unpleasant trip and there’s no transit.
These cul-de-sacs are interesting to me, if I can use a word like ‘interesting’ anywhere near such a lifeless thing, in part because they present an opportunity to do something inverted: the opportunity to make an intensity map of the very opposite of intensity, a map of the extremity of dullness. So far as transportation is concerned, this will also be a proxy for the degree of disconnection between things or more practically, the degree to which one might reasonably be scared to be outside the protective machinations of a car.
Here’s a density analysis of the location of cul-de-sacs:
Places you probably won’t go on a bus
The darker the color, the more and closer-together are the cul-de-sacs.
Let’s take a closer look at the most dis-intense spot, shall we?
I actually thought this cluster must have been an error in the data when I first noticed it.
Leave it to the golf-course-crowd to take the top spot in this contest. This kind of pattern is perfectly typical of affluent post-car suburbs: houses are located for maximum isolation from neighbors and no one wants to live on a street with ‘traffic’. Of course the obvious irony is that in keeping the traffic off their part of the street, they’ve ensured it everywhere else. It’s such a middle-class arms race isn’t it?
There’s an interesting counter-variable here, though it’s not as completely represented in the data: pedestrian crosswalks. Where there are many crosswalks close together, we should find the opposite characteristics: walkability, liveliness, places where you’d rather not be in a car. So where are the crosswalks?
Locations of 2,770 known crosswalks. Crosswalks are only somewhere near half to a third accounted for in this dataset, so this is not an accurate representation, but it’s the best I can do at the moment.
And then the reveal:
Kernel density of crosswalk locations, same scale and methods as with the cul-de-sacs above
This looks like it might actually line up well with the location of transit lines!
1km triweight kernel density of bus stop events(raster) compared against the contour lines from the crosswalks from above(slightly altered for legibility)
Not a terrible assumption! It’s not a superb fit, but you can definitely notice some areas that seem to have a rather strong correlation. Obviously, the most intense spot for both transit and crosswalks is right in downtown, which we’ve all seen, so I won’t bother with an aerial photo of that.
Interestingly though not surprisingly, crosswalks and cul-de-sacs appear to be somewhat mutually exclusive.
Only a few relatively minor areas demonstrate substantial overlap
It seems odd that anyone would have taken the time to actually enter in almost 9,000 cul-de-sacs around Cincinnati, though indeed there have been about 85,000 buildings already entered by hand. I rather suspected that they might have been added in the big TIGER imports from a few years back. If they were, that would mean we’d be able to compare against other US cities. I tried a few, but it looks like the data is really just too spotty for a any reliable analysis. Alas, Pittsburgh, Indy, Cleveland and the other cities I checked don’t seem quite ready to give up their
subhuman suburban secrets just yet.
Demonstrative Pittsburgh data problems: Clearly, there should be more cul-de-sacs on the right here.
Indy seems fairly complete, but something about this just doesn’t feel right to me. From what I know about the city, I don’t think there are enough cul-de-sacs in the data here. Maybe someone will tell me I’m wrong and that Indy just hasn’t experienced as much post-war growth as Cincinnati.
One of my long-term mapping goals is to tag my taxidermist boyfriend with a GPS and get exact locations of all the roadkill he picks up. My bet is that it would primarily lie within or along the edges of the cul-de-clusters identified here.