Stop Dispersion, Inter-agency Comparison


That last post was fun. Here’s some more :-)

All data is from current(April 2014) public GTFS feeds listed here. I calculated the straight-line distance, in feet, using the appropriate US state-plane projections, from each and every stop on each line to the next scheduled stop in that line. I then weighted each segment distance thus created by the number of times the agency traverses that segment each week. Weekday trips count 5 times Saturday-only trips, etc. That should give us something like the average person’s experience of the distance to the next stop, if people are evenly distributed across vehicles. The chart was plotted in R using the density() function, output to SVG, and then tweaked and reoriented in Inkscape.

Draw your own conclusions!

EDIT: Some useful SQL code using the basic table structure from GTFS; I simply imported the calendar.txt, trips.txt, stop_times.txt & stops.txt from each agency’s feed into a POSTGIS DB and ran a spatial query on that. Here’s the basic procedure for Muni:

--create the whole table, even though we only need a few things
--it's easier than editing the CSV
CREATE TABLE sfmta_calendar (
	service_id varchar,
	monday integer,
	tuesday integer,
	wednesday integer,
	thursday integer,
	friday integer,
	saturday integer,
	sunday integer,
	start_date varchar,
	end_date varchar
CREATE TABLE sfmta_trips (
	route_id varchar,
	service_id varchar,
	trip_id varchar,
	trip_headsign varchar,
	direction_id varchar,
	block_id varchar,
	shape_id varchar
CREATE TABLE sfmta_stop_times (
	trip_id varchar,
	arrival_time varchar,
	departure_time varchar,
	stop_id varchar,
	stop_sequence integer,
	headsign varchar,
	pickup_type varchar,
	drop_off_type varchar,
	dist_travelled varchar
CREATE TABLE sfmta_stops (
	stop_id varchar,
	--stop_code varchar,
	stop_name varchar,
	stop_desc varchar,
	stop_lat varchar,
	stop_lon varchar,
	zone_id varchar,
	url varchar

-- bring in the data
-- mind you get rid of the headers first
COPY sfmta_calendar FROM '/home/nate/calendar.txt' DELIMITER ',' CSV;
COPY sfmta_trips FROM '/home/nate/trips.txt' DELIMITER ',' CSV;
COPY sfmta_stop_times FROM '/home/nate/stop_times.txt' DELIMITER ',' CSV;
COPY sfmta_stops FROM '/home/nate/stops.txt' DELIMITER ',' CSV;

-- add a geometry column
ALTER TABLE sfmta_stops ADD COLUMN the_geom geometry(POINT,3494);
UPDATE sfmta_stops SET the_geom = 
		ST_GeomFromText('POINT('|| stop_lon ||' '|| stop_lat ||')', 4326)
--add columns which we'll update with the values we're actually interested in
ALTER TABLE sfmta_stop_times
ADD COLUMN weight integer,
ADD COLUMN next_stop real;

--run the spatial query
--first get each stop paired up with the next one down the line.
--this should return the total number of records in the stop_times
-- table minus the number of records in the trips table
--(each trip has one final stop)
WITH temp AS (
	FROM sfmta_trips AS t 
	JOIN sfmta_stop_times AS st ON t.trip_id = st.trip_id
	JOIN sfmta_stops AS s ON st.stop_id = s.stop_id),
--now get a table of weights
--most agencies use the same schedule for all week days and 
--we don't want to over-emphasize weekend-only services
--each day column simply has a true/false binary value, which we've treated as an integer
weights AS (
		(monday+tuesday+wednesday+thursday+friday+saturday+sunday) AS weight
	FROM sfmta_calendar
--join all that shit and calculate the distance from each stop to the next
UPDATE sfmta_stop_times SET 
	next_stop = t1.the_geom <-> t2.the_geom,
	weight = weights.weight
FROM temp AS t1
JOIN temp AS t2 ON 
	t1.stop_sequence = t2.stop_sequence + 1
	AND t1.trip_id = t2.trip_id
	AND t1.route_id = t2.route_id
JOIN weights ON weights.service_id = t1.service_id
	t1.trip_id = sfmta_stop_times.trip_id AND
	t1.stop_sequence = sfmta_stop_times.stop_sequence;

-- and export the data
	FROM sfmta_stop_times
	WHERE next_stop > 0
) TO '/home/nate/sfmta.csv' DELIMITER ',' CSV HEADER;

And then it’s on to R!

Comments: 2
Posted in: Access | Analysis | Data
Tags: | | | | | | | | | | | | |

Finding the center

Interesting topic in Spatial Statistics 1 today: Just where exactly is the ‘center’ of something? There are a dozen different ways to pick one from a multivariate dataset. I thought I’d toy around with weighted averages to see if I had the concept figured out before I tackled the weekend’s homework. Here’s the approximate center of the transit system(s), weighted for trip origins and destinations as of the most recent data I have.


Pro-tip for cartographers: When your data is simple, throw in gratuitously pretty and engrossing things like elevation and hillshades as points of reference.

Comments: Leave one?
Posted in: Maps | Math
Tags: | | | | | | |

SORTA’s stop-level ridership stats are out

And here they are! I haven’t had a chance to really thoroughly pick through them yet. The data is a bit messy and I still need to write a little script to walk through that flat text file to clean it up before doing a join with the list of stop locations. But go ahead and poke around it yourself to see if you can find anything interesting. I’ll be able to get around to making some maps from this and comparing it with the 2009 data later this week or next. I’ll also share it again in a cleaned up format once I finish compiling it.

In related news, SORTA is reporting a 4.2% increase in ridership for 2012 over the previous year. I took a quick look into the National Transit Database to see how this compared to earlier years and immediately ran into a higher figure for unlinked trips in 2011 than was reported in that story for 2012. So I probably need to learn more about the methodology behind both numbers. For now, let’s celebrate a short term increase in transit use, and leave the potential downer of longitudinal context for later. Statistics is messy!

Comments: 2
Posted in: Data | Math
Tags: | | | | |