Predicting How NYC Moves from Live Citi Bike Data
I've been quietly recording every Citi e-bike in NYC, every minute, 24/7.
I started by looking at the network requests the Citi Bike app makes to populate the map. The app pulls from a public feed showing station and bike availability, so I wrote a script to save that data every minute.
Right now, I'm recording every e-bike at every station in NYC, including its location and battery level.
That is more data than it sounds like. At full resolution, it comes to tens of gigabytes of raw data per week. To keep that manageable, I store everything as Parquet, a columnar format that compresses this kind of repetitive time series really well, and I back it up to Cloudflare R2. So the raw firehose is large, but the stored footprint stays reasonable and easy to query later.
That turned into a live map where you can watch the system move in near real time and see where individual e-bikes appear to have recently been:
The map is mostly the fun part. My real goal is the dataset.
I'm doing this as the final project for my deep learning course at Harvard this summer. Once I have enough historical data, I want to train predictive models around questions like:
- At a given time of day, in which direction is the city moving?
- How many bikes are likely to arrive at or leave a station in the next hour?
- If someone picks up a bike here, where are they most likely to drop it off?
- Is this station likely to run out of bikes in the next 30 minutes?
On the modeling side, I plan to start simple and build from there. First, some honest baselines (historical averages and gradient-boosted trees on features like time of day, day of week, and weather) so I have something real to beat. From there, the interesting part is that this is a network problem in both space and time. Bikes flow between stations, and what happens at one station clearly depends on its neighbors. So the methods I want to try, roughly in order:
- Recurrent models (LSTMs and GRUs) for per-station demand over time.
- Graph neural networks that treat stations as nodes and flows as edges, since the spatial structure matters. Spatiotemporal GNNs like DCRNN and Graph WaveNet are built almost exactly for this kind of problem.
- Transformers, to let the model attend across the whole city and over longer time horizons.
The drop-off question is a little different. It is closer to predicting a destination distribution given an origin and a time, so I may frame it as an origin-destination problem or a sequence-to-sequence model.
One part that makes all of this harder and more interesting is that Citi Bike only exposes the last 4 digits of each e-bike ID. There is a trick to get the full ID of every bike, but doing it at scale would basically mean DDoS-ing Lyft, and I don't want to do that 😅. With roughly 16,000 bikes and only a few thousand possible 4-digit codes, those IDs collide constantly. You cannot perfectly track a single bike from pickup to drop-off.
So instead of tracking bikes directly, I reconstruct the most likely paths. When a bike leaves one station, and a bike with the same 4-digit code shows up at another, I decide whether it is probably the same physical bike using a few physical constraints:
- Speed: the straight-line distance over the elapsed time has to be feasible for an e-bike, roughly under 20 mph.
- Battery: charge should only go down between a drop-off and the next pickup, never up (within some sensor noise), since bikes do not charge at the dock.
- Alternation: a single bike has to go dock, undock, dock, undock. If I see two pickups in a row for the same code with no drop-off in between, that is proof that at least two physical bikes share it.
When more than one candidate fits, I mark that hop as uncertain instead of pretending I know. It is essentially a constrained data-association problem, similar to multi-object tracking, and eventually, I would like to learn the assignment probabilistically rather than relying on hand-tuned rules.
That reconstruction problem is already interesting before getting into the modeling.
I'll share more as the dataset grows, including what works, what doesn't, where the models fail, and what that reveals about the system.
If you've worked on demand forecasting, mobility data, or spatiotemporal modeling, I'd love to hear how you'd approach this.