Mar 2026

IU Indy DataViz Games submission exploring how travel distance relates to game attendance.
Distance & Devotion is a data visualization project we created for the IU Indy DataViz Games 2026, a competition centered on turning sports data into a clear and compelling story. The repository includes a full reproducible pipeline with a raw NCAA dataset, intermediate CSVs, candidate charts, and a final stitched submission graphic.
The central question we were trying to answer from the start was: how does travel distance relate to women's college basketball attendance, and what can that reveal about fan devotion, program reach, and structural differences across institutions? Since fan-level travel telemetry data was largely unavailable, the analysis is conducted with the visiting team's campus-to-venue distance as a proxy for the travel burden facing an away fan base.
That framing allows us to turn attendance into more than a simple popularity metric and study whether high-interest programs and games still draw crowds even when travel is long, and whether those burdens are distributed differently across teams, conferences, and HBCU programs.
The main data set we used is data/raw/ncaa_raw.csv, which is the raw NCAA dataset given by the competition to be used for the project.
The raw data is then processed into two committed intermediate outputs for reproducibility: data/intermediate/games_with_distance.csv and data/intermediate/team_travel_distances.csv.
Those files are the filtered versions of the raw data set to be used for specific analyses as we will describe below.
The project also depends on a US state shapefile stored in data/geographical_visualization_maps/, which is loaded in visuals.py and filtered to admin == 'United States of America' before the national travel map is drawn.
The first step of our process was to clean the raw data set of any extraneous fields and to filter the dataset to only include the fields needed for the analysis. Our data-prep script immediately narrows the working dataset to a smaller set of columns centered on team identity, conference, attendance, venue coordinates, campus coordinates, home/away status, neutral-site status, HBCU flag, and season record.
The specific fields used directly in the code include:
| Field | Purpose |
|---|---|
TEAM_INSTITUTION_NAME | Team/school identifier. |
TEAM_CONFERENCE | Conference grouping for comparisons. |
ATTENDANCE | Main outcome metric used in charts and aggregations. |
FACILITY_LATITUDE, FACILITY_LONGITUDE | Venue coordinates for travel calculations. |
TEAM_INSTITUTION_LATITUDE, TEAM_INSTITUTION_LONGITUDE | Campus coordinates for travel calculations. |
IS_HOME_CONTEST | Used to isolate away games. |
IS_SITE_NEUTRAL | Used to exclude neutral-site contests. |
IS_INSTITUTION_HBCU | Supports the HBCU vs. non-HBCU analysis. |
SEASON_WINS, SEASON_LOSSES | Used for win-adjusted metrics and team summaries. |
All core preprocessing lives in src/scripts/geographic_gravity.py, which acts as the analytical engine for the project.
It loads the raw CSV, trims the data to the fields needed for analysis (that we just described), computes geographic distance, creates derived metrics, and exports the two intermediate CSVs used by the rest of the pipeline.
Specifically to filter the data, the script applies the following sequence:
IS_SITE_NEUTRAL is no.IS_HOME_CONTEST to no.This cleaning matters because travel distance is only meaningful from the visiting team's perspective, and coordinate completeness is required for the distance calculation to run cleanly.
Travel distance is computed with a custom Haversine implementation written directly in Python. The function uses an Earth radius of 3,959 miles and calculates the great-circle distance between the venue and the visiting institution.
def haversine(lat1, lon1, lat2, lon2):
R = 3959
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
dlat, dlon = lat2 - lat1, lon2 - lon1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * atan2(sqrt(a), sqrt(1-a))
return R * c
After distances are computed, the script removes the top 1% of travel-distance values using quantile(0.99).
We did this to limit extreme cases from distorting the later charts and summaries.
The script then creates three especially important derived variables:
attendance_per_mileattendance_per_windistance_bucketThese features support the efficiency ranking, the win-adjusted attendance lens, and the clean five-category narrative used in the distance-bucket chart.
The five distance buckets that are defined are:
| Bucket | Range |
|---|---|
| Local | 0–150 miles |
| Regional | 150–400 miles |
| Mid-distance | 400–800 miles |
| Long | 800–1,500 miles |
| Cross-country | 1,500+ miles |
The script groups games by TEAM_INSTITUTION_NAME and computes team summaries including average distance, maximum distance, number of games played, average attendance, conference, HBCU flag, and average wins.
It then filters to teams with at least 10 games before saving team_travel_distances.csv.
The filter was applied since it reduces noise from tiny sample sizes and produces a more stable team-level ranking, though it also excludes some smaller programs from the final aggregate view.
After filtering the original data set based on the parameters outlined above, we had 19,272 away-team rows of data. For those 19,272 data points, we calculated a mean travel distance of about 383 miles and a median of about 270 miles. Graphing of the distribution also showed a right-skewed distribution in which most trips are regional but a smaller set of long-haul trips pull the average upward.
The reported Pearson correlation (r) between travel distance and attendance is approximately 0.151, which is weakly positive rather than negative. This is interesting since it signifies that, rather than coinciding with a simple “distance lowers attendance” story, it suggests that longer trips often coincide with games that already have larger audiences or broader visibility.
Most prominently, we found that average attendance rises from roughly 1,332 in local games to about 2,742 in cross-country games.
One of the most distinctive parts of the project, at least in my opinion, was the explicit HBCU lens. In our analysis we foundd that HBCU teams in the filtered set travel about 343 miles on average, compared with about 386 miles for non-HBCU teams. This suggests that HBCU programs may have slightly lower travel burdens on average, which could reflect factors such as conference alignment, geographic clustering, or resource constraints. Either way, it's interesting that HBCU teams travel less on average than non-HBCU teams...
Based on this analysis, we concluded that games involving longer travel burdens tend to be associated with larger attendance environments. This actually makes a lot of sense since marquee matchups, larger venues, and geographically dispersed conferences all shape both distance and turnout at the same time.
All visuals that we came up with are generated in src/scripts/visuals.py using Matplotlib, GeoPandas, Shapely, NumPy, and a carefully tuned set of global style parameters.
The repository's src/plots/ directory is generated by visuals.py.
| File | Chart type | Purpose |
|---|---|---|
![]() | Flow map | Shows national travel patterns and top-attendance venues. |
![]() | Horizontal bar chart | Ranks the 15 teams with the highest average travel distance. |
![]() | Scatter plot | Compares average wins to average travel distance, with attendance as bubble size. |
![]() | Vertical bar chart | Compares top conferences by average travel distance. |
![]() | Histogram | Depicts distribution travel distances for all away games in data set. |
![]() | Bar chart | Shows the monotonic increase in attendance across distance buckets. |
![]() | Scatter plot | Shows the weak positive correlation between distance and attendance. |
![]() | Box-and-whisker plot | Compares HBCU and non-HBCU travel distributions. |
![]() | Horizontal bar chart | Highlights teams with the highest attendance per mile traveled. |
The repository follows a clean three-stage pipeline: compute, visualize, then compose.
data/raw/ncaa_raw.csv
│
▼
src/scripts/geographic_gravity.py
│
├── data/intermediate/games_with_distance.csv
└── data/intermediate/team_travel_distances.csv
│
▼
src/scripts/visuals.py
│
└── src/plots/viz*.png
│
▼
src/scripts/final.py
│
└── data/outputs/combined_fan_travel_story.png
.
├── README.md
├── requirements.txt
├── dictionary # plain-language data dictionary
├── data
│ ├── raw/ncaa_raw.csv
│ ├── intermediate
│ │ ├── games_with_distance.csv
│ │ └── team_travel_distances.csv
│ ├── geographical_visualization_maps/ # US states shapefile (basemap)
│ └── outputs
│ ├── combined_fan_travel_story.png
│ └── kz_mk_iu_submission.pdf
└── src
├── notebooks
│ ├── 01_exploration.ipynb
│ └── ncaa_attendance_analysis.png
├── plots # generated figures (saved by visuals.py)
└── scripts
├── geographic_gravity.py
├── visuals.py
└── final.py
Honestly I don't remember off the top of my head what python version is used, but 3.10+ should work.
| Library | Version | Role |
|---|---|---|
| pandas | ~=2.3.3 | Data manipulation and aggregation. |
| geopandas | ~=1.1.3 | Shapefile loading and geospatial plotting. |
| matplotlib | ~=3.10.8 | Figure generation. |
| numpy | ~=2.2.6 | Numerical computation, interpolation, and scaling. |
| shapely | ~=2.1.2 | Geometry objects such as LineString. |
| seaborn | ~=0.13.2 | Styling baseline. |
| Pillow | ~=10.4.0 | Final image composition. |
One subtle but important implementation detail is that the scripts anchor paths using Path.cwd().parent.parent, which assumes they are run from src/scripts/. In the README we explicitly instruct users to run the pipeline from that directory, so the repo is internally consistent, but the scripts are not path-agnostic from arbitrary working directories.
The biggest methodological challenge we faced is the proxy itself. The project measures team-campus-to-venue distance, not where actual ticket-buying fans live, so the analysis should be interpreted as a travel-burden proxy rather than a direct observation of fan movement.
Attendance is also confounded by factors the current analysis does not explicitly control for, including venue capacity, promotions, rivalry intensity, team quality, and tournament context. That means the weak positive distance-attendance relationship is most plausibly capturing some mix of game significance and conference structure rather than a pure behavioral response to distance.
There are also some engineering tradeoffs. The map samples only 100 rows for readability, final.py attempts macOS-specific font paths before falling back to a default font, and the path logic assumes execution from src/scripts/, which is convenient but not fully portable.
Several natural next steps would make this project even stronger.
conda environment spec and more portable font handling in final.py.REAME also has these instructions:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cd src/scripts
python geographic_gravity.py
python visuals.py
python final.py
That sequence generates the intermediate CSVs, candidate PNG charts, and final combined story graphic exactly in the locations documented by the repository.
What makes this repository stand out is not just the polish of the final graphic, but the balance between technical execution and analytical restraint. The project makes a bold visual claim, but it also acknowledges where the proxy is imperfect and where the correlation should not be overinterpreted.
Also, considering we made this in the span of only a few weeks when we were pretty busy, I'm pretty happy with the results. It really helped me practice my skills for analyzing and visualizing data and working with another teammate.i For me, the topic was pretty unfamiliar but interesting, and I enjoyed the process of going through the data, figuring out how to clean it, and then trying to find the best way to visualize the story we wanted to tell with it.
I am open to ANY questions about this project! Shoot me an email or a message on my LinkedIn, and I will be happy to chat about the data, the code, the design decisions, or anything else related to this project.
Parts of this project were developed in collaboration with generative AI.