IU-Indy DataViz Games

Mar 2026

IU-Indy DataViz Games

IU Indy DataViz Games submission exploring how travel distance relates to game attendance.

PythonseabornPandasNumPyMatplotlib
Source
Collaborators:
Kevin Zhou avatar

Overview

Distance & Devotion is a data visualization project we created for the IU Indy DataViz Games 2026, a competition centered on turning sports data into a clear and compelling story. The repository includes a full reproducible pipeline with a raw NCAA dataset, intermediate CSVs, candidate charts, and a final stitched submission graphic.

The central question we were trying to answer from the start was: how does travel distance relate to women's college basketball attendance, and what can that reveal about fan devotion, program reach, and structural differences across institutions? Since fan-level travel telemetry data was largely unavailable, the analysis is conducted with the visiting team's campus-to-venue distance as a proxy for the travel burden facing an away fan base.

That framing allows us to turn attendance into more than a simple popularity metric and study whether high-interest programs and games still draw crowds even when travel is long, and whether those burdens are distributed differently across teams, conferences, and HBCU programs.


Data sources

The main data set we used is data/raw/ncaa_raw.csv, which is the raw NCAA dataset given by the competition to be used for the project.

The raw data is then processed into two committed intermediate outputs for reproducibility: data/intermediate/games_with_distance.csv and data/intermediate/team_travel_distances.csv. Those files are the filtered versions of the raw data set to be used for specific analyses as we will describe below.

The project also depends on a US state shapefile stored in data/geographical_visualization_maps/, which is loaded in visuals.py and filtered to admin == 'United States of America' before the national travel map is drawn.

Core fields used in the pipeline

The first step of our process was to clean the raw data set of any extraneous fields and to filter the dataset to only include the fields needed for the analysis. Our data-prep script immediately narrows the working dataset to a smaller set of columns centered on team identity, conference, attendance, venue coordinates, campus coordinates, home/away status, neutral-site status, HBCU flag, and season record.

The specific fields used directly in the code include:

FieldPurpose
TEAM_INSTITUTION_NAMETeam/school identifier.
TEAM_CONFERENCEConference grouping for comparisons.
ATTENDANCEMain outcome metric used in charts and aggregations.
FACILITY_LATITUDE, FACILITY_LONGITUDEVenue coordinates for travel calculations.
TEAM_INSTITUTION_LATITUDE, TEAM_INSTITUTION_LONGITUDECampus coordinates for travel calculations.
IS_HOME_CONTESTUsed to isolate away games.
IS_SITE_NEUTRALUsed to exclude neutral-site contests.
IS_INSTITUTION_HBCUSupports the HBCU vs. non-HBCU analysis.
SEASON_WINS, SEASON_LOSSESUsed for win-adjusted metrics and team summaries.

Data cleaning and feature engineering

All core preprocessing lives in src/scripts/geographic_gravity.py, which acts as the analytical engine for the project. It loads the raw CSV, trims the data to the fields needed for analysis (that we just described), computes geographic distance, creates derived metrics, and exports the two intermediate CSVs used by the rest of the pipeline.

Filtering pipeline

Specifically to filter the data, the script applies the following sequence:

  1. It removes neutral-site games by keeping only rows where IS_SITE_NEUTRAL is no.
  2. Drops rows with missing venue or campus coordinates (these are not usable for calculating distance),
  3. Keeps only away-game rows by filtering IS_HOME_CONTEST to no.

This cleaning matters because travel distance is only meaningful from the visiting team's perspective, and coordinate completeness is required for the distance calculation to run cleanly.

Haversine distance

Travel distance is computed with a custom Haversine implementation written directly in Python. The function uses an Earth radius of 3,959 miles and calculates the great-circle distance between the venue and the visiting institution.

def haversine(lat1, lon1, lat2, lon2):
    R = 3959
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat, dlon = lat2 - lat1, lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    return R * c

Outlier handling and derived metrics

After distances are computed, the script removes the top 1% of travel-distance values using quantile(0.99). We did this to limit extreme cases from distorting the later charts and summaries. The script then creates three especially important derived variables:

  • attendance_per_mile
  • attendance_per_win
  • distance_bucket

These features support the efficiency ranking, the win-adjusted attendance lens, and the clean five-category narrative used in the distance-bucket chart.

The five distance buckets that are defined are:

BucketRange
Local0–150 miles
Regional150–400 miles
Mid-distance400–800 miles
Long800–1,500 miles
Cross-country1,500+ miles

Team-level aggregation

The script groups games by TEAM_INSTITUTION_NAME and computes team summaries including average distance, maximum distance, number of games played, average attendance, conference, HBCU flag, and average wins. It then filters to teams with at least 10 games before saving team_travel_distances.csv.

The filter was applied since it reduces noise from tiny sample sizes and produces a more stable team-level ranking, though it also excludes some smaller programs from the final aggregate view.


Main dataset-level findings

After filtering the original data set based on the parameters outlined above, we had 19,272 away-team rows of data. For those 19,272 data points, we calculated a mean travel distance of about 383 miles and a median of about 270 miles. Graphing of the distribution also showed a right-skewed distribution in which most trips are regional but a smaller set of long-haul trips pull the average upward.

The reported Pearson correlation (r) between travel distance and attendance is approximately 0.151, which is weakly positive rather than negative. This is interesting since it signifies that, rather than coinciding with a simple “distance lowers attendance” story, it suggests that longer trips often coincide with games that already have larger audiences or broader visibility.

Most prominently, we found that average attendance rises from roughly 1,332 in local games to about 2,742 in cross-country games.

HBCU comparison

One of the most distinctive parts of the project, at least in my opinion, was the explicit HBCU lens. In our analysis we foundd that HBCU teams in the filtered set travel about 343 miles on average, compared with about 386 miles for non-HBCU teams. This suggests that HBCU programs may have slightly lower travel burdens on average, which could reflect factors such as conference alignment, geographic clustering, or resource constraints. Either way, it's interesting that HBCU teams travel less on average than non-HBCU teams...

Interpretation

Based on this analysis, we concluded that games involving longer travel burdens tend to be associated with larger attendance environments. This actually makes a lot of sense since marquee matchups, larger venues, and geographically dispersed conferences all shape both distance and turnout at the same time.


Visualization approach

All visuals that we came up with are generated in src/scripts/visuals.py using Matplotlib, GeoPandas, Shapely, NumPy, and a carefully tuned set of global style parameters.

Candidate figures

The repository's src/plots/ directory is generated by visuals.py.

FileChart typePurpose
viz0_travel_map_dark.pngFlow mapShows national travel patterns and top-attendance venues.
viz1_top15.pngHorizontal bar chartRanks the 15 teams with the highest average travel distance.
viz2_wins_corr.pngScatter plotCompares average wins to average travel distance, with attendance as bubble size.
viz3_conference.pngVertical bar chartCompares top conferences by average travel distance.
viz4_distance_distribution.pngHistogramDepicts distribution travel distances for all away games in data set.
viz5_distance_bucket_attendance.pngBar chartShows the monotonic increase in attendance across distance buckets.
viz6_distance_vs_attendance.pngScatter plotShows the weak positive correlation between distance and attendance.
viz7_hbcu_travel.pngBox-and-whisker plotCompares HBCU and non-HBCU travel distributions.
viz8_attendance_efficiency.pngHorizontal bar chartHighlights teams with the highest attendance per mile traveled.

Technical architecture

The repository follows a clean three-stage pipeline: compute, visualize, then compose.

data/raw/ncaa_raw.csv
        │
        ▼
src/scripts/geographic_gravity.py
        │
        ├── data/intermediate/games_with_distance.csv
        └── data/intermediate/team_travel_distances.csv
                │
                ▼
        src/scripts/visuals.py
                │
                └── src/plots/viz*.png
                        │
                        ▼
                src/scripts/final.py
                        │
                        └── data/outputs/combined_fan_travel_story.png

Directory structure

.
├── README.md
├── requirements.txt
├── dictionary                       # plain-language data dictionary
├── data
│   ├── raw/ncaa_raw.csv
│   ├── intermediate
│   │   ├── games_with_distance.csv
│   │   └── team_travel_distances.csv
│   ├── geographical_visualization_maps/  # US states shapefile (basemap)
│   └── outputs
│       ├── combined_fan_travel_story.png
│       └── kz_mk_iu_submission.pdf
└── src
    ├── notebooks
    │   ├── 01_exploration.ipynb
    │   └── ncaa_attendance_analysis.png
    ├── plots                         # generated figures (saved by visuals.py)
    └── scripts
        ├── geographic_gravity.py
        ├── visuals.py
        └── final.py

Environment and dependencies

Honestly I don't remember off the top of my head what python version is used, but 3.10+ should work.

LibraryVersionRole
pandas~=2.3.3Data manipulation and aggregation.
geopandas~=1.1.3Shapefile loading and geospatial plotting.
matplotlib~=3.10.8Figure generation.
numpy~=2.2.6Numerical computation, interpolation, and scaling.
shapely~=2.1.2Geometry objects such as LineString.
seaborn~=0.13.2Styling baseline.
Pillow~=10.4.0Final image composition.

Path management

One subtle but important implementation detail is that the scripts anchor paths using Path.cwd().parent.parent, which assumes they are run from src/scripts/. In the README we explicitly instruct users to run the pipeline from that directory, so the repo is internally consistent, but the scripts are not path-agnostic from arbitrary working directories.


Challenges and tradeoffs

The biggest methodological challenge we faced is the proxy itself. The project measures team-campus-to-venue distance, not where actual ticket-buying fans live, so the analysis should be interpreted as a travel-burden proxy rather than a direct observation of fan movement.

Attendance is also confounded by factors the current analysis does not explicitly control for, including venue capacity, promotions, rivalry intensity, team quality, and tournament context. That means the weak positive distance-attendance relationship is most plausibly capturing some mix of game significance and conference structure rather than a pure behavioral response to distance.

There are also some engineering tradeoffs. The map samples only 100 rows for readability, final.py attempts macOS-specific font paths before falling back to a default font, and the path logic assumes execution from src/scripts/, which is convenient but not fully portable.


Improvements with more time

Several natural next steps would make this project even stronger.

  • Add a multivariate regression model controlling for conference, venue capacity, and team strength to separate the effect of distance from marquee-game effects.
  • Replace the static output with an interactive map that supports hover tooltips and filtering by team or conference (although this can't be submitted since it must be a pdf).
  • Add season-by-season analysis to see whether the travel-attendance relationship changes over time instead of aggregating everything into one combined view.
  • Improve reproducibility with a conda environment spec and more portable font handling in final.py.
  • Use a more appropriate national map projection, such as Albers Equal Area, rather than plotting in raw latitude/longitude space.

Reproduction

REAME also has these instructions:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cd src/scripts
python geographic_gravity.py
python visuals.py
python final.py

That sequence generates the intermediate CSVs, candidate PNG charts, and final combined story graphic exactly in the locations documented by the repository.


Highlights

  • Runner-up / finalist project for the IU Indy DataViz Games 2026.
  • Reproducible pipeline from raw NCAA records to intermediate tables, candidate charts, and a final competition graphic.
  • Custom Haversine distance engine implemented directly in Python.
  • Geospatial flow map built with custom curved routes, log-scaled encodings, and venue annotations.
  • Cohesive dark-theme design system shared across the full figure set.

Final reflection

What makes this repository stand out is not just the polish of the final graphic, but the balance between technical execution and analytical restraint. The project makes a bold visual claim, but it also acknowledges where the proxy is imperfect and where the correlation should not be overinterpreted.

Also, considering we made this in the span of only a few weeks when we were pretty busy, I'm pretty happy with the results. It really helped me practice my skills for analyzing and visualizing data and working with another teammate.i For me, the topic was pretty unfamiliar but interesting, and I enjoyed the process of going through the data, figuring out how to clean it, and then trying to find the best way to visualize the story we wanted to tell with it.


Still have questions?

I am open to ANY questions about this project! Shoot me an email or a message on my LinkedIn, and I will be happy to chat about the data, the code, the design decisions, or anything else related to this project.


Parts of this project were developed in collaboration with generative AI.