← Back
DSBA 2302PythonLinear RegressionStatistical AnalysisTableau

Tech Industry Salary Demographics Study

A statistical analysis of compensation trends across 62,642 tech industry employees, investigating how demographic factors like education, race, gender, and experience drive differences in base salary. Built a linear regression model to predict total yearly compensation from behavioral and demographic variables.

research question

“How do demographic factors such as education, race, gender, and level of experience affect someone's base salary in the tech industry?”

my role

Team of five (Nick Krassy, Alec Lundy, Christian Chapman, Ezra Moen, Nash Bachman). Contributed to statistical modeling, data cleaning strategy, and correlation analysis, turning raw demographic data into usable model inputs and presenting findings to both technical and non-technical teammates.

dataset overview

62,642

Total Records

6

Key Variables

3

Model Features

Data sourced from tech industry salary records covering employees across companies of varying sizes. Key variables include total yearly compensation, years of experience, years at current company, gender, race, and education level. The dataset contained substantial missing values requiring targeted handling prior to modeling.

missing values by variable

MISSINGCOUNTRace64.2%40,215Education51.5%32,272Gender31.2%19,540n = 62,642 total records
Race had the highest missingness at 64.2%, limiting its use as a reliable predictor. Gender (31.2%) and Education (51.5%) were also heavily incomplete. All three were excluded from the final regression model due to insufficient data density.

outlier analysis

010203040Yrs at Company(yrs)010203040Yrs of Experience(yrs)0K100K200K300K400Kup to $5M+Compensation($K)
Box = interquartile range (Q1–Q3)
Bold line = median
Circle = outlier

correlation matrix

CompensationYrs ExperienceYrs at Company1.00SELF0.37MODERATE0.11WEAK0.37MODERATE1.00SELF0.44MODERATE0.11WEAK0.44MODERATE1.00SELFCompensationYrs ExperienceYrs at Company0.0, no correlation1.0, perfect
Years of experience showed the strongest correlation with compensation (r = 0.37), making it the primary predictor. Years at company had a weaker direct relationship with pay (r = 0.11), but correlated moderately with experience (r = 0.44), indicating collinearity between the two predictors.

model selection

Linear Regression

Selected

Targets a continuous quantitative variable. Preliminary analysis confirmed linear trends between experience and compensation.

Logistic Regression

Rejected

Designed for categorical outcomes. Not appropriate for predicting a continuous salary value.

Decision Tree

Rejected

Also targets categorical variables. Prone to overfitting on high-variance salary data.

model performance

Baseline: Linear Regression

0.187

explains 18.7% of variance

Durbin-Watson
2.008

no autocorrelation

MSE
6.715 × 10⁹

mean squared error

Tuned: Ridge + Z-Score Filtering

Technique
Ridge Reg.

L2 regularization

Filter
Z-Score

outlier removal

MSE
6.714 × 10⁹

marginal improvement

Variance Explained (R²)

18.7%
0%100%

The model explains ~19% of the variation in compensation. The remaining 81% comes down to factors not captured in the dataset: company size, location, stock grants, and role specifics.

key findings

limitations

Data Quality

Dataset contains significant missingness in key demographic variables, limiting the scope of the analysis.

Model Accuracy

R² of 0.187 indicates the model captures a fraction of compensation variance. Further feature engineering is needed.

Missing Context

Location, company size, and equity/bonus structures are missing, all of which are major drivers of real-world compensation.

Download Full Presentation