Tech Industry Salary Demographics Study
A statistical analysis of compensation trends across 62,642 tech industry employees, investigating how demographic factors like education, race, gender, and experience drive differences in base salary. Built a linear regression model to predict total yearly compensation from behavioral and demographic variables.
research question
“How do demographic factors such as education, race, gender, and level of experience affect someone's base salary in the tech industry?”
my role
Team of five (Nick Krassy, Alec Lundy, Christian Chapman, Ezra Moen, Nash Bachman). Contributed to statistical modeling, data cleaning strategy, and correlation analysis, turning raw demographic data into usable model inputs and presenting findings to both technical and non-technical teammates.
dataset overview
62,642
Total Records
6
Key Variables
3
Model Features
Data sourced from tech industry salary records covering employees across companies of varying sizes. Key variables include total yearly compensation, years of experience, years at current company, gender, race, and education level. The dataset contained substantial missing values requiring targeted handling prior to modeling.
missing values by variable
outlier analysis
correlation matrix
model selection
Linear Regression
SelectedTargets a continuous quantitative variable. Preliminary analysis confirmed linear trends between experience and compensation.
Logistic Regression
RejectedDesigned for categorical outcomes. Not appropriate for predicting a continuous salary value.
Decision Tree
RejectedAlso targets categorical variables. Prone to overfitting on high-variance salary data.
model performance
Baseline: Linear Regression
explains 18.7% of variance
no autocorrelation
mean squared error
Tuned: Ridge + Z-Score Filtering
L2 regularization
outlier removal
marginal improvement
Variance Explained (R²)
18.7%The model explains ~19% of the variation in compensation. The remaining 81% comes down to factors not captured in the dataset: company size, location, stock grants, and role specifics.
key findings
- -Years of experience is the strongest predictor of compensation (r = 0.37), consistently outweighing tenure at a single company.
- -Demographic variables (race, gender, education) could not be reliably included due to high rates of missing data (up to 64% for race).
- -Ridge regression and Z-score filtering produced only marginal MSE improvement over the baseline, suggesting the data itself is the primary constraint on accuracy.
- -A Durbin-Watson statistic of 2.008 confirms residuals are uncorrelated, so the linear model assumptions hold.
- -The model is best used as a directional tool for salary benchmarking, not precise prediction.
limitations
Data Quality
Dataset contains significant missingness in key demographic variables, limiting the scope of the analysis.
Model Accuracy
R² of 0.187 indicates the model captures a fraction of compensation variance. Further feature engineering is needed.
Missing Context
Location, company size, and equity/bonus structures are missing, all of which are major drivers of real-world compensation.