2  Data preprocessing and More Visualizations

2.1 Python: Preprocessing and Cleaning

2.1.1 Library Imports

import pandas as pd
import os
from siuba import _, group_by, summarize, filter, select, mutate, arrange, count
import matplotlib.pyplot as plt

2.1.2 Importing Data

youth = pd.read_csv('../../dssg-2025-mentor-canada/Data/Data_2020-Youth-Survey.csv')

2.1.3 Dropping Columns

# drop columns up to AO
youth = youth.iloc[:, 41:]

# drop logic + validation columns 
youth = youth.drop(columns=['QAge_Validation', 'Logic_QS1_6_Qtext', 'Logic_Qtext', 'QS1_8_Validation', 'Logic_QS1_26_Ask', 'QS1_29_Validation', 
                              'QS1_30_MValidatio', 'QS1_30_SMValidati', 'QS1_31_BWValidati', 'QS1_32_WValidatio', 'QS2_10_Validation', 'Logic_QS2_14_Ask','Logic_MENTORID1_1_1',
                              'Logic_MENTORID1_2_2', 'Logic_MENTORID1_3_3', 'Logic_AP_QS2_23', 'Logic_QS2_27_Ask', 'Logic_QS2_34_Valid', 'Logic_QS2_35_Ask', 'Logic_QS2_35_Mask1_1_1',
                              'Logic_QS2_35_Mask1_2_2', 'Logic_QS2_35_Mask1_3_3', 'Logic_QS2_35_Mask1_4_4', 'Logic_QS2_35_Mask1_5_5', 'Logic_QS2_35_Mask1_6_6', 'Logic_QS2_35_Mask1_7_7',
                              'Logic_QS2_35_Mask1_8_8', 'Logic_QS2_35_Mask1_9_9', 'Logic_QS2_35_Mask1_10_10', 'QS4_14_Validatio', 'QS4_15_Validatio', 'QS4_19_Validatio', 'QS4_23_Validatio'
    
                              ])

# drop parent education columns (keeping first two parent education columns)
youth = youth.drop(youth.columns[102:120], axis=1)

2.1.4 Indicate Text Columns

text_columns = ['QS1_6_Other', 'QS1_9_Other', 'QS1_11_Other', 'QS1_16_Other', 'QS1_18_Other_1', 'QS1_18_Other_2',
                 'QS1_22_Other', 'QS1_26_Other', 'QS1_27_Other', 'QS2_13_Other', 'QS2_14_MENTORID', 'QS2_14_MENTORID_2', 'QS2_14_MENTORID_3', 'QS2_18_LOCATION_1_O', 'QS2_15_RELATIONSHIP2', 'QS2_17_TYPE_2_Other', 'QS2_18_LOCATION_2_O', 'QS2_15_RELATIONSHIP3',
                'QS2_17_TYPE_3_Other', 'QS2_18_LOCATION_3_O', 'QS2_25_YOUTHINIT2', 'QS2_27_MENTORPROGRA2', 'QS2_33_TRANSITIONS_Ot', 'QS2_34_SUPPORTS_Ot', 'QS2_38_NETGATIVEMENTO', 'QS3_2_TRANSITIONWITHOUTMEN', 'QS3_3_TRANSITIONSWITHOUTMENTO', 'QS4_4_Other', 'QS4_5_SATEDU_Other']

2.1.5 Encode Columns

2.1.5.1 Make column names interpretable for Q4

youth.rename(columns={
    'QS1_6_ETHNOCULTURAL1_1_1': 'Race_SouthAsian',
    'QS1_6_ETHNOCULTURAL1_2_2': 'Race_Chinese',
    'QS1_6_ETHNOCULTURAL1_3_3': 'Race_Black',
    'QS1_6_ETHNOCULTURAL1_4_4': 'Race_Filipino',
    'QS1_6_ETHNOCULTURAL1_5_5': 'Race_LatinAmerica',
    'QS1_6_ETHNOCULTURAL1_6_6': 'Race_Arab',
    'QS1_6_ETHNOCULTURAL1_7_7': 'Race_SouthEastAsian',
    'QS1_6_ETHNOCULTURAL1_8_8': 'Race_WestAsian',
    'QS1_6_ETHNOCULTURAL1_9_9': 'Race_Korean',
    'QS1_6_ETHNOCULTURAL1_10_10': 'Race_Japanese',
    'QS1_6_ETHNOCULTURAL1_11_11': 'Race_White',
    'QS1_6_ETHNOCULTURAL1_12_12': 'Race_Other',
    'QS1_6_ETHNOCULTURAL1_13_13':'Race_Unsure',
    'QS1_6_ETHNOCULTURAL1_14_14': 'Race_PreferNotToSay'
}, inplace=True)

3 Viualizations (continued)

3.1 Distribution of Age variable

3.2 Age DistributionEstimated yearly income distribution (15_yearly_income)

  • Yearly Income Distribution (Presented also in last week’s notebook)

  • Median as the central tendency is less susceptible to extreme outliers.

3.3 Income output and Transgender Identity

3.4 Income output and Transgender IdentityEarly mentor experience and mental health rating outcome

3.5 Early mentor experience and mental health rating outcome

3.6 Teen mentor experience and mental health rating outcome

Teen mentor experience and mental health rating outcome
  • Similar trend compared to those who had early mentor experience.
  • We can also tap into whether those individuals who had EME also tend to continue their mentorship into TME.