English
Student Social Media Addiction¶
About Dataset¶
Overview¶
The Student Social Media & Relationships dataset contains anonymized records of students’ social‐media behaviors and related life outcomes. It spans multiple countries and academic levels, focusing on key dimensions such as usage intensity, platform preferences, and relationship dynamics. Each row represents one student’s survey response, offering a cross‐sectional snapshot suitable for statistical analysis and machine‐learning applications.
Scope & Coverage¶
- Population: Students aged 16–25 enrolled in high school, undergraduate, or graduate programs.
- Geography: Multi‐country coverage (e.g., Bangladesh, India, USA, UK, Canada, Australia, Germany, Brazil, Japan, South Korea).
- Timeframe: Data collected via a one‐time online survey administered in Q1 2025.
- Volume: Configurable sample sizes (e.g., 100, 500, 1,000 records) based on research needs.
Data Collection & Methodology¶
- Survey Design: Questions adapted from validated scales on social‐media addiction (e.g., Bergen Social Media Addiction Scale) and relationship conflict indices.
- Recruitment: Participants recruited through university mailing lists and social‐media platforms, ensuring diversity in academic level and country.
- Data Quality Controls:
- Validation: Mandatory fields and range checks (e.g., usage hours between 0–24).
- De‐duplication: Removal of duplicate entries via unique Student_ID checks.
- Anonymization: No personally identifiable information collected.
Key Variables¶
Variable | Type | Description |
---|---|---|
Student_ID | Integer | Unique respondent identifier |
Age | Integer | Age in years |
Gender | Categorical | “Male” or “Female” |
Academic_Level | Categorical | High School / Undergraduate / Graduate |
Country | Categorical | Country of residence |
Avg_Daily_Usage_Hours | Float | Average hours per day on social media |
Most_Used_Platform | Categorical | Instagram, Facebook, TikTok, etc. |
Affects_Academic_Performance | Boolean | Self‐reported impact on academics (Yes/No) |
Sleep_Hours_Per_Night | Float | Average nightly sleep hours |
Mental_Health_Score | Integer | Self‐rated mental health (1 = poor to 10 = excellent) |
Relationship_Status | Categorical | Single / In Relationship / Complicated |
Conflicts_Over_Social_Media | Integer | Number of relationship conflicts due to social media |
Addicted_Score | Integer | Social Media Addiction Score (1 = low to 10 = high) |
Potential Analyses¶
- Correlation Studies: Examine associations between daily usage hours and mental health score or sleep hours.
- Predictive Modeling: Build classifiers to predict relationship conflicts based on usage patterns and platform type.
- Clustering: Identify user segments (e.g., “high‐usage high‐stress” vs. “moderate‐usage balanced”) across countries.
Limitations¶
- Self‐Report Bias: All measures are self-reported and may be subject to social‐desirability effects.
- Cross‐Sectional Design: One‐time survey prevents causal inference.
- Sampling Variability: Recruitment via online channels may underrepresent students with limited internet access.
Español
Adicción a las Redes Sociales en Estudiantes¶
Acerca del Conjunto de Datos¶
Resumen¶
El conjunto de datos sobre la Adicción a las Redes Sociales y las Relaciones de los Estudiantes contiene registros anonimados del comportamiento de los estudiantes en redes sociales y los resultados de su vida relacionados. Cubre múltiples países y niveles académicos, centrándose en dimensiones clave como la intensidad de uso, las preferencias de plataformas y las dinámicas de las relaciones. Cada fila representa la respuesta de un estudiante en una encuesta, ofreciendo una instantánea transversal adecuada para análisis estadísticos y aplicaciones de aprendizaje automático.
Alcance y Cobertura¶
- Población: Estudiantes de entre 16 y 25 años matriculados en programas de secundaria, pregrado o posgrado.
- Geografía: Cobertura multinacional (por ejemplo, Bangladesh, India, EE. UU., Reino Unido, Canadá, Australia, Alemania, Brasil, Japón, Corea del Sur).
- Periodo de Tiempo: Los datos se recopilaron a través de una encuesta en línea única administrada en el primer trimestre de 2025.
- Volumen: Tamaños de muestra configurables (por ejemplo, 100, 500, 1,000 registros) según las necesidades de investigación.
Recolección de Datos y Metodología¶
- Diseño de la Encuesta: Las preguntas se adaptaron de escalas validadas sobre adicción a las redes sociales (por ejemplo, la Escala de Adicción a las Redes Sociales de Bergen) y los índices de conflictos en relaciones.
- Reclutamiento: Los participantes fueron reclutados a través de listas de correo universitarias y plataformas de redes sociales, asegurando diversidad en nivel académico y país.
- Controles de Calidad de los Datos:
- Validación: Campos obligatorios y comprobaciones de rango (por ejemplo, horas de uso entre 0-24).
- Eliminación de duplicados: Eliminación de entradas duplicadas mediante comprobaciones del identificador único del estudiante (Student_ID).
- Anonimización: No se recopila información personal identificable.
Variables Clave¶
Variable | Tipo | Descripción |
---|---|---|
Student_ID | Entero | Identificador único del encuestado |
Age | Entero | Edad en años |
Gender | Categórico | “Masculino” o “Femenino” |
Academic_Level | Categórico | Secundaria / Pregrado / Posgrado |
Country | Categórico | País de residencia |
Avg_Daily_Usage_Hours | Flotante | Promedio de horas diarias en redes sociales |
Most_Used_Platform | Categórico | Instagram, Facebook, TikTok, etc. |
Affects_Academic_Performance | Booleano | Impacto autoinformado en el rendimiento académico (Sí/No) |
Sleep_Hours_Per_Night | Flotante | Promedio de horas de sueño por noche |
Mental_Health_Score | Entero | Autoevaluación de la salud mental (1 = pobre a 10 = excelente) |
Relationship_Status | Categórico | Soltero / En relación / Complicado |
Conflicts_Over_Social_Media | Entero | Número de conflictos en la relación debido a las redes sociales |
Addicted_Score | Entero | Puntaje de adicción a las redes sociales (1 = bajo a 10 = alto) |
Análisis Potenciales¶
- Estudios de Correlación: Examinar asociaciones entre horas de uso diarias y el puntaje de salud mental o las horas de sueño.
- Modelado Predictivo: Construir clasificadores para predecir conflictos en las relaciones basados en patrones de uso y tipo de plataforma.
- Clustering: Identificar segmentos de usuarios (por ejemplo, “alto uso y alto estrés” frente a “uso moderado y equilibrado”) a través de países.
Limitaciones¶
- Sesgo de Autoinforme: Todas las medidas son autoinformadas y pueden estar sujetas a efectos de deseabilidad social.
- Diseño Transversal: La encuesta es de una sola vez, lo que impide inferir causalidad.
- Variabilidad en el Muestreo: El reclutamiento a través de canales en línea puede subrepresentar a estudiantes con acceso limitado a internet.
Carga de datos¶
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import joblib
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
df = pd.read_csv('../data/raw/Students Social Media Addiction.csv')
df.head(5)
Student_ID | Age | Gender | Academic_Level | Country | Avg_Daily_Usage_Hours | Most_Used_Platform | Affects_Academic_Performance | Sleep_Hours_Per_Night | Mental_Health_Score | Relationship_Status | Conflicts_Over_Social_Media | Addicted_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 19 | Female | Undergraduate | Bangladesh | 5.2 | Yes | 6.5 | 6 | In Relationship | 3 | 8 | |
1 | 2 | 22 | Male | Graduate | India | 2.1 | No | 7.5 | 8 | Single | 0 | 3 | |
2 | 3 | 20 | Female | Undergraduate | USA | 6.0 | TikTok | Yes | 5.0 | 5 | Complicated | 4 | 9 |
3 | 4 | 18 | Male | High School | UK | 3.0 | YouTube | No | 7.0 | 7 | Single | 1 | 4 |
4 | 5 | 21 | Male | Graduate | Canada | 4.5 | Yes | 6.0 | 6 | In Relationship | 2 | 7 |
df.columns
Index(['Student_ID', 'Age', 'Gender', 'Academic_Level', 'Country', 'Avg_Daily_Usage_Hours', 'Most_Used_Platform', 'Affects_Academic_Performance', 'Sleep_Hours_Per_Night', 'Mental_Health_Score', 'Relationship_Status', 'Conflicts_Over_Social_Media', 'Addicted_Score'], dtype='object')
num_cols = ['Student_ID', 'Age', 'Avg_Daily_Usage_Hours', 'Sleep_Hours_Per_Night', 'Mental_Health_Score',
'Conflicts_Over_Social_Media', 'Addicted_Score']
cat_cols = ['Gender', 'Academic_Level', 'Country', 'Most_Used_Platform', 'Relationship_Status']
target_col = 'Affects_Academic_Performance'
df.describe()
Student_ID | Age | Avg_Daily_Usage_Hours | Sleep_Hours_Per_Night | Mental_Health_Score | Conflicts_Over_Social_Media | Addicted_Score | |
---|---|---|---|---|---|---|---|
count | 705.000000 | 705.000000 | 705.000000 | 705.000000 | 705.000000 | 705.000000 | 705.000000 |
mean | 353.000000 | 20.659574 | 4.918723 | 6.868936 | 6.226950 | 2.849645 | 6.436879 |
std | 203.660256 | 1.399217 | 1.257395 | 1.126848 | 1.105055 | 0.957968 | 1.587165 |
min | 1.000000 | 18.000000 | 1.500000 | 3.800000 | 4.000000 | 0.000000 | 2.000000 |
25% | 177.000000 | 19.000000 | 4.100000 | 6.000000 | 5.000000 | 2.000000 | 5.000000 |
50% | 353.000000 | 21.000000 | 4.800000 | 6.900000 | 6.000000 | 3.000000 | 7.000000 |
75% | 529.000000 | 22.000000 | 5.800000 | 7.700000 | 7.000000 | 4.000000 | 8.000000 |
max | 705.000000 | 24.000000 | 8.500000 | 9.600000 | 9.000000 | 5.000000 | 9.000000 |
print(df.info())
print(df.isnull().sum())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 705 entries, 0 to 704 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Student_ID 705 non-null int64 1 Age 705 non-null int64 2 Gender 705 non-null object 3 Academic_Level 705 non-null object 4 Country 705 non-null object 5 Avg_Daily_Usage_Hours 705 non-null float64 6 Most_Used_Platform 705 non-null object 7 Affects_Academic_Performance 705 non-null object 8 Sleep_Hours_Per_Night 705 non-null float64 9 Mental_Health_Score 705 non-null int64 10 Relationship_Status 705 non-null object 11 Conflicts_Over_Social_Media 705 non-null int64 12 Addicted_Score 705 non-null int64 dtypes: float64(2), int64(5), object(6) memory usage: 71.7+ KB None Student_ID 0 Age 0 Gender 0 Academic_Level 0 Country 0 Avg_Daily_Usage_Hours 0 Most_Used_Platform 0 Affects_Academic_Performance 0 Sleep_Hours_Per_Night 0 Mental_Health_Score 0 Relationship_Status 0 Conflicts_Over_Social_Media 0 Addicted_Score 0 dtype: int64
Análisis exploratorio de Datos(EDA)¶
df.hist(grid=True, bins=25, figsize=(15,8))
array([[<Axes: title={'center': 'Student_ID'}>, <Axes: title={'center': 'Age'}>, <Axes: title={'center': 'Avg_Daily_Usage_Hours'}>], [<Axes: title={'center': 'Sleep_Hours_Per_Night'}>, <Axes: title={'center': 'Mental_Health_Score'}>, <Axes: title={'center': 'Conflicts_Over_Social_Media'}>], [<Axes: title={'center': 'Addicted_Score'}>, <Axes: >, <Axes: >]], dtype=object)
plt.figure(figsize=(8, 6))
sns.countplot(x=target_col, data=df, palette='viridis')
plt.title('Distribución de los tipos de Afectación acádemica')
plt.xlabel('Personalidad')
plt.ylabel('Cantidad')
plt.show()
# #### Age and Gender Distribution
plt.figure(figsize=(16, 6))
# Age distribution
plt.subplot(1, 2, 1)
sns.histplot(df['Age'], kde=True, bins=15)
plt.title('Age Distribution of Students', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
# Gender distribution
plt.subplot(1, 2, 2)
gender_counts = df['Gender'].value_counts()
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=90, colors=sns.color_palette('viridis', len(gender_counts)))
plt.title('Gender Distribution', fontsize=14)
plt.axis('equal')
plt.tight_layout()
plt.savefig('../figures/Age_Gender_Distribution')
plt.show()
plt.figure(figsize=(18, 8))
# Country distribution
plt.subplot(1, 2, 1)
country_counts = df['Country'].value_counts().head(10)
sns.barplot(x=country_counts.index, y=country_counts.values, palette='viridis')
plt.title('Top 10 Countries', fontsize=14)
plt.xlabel('Country', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
# Academic level distribution
plt.subplot(1, 2, 2)
academic_counts = df['Academic_Level'].value_counts()
sns.barplot(x=academic_counts.index, y=academic_counts.values, palette='viridis')
plt.title('Academic Level Distribution', fontsize=14)
plt.xlabel('Academic Level', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('../figures/Country_AcademicLev_Distribution')
plt.show()
# #### Most Used Platforms
plt.figure(figsize=(14, 7))
platform_counts = df['Most_Used_Platform'].value_counts()
sns.barplot(x=platform_counts.index, y=platform_counts.values, palette='viridis')
plt.title('Most Used Social Media Platforms', fontsize=14)
plt.xlabel('Platform', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.savefig('../figures/Most_Used_Social_Media_Distribution')
plt.show()
# #### Academic Performance vs Social Media Usage
import plotly.express as px
fig = px.box(df, x='Affects_Academic_Performance', y='Avg_Daily_Usage_Hours',
title='Social Media Usage Hours vs. Impact on Academic Performance',
labels={'Affects_Academic_Performance': 'Does it Affect Academic Performance?',
'Avg_Daily_Usage_Hours': 'Average Daily Usage Hours'})
fig.update_layout(width=800, height=500)
fig.show()
plt.figure(figsize=(12, 8))
for i, col in enumerate(num_cols, 1):
plt.subplot(4, 2, i)
sns.boxplot(x=target_col, y=col, data=df, palette='viridis')
plt.title(f'{col} by Personality')
plt.tight_layout()
plt.suptitle('Distribución de categorías en base a la Afectación acádemica', y=1.05)
plt.savefig('../figures/distCatPorAfectAcad.png')
plt.show()
corr_matrix = df[num_cols].corr(method='spearman')
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Mapa de calor de correlación entre atributos')
plt.savefig('../figures/mapaCalorCorrelacion.png')
plt.show()
Creación de nuevas características¶
# Crear nuevas características
# Interacción entre Género y Nivel Académico
df['Gender_Academic_Interaction'] = df['Gender'].astype(str) + "_" + df['Academic_Level'].astype(str)
# Interacción entre Plataforma y Estado de Relación
df['Platform_Relationship_Interaction'] = df['Most_Used_Platform'] + "_" + df['Relationship_Status']
# Clasificación de uso de redes sociales (Bajo, Medio, Alto)
def usage_category(hours):
if hours <= 2:
return 0 # Bajo
elif hours <= 5:
return 1 # Medio
else:
return 2 # Alto
df['Usage_Category'] = df['Avg_Daily_Usage_Hours'].apply(usage_category)
# Variables derivadas de salud mental y sueño
df['Mental_Health_Risk'] = df['Mental_Health_Score'] - df['Sleep_Hours_Per_Night']
df['Sleep_Usage_Discrepancy'] = df['Sleep_Hours_Per_Night'] - df['Avg_Daily_Usage_Hours']
# Estrés acumulado en relaciones por redes sociales
df['Relationship_Stress'] = df['Conflicts_Over_Social_Media'] * df['Addicted_Score']
# Interacción entre adicción a redes sociales y rendimiento académico
df['Addiction_Affects_Academic'] = df['Addicted_Score'] * df['Affects_Academic_Performance'].map({'Yes': 1, 'No': 0}).astype(int)
# Variables de comportamiento en redes sociales
# Definir plataformas populares
popular_platforms = ['Instagram', 'Facebook', 'TikTok']
# Frecuencia de uso de plataformas populares
df['Frequent_Use_Popular_Platforms'] = df['Most_Used_Platform'].apply(lambda x: 1 if x in popular_platforms else 0)
# Impacto del uso de redes sociales en función de la plataforma
df['Social_Media_Usage_Impact'] = df['Avg_Daily_Usage_Hours'] * df['Most_Used_Platform'].apply(lambda x: 2 if x in ['Instagram', 'TikTok'] else 1)
# Promedios por país
df['Avg_Country_Usage'] = df.groupby('Country')['Avg_Daily_Usage_Hours'].transform('mean')
df['Avg_Country_Mental_Health'] = df.groupby('Country')['Mental_Health_Score'].transform('mean')
# Visualización de las primeras filas para revisar los datos
df.head(5)
Student_ID | Age | Gender | Academic_Level | Country | Avg_Daily_Usage_Hours | Most_Used_Platform | Affects_Academic_Performance | Sleep_Hours_Per_Night | Mental_Health_Score | ... | Platform_Relationship_Interaction | Usage_Category | Mental_Health_Risk | Sleep_Usage_Discrepancy | Relationship_Stress | Addiction_Affects_Academic | Frequent_Use_Popular_Platforms | Social_Media_Usage_Impact | Avg_Country_Usage | Avg_Country_Mental_Health | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 19 | Female | Undergraduate | Bangladesh | 5.2 | Yes | 6.5 | 6 | ... | Instagram_In Relationship | 2 | -0.5 | 1.3 | 24 | 8 | 1 | 10.4 | 4.800000 | 5.050000 | |
1 | 2 | 22 | Male | Graduate | India | 2.1 | No | 7.5 | 8 | ... | Twitter_Single | 1 | 0.5 | 5.4 | 0 | 0 | 0 | 2.1 | 6.116981 | 5.452830 | |
2 | 3 | 20 | Female | Undergraduate | USA | 6.0 | TikTok | Yes | 5.0 | 5 | ... | TikTok_Complicated | 2 | 0.0 | -1.0 | 36 | 9 | 1 | 12.0 | 6.890000 | 4.900000 |
3 | 4 | 18 | Male | High School | UK | 3.0 | YouTube | No | 7.0 | 7 | ... | YouTube_Single | 1 | 0.0 | 4.0 | 4 | 0 | 0 | 3.0 | 5.472727 | 5.681818 |
4 | 5 | 21 | Male | Graduate | Canada | 4.5 | Yes | 6.0 | 6 | ... | Facebook_In Relationship | 1 | 0.0 | 1.5 | 14 | 7 | 1 | 4.5 | 4.714706 | 6.235294 |
5 rows × 24 columns
Procesamiento de los datos¶
for col in cat_cols:
print(col)
print(df[col].unique())
Gender ['Female' 'Male'] Academic_Level ['Undergraduate' 'Graduate' 'High School'] Country ['Bangladesh' 'India' 'USA' 'UK' 'Canada' 'Australia' 'Germany' 'Brazil' 'Japan' 'South Korea' 'France' 'Spain' 'Italy' 'Mexico' 'Russia' 'China' 'Sweden' 'Norway' 'Denmark' 'Netherlands' 'Belgium' 'Switzerland' 'Austria' 'Portugal' 'Greece' 'Ireland' 'New Zealand' 'Singapore' 'Malaysia' 'Thailand' 'Vietnam' 'Philippines' 'Indonesia' 'Taiwan' 'Hong Kong' 'Turkey' 'Israel' 'UAE' 'Egypt' 'Morocco' 'South Africa' 'Nigeria' 'Kenya' 'Ghana' 'Argentina' 'Chile' 'Colombia' 'Peru' 'Venezuela' 'Ecuador' 'Uruguay' 'Paraguay' 'Bolivia' 'Costa Rica' 'Panama' 'Jamaica' 'Trinidad' 'Bahamas' 'Iceland' 'Finland' 'Poland' 'Romania' 'Hungary' 'Czech Republic' 'Slovakia' 'Croatia' 'Serbia' 'Slovenia' 'Bulgaria' 'Estonia' 'Latvia' 'Lithuania' 'Ukraine' 'Moldova' 'Belarus' 'Kazakhstan' 'Uzbekistan' 'Kyrgyzstan' 'Tajikistan' 'Armenia' 'Georgia' 'Azerbaijan' 'Cyprus' 'Malta' 'Luxembourg' 'Monaco' 'Andorra' 'San Marino' 'Vatican City' 'Liechtenstein' 'Montenegro' 'Albania' 'North Macedonia' 'Kosovo' 'Bosnia' 'Qatar' 'Kuwait' 'Bahrain' 'Oman' 'Jordan' 'Lebanon' 'Iraq' 'Yemen' 'Syria' 'Afghanistan' 'Pakistan' 'Nepal' 'Bhutan' 'Sri Lanka' 'Maldives'] Most_Used_Platform ['Instagram' 'Twitter' 'TikTok' 'YouTube' 'Facebook' 'LinkedIn' 'Snapchat' 'LINE' 'KakaoTalk' 'VKontakte' 'WhatsApp' 'WeChat'] Relationship_Status ['In Relationship' 'Single' 'Complicated']
from sklearn.preprocessing import LabelEncoder
# Inicializar el codificador
le = LabelEncoder()
# Codificar las columnas categóricas con Label Encoding
df['Gender'] = le.fit_transform(df['Gender'])
df['Academic_Level'] = le.fit_transform(df['Academic_Level'])
df['Relationship_Status'] = le.fit_transform(df['Relationship_Status'])
# Codificar las interacciones generadas con Label Encoding
df['Gender_Academic_Interaction'] = le.fit_transform(df['Gender_Academic_Interaction'])
df['Platform_Relationship_Interaction'] = le.fit_transform(df['Platform_Relationship_Interaction'])
# One-Hot Encoding para las columnas 'Country' y 'Most_Used_Platform'
df = pd.get_dummies(df, columns=['Country', 'Most_Used_Platform'])
# Visualización del DataFrame con las modificaciones
df.head(5)
Student_ID | Age | Gender | Academic_Level | Avg_Daily_Usage_Hours | Affects_Academic_Performance | Sleep_Hours_Per_Night | Mental_Health_Score | Relationship_Status | Conflicts_Over_Social_Media | ... | Most_Used_Platform_KakaoTalk | Most_Used_Platform_LINE | Most_Used_Platform_LinkedIn | Most_Used_Platform_Snapchat | Most_Used_Platform_TikTok | Most_Used_Platform_Twitter | Most_Used_Platform_VKontakte | Most_Used_Platform_WeChat | Most_Used_Platform_WhatsApp | Most_Used_Platform_YouTube | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 19 | 0 | 2 | 5.2 | Yes | 6.5 | 6 | 1 | 3 | ... | False | False | False | False | False | False | False | False | False | False |
1 | 2 | 22 | 1 | 0 | 2.1 | No | 7.5 | 8 | 2 | 0 | ... | False | False | False | False | False | True | False | False | False | False |
2 | 3 | 20 | 0 | 2 | 6.0 | Yes | 5.0 | 5 | 0 | 4 | ... | False | False | False | False | True | False | False | False | False | False |
3 | 4 | 18 | 1 | 1 | 3.0 | No | 7.0 | 7 | 2 | 1 | ... | False | False | False | False | False | False | False | False | False | True |
4 | 5 | 21 | 1 | 0 | 4.5 | Yes | 6.0 | 6 | 1 | 2 | ... | False | False | False | False | False | False | False | False | False | False |
5 rows × 144 columns
df.shape
(705, 144)
df.head(5)
Student_ID | Age | Gender | Academic_Level | Avg_Daily_Usage_Hours | Affects_Academic_Performance | Sleep_Hours_Per_Night | Mental_Health_Score | Relationship_Status | Conflicts_Over_Social_Media | ... | Most_Used_Platform_KakaoTalk | Most_Used_Platform_LINE | Most_Used_Platform_LinkedIn | Most_Used_Platform_Snapchat | Most_Used_Platform_TikTok | Most_Used_Platform_Twitter | Most_Used_Platform_VKontakte | Most_Used_Platform_WeChat | Most_Used_Platform_WhatsApp | Most_Used_Platform_YouTube | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 19 | 0 | 2 | 5.2 | Yes | 6.5 | 6 | 1 | 3 | ... | False | False | False | False | False | False | False | False | False | False |
1 | 2 | 22 | 1 | 0 | 2.1 | No | 7.5 | 8 | 2 | 0 | ... | False | False | False | False | False | True | False | False | False | False |
2 | 3 | 20 | 0 | 2 | 6.0 | Yes | 5.0 | 5 | 0 | 4 | ... | False | False | False | False | True | False | False | False | False | False |
3 | 4 | 18 | 1 | 1 | 3.0 | No | 7.0 | 7 | 2 | 1 | ... | False | False | False | False | False | False | False | False | False | True |
4 | 5 | 21 | 1 | 0 | 4.5 | Yes | 6.0 | 6 | 1 | 2 | ... | False | False | False | False | False | False | False | False | False | False |
5 rows × 144 columns
# Convertir la columna 'Affects_Academic_Performance' de tipo objeto a numérica
df['Affects_Academic_Performance'] = df['Affects_Academic_Performance'].map({'Yes': 1, 'No': 0})
df.to_csv('../data/processed/processing_dataset.csv')
from sklearn.model_selection import train_test_split
X = df.drop(columns=[target_col])
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
target_encoder = LabelEncoder()
y_train = target_encoder.fit_transform(y_train)
y_test = target_encoder.transform(y_test)
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# Modelos recomendados con hiperparámetros
models = {
'logistic': {
'model': LogisticRegression(max_iter=1000),
'use_scaled': True,
'params': {'C': np.logspace(-4, 4, 20), 'solver': ['lbfgs', 'liblinear']}
},
'svm': {
'model': SVC(probability=True),
'use_scaled': True,
'params': {'C': np.logspace(-3, 3, 20), 'kernel': ['rbf', 'linear'], 'gamma': ['scale', 'auto', 0.1, 1]}
},
'rf': {
'model': RandomForestClassifier(random_state=42),
'use_scaled': False,
'params': {'n_estimators': [100, 150, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10]}
},
'knn': {
'model': KNeighborsClassifier(),
'use_scaled': True,
'params': {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}
},
'dt': {
'model': DecisionTreeClassifier(random_state=42),
'use_scaled': False,
'params': {'max_depth': [None, 5, 10, 15], 'min_samples_split': [2, 5, 10]}
}
}
# Trabajar los hiperparámetros
results = []
best_models = {}
for model_name, mp in models.items():
clf = GridSearchCV(mp['model'], mp['params'], cv=3, scoring='f1_weighted', n_jobs=-1)
# Usar el conjunto de datos escalado o no escalado según sea necesario
X_train_current = X_train_scaled if mp['use_scaled'] else X_train
clf.fit(X_train_current, y_train)
# Guardar los resultados
results.append({
'model': model_name,
'best_score': clf.best_score_,
'best_params': clf.best_params_
})
# Almacenar el mejor modelo
best_models[model_name] = clf.best_estimator_
# Mostrar los mejores resultados
for result in results:
print(f"Modelo: {result['model']}")
print(f"Mejor Score: {result['best_score']}")
print(f"Mejores Hiperparámetros: {result['best_params']}")
print("-" * 50)
Modelo: logistic Mejor Score: 1.0 Mejores Hiperparámetros: {'C': 0.004832930238571752, 'solver': 'lbfgs'} -------------------------------------------------- Modelo: svm Mejor Score: 1.0 Mejores Hiperparámetros: {'C': 0.00206913808111479, 'gamma': 'scale', 'kernel': 'linear'} -------------------------------------------------- Modelo: rf Mejor Score: 1.0 Mejores Hiperparámetros: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100} -------------------------------------------------- Modelo: knn Mejor Score: 0.9834289482388958 Mejores Hiperparámetros: {'n_neighbors': 3, 'weights': 'uniform'} -------------------------------------------------- Modelo: dt Mejor Score: 1.0 Mejores Hiperparámetros: {'max_depth': None, 'min_samples_split': 2} --------------------------------------------------
df_results = pd.DataFrame(results)
df_results
model | best_score | best_params | |
---|---|---|---|
0 | logistic | 1.000000 | {'C': 0.004832930238571752, 'solver': 'lbfgs'} |
1 | svm | 1.000000 | {'C': 0.00206913808111479, 'gamma': 'scale', '... |
2 | rf | 1.000000 | {'max_depth': None, 'min_samples_split': 2, 'n... |
3 | knn | 0.983429 | {'n_neighbors': 3, 'weights': 'uniform'} |
4 | dt | 1.000000 | {'max_depth': None, 'min_samples_split': 2} |
model_names = [result['model'] for result in results]
model_scores = [result['best_score'] for result in results]
colors = plt.cm.viridis(np.linspace(0, 1, len(model_names)))
plt.figure(figsize=(10, 6))
plt.bar(model_names, model_scores, color=colors)
plt.ylim(min(model_scores) - 0.05, max(model_scores) + 0.001)
plt.ylabel('Best Score')
plt.title('Comparación de Modelos')
plt.savefig('../figures/comparacionModelos.png')
plt.show()
from sklearn.metrics import accuracy_score, f1_score
for model_name, model in best_models.items():
model_predic = model.predict(X_test)
accuracy = accuracy_score(y_test, model_predic)
f1 = f1_score(y_test, model_predic, average='weighted')
print('------------------------------')
print(model_name + ':\n')
print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")
------------------------------ logistic: Accuracy: 0.8794326241134752 F1 Score: 0.8722352255166287 ------------------------------ svm: Accuracy: 0.7446808510638298 F1 Score: 0.7493955512572534 ------------------------------ rf: Accuracy: 1.0 F1 Score: 1.0 ------------------------------ knn: Accuracy: 0.6524822695035462 F1 Score: 0.5223954185155687 ------------------------------ dt: Accuracy: 1.0 F1 Score: 1.0
from sklearn.metrics import classification_report, confusion_matrix
final_model = best_models['rf']
y_predic = final_model.predict(X_test)
print(classification_report(y_test, y_predic, target_names=['Yes', 'No']))
precision recall f1-score support Yes 1.00 1.00 1.00 50 No 1.00 1.00 1.00 91 accuracy 1.00 141 macro avg 1.00 1.00 1.00 141 weighted avg 1.00 1.00 1.00 141
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_predic), annot=True, fmt='d', cmap='Blues')
plt.title('Matriz de confusión del modelo seleccionado')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.savefig('../figures/matrizConfusionResultados.png')
plt.show()
conf_matrix = confusion_matrix(y_test, y_predic)
conf_matrix_normalized = conf_matrix.astype('float') / conf_matrix.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_normalized, annot=True, fmt='.2f', cmap='Blues', xticklabels=target_encoder.classes_, yticklabels=target_encoder.classes_)
plt.title('Matriz de Confusión Normalizada - Random Forest')
plt.xlabel('Predicción')
plt.ylabel('Real')
plt.savefig('../figures/matrizConfusionResultadosNormalizada.png')
plt.show()
importances = pd.Series(final_model.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=False)
top_n = 15
top_importances = importances.head(top_n)
sns.set(style="whitegrid", palette="muted")
plt.figure(figsize=(10,6))
sns.barplot(x=top_importances.values, y=top_importances.index)
plt.title(f"Top {top_n} Importancia de Variables en Random Forest")
plt.xlabel('Importancia')
plt.ylabel('Características')
plt.savefig('../figures/varImportantesRF.png')
plt.show()
from sklearn.metrics import roc_curve, auc
# Predecir probabilidades para obtener la curva ROC
y_prob = final_model.predict_proba(X_test)[:, 1] # Probabilidad de la clase positiva
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# Graficar
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Curva ROC - Random Forest')
plt.legend(loc='lower right')
plt.savefig('../figures/curvaROC_RF.png')
plt.show()
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(
final_model, X_train_scaled, y_train, cv=3, scoring='f1_weighted', n_jobs=-1
)
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
plt.figure(figsize=(8, 6))
plt.plot(train_sizes, train_mean, label='Entrenamiento', color='blue')
plt.plot(train_sizes, test_mean, label='Validación', color='green')
plt.xlabel('Tamaño de Entrenamiento')
plt.ylabel('F1 Score')
plt.title('Curva de Aprendizaje - Random Forest')
plt.legend()
plt.savefig('../figures/curvaAprendizajeRF.png')
plt.show()
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(final_model, X_train_scaled, y_train, cv=5, scoring='f1_weighted')
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cv_scores) + 1), cv_scores, marker='o', linestyle='-', color='purple')
plt.title('Puntajes de Validación Cruzada - Random Forest')
plt.xlabel('Fold')
plt.ylabel('F1 Score')
plt.savefig('../figures/valCruzadaRF.png')
plt.show()
plt.figure(figsize=(8, 6))
plt.hist([y_test, y_predic], bins=20, label=['Verdadero', 'Predicción'], alpha=0.7)
plt.legend(loc='best')
plt.title('Distribución de Verdaderos vs Predicciones')
plt.savefig('../figures/distVerdaderos_VS_Prediccion.png')
plt.show()
English
Distribution of Countries and Academic Levels:
- Top 10 countries: India, the USA, and Canada have a substantial representation in the dataset, while countries like Switzerland and Spain have a smaller sample size.
- Academic level distribution: Most students are from the undergraduate level, with very few from high school.
Impact on Academic Performance:
- Comparisons between variables such as average daily social media usage, mental health scores, and social media conflicts suggest a relationship between higher social media usage, poorer mental health, and more conflicts in relationships, which is associated with negative impact on academic performance.
- Students whose academic performance is affected tend to use social media more and report more conflicts, possibly reflecting higher addiction.
Predicting Academic Performance:
- Confusion matrix and the distribution of true vs predicted: The model seems to perform exceptionally well, with almost all predictions being correct in terms of classifying academic performance ("Yes"/"No").
- The normalized confusion matrix shows that the model is highly accurate, with 100% correct predictions for both the "Yes" and "No" categories.
Correlations:
- The correlation heatmap suggests that hours of social media usage and addiction score are strongly correlated. It also shows that mental health has a negative relationship with social media usage hours and social media conflicts, indicating a detrimental impact on students' lives.
Important Features:
- In the Random Forest model, the most important variables for predicting academic performance are Addicted_Score, Affects_Academic_Performance, and Relationship_Stress.
Most Used Social Media Platforms:
- Instagram and TikTok are the most widely used platforms, with significantly higher usage compared to platforms like LinkedIn and YouTube.
Conclusion:¶
Social media addiction seems to be strongly related to students' academic performance, with factors such as usage hours, conflicts, and mental health having a significant influence. The Random Forest model has shown excellent predictive performance, and the most popular platforms in the dataset are Instagram and TikTok.
Español
Distribución de Países y Niveles Académicos:
- Top 10 países: India, Estados Unidos y Canadá tienen una representación considerable en el conjunto de datos, mientras que países como Suiza y España tienen una muestra más pequeña.
- Distribución por nivel académico: La mayoría de los estudiantes provienen del nivel de pregrado, con pocos estudiantes de secundaria.
Impacto en el rendimiento académico:
- Las comparaciones entre las características como horas de uso de redes sociales diarias, salud mental y conflictos por redes sociales sugieren una relación entre mayores horas de uso de redes sociales, peores puntuaciones de salud mental y más conflictos en relaciones, lo que está asociado con un impacto negativo en el rendimiento académico.
- Aquellos estudiantes cuyo rendimiento académico se ve afectado parecen usar más las redes sociales y reportan más conflictos, lo que podría reflejar una mayor adicción.
Predicción del rendimiento académico:
- Matriz de confusión y la distribución de predicciones vs reales: El modelo parece tener un desempeño notablemente bueno, con casi todas las predicciones correctas en cuanto a la clasificación del rendimiento académico ("Sí"/"No").
- La matriz de confusión normalizada muestra que el modelo es muy preciso, con 100% de predicciones correctas para la categoría "Sí" y "No".
Correlaciones:
- El mapa de calor de correlaciones sugiere que las horas de uso de redes sociales y el puntaje de adicción están fuertemente correlacionados. También se observa que la salud mental tiene una relación negativa con las horas de uso de redes sociales y los conflictos sociales, lo que indica un impacto perjudicial en la vida de los estudiantes.
Características importantes:
- En el modelo Random Forest, las variables más relevantes para predecir el rendimiento académico son Addicted_Score, Affects_Academic_Performance y Relationship_Stress.
Plataformas de redes sociales más usadas:
- Instagram y TikTok son las plataformas más utilizadas, con un uso considerablemente mayor en comparación con plataformas como LinkedIn y YouTube.
Conclusión:¶
La adicción a las redes sociales parece estar fuertemente relacionada con el rendimiento académico de los estudiantes, con factores como horas de uso, conflictos y salud mental influyendo de manera significativa. El modelo Random Forest ha mostrado un excelente rendimiento predictivo y las plataformas más populares en el conjunto de datos son Instagram y TikTok.
import joblib
joblib.dump(final_model, "../models/student-social-media-model.pkl")
['../models/student-social-media-model.pkl']