Español
# Comportamiento de Extrovertidos vs. Introvertidos
Acerca del Conjunto de Datos¶
Visión general¶
Sumérgete en el conjunto de datos de rasgos de personalidad Extrovertido vs. Introvertido, una rica colección de datos de comportamiento y sociales diseñada para explorar el espectro de la personalidad humana. Este conjunto de datos captura indicadores clave de extroversión e introversión, lo que lo convierte en un recurso valioso para psicólogos, científicos de datos e investigadores que estudian el comportamiento social, la predicción de la personalidad o las técnicas de preprocesamiento de datos.
Contexto¶
Rasgos de personalidad como la extroversión e introversión moldean cómo los individuos interactúan con su entorno social. Este conjunto de datos proporciona información sobre comportamientos como el tiempo pasado a solas, la asistencia a eventos sociales y la participación en redes sociales, lo que permite aplicaciones en psicología, sociología, marketing y aprendizaje automático. Ya sea que estés prediciendo tipos de personalidad o analizando patrones sociales, este conjunto de datos es tu puerta de entrada para descubrir ideas fascinantes.
Detalles del Conjunto de Datos¶
Tamaño: El conjunto de datos contiene 2,900 filas y 8 columnas.
Features:
- Time_spent_Alone: Hours spent alone daily (0–11).
- Stage_fear: Presence of stage fright (Yes/No).
- Social_event_attendance: Frequency of social events (0–10).
- Going_outside: Frequency of going outside (0–7).
- Drained_after_socializing: Feeling drained after socializing (Yes/No).
- Friends_circle_size: Number of close friends (0–15).
- Post_frequency: Social media post frequency (0–10).
- Personality: Target variable (Extrovert/Introvert).*
English
Behavior of Extroverts vs. Introverts¶
About the Dataset¶
Overview¶
Dive into the Extrovert vs. Introvert Personality Traits dataset—a rich collection of behavioral and social data designed to explore the spectrum of human personality. This dataset captures key indicators of extroversion and introversion, making it a valuable resource for psychologists, data scientists, and researchers studying social behavior, personality prediction, or data preprocessing techniques.
Context¶
Personality traits like extroversion and introversion shape how individuals interact with their social environment. This dataset provides insights into behaviors such as time spent alone, social event attendance, and social media engagement, enabling applications in psychology, sociology, marketing, and machine learning. Whether you're predicting personality types or analyzing social patterns, this dataset is your gateway to discovering fascinating insights.
Dataset Details¶
Size: The dataset contains 2,900 rows and 8 columns.
Features:
- Time_spent_Alone: Hours spent alone daily (0–11).
- Stage_fear: Presence of stage fright (Yes/No).
- Social_event_attendance: Frequency of social events (0–10).
- Going_outside: Frequency of going outside (0–7).
- Drained_after_socializing: Feeling drained after socializing (Yes/No).
- Friends_circle_size: Number of close friends (0–15).
- Post_frequency: Social media post frequency (0–10).
- Personality: Target variable (Extrovert/Introvert).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import joblib
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
Carga y análisis de datos¶
df = pd.read_csv('../data/raw/personality_dataset.csv')
df.head(5)
Time_spent_Alone | Stage_fear | Social_event_attendance | Going_outside | Drained_after_socializing | Friends_circle_size | Post_frequency | Personality | |
---|---|---|---|---|---|---|---|---|
0 | 4.0 | No | 4.0 | 6.0 | No | 13.0 | 5.0 | Extrovert |
1 | 9.0 | Yes | 0.0 | 0.0 | Yes | 0.0 | 3.0 | Introvert |
2 | 9.0 | Yes | 1.0 | 2.0 | Yes | 5.0 | 2.0 | Introvert |
3 | 0.0 | No | 6.0 | 7.0 | No | 14.0 | 8.0 | Extrovert |
4 | 3.0 | No | 9.0 | 4.0 | No | 8.0 | 5.0 | Extrovert |
df.columns
Index(['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance', 'Going_outside', 'Drained_after_socializing', 'Friends_circle_size', 'Post_frequency', 'Personality'], dtype='object')
num_cols = ['Time_spent_Alone', 'Social_event_attendance', 'Going_outside', 'Friends_circle_size', 'Post_frequency']
cat_cols = ['Stage_fear', 'Drained_after_socializing']
target_col = 'Personality'
df.describe()
Time_spent_Alone | Social_event_attendance | Going_outside | Friends_circle_size | Post_frequency | |
---|---|---|---|---|---|
count | 2837.000000 | 2838.000000 | 2834.000000 | 2823.000000 | 2835.000000 |
mean | 4.505816 | 3.963354 | 3.000000 | 6.268863 | 3.564727 |
std | 3.479192 | 2.903827 | 2.247327 | 4.289693 | 2.926582 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 2.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 |
50% | 4.000000 | 3.000000 | 3.000000 | 5.000000 | 3.000000 |
75% | 8.000000 | 6.000000 | 5.000000 | 10.000000 | 6.000000 |
max | 11.000000 | 10.000000 | 7.000000 | 15.000000 | 10.000000 |
print(df.info())
print(df.isnull().sum())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2900 entries, 0 to 2899 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time_spent_Alone 2837 non-null float64 1 Stage_fear 2827 non-null object 2 Social_event_attendance 2838 non-null float64 3 Going_outside 2834 non-null float64 4 Drained_after_socializing 2848 non-null object 5 Friends_circle_size 2823 non-null float64 6 Post_frequency 2835 non-null float64 7 Personality 2900 non-null object dtypes: float64(5), object(3) memory usage: 181.4+ KB None Time_spent_Alone 63 Stage_fear 73 Social_event_attendance 62 Going_outside 66 Drained_after_socializing 52 Friends_circle_size 77 Post_frequency 65 Personality 0 dtype: int64
df.hist(grid=True, bins=25, figsize=(15,8))
array([[<Axes: title={'center': 'Time_spent_Alone'}>, <Axes: title={'center': 'Social_event_attendance'}>], [<Axes: title={'center': 'Going_outside'}>, <Axes: title={'center': 'Friends_circle_size'}>], [<Axes: title={'center': 'Post_frequency'}>, <Axes: >]], dtype=object)
plt.figure(figsize=(8, 6))
sns.countplot(x=target_col, data=df)
plt.title('Distribución de los tipos de Personalidad')
plt.xlabel('Personalidad')
plt.ylabel('Cantidad')
plt.show()
sns.pairplot(df[num_cols + [target_col]], hue=target_col, diag_kind='hist')
plt.suptitle('Categorías númericas con respecto a la Personalidad', y=1.05)
plt.savefig('../figures/catNumRespecto_Personalidad.png')
plt.show()
plt.figure(figsize=(8, 6))
for i, col in enumerate(num_cols, 1):
plt.subplot(3, 2, i)
sns.boxplot(x=target_col, y=col, data=df)
plt.title(f'{col} by Personality')
plt.tight_layout()
plt.suptitle('Distribución de categorías en base a la personalidad', y=1.05)
plt.savefig('../figures/distCatPorPersonalidad.png')
plt.show()
Hay valores atípicos en los diferentes categorías.
corr_matrix = df[num_cols].corr(method='spearman')
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Mapa de calor de correlación entre atributos')
plt.savefig('../figures/mapaCalorCorrelacion.png')
plt.show()
Procesar Datos¶
# Transformación de los datos
from sklearn.impute import SimpleImputer
# Para columnas numéricas:
numeric_imputer = SimpleImputer(strategy='mean') # o 'median'
df[num_cols] = numeric_imputer.fit_transform(df[num_cols])
# Para columnas categóricas:
categorical_imputer = SimpleImputer(strategy='most_frequent')
df[cat_cols] = categorical_imputer.fit_transform(df[cat_cols])
df.isnull().sum()
Time_spent_Alone 0 Stage_fear 0 Social_event_attendance 0 Going_outside 0 Drained_after_socializing 0 Friends_circle_size 0 Post_frequency 0 Personality 0 dtype: int64
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
cat_encoder = OneHotEncoder(sparse=False)
encoded_cols = cat_encoder.fit_transform(df[cat_cols])
encoded_df = pd.DataFrame(encoded_cols, columns=cat_encoder.get_feature_names_out(cat_cols))
df = pd.concat([df, encoded_df], axis=1)
df = df.drop(columns=cat_cols)
Creación de características.¶
# Tiempo de interacción social calculado
df['Interaction_Time'] = df['Social_event_attendance'] + df['Going_outside']
# Tensión social (1 si hay miedo al escenario o se siente agotado después de socializar, 0 si no)
df['Social_Stress'] = ((df['Stage_fear_Yes'] == 1) | (df['Drained_after_socializing_Yes'] == 1)).astype(int)
# Amigos vs. eventos sociales
df['Friends_to_Events_Ratio'] = df['Friends_circle_size'] / (df['Social_event_attendance'] + 1e-5)
# Actividad en redes sociales
df['Social_Media_Engagement'] = df['Post_frequency'] * df['Social_event_attendance']
# Comportamiento social combinado
df['Social_Behavior'] = (df['Social_event_attendance'] + df['Going_outside'] + df['Friends_circle_size']) / 3
# Escala de energía social
df['Energy_Social'] = (df['Drained_after_socializing_No'] == 1).astype(int)
# Manejar outliers en las num_cols
for col in num_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)
df.head()
Time_spent_Alone | Social_event_attendance | Going_outside | Friends_circle_size | Post_frequency | Personality | Stage_fear_No | Stage_fear_Yes | Drained_after_socializing_No | Drained_after_socializing_Yes | Interaction_Time | Social_Stress | Friends_to_Events_Ratio | Social_Media_Engagement | Social_Behavior | Energy_Social | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.0 | 4.0 | 6.0 | 13.0 | 5.0 | Extrovert | 1.0 | 0.0 | 1.0 | 0.0 | 10.0 | 0 | 3.249992 | 20.0 | 7.666667 | 1 |
1 | 9.0 | 0.0 | 0.0 | 0.0 | 3.0 | Introvert | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1 | 0.000000 | 0.0 | 0.000000 | 0 |
2 | 9.0 | 1.0 | 2.0 | 5.0 | 2.0 | Introvert | 0.0 | 1.0 | 0.0 | 1.0 | 3.0 | 1 | 4.999950 | 2.0 | 2.666667 | 0 |
3 | 0.0 | 6.0 | 7.0 | 14.0 | 8.0 | Extrovert | 1.0 | 0.0 | 1.0 | 0.0 | 13.0 | 0 | 2.333329 | 48.0 | 9.000000 | 1 |
4 | 3.0 | 9.0 | 4.0 | 8.0 | 5.0 | Extrovert | 1.0 | 0.0 | 1.0 | 0.0 | 13.0 | 0 | 0.888888 | 45.0 | 7.000000 | 1 |
df.to_csv('../data/processed/processing_dataset.csv')
from sklearn.model_selection import train_test_split
X = df.drop(columns=[target_col])
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
target_encoder = LabelEncoder()
y_train = target_encoder.fit_transform(y_train)
y_test = target_encoder.transform(y_test)
Entrenamiento y Evaluación del Modelo¶
Modelos escogidos¶
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV, cross_val_score, GridSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
models = {
'logistic': {
'model': LogisticRegression(max_iter=1000),
'use_scaled': True,
'params': {'C': np.logspace(-4, 4, 20), 'solver': ['lbfgs', 'liblinear']}
},
'svm': {
'model': SVC(probability=True),
'use_scaled': True,
'params': {'C': np.logspace(-3, 3, 20), 'kernel': ['rbf', 'linear'], 'gamma': ['scale', 'auto', 0.1, 1]}
},
'rf': {
'model': RandomForestClassifier(random_state=42),
'use_scaled': False,
'params': {'n_estimators': [100, 150, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10]}
},
'knn': {
'model': KNeighborsClassifier(),
'use_scaled': True,
'params': {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}
},
'dt': {
'model': DecisionTreeClassifier(random_state=42),
'use_scaled': False,
'params': {'max_depth': [None, 5, 10, 15], 'min_samples_split': [2, 5, 10]}
}
}
# Trabajar los hiperparámetros
results = []
best_models = {}
for model_name, mp in models.items():
clf = GridSearchCV(mp['model'], mp['params'], cv=3, scoring='f1_weighted', n_jobs=-1)
X_train_current = X_train_scaled if mp['use_scaled'] else X_train
clf.fit(X_train_current, y_train)
results.append({
'model': model_name,
'best_score': clf.best_score_,
'best_params': clf.best_params_
})
best_models[model_name] = clf.best_estimator_
df_results = pd.DataFrame(results)
df_results
model | best_score | best_params | |
---|---|---|---|
0 | logistic | 0.937968 | {'C': 0.0001, 'solver': 'lbfgs'} |
1 | svm | 0.938387 | {'C': 0.0379269019073225, 'gamma': 0.1, 'kerne... |
2 | rf | 0.938387 | {'max_depth': None, 'min_samples_split': 10, '... |
3 | knn | 0.936291 | {'n_neighbors': 9, 'weights': 'uniform'} |
4 | dt | 0.936710 | {'max_depth': 5, 'min_samples_split': 2} |
model_names = [result['model'] for result in results]
model_scores = [result['best_score'] for result in results]
colors = plt.cm.viridis(np.linspace(0, 1, len(model_names)))
plt.figure(figsize=(10, 6))
plt.bar(model_names, model_scores, color=colors)
plt.ylim(min(model_scores) - 0.05, max(model_scores) + 0.001)
plt.ylabel('Best Score')
plt.title('Comparación de Modelos')
plt.savefig('../figures/comparacionModelos.png')
plt.show()
from sklearn.metrics import accuracy_score, f1_score
rf_model = best_models['rf']
rf_predictions = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, rf_predictions)
f1 = f1_score(y_test, rf_predictions, average='weighted')
print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")
Accuracy: 0.9155172413793103 F1 Score: 0.9155210084576291
from sklearn.metrics import accuracy_score, f1_score
for model_name, model in best_models.items():
model_predic = model.predict(X_test)
accuracy = accuracy_score(y_test, model_predic)
f1 = f1_score(y_test, model_predic, average='weighted')
print('------------------------------')
print(model_name + ':\n')
print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")
------------------------------ logistic: Accuracy: 0.6948275862068966 F1 Score: 0.6761589928468138 ------------------------------ svm: Accuracy: 0.4862068965517241 F1 Score: 0.3181214497159773 ------------------------------ rf: Accuracy: 0.9155172413793103 F1 Score: 0.9155210084576291 ------------------------------ knn: Accuracy: 0.7344827586206897 F1 Score: 0.724572670684715 ------------------------------ dt: Accuracy: 0.9103448275862069 F1 Score: 0.9103448275862069
Evaluación del modelo¶
from sklearn.metrics import classification_report, confusion_matrix
final_model = best_models['rf']
y_predic = final_model.predict(X_test)
print(classification_report(y_test, y_predic, target_names=target_encoder.classes_))
precision recall f1-score support Extrovert 0.94 0.89 0.92 298 Introvert 0.89 0.94 0.92 282 accuracy 0.92 580 macro avg 0.92 0.92 0.92 580 weighted avg 0.92 0.92 0.92 580
# Matriz de confusión
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_predic), annot=True, fmt='d', cmap='Blues')
plt.title('Matriz de confusión del modelo seleccionado')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.savefig('../figures/matrizConfusionResultados.png')
plt.show()
conf_matrix = confusion_matrix(y_test, y_predic)
conf_matrix_normalized = conf_matrix.astype('float') / conf_matrix.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_normalized, annot=True, fmt='.2f', cmap='Blues', xticklabels=target_encoder.classes_, yticklabels=target_encoder.classes_)
plt.title('Matriz de Confusión Normalizada - Random Forest')
plt.xlabel('Predicción')
plt.ylabel('Real')
plt.savefig('../figures/matrizConfusionResultadosNormalizada.png')
plt.show()
Características importantes para el modelo¶
importances = pd.Series(final_model.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=False)
sns.set(style="whitegrid", palette="muted")
plt.figure(figsize=(10,6))
sns.barplot(x=importances.values, y=importances.index)
plt.title("Importancia de variables en Random Forest")
plt.savefig('../figures/varImportantesRF.png')
plt.show()
from sklearn.metrics import roc_curve, auc
# Predecir probabilidades para obtener la curva ROC
y_prob = final_model.predict_proba(X_test)[:, 1] # Probabilidad de la clase positiva
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# Graficar
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Curva ROC - Random Forest')
plt.legend(loc='lower right')
plt.savefig('../figures/curvaROC_RF.png')
plt.show()
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(
rf_model, X_train_scaled, y_train, cv=3, scoring='f1_weighted', n_jobs=-1
)
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
plt.figure(figsize=(8, 6))
plt.plot(train_sizes, train_mean, label='Entrenamiento', color='blue')
plt.plot(train_sizes, test_mean, label='Validación', color='green')
plt.xlabel('Tamaño de Entrenamiento')
plt.ylabel('F1 Score')
plt.title('Curva de Aprendizaje - Random Forest')
plt.legend()
plt.savefig('../figures/curvaAprendizajeRF.png')
plt.show()
cv_scores = cross_val_score(final_model, X_train_scaled, y_train, cv=5, scoring='f1_weighted')
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cv_scores) + 1), cv_scores, marker='o', linestyle='-', color='purple')
plt.title('Puntajes de Validación Cruzada - Random Forest')
plt.xlabel('Fold')
plt.ylabel('F1 Score')
plt.savefig('../figures/valCruzadaRF.png')
plt.show()
plt.figure(figsize=(8, 6))
plt.hist([y_test, y_predic], bins=20, label=['Verdadero', 'Predicción'], alpha=0.7)
plt.legend(loc='best')
plt.title('Distribución de Verdaderos vs Predicciones')
plt.savefig('../figures/distVerdaderos_VS_Prediccion.png')
plt.show()
Conclusión¶
Para predecir si una persona es extrovertida o introvertida, se utilizó un modelo de Random Forest, que es eficaz para clasificar datos complejos. Después de procesar y transformar los datos (como la asistencia a eventos sociales, la interacción en redes y el tiempo a solas), el modelo mostró un buen rendimiento al identificar correctamente las personalidades. Las variables más relevantes fueron el estrés social, el comportamiento social y la frecuencia de interacción en redes sociales, las cuales son claves para diferenciar entre extrovertidos e introvertidos. Al comparar las distribuciones de los valores reales y las predicciones, podemos observar que el modelo está haciendo una clasificación bastante equilibrada. Aunque hay una ligera tendencia hacia la clase "Extrovertido", la distribución de las predicciones sigue de cerca a la distribución real, lo que sugiere una predicción efectiva y representativa.
import joblib
joblib.dump(final_model, "../models/human-personality-model.pkl")
['../models/human-personality-model.pkl']