Español

# Comportamiento de Extrovertidos vs. Introvertidos

Acerca del Conjunto de Datos¶

Visión general¶

Sumérgete en el conjunto de datos de rasgos de personalidad Extrovertido vs. Introvertido, una rica colección de datos de comportamiento y sociales diseñada para explorar el espectro de la personalidad humana. Este conjunto de datos captura indicadores clave de extroversión e introversión, lo que lo convierte en un recurso valioso para psicólogos, científicos de datos e investigadores que estudian el comportamiento social, la predicción de la personalidad o las técnicas de preprocesamiento de datos.

Contexto¶

Rasgos de personalidad como la extroversión e introversión moldean cómo los individuos interactúan con su entorno social. Este conjunto de datos proporciona información sobre comportamientos como el tiempo pasado a solas, la asistencia a eventos sociales y la participación en redes sociales, lo que permite aplicaciones en psicología, sociología, marketing y aprendizaje automático. Ya sea que estés prediciendo tipos de personalidad o analizando patrones sociales, este conjunto de datos es tu puerta de entrada para descubrir ideas fascinantes.

Detalles del Conjunto de Datos¶

Tamaño: El conjunto de datos contiene 2,900 filas y 8 columnas.

Features:

- Time_spent_Alone: Hours spent alone daily (0–11).
- Stage_fear: Presence of stage fright (Yes/No).
- Social_event_attendance: Frequency of social events (0–10).
- Going_outside: Frequency of going outside (0–7).
- Drained_after_socializing: Feeling drained after socializing (Yes/No).
- Friends_circle_size: Number of close friends (0–15).
- Post_frequency: Social media post frequency (0–10).
- Personality: Target variable (Extrovert/Introvert).*

English

Behavior of Extroverts vs. Introverts¶

About the Dataset¶

Overview¶

Dive into the Extrovert vs. Introvert Personality Traits dataset—a rich collection of behavioral and social data designed to explore the spectrum of human personality. This dataset captures key indicators of extroversion and introversion, making it a valuable resource for psychologists, data scientists, and researchers studying social behavior, personality prediction, or data preprocessing techniques.

Context¶

Personality traits like extroversion and introversion shape how individuals interact with their social environment. This dataset provides insights into behaviors such as time spent alone, social event attendance, and social media engagement, enabling applications in psychology, sociology, marketing, and machine learning. Whether you're predicting personality types or analyzing social patterns, this dataset is your gateway to discovering fascinating insights.

Dataset Details¶

Size: The dataset contains 2,900 rows and 8 columns.

Features:

  • Time_spent_Alone: Hours spent alone daily (0–11).
  • Stage_fear: Presence of stage fright (Yes/No).
  • Social_event_attendance: Frequency of social events (0–10).
  • Going_outside: Frequency of going outside (0–7).
  • Drained_after_socializing: Feeling drained after socializing (Yes/No).
  • Friends_circle_size: Number of close friends (0–15).
  • Post_frequency: Social media post frequency (0–10).
  • Personality: Target variable (Extrovert/Introvert).
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import joblib
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Carga y análisis de datos¶

In [2]:
df = pd.read_csv('../data/raw/personality_dataset.csv')

df.head(5)
Out[2]:
Time_spent_Alone Stage_fear Social_event_attendance Going_outside Drained_after_socializing Friends_circle_size Post_frequency Personality
0 4.0 No 4.0 6.0 No 13.0 5.0 Extrovert
1 9.0 Yes 0.0 0.0 Yes 0.0 3.0 Introvert
2 9.0 Yes 1.0 2.0 Yes 5.0 2.0 Introvert
3 0.0 No 6.0 7.0 No 14.0 8.0 Extrovert
4 3.0 No 9.0 4.0 No 8.0 5.0 Extrovert
In [3]:
df.columns
Out[3]:
Index(['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance',
       'Going_outside', 'Drained_after_socializing', 'Friends_circle_size',
       'Post_frequency', 'Personality'],
      dtype='object')
In [4]:
num_cols   = ['Time_spent_Alone', 'Social_event_attendance', 'Going_outside', 'Friends_circle_size', 'Post_frequency']
cat_cols   = ['Stage_fear', 'Drained_after_socializing']
target_col = 'Personality'
In [5]:
df.describe()
Out[5]:
Time_spent_Alone Social_event_attendance Going_outside Friends_circle_size Post_frequency
count 2837.000000 2838.000000 2834.000000 2823.000000 2835.000000
mean 4.505816 3.963354 3.000000 6.268863 3.564727
std 3.479192 2.903827 2.247327 4.289693 2.926582
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.000000 2.000000 1.000000 3.000000 1.000000
50% 4.000000 3.000000 3.000000 5.000000 3.000000
75% 8.000000 6.000000 5.000000 10.000000 6.000000
max 11.000000 10.000000 7.000000 15.000000 10.000000
In [6]:
print(df.info())
print(df.isnull().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2900 entries, 0 to 2899
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Time_spent_Alone           2837 non-null   float64
 1   Stage_fear                 2827 non-null   object 
 2   Social_event_attendance    2838 non-null   float64
 3   Going_outside              2834 non-null   float64
 4   Drained_after_socializing  2848 non-null   object 
 5   Friends_circle_size        2823 non-null   float64
 6   Post_frequency             2835 non-null   float64
 7   Personality                2900 non-null   object 
dtypes: float64(5), object(3)
memory usage: 181.4+ KB
None
Time_spent_Alone             63
Stage_fear                   73
Social_event_attendance      62
Going_outside                66
Drained_after_socializing    52
Friends_circle_size          77
Post_frequency               65
Personality                   0
dtype: int64
In [7]:
df.hist(grid=True, bins=25, figsize=(15,8))
Out[7]:
array([[<Axes: title={'center': 'Time_spent_Alone'}>,
        <Axes: title={'center': 'Social_event_attendance'}>],
       [<Axes: title={'center': 'Going_outside'}>,
        <Axes: title={'center': 'Friends_circle_size'}>],
       [<Axes: title={'center': 'Post_frequency'}>, <Axes: >]],
      dtype=object)
No description has been provided for this image
In [8]:
plt.figure(figsize=(8, 6))
sns.countplot(x=target_col, data=df)
plt.title('Distribución de los tipos de Personalidad')
plt.xlabel('Personalidad')
plt.ylabel('Cantidad')
plt.show()
No description has been provided for this image
In [9]:
sns.pairplot(df[num_cols + [target_col]], hue=target_col, diag_kind='hist')
plt.suptitle('Categorías númericas con respecto a la Personalidad', y=1.05)
plt.savefig('../figures/catNumRespecto_Personalidad.png')
plt.show()
No description has been provided for this image
In [10]:
plt.figure(figsize=(8, 6))
for i, col in enumerate(num_cols, 1):
    plt.subplot(3, 2, i)
    sns.boxplot(x=target_col, y=col, data=df)
    plt.title(f'{col} by Personality')
plt.tight_layout()
plt.suptitle('Distribución de categorías en base a la personalidad', y=1.05)
plt.savefig('../figures/distCatPorPersonalidad.png')
plt.show()
No description has been provided for this image

Hay valores atípicos en los diferentes categorías.

In [11]:
corr_matrix = df[num_cols].corr(method='spearman')
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Mapa de calor de correlación entre atributos')
plt.savefig('../figures/mapaCalorCorrelacion.png')
plt.show()
No description has been provided for this image

Procesar Datos¶

In [12]:
# Transformación de los datos
from sklearn.impute import SimpleImputer

# Para columnas numéricas:
numeric_imputer = SimpleImputer(strategy='mean')  # o 'median'
df[num_cols] = numeric_imputer.fit_transform(df[num_cols])

# Para columnas categóricas:
categorical_imputer = SimpleImputer(strategy='most_frequent')
df[cat_cols] = categorical_imputer.fit_transform(df[cat_cols])

df.isnull().sum()
Out[12]:
Time_spent_Alone             0
Stage_fear                   0
Social_event_attendance      0
Going_outside                0
Drained_after_socializing    0
Friends_circle_size          0
Post_frequency               0
Personality                  0
dtype: int64
In [13]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

cat_encoder = OneHotEncoder(sparse=False)

encoded_cols = cat_encoder.fit_transform(df[cat_cols])

encoded_df = pd.DataFrame(encoded_cols, columns=cat_encoder.get_feature_names_out(cat_cols))

df = pd.concat([df, encoded_df], axis=1)

df = df.drop(columns=cat_cols)

Creación de características.¶

In [14]:
# Tiempo de interacción social calculado
df['Interaction_Time'] = df['Social_event_attendance'] + df['Going_outside']

# Tensión social (1 si hay miedo al escenario o se siente agotado después de socializar, 0 si no)
df['Social_Stress'] = ((df['Stage_fear_Yes'] == 1) | (df['Drained_after_socializing_Yes'] == 1)).astype(int)

#  Amigos vs. eventos sociales
df['Friends_to_Events_Ratio'] = df['Friends_circle_size'] / (df['Social_event_attendance'] + 1e-5)

# Actividad en redes sociales
df['Social_Media_Engagement'] = df['Post_frequency'] * df['Social_event_attendance']

# Comportamiento social combinado
df['Social_Behavior'] = (df['Social_event_attendance'] + df['Going_outside'] + df['Friends_circle_size']) / 3

# Escala de energía social
df['Energy_Social'] = (df['Drained_after_socializing_No'] == 1).astype(int)
In [15]:
# Manejar outliers en las num_cols
for col in num_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)
In [16]:
df.head()
Out[16]:
Time_spent_Alone Social_event_attendance Going_outside Friends_circle_size Post_frequency Personality Stage_fear_No Stage_fear_Yes Drained_after_socializing_No Drained_after_socializing_Yes Interaction_Time Social_Stress Friends_to_Events_Ratio Social_Media_Engagement Social_Behavior Energy_Social
0 4.0 4.0 6.0 13.0 5.0 Extrovert 1.0 0.0 1.0 0.0 10.0 0 3.249992 20.0 7.666667 1
1 9.0 0.0 0.0 0.0 3.0 Introvert 0.0 1.0 0.0 1.0 0.0 1 0.000000 0.0 0.000000 0
2 9.0 1.0 2.0 5.0 2.0 Introvert 0.0 1.0 0.0 1.0 3.0 1 4.999950 2.0 2.666667 0
3 0.0 6.0 7.0 14.0 8.0 Extrovert 1.0 0.0 1.0 0.0 13.0 0 2.333329 48.0 9.000000 1
4 3.0 9.0 4.0 8.0 5.0 Extrovert 1.0 0.0 1.0 0.0 13.0 0 0.888888 45.0 7.000000 1
In [17]:
df.to_csv('../data/processed/processing_dataset.csv')
In [18]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=[target_col])
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
In [19]:
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler

smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

target_encoder = LabelEncoder()
y_train = target_encoder.fit_transform(y_train)
y_test = target_encoder.transform(y_test)

Entrenamiento y Evaluación del Modelo¶

Modelos escogidos¶

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV, cross_val_score, GridSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = {
    'logistic': {
        'model': LogisticRegression(max_iter=1000),
        'use_scaled': True,
        'params': {'C': np.logspace(-4, 4, 20), 'solver': ['lbfgs', 'liblinear']}
    },
    'svm': {
        'model': SVC(probability=True),
        'use_scaled': True,
        'params': {'C': np.logspace(-3, 3, 20), 'kernel': ['rbf', 'linear'], 'gamma': ['scale', 'auto', 0.1, 1]}
    },
    'rf': {
        'model': RandomForestClassifier(random_state=42),
        'use_scaled': False,
        'params': {'n_estimators': [100, 150, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10]}
    },
    'knn': {
        'model': KNeighborsClassifier(),
        'use_scaled': True,
        'params': {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}
    },
    'dt': {
        'model': DecisionTreeClassifier(random_state=42),
        'use_scaled': False,
        'params': {'max_depth': [None, 5, 10, 15], 'min_samples_split': [2, 5, 10]}
    }
}
In [21]:
# Trabajar los hiperparámetros
results = []
best_models = {}
for model_name, mp in models.items():
    clf = GridSearchCV(mp['model'], mp['params'], cv=3, scoring='f1_weighted', n_jobs=-1)
    X_train_current = X_train_scaled if mp['use_scaled'] else X_train
    clf.fit(X_train_current, y_train)
    results.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    best_models[model_name] = clf.best_estimator_
In [22]:
df_results = pd.DataFrame(results)

df_results
Out[22]:
model best_score best_params
0 logistic 0.937968 {'C': 0.0001, 'solver': 'lbfgs'}
1 svm 0.938387 {'C': 0.0379269019073225, 'gamma': 0.1, 'kerne...
2 rf 0.938387 {'max_depth': None, 'min_samples_split': 10, '...
3 knn 0.936291 {'n_neighbors': 9, 'weights': 'uniform'}
4 dt 0.936710 {'max_depth': 5, 'min_samples_split': 2}
In [23]:
model_names = [result['model'] for result in results]
model_scores = [result['best_score'] for result in results]

colors = plt.cm.viridis(np.linspace(0, 1, len(model_names)))

plt.figure(figsize=(10, 6))
plt.bar(model_names, model_scores, color=colors)
plt.ylim(min(model_scores) - 0.05, max(model_scores) + 0.001)
plt.ylabel('Best Score')
plt.title('Comparación de Modelos')
plt.savefig('../figures/comparacionModelos.png')
plt.show()
No description has been provided for this image
In [24]:
from sklearn.metrics import accuracy_score, f1_score

rf_model = best_models['rf']
rf_predictions = rf_model.predict(X_test)

accuracy = accuracy_score(y_test, rf_predictions)
f1 = f1_score(y_test, rf_predictions, average='weighted')

print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")
Accuracy: 0.9155172413793103
F1 Score: 0.9155210084576291
In [25]:
from sklearn.metrics import accuracy_score, f1_score

for model_name, model in best_models.items():
    model_predic = model.predict(X_test)
    accuracy = accuracy_score(y_test, model_predic)
    f1 = f1_score(y_test, model_predic, average='weighted')
    print('------------------------------')
    print(model_name + ':\n')
    print(f"Accuracy: {accuracy}")
    print(f"F1 Score: {f1}")
------------------------------
logistic:

Accuracy: 0.6948275862068966
F1 Score: 0.6761589928468138
------------------------------
svm:

Accuracy: 0.4862068965517241
F1 Score: 0.3181214497159773
------------------------------
rf:

Accuracy: 0.9155172413793103
F1 Score: 0.9155210084576291
------------------------------
knn:

Accuracy: 0.7344827586206897
F1 Score: 0.724572670684715
------------------------------
dt:

Accuracy: 0.9103448275862069
F1 Score: 0.9103448275862069

Evaluación del modelo¶

In [26]:
from sklearn.metrics import classification_report, confusion_matrix
final_model = best_models['rf']

y_predic = final_model.predict(X_test)
print(classification_report(y_test, y_predic, target_names=target_encoder.classes_))
              precision    recall  f1-score   support

   Extrovert       0.94      0.89      0.92       298
   Introvert       0.89      0.94      0.92       282

    accuracy                           0.92       580
   macro avg       0.92      0.92      0.92       580
weighted avg       0.92      0.92      0.92       580

In [27]:
# Matriz de confusión
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_predic), annot=True, fmt='d', cmap='Blues')
plt.title('Matriz de confusión del modelo seleccionado')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.savefig('../figures/matrizConfusionResultados.png')
plt.show()
No description has been provided for this image
In [28]:
conf_matrix = confusion_matrix(y_test, y_predic)
conf_matrix_normalized = conf_matrix.astype('float') / conf_matrix.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_normalized, annot=True, fmt='.2f', cmap='Blues', xticklabels=target_encoder.classes_, yticklabels=target_encoder.classes_)
plt.title('Matriz de Confusión Normalizada - Random Forest')
plt.xlabel('Predicción')
plt.ylabel('Real')
plt.savefig('../figures/matrizConfusionResultadosNormalizada.png')
plt.show()
No description has been provided for this image

Características importantes para el modelo¶

In [29]:
importances = pd.Series(final_model.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=False)
sns.set(style="whitegrid", palette="muted")

plt.figure(figsize=(10,6))
sns.barplot(x=importances.values, y=importances.index)
plt.title("Importancia de variables en Random Forest")
plt.savefig('../figures/varImportantesRF.png')
plt.show()
No description has been provided for this image
In [30]:
from sklearn.metrics import roc_curve, auc

# Predecir probabilidades para obtener la curva ROC
y_prob = final_model.predict_proba(X_test)[:, 1]  # Probabilidad de la clase positiva
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Graficar
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Curva ROC - Random Forest')
plt.legend(loc='lower right')
plt.savefig('../figures/curvaROC_RF.png')
plt.show()
No description has been provided for this image
In [31]:
from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve(
    rf_model, X_train_scaled, y_train, cv=3, scoring='f1_weighted', n_jobs=-1
)

train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)

plt.figure(figsize=(8, 6))
plt.plot(train_sizes, train_mean, label='Entrenamiento', color='blue')
plt.plot(train_sizes, test_mean, label='Validación', color='green')
plt.xlabel('Tamaño de Entrenamiento')
plt.ylabel('F1 Score')
plt.title('Curva de Aprendizaje - Random Forest')
plt.legend()
plt.savefig('../figures/curvaAprendizajeRF.png')
plt.show()
No description has been provided for this image
In [32]:
cv_scores = cross_val_score(final_model, X_train_scaled, y_train, cv=5, scoring='f1_weighted')

plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cv_scores) + 1), cv_scores, marker='o', linestyle='-', color='purple')
plt.title('Puntajes de Validación Cruzada - Random Forest')
plt.xlabel('Fold')
plt.ylabel('F1 Score')
plt.savefig('../figures/valCruzadaRF.png')
plt.show()
No description has been provided for this image
In [33]:
plt.figure(figsize=(8, 6))
plt.hist([y_test, y_predic], bins=20, label=['Verdadero', 'Predicción'], alpha=0.7)
plt.legend(loc='best')
plt.title('Distribución de Verdaderos vs Predicciones')
plt.savefig('../figures/distVerdaderos_VS_Prediccion.png')
plt.show()
No description has been provided for this image

Conclusión¶


Para predecir si una persona es extrovertida o introvertida, se utilizó un modelo de Random Forest, que es eficaz para clasificar datos complejos. Después de procesar y transformar los datos (como la asistencia a eventos sociales, la interacción en redes y el tiempo a solas), el modelo mostró un buen rendimiento al identificar correctamente las personalidades. Las variables más relevantes fueron el estrés social, el comportamiento social y la frecuencia de interacción en redes sociales, las cuales son claves para diferenciar entre extrovertidos e introvertidos. Al comparar las distribuciones de los valores reales y las predicciones, podemos observar que el modelo está haciendo una clasificación bastante equilibrada. Aunque hay una ligera tendencia hacia la clase "Extrovertido", la distribución de las predicciones sigue de cerca a la distribución real, lo que sugiere una predicción efectiva y representativa.

In [34]:
import joblib

joblib.dump(final_model, "../models/human-personality-model.pkl")
Out[34]:
['../models/human-personality-model.pkl']