Learn the fundamentals of scikit-learn, Python's most popular machine learning library, and build your first ML models.
- The scikit-learn API pattern (fit, predict, score)
- Train/test split for model evaluation
- Linear Regression for predicting continuous values
- Classification models (Logistic Regression, Decision Trees)
- Data preprocessing (scaling, encoding)
- Model evaluation metrics (accuracy, MSE, confusion matrix)
- Cross-validation for robust evaluation
- Pipelines for cleaner workflows
scikit-learn (often imported as sklearn) is the go-to library for machine learning in Python. It provides:
- Simple, consistent API — almost every model uses the same pattern
- Wide variety of algorithms — regression, classification, clustering, and more
- Built-in preprocessing tools — scaling, encoding, feature engineering
- Model evaluation utilities — metrics, cross-validation, grid search
- Excellent documentation — clear examples and explanations
The beauty of scikit-learn is that once you learn the pattern for one model, you can apply it to dozens of others.
Every scikit-learn model follows the same workflow:
# 1. Create the model
model = SomeModel()
# 2. Train it on data
model.fit(X_train, y_train)
# 3. Make predictions
predictions = model.predict(X_test)
# 4. Evaluate performance
score = model.score(X_test, y_test)This consistency makes it easy to experiment with different algorithms.
To evaluate how well a model generalizes to new data, we split our dataset:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 80% for training, 20% for testingThe random_state parameter ensures reproducibility.
Linear Regression finds the best-fit line through your data:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# R² score (1.0 is perfect)
score = model.score(X_test, y_test)Common metrics:
- R² score — how much variance the model explains (0 to 1, higher is better)
- Mean Squared Error (MSE) — average squared difference between predictions and actual values (lower is better)
- Mean Absolute Error (MAE) — average absolute difference (lower is better)
Logistic Regression (despite the name, it's for classification):
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Accuracy score
accuracy = model.score(X_test, y_test)Decision Trees — intuitive, interpretable models:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)Common metrics:
- Accuracy — percentage of correct predictions
- Confusion Matrix — shows true positives, false positives, etc.
- Precision, Recall, F1-score — for imbalanced datasets
Most models need properly scaled data:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use same scaling!Important: Always fit the scaler on training data only, then apply to both train and test.
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix
# For regression
mse = mean_squared_error(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
# For classification
accuracy = accuracy_score(y_test, predictions)
cm = confusion_matrix(y_test, predictions)Instead of a single train/test split, cross-validation tests the model multiple times:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5) # 5-fold CV
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")This gives a more reliable estimate of model performance.
Pipelines combine preprocessing and modeling into one step:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)Benefits:
- Cleaner code
- Prevents data leakage
- Easy to deploy
- Forgetting to split data — always evaluate on unseen test data
- Scaling test data incorrectly — fit scaler on training data only
- Overfitting — model memorizes training data but fails on new data
- Ignoring data preprocessing — most models need scaled/normalized features
- Using wrong metrics — accuracy can be misleading for imbalanced datasets
# Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
# Evaluation
from sklearn.model_selection import cross_val_scoreAfter mastering these basics, explore:
- Random Forests and ensemble methods
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Model tuning with GridSearchCV
- Feature selection and engineering
- Handling imbalanced datasets
Check out example.py for a complete working example.
Try the practice problems in exercises.py to test your understanding.