Update Soccer Prediction Project with Data File Modifications and Code Cleanup by ronyka77 · Pull Request #6 · ronyka77/TheDrawCode

ronyka77 · 2025-10-12T22:35:58Z

Updated binary data files new_api_training_final.parquet and new_api_training_final.xlsx to reflect the latest training data.
Refactored add_ELO_scores.py and add_poisson_xG.py for improved readability and organization by adjusting export paths and formatting.
Enhanced event_features_calculator.py to improve DataFrame handling and logging for better debugging.
Cleaned up imports and formatting across various scripts to maintain consistency and improve code quality.

These changes aim to enhance data management, improve code readability, and ensure better organization within the Soccer Prediction Project.

Summary by CodeRabbit

New Features
- Full TabNet outlier-aware modeling pipeline with staged feature selection, hyperparameter tuning, and MLflow tracking.
- Ensemble expanded to include PyTorch, SVM, and FNN base models, richer meta-features, and dynamic weighting.
- New dataset balancing/cleaning and feature‑importance selection utilities.
- SHAP-based explanations and enhanced error-analysis tools.
- Prediction utility now returns results plus precision and recall metrics and supports thresholded outputs.
Improvements
- More robust data loading, logging, diagnostics, and MLflow signature handling across training and inference.

…e Cleanup - Updated binary data files `new_api_training_final.parquet` and `new_api_training_final.xlsx` to reflect the latest training data. - Refactored `add_ELO_scores.py` and `add_poisson_xG.py` for improved readability and organization by adjusting export paths and formatting. - Enhanced `event_features_calculator.py` to improve DataFrame handling and logging for better debugging. - Cleaned up imports and formatting across various scripts to maintain consistency and improve code quality. These changes aim to enhance data management, improve code readability, and ensure better organization within the Soccer Prediction Project.

coderabbitai · 2025-10-12T22:36:15Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This PR makes broad formatting and typing changes, adds/rewrites many model training, hypertuning, feature-selection and ensemble pipelines (including a new TabNet outliers module and specialized FNN integration), expands ensemble base-model support (PyTorch/SVM/FNN), hardens loaders/prediction flows, and introduces new utilities for weighting, diagnostics and data preparation.

Changes

Cohort / File(s)	Summary of Changes
Small data scripts `data/Create_data/add_ELO_scores.py`, `data/Create_data/add_poisson_xG.py`	Removed unused imports; path/signature formatting and minor reflow; no behavior changes.
API Football ingestion `data/Create_data/api_football/add_current_features_postgre.py`, `data/Create_data/api_football/get_fixtures_postgre.py`	Quoting/formatting, env-based DB config, added helpers (JSON save), stricter typing, defensive parsing, refined upsert and prediction handling.
Feature selection utils `data/Create_data/utils/feature_selection.py`	Replaced inline weight optimization with modular pipeline and helpers; Optional logger; align_columns now returns (train,test,eval).
Shared data/config/hyper utilities `src/models/StackedEnsemble/shared/config_loader.py`, `.../data_loader.py`, `.../data_loader_new.py`, `.../hypertuner_utils.py`, `.../validation.py`	Added Optional typing for logger/model_type, imported np/pd, added predict helper, renamed X_val→x_val in places, None-tolerant initialization.
SVM & kernel models `src/models/StackedEnsemble/base/kernel_based/svm_model.py`, `.../svm_model_25.py`	Cosmetic formatting and logging changes; potential hypertune return-path regression noted in one file.
Neural models & hypertuners `src/models/StackedEnsemble/base/neural/.py` (mlp_model, pytorch_hypertuner, specialized_fnn_integration, tabnet_model*, tabnet_model_outliers.py)	Defensive checks for fitted models; RobustScaler adoption in PyTorch hypertuner; specialized FNN runtime patching; new TabNet outlier-aware module (training, HPO, feature-selection, MLflow); signature/name normalizations (x_* / x_eval).
Tree-based models & FS `src/models/StackedEnsemble/base/tree_based/.py` (catboost_model.py, lightgbm_model.py, xgboost_model.py, random_forest_30.py, feature_selection/)`	Added typed helpers (create_model/train_model/optimize_hyperparameters/logging), permutation importance, RFECV tweaks, x_*/x_eval naming, MLflow signature inference; Boruta/feature-selection eval-data prep adjustments.
Ensemble core, loaders and meta-features `src/models/ensemble/.py` (ensemble_model, meta_features_, run_ensemble.py, data_utils.py, diagnostics.py, training.py, weights_.py)	Deferred MLP init, added MLflow loaders for many flavors (including PyTorch/SVM), expanded dynamic weighting to include PyTorch/SVM/FNN, new balancing/feature-selection utilities, SHAP/error diagnostics modularization, hardened run pipeline and parameter-logging helpers, x_* naming normalization.
Predictor `src/predictors/predict_ensemble.py`	Robust model loading (sklearn/pyfunc), probability fallbacks, thresholding guards, stronger DataFrame handling; make_prediction now returns (DataFrame, precision, recall).
K-factor & misc `src/utils/K_factor_calculation.py`, `.gitignore`	Formatting and quoting; temporal_factor introduced for draw distribution; .gitignore updated to restore prompts entry and add SonarLint exclusion.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant M as specialized_fnn_integration.main
  participant HT as hypertuner.module
  participant F as create_specialized_fnn
  Note over M,HT: Runtime patching flow for specialized FNN
  M->>HT: save originals, patch create_pytorch_model, patch load_hyperparameter_space
  HT->>F: build model via patched_create_pytorch_model
  HT->>HT: extend hyperparameter space via patched_load_hyperparameter_space
  HT-->>M: return after run
  M->>HT: restore originals

sequenceDiagram
  autonumber
  participant E as EnsembleModel.train
  participant L as MLflow Loaders
  participant B as Base Models (xgb,lgb,tabnet,extra,mlp,pytorch,svm,fnn)
  participant W as Weighting (precision-focused)
  participant ML as Meta-learner
  E->>L: load models, scalers, feature signatures
  E->>B: prepare/initialize base models (deferred where applicable)
  B-->>E: produce base predictions for x_val/x_test
  E->>W: compute per-model dynamic weights
  W-->>E: weights vector
  E->>ML: assemble meta-features and train meta-learner
  ML-->>E: final metrics and thresholds

sequenceDiagram
  autonumber
  participant P as predict_ensemble.make_prediction
  participant LD as Model Loader (mlflow)
  participant M as Model
  participant U as Threshold Utils
  P->>LD: load model by URI (sklearn or pyfunc)
  P->>M: obtain probabilities (predict_proba or pyfunc)
  alt probability available
    M-->>P: probability array
  else fallback
    LD-->>P: pyfunc predict fallback
  end
  P->>U: apply threshold -> binary predictions
  U-->>P: (DataFrame, precision, recall)
  P-->>Caller: return predictions and metrics

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

I thump my paws on lines so wide,
New weights to learn, new flows to ride.
TabNet hops in, FNN joins the crew,
Models knit together, old and new.
Logs lit bright — the rabbit's pride. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 73.97% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly captures the primary changes—updating data files and performing code cleanup—reflecting the core objectives without listing individual files, making it clear and specific enough for team members to understand the PR’s intent.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bugfix/sonarqube_fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 22

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (23)

src/models/StackedEnsemble/shared/data_loader.py (1)

97-99: Pre-existing bug: NaN count calculated after replacement.

The NaN counts are calculated after the NaN replacement (lines 87-89), so they will always be zero. The counts should be calculated before calling fillna(0) to accurately reflect the number of NaN values that were replaced.

Move the NaN counting before the replacement operations:

+        # Count NaN values before replacement
+        train_nan_count_before = X_train.isna().sum().sum()
+        test_nan_count_before = X_test.isna().sum().sum()
+        val_nan_count_before = X_val.isna().sum().sum()
+
         # Replace NaN values with 0 in all data splits
         self.logger.info("Replacing NaN values with 0 in all data splits")
         X_train = X_train.fillna(0)
         X_test = X_test.fillna(0)
         X_val = X_val.fillna(0)
         # Replace inf values with 0 in all data splits
         self.logger.info("Replacing inf values with 0 in all data splits")
         X_train = X_train.replace([np.inf, -np.inf], 0)
         X_test = X_test.replace([np.inf, -np.inf], 0)
         X_val = X_val.replace([np.inf, -np.inf], 0)
-
-        # Log NaN replacement statistics
-        train_nan_count_before = X_train.isna().sum().sum()
-        test_nan_count_before = X_test.isna().sum().sum()
-        val_nan_count_before = X_val.isna().sum().sum()

src/models/ensemble/data_utils.py (1)

218-231: Critical: Invalid parameter passed to SMOTE.

Line 230 passes n_neighbors=nn_estimator to SMOTE, but SMOTE's constructor does not accept an n_neighbors parameter. This will raise a TypeError at runtime. The k_neighbors parameter is already set on line 229, and if you want to pass a custom estimator, it should be a NearestNeighbors object (not KNeighborsClassifier) passed through the k_neighbors parameter.

Apply this diff to fix the SMOTE configuration:
     try:
-        # Configure SMOTE for CPU usage
-        # Using KNeighborsClassifier with n_jobs parameter to avoid FutureWarning
-        from sklearn.neighbors import KNeighborsClassifier
-
-        k_neighbors = min(5, min(n_pos, n_neg) - 1)  # Adaptive neighbors
-        nn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, n_jobs=-1)
-
         smote = SMOTE(
             sampling_strategy=sampling_strategy,
             random_state=42,
-            k_neighbors=k_neighbors,
-            n_neighbors=nn_estimator,  # Pass the estimator to avoid FutureWarning
+            k_neighbors=min(5, min(n_pos, n_neg) - 1),
         )
Note: The n_jobs parameter for SMOTE has been deprecated. If you need parallel processing, consider using SMOTE within a pipeline or joblib directly.

src/models/StackedEnsemble/base/kernel_based/svm_model_25.py (1)

458-473: Undefined final_metrics return will crash hypertuning.

Inside hypertune_svm, the happy-path return best_hpo_params, final_metrics references final_metrics that never gets assigned, so the first successful HPO run blows up with a NameError. Please capture the best trial’s metrics before returning. One straightforward fix is to let the optimization runner hand back both params and metrics, e.g.:

@@
-    best_params_from_hpo = {}
+    best_params_from_hpo = {}
+    best_metrics_from_hpo = {}
@@
-    def callback(study, trial):
-        nonlocal best_score, best_params_from_hpo, top_trials, global_top_trials
+    def callback(study, trial):
+        nonlocal best_score, best_params_from_hpo, best_metrics_from_hpo, top_trials, global_top_trials
@@
-                best_params_from_hpo = trial.params
+                best_params_from_hpo = trial.params
+                best_metrics_from_hpo = dict(current_run_metrics)
@@
-    return best_params_from_hpo
+    return best_params_from_hpo, best_metrics_from_hpo
@@
-        best_hpo_params = optimize_hyperparameters_svm(
+        best_hpo_params, final_metrics = optimize_hyperparameters_svm(
             X_train,
             y_train,
             X_test,
             y_test,
             X_eval,
             y_eval,
             hyperparameter_space,
         )
@@
-        if not best_hpo_params:
+        if not best_hpo_params:
             logger.error("Optuna optimization failed for SVM. Aborting.")
             return None, None
-        return best_hpo_params, final_metrics
+        return best_hpo_params, final_metrics

src/models/StackedEnsemble/base/kernel_based/svm_model.py (1)

407-425: final_metrics is undefined in hypertune return path.

hypertune_svm happily returns (best_hpo_params, final_metrics) even though final_metrics is never set, so the first successful run will explode with a NameError. Please propagate the metrics from the Optuna study before returning. Mirroring the approach above keeps the signature intact:

@@
-    best_params_from_hpo = {}
+    best_params_from_hpo = {}
+    best_metrics_from_hpo = {}
@@
-    def callback(study, trial):
-        nonlocal best_score, best_params_from_hpo, top_trials, global_top_trials
+    def callback(study, trial):
+        nonlocal best_score, best_params_from_hpo, best_metrics_from_hpo, top_trials, global_top_trials
@@
-                best_params_from_hpo = trial.params
+                best_params_from_hpo = trial.params
+                best_metrics_from_hpo = dict(current_run_metrics)
@@
-    return best_params_from_hpo
+    return best_params_from_hpo, best_metrics_from_hpo
@@
-        best_hpo_params = optimize_hyperparameters_svm(
+        best_hpo_params, final_metrics = optimize_hyperparameters_svm(
             X_train,
             y_train,
             X_test,
             y_test,
             X_eval,
             y_eval,
             hyperparameter_space,
         )
@@
-        if not best_hpo_params:
+        if not best_hpo_params:
             logger.error("Optuna optimization failed for SVM. Aborting.")
             return None, None
-        return best_hpo_params, final_metrics
+        return best_hpo_params, final_metrics

data/Create_data/add_ELO_scores.py (4)

171-180: Season reset is global across leagues — causes incorrect ELO carryover.

current_season tracks one season for all leagues; teams in other leagues won’t reset when season changes, leading to contaminated ratings. Track season per league.

Apply this diff in add_elo_scores and add a per‑league map:

-        # Check for season change
-        if row["season_encoded"] != self.current_season:
-            self.current_season = row["season_encoded"]
-            league_teams = matches[matches["league_encoded"] == row["league_encoded"]]
-            league_teams = pd.concat(
-                [league_teams["home_encoded"], league_teams["away_encoded"]]
-            ).unique()
-            self.reset_season_ratings(league_teams)
+        # Check for season change per league
+        season = row["season_encoded"]
+        league = row["league_encoded"]
+        prev = self.current_season_by_league.get(league)
+        if season != prev:
+            self.current_season_by_league[league] = season
+            league_rows = matches[matches["league_encoded"] == league]
+            league_teams = pd.concat(
+                [league_rows["home_encoded"], league_rows["away_encoded"]]
+            ).unique()
+            self.reset_season_ratings(league_teams)

Add this in init:

self.current_season_by_league = {}

150-161: Handle NaN leagues when computing K-factors.

NaN league IDs create a NaN dict key and later default to 30 due to NaN key lookup semantics.

Apply:

-            unique_leagues = matches["league_encoded"].unique()
+            unique_leagues = matches["league_encoded"].dropna().unique()
             for league in unique_leagues:
                 if league in self.prefixed_k_factors:

249-258: Numeric conversion can raise on None; use vectorized to_numeric.

float(None) raises TypeError. Switch to robust coercion.

-                for col in numeric_columns:
-                    if col in df.columns:
-                        df[col] = df[col].apply(
-                            lambda x: float(str(x).replace(",", "."))
-                            if isinstance(x, str)
-                            else float(x)
-                        )
+                for col in numeric_columns:
+                    if col in df.columns:
+                        # Normalize decimal commas and coerce invalids to NaN
+                        s = df[col].astype(str).str.replace(",", ".", regex=False)
+                        df[col] = pd.to_numeric(s, errors="coerce")

354-361: Ensure real None values are treated as NaN in loader.

Current replace list covers the string "None" but not actual None inserted during row padding.

-        # Replace NA values
-        na_values = ["NaN", "N/A", "NA", "null", "None", "", "Infinity", "-Infinity", "inf", "-inf"]
-        df = df.replace(na_values, np.nan)
+        # Replace NA values (strings) and actual None
+        na_values = ["NaN", "N/A", "NA", "null", "None", "", "Infinity", "-Infinity", "inf", "-inf"]
+        df = df.replace(na_values, np.nan)
+        df = df.where(pd.notnull(df), np.nan)

(import pandas as pd is already present.)

Also applies to: 365-372

src/models/ensemble/meta_features_20.py (1)

13-25: Meta-feature count/order mismatch (8 models) causes wrong column labels.

You stack 8 base probs, 1 avg, 23 diffs, 3 range, 8 ranks, then 3 meta, then 2 votes (total 48). create_meta_dataframe expects a different count and a different order (votes before meta). This triggers the fallback and breaks downstream by losing semantic column names.

Update docstring/comments to 8 models.
Keep stacking order as-is and fix col_names to length 48 with matching order, or reorder stacking to match the current names. Example fixes (keep current stacking and adjust names):

-    Now handles 7 base models (incl. PyTorch, SVM, FNN).
+    Now handles 8 base models (XGB, TabNet, LGB, Extra, MLP, PyTorch, SVM, FNN).

-    # Add to meta-features
-    meta_features = np.column_stack([meta_features, weighted_avg.reshape(-1, 1)])
+    # Add to meta-features
+    meta_features = np.column_stack([meta_features, weighted_avg.reshape(-1, 1)])

Ensure diffs list reflects 23 entries (your current all_diffs already has 23).

Option A (recommended): Adjust column names to match stacking order (meta after ranks, then vote_sum, vote_agreement last):

-    # Define column names based on the expected structure (43 features)
-    # 7 base + 1 avg + 21 diffs + 3 range + 7 ranks + 2 votes = 41 ? Mistake somewhere
-    # Re-count: 7 base + 1 avg + 21 diffs + 3 range + 7 ranks + 2 votes = 41 features
-    col_names = (
-        ["p_xgb", "p_tabnet", "p_lgb", "p_extra", "p_mlp", "p_pytorch", "p_svm", "p_fnn"]  # 8 base
+    # Define column names to match stacking order:
+    # 8 base + 1 avg + 23 diffs + 3 range + 8 ranks + 3 meta + 2 votes = 48
+    col_names = (
+        ["p_xgb", "p_tabnet", "p_lgb", "p_extra", "p_mlp", "p_pytorch", "p_svm", "p_fnn"]
         + ["weighted_avg"]
         + [
             "diff_xgb_tabnet",
             "diff_xgb_lgb",
             "diff_xgb_extra",
             "diff_xgb_mlp",
             "diff_xgb_pytorch",
             "diff_xgb_svm",
             "diff_tabnet_lgb",
             "diff_tabnet_extra",
             "diff_tabnet_mlp",
             "diff_tabnet_pytorch",
             "diff_tabnet_svm",
             "diff_lgb_extra",
             "diff_lgb_mlp",
             "diff_lgb_pytorch",
             "diff_lgb_svm",
             "diff_extra_mlp",
             "diff_extra_pytorch",
             "diff_extra_svm",
             "diff_mlp_pytorch",
             "diff_mlp_svm",
             "diff_pytorch_svm",
             "diff_fnn_pytorch",
             "diff_fnn_svm",
-        ]  # 21 diffs
-        + ["max_prob", "min_prob", "range_prob"]  # 3 range
+        ]
+        + ["max_prob", "min_prob", "range_prob"]
         + [
             "rank_xgb",
             "rank_tabnet",
             "rank_lgb",
             "rank_extra",
             "rank_mlp",
             "rank_pytorch",
             "rank_svm",
             "rank_fnn",
-        ]  # 8 ranks
-        + ["vote_sum", "vote_agreement"]  # 2 votes
-        + ["league_encoded", "season_encoded", "date_encoded"]  # 4 meta
+        ]
+        + ["league_encoded", "season_encoded", "date_encoded"]
+        + ["vote_sum", "vote_agreement"]
     )

Option B: If you prefer votes before meta, move the stacking of vote_sum/vote_agreement before adding league/season/date to match existing names.

Also fix the comments that still refer to 7 models and 21 diffs.

Also applies to: 49-75, 107-133, 145-149, 151-163, 165-168, 175-179, 189-241

data/Create_data/add_poisson_xG.py (1)

105-115: String/None to numeric conversion can fail; use vectorized coercion.

.str on non-strings and float(None) paths cause exceptions. Normalize and coerce.

-            for col in self.all_features:
-                if X[col].dtype == "object":
-                    try:
-                        # Replace commas with periods and convert to float
-                        X[col] = X[col].str.replace(",", ".").astype(float)
-                    except Exception as e:
-                        self.logger.warning(f"Could not convert column {col} to numeric: {str(e)}")
+            for col in self.all_features:
+                # Normalize decimal commas and coerce invalids to NaN
+                s = X[col].astype(str).str.replace(",", ".", regex=False)
+                X[col] = pd.to_numeric(s, errors="coerce")

-            for col in self.all_features:
-                if df[col].dtype == "object":
-                    try:
-                        # Replace commas with periods and convert to float
-                        df[col] = df[col].str.replace(",", ".").astype(float)
-                    except Exception as e:
-                        self.logger.warning(f"Could not convert column {col} to numeric: {str(e)}")
+            for col in self.all_features:
+                s = df[col].astype(str).str.replace(",", ".", regex=False)
+                df[col] = pd.to_numeric(s, errors="coerce")

Also applies to: 226-233

data/Create_data/api_football/add_current_features_postgre.py (1)

436-440: Future match query now returns past fixtures

WHERE "date" BETWEEN ... OR home_total_shots IS NOT NULL evaluates as (date in range) OR (stats present), so every row with recorded stats (i.e., historical matches) bypasses the date window. The exported “future matches” file will be dominated by old games with completed stats. Please tighten the predicate—e.g. require both the date window and home_total_shots IS NULL, or wrap the BETWEEN in parentheses before the OR.

Apply something like:
-        WHERE "date" between '{today_str}' and '{two_weeks_str}' or home_total_shots is not null;
+        WHERE "date" BETWEEN '{today_str}' AND '{two_weeks_str}'
+          AND home_total_shots IS NULL;

src/models/ensemble/training.py (1)

803-814: TabNet trials crash because weights=1 is invalid

TabNetClassifier.fit expects weights to be an array-like of sample weights (or None). Passing the scalar 1 trips the fit loop (len(weights)), giving 'int' object is not subscriptable before any trial completes. Remove the argument or supply a proper sample-weight vector computed from meta_targets_np.
-                    eval_metric=params.get("eval_metric", ["logloss", "auc"]),
-                    weights=1,  # Use automatic class weighting
+                    eval_metric=params.get("eval_metric", ["logloss", "auc"]),

data/Create_data/utils/feature_selection.py (1)

373-399: Shared list corrupts stability scores

dict.fromkeys(..., []) reuses the same list for every key, so appending scores for one feature mutates the “scores” list of every feature. The subsequent mean/selection logic then treats all features as identical. Initialize with independent lists instead.
-        feature_counts = dict.fromkeys(X.columns, 0)
-        feature_scores = dict.fromkeys(X.columns, [])
+        feature_counts = {col: 0 for col in X.columns}
+        feature_scores = {col: [] for col in X.columns}

src/models/StackedEnsemble/base/neural/pytorch_hypertuner.py (1)

129-135: Use the passed batch_size in predict_proba (avoid hardcoded 32).

Hardcoding 32 ignores the API and can degrade performance.

Apply this diff:

-        dataloader = TorchDataLoader(
-            dataset,
-            batch_size=32,
-            num_workers=1,
-            # pin_memory=True,  # Enables faster CPU to GPU transfers
-            # persistent_workers=True  # Keeps workers alive between epochs
-        )
+        dataloader = TorchDataLoader(
+            dataset,
+            batch_size=batch_size,
+            num_workers=0,
+            pin_memory=(self.device_.type == "cuda"),
+        )

src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (1)

426-434: Don’t mutate X_eval when fixing dtypes for signature; modify input_example.

Current code alters the caller’s DataFrame.

Apply this diff:

-            if hasattr(input_example, "dtypes"):
-                for col in input_example.columns:
-                    if X_eval[col].dtype.kind == "i":
+            if hasattr(input_example, "dtypes"):
+                for col in input_example.columns:
+                    if input_example[col].dtype.kind == "i":
                         logger.info(
                             f"Converting integer column '{col}' to float64 to handle potential missing values"
                         )
-                        X_eval[col] = X_eval[col].astype("float64")
+                        input_example[col] = input_example[col].astype("float64")

src/models/StackedEnsemble/base/tree_based/random_forest_30.py (1)

543-550: Robust y_val handling and safe baseline precision (avoid divide-by-zero).

.values assumes pandas; precision can divide by zero.

Apply this diff:

-    y_val_np = y_val.values
+    y_val_np = y_val.values if hasattr(y_val, "values") else np.asarray(y_val)
     # Compute baseline metric
     probs = model.predict_proba(X_val)[:, 1]
     preds = (probs >= threshold).astype(int)
-    # Fix: metric is being passed as a float value instead of a function
-    # We'll calculate precision directly since that's what was passed in
-    baseline = np.sum((y_val_np == 1) & (preds == 1)) / (np.sum(preds == 1))
+    # Calculate baseline precision safely
+    _pp = np.sum(preds == 1)
+    baseline = 0.0 if _pp == 0 else np.sum((y_val_np == 1) & (preds == 1)) / _pp

src/models/ensemble/meta_features_0412.py (1)

13-223: Add the FNN channel; current signature breaks dynamic weighting

We now call create_meta_features_optimized with eight model outputs (see EnsembleModel.train / predict_proba), but the function still accepts only seven arrays. The extra p_fnn positional argument ends up bound to dynamic_weights, so (dynamic_weights.get(...)) tries to run on a NumPy array and we crash immediately. Even if it didn’t blow up, all meta-features would ignore the FNN output. Please extend the function (and column headers) to include the eighth model so downstream code keeps working.

-def create_meta_features_optimized(
-    p_xgb: np.ndarray,
-    p_tabnet: np.ndarray,
-    p_lgb: np.ndarray,
-    p_extra: np.ndarray,
-    p_mlp: np.ndarray,
-    p_pytorch: np.ndarray,
-    p_svm: np.ndarray,
-    dynamic_weights: Optional[dict] = None,
-    thresholds: Optional[dict] = None,
-) -> np.ndarray:
+def create_meta_features_optimized(
+    p_xgb: np.ndarray,
+    p_tabnet: np.ndarray,
+    p_lgb: np.ndarray,
+    p_extra: np.ndarray,
+    p_mlp: np.ndarray,
+    p_pytorch: np.ndarray,
+    p_svm: np.ndarray,
+    p_fnn: np.ndarray,
+    dynamic_weights: Optional[dict] = None,
+    thresholds: Optional[dict] = None,
+) -> np.ndarray:
@@
-    all_preds = [p_xgb, p_tabnet, p_lgb, p_extra, p_mlp, p_pytorch, p_svm]
+    all_preds = [p_xgb, p_tabnet, p_lgb, p_extra, p_mlp, p_pytorch, p_svm, p_fnn]
@@
-            + dynamic_weights.get("svm", default_weight) * p_svm
+            + dynamic_weights.get("svm", default_weight) * p_svm
+            + dynamic_weights.get("fnn", default_weight) * p_fnn
@@
-    # Stack all differences (original 15 + new 6 = 21)
+    # Stack all differences (C(8, 2) = 28)
@@
-        diff_pytorch_svm,  # pytorch vs svm
+        diff_pytorch_svm,
+        diff_xgb_fnn,
+        diff_tabnet_fnn,
+        diff_lgb_fnn,
+        diff_extra_fnn,
+        diff_mlp_fnn,
+        diff_pytorch_fnn,
+        diff_svm_fnn,
@@
-    vote_thresholds = [
+    vote_thresholds = [
         thresholds.get("xgb", default_threshold) if thresholds else default_threshold,
         ...
-        thresholds.get("svm", default_threshold) if thresholds else default_threshold,
+        thresholds.get("svm", default_threshold) if thresholds else default_threshold,
+        thresholds.get("fnn", default_threshold) if thresholds else default_threshold,
@@
-        ["p_xgb", "p_tabnet", "p_lgb", "p_extra", "p_mlp", "p_pytorch", "p_svm"]
+        [
+            "p_xgb",
+            "p_tabnet",
+            "p_lgb",
+            "p_extra",
+            "p_mlp",
+            "p_pytorch",
+            "p_svm",
+            "p_fnn",
+        ]
@@
-        + ["max_prob", "min_prob", "range_prob"]  # 3 range
+        + ["max_prob", "min_prob", "range_prob"]
         + [
             "rank_xgb",
             ...
             "rank_svm",
+            "rank_fnn",
         ]

Don’t forget to define the new diff_*_fnn arrays before the list, and to bump the expected feature count comment to 50 (8 base + 1 avg + 28 diffs + 3 range + 8 ranks + 2 votes). Based on the call sites in ensemble_model_0414.py.

src/models/StackedEnsemble/base/neural/mlp_model_25.py (1)

348-351: Pass an actual DataFrame to log_to_mlflow during Optuna logging

Inside the objective we call log_to_mlflow(..., X_eval), but here X_eval is the scaled NumPy array. log_to_mlflow immediately tries X_eval.iloc[:5], triggering 'numpy.ndarray' object has no attribute 'iloc' as soon as a trial exceeds the score threshold. Please thread the raw evaluation DataFrame through the optimizer so logging keeps working.

-def optimize_hyperparameters(
-    X_train, y_train, X_test, y_test, X_eval, y_eval, hyperparameter_space
-):
+def optimize_hyperparameters(
+    X_train,
+    y_train,
+    X_test,
+    y_test,
+    X_eval,
+    y_eval,
+    hyperparameter_space,
+    X_eval_raw,
+):
@@
-                log_to_mlflow(model, metrics, params, experiment_name, scaler, X_eval)
+                log_to_mlflow(
+                    model,
+                    metrics,
+                    params,
+                    experiment_name,
+                    scaler,
+                    X_eval_raw,
+                )
@@
-        best_params = optimize_hyperparameters(
+        best_params = optimize_hyperparameters(
             X_train_scaled,
             y_train,
             X_test_scaled,
             y_test,
             X_eval_scaled,
             y_eval,
-            hyperparameter_space,
+            hyperparameter_space,
+            X_eval,
         )

Make the same signature change in any other call paths (e.g., tests) to avoid regressions.

src/models/ensemble/ensemble_model_0410.py (1)

346-351: Stop double-scaling PyTorch inputs

self.model_pytorch.predict_proba already applies its own scaler (scaler_.transform(...), see src/models/StackedEnsemble/base/neural/pytorch_hypertuner.py). Pre-scaling here runs that transform twice, skewing every prediction. Pass the raw features instead.
-        X_pytorch_scaled = self.model_pytorch_scaler.transform(X_pytorch)
-        p_pytorch = self.model_pytorch.predict_proba(X_pytorch_scaled)[:, 1]
+        p_pytorch = self.model_pytorch.predict_proba(X_pytorch)[:, 1]

src/models/StackedEnsemble/base/tree_based/catboost_model.py (1)

538-553: Guard against zero positives in permutation importance

When the model predicts no positives, both the baseline precision (np.sum(preds == 1)) and the per-shuffle precision denominator drop to zero, raising a ZeroDivisionError. Add a safe guard (e.g., denom = max(np.sum(...), 1) or an epsilon) before dividing so permutation importance still runs on edge cases.

-    baseline = np.sum((y_val_np == 1) & (preds == 1)) / (np.sum(preds == 1))
+    predicted_positives = np.sum(preds == 1)
+    baseline = (
+        np.sum((y_val_np == 1) & (preds == 1)) / predicted_positives
+        if predicted_positives > 0
+        else 0.0
+    )
...
-            precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / (
-                np.sum(preds_shuffled == 1)
-            )
+            predicted_positives_shuffled = np.sum(preds_shuffled == 1)
+            precision = (
+                np.sum((y_val_np == 1) & (preds_shuffled == 1)) / predicted_positives_shuffled
+                if predicted_positives_shuffled > 0
+                else 0.0
+            )

src/models/StackedEnsemble/base/tree_based/lightgbm_model_25.py (1)

438-466: Log the LightGBM model only once.

Lines 438-466 call mlflow.lightgbm.log_model twice on the same artifact path (“model”). Mlflow raises MlflowException when the second call tries to overwrite the existing artifacts, so the run fails. Log once after you’ve inferred the signature.

-            # Log model
-            model_info = mlflow.lightgbm.log_model(
-                model,
-                "model",
-                registered_model_name=f"lightgbm_{datetime.now().strftime('%Y%m%d_%H%M')}",
-            )
             # Create input example for model signature
             input_example = X_eval.head(5)
             # Handle integer columns by converting them to float64 to properly manage missing values
             input_example = X_eval.iloc[:5].copy() if hasattr(X_eval, "iloc") else X_eval[:5].copy()
@@
-            model_info = mlflow.lightgbm.log_model(
-                model,
-                "model",
-                registered_model_name=f"lightgbm_{datetime.now().strftime('%Y%m%d_%H%M')}",
-                signature=signature,
-            )
+            model_info = mlflow.lightgbm.log_model(
+                model,
+                "model",
+                registered_model_name=f"lightgbm_{datetime.now().strftime('%Y%m%d_%H%M')}",
+                signature=signature,
+            )

src/models/ensemble/ensemble_model_20.py (1)

145-160: Restore optional validation/test handling in train()

x_val, y_val, X_test, and y_test still default to None, yet Lines 156-159 assert they are DataFrames before any fallback logic runs. Any existing caller relying on the previous “split_validation=True” behaviour now hits an AssertionError (or AttributeError when prepare_data runs) even though the signature still advertises optional inputs. Please reinstate the old guard (or make the arguments required) before preparing the data, e.g.:
- assert isinstance(x_val, pd.DataFrame), f"x_val must be a DataFrame, got {type(x_val)}"
- assert isinstance(X_test, pd.DataFrame), f"X_test must be a DataFrame, got {type(X_test)}"
+ if any(arg is None for arg in (x_val, y_val, X_test, y_test)):
+     if split_validation:
+         # fall back to splitting below
+         pass
+     else:
+         raise ValueError("train() now requires explicit x_val/y_val/X_test/y_test inputs when split_validation=False")
and only call prepare_data after the inputs are known to exist. Otherwise the refactor breaks every path that depended on the earlier optional arguments.

src/models/StackedEnsemble/shared/config_loader.py (1)

51-63: YAML: add UTF‑8 encoding, validate type, and log with traceback

Ensure loaded YAML is a dict; include traceback with logger.exception.

-            with open(config_path) as f:
+            with open(config_path, encoding="utf-8") as f:
                 config = yaml.safe_load(f)
 
-            # Validate configuration
+            if not isinstance(config, dict):
+                self.logger.error(f"Configuration at {config_path} must be a mapping (dict)")
+                raise ValueError(f"Invalid YAML structure at {config_path}")
+
+            # Validate configuration
             self._validate_model_config(config, model_type)
             return config
 ...
-        except yaml.YAMLError as e:
-            self.logger.error(f"Error parsing configuration file: {e}")
+        except yaml.YAMLError as e:
+            self.logger.exception(f"Error parsing configuration file: {e}")
             raise

-            with open(space_path) as f:
+            with open(space_path, encoding="utf-8") as f:
                 space = yaml.safe_load(f)
 
-            # Validate search space
+            if not isinstance(space, dict):
+                self.logger.error(f"Hyperparameter space at {space_path} must be a mapping (dict)")
+                raise ValueError(f"Invalid YAML structure at {space_path}")
+
+            # Validate search space
             self._validate_hyperparameter_space(space, model_type)
             return space
 ...
-        except yaml.YAMLError as e:
-            self.logger.error(f"Error parsing hyperparameter space file: {e}")
+        except yaml.YAMLError as e:
+            self.logger.exception(f"Error parsing hyperparameter space file: {e}")
             raise

Also applies to: 82-94

🧹 Nitpick comments (21)

src/models/StackedEnsemble/shared/data_loader.py (1)

92-94: Consider extracting duplicate inf replacement logic.

The inf value replacement logic is repeated three times for the three dataframes. Consider extracting this into a helper method to reduce duplication and improve maintainability.

Example refactor:

def _replace_inf_values(self, df: pd.DataFrame) -> pd.DataFrame:
    """Replace infinite values with 0 in a DataFrame."""
    return df.replace([np.inf, -np.inf], 0)

Then apply it to each dataset:

         # Replace inf values with 0 in all data splits
         self.logger.info("Replacing inf values with 0 in all data splits")
-        X_train = X_train.replace([np.inf, -np.inf], 0)
-        X_test = X_test.replace([np.inf, -np.inf], 0)
-        X_val = X_val.replace([np.inf, -np.inf], 0)
+        X_train = self._replace_inf_values(X_train)
+        X_test = self._replace_inf_values(X_test)
+        X_val = self._replace_inf_values(X_val)

src/utils/K_factor_calculation.py (4)

13-25: Consider optimizing repeated DataFrame filtering.

The current implementation filters league_data twice for each unique team (once for home matches, once for away matches), resulting in O(n×m) complexity. For large datasets, this could be slow.

Consider using groupby to compute team-level statistics more efficiently:
# Compute home stats
home_stats = league_data.groupby("home_encoded").agg({
    "Home_draws": "first",
    "Home_team_matches": "first"
})

# Compute away stats  
away_stats = league_data.groupby("away_encoded").agg({
    "Away_draws": "first",
    "Away_team_matches": "first"
})

# Combine stats
for team_id in league_data["home_encoded"].unique():
    if team_id not in home_stats.index or team_id not in away_stats.index:
        continue
    
    home_draws = home_stats.loc[team_id, "Home_draws"]
    home_matches = home_stats.loc[team_id, "Home_team_matches"]
    away_draws = away_stats.loc[team_id, "Away_draws"]
    away_matches = away_stats.loc[team_id, "Away_team_matches"]
    
    if home_matches > 0 and away_matches > 0:
        team_draw_rate = (home_draws + away_draws) / (home_matches + away_matches)
        draw_rates.append(team_draw_rate)
This reduces complexity to O(m) for the groupby operations plus O(n) for the iteration.

32-36: Consider vectorized operations instead of iterrows().

Using .iterrows() is inefficient for large DataFrames. For better performance, consider vectorized operations:
# Vectorized alternative
draw_mask = league_data["match_outcome"] == 2
draw_matches = league_data[draw_mask]
goal_diff_data = abs(draw_matches.get("home_goals", 0) - draw_matches.get("away_goals", 0)).tolist()
Note: The current use of .get() with default values is good defensive programming.

69-72: Consider vectorized operations for better performance.

Similar to the earlier loop, using .iterrows() is inefficient. Consider a vectorized approach:
# Vectorized alternative
draw_mask = (league_data["match_outcome"] == 2) & \
            league_data["home_team_elo"].notna() & \
            league_data["away_team_elo"].notna()
draw_matches = league_data[draw_mask]
draw_elo_diffs = abs(draw_matches["home_team_elo"] - draw_matches["away_team_elo"]).tolist()
94-96: Use logger.exception() for better error diagnostics.

When logging within an exception handler, logger.exception() automatically includes the stack trace, making debugging easier than logger.error().

Apply this diff:
 except Exception as e:
-    logger.error(f"Error calculating draw K-factor: {str(e)}")
+    logger.exception("Error calculating draw K-factor")
     return 25  # Default draw K-factor
Note: logger.exception() automatically includes the exception message and traceback, so explicit message formatting is unnecessary.

Based on coding guidelines.

data/Create_data/add_ELO_scores.py (1)

211-213: Use logger instead of print.

For consistency with the rest of the module.
-            print(f"ELO created for {elo_count} matches")
+            self.logger.info(f"ELO created for {elo_count} matches")

src/models/StackedEnsemble/base/tree_based/feature_selection/xgboost_boruta.py (2)

52-66: Set determinism and align threads on the classifier.

Add random_state and n_jobs to match global seeds and thread caps.

-    params = {
+    params = {
         "alpha": 55.8,
         "colsample_bytree": 0.885,
         "eval_metric": ["aucpr", "error", "logloss"],
         "gamma": 4.43,
         "lambda": 6.94,
         "learning_rate": 0.15,
         "max_depth": 10,
         "min_child_weight": 635,
         "scale_pos_weight": 2.7,
         "subsample": 0.795,
-    }
-    xgb_clf = xgb.XGBClassifier(**params)
+        "random_state": SEED,
+        "n_jobs": 4,
+    }
+    xgb_clf = xgb.XGBClassifier(**params)

87-90: Avoid print; log or persist to a defined artifacts directory.

Use logger and explicit path (e.g., under logs/artifacts) for traceability.

-    selected_features = feature_selector.Subset().columns.tolist()
-    print("Selected features:", selected_features)
+    selected_features = feature_selector.Subset().columns.tolist()
+    logger.info(f"Selected features: {len(selected_features)}")

data/Create_data/add_poisson_xG.py (3)

161-163: Consider removing add_constant with scikit-learn PoissonRegressor.

PoissonRegressor already includes an intercept; adding a constant risks collinearity.

-            # Add constant term for statsmodels
-            X_scaled = sm.add_constant(X_scaled)
+            # Intercept is handled by scikit-learn's PoissonRegressor (fit_intercept=True)
+            # Keep features as-is

238-239: Use logger instead of print.

-            print(f"X shape: {X.shape}")
+            self.logger.debug(f"Feature matrix shape: {X.shape}")

285-288: Avoid using 'type' as a parameter name.

Shadows built-in and reduces clarity.

-    def add_poisson_xG(
-        self, df: pd.DataFrame, base_df: pd.DataFrame, type: str, is_training: bool = False
-    ) -> pd.DataFrame:
+    def add_poisson_xG(
+        self, df: pd.DataFrame, base_df: pd.DataFrame, dataset_type: str, is_training: bool = False
+    ) -> pd.DataFrame:
...
-        output_path = datasets[type]
+        output_path = datasets[dataset_type]
...
-            self.logger.error(f"Failed to export {type} data: {str(e)}")
+            self.logger.error(f"Failed to export {dataset_type} data: {str(e)}")

Also applies to: 316-316, 335-341

src/models/StackedEnsemble/base/neural/pytorch_hypertuner.py (2)

565-577: Replace lambda with named function for clarity and lint compliance.

Improves readability and satisfies E731.

Apply this diff:

-    objective_func = lambda trial: objective(
-        trial,
-        X_train,
-        y_train,
-        X_test,
-        y_test,
-        X_val,
-        y_val,
-        hyperparameter_space,
-        input_dim,
-        device,
-        scaler,
-    )
+    def objective_func(trial):
+        return objective(
+            trial,
+            X_train,
+            y_train,
+            X_test,
+            y_test,
+            X_val,
+            y_val,
+            hyperparameter_space,
+            input_dim,
+            device,
+            scaler,
+        )

419-423: Log full traceback when threshold optimization fails.

Use logger.exception for actionable diagnostics.

Apply this diff:

-            logger.error(f"TypeError calling optimize_threshold: {te}")
-            logger.error(
+            logger.exception(f"TypeError calling optimize_threshold: {te}")
+            logger.exception(
                 "Ensure optimize_threshold in hypertuner_utils.py expects (model, X, y, min_recall)."
             )

src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (1)

594-596: Prefer logger.exception for failures in feature-combination trials.

Keeps stack traces for debugging; behavior unchanged.

Apply this diff:
-        except Exception as e:
-            logger.error(f"Trial {trial + 1} failed: {e}")
+        except Exception as e:
+            logger.exception(f"Trial {trial + 1} failed: {e}")

src/models/ensemble/meta_features_0414.py (1)

195-233: Align column-name generation with num_models programmatically.

Hardcoded lists are brittle (comment says 41, actual is 45). Generate names from all_preds and computed diffs to avoid drift.

I can provide a helper to auto-build col_names from base names, pairs, and ranks to always match meta_features.shape[1]. Want a patch?

src/models/StackedEnsemble/shared/config_loader.py (1)

28-30: Avoid "None_config_loader" logger name when model_type/experiment_name are None

Fallback currently yields "None_config_loader". Derive a sensible default.

-        self.logger = logger or ExperimentLogger(
-            experiment_name=experiment_name or f"{model_type}_config_loader"
-        )
+        loader_name = (
+            experiment_name
+            or (f"{model_type}_config_loader" if model_type else "config_loader")
+        )
+        self.logger = logger or ExperimentLogger(experiment_name=loader_name)

data/Create_data/api_football/get_fixtures_postgre.py (5)

223-226: Return type mismatch (function can return None)

Function returns None on validation failures but is annotated as dict. Adjust to Optional.

-    def _parse_team_stat_response(
-        self, api_response_data: dict, fixture_id: int, target_team_id: Any
-    ) -> dict:
+    def _parse_team_stat_response(
+        self, api_response_data: dict, fixture_id: int, target_team_id: Any
+    ) -> Optional[dict[str, Any]]:

Add import if missing:

-from typing import Any
+from typing import Any, Optional

385-386: Use timezone-aware timestamps (UTC) for updated_at

Naive datetimes cause ambiguity across services/DB. Prefer UTC-aware.

-            "updated_at": datetime.now(),
+            "updated_at": datetime.now(timezone.utc),

Also update imports:

-from datetime import datetime, timedelta
+from datetime import datetime, timedelta, timezone

Also applies to: 875-875, 1064-1064, 1173-1173

470-473: Log exceptions with tracebacks

Use logger.exception inside except blocks to capture stack traces. Also avoid blanket Exception where possible.

Example changes:

-                    except SQLAlchemyError as e_check:
-                        self.logger.error(
+                    except SQLAlchemyError as e_check:
+                        self.logger.exception(
                             f"DB error checking existing season stats for team {team_id_to_fetch}, league {league_id}, season {season}: {e_check}"
                         )

-            except Exception as e:
-                self.logger.error(
+            except Exception as e:
+                self.logger.exception(
                     f"Error updating statistics for fixture {fixture_id} in PostgreSQL: {e}"
                 )

-        except requests.exceptions.RequestException as e:
-            message = f"API request failed for fixture ID {fixture_id}: {e}"
-            self.logger.error(message)
+        except requests.exceptions.RequestException as e:
+            message = f"API request failed for fixture ID {fixture_id}: {e}"
+            self.logger.exception(message)

-        except (Exception, psycopg2.Error) as e:
-            message = f"Error processing or upserting prediction for fixture ID {fixture_id}: {e}"
-            self.logger.error(message)
+        except Exception as e:
+            message = f"Error processing or upserting prediction for fixture ID {fixture_id}: {e}"
+            self.logger.exception(message)

-        except Exception as e:
-            self.logger.error(
+        except Exception as e:
+            self.logger.exception(
                 f"Unexpected error processing events for fixture_id {fixture_id}: {e}"
             )

Based on static analysis hints.

Also applies to: 550-552, 699-701, 1246-1248, 932-938

921-923: Remove prints; rely on logger

Keep output consistent and structured.

-                self.logger.info(message)
-                print(message)
+                self.logger.info(message)

-                message = f"No prediction data found in API response for fixture ID: {fixture_id}"
-                self.logger.warning(message)
-                print(message)
+                message = f"No prediction data found in API response for fixture ID: {fixture_id}"
+                self.logger.warning(message)

Also applies to: 807-807

1011-1015: Add timeout to teams API request (venues update)

Prevent indefinite hangs; optionally reuse _get_request for consistency.

-                    response = requests.get(url, headers=headers)
+                    response = requests.get(url, headers=headers, timeout=30)

coderabbitai · 2025-10-12T22:59:37Z


 # PostgreSQL engine setup
-engine = create_engine('postgresql+psycopg2://postgres:ronaldo99@localhost:5432/api_football')
+engine = create_engine("postgresql+psycopg2://postgres:ronaldo99@localhost:5432/api_football")


⚠️ Potential issue | 🔴 Critical

Critical: hardcoded Postgres credentials in source

Leak of DB password in VCS. Load from env and enable pool_pre_ping. Rotate the exposed credential.

- engine = create_engine("postgresql+psycopg2://postgres:ronaldo99@localhost:5432/api_football") + db_url = os.getenv("API_FOOTBALL_DB_URL") + if not db_url: + raise RuntimeError( + "Missing API_FOOTBALL_DB_URL. Put your SQLAlchemy URL in the environment/.env." + ) + engine = create_engine(db_url, pool_pre_ping=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

engine = create_engine("postgresql+psycopg2://postgres:ronaldo99@localhost:5432/api_football")

db_url = os.getenv("API_FOOTBALL_DB_URL")

if not db_url:

raise RuntimeError(

"Missing API_FOOTBALL_DB_URL. Put your SQLAlchemy URL in the environment/.env."

)

engine = create_engine(db_url, pool_pre_ping=True)

🤖 Prompt for AI Agents

In data/Create_data/api_football/get_fixtures_postgre.py around line 34, the Postgres connection string is hardcoded causing a credentials leak; replace the hardcoded URL with a connection string sourced from environment variables (e.g., DB_USER, DB_PASS, DB_HOST, DB_PORT, DB_NAME or a single DATABASE_URL) and construct the SQLAlchemy engine from that value, enable pool_pre_ping=True when calling create_engine to avoid stale connections, and remove the plaintext credential from the repo (rotate the exposed password immediately).

coderabbitai · 2025-10-12T22:59:37Z

+                    if (
+                        k == "passes_percent"
+                        or k == "passes_%"
+                        and isinstance(val, str)
+                        and val.endswith("%")
+                    ):
                        try:
-                            val = float(val.replace('%', ''))
+                            val = float(val.replace("%", ""))
                        except Exception:


⚠️ Potential issue | 🟠 Major

Fix precedence bug in passes percentage handling

Current condition mixes or/and without parentheses and checks an unused key. Tighten to only handle "passes_%" key.

- if ( - k == "passes_percent" - or k == "passes_%" - and isinstance(val, str) - and val.endswith("%") - ): + if k == "passes_%" and isinstance(val, str) and val.endswith("%"): try: val = float(val.replace("%", "")) except Exception: val = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (

k == "passes_percent"

or k == "passes_%"

and isinstance(val, str)

and val.endswith("%")

):

try:

val = float(val.replace('%', ''))

val = float(val.replace("%", ""))

except Exception:

if k == "passes_%" and isinstance(val, str) and val.endswith("%"):

try:

val = float(val.replace("%", ""))

except Exception:

val = None

🧰 Tools

🪛 Ruff (0.13.3)

671-673: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear

Parenthesize the and subexpression

(RUF021)

677-677: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In data/Create_data/api_football/get_fixtures_postgre.py around lines 669 to 677, the conditional mixes or/and without parentheses and checks an unintended key; change the condition to only handle the "passes_%" key with proper precedence: use if k == "passes_%" and isinstance(val, str) and val.endswith("%"): then proceed to strip the "%" and convert to float inside the try/except as before (no other keys), ensuring the boolean logic only applies to the percent-string check.

coderabbitai · 2025-10-12T22:59:37Z

+                result = conn.execute(query, {"today": today, "league_ids": target_league_ids})
+


⚠️ Potential issue | 🟡 Minor

Do not pass unused parameters to SQLAlchemy text() execution

The query doesn't use :today or :league_ids. Passing extra bind params can fail on some drivers.

- result = conn.execute(query, {"today": today, "league_ids": target_league_ids}) + result = conn.execute(query)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

result = conn.execute(query, {"today": today, "league_ids": target_league_ids})

result = conn.execute(query)

🤖 Prompt for AI Agents

In data/Create_data/api_football/get_fixtures_postgre.py around lines 767-768, the call to conn.execute(query, {"today": today, "league_ids": target_league_ids}) is passing bind parameters that the SQL text() does not use; remove the unused "today" and "league_ids" parameters (or alternatively add matching :today/:league_ids bind placeholders to the SQL if they are intended to be used) so that execute() only receives parameters referenced by the query and avoid driver failures.

coderabbitai · 2025-10-12T22:59:37Z

+        api_url = f"{self.base_url}/predictions?fixture={fixture_id}"
+        try:
+            response = requests.get(api_url, headers=self.headers)
+            response.raise_for_status()  # Raises an HTTPError for bad responses (4XX or 5XX)

-                # Calculate H2H stats
-                h2h_home_wins_calc, h2h_away_wins_calc, h2h_draws_calc = 0, 0, 0
-                if fixture_home_team_id is not None and fixture_away_team_id is not None and h2h_list_api:
-                    for match in h2h_list_api:
-                        match_teams = match.get("teams", {})
-                        match_goals = match.get("goals", {})
-
-                        h2h_match_home_id = match_teams.get("home", {}).get("id")
-
-                        h2h_gh = match_goals.get("home") 
-                        h2h_ga = match_goals.get("away") 
-
-                        if isinstance(h2h_gh, (int, float)) and isinstance(h2h_ga, (int, float)):
-                            if h2h_gh == h2h_ga:
-                                h2h_draws_calc += 1
-                            elif h2h_match_home_id == fixture_home_team_id: 
-                                if h2h_gh > h2h_ga: 
-                                    h2h_home_wins_calc +=1 
-                                else: 
-                                    h2h_away_wins_calc +=1 
-                            elif h2h_match_home_id == fixture_away_team_id: 
-                                if h2h_ga > h2h_gh:
-                                    h2h_home_wins_calc += 1
-                                else:
-                                    h2h_away_wins_calc += 1
-
-                data_to_upsert["h2h_home_wins"] = h2h_home_wins_calc
-                data_to_upsert["h2h_away_wins"] = h2h_away_wins_calc
-                data_to_upsert["h2h_draws"] = h2h_draws_calc
-                data_to_upsert["h2h_total_matches"] = len(h2h_list_api) if h2h_list_api else 0
-
-                # Upsert to PostgreSQL
-                stmt = pg_insert(predictions_table).values(**data_to_upsert)
-                update_dict = {c: stmt.excluded[c] for c in data_to_upsert.keys() if c != 'fixture_id'}
-                stmt = stmt.on_conflict_do_update(index_elements=['fixture_id'], set_=update_dict)
-                try:
-                    with engine.begin() as conn:
-                        conn.execute(stmt)
-                    message = f"Successfully upserted prediction for fixture ID: {fixture_id}"
-                    self.logger.info(message)
-                    print(message)
-                    return True
-                except SQLAlchemyError as e:
-                    message = f"Error upserting prediction for fixture ID {fixture_id}: {e}"
-                    self.logger.error(message)
-                    print(message)
-                    return False
-
-            except requests.exceptions.RequestException as e:
-                message = f"API request failed for fixture ID {fixture_id}: {e}"
-                self.logger.error(message)
+            api_data = response.json()
+
+            if not api_data.get("response"):
+                message = f"No prediction data found in API response for fixture ID: {fixture_id}"
+                self.logger.warning(message)


⚠️ Potential issue | 🟠 Major

Predictions fetch: fix double slash and add request timeout

Avoid base_url double slash and include a timeout to prevent hanging calls.

- api_url = f"{self.base_url}/predictions?fixture={fixture_id}" + api_url = f"{self.base_url.rstrip('/')}/predictions?fixture={fixture_id}" try: - response = requests.get(api_url, headers=self.headers) + response = requests.get(api_url, headers=self.headers, timeout=30) response.raise_for_status() # Raises an HTTPError for bad responses (4XX or 5XX)

Also applies to: 799-801

🧰 Tools

🪛 Ruff (0.13.3)

799-799: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents

In data/Create_data/api_football/get_fixtures_postgre.py around lines 797 to 806, the API call may produce a double slash when self.base_url already ends with a slash and lacks a timeout which can cause hanging requests; ensure the URL is built without duplicating slashes (e.g., strip a trailing slash from self.base_url or join paths safely) and add a timeout parameter to the requests.get call (apply the same change for the adjacent calls on lines 799–801) so the request becomes requests.get(<cleaned_url>, headers=self.headers, timeout=<reasonable_seconds>).

coderabbitai · 2025-10-12T22:59:39Z

+    xgb_features, _ = xgboost_staged_selection(
+        X, y, X_test, y_test, X_eval, y_eval, target_features + 20
+    )
+    lgb_features, _ = lightgbm_staged_selection(
+        X, y, X_test, y_test, X_eval, y_eval, target_features + 20
+    )
+


⚠️ Potential issue | 🔴 Critical

Fix staged-selection call signature.

xgboost_staged_selection and lightgbm_staged_selection accept (X, y, X_eval, y_eval, target_features). The new call site passes X_test/y_test, so the function invocation raises TypeError: ... takes 5 positional arguments but 7 were given, aborting the Boruta pipeline. Drop the extra arguments:

- xgb_features, _ = xgboost_staged_selection( - X, y, X_test, y_test, X_eval, y_eval, target_features + 20 - ) + xgb_features, _ = xgboost_staged_selection( + X, y, X_eval, y_eval, target_features + 20 + ) - lgb_features, _ = lightgbm_staged_selection( - X, y, X_test, y_test, X_eval, y_eval, target_features + 20 - ) + lgb_features, _ = lightgbm_staged_selection( + X, y, X_eval, y_eval, target_features + 20 + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

xgb_features, _ = xgboost_staged_selection(

X, y, X_test, y_test, X_eval, y_eval, target_features + 20

)

lgb_features, _ = lightgbm_staged_selection(

X, y, X_test, y_test, X_eval, y_eval, target_features + 20

)

xgb_features, _ = xgboost_staged_selection(

X, y, X_eval, y_eval, target_features + 20

)

lgb_features, _ = lightgbm_staged_selection(

X, y, X_eval, y_eval, target_features + 20

)

🤖 Prompt for AI Agents

In src/models/StackedEnsemble/base/tree_based/feature_selection/lightgbm_boruta.py around lines 287 to 293, the staged-selection calls pass X_test and y_test but the signatures are (X, y, X_eval, y_eval, target_features); remove the extra X_test and y_test arguments and call xgboost_staged_selection and lightgbm_staged_selection with (X, y, X_eval, y_eval, target_features + 20) so the number and order of positional arguments match the function definitions.

coderabbitai · 2025-10-12T22:59:39Z

            # Calculate precision directly instead of using metric parameter
-            precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / (np.sum(preds_shuffled == 1))
+            precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / (
+                np.sum(preds_shuffled == 1)
+            )


⚠️ Potential issue | 🔴 Critical

Guard precision calculation in permutation loop.

Avoid division by zero when no positives are predicted.

Apply this diff:

- # Calculate precision directly instead of using metric parameter - precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / ( - np.sum(preds_shuffled == 1) - ) + # Calculate precision safely + _pp_shuf = np.sum(preds_shuffled == 1) + precision = ( + 0.0 if _pp_shuf == 0 else np.sum((y_val_np == 1) & (preds_shuffled == 1)) / _pp_shuf + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Calculate precision directly instead of using metric parameter

precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / (np.sum(preds_shuffled == 1))

precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / (

np.sum(preds_shuffled == 1)

)

# Calculate precision safely

_pp_shuf = np.sum(preds_shuffled == 1)

precision = (

0.0 if _pp_shuf == 0 else np.sum((y_val_np == 1) & (preds_shuffled == 1)) / _pp_shuf

)

🤖 Prompt for AI Agents

In src/models/StackedEnsemble/base/tree_based/random_forest_30.py around lines 561 to 564, the precision calculation can divide by zero when preds_shuffled contains no positive predictions; modify the permutation loop to compute denom = np.sum(preds_shuffled == 1) and only perform the division when denom > 0, otherwise set precision to 0 (or an agreed sentinel like np.nan), so the code guards against ZeroDivisionError and yields a defined precision value when there are no predicted positives.

coderabbitai · 2025-10-12T22:59:39Z

+            if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)):
+                # sklearn model
+                predictions = self.model.predict(df)
+                pos_probas = self.model.predict_proba(df)[:, 1]  # type: ignore # Get positive class probabilities
+            else:
+                # PyFuncModel - use predict method which returns probabilities
+                predictions = self.test_model.predict(df)
+                # For binary classification, PyFuncModel predict returns probabilities for both classes
+                if predictions.ndim == 2 and predictions.shape[1] == 2:
+                    pos_probas = predictions[:, 1]
+                    predictions = (pos_probas >= self.threshold).astype(int)
+                else:
+                    # If predict returns 1D array, it might be class predictions
+                    pos_probas = predictions
+                    predictions = (pos_probas >= self.threshold).astype(int)
+
            # Ensure we have a 1D numpy array
            if not isinstance(pos_probas, np.ndarray):
                pos_probas = np.array(pos_probas)
        except Exception as e:
            print(f"Error predicting: {e}")
            if "use_label_encoder" in str(e):
                print("Attribute error due to missing 'use_label_encoder'. Patching model...")
-                self.model.use_label_encoder = False
-                predictions = self.model.predict(df)
-                pos_probas = self.model.predict_proba(df)
+                if self.model is None:
+                    raise ValueError(MODEL_NOT_LOADED_ERROR) from e
+                # Fallback to PyFuncModel predict
+                predictions = self.test_model.predict(df)
+                if predictions.ndim == 2 and predictions.shape[1] == 2:
+                    pos_probas = predictions[:, 1]
+                    predictions = (pos_probas >= self.threshold).astype(int)
+                else:
+                    pos_probas = predictions
+                    predictions = (pos_probas >= self.threshold).astype(int)
+
+        # Ensure predictions and probabilities were obtained


⚠️ Potential issue | 🔴 Critical

Apply the tuned threshold to sklearn predictions

For the sklearn path you keep self.model.predict(df) (hard-coded 0.5 threshold) and never reapply the tuned self.threshold, so all downstream metrics ignore the calibrated cutoff. Mirror the PyFunc branch: threshold the positive-class probabilities yourself.

- if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)): - # sklearn model - predictions = self.model.predict(df) - pos_probas = self.model.predict_proba(df)[:, 1] # type: ignore # Get positive class probabilities + if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)): + # sklearn model + pos_probas = self.model.predict_proba(df)[:, 1] # type: ignore + predictions = (pos_probas >= self.threshold).astype(int)

Make the same adjustment in the exception fallback so both paths honour the calibrated threshold.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)):

# sklearn model

predictions = self.model.predict(df)

pos_probas = self.model.predict_proba(df)[:, 1] # type: ignore # Get positive class probabilities

else:

# PyFuncModel - use predict method which returns probabilities

predictions = self.test_model.predict(df)

# For binary classification, PyFuncModel predict returns probabilities for both classes

if predictions.ndim == 2 and predictions.shape[1] == 2:

pos_probas = predictions[:, 1]

predictions = (pos_probas >= self.threshold).astype(int)

else:

# If predict returns 1D array, it might be class predictions

pos_probas = predictions

predictions = (pos_probas >= self.threshold).astype(int)

# Ensure we have a 1D numpy array

if not isinstance(pos_probas, np.ndarray):

pos_probas = np.array(pos_probas)

except Exception as e:

print(f"Error predicting: {e}")

if "use_label_encoder" in str(e):

print("Attribute error due to missing 'use_label_encoder'. Patching model...")

self.model.use_label_encoder = False

predictions = self.model.predict(df)

pos_probas = self.model.predict_proba(df)

if self.model is None:

raise ValueError(MODEL_NOT_LOADED_ERROR) from e

# Fallback to PyFuncModel predict

predictions = self.test_model.predict(df)

if predictions.ndim == 2 and predictions.shape[1] == 2:

pos_probas = predictions[:, 1]

predictions = (pos_probas >= self.threshold).astype(int)

else:

pos_probas = predictions

predictions = (pos_probas >= self.threshold).astype(int)

# Ensure predictions and probabilities were obtained

if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)):

# sklearn model

pos_probas = self.model.predict_proba(df)[:, 1] # type: ignore

predictions = (pos_probas >= self.threshold).astype(int)

else:

# PyFuncModel - use predict method which returns probabilities

predictions = self.test_model.predict(df)

# For binary classification, PyFuncModel predict returns probabilities for both classes

if predictions.ndim == 2 and predictions.shape[1] == 2:

pos_probas = predictions[:, 1]

predictions = (pos_probas >= self.threshold).astype(int)

else:

# If predict returns 1D array, it might be class predictions

pos_probas = predictions

predictions = (pos_probas >= self.threshold).astype(int)

🤖 Prompt for AI Agents

In src/predictors/predict_ensemble.py around lines 108 to 142, the sklearn branch currently uses self.model.predict(df) (implicit 0.5 cutoff) and never applies the calibrated self.threshold; update the sklearn path to compute pos_probas = self.model.predict_proba(df)[:,1] and then set predictions = (pos_probas >= self.threshold).astype(int); also make the same change in the exception fallback branch so when you fallback to PyFuncModel or re-run prediction you always derive pos_probas and then apply (pos_probas >= self.threshold).astype(int) for predictions, ensuring pos_probas is converted to a 1D numpy array if needed.

coderabbitai · 2025-10-12T22:59:40Z

+            home_draws = league_data[league_data["home_encoded"] == team_id]["Home_draws"].iloc[0]
+            home_matches = league_data[league_data["home_encoded"] == team_id][
+                "Home_team_matches"
+            ].iloc[0]
+            away_draws = league_data[league_data["away_encoded"] == team_id]["Away_draws"].iloc[0]
+            away_matches = league_data[league_data["away_encoded"] == team_id][
+                "Away_team_matches"
+            ].iloc[0]


⚠️ Potential issue | 🟠 Major

Risk of IndexError when filtering returns empty results.

The code filters league_data by team ID and immediately accesses .iloc[0] without verifying the filtered result is non-empty. If a team appears in home_encoded.unique() but has no matching rows when filtered by that ID (or vice versa for away matches), this will raise an IndexError.

Add a defensive check before accessing .iloc[0]:

for team_id in league_data["home_encoded"].unique(): - home_draws = league_data[league_data["home_encoded"] == team_id]["Home_draws"].iloc[0] - home_matches = league_data[league_data["home_encoded"] == team_id][ - "Home_team_matches" - ].iloc[0] - away_draws = league_data[league_data["away_encoded"] == team_id]["Away_draws"].iloc[0] - away_matches = league_data[league_data["away_encoded"] == team_id][ - "Away_team_matches" - ].iloc[0] + home_data = league_data[league_data["home_encoded"] == team_id] + away_data = league_data[league_data["away_encoded"] == team_id] + + if home_data.empty or away_data.empty: + continue + + home_draws = home_data["Home_draws"].iloc[0] + home_matches = home_data["Home_team_matches"].iloc[0] + away_draws = away_data["Away_draws"].iloc[0] + away_matches = away_data["Away_team_matches"].iloc[0]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

home_draws = league_data[league_data["home_encoded"] == team_id]["Home_draws"].iloc[0]

home_matches = league_data[league_data["home_encoded"] == team_id][

"Home_team_matches"

].iloc[0]

away_draws = league_data[league_data["away_encoded"] == team_id]["Away_draws"].iloc[0]

away_matches = league_data[league_data["away_encoded"] == team_id][

"Away_team_matches"

].iloc[0]

for team_id in league_data["home_encoded"].unique():

home_data = league_data[league_data["home_encoded"] == team_id]

away_data = league_data[league_data["away_encoded"] == team_id]

if home_data.empty or away_data.empty:

continue

home_draws = home_data["Home_draws"].iloc[0]

home_matches = home_data["Home_team_matches"].iloc[0]

away_draws = away_data["Away_draws"].iloc[0]

away_matches = away_data["Away_team_matches"].iloc[0]

coderabbitai · 2025-10-12T22:59:40Z

        for _, row in league_data.iterrows():
-            if row['match_outcome'] == 2 and 'home_team_elo' in row and 'away_team_elo' in row:
-                elo_diff = abs(row['home_team_elo'] - row['away_team_elo'])
+            if row["match_outcome"] == 2 and "home_team_elo" in row and "away_team_elo" in row:
+                elo_diff = abs(row["home_team_elo"] - row["away_team_elo"])
                draw_elo_diffs.append(elo_diff)


⚠️ Potential issue | 🟠 Major

Logic error in ELO column existence check.

The expression "home_team_elo" in row checks if the column name exists in the DataFrame's column index, not whether the specific row has a non-null value. When iterating with .iterrows(), this check will always return True if the column exists, even if the value is NaN.

Replace with a null-check:

for _, row in league_data.iterrows(): - if row["match_outcome"] == 2 and "home_team_elo" in row and "away_team_elo" in row: + if row["match_outcome"] == 2 and pd.notna(row["home_team_elo"]) and pd.notna(row["away_team_elo"]): elo_diff = abs(row["home_team_elo"] - row["away_team_elo"]) draw_elo_diffs.append(elo_diff)

Additionally, import pandas if not already imported:

import pandas as pd

🤖 Prompt for AI Agents

In src/utils/K_factor_calculation.py around lines 69 to 72, the check `"home_team_elo" in row` is incorrect because iterrows() yields a Series for each row and that test only checks column existence, not whether the value is non-null; change the condition to explicitly test for non-null values (e.g., use pandas.notnull()/pd.notna() or Series.notnull() on row["home_team_elo"] and row["away_team_elo"]) before computing elo_diff, and ensure pandas is imported (import pandas as pd) if not already.

…t Variable Integration - Enhanced `get_fixtures_postgre.py` by adding support for environment variables for database connection, improving security and flexibility. - Refactored variable names in `pytorch_hypertuner_20.py` for consistency and clarity, ensuring better readability. - Updated `.gitignore` to include additional files and directories, maintaining a clean project structure. - Improved error handling in various scripts to ensure robustness during execution. These changes aim to enhance code quality, improve security practices, and ensure better organization within the Soccer Prediction Project.

sonarqubecloud · 2025-10-12T23:13:41Z

Quality Gate failed

Failed conditions
29.0% Duplication on New Code (required ≤ 3%)
E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (1)

439-501: Critical: Undefined variables will cause runtime NameError.

The log_to_mlflow function references X_train (line 470) and X_eval (lines 472, 478, 481) but these variables are not defined in the function's scope—they are neither parameters nor accessible globals.

Additionally:

Line 470 assigns input_example = X_train.head(5) but immediately overwrites it on line 472
Line 481 mutates the caller's X_eval DataFrame (side effect)

Add x_eval parameter and use it consistently:

-def log_to_mlflow(model, metrics, params, experiment_name):
+def log_to_mlflow(model, metrics, params, experiment_name, x_eval):
     """
     Log trained model, metrics, and parameters to MLflow.
     Args:
         model: Trained LightGBM model
         metrics: Model evaluation metrics
         params: Model parameters
         experiment_name: Experiment name
+        x_eval: Evaluation features for signature generation
     Returns:
         str: Run ID
     """
     try:
         # Set up MLflow tracking
         mlflow.set_experiment(experiment_name)
         # Start a new run
         with mlflow.start_run(run_name=f"lightgbm_{datetime.now().strftime('%Y%m%d_%H%M')}") as run:
             # Log parameters
             for param_name, param_value in params.items():
                 mlflow.log_param(param_name, param_value)

             # Log metrics
             for metric_name, metric_value in metrics.items():
                 mlflow.log_metric(metric_name, metric_value)

-            # Create input example for model signature
-            input_example = X_train.head(5)
-            # Handle integer columns by converting them to float64 to properly manage missing values
-            input_example = X_eval.iloc[:5].copy() if hasattr(X_eval, "iloc") else X_eval[:5].copy()
+            # Create input example for signature inference
+            input_example = x_eval.iloc[:5].copy() if hasattr(x_eval, "iloc") else x_eval[:5].copy()

             # Identify and convert integer columns to float64 to prevent schema enforcement errors
             if hasattr(input_example, "dtypes"):
                 for col in input_example.columns:
-                    if X_eval[col].dtype.kind == "i":
+                    if input_example[col].dtype.kind == "i":
                         logger.info(
                             f"Converting integer column '{col}' to float64 to handle potential missing values"
                         )
-                        X_eval[col] = X_eval[col].astype("float64")
+                        input_example[col] = input_example[col].astype("float64")

             # Infer signature with proper handling for integer columns with potential missing values
             signature = infer_signature(input_example, model.predict(input_example))

Then update the caller on line 544:

-        log_to_mlflow(model, metrics, params, experiment_name)
+        log_to_mlflow(model, metrics, params, experiment_name, x_eval)

♻️ Duplicate comments (4)

data/Create_data/api_football/get_fixtures_postgre.py (3)
683-692: Fix precedence bug in passes percentage handling.

This is a duplicate of a previous review comment. The condition mixes or and and without proper parentheses, causing incorrect logic. The key "passes_percent" is also not used in the API response.

Apply this diff to fix:
-                    elif (
-                        k == "passes_percent"
-                        or k == "passes_%"
-                        and isinstance(val, str)
-                        and val.endswith("%")
-                    ):
+                    elif k == "passes_%" and isinstance(val, str) and val.endswith("%"):
                         try:
                             val = float(val.replace("%", ""))
-                        except (ValueError, AttributeError):
+                        except Exception:
                             val = None
780-782: Remove unused SQL bind parameters.

This is a duplicate of a previous review comment. The query doesn't use :today or :league_ids placeholders, so passing these parameters can cause driver failures.

Apply this diff:
             with engine.connect() as conn:
-                result = conn.execute(query, {"today": today, "league_ids": target_league_ids})
+                result = conn.execute(query)
-
                 for row in result:
811-814: Add timeout to API request to prevent hanging.

This is a duplicate of a previous review comment. The requests.get call lacks a timeout parameter, which can cause the application to hang indefinitely if the API is slow or unresponsive.

Apply this diff:
         api_url = f"{self.base_url}/predictions?fixture={fixture_id}"
         try:
-            response = requests.get(api_url, headers=self.headers)
+            response = requests.get(api_url, headers=self.headers, timeout=30)
             response.raise_for_status()  # Raises an HTTPError for bad responses (4XX or 5XX)
src/models/StackedEnsemble/base/neural/pytorch_hypertuner_20.py (1)
1041-1041: Guard baseline precision against zero positives (duplicate concern)

The baseline precision calculation at line 1041 still divides by np.sum(preds == 1) without protection, which will raise ZeroDivisionError when the model predicts no positives. This issue was flagged in a previous review but remains unresolved.

Apply this diff to guard the denominator:
-        baseline = np.sum((y_val_np == 1) & (preds == 1)) / (np.sum(preds == 1))
+        denom_baseline = max(np.sum(preds == 1), 1)
+        baseline = np.sum((y_val_np == 1) & (preds == 1)) / denom_baseline
This matches the guard used for shuffled precision at lines 1062-1064.

🧹 Nitpick comments (10)

data/Create_data/api_football/get_fixtures_postgre.py (2)
309-311: Use logging.exception in exception handlers for better debugging.

Multiple exception handlers use self.logger.error() without automatically capturing the traceback. Using self.logger.exception() provides better debugging information by including the full stack trace.

Example fixes (apply similar pattern to all flagged locations):
         except (ValueError, TypeError):
-            self.logger.error(
-                f"_parse_team_stat_response: Invalid target_team_id type. Got {target_team_id}"
-            )
+            self.logger.exception(
+                f"_parse_team_stat_response: Invalid target_team_id type. Got {target_team_id}"
+            )
             return None
Apply similar changes at lines 483-485, 564-566, 713-715, 940, 946, 951, 1253, and 1256-1258.

Also applies to: 483-485, 564-566, 713-715, 940-940, 946-946, 951-951, 1253-1253, 1256-1258

949-953: Tighten exception handling to avoid masking errors.

The broad Exception catch can mask unexpected programming errors. Since you're already catching psycopg2.Error separately, consider being more specific about the exceptions you expect.

If you need a catch-all for unexpected errors, consider logging at a different level or re-raising after logging:
         except requests.exceptions.RequestException as e:
             message = f"API request failed for fixture ID {fixture_id}: {e}"
             self.logger.error(message)
             print(message)
             return False
-        except (Exception, psycopg2.Error) as e:
+        except psycopg2.Error as e:
+            message = f"Database error processing prediction for fixture ID {fixture_id}: {e}"
+            self.logger.exception(message)
+            print(message)
+            return False
+        except Exception as e:
             message = f"Error processing or upserting prediction for fixture ID {fixture_id}: {e}"
-            self.logger.error(message)
+            self.logger.exception(message)
             print(message)
             return False
-
src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (4)
391-436: Refactor: Data loading duplication between hypertune_lightgbm and main.

hypertune_lightgbm now loads and prepares data internally (lines 401-411), but main (lines 657-668) also loads and prepares the same data. This creates duplication.

Consider one of these approaches:

Have main load data once and pass it to hypertune_lightgbm (restoring the previous signature)

Have hypertune_lightgbm return the prepared data for use in main

Extract data loading to a separate function used by both

Option 1: Pass data as parameters (restores original pattern):
-def hypertune_lightgbm():
+def hypertune_lightgbm(x_train, y_train, x_test, y_test, x_eval, y_eval):
     """
     Main training function with MLflow tracking.
-    Updated name from hypertune_mlp to hypertune_lightgbm to match notebook.
-    Args:
+    Args:
+        x_train: Training features
+        y_train: Training labels
+        x_test: Testing features
+        y_test: Testing labels
+        x_eval: Evaluation features
+        y_eval: Evaluation labels
     Returns:
         tuple: (best_params, best_metrics)
     """
     try:
-        # Load data
-        dataloader = DataLoader()
-        x_train_raw, y_train, x_test_raw, y_test, x_eval_raw, y_eval = dataloader.load_data()
-        features = import_selected_features_ensemble(model_type="lgbm")
-
-        # Ensure features is a list of strings
-        if not isinstance(features, list):
-            raise ValueError("Expected features to be a list for lgbm model type")
-
-        x_train = prepare_data(x_train_raw, features)
-        x_test = prepare_data(x_test_raw, features)
-        x_eval = prepare_data(x_eval_raw, features)
-
         # Load hyperparameter space
         hyperparameter_space = load_hyperparameter_space()
648-685: Cleanup: Unused global variable declaration.

Line 654 declares global X_train, y_train, X_test, y_test, X_eval, y_eval (uppercase), but these globals are never used since:

The prepared data is stored in lowercase local variables (lines 665-667)

hypertune_lightgbm() loads its own data internally (no parameters)

Remove the unused global declaration:
     try:
         logger.info("Starting LightGBM model training")
-        global X_train, y_train, X_test, y_test, X_eval, y_eval
-
         # Load data
         dataloader = DataLoader()
         X_train, y_train, X_test, y_test, X_eval, y_eval = dataloader.load_data()
Alternatively, if you intend to use the prepared data in main, align the variable names and pass them to hypertune_lightgbm (see previous comment).

407-407: Use TypeError for type validation.

Lines 407 and 663 raise ValueError for type mismatches, but Python convention recommends TypeError when the issue is an incorrect type rather than an incorrect value.
-        if not isinstance(features, list):
-            raise ValueError("Expected features to be a list for lgbm model type")
+        if not isinstance(features, list):
+            raise TypeError(f"Expected features to be a list for lgbm model type, got {type(features)}")
Apply the same change on line 663.

Also applies to: 663-663

641-642: Use logging.exception for better error diagnostics.

Lines 642 and 675 use logger.error in exception handlers. Python's logging.exception automatically includes the full traceback, making debugging easier.
         except Exception as e:
-            logger.error(f"Trial {trial + 1} failed: {e}")
+            logger.exception(f"Trial {trial + 1} failed: {e}")
Apply the same pattern on line 675.

Also applies to: 675-675
src/models/StackedEnsemble/base/tree_based/xgboost_model.py (4)
425-425: Remove unused experiment_name parameter.

The experiment_name parameter is not used in the function body. The objective function on line 414 references the global experiment_name variable instead.

Either remove the parameter:
-def hypertune_xgboost(x_train, y_train, x_test, y_test, x_eval, y_eval, experiment_name: str):
+def hypertune_xgboost(x_train, y_train, x_test, y_test, x_eval, y_eval):
Or use the parameter in the objective function by passing it through the closure (preferred if you want to support multiple experiments):
     def objective_func(trial):
         return objective(
-            trial, x_train, y_train, x_test, y_test, x_eval, y_eval, hyperparameter_space
+            trial, x_train, y_train, x_test, y_test, x_eval, y_eval, hyperparameter_space, experiment_name
         )
And update the objective function signature accordingly.

754-762: Remove commented code with stale noqa directives.

Lines 754-762 contain commented-out code with unused noqa directives (flagged by Ruff). Commented code creates maintenance burden and should be removed or moved to version control history.

Remove the commented block:
-        # best_model, best_metrics = train_with_precision_target( # noqa: E501
-        #     x_train, # noqa: E501
-        #     y_train, # noqa: E501
-        #     x_test, # noqa: E501
-        #     y_test, # noqa: E501
-        #     x_eval, # noqa: E501
-        #     y_eval,  # Pass DataFrames # noqa: E501
-        # ) # noqa: E501
-        # logger.error("Hyperparameter optimization failed. Skipping precision target training.") # noqa: E501
-
     except Exception as e:
If you need to preserve this code, use feature flags or extract it to a separate branch.

580-582: Use logging.exception for better error diagnostics.

Lines 581 and 675 use logger.error in exception handlers. Python's logging.exception automatically includes the full traceback, significantly improving debugging.
     except Exception as e:
-        logger.error(f"Error in precision-focused training: {str(e)}")
+        logger.exception("Error in precision-focused training")
         return None, None
Apply the same pattern on line 675:
     except Exception as e:
-        logger.error(f"Error in main execution: {str(e)}")
+        logger.exception("Error in main execution")
Note: logging.exception automatically includes the exception message and traceback, so explicit string formatting is unnecessary.

Also applies to: 674-675

696-698: Extract long exception messages to improve maintainability.

Lines 696-698 and 704 contain long exception messages directly in the raise statement. Consider extracting these to constants or module-level error classes for better maintainability.
+# At module level
+class ModelError(Exception):
+    """Base exception for model-related errors."""
+    pass
+
+class ModelNotTrainedError(ModelError):
+    """Raised when attempting to use an untrained model."""
+    pass
+
+class FeatureMismatchError(ModelError):
+    """Raised when feature counts don't match expectations."""
+    pass
+
 def select_top_features_xgb(model, x_features, n_features: int = 30) -> list[str]:
     """..."""
     if not hasattr(model, "feature_importances_"):
-        raise ValueError(
-            "The provided model has not been trained yet or does not support feature importances."
-        )
+        raise ModelNotTrainedError(
+            "Model has not been trained or does not support feature importances"
+        )
     
     # ...
     
     if len(importances) != len(feature_names):
-        raise ValueError("Mismatch between the number of feature importances and feature names.")
+        raise FeatureMismatchError(
+            f"Feature count mismatch: {len(importances)} importances vs {len(feature_names)} names"
+        )
Also applies to: 704-704

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1228b5c and efd96f9.

📒 Files selected for processing (5)

.gitignore (1 hunks)
data/Create_data/api_football/get_fixtures_postgre.py (29 hunks)
src/models/StackedEnsemble/base/neural/pytorch_hypertuner_20.py (32 hunks)
src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (14 hunks)
src/models/StackedEnsemble/base/tree_based/xgboost_model.py (20 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (4)

src/models/StackedEnsemble/shared/hypertuner_utils.py (2)

optimize_threshold (167-235)

predict (47-73)

src/models/StackedEnsemble/shared/data_loader.py (1)

load_data (25-116)

src/utils/create_evaluation_set.py (1)

import_selected_features_ensemble (1104-1189)

src/models/ensemble/data_utils.py (1)

prepare_data (16-55)

src/models/StackedEnsemble/base/neural/pytorch_hypertuner_20.py (4)

src/models/StackedEnsemble/shared/hypertuner_utils.py (2)

optimize_threshold (167-235)

predict_proba (76-98)

src/models/StackedEnsemble/shared/data_loader.py (2)

DataLoader (14-126)

load_data (25-116)

src/utils/create_evaluation_set.py (1)

import_selected_features_ensemble_new (1893-1968)

src/models/ensemble/data_utils.py (1)

prepare_data (16-55)

src/models/StackedEnsemble/base/tree_based/xgboost_model.py (3)

src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (4)

train_model (139-186)

objective (231-256)

log_to_mlflow (439-501)

main (648-684)

src/models/StackedEnsemble/base/tree_based/xgboost_model_25.py (5)

train_model (182-231)

objective (369-449)

objective (693-733)

log_to_mlflow (504-564)

main (761-803)

src/models/StackedEnsemble/shared/hypertuner_utils.py (2)

optimize_threshold (167-235)

predict (47-73)

data/Create_data/api_football/get_fixtures_postgre.py (1)

src/utils/logger.py (4)

ExperimentLogger (198-377)

info (336-343)

warning (345-352)

error (354-361)

🪛 Ruff (0.13.3)

src/models/StackedEnsemble/base/tree_based/lightgbm_model.py

407-407: Prefer TypeError exception for invalid type

(TRY004)

407-407: Abstract raise to an inner function

(TRY301)

407-407: Avoid specifying long messages outside the exception class

(TRY003)

641-641: Do not catch blind exception: Exception

(BLE001)

642-642: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

663-663: Prefer TypeError exception for invalid type

(TRY004)

663-663: Abstract raise to an inner function

(TRY301)

663-663: Avoid specifying long messages outside the exception class

(TRY003)

src/models/StackedEnsemble/base/neural/pytorch_hypertuner_20.py

116-118: Avoid specifying long messages outside the exception class

(TRY003)

265-265: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

270-270: Do not catch blind exception: Exception

(BLE001)

271-271: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

271-271: Use explicit conversion flag

Replace with conversion flag

(RUF010)

331-331: Unused function argument: trial

(ARG001)

479-479: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

480-482: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

487-487: Consider moving this statement to an else block

(TRY300)

491-491: Do not catch blind exception: Exception

(BLE001)

492-492: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

492-492: Use explicit conversion flag

Replace with conversion flag

(RUF010)

591-591: Consider moving this statement to an else block

(TRY300)

594-594: Use raise without specifying exception name

Remove exception name

(TRY201)

595-595: Do not catch blind exception: Exception

(BLE001)

596-596: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

596-596: Use explicit conversion flag

Replace with conversion flag

(RUF010)

922-922: Unused function argument: experiment_name

(ARG001)

1076-1076: Consider moving this statement to an else block

(TRY300)

1077-1077: Do not catch blind exception: Exception

(BLE001)

1078-1078: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1078-1078: Use explicit conversion flag

Replace with conversion flag

(RUF010)

1367-1367: Unused function argument: experiment_name

(ARG001)

1534-1534: Do not catch blind exception: Exception

(BLE001)

1535-1535: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1535-1535: Use explicit conversion flag

Replace with conversion flag

(RUF010)

src/models/StackedEnsemble/base/tree_based/xgboost_model.py

425-425: Unused function argument: experiment_name

(ARG001)

579-579: Consider moving this statement to an else block

(TRY300)

580-580: Do not catch blind exception: Exception

(BLE001)

581-581: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

581-581: Use explicit conversion flag

Replace with conversion flag

(RUF010)

674-674: Do not catch blind exception: Exception

(BLE001)

675-675: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

696-698: Avoid specifying long messages outside the exception class

(TRY003)

704-704: Avoid specifying long messages outside the exception class

(TRY003)

754-754: Unused noqa directive (non-enabled: E501)