Update Soccer Prediction Project with Data File Modifications and Code Cleanup#6
Update Soccer Prediction Project with Data File Modifications and Code Cleanup#6ronyka77 wants to merge 2 commits into
Conversation
…e Cleanup - Updated binary data files `new_api_training_final.parquet` and `new_api_training_final.xlsx` to reflect the latest training data. - Refactored `add_ELO_scores.py` and `add_poisson_xG.py` for improved readability and organization by adjusting export paths and formatting. - Enhanced `event_features_calculator.py` to improve DataFrame handling and logging for better debugging. - Cleaned up imports and formatting across various scripts to maintain consistency and improve code quality. These changes aim to enhance data management, improve code readability, and ensure better organization within the Soccer Prediction Project.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThis PR makes broad formatting and typing changes, adds/rewrites many model training, hypertuning, feature-selection and ensemble pipelines (including a new TabNet outliers module and specialized FNN integration), expands ensemble base-model support (PyTorch/SVM/FNN), hardens loaders/prediction flows, and introduces new utilities for weighting, diagnostics and data preparation. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant M as specialized_fnn_integration.main
participant HT as hypertuner.module
participant F as create_specialized_fnn
Note over M,HT: Runtime patching flow for specialized FNN
M->>HT: save originals, patch create_pytorch_model, patch load_hyperparameter_space
HT->>F: build model via patched_create_pytorch_model
HT->>HT: extend hyperparameter space via patched_load_hyperparameter_space
HT-->>M: return after run
M->>HT: restore originals
sequenceDiagram
autonumber
participant E as EnsembleModel.train
participant L as MLflow Loaders
participant B as Base Models (xgb,lgb,tabnet,extra,mlp,pytorch,svm,fnn)
participant W as Weighting (precision-focused)
participant ML as Meta-learner
E->>L: load models, scalers, feature signatures
E->>B: prepare/initialize base models (deferred where applicable)
B-->>E: produce base predictions for x_val/x_test
E->>W: compute per-model dynamic weights
W-->>E: weights vector
E->>ML: assemble meta-features and train meta-learner
ML-->>E: final metrics and thresholds
sequenceDiagram
autonumber
participant P as predict_ensemble.make_prediction
participant LD as Model Loader (mlflow)
participant M as Model
participant U as Threshold Utils
P->>LD: load model by URI (sklearn or pyfunc)
P->>M: obtain probabilities (predict_proba or pyfunc)
alt probability available
M-->>P: probability array
else fallback
LD-->>P: pyfunc predict fallback
end
P->>U: apply threshold -> binary predictions
U-->>P: (DataFrame, precision, recall)
P-->>Caller: return predictions and metrics
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 22
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (23)
src/models/StackedEnsemble/shared/data_loader.py (1)
97-99: Pre-existing bug: NaN count calculated after replacement.The NaN counts are calculated after the NaN replacement (lines 87-89), so they will always be zero. The counts should be calculated before calling
fillna(0)to accurately reflect the number of NaN values that were replaced.Move the NaN counting before the replacement operations:
+ # Count NaN values before replacement + train_nan_count_before = X_train.isna().sum().sum() + test_nan_count_before = X_test.isna().sum().sum() + val_nan_count_before = X_val.isna().sum().sum() + # Replace NaN values with 0 in all data splits self.logger.info("Replacing NaN values with 0 in all data splits") X_train = X_train.fillna(0) X_test = X_test.fillna(0) X_val = X_val.fillna(0) # Replace inf values with 0 in all data splits self.logger.info("Replacing inf values with 0 in all data splits") X_train = X_train.replace([np.inf, -np.inf], 0) X_test = X_test.replace([np.inf, -np.inf], 0) X_val = X_val.replace([np.inf, -np.inf], 0) - - # Log NaN replacement statistics - train_nan_count_before = X_train.isna().sum().sum() - test_nan_count_before = X_test.isna().sum().sum() - val_nan_count_before = X_val.isna().sum().sum()src/models/ensemble/data_utils.py (1)
218-231: Critical: Invalid parameter passed to SMOTE.Line 230 passes
n_neighbors=nn_estimatorto SMOTE, but SMOTE's constructor does not accept ann_neighborsparameter. This will raise aTypeErrorat runtime. Thek_neighborsparameter is already set on line 229, and if you want to pass a custom estimator, it should be aNearestNeighborsobject (notKNeighborsClassifier) passed through thek_neighborsparameter.Apply this diff to fix the SMOTE configuration:
try: - # Configure SMOTE for CPU usage - # Using KNeighborsClassifier with n_jobs parameter to avoid FutureWarning - from sklearn.neighbors import KNeighborsClassifier - - k_neighbors = min(5, min(n_pos, n_neg) - 1) # Adaptive neighbors - nn_estimator = KNeighborsClassifier(n_neighbors=k_neighbors, n_jobs=-1) - smote = SMOTE( sampling_strategy=sampling_strategy, random_state=42, - k_neighbors=k_neighbors, - n_neighbors=nn_estimator, # Pass the estimator to avoid FutureWarning + k_neighbors=min(5, min(n_pos, n_neg) - 1), )Note: The n_jobs parameter for SMOTE has been deprecated. If you need parallel processing, consider using SMOTE within a pipeline or joblib directly.
src/models/StackedEnsemble/base/kernel_based/svm_model_25.py (1)
458-473: Undefinedfinal_metricsreturn will crash hypertuning.Inside
hypertune_svm, the happy-pathreturn best_hpo_params, final_metricsreferencesfinal_metricsthat never gets assigned, so the first successful HPO run blows up with aNameError. Please capture the best trial’s metrics before returning. One straightforward fix is to let the optimization runner hand back both params and metrics, e.g.:@@ - best_params_from_hpo = {} + best_params_from_hpo = {} + best_metrics_from_hpo = {} @@ - def callback(study, trial): - nonlocal best_score, best_params_from_hpo, top_trials, global_top_trials + def callback(study, trial): + nonlocal best_score, best_params_from_hpo, best_metrics_from_hpo, top_trials, global_top_trials @@ - best_params_from_hpo = trial.params + best_params_from_hpo = trial.params + best_metrics_from_hpo = dict(current_run_metrics) @@ - return best_params_from_hpo + return best_params_from_hpo, best_metrics_from_hpo @@ - best_hpo_params = optimize_hyperparameters_svm( + best_hpo_params, final_metrics = optimize_hyperparameters_svm( X_train, y_train, X_test, y_test, X_eval, y_eval, hyperparameter_space, ) @@ - if not best_hpo_params: + if not best_hpo_params: logger.error("Optuna optimization failed for SVM. Aborting.") return None, None - return best_hpo_params, final_metrics + return best_hpo_params, final_metricssrc/models/StackedEnsemble/base/kernel_based/svm_model.py (1)
407-425:final_metricsis undefined in hypertune return path.
hypertune_svmhappily returns(best_hpo_params, final_metrics)even thoughfinal_metricsis never set, so the first successful run will explode with aNameError. Please propagate the metrics from the Optuna study before returning. Mirroring the approach above keeps the signature intact:@@ - best_params_from_hpo = {} + best_params_from_hpo = {} + best_metrics_from_hpo = {} @@ - def callback(study, trial): - nonlocal best_score, best_params_from_hpo, top_trials, global_top_trials + def callback(study, trial): + nonlocal best_score, best_params_from_hpo, best_metrics_from_hpo, top_trials, global_top_trials @@ - best_params_from_hpo = trial.params + best_params_from_hpo = trial.params + best_metrics_from_hpo = dict(current_run_metrics) @@ - return best_params_from_hpo + return best_params_from_hpo, best_metrics_from_hpo @@ - best_hpo_params = optimize_hyperparameters_svm( + best_hpo_params, final_metrics = optimize_hyperparameters_svm( X_train, y_train, X_test, y_test, X_eval, y_eval, hyperparameter_space, ) @@ - if not best_hpo_params: + if not best_hpo_params: logger.error("Optuna optimization failed for SVM. Aborting.") return None, None - return best_hpo_params, final_metrics + return best_hpo_params, final_metricsdata/Create_data/add_ELO_scores.py (4)
171-180: Season reset is global across leagues — causes incorrect ELO carryover.current_season tracks one season for all leagues; teams in other leagues won’t reset when season changes, leading to contaminated ratings. Track season per league.
Apply this diff in add_elo_scores and add a per‑league map:
- # Check for season change - if row["season_encoded"] != self.current_season: - self.current_season = row["season_encoded"] - league_teams = matches[matches["league_encoded"] == row["league_encoded"]] - league_teams = pd.concat( - [league_teams["home_encoded"], league_teams["away_encoded"]] - ).unique() - self.reset_season_ratings(league_teams) + # Check for season change per league + season = row["season_encoded"] + league = row["league_encoded"] + prev = self.current_season_by_league.get(league) + if season != prev: + self.current_season_by_league[league] = season + league_rows = matches[matches["league_encoded"] == league] + league_teams = pd.concat( + [league_rows["home_encoded"], league_rows["away_encoded"]] + ).unique() + self.reset_season_ratings(league_teams)Add this in init:
self.current_season_by_league = {}
150-161: Handle NaN leagues when computing K-factors.NaN league IDs create a NaN dict key and later default to 30 due to NaN key lookup semantics.
Apply:
- unique_leagues = matches["league_encoded"].unique() + unique_leagues = matches["league_encoded"].dropna().unique() for league in unique_leagues: if league in self.prefixed_k_factors:
249-258: Numeric conversion can raise on None; use vectorized to_numeric.float(None) raises TypeError. Switch to robust coercion.
- for col in numeric_columns: - if col in df.columns: - df[col] = df[col].apply( - lambda x: float(str(x).replace(",", ".")) - if isinstance(x, str) - else float(x) - ) + for col in numeric_columns: + if col in df.columns: + # Normalize decimal commas and coerce invalids to NaN + s = df[col].astype(str).str.replace(",", ".", regex=False) + df[col] = pd.to_numeric(s, errors="coerce")
354-361: Ensure real None values are treated as NaN in loader.Current replace list covers the string "None" but not actual None inserted during row padding.
- # Replace NA values - na_values = ["NaN", "N/A", "NA", "null", "None", "", "Infinity", "-Infinity", "inf", "-inf"] - df = df.replace(na_values, np.nan) + # Replace NA values (strings) and actual None + na_values = ["NaN", "N/A", "NA", "null", "None", "", "Infinity", "-Infinity", "inf", "-inf"] + df = df.replace(na_values, np.nan) + df = df.where(pd.notnull(df), np.nan)(import pandas as pd is already present.)
Also applies to: 365-372
src/models/ensemble/meta_features_20.py (1)
13-25: Meta-feature count/order mismatch (8 models) causes wrong column labels.You stack 8 base probs, 1 avg, 23 diffs, 3 range, 8 ranks, then 3 meta, then 2 votes (total 48). create_meta_dataframe expects a different count and a different order (votes before meta). This triggers the fallback and breaks downstream by losing semantic column names.
- Update docstring/comments to 8 models.
- Keep stacking order as-is and fix col_names to length 48 with matching order, or reorder stacking to match the current names. Example fixes (keep current stacking and adjust names):
- Now handles 7 base models (incl. PyTorch, SVM, FNN). + Now handles 8 base models (XGB, TabNet, LGB, Extra, MLP, PyTorch, SVM, FNN).- # Add to meta-features - meta_features = np.column_stack([meta_features, weighted_avg.reshape(-1, 1)]) + # Add to meta-features + meta_features = np.column_stack([meta_features, weighted_avg.reshape(-1, 1)])Ensure diffs list reflects 23 entries (your current all_diffs already has 23).
Option A (recommended): Adjust column names to match stacking order (meta after ranks, then vote_sum, vote_agreement last):
- # Define column names based on the expected structure (43 features) - # 7 base + 1 avg + 21 diffs + 3 range + 7 ranks + 2 votes = 41 ? Mistake somewhere - # Re-count: 7 base + 1 avg + 21 diffs + 3 range + 7 ranks + 2 votes = 41 features - col_names = ( - ["p_xgb", "p_tabnet", "p_lgb", "p_extra", "p_mlp", "p_pytorch", "p_svm", "p_fnn"] # 8 base + # Define column names to match stacking order: + # 8 base + 1 avg + 23 diffs + 3 range + 8 ranks + 3 meta + 2 votes = 48 + col_names = ( + ["p_xgb", "p_tabnet", "p_lgb", "p_extra", "p_mlp", "p_pytorch", "p_svm", "p_fnn"] + ["weighted_avg"] + [ "diff_xgb_tabnet", "diff_xgb_lgb", "diff_xgb_extra", "diff_xgb_mlp", "diff_xgb_pytorch", "diff_xgb_svm", "diff_tabnet_lgb", "diff_tabnet_extra", "diff_tabnet_mlp", "diff_tabnet_pytorch", "diff_tabnet_svm", "diff_lgb_extra", "diff_lgb_mlp", "diff_lgb_pytorch", "diff_lgb_svm", "diff_extra_mlp", "diff_extra_pytorch", "diff_extra_svm", "diff_mlp_pytorch", "diff_mlp_svm", "diff_pytorch_svm", "diff_fnn_pytorch", "diff_fnn_svm", - ] # 21 diffs - + ["max_prob", "min_prob", "range_prob"] # 3 range + ] + + ["max_prob", "min_prob", "range_prob"] + [ "rank_xgb", "rank_tabnet", "rank_lgb", "rank_extra", "rank_mlp", "rank_pytorch", "rank_svm", "rank_fnn", - ] # 8 ranks - + ["vote_sum", "vote_agreement"] # 2 votes - + ["league_encoded", "season_encoded", "date_encoded"] # 4 meta + ] + + ["league_encoded", "season_encoded", "date_encoded"] + + ["vote_sum", "vote_agreement"] )Option B: If you prefer votes before meta, move the stacking of vote_sum/vote_agreement before adding league/season/date to match existing names.
Also fix the comments that still refer to 7 models and 21 diffs.
Also applies to: 49-75, 107-133, 145-149, 151-163, 165-168, 175-179, 189-241
data/Create_data/add_poisson_xG.py (1)
105-115: String/None to numeric conversion can fail; use vectorized coercion..str on non-strings and float(None) paths cause exceptions. Normalize and coerce.
- for col in self.all_features: - if X[col].dtype == "object": - try: - # Replace commas with periods and convert to float - X[col] = X[col].str.replace(",", ".").astype(float) - except Exception as e: - self.logger.warning(f"Could not convert column {col} to numeric: {str(e)}") + for col in self.all_features: + # Normalize decimal commas and coerce invalids to NaN + s = X[col].astype(str).str.replace(",", ".", regex=False) + X[col] = pd.to_numeric(s, errors="coerce")- for col in self.all_features: - if df[col].dtype == "object": - try: - # Replace commas with periods and convert to float - df[col] = df[col].str.replace(",", ".").astype(float) - except Exception as e: - self.logger.warning(f"Could not convert column {col} to numeric: {str(e)}") + for col in self.all_features: + s = df[col].astype(str).str.replace(",", ".", regex=False) + df[col] = pd.to_numeric(s, errors="coerce")Also applies to: 226-233
data/Create_data/api_football/add_current_features_postgre.py (1)
436-440: Future match query now returns past fixtures
WHERE "date" BETWEEN ... OR home_total_shots IS NOT NULLevaluates as (date in range) OR (stats present), so every row with recorded stats (i.e., historical matches) bypasses the date window. The exported “future matches” file will be dominated by old games with completed stats. Please tighten the predicate—e.g. require both the date window andhome_total_shots IS NULL, or wrap theBETWEENin parentheses before theOR.Apply something like:
- WHERE "date" between '{today_str}' and '{two_weeks_str}' or home_total_shots is not null; + WHERE "date" BETWEEN '{today_str}' AND '{two_weeks_str}' + AND home_total_shots IS NULL;src/models/ensemble/training.py (1)
803-814: TabNet trials crash becauseweights=1is invalid
TabNetClassifier.fitexpectsweightsto be an array-like of sample weights (orNone). Passing the scalar1trips the fit loop (len(weights)), giving'int' object is not subscriptablebefore any trial completes. Remove the argument or supply a proper sample-weight vector computed frommeta_targets_np.- eval_metric=params.get("eval_metric", ["logloss", "auc"]), - weights=1, # Use automatic class weighting + eval_metric=params.get("eval_metric", ["logloss", "auc"]),data/Create_data/utils/feature_selection.py (1)
373-399: Shared list corrupts stability scores
dict.fromkeys(..., [])reuses the same list for every key, so appending scores for one feature mutates the “scores” list of every feature. The subsequent mean/selection logic then treats all features as identical. Initialize with independent lists instead.- feature_counts = dict.fromkeys(X.columns, 0) - feature_scores = dict.fromkeys(X.columns, []) + feature_counts = {col: 0 for col in X.columns} + feature_scores = {col: [] for col in X.columns}src/models/StackedEnsemble/base/neural/pytorch_hypertuner.py (1)
129-135: Use the passedbatch_sizein predict_proba (avoid hardcoded 32).Hardcoding 32 ignores the API and can degrade performance.
Apply this diff:
- dataloader = TorchDataLoader( - dataset, - batch_size=32, - num_workers=1, - # pin_memory=True, # Enables faster CPU to GPU transfers - # persistent_workers=True # Keeps workers alive between epochs - ) + dataloader = TorchDataLoader( + dataset, + batch_size=batch_size, + num_workers=0, + pin_memory=(self.device_.type == "cuda"), + )src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (1)
426-434: Don’t mutate X_eval when fixing dtypes for signature; modify input_example.Current code alters the caller’s DataFrame.
Apply this diff:
- if hasattr(input_example, "dtypes"): - for col in input_example.columns: - if X_eval[col].dtype.kind == "i": + if hasattr(input_example, "dtypes"): + for col in input_example.columns: + if input_example[col].dtype.kind == "i": logger.info( f"Converting integer column '{col}' to float64 to handle potential missing values" ) - X_eval[col] = X_eval[col].astype("float64") + input_example[col] = input_example[col].astype("float64")src/models/StackedEnsemble/base/tree_based/random_forest_30.py (1)
543-550: Robust y_val handling and safe baseline precision (avoid divide-by-zero).
.valuesassumes pandas; precision can divide by zero.Apply this diff:
- y_val_np = y_val.values + y_val_np = y_val.values if hasattr(y_val, "values") else np.asarray(y_val) # Compute baseline metric probs = model.predict_proba(X_val)[:, 1] preds = (probs >= threshold).astype(int) - # Fix: metric is being passed as a float value instead of a function - # We'll calculate precision directly since that's what was passed in - baseline = np.sum((y_val_np == 1) & (preds == 1)) / (np.sum(preds == 1)) + # Calculate baseline precision safely + _pp = np.sum(preds == 1) + baseline = 0.0 if _pp == 0 else np.sum((y_val_np == 1) & (preds == 1)) / _ppsrc/models/ensemble/meta_features_0412.py (1)
13-223: Add the FNN channel; current signature breaks dynamic weightingWe now call
create_meta_features_optimizedwith eight model outputs (seeEnsembleModel.train/predict_proba), but the function still accepts only seven arrays. The extrap_fnnpositional argument ends up bound todynamic_weights, so(dynamic_weights.get(...))tries to run on a NumPy array and we crash immediately. Even if it didn’t blow up, all meta-features would ignore the FNN output. Please extend the function (and column headers) to include the eighth model so downstream code keeps working.-def create_meta_features_optimized( - p_xgb: np.ndarray, - p_tabnet: np.ndarray, - p_lgb: np.ndarray, - p_extra: np.ndarray, - p_mlp: np.ndarray, - p_pytorch: np.ndarray, - p_svm: np.ndarray, - dynamic_weights: Optional[dict] = None, - thresholds: Optional[dict] = None, -) -> np.ndarray: +def create_meta_features_optimized( + p_xgb: np.ndarray, + p_tabnet: np.ndarray, + p_lgb: np.ndarray, + p_extra: np.ndarray, + p_mlp: np.ndarray, + p_pytorch: np.ndarray, + p_svm: np.ndarray, + p_fnn: np.ndarray, + dynamic_weights: Optional[dict] = None, + thresholds: Optional[dict] = None, +) -> np.ndarray: @@ - all_preds = [p_xgb, p_tabnet, p_lgb, p_extra, p_mlp, p_pytorch, p_svm] + all_preds = [p_xgb, p_tabnet, p_lgb, p_extra, p_mlp, p_pytorch, p_svm, p_fnn] @@ - + dynamic_weights.get("svm", default_weight) * p_svm + + dynamic_weights.get("svm", default_weight) * p_svm + + dynamic_weights.get("fnn", default_weight) * p_fnn @@ - # Stack all differences (original 15 + new 6 = 21) + # Stack all differences (C(8, 2) = 28) @@ - diff_pytorch_svm, # pytorch vs svm + diff_pytorch_svm, + diff_xgb_fnn, + diff_tabnet_fnn, + diff_lgb_fnn, + diff_extra_fnn, + diff_mlp_fnn, + diff_pytorch_fnn, + diff_svm_fnn, @@ - vote_thresholds = [ + vote_thresholds = [ thresholds.get("xgb", default_threshold) if thresholds else default_threshold, ... - thresholds.get("svm", default_threshold) if thresholds else default_threshold, + thresholds.get("svm", default_threshold) if thresholds else default_threshold, + thresholds.get("fnn", default_threshold) if thresholds else default_threshold, @@ - ["p_xgb", "p_tabnet", "p_lgb", "p_extra", "p_mlp", "p_pytorch", "p_svm"] + [ + "p_xgb", + "p_tabnet", + "p_lgb", + "p_extra", + "p_mlp", + "p_pytorch", + "p_svm", + "p_fnn", + ] @@ - + ["max_prob", "min_prob", "range_prob"] # 3 range + + ["max_prob", "min_prob", "range_prob"] + [ "rank_xgb", ... "rank_svm", + "rank_fnn", ]Don’t forget to define the new
diff_*_fnnarrays before the list, and to bump the expected feature count comment to 50 (8 base + 1 avg + 28 diffs + 3 range + 8 ranks + 2 votes). Based on the call sites inensemble_model_0414.py.src/models/StackedEnsemble/base/neural/mlp_model_25.py (1)
348-351: Pass an actual DataFrame to log_to_mlflow during Optuna loggingInside the objective we call
log_to_mlflow(..., X_eval), but hereX_evalis the scaled NumPy array.log_to_mlflowimmediately triesX_eval.iloc[:5], triggering'numpy.ndarray' object has no attribute 'iloc'as soon as a trial exceeds the score threshold. Please thread the raw evaluation DataFrame through the optimizer so logging keeps working.-def optimize_hyperparameters( - X_train, y_train, X_test, y_test, X_eval, y_eval, hyperparameter_space -): +def optimize_hyperparameters( + X_train, + y_train, + X_test, + y_test, + X_eval, + y_eval, + hyperparameter_space, + X_eval_raw, +): @@ - log_to_mlflow(model, metrics, params, experiment_name, scaler, X_eval) + log_to_mlflow( + model, + metrics, + params, + experiment_name, + scaler, + X_eval_raw, + ) @@ - best_params = optimize_hyperparameters( + best_params = optimize_hyperparameters( X_train_scaled, y_train, X_test_scaled, y_test, X_eval_scaled, y_eval, - hyperparameter_space, + hyperparameter_space, + X_eval, )Make the same signature change in any other call paths (e.g., tests) to avoid regressions.
src/models/ensemble/ensemble_model_0410.py (1)
346-351: Stop double-scaling PyTorch inputs
self.model_pytorch.predict_probaalready applies its own scaler (scaler_.transform(...), seesrc/models/StackedEnsemble/base/neural/pytorch_hypertuner.py). Pre-scaling here runs that transform twice, skewing every prediction. Pass the raw features instead.- X_pytorch_scaled = self.model_pytorch_scaler.transform(X_pytorch) - p_pytorch = self.model_pytorch.predict_proba(X_pytorch_scaled)[:, 1] + p_pytorch = self.model_pytorch.predict_proba(X_pytorch)[:, 1]src/models/StackedEnsemble/base/tree_based/catboost_model.py (1)
538-553: Guard against zero positives in permutation importanceWhen the model predicts no positives, both the baseline precision (
np.sum(preds == 1)) and the per-shuffle precision denominator drop to zero, raising aZeroDivisionError. Add a safe guard (e.g.,denom = max(np.sum(...), 1)or an epsilon) before dividing so permutation importance still runs on edge cases.- baseline = np.sum((y_val_np == 1) & (preds == 1)) / (np.sum(preds == 1)) + predicted_positives = np.sum(preds == 1) + baseline = ( + np.sum((y_val_np == 1) & (preds == 1)) / predicted_positives + if predicted_positives > 0 + else 0.0 + ) ... - precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / ( - np.sum(preds_shuffled == 1) - ) + predicted_positives_shuffled = np.sum(preds_shuffled == 1) + precision = ( + np.sum((y_val_np == 1) & (preds_shuffled == 1)) / predicted_positives_shuffled + if predicted_positives_shuffled > 0 + else 0.0 + )src/models/StackedEnsemble/base/tree_based/lightgbm_model_25.py (1)
438-466: Log the LightGBM model only once.Lines 438-466 call
mlflow.lightgbm.log_modeltwice on the same artifact path (“model”). Mlflow raisesMlflowExceptionwhen the second call tries to overwrite the existing artifacts, so the run fails. Log once after you’ve inferred the signature.- # Log model - model_info = mlflow.lightgbm.log_model( - model, - "model", - registered_model_name=f"lightgbm_{datetime.now().strftime('%Y%m%d_%H%M')}", - ) # Create input example for model signature input_example = X_eval.head(5) # Handle integer columns by converting them to float64 to properly manage missing values input_example = X_eval.iloc[:5].copy() if hasattr(X_eval, "iloc") else X_eval[:5].copy() @@ - model_info = mlflow.lightgbm.log_model( - model, - "model", - registered_model_name=f"lightgbm_{datetime.now().strftime('%Y%m%d_%H%M')}", - signature=signature, - ) + model_info = mlflow.lightgbm.log_model( + model, + "model", + registered_model_name=f"lightgbm_{datetime.now().strftime('%Y%m%d_%H%M')}", + signature=signature, + )src/models/ensemble/ensemble_model_20.py (1)
145-160: Restore optional validation/test handling intrain()
x_val,y_val,X_test, andy_teststill default toNone, yet Lines 156-159 assert they are DataFrames before any fallback logic runs. Any existing caller relying on the previous “split_validation=True” behaviour now hits an AssertionError (orAttributeErrorwhenprepare_dataruns) even though the signature still advertises optional inputs. Please reinstate the old guard (or make the arguments required) before preparing the data, e.g.:- assert isinstance(x_val, pd.DataFrame), f"x_val must be a DataFrame, got {type(x_val)}" - assert isinstance(X_test, pd.DataFrame), f"X_test must be a DataFrame, got {type(X_test)}" + if any(arg is None for arg in (x_val, y_val, X_test, y_test)): + if split_validation: + # fall back to splitting below + pass + else: + raise ValueError("train() now requires explicit x_val/y_val/X_test/y_test inputs when split_validation=False")and only call
prepare_dataafter the inputs are known to exist. Otherwise the refactor breaks every path that depended on the earlier optional arguments.src/models/StackedEnsemble/shared/config_loader.py (1)
51-63: YAML: add UTF‑8 encoding, validate type, and log with tracebackEnsure loaded YAML is a dict; include traceback with logger.exception.
- with open(config_path) as f: + with open(config_path, encoding="utf-8") as f: config = yaml.safe_load(f) - # Validate configuration + if not isinstance(config, dict): + self.logger.error(f"Configuration at {config_path} must be a mapping (dict)") + raise ValueError(f"Invalid YAML structure at {config_path}") + + # Validate configuration self._validate_model_config(config, model_type) return config ... - except yaml.YAMLError as e: - self.logger.error(f"Error parsing configuration file: {e}") + except yaml.YAMLError as e: + self.logger.exception(f"Error parsing configuration file: {e}") raise- with open(space_path) as f: + with open(space_path, encoding="utf-8") as f: space = yaml.safe_load(f) - # Validate search space + if not isinstance(space, dict): + self.logger.error(f"Hyperparameter space at {space_path} must be a mapping (dict)") + raise ValueError(f"Invalid YAML structure at {space_path}") + + # Validate search space self._validate_hyperparameter_space(space, model_type) return space ... - except yaml.YAMLError as e: - self.logger.error(f"Error parsing hyperparameter space file: {e}") + except yaml.YAMLError as e: + self.logger.exception(f"Error parsing hyperparameter space file: {e}") raiseAlso applies to: 82-94
🧹 Nitpick comments (21)
src/models/StackedEnsemble/shared/data_loader.py (1)
92-94: Consider extracting duplicate inf replacement logic.The inf value replacement logic is repeated three times for the three dataframes. Consider extracting this into a helper method to reduce duplication and improve maintainability.
Example refactor:
def _replace_inf_values(self, df: pd.DataFrame) -> pd.DataFrame: """Replace infinite values with 0 in a DataFrame.""" return df.replace([np.inf, -np.inf], 0)Then apply it to each dataset:
# Replace inf values with 0 in all data splits self.logger.info("Replacing inf values with 0 in all data splits") - X_train = X_train.replace([np.inf, -np.inf], 0) - X_test = X_test.replace([np.inf, -np.inf], 0) - X_val = X_val.replace([np.inf, -np.inf], 0) + X_train = self._replace_inf_values(X_train) + X_test = self._replace_inf_values(X_test) + X_val = self._replace_inf_values(X_val)src/utils/K_factor_calculation.py (4)
13-25: Consider optimizing repeated DataFrame filtering.The current implementation filters
league_datatwice for each unique team (once for home matches, once for away matches), resulting in O(n×m) complexity. For large datasets, this could be slow.Consider using
groupbyto compute team-level statistics more efficiently:# Compute home stats home_stats = league_data.groupby("home_encoded").agg({ "Home_draws": "first", "Home_team_matches": "first" }) # Compute away stats away_stats = league_data.groupby("away_encoded").agg({ "Away_draws": "first", "Away_team_matches": "first" }) # Combine stats for team_id in league_data["home_encoded"].unique(): if team_id not in home_stats.index or team_id not in away_stats.index: continue home_draws = home_stats.loc[team_id, "Home_draws"] home_matches = home_stats.loc[team_id, "Home_team_matches"] away_draws = away_stats.loc[team_id, "Away_draws"] away_matches = away_stats.loc[team_id, "Away_team_matches"] if home_matches > 0 and away_matches > 0: team_draw_rate = (home_draws + away_draws) / (home_matches + away_matches) draw_rates.append(team_draw_rate)This reduces complexity to O(m) for the groupby operations plus O(n) for the iteration.
32-36: Consider vectorized operations instead of iterrows().Using
.iterrows()is inefficient for large DataFrames. For better performance, consider vectorized operations:# Vectorized alternative draw_mask = league_data["match_outcome"] == 2 draw_matches = league_data[draw_mask] goal_diff_data = abs(draw_matches.get("home_goals", 0) - draw_matches.get("away_goals", 0)).tolist()Note: The current use of
.get()with default values is good defensive programming.
69-72: Consider vectorized operations for better performance.Similar to the earlier loop, using
.iterrows()is inefficient. Consider a vectorized approach:# Vectorized alternative draw_mask = (league_data["match_outcome"] == 2) & \ league_data["home_team_elo"].notna() & \ league_data["away_team_elo"].notna() draw_matches = league_data[draw_mask] draw_elo_diffs = abs(draw_matches["home_team_elo"] - draw_matches["away_team_elo"]).tolist()
94-96: Use logger.exception() for better error diagnostics.When logging within an exception handler,
logger.exception()automatically includes the stack trace, making debugging easier thanlogger.error().Apply this diff:
except Exception as e: - logger.error(f"Error calculating draw K-factor: {str(e)}") + logger.exception("Error calculating draw K-factor") return 25 # Default draw K-factorNote:
logger.exception()automatically includes the exception message and traceback, so explicit message formatting is unnecessary.Based on coding guidelines.
data/Create_data/add_ELO_scores.py (1)
211-213: Use logger instead of print.For consistency with the rest of the module.
- print(f"ELO created for {elo_count} matches") + self.logger.info(f"ELO created for {elo_count} matches")src/models/StackedEnsemble/base/tree_based/feature_selection/xgboost_boruta.py (2)
52-66: Set determinism and align threads on the classifier.Add random_state and n_jobs to match global seeds and thread caps.
- params = { + params = { "alpha": 55.8, "colsample_bytree": 0.885, "eval_metric": ["aucpr", "error", "logloss"], "gamma": 4.43, "lambda": 6.94, "learning_rate": 0.15, "max_depth": 10, "min_child_weight": 635, "scale_pos_weight": 2.7, "subsample": 0.795, - } - xgb_clf = xgb.XGBClassifier(**params) + "random_state": SEED, + "n_jobs": 4, + } + xgb_clf = xgb.XGBClassifier(**params)
87-90: Avoid print; log or persist to a defined artifacts directory.Use logger and explicit path (e.g., under logs/artifacts) for traceability.
- selected_features = feature_selector.Subset().columns.tolist() - print("Selected features:", selected_features) + selected_features = feature_selector.Subset().columns.tolist() + logger.info(f"Selected features: {len(selected_features)}")data/Create_data/add_poisson_xG.py (3)
161-163: Consider removing add_constant with scikit-learn PoissonRegressor.PoissonRegressor already includes an intercept; adding a constant risks collinearity.
- # Add constant term for statsmodels - X_scaled = sm.add_constant(X_scaled) + # Intercept is handled by scikit-learn's PoissonRegressor (fit_intercept=True) + # Keep features as-is
238-239: Use logger instead of print.- print(f"X shape: {X.shape}") + self.logger.debug(f"Feature matrix shape: {X.shape}")
285-288: Avoid using 'type' as a parameter name.Shadows built-in and reduces clarity.
- def add_poisson_xG( - self, df: pd.DataFrame, base_df: pd.DataFrame, type: str, is_training: bool = False - ) -> pd.DataFrame: + def add_poisson_xG( + self, df: pd.DataFrame, base_df: pd.DataFrame, dataset_type: str, is_training: bool = False + ) -> pd.DataFrame: ... - output_path = datasets[type] + output_path = datasets[dataset_type] ... - self.logger.error(f"Failed to export {type} data: {str(e)}") + self.logger.error(f"Failed to export {dataset_type} data: {str(e)}")Also applies to: 316-316, 335-341
src/models/StackedEnsemble/base/neural/pytorch_hypertuner.py (2)
565-577: Replace lambda with named function for clarity and lint compliance.Improves readability and satisfies E731.
Apply this diff:
- objective_func = lambda trial: objective( - trial, - X_train, - y_train, - X_test, - y_test, - X_val, - y_val, - hyperparameter_space, - input_dim, - device, - scaler, - ) + def objective_func(trial): + return objective( + trial, + X_train, + y_train, + X_test, + y_test, + X_val, + y_val, + hyperparameter_space, + input_dim, + device, + scaler, + )
419-423: Log full traceback when threshold optimization fails.Use logger.exception for actionable diagnostics.
Apply this diff:
- logger.error(f"TypeError calling optimize_threshold: {te}") - logger.error( + logger.exception(f"TypeError calling optimize_threshold: {te}") + logger.exception( "Ensure optimize_threshold in hypertuner_utils.py expects (model, X, y, min_recall)." )src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (1)
594-596: Prefer logger.exception for failures in feature-combination trials.Keeps stack traces for debugging; behavior unchanged.
Apply this diff:
- except Exception as e: - logger.error(f"Trial {trial + 1} failed: {e}") + except Exception as e: + logger.exception(f"Trial {trial + 1} failed: {e}")src/models/ensemble/meta_features_0414.py (1)
195-233: Align column-name generation with num_models programmatically.Hardcoded lists are brittle (comment says 41, actual is 45). Generate names from
all_predsand computed diffs to avoid drift.I can provide a helper to auto-build
col_namesfrom base names, pairs, and ranks to always matchmeta_features.shape[1]. Want a patch?src/models/StackedEnsemble/shared/config_loader.py (1)
28-30: Avoid "None_config_loader" logger name when model_type/experiment_name are NoneFallback currently yields "None_config_loader". Derive a sensible default.
- self.logger = logger or ExperimentLogger( - experiment_name=experiment_name or f"{model_type}_config_loader" - ) + loader_name = ( + experiment_name + or (f"{model_type}_config_loader" if model_type else "config_loader") + ) + self.logger = logger or ExperimentLogger(experiment_name=loader_name)data/Create_data/api_football/get_fixtures_postgre.py (5)
223-226: Return type mismatch (function can return None)Function returns None on validation failures but is annotated as dict. Adjust to Optional.
- def _parse_team_stat_response( - self, api_response_data: dict, fixture_id: int, target_team_id: Any - ) -> dict: + def _parse_team_stat_response( + self, api_response_data: dict, fixture_id: int, target_team_id: Any + ) -> Optional[dict[str, Any]]:Add import if missing:
-from typing import Any +from typing import Any, Optional
385-386: Use timezone-aware timestamps (UTC) for updated_atNaive datetimes cause ambiguity across services/DB. Prefer UTC-aware.
- "updated_at": datetime.now(), + "updated_at": datetime.now(timezone.utc),Also update imports:
-from datetime import datetime, timedelta +from datetime import datetime, timedelta, timezoneAlso applies to: 875-875, 1064-1064, 1173-1173
470-473: Log exceptions with tracebacksUse logger.exception inside except blocks to capture stack traces. Also avoid blanket Exception where possible.
Example changes:
- except SQLAlchemyError as e_check: - self.logger.error( + except SQLAlchemyError as e_check: + self.logger.exception( f"DB error checking existing season stats for team {team_id_to_fetch}, league {league_id}, season {season}: {e_check}" )- except Exception as e: - self.logger.error( + except Exception as e: + self.logger.exception( f"Error updating statistics for fixture {fixture_id} in PostgreSQL: {e}" )- except requests.exceptions.RequestException as e: - message = f"API request failed for fixture ID {fixture_id}: {e}" - self.logger.error(message) + except requests.exceptions.RequestException as e: + message = f"API request failed for fixture ID {fixture_id}: {e}" + self.logger.exception(message)- except (Exception, psycopg2.Error) as e: - message = f"Error processing or upserting prediction for fixture ID {fixture_id}: {e}" - self.logger.error(message) + except Exception as e: + message = f"Error processing or upserting prediction for fixture ID {fixture_id}: {e}" + self.logger.exception(message)- except Exception as e: - self.logger.error( + except Exception as e: + self.logger.exception( f"Unexpected error processing events for fixture_id {fixture_id}: {e}" )Based on static analysis hints.
Also applies to: 550-552, 699-701, 1246-1248, 932-938
921-923: Remove prints; rely on loggerKeep output consistent and structured.
- self.logger.info(message) - print(message) + self.logger.info(message)- message = f"No prediction data found in API response for fixture ID: {fixture_id}" - self.logger.warning(message) - print(message) + message = f"No prediction data found in API response for fixture ID: {fixture_id}" + self.logger.warning(message)Also applies to: 807-807
1011-1015: Add timeout to teams API request (venues update)Prevent indefinite hangs; optionally reuse _get_request for consistency.
- response = requests.get(url, headers=headers) + response = requests.get(url, headers=headers, timeout=30)
|
|
||
| # PostgreSQL engine setup | ||
| engine = create_engine('postgresql+psycopg2://postgres:ronaldo99@localhost:5432/api_football') | ||
| engine = create_engine("postgresql+psycopg2://postgres:ronaldo99@localhost:5432/api_football") |
There was a problem hiding this comment.
Critical: hardcoded Postgres credentials in source
Leak of DB password in VCS. Load from env and enable pool_pre_ping. Rotate the exposed credential.
- engine = create_engine("postgresql+psycopg2://postgres:ronaldo99@localhost:5432/api_football")
+ db_url = os.getenv("API_FOOTBALL_DB_URL")
+ if not db_url:
+ raise RuntimeError(
+ "Missing API_FOOTBALL_DB_URL. Put your SQLAlchemy URL in the environment/.env."
+ )
+ engine = create_engine(db_url, pool_pre_ping=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| engine = create_engine("postgresql+psycopg2://postgres:ronaldo99@localhost:5432/api_football") | |
| db_url = os.getenv("API_FOOTBALL_DB_URL") | |
| if not db_url: | |
| raise RuntimeError( | |
| "Missing API_FOOTBALL_DB_URL. Put your SQLAlchemy URL in the environment/.env." | |
| ) | |
| engine = create_engine(db_url, pool_pre_ping=True) |
🤖 Prompt for AI Agents
In data/Create_data/api_football/get_fixtures_postgre.py around line 34, the
Postgres connection string is hardcoded causing a credentials leak; replace the
hardcoded URL with a connection string sourced from environment variables (e.g.,
DB_USER, DB_PASS, DB_HOST, DB_PORT, DB_NAME or a single DATABASE_URL) and
construct the SQLAlchemy engine from that value, enable pool_pre_ping=True when
calling create_engine to avoid stale connections, and remove the plaintext
credential from the repo (rotate the exposed password immediately).
| if ( | ||
| k == "passes_percent" | ||
| or k == "passes_%" | ||
| and isinstance(val, str) | ||
| and val.endswith("%") | ||
| ): | ||
| try: | ||
| val = float(val.replace('%', '')) | ||
| val = float(val.replace("%", "")) | ||
| except Exception: |
There was a problem hiding this comment.
Fix precedence bug in passes percentage handling
Current condition mixes or/and without parentheses and checks an unused key. Tighten to only handle "passes_%" key.
- if (
- k == "passes_percent"
- or k == "passes_%"
- and isinstance(val, str)
- and val.endswith("%")
- ):
+ if k == "passes_%" and isinstance(val, str) and val.endswith("%"):
try:
val = float(val.replace("%", ""))
except Exception:
val = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ( | |
| k == "passes_percent" | |
| or k == "passes_%" | |
| and isinstance(val, str) | |
| and val.endswith("%") | |
| ): | |
| try: | |
| val = float(val.replace('%', '')) | |
| val = float(val.replace("%", "")) | |
| except Exception: | |
| if k == "passes_%" and isinstance(val, str) and val.endswith("%"): | |
| try: | |
| val = float(val.replace("%", "")) | |
| except Exception: | |
| val = None |
🧰 Tools
🪛 Ruff (0.13.3)
671-673: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
677-677: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In data/Create_data/api_football/get_fixtures_postgre.py around lines 669 to
677, the conditional mixes or/and without parentheses and checks an unintended
key; change the condition to only handle the "passes_%" key with proper
precedence: use if k == "passes_%" and isinstance(val, str) and
val.endswith("%"): then proceed to strip the "%" and convert to float inside the
try/except as before (no other keys), ensuring the boolean logic only applies to
the percent-string check.
| result = conn.execute(query, {"today": today, "league_ids": target_league_ids}) | ||
|
|
There was a problem hiding this comment.
Do not pass unused parameters to SQLAlchemy text() execution
The query doesn't use :today or :league_ids. Passing extra bind params can fail on some drivers.
- result = conn.execute(query, {"today": today, "league_ids": target_league_ids})
+ result = conn.execute(query)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result = conn.execute(query, {"today": today, "league_ids": target_league_ids}) | |
| result = conn.execute(query) |
🤖 Prompt for AI Agents
In data/Create_data/api_football/get_fixtures_postgre.py around lines 767-768,
the call to conn.execute(query, {"today": today, "league_ids":
target_league_ids}) is passing bind parameters that the SQL text() does not use;
remove the unused "today" and "league_ids" parameters (or alternatively add
matching :today/:league_ids bind placeholders to the SQL if they are intended to
be used) so that execute() only receives parameters referenced by the query and
avoid driver failures.
| api_url = f"{self.base_url}/predictions?fixture={fixture_id}" | ||
| try: | ||
| response = requests.get(api_url, headers=self.headers) | ||
| response.raise_for_status() # Raises an HTTPError for bad responses (4XX or 5XX) | ||
|
|
||
| # Calculate H2H stats | ||
| h2h_home_wins_calc, h2h_away_wins_calc, h2h_draws_calc = 0, 0, 0 | ||
| if fixture_home_team_id is not None and fixture_away_team_id is not None and h2h_list_api: | ||
| for match in h2h_list_api: | ||
| match_teams = match.get("teams", {}) | ||
| match_goals = match.get("goals", {}) | ||
|
|
||
| h2h_match_home_id = match_teams.get("home", {}).get("id") | ||
|
|
||
| h2h_gh = match_goals.get("home") | ||
| h2h_ga = match_goals.get("away") | ||
|
|
||
| if isinstance(h2h_gh, (int, float)) and isinstance(h2h_ga, (int, float)): | ||
| if h2h_gh == h2h_ga: | ||
| h2h_draws_calc += 1 | ||
| elif h2h_match_home_id == fixture_home_team_id: | ||
| if h2h_gh > h2h_ga: | ||
| h2h_home_wins_calc +=1 | ||
| else: | ||
| h2h_away_wins_calc +=1 | ||
| elif h2h_match_home_id == fixture_away_team_id: | ||
| if h2h_ga > h2h_gh: | ||
| h2h_home_wins_calc += 1 | ||
| else: | ||
| h2h_away_wins_calc += 1 | ||
|
|
||
| data_to_upsert["h2h_home_wins"] = h2h_home_wins_calc | ||
| data_to_upsert["h2h_away_wins"] = h2h_away_wins_calc | ||
| data_to_upsert["h2h_draws"] = h2h_draws_calc | ||
| data_to_upsert["h2h_total_matches"] = len(h2h_list_api) if h2h_list_api else 0 | ||
|
|
||
| # Upsert to PostgreSQL | ||
| stmt = pg_insert(predictions_table).values(**data_to_upsert) | ||
| update_dict = {c: stmt.excluded[c] for c in data_to_upsert.keys() if c != 'fixture_id'} | ||
| stmt = stmt.on_conflict_do_update(index_elements=['fixture_id'], set_=update_dict) | ||
| try: | ||
| with engine.begin() as conn: | ||
| conn.execute(stmt) | ||
| message = f"Successfully upserted prediction for fixture ID: {fixture_id}" | ||
| self.logger.info(message) | ||
| print(message) | ||
| return True | ||
| except SQLAlchemyError as e: | ||
| message = f"Error upserting prediction for fixture ID {fixture_id}: {e}" | ||
| self.logger.error(message) | ||
| print(message) | ||
| return False | ||
|
|
||
| except requests.exceptions.RequestException as e: | ||
| message = f"API request failed for fixture ID {fixture_id}: {e}" | ||
| self.logger.error(message) | ||
| api_data = response.json() | ||
|
|
||
| if not api_data.get("response"): | ||
| message = f"No prediction data found in API response for fixture ID: {fixture_id}" | ||
| self.logger.warning(message) |
There was a problem hiding this comment.
Predictions fetch: fix double slash and add request timeout
Avoid base_url double slash and include a timeout to prevent hanging calls.
- api_url = f"{self.base_url}/predictions?fixture={fixture_id}"
+ api_url = f"{self.base_url.rstrip('/')}/predictions?fixture={fixture_id}"
try:
- response = requests.get(api_url, headers=self.headers)
+ response = requests.get(api_url, headers=self.headers, timeout=30)
response.raise_for_status() # Raises an HTTPError for bad responses (4XX or 5XX)Also applies to: 799-801
🧰 Tools
🪛 Ruff (0.13.3)
799-799: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In data/Create_data/api_football/get_fixtures_postgre.py around lines 797 to
806, the API call may produce a double slash when self.base_url already ends
with a slash and lacks a timeout which can cause hanging requests; ensure the
URL is built without duplicating slashes (e.g., strip a trailing slash from
self.base_url or join paths safely) and add a timeout parameter to the
requests.get call (apply the same change for the adjacent calls on lines
799–801) so the request becomes requests.get(<cleaned_url>,
headers=self.headers, timeout=<reasonable_seconds>).
| xgb_features, _ = xgboost_staged_selection( | ||
| X, y, X_test, y_test, X_eval, y_eval, target_features + 20 | ||
| ) | ||
| lgb_features, _ = lightgbm_staged_selection( | ||
| X, y, X_test, y_test, X_eval, y_eval, target_features + 20 | ||
| ) | ||
|
|
There was a problem hiding this comment.
Fix staged-selection call signature.
xgboost_staged_selection and lightgbm_staged_selection accept (X, y, X_eval, y_eval, target_features). The new call site passes X_test/y_test, so the function invocation raises TypeError: ... takes 5 positional arguments but 7 were given, aborting the Boruta pipeline. Drop the extra arguments:
- xgb_features, _ = xgboost_staged_selection(
- X, y, X_test, y_test, X_eval, y_eval, target_features + 20
- )
+ xgb_features, _ = xgboost_staged_selection(
+ X, y, X_eval, y_eval, target_features + 20
+ )
- lgb_features, _ = lightgbm_staged_selection(
- X, y, X_test, y_test, X_eval, y_eval, target_features + 20
- )
+ lgb_features, _ = lightgbm_staged_selection(
+ X, y, X_eval, y_eval, target_features + 20
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| xgb_features, _ = xgboost_staged_selection( | |
| X, y, X_test, y_test, X_eval, y_eval, target_features + 20 | |
| ) | |
| lgb_features, _ = lightgbm_staged_selection( | |
| X, y, X_test, y_test, X_eval, y_eval, target_features + 20 | |
| ) | |
| xgb_features, _ = xgboost_staged_selection( | |
| X, y, X_eval, y_eval, target_features + 20 | |
| ) | |
| lgb_features, _ = lightgbm_staged_selection( | |
| X, y, X_eval, y_eval, target_features + 20 | |
| ) |
🤖 Prompt for AI Agents
In
src/models/StackedEnsemble/base/tree_based/feature_selection/lightgbm_boruta.py
around lines 287 to 293, the staged-selection calls pass X_test and y_test but
the signatures are (X, y, X_eval, y_eval, target_features); remove the extra
X_test and y_test arguments and call xgboost_staged_selection and
lightgbm_staged_selection with (X, y, X_eval, y_eval, target_features + 20) so
the number and order of positional arguments match the function definitions.
| # Calculate precision directly instead of using metric parameter | ||
| precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / (np.sum(preds_shuffled == 1)) | ||
| precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / ( | ||
| np.sum(preds_shuffled == 1) | ||
| ) |
There was a problem hiding this comment.
Guard precision calculation in permutation loop.
Avoid division by zero when no positives are predicted.
Apply this diff:
- # Calculate precision directly instead of using metric parameter
- precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / (
- np.sum(preds_shuffled == 1)
- )
+ # Calculate precision safely
+ _pp_shuf = np.sum(preds_shuffled == 1)
+ precision = (
+ 0.0 if _pp_shuf == 0 else np.sum((y_val_np == 1) & (preds_shuffled == 1)) / _pp_shuf
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Calculate precision directly instead of using metric parameter | |
| precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / (np.sum(preds_shuffled == 1)) | |
| precision = np.sum((y_val_np == 1) & (preds_shuffled == 1)) / ( | |
| np.sum(preds_shuffled == 1) | |
| ) | |
| # Calculate precision safely | |
| _pp_shuf = np.sum(preds_shuffled == 1) | |
| precision = ( | |
| 0.0 if _pp_shuf == 0 else np.sum((y_val_np == 1) & (preds_shuffled == 1)) / _pp_shuf | |
| ) |
🤖 Prompt for AI Agents
In src/models/StackedEnsemble/base/tree_based/random_forest_30.py around lines
561 to 564, the precision calculation can divide by zero when preds_shuffled
contains no positive predictions; modify the permutation loop to compute denom =
np.sum(preds_shuffled == 1) and only perform the division when denom > 0,
otherwise set precision to 0 (or an agreed sentinel like np.nan), so the code
guards against ZeroDivisionError and yields a defined precision value when there
are no predicted positives.
| if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)): | ||
| # sklearn model | ||
| predictions = self.model.predict(df) | ||
| pos_probas = self.model.predict_proba(df)[:, 1] # type: ignore # Get positive class probabilities | ||
| else: | ||
| # PyFuncModel - use predict method which returns probabilities | ||
| predictions = self.test_model.predict(df) | ||
| # For binary classification, PyFuncModel predict returns probabilities for both classes | ||
| if predictions.ndim == 2 and predictions.shape[1] == 2: | ||
| pos_probas = predictions[:, 1] | ||
| predictions = (pos_probas >= self.threshold).astype(int) | ||
| else: | ||
| # If predict returns 1D array, it might be class predictions | ||
| pos_probas = predictions | ||
| predictions = (pos_probas >= self.threshold).astype(int) | ||
|
|
||
| # Ensure we have a 1D numpy array | ||
| if not isinstance(pos_probas, np.ndarray): | ||
| pos_probas = np.array(pos_probas) | ||
| except Exception as e: | ||
| print(f"Error predicting: {e}") | ||
| if "use_label_encoder" in str(e): | ||
| print("Attribute error due to missing 'use_label_encoder'. Patching model...") | ||
| self.model.use_label_encoder = False | ||
| predictions = self.model.predict(df) | ||
| pos_probas = self.model.predict_proba(df) | ||
| if self.model is None: | ||
| raise ValueError(MODEL_NOT_LOADED_ERROR) from e | ||
| # Fallback to PyFuncModel predict | ||
| predictions = self.test_model.predict(df) | ||
| if predictions.ndim == 2 and predictions.shape[1] == 2: | ||
| pos_probas = predictions[:, 1] | ||
| predictions = (pos_probas >= self.threshold).astype(int) | ||
| else: | ||
| pos_probas = predictions | ||
| predictions = (pos_probas >= self.threshold).astype(int) | ||
|
|
||
| # Ensure predictions and probabilities were obtained |
There was a problem hiding this comment.
Apply the tuned threshold to sklearn predictions
For the sklearn path you keep self.model.predict(df) (hard-coded 0.5 threshold) and never reapply the tuned self.threshold, so all downstream metrics ignore the calibrated cutoff. Mirror the PyFunc branch: threshold the positive-class probabilities yourself.
- if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)):
- # sklearn model
- predictions = self.model.predict(df)
- pos_probas = self.model.predict_proba(df)[:, 1] # type: ignore # Get positive class probabilities
+ if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)):
+ # sklearn model
+ pos_probas = self.model.predict_proba(df)[:, 1] # type: ignore
+ predictions = (pos_probas >= self.threshold).astype(int)Make the same adjustment in the exception fallback so both paths honour the calibrated threshold.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)): | |
| # sklearn model | |
| predictions = self.model.predict(df) | |
| pos_probas = self.model.predict_proba(df)[:, 1] # type: ignore # Get positive class probabilities | |
| else: | |
| # PyFuncModel - use predict method which returns probabilities | |
| predictions = self.test_model.predict(df) | |
| # For binary classification, PyFuncModel predict returns probabilities for both classes | |
| if predictions.ndim == 2 and predictions.shape[1] == 2: | |
| pos_probas = predictions[:, 1] | |
| predictions = (pos_probas >= self.threshold).astype(int) | |
| else: | |
| # If predict returns 1D array, it might be class predictions | |
| pos_probas = predictions | |
| predictions = (pos_probas >= self.threshold).astype(int) | |
| # Ensure we have a 1D numpy array | |
| if not isinstance(pos_probas, np.ndarray): | |
| pos_probas = np.array(pos_probas) | |
| except Exception as e: | |
| print(f"Error predicting: {e}") | |
| if "use_label_encoder" in str(e): | |
| print("Attribute error due to missing 'use_label_encoder'. Patching model...") | |
| self.model.use_label_encoder = False | |
| predictions = self.model.predict(df) | |
| pos_probas = self.model.predict_proba(df) | |
| if self.model is None: | |
| raise ValueError(MODEL_NOT_LOADED_ERROR) from e | |
| # Fallback to PyFuncModel predict | |
| predictions = self.test_model.predict(df) | |
| if predictions.ndim == 2 and predictions.shape[1] == 2: | |
| pos_probas = predictions[:, 1] | |
| predictions = (pos_probas >= self.threshold).astype(int) | |
| else: | |
| pos_probas = predictions | |
| predictions = (pos_probas >= self.threshold).astype(int) | |
| # Ensure predictions and probabilities were obtained | |
| if hasattr(self.model, "predict_proba") and callable(getattr(self.model, "predict_proba", None)): | |
| # sklearn model | |
| pos_probas = self.model.predict_proba(df)[:, 1] # type: ignore | |
| predictions = (pos_probas >= self.threshold).astype(int) | |
| else: | |
| # PyFuncModel - use predict method which returns probabilities | |
| predictions = self.test_model.predict(df) | |
| # For binary classification, PyFuncModel predict returns probabilities for both classes | |
| if predictions.ndim == 2 and predictions.shape[1] == 2: | |
| pos_probas = predictions[:, 1] | |
| predictions = (pos_probas >= self.threshold).astype(int) | |
| else: | |
| # If predict returns 1D array, it might be class predictions | |
| pos_probas = predictions | |
| predictions = (pos_probas >= self.threshold).astype(int) |
🤖 Prompt for AI Agents
In src/predictors/predict_ensemble.py around lines 108 to 142, the sklearn
branch currently uses self.model.predict(df) (implicit 0.5 cutoff) and never
applies the calibrated self.threshold; update the sklearn path to compute
pos_probas = self.model.predict_proba(df)[:,1] and then set predictions =
(pos_probas >= self.threshold).astype(int); also make the same change in the
exception fallback branch so when you fallback to PyFuncModel or re-run
prediction you always derive pos_probas and then apply (pos_probas >=
self.threshold).astype(int) for predictions, ensuring pos_probas is converted to
a 1D numpy array if needed.
| home_draws = league_data[league_data["home_encoded"] == team_id]["Home_draws"].iloc[0] | ||
| home_matches = league_data[league_data["home_encoded"] == team_id][ | ||
| "Home_team_matches" | ||
| ].iloc[0] | ||
| away_draws = league_data[league_data["away_encoded"] == team_id]["Away_draws"].iloc[0] | ||
| away_matches = league_data[league_data["away_encoded"] == team_id][ | ||
| "Away_team_matches" | ||
| ].iloc[0] |
There was a problem hiding this comment.
Risk of IndexError when filtering returns empty results.
The code filters league_data by team ID and immediately accesses .iloc[0] without verifying the filtered result is non-empty. If a team appears in home_encoded.unique() but has no matching rows when filtered by that ID (or vice versa for away matches), this will raise an IndexError.
Add a defensive check before accessing .iloc[0]:
for team_id in league_data["home_encoded"].unique():
- home_draws = league_data[league_data["home_encoded"] == team_id]["Home_draws"].iloc[0]
- home_matches = league_data[league_data["home_encoded"] == team_id][
- "Home_team_matches"
- ].iloc[0]
- away_draws = league_data[league_data["away_encoded"] == team_id]["Away_draws"].iloc[0]
- away_matches = league_data[league_data["away_encoded"] == team_id][
- "Away_team_matches"
- ].iloc[0]
+ home_data = league_data[league_data["home_encoded"] == team_id]
+ away_data = league_data[league_data["away_encoded"] == team_id]
+
+ if home_data.empty or away_data.empty:
+ continue
+
+ home_draws = home_data["Home_draws"].iloc[0]
+ home_matches = home_data["Home_team_matches"].iloc[0]
+ away_draws = away_data["Away_draws"].iloc[0]
+ away_matches = away_data["Away_team_matches"].iloc[0]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| home_draws = league_data[league_data["home_encoded"] == team_id]["Home_draws"].iloc[0] | |
| home_matches = league_data[league_data["home_encoded"] == team_id][ | |
| "Home_team_matches" | |
| ].iloc[0] | |
| away_draws = league_data[league_data["away_encoded"] == team_id]["Away_draws"].iloc[0] | |
| away_matches = league_data[league_data["away_encoded"] == team_id][ | |
| "Away_team_matches" | |
| ].iloc[0] | |
| for team_id in league_data["home_encoded"].unique(): | |
| home_data = league_data[league_data["home_encoded"] == team_id] | |
| away_data = league_data[league_data["away_encoded"] == team_id] | |
| if home_data.empty or away_data.empty: | |
| continue | |
| home_draws = home_data["Home_draws"].iloc[0] | |
| home_matches = home_data["Home_team_matches"].iloc[0] | |
| away_draws = away_data["Away_draws"].iloc[0] | |
| away_matches = away_data["Away_team_matches"].iloc[0] |
| for _, row in league_data.iterrows(): | ||
| if row['match_outcome'] == 2 and 'home_team_elo' in row and 'away_team_elo' in row: | ||
| elo_diff = abs(row['home_team_elo'] - row['away_team_elo']) | ||
| if row["match_outcome"] == 2 and "home_team_elo" in row and "away_team_elo" in row: | ||
| elo_diff = abs(row["home_team_elo"] - row["away_team_elo"]) | ||
| draw_elo_diffs.append(elo_diff) |
There was a problem hiding this comment.
Logic error in ELO column existence check.
The expression "home_team_elo" in row checks if the column name exists in the DataFrame's column index, not whether the specific row has a non-null value. When iterating with .iterrows(), this check will always return True if the column exists, even if the value is NaN.
Replace with a null-check:
for _, row in league_data.iterrows():
- if row["match_outcome"] == 2 and "home_team_elo" in row and "away_team_elo" in row:
+ if row["match_outcome"] == 2 and pd.notna(row["home_team_elo"]) and pd.notna(row["away_team_elo"]):
elo_diff = abs(row["home_team_elo"] - row["away_team_elo"])
draw_elo_diffs.append(elo_diff)Additionally, import pandas if not already imported:
import pandas as pd🤖 Prompt for AI Agents
In src/utils/K_factor_calculation.py around lines 69 to 72, the check
`"home_team_elo" in row` is incorrect because iterrows() yields a Series for
each row and that test only checks column existence, not whether the value is
non-null; change the condition to explicitly test for non-null values (e.g., use
pandas.notnull()/pd.notna() or Series.notnull() on row["home_team_elo"] and
row["away_team_elo"]) before computing elo_diff, and ensure pandas is imported
(import pandas as pd) if not already.
…t Variable Integration - Enhanced `get_fixtures_postgre.py` by adding support for environment variables for database connection, improving security and flexibility. - Refactored variable names in `pytorch_hypertuner_20.py` for consistency and clarity, ensuring better readability. - Updated `.gitignore` to include additional files and directories, maintaining a clean project structure. - Improved error handling in various scripts to ensure robustness during execution. These changes aim to enhance code quality, improve security practices, and ensure better organization within the Soccer Prediction Project.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (1)
439-501: Critical: Undefined variables will cause runtime NameError.The
log_to_mlflowfunction referencesX_train(line 470) andX_eval(lines 472, 478, 481) but these variables are not defined in the function's scope—they are neither parameters nor accessible globals.Additionally:
- Line 470 assigns
input_example = X_train.head(5)but immediately overwrites it on line 472- Line 481 mutates the caller's
X_evalDataFrame (side effect)Add
x_evalparameter and use it consistently:-def log_to_mlflow(model, metrics, params, experiment_name): +def log_to_mlflow(model, metrics, params, experiment_name, x_eval): """ Log trained model, metrics, and parameters to MLflow. Args: model: Trained LightGBM model metrics: Model evaluation metrics params: Model parameters experiment_name: Experiment name + x_eval: Evaluation features for signature generation Returns: str: Run ID """ try: # Set up MLflow tracking mlflow.set_experiment(experiment_name) # Start a new run with mlflow.start_run(run_name=f"lightgbm_{datetime.now().strftime('%Y%m%d_%H%M')}") as run: # Log parameters for param_name, param_value in params.items(): mlflow.log_param(param_name, param_value) # Log metrics for metric_name, metric_value in metrics.items(): mlflow.log_metric(metric_name, metric_value) - # Create input example for model signature - input_example = X_train.head(5) - # Handle integer columns by converting them to float64 to properly manage missing values - input_example = X_eval.iloc[:5].copy() if hasattr(X_eval, "iloc") else X_eval[:5].copy() + # Create input example for signature inference + input_example = x_eval.iloc[:5].copy() if hasattr(x_eval, "iloc") else x_eval[:5].copy() # Identify and convert integer columns to float64 to prevent schema enforcement errors if hasattr(input_example, "dtypes"): for col in input_example.columns: - if X_eval[col].dtype.kind == "i": + if input_example[col].dtype.kind == "i": logger.info( f"Converting integer column '{col}' to float64 to handle potential missing values" ) - X_eval[col] = X_eval[col].astype("float64") + input_example[col] = input_example[col].astype("float64") # Infer signature with proper handling for integer columns with potential missing values signature = infer_signature(input_example, model.predict(input_example))Then update the caller on line 544:
- log_to_mlflow(model, metrics, params, experiment_name) + log_to_mlflow(model, metrics, params, experiment_name, x_eval)
♻️ Duplicate comments (4)
data/Create_data/api_football/get_fixtures_postgre.py (3)
683-692: Fix precedence bug in passes percentage handling.This is a duplicate of a previous review comment. The condition mixes
orandandwithout proper parentheses, causing incorrect logic. The key"passes_percent"is also not used in the API response.Apply this diff to fix:
- elif ( - k == "passes_percent" - or k == "passes_%" - and isinstance(val, str) - and val.endswith("%") - ): + elif k == "passes_%" and isinstance(val, str) and val.endswith("%"): try: val = float(val.replace("%", "")) - except (ValueError, AttributeError): + except Exception: val = None
780-782: Remove unused SQL bind parameters.This is a duplicate of a previous review comment. The query doesn't use
:todayor:league_idsplaceholders, so passing these parameters can cause driver failures.Apply this diff:
with engine.connect() as conn: - result = conn.execute(query, {"today": today, "league_ids": target_league_ids}) + result = conn.execute(query) - for row in result:
811-814: Add timeout to API request to prevent hanging.This is a duplicate of a previous review comment. The
requests.getcall lacks a timeout parameter, which can cause the application to hang indefinitely if the API is slow or unresponsive.Apply this diff:
api_url = f"{self.base_url}/predictions?fixture={fixture_id}" try: - response = requests.get(api_url, headers=self.headers) + response = requests.get(api_url, headers=self.headers, timeout=30) response.raise_for_status() # Raises an HTTPError for bad responses (4XX or 5XX)src/models/StackedEnsemble/base/neural/pytorch_hypertuner_20.py (1)
1041-1041: Guard baseline precision against zero positives (duplicate concern)The baseline precision calculation at line 1041 still divides by
np.sum(preds == 1)without protection, which will raiseZeroDivisionErrorwhen the model predicts no positives. This issue was flagged in a previous review but remains unresolved.Apply this diff to guard the denominator:
- baseline = np.sum((y_val_np == 1) & (preds == 1)) / (np.sum(preds == 1)) + denom_baseline = max(np.sum(preds == 1), 1) + baseline = np.sum((y_val_np == 1) & (preds == 1)) / denom_baselineThis matches the guard used for shuffled precision at lines 1062-1064.
🧹 Nitpick comments (10)
data/Create_data/api_football/get_fixtures_postgre.py (2)
309-311: Uselogging.exceptionin exception handlers for better debugging.Multiple exception handlers use
self.logger.error()without automatically capturing the traceback. Usingself.logger.exception()provides better debugging information by including the full stack trace.Example fixes (apply similar pattern to all flagged locations):
except (ValueError, TypeError): - self.logger.error( - f"_parse_team_stat_response: Invalid target_team_id type. Got {target_team_id}" - ) + self.logger.exception( + f"_parse_team_stat_response: Invalid target_team_id type. Got {target_team_id}" + ) return NoneApply similar changes at lines 483-485, 564-566, 713-715, 940, 946, 951, 1253, and 1256-1258.
Also applies to: 483-485, 564-566, 713-715, 940-940, 946-946, 951-951, 1253-1253, 1256-1258
949-953: Tighten exception handling to avoid masking errors.The broad
Exceptioncatch can mask unexpected programming errors. Since you're already catchingpsycopg2.Errorseparately, consider being more specific about the exceptions you expect.If you need a catch-all for unexpected errors, consider logging at a different level or re-raising after logging:
except requests.exceptions.RequestException as e: message = f"API request failed for fixture ID {fixture_id}: {e}" self.logger.error(message) print(message) return False - except (Exception, psycopg2.Error) as e: + except psycopg2.Error as e: + message = f"Database error processing prediction for fixture ID {fixture_id}: {e}" + self.logger.exception(message) + print(message) + return False + except Exception as e: message = f"Error processing or upserting prediction for fixture ID {fixture_id}: {e}" - self.logger.error(message) + self.logger.exception(message) print(message) return False -src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (4)
391-436: Refactor: Data loading duplication betweenhypertune_lightgbmandmain.
hypertune_lightgbmnow loads and prepares data internally (lines 401-411), butmain(lines 657-668) also loads and prepares the same data. This creates duplication.Consider one of these approaches:
- Have
mainload data once and pass it tohypertune_lightgbm(restoring the previous signature)- Have
hypertune_lightgbmreturn the prepared data for use inmain- Extract data loading to a separate function used by both
Option 1: Pass data as parameters (restores original pattern):
-def hypertune_lightgbm(): +def hypertune_lightgbm(x_train, y_train, x_test, y_test, x_eval, y_eval): """ Main training function with MLflow tracking. - Updated name from hypertune_mlp to hypertune_lightgbm to match notebook. - Args: + Args: + x_train: Training features + y_train: Training labels + x_test: Testing features + y_test: Testing labels + x_eval: Evaluation features + y_eval: Evaluation labels Returns: tuple: (best_params, best_metrics) """ try: - # Load data - dataloader = DataLoader() - x_train_raw, y_train, x_test_raw, y_test, x_eval_raw, y_eval = dataloader.load_data() - features = import_selected_features_ensemble(model_type="lgbm") - - # Ensure features is a list of strings - if not isinstance(features, list): - raise ValueError("Expected features to be a list for lgbm model type") - - x_train = prepare_data(x_train_raw, features) - x_test = prepare_data(x_test_raw, features) - x_eval = prepare_data(x_eval_raw, features) - # Load hyperparameter space hyperparameter_space = load_hyperparameter_space()
648-685: Cleanup: Unused global variable declaration.Line 654 declares
global X_train, y_train, X_test, y_test, X_eval, y_eval(uppercase), but these globals are never used since:
- The prepared data is stored in lowercase local variables (lines 665-667)
hypertune_lightgbm()loads its own data internally (no parameters)Remove the unused global declaration:
try: logger.info("Starting LightGBM model training") - global X_train, y_train, X_test, y_test, X_eval, y_eval - # Load data dataloader = DataLoader() X_train, y_train, X_test, y_test, X_eval, y_eval = dataloader.load_data()Alternatively, if you intend to use the prepared data in
main, align the variable names and pass them tohypertune_lightgbm(see previous comment).
407-407: UseTypeErrorfor type validation.Lines 407 and 663 raise
ValueErrorfor type mismatches, but Python convention recommendsTypeErrorwhen the issue is an incorrect type rather than an incorrect value.- if not isinstance(features, list): - raise ValueError("Expected features to be a list for lgbm model type") + if not isinstance(features, list): + raise TypeError(f"Expected features to be a list for lgbm model type, got {type(features)}")Apply the same change on line 663.
Also applies to: 663-663
641-642: Uselogging.exceptionfor better error diagnostics.Lines 642 and 675 use
logger.errorin exception handlers. Python'slogging.exceptionautomatically includes the full traceback, making debugging easier.except Exception as e: - logger.error(f"Trial {trial + 1} failed: {e}") + logger.exception(f"Trial {trial + 1} failed: {e}")Apply the same pattern on line 675.
Also applies to: 675-675
src/models/StackedEnsemble/base/tree_based/xgboost_model.py (4)
425-425: Remove unusedexperiment_nameparameter.The
experiment_nameparameter is not used in the function body. The objective function on line 414 references the globalexperiment_namevariable instead.Either remove the parameter:
-def hypertune_xgboost(x_train, y_train, x_test, y_test, x_eval, y_eval, experiment_name: str): +def hypertune_xgboost(x_train, y_train, x_test, y_test, x_eval, y_eval):Or use the parameter in the objective function by passing it through the closure (preferred if you want to support multiple experiments):
def objective_func(trial): return objective( - trial, x_train, y_train, x_test, y_test, x_eval, y_eval, hyperparameter_space + trial, x_train, y_train, x_test, y_test, x_eval, y_eval, hyperparameter_space, experiment_name )And update the
objectivefunction signature accordingly.
754-762: Remove commented code with stale noqa directives.Lines 754-762 contain commented-out code with unused
noqadirectives (flagged by Ruff). Commented code creates maintenance burden and should be removed or moved to version control history.Remove the commented block:
- # best_model, best_metrics = train_with_precision_target( # noqa: E501 - # x_train, # noqa: E501 - # y_train, # noqa: E501 - # x_test, # noqa: E501 - # y_test, # noqa: E501 - # x_eval, # noqa: E501 - # y_eval, # Pass DataFrames # noqa: E501 - # ) # noqa: E501 - # logger.error("Hyperparameter optimization failed. Skipping precision target training.") # noqa: E501 - except Exception as e:If you need to preserve this code, use feature flags or extract it to a separate branch.
580-582: Uselogging.exceptionfor better error diagnostics.Lines 581 and 675 use
logger.errorin exception handlers. Python'slogging.exceptionautomatically includes the full traceback, significantly improving debugging.except Exception as e: - logger.error(f"Error in precision-focused training: {str(e)}") + logger.exception("Error in precision-focused training") return None, NoneApply the same pattern on line 675:
except Exception as e: - logger.error(f"Error in main execution: {str(e)}") + logger.exception("Error in main execution")Note:
logging.exceptionautomatically includes the exception message and traceback, so explicit string formatting is unnecessary.Also applies to: 674-675
696-698: Extract long exception messages to improve maintainability.Lines 696-698 and 704 contain long exception messages directly in the
raisestatement. Consider extracting these to constants or module-level error classes for better maintainability.+# At module level +class ModelError(Exception): + """Base exception for model-related errors.""" + pass + +class ModelNotTrainedError(ModelError): + """Raised when attempting to use an untrained model.""" + pass + +class FeatureMismatchError(ModelError): + """Raised when feature counts don't match expectations.""" + pass + def select_top_features_xgb(model, x_features, n_features: int = 30) -> list[str]: """...""" if not hasattr(model, "feature_importances_"): - raise ValueError( - "The provided model has not been trained yet or does not support feature importances." - ) + raise ModelNotTrainedError( + "Model has not been trained or does not support feature importances" + ) # ... if len(importances) != len(feature_names): - raise ValueError("Mismatch between the number of feature importances and feature names.") + raise FeatureMismatchError( + f"Feature count mismatch: {len(importances)} importances vs {len(feature_names)} names" + )Also applies to: 704-704
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.gitignore(1 hunks)data/Create_data/api_football/get_fixtures_postgre.py(29 hunks)src/models/StackedEnsemble/base/neural/pytorch_hypertuner_20.py(32 hunks)src/models/StackedEnsemble/base/tree_based/lightgbm_model.py(14 hunks)src/models/StackedEnsemble/base/tree_based/xgboost_model.py(20 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (4)
src/models/StackedEnsemble/shared/hypertuner_utils.py (2)
optimize_threshold(167-235)predict(47-73)src/models/StackedEnsemble/shared/data_loader.py (1)
load_data(25-116)src/utils/create_evaluation_set.py (1)
import_selected_features_ensemble(1104-1189)src/models/ensemble/data_utils.py (1)
prepare_data(16-55)
src/models/StackedEnsemble/base/neural/pytorch_hypertuner_20.py (4)
src/models/StackedEnsemble/shared/hypertuner_utils.py (2)
optimize_threshold(167-235)predict_proba(76-98)src/models/StackedEnsemble/shared/data_loader.py (2)
DataLoader(14-126)load_data(25-116)src/utils/create_evaluation_set.py (1)
import_selected_features_ensemble_new(1893-1968)src/models/ensemble/data_utils.py (1)
prepare_data(16-55)
src/models/StackedEnsemble/base/tree_based/xgboost_model.py (3)
src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (4)
train_model(139-186)objective(231-256)log_to_mlflow(439-501)main(648-684)src/models/StackedEnsemble/base/tree_based/xgboost_model_25.py (5)
train_model(182-231)objective(369-449)objective(693-733)log_to_mlflow(504-564)main(761-803)src/models/StackedEnsemble/shared/hypertuner_utils.py (2)
optimize_threshold(167-235)predict(47-73)
data/Create_data/api_football/get_fixtures_postgre.py (1)
src/utils/logger.py (4)
ExperimentLogger(198-377)info(336-343)warning(345-352)error(354-361)
🪛 Ruff (0.13.3)
src/models/StackedEnsemble/base/tree_based/lightgbm_model.py
407-407: Prefer TypeError exception for invalid type
(TRY004)
407-407: Abstract raise to an inner function
(TRY301)
407-407: Avoid specifying long messages outside the exception class
(TRY003)
641-641: Do not catch blind exception: Exception
(BLE001)
642-642: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
663-663: Prefer TypeError exception for invalid type
(TRY004)
663-663: Abstract raise to an inner function
(TRY301)
663-663: Avoid specifying long messages outside the exception class
(TRY003)
src/models/StackedEnsemble/base/neural/pytorch_hypertuner_20.py
116-118: Avoid specifying long messages outside the exception class
(TRY003)
265-265: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
270-270: Do not catch blind exception: Exception
(BLE001)
271-271: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
271-271: Use explicit conversion flag
Replace with conversion flag
(RUF010)
331-331: Unused function argument: trial
(ARG001)
479-479: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
480-482: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
487-487: Consider moving this statement to an else block
(TRY300)
491-491: Do not catch blind exception: Exception
(BLE001)
492-492: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
492-492: Use explicit conversion flag
Replace with conversion flag
(RUF010)
591-591: Consider moving this statement to an else block
(TRY300)
594-594: Use raise without specifying exception name
Remove exception name
(TRY201)
595-595: Do not catch blind exception: Exception
(BLE001)
596-596: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
596-596: Use explicit conversion flag
Replace with conversion flag
(RUF010)
922-922: Unused function argument: experiment_name
(ARG001)
1076-1076: Consider moving this statement to an else block
(TRY300)
1077-1077: Do not catch blind exception: Exception
(BLE001)
1078-1078: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1078-1078: Use explicit conversion flag
Replace with conversion flag
(RUF010)
1367-1367: Unused function argument: experiment_name
(ARG001)
1534-1534: Do not catch blind exception: Exception
(BLE001)
1535-1535: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1535-1535: Use explicit conversion flag
Replace with conversion flag
(RUF010)
src/models/StackedEnsemble/base/tree_based/xgboost_model.py
425-425: Unused function argument: experiment_name
(ARG001)
579-579: Consider moving this statement to an else block
(TRY300)
580-580: Do not catch blind exception: Exception
(BLE001)
581-581: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
581-581: Use explicit conversion flag
Replace with conversion flag
(RUF010)
674-674: Do not catch blind exception: Exception
(BLE001)
675-675: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
696-698: Avoid specifying long messages outside the exception class
(TRY003)
704-704: Avoid specifying long messages outside the exception class
(TRY003)
754-754: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
755-755: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
756-756: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
757-757: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
758-758: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
759-759: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
760-760: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
761-761: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
762-762: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
764-764: Do not catch blind exception: Exception
(BLE001)
765-765: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
765-765: Use explicit conversion flag
Replace with conversion flag
(RUF010)
data/Create_data/api_football/get_fixtures_postgre.py
42-42: Avoid specifying long messages outside the exception class
(TRY003)
309-311: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
483-485: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
564-566: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
685-687: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
712-712: Do not catch blind exception: Exception
(BLE001)
713-715: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
813-813: Probable use of requests call without timeout
(S113)
937-937: Consider moving this statement to an else block
(TRY300)
940-940: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
946-946: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
949-949: Do not catch blind exception: Exception
(BLE001)
951-951: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1251-1251: Consider moving this statement to an else block
(TRY300)
1253-1253: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1255-1255: Do not catch blind exception: Exception
(BLE001)
1256-1258: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🔇 Additional comments (9)
data/Create_data/api_football/get_fixtures_postgre.py (5)
34-44: Good: DB credentials now use environment variables, but consider adding connection resilience.The hardcoded credentials issue from the previous review has been properly addressed. However, the past review also recommended enabling
pool_pre_ping=Trueto avoid stale connections.Consider adding connection health checks:
-engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}") +engine = create_engine( + f"postgresql+psycopg2://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}", + pool_pre_ping=True +)
894-925: Good: Robust H2H calculation with defensive checks.The H2H calculation logic properly handles missing data and validates team IDs before aggregating statistics. The defensive checks prevent incorrect calculations when API data is incomplete.
1147-1167: Good: Defensive event detail handling with proper defaults.The use of
VAR_CHECK_DETAILconstant and the defaulting logic prevents database constraint violations when the API doesn't provide event details. The logging helps track when defaults are applied.
1215-1233: Good: Proper handling of fixtures without events.The method correctly distinguishes between error cases (returning
False) and valid no-event fixtures (returningTrue). This prevents unnecessary error logging for fixtures that legitimately have no events.
1302-1336: Good: Consistent and respectful API rate limiting.The rate limiting implementation matches the pattern used in
get_team_stats_for_fixturesand includes appropriate progress logging for monitoring long-running operations.src/models/StackedEnsemble/base/tree_based/lightgbm_model.py (1)
21-21: LGTM: MLflow signature inference import.The
infer_signatureimport is correctly added to enable automatic schema inference for model logging, aligning with MLflow 3.x best practices.src/models/StackedEnsemble/base/tree_based/xgboost_model.py (3)
22-22: LGTM: MLflow signature inference import.The
infer_signatureimport enables automatic schema inference for model logging on line 520, aligning with MLflow 3.x patterns.
82-153: LGTM: Systematic hyperparameter space refinement.The updated bounds with "Slightly below/above min/max" comments demonstrate data-driven tuning based on previous trial results. This focused search space should improve convergence.
477-537: LGTM: Correct MLflow logging with signature inference.The
log_to_mlflowfunction properly:
- Accepts
x_evalas a parameter (line 477)- Creates
input_examplefromx_evalwithout side effects (line 510)- Handles integer column conversion in the copy (line 517)
- Infers signature correctly (line 520)
This is the correct pattern for MLflow model logging.
| *prompts | ||
| .sonarlint/connectedMode.json |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
Tighten ignore patterns for prompts and SonarLint (prevent accidental IP/secret leakage).
- Line 162:
*promptsonly matches names ending with “prompts” (won’t coverprompt_*.txtor aprompts/dir). Use a directory or broader pattern per your intent. - Line 163: Ignoring only
.sonarlint/connectedMode.jsoncan still leak other .sonarlint local artifacts. Prefer ignoring the whole.sonarlint/dir.
Recommended change:
-*prompts
-.sonarlint/connectedMode.json
+# If you want to ignore a top-level prompts directory:
+prompts/
+# Or, to ignore any 'prompts' directory anywhere:
+# **/prompts/**
+# If the intent is to ignore all prompt-related artifact files:
+# *prompt*
+
+# Ignore all SonarLint local artifacts (safer; prevents token leaks)
+.sonarlint/Run to verify if anything sensitive is already tracked:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Tracked SonarLint files:"
git ls-files -z | tr '\0' '\n' | rg -n '^\.sonarlint/'
if git ls-files --error-unmatch .sonarlint/connectedMode.json >/dev/null 2>&1; then
echo "WARNING: .sonarlint/connectedMode.json is tracked."
fi
echo
echo "Tracked prompt artifacts (review before broad ignores):"
git ls-files -z | tr '\0' '\n' | rg -n '(?i)(^|/)(prompts)(/|$)|\bprompt'Length of output: 24
Tighten .gitignore patterns for prompts and SonarLint
- Line 162: replace
*promptswith an explicit pattern matching your intent (e.g.prompts/,**/prompts/**or*prompt*) - Line 163: replace
.sonarlint/connectedMode.jsonwith.sonarlint/to ignore all SonarLint artifacts
🤖 Prompt for AI Agents
In .gitignore around lines 162 to 163, the patterns are too broad/specific:
replace the ambiguous "*prompts" with the explicit pattern that matches your
intent (for example use "prompts/" to ignore a top-level prompts directory,
"**/prompts/**" to ignore any prompts directory recursively, or "*prompt*" to
match files containing "prompt"), and replace ".sonarlint/connectedMode.json"
with ".sonarlint/" to ignore the entire SonarLint directory and all its
artifacts.




new_api_training_final.parquetandnew_api_training_final.xlsxto reflect the latest training data.add_ELO_scores.pyandadd_poisson_xG.pyfor improved readability and organization by adjusting export paths and formatting.event_features_calculator.pyto improve DataFrame handling and logging for better debugging.These changes aim to enhance data management, improve code readability, and ensure better organization within the Soccer Prediction Project.
Summary by CodeRabbit
New Features
Improvements