機器學習超參數調優：十個實用的貝葉斯優化（Bayesian Optimization）進階技巧詳情 - 人工智能,機器學習,貝葉斯,python,深度學習 deephub 博客

貝葉斯優化（Bayesian Optimization, BO）雖然是超參數調優的利器，但在實際落地中往往會出現收斂慢、計算開銷大等問題。很多時候直接“裸跑”標準庫裏的 BO，效果甚至不如多跑幾次 Random Search。

所以要想真正發揮 BO 的威力，必須在搜索策略、先驗知識注入以及計算成本控制上做文章。本文整理了十個經過實戰驗證的技巧，能幫助優化器搜索得更“聰明”，收斂更快，顯著提升模型迭代效率。

1、像貝葉斯專家一樣引入先驗（Priors）

千萬別冷啓動，優化器如果在沒有任何線索的情況下開始，為了探索邊界會浪費大量算力。既然我們通常對超參數範圍有一定領域知識，或者手頭有類似的過往實驗數據，就應該利用起來。

弱先驗會導致優化器在搜索空間中漫無目的地遊蕩，而強先驗能迅速坍縮搜索空間。在昂貴的 ML 訓練循環中，先驗質量直接決定了你能省下多少 GPU 時間。

所以可以先跑一個微型的網格搜索或隨機搜索（比如 5-10 次試驗），把表現最好的幾個點作為先驗，去初始化高斯過程（Gaussian Process）。

利用知情先驗初始化高斯過程

 import numpy as np  
 from sklearn.gaussian_process import GaussianProcessRegressor  
 from sklearn.gaussian_process.kernels import Matern  
 from skopt import Optimizer  
 
 # Step 1: Quick cheap search to build priors  
 def objective(params):  
     lr, depth = params  
     return train_model(lr, depth)  # your training loop returning validation loss  
 
 search_space = [  
     (1e-4, 1e-1),   # learning rate  
     (2, 10)         # depth  
 ]  
 
 # quick 8-run grid/random search  
 initial_points = [  
     (1e-4, 4), (1e-3, 4), (1e-2, 4),  
     (1e-4, 8), (1e-3, 8), (1e-2, 8),  
     (5e-3, 6), (8e-3, 10)  
 ]  
 initial_results = [objective(p) for p in initial_points]  
 
 # Step 2: Build priors for Bayesian Optimization  
 kernel = Matern(nu=2.5)  
 gp = GaussianProcessRegressor(kernel=kernel, normalize_y=True)  
 
 # Step 3: Initialize optimizer with priors  
 opt = Optimizer(  
     dimensions=search_space,  
     base_estimator=gp,  
     initial_point_generator="sobol",  
 )  
 
 # Feed prior observations  
 for p, r in zip(initial_points, initial_results):  
     opt.tell(p, r)  
 
 # Step 4: Bayesian Optimization with informed priors  
 for _ in range(30):  
     next_params = opt.ask()  
     score = objective(next_params)  
     opt.tell(next_params, score)  
 
 best_params = opt.get_result().x  
 print("Best Params:", best_params)

有 Kaggle Grandmaster 曾通過複用相似問題的先驗配置，減少了 40% 的調優輪次。用幾次廉價的評估換取貝葉斯搜索的加速，這筆交易很划算。

2、動態調整採集函數（Acquisition Function）

Expected Improvement (EI) 是最常用的採集函數，因為它在“探索”和“利用”之間取得了不錯的平衡。但在搜索後期，EI 往往變得過於保守，導致收斂停滯。

搜索策略不應該是一成不變的。當發現搜索陷入平原區時，可以嘗試動態切換採集函數：在需要激進逼近最優解時切換到 UCB（Upper Confidence Bound）；在搜索初期或者目標函數噪聲較大需要跳出局部優時，切換到 PI（Probability of Improvement）。

動態調整策略能有效打破後期平台期，減少那些對模型提升毫無幫助的“垃圾時間”。這裏用

scikit-optimize

演示如何根據收斂情況動態切換策略：

 import numpy as np  
 from skopt import Optimizer  
 from skopt.acquisition import gaussian_ei, gaussian_pi, gaussian_ucb  
   
 # Dummy expensive objective  
 def objective(params):  
     lr, depth = params  
     return train_model(lr, depth)  # Replace with your actual training loop  
 
 space = [(1e-4, 1e-1), (2, 10)]  
 opt = Optimizer(  
     dimensions=space,  
     base_estimator="GP",  
     acq_func="EI"   # initial acquisition function  
 )  
 
 def should_switch(iteration, recent_scores):  
     # Simple heuristic: if scores haven't improved in last 5 steps, switch mode  
     if iteration > 10 and np.std(recent_scores[-5:]) < 1e-4:  
         return True  
     return False  
 
 scores = []  
 for i in range(40):  
     # Dynamically pick acquisition function  
     if should_switch(i, scores):  
         # Choose UCB when nearing convergence, PI for risky exploration  
         opt.acq_func = "UCB" if scores[-1] < np.median(scores) else "PI"  
     x = opt.ask()  
     y = objective(x)  
     scores.append(y)  
     opt.tell(x, y)  
 
 best_params = opt.get_result().x  
 print("Best Params:", best_params)

3、善用對數變換（Log Transforms）

很多超參數（如學習率、正則化強度、Batch Size）在數值上跨越了幾個數量級，呈現指數分佈。這種分佈對高斯過程（GP）非常不友好，因為 GP 假設空間是平滑均勻的。

直接在原始空間搜索，優化器會把大量時間浪費在擬合那些陡峭的“懸崖”上。對這些參數進行對數變換（Log Transform），把指數空間拉伸成線性的，讓優化器在一個“平坦”的操場上跑。這不僅能穩定 GP 的核函數，還能大幅降低曲率，在實際調參中通常能把收斂時間減半。

 import numpy as np  
 from skopt import Optimizer  
 from skopt.space import Real  
   
 # Expensive training function  
 def objective(params):  
     log_lr, log_reg = params  
     lr = 10 ** log_lr          # inverse log transform  
     reg = 10 ** log_reg  
     return train_model(lr, reg)  # replace with your actual training loop  
 
 # Step 1: Define search space in log10 scale  
 space = [  
     Real(-5, -1, name="log_lr"),     # lr in [1e-5, 1e-1]  
     Real(-6, -2, name="log_reg")     # reg in [1e-6, 1e-2]  
 ]  
 
 # Step 2: Create optimizer with log-transformed space  
 opt = Optimizer(  
     dimensions=space,  
     base_estimator="GP",  
     acq_func="EI"  
 )  
 
 # Step 3: Run Bayesian Optimization entirely in log-space  
 n_iters = 40  
 scores = []  
 for _ in range(n_iters):  
     x = opt.ask()              # propose in log-space  
     y = objective(x)           # evaluate in real-space  
     opt.tell(x, y)  
     scores.append(y)  
 
 best_log_params = opt.get_result().x  
 best_params = {  
     "lr": 10 ** best_log_params[0],  
     "reg": 10 ** best_log_params[1]  
 }  
 print("Best Params:", best_params)

4、別讓 BO 陷入“套娃”陷阱（Hyper-hypers）

貝葉斯優化本身也是有超參數的：Kernel Length Scales、噪聲項、先驗方差等。如果你試圖去優化這些參數，就會陷入“為了調參而調參”的無限遞歸。

BO 內部的超參數優化非常敏感，容易導致代理模型過擬合或者噪聲估計錯誤。對於工業級應用，更穩健的做法是早停（Early Stopping）GP 的內部優化器，或者直接使用元學習（Meta-Learning）得出的經驗值來初始化這些超-超參數。這能讓代理模型更穩定，更新成本更低，AutoML 系統通常都採用這種策略而非從零學起。

 import numpy as np  
 from skopt import Optimizer  
 from sklearn.gaussian_process import GaussianProcessRegressor  
 from sklearn.gaussian_process.kernels import Matern, WhiteKernel  
   
 # Meta-learned priors from previous similar tasks  
 meta_length_scale = 0.3  
 meta_noise_level = 1e-3  
 kernel = (  
     Matern(length_scale=meta_length_scale, nu=2.5) +  
     WhiteKernel(noise_level=meta_noise_level)  
 )  
 
 # Early-stop BO's own hyperparameter tuning  
 gp = GaussianProcessRegressor(  
     kernel=kernel,  
     optimizer="fmin_l_bfgs_b",  
     n_restarts_optimizer=0,    # Crucial: prevent expensive hyper-hyper loops  
     normalize_y=True  
 )  
 
 # BO with a stable, meta-initialized GP  
 opt = Optimizer(  
     dimensions=[(1e-4, 1e-1), (2, 12)],  
     base_estimator=gp,  
     acq_func="EI"  
 )  
 
 def objective(params):  
     lr, depth = params  
     return train_model(lr, depth)   # your model's validation loss  
 
 scores = []  
 for _ in range(40):  
     x = opt.ask()  
     y = objective(x)  
     opt.tell(x, y)  
     scores.append(y)  
 
 best_params = opt.get_result().x  
 print("Best Params:", best_params)

5、懲罰高成本區域

標準的 BO 只在乎準確率，不在乎你的電費單。有些參數組合（比如超大 Batch Size、極深的網絡、巨大的 Embedding 維度）可能只會帶來微小的性能提升，但計算成本卻是指數級增長的。

如果不管控成本，BO 很容易鑽進“高分低能”的牛角尖。所以可以修改採集函數，引入成本懲罰項。我們不看絕對性能，而是看單位成本的性能收益。斯坦福 ML 實驗室曾指出，忽略成本感知會導致預算超支 37% 以上。

成本感知的採集函數（Cost-Aware EI）

 import numpy as np  
 from skopt import Optimizer  
 from skopt.acquisition import gaussian_ei  
   
 # Objective returns BOTH validation loss and estimated training cost  
 def objective(params):  
     lr, depth = params  
     val_loss = train_model(lr, depth)  
     cost = estimate_cost(lr, depth)   # e.g., GPU hours or FLOPs proxy  
     return val_loss, cost  
 
 # Custom cost-aware EI: maximize EI / Cost  
 def cost_aware_ei(model, X, y_min, costs):  
     raw_ei = gaussian_ei(X, model, y_min=y_min)  
     normalized_costs = costs / np.max(costs)  
     penalty = 1.0 / (1e-6 + normalized_costs)  
     return raw_ei * penalty  
 
 # Search space  
 opt = Optimizer(  
     dimensions=[(1e-4, 1e-1), (2, 20)],  
     base_estimator="GP"  
 )  
 
 observed_losses = []  
 observed_costs = []  
 
 for _ in range(40):  
     # Ask a batch of candidate points  
     candidates = opt.ask(n_points=20)  
       
     # Evaluate cost-aware EI for each candidate  
     y_min = np.min(observed_losses) if observed_losses else np.inf  
     cost_scores = cost_aware_ei(  
         opt.base_estimator_,  
         np.array(candidates),  
         y_min=y_min,  
         costs=np.array(observed_costs[-len(candidates):] + [1]*len(candidates))  # fallback cost=1  
     )  
     # Pick best candidate under cost-awareness  
     next_x = candidates[np.argmax(cost_scores)]  
       
     (loss, cost) = objective(next_x)  
       
     observed_losses.append(loss)  
     observed_costs.append(cost)  
       
     opt.tell(next_x, loss)  
 
 best_params = opt.get_result().x  
 print("Best Params (Cost-Aware):", best_params)

6、混合策略：BO + 隨機搜索

在噪聲較大的任務（如 RL 或深度學習訓練）中，BO 並非無懈可擊。GP 代理模型有時候會被噪聲“騙”了，導致對錯誤的區域過度自信，陷入局部最優。

這時候引入一點“混亂”反而有奇效。在 BO 循環中混入約 10% 的隨機搜索，能有效打破代理模型的“執念”，增加全局覆蓋率。這是一種用隨機性的多樣性來彌補 BO 確定性缺陷的混合策略，也是很多大規模 AutoML 系統的默認配置。

隨機-BO 混合模式

 import numpy as np  
 from skopt import Optimizer  
 from skopt.space import Real, Integer  
   
 # Define search space  
 space = [  
     Real(1e-4, 1e-1, name="lr"),  
     Integer(2, 12, name="depth")  
 ]  
 
 # Expensive training loop  
 def objective(params):  
     lr, depth = params  
     return train_model(lr, depth)   # your model's validation loss  
 
 # BO Optimizer  
 opt = Optimizer(  
     dimensions=space,  
     base_estimator="GP",  
     acq_func="EI"  
 )  
 
 n_total = 50  
 n_random = int(0.20 * n_total)      # first 20% = random exploration  
 results = []  
 
 for i in range(n_total):  
     if i < n_random:  
         # ----- Phase 1: Pure Random Search -----  
         x = [  
             np.random.uniform(1e-4, 1e-1),   
             np.random.randint(2, 13)  
         ]  
     else:  
         # ----- Phase 2: Bayesian Optimization -----  
         x = opt.ask()  
     y = objective(x)  
     results.append((x, y))  
     # Only tell BO after evaluations (keeps history consistent)  
     opt.tell(x, y)  
 
 best_params = opt.get_result().x  
 print("Best Params (Hybrid):", best_params)

7、並行化：偽裝成並行計算

BO 本質上是串行的（Sequential），因為每一步都依賴上一步更新的後驗分佈。這在多 GPU 環境下很吃虧。不過我們可以“偽造”並行性。

啓動多個獨立的 BO 實例，給它們設置不同的隨機種子或先驗。讓它們獨立跑，然後把結果彙總到一個主 GP 模型裏進行 Retrain。這樣既利用了並行計算資源，又通過多樣化的探索增強了最終代理模型的適應性。這種方法在 NAS（神經網絡架構搜索）中非常普遍。

多路並行 BO + 結果合併

 import numpy as np  
 from skopt import Optimizer  
 from multiprocessing import Pool  
   
 # Search space  
 space = [(1e-4, 1e-1), (2, 10)]  
 
 # Expensive objective  
 def objective(params):  
     lr, depth = params  
     return train_model(lr, depth)  
 
 # Create BO instances with different priors/kernels  
 def make_optimizer(seed):  
     return Optimizer(  
         dimensions=space,  
         base_estimator="GP",  
         acq_func="EI",  
         random_state=seed  
     )  
 
 optimizers = [make_optimizer(seed) for seed in [0, 1, 2, 3]]  # 4 BO tracks  
 
 # Evaluate one BO step for a single optimizer  
 def bo_step(opt):  
     x = opt.ask()  
     y = objective(x)  
     opt.tell(x, y)  
     return (x, y)  
 
 # Run pseudo-parallel BO for N steps  
 def run_parallel_steps(optimizers, steps=10):  
     pool = Pool(len(optimizers))  
     results = []  
     for _ in range(steps):  
         async_calls = [pool.apply_async(bo_step, (opt,)) for opt in optimizers]  
         for res, opt in zip(async_calls, optimizers):  
             x, y = res.get()  
             results.append((x, y))  
     pool.close()  
     pool.join()  
     return results  
 
 # Step 1: parallel exploration  
 parallel_results = run_parallel_steps(optimizers, steps=15)  
 
 # Step 2: merge results into a master BO  
 master = make_optimizer(seed=99)  
 for x, y in parallel_results:  
     master.tell(x, y)  
 
 # Step 3: refine with unified BO  
 for _ in range(30):  
     x = master.ask()  
     y = objective(x)  
     master.tell(x, y)  
 
 print("Best Params:", master.get_result().x)

8、非數值輸入的處理技巧

高斯過程喜歡連續平滑的空間，但現實中的超參數往往包含非數值型變量（如優化器類型：Adam vs SGD，激活函數類型等）。這些離散的“跳躍”會破壞 GP 的核函數假設。

直接把它們當類別 ID 輸入給 GP 是錯誤的。正確的做法是使用 One-Hot 編碼或者 Embedding。將類別變量映射到連續的數值空間，讓 BO 能理解類別之間的“距離”，從而恢復搜索空間的平滑性。在一個 BERT 微調的案例中，僅僅通過正確編碼

adam_vs_sgd

，就帶來了 15% 的性能提升。

處理類別型超參數

 import numpy as np  
 from skopt import Optimizer  
 from sklearn.preprocessing import OneHotEncoder  
   
 # --- Step 1: Prepare categorical encoder ---  
 optimizers = np.array([["adam"], ["sgd"], ["adamw"]])  
 enc = OneHotEncoder(sparse_output=False).fit(optimizers)  
 
 def encode_category(cat_name):  
     return enc.transform([[cat_name]])[0]  # returns continuous 3-dim vector  
 
 # --- Step 2: Combined numeric + categorical search space ---  
 # Continuous params: lr, dropout  
 # Encoded categorical: optimizer  
 space_dims = [  
     (1e-5, 1e-2),          # learning rate  
     (0.0, 0.5),            # dropout  
     (0.0, 1.0),            # optimizer_onehot_dim1  
     (0.0, 1.0),            # optimizer_onehot_dim2  
     (0.0, 1.0)             # optimizer_onehot_dim3  
 ]  
 
 opt = Optimizer(  
     dimensions=space_dims,  
     base_estimator="GP",  
     acq_func="EI"  
 )  
 
 # --- Step 3: Objective that decodes embedding back to category ---  
 def decode_optimizer(vec):  
     idx = np.argmax(vec)  
     return ["adam", "sgd", "adamw"][idx]  
 
 def objective(params):  
     lr, dropout, *opt_vec = params  
     opt_name = decode_optimizer(opt_vec)  
     return train_model(lr, dropout, optimizer=opt_name)  
 
 # --- Step 4: Hybrid categorical-continuous BO loop ---  
 for _ in range(40):  
     x = opt.ask()  
     # Snap encoded optimizer vector to nearest valid one-hot  
     opt_vec = np.array(x[2:])  
     snapped_vec = np.zeros_like(opt_vec)  
     snapped_vec[np.argmax(opt_vec)] = 1.0  
     clean_x = [x[0], x[1], *snapped_vec]  
     y = objective(clean_x)  
     opt.tell(clean_x, y)  
 
 best_params = opt.get_result().x  
 print("Best Params:", best_params)

9、約束不可探索區域

很多超參數組合理論上存在，但工程上跑不通。比如

batch_size

大於數據集大小，或者

num_layers < num_heads

等邏輯矛盾。如果不對其進行約束，BO 會浪費大量時間去嘗試這些必然報錯或無效的組合。

通過顯式地定義約束條件，或者在目標函數中對無效區域返回一個巨大的 Loss，可以迫使 BO 避開這些“雷區”。這能顯著減少失敗的試驗次數，通常能節省 25-40% 的搜索時間。

約束感知的貝葉斯優化

 from skopt import gp_minimize  
 from skopt.space import Integer, Real, Categorical  
 import numpy as np  
   
 # Hyperparameter search space  
 space = [  
     Integer(8, 512, name="batch_size"),  
     Integer(1, 12, name="num_layers"),  
     Integer(1, 12, name="num_heads"),  
     Real(1e-5, 1e-2, name="learning_rate", prior="log-uniform"),  
 ]  
 
 # Define constraints  
 def valid_config(params):  
     batch_size, num_layers, num_heads, _ = params  
     return (batch_size <= 12800) and (num_layers >= num_heads)  
 
 # Wrapped objective that enforces constraints  
 def objective(params):  
     if not valid_config(params):  
         # Penalize invalid regions so BO learns to avoid them  
         return 10.0  # large synthetic loss  
       
     # Fake expensive training loop  
     batch_size, num_layers, num_heads, lr = params  
     loss = (  
         (num_layers - num_heads) * 0.1  
         + np.log(batch_size) * 0.05  
         + np.random.normal(0, 0.01)  
         + lr * 5  
     )  
     return loss  
 
 # Run constraint-aware BO  
 result = gp_minimize(  
     func=objective,  
     dimensions=space,  
     n_calls=40,  
     n_initial_points=8,  
     noise=1e-5  
 )  
 print("Best hyperparameters:", result.x)

10、集成代理模型（Ensemble Surrogate Models）

單一的高斯過程模型並不總是可靠的。面對高維空間或稀疏數據，GP 容易產生“幻覺”，給出錯誤的置信度估計。

更穩健的做法是集成多個代理模型。我們可以同時維護 GP、隨機森林（Random Forest）和梯度提升樹（GBDT），甚至簡單的 MLP。通過投票或加權平均來決定下一步的搜索方向。這利用了集成學習的優勢，顯著降低了預測方差。在 Optuna 等成熟框架中，這種思想被廣泛應用。

 import optuna  
 from sklearn.gaussian_process import GaussianProcessRegressor  
 from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor  
 import numpy as np  
   
 # Build surrogate ensemble  
 def build_surrogates():  
     return [  
         GaussianProcessRegressor(normalize_y=True),  
         RandomForestRegressor(n_estimators=200),  
         GradientBoostingRegressor()  
     ]  
 
 # Train all surrogates on past trials  
 def train_surrogates(surrogates, X, y):  
     for s in surrogates:  
         s.fit(X, y)  
 
 # Aggregate predictions using uncertainty-aware weighting  
 def ensemble_predict(surrogates, X):  
     preds = []  
     for s in surrogates:  
         p = s.predict(X, return_std=False)  
         preds.append(p)  
     return np.mean(preds, axis=0)  
 
 def objective(trial):  
     # Hyperparameters  
     lr = trial.suggest_loguniform("lr", 1e-5, 1e-2)  
     depth = trial.suggest_int("depth", 2, 8)  
       
     # Fake expensive evaluation  
     loss = (depth * 0.1) + (np.log1p(1/lr) * 0.05) + np.random.normal(0, 0.02)  
     return loss  
 
 # Custom sampling strategy that ensembles surrogate predictions  
 class EnsembleSampler(optuna.samplers.BaseSampler):  
     def __init__(self):  
         self.surrogates = build_surrogates()  
     def infer_relative_search_space(self, study, trial):  
         return None  # use independent sampling  
     def sample_relative(self, study, trial, search_space):  
         return {}  
     def sample_independent(self, study, trial, param_name, distribution):  
         trials = study.get_trials(deepcopy=False)  
         # Warm-up phase: random sampling  
         if len(trials) < 15:  
             return optuna.samplers.RandomSampler().sample_independent(  
                 study, trial, param_name, distribution  
             )  
         # Collect training data  
         X = []  
         y = []  
         for t in trials:  
             if t.values:  
                 X.append([t.params["lr"], t.params["depth"]])  
                 y.append(t.values[0])  
         X = np.array(X)  
         y = np.array(y)  
         train_surrogates(self.surrogates, X, y)  
         # Generate candidate points  
         candidates = np.random.uniform(  
             low=distribution.low, high=distribution.high, size=64  
         )  
         # Predict surrogate losses  
         if param_name == "lr":  
             Xcand = np.column_stack([candidates, np.full_like(candidates, trial.params.get("depth", 5))])  
         else:  
             Xcand = np.column_stack([np.full_like(candidates, trial.params.get("lr", 1e-3)), candidates])  
         preds = ensemble_predict(self.surrogates, Xcand)  
         # Pick best predicted candidate  
         return float(candidates[np.argmin(preds)])  
 
 # Run ensemble-driven BO  
 study = optuna.create_study(sampler=EnsembleSampler(), direction="minimize")  
 study.optimize(objective, n_trials=40)  
 print("Best:", study.best_params)

總結

直接調用現成的庫往往難以解決複雜的工業級問題。上述這十個技巧，本質上都是在彌合理論假設（如平滑性、無限算力、同質噪聲）與工程現實（如預算限制、離散參數、失敗試驗）之間的鴻溝。

在實際應用中，不要把貝葉斯優化當作一個不可干預的黑盒。它應該是一個可以深度定製的組件。只有當你根據具體問題的特性，去精心設計搜索空間、調整採集策略並引入必要的約束時，貝葉斯優化才能真正成為提升模型性能的加速器，而不是消耗 GPU 資源的無底洞。

https://avoid.overfit.cn/post/bb15da0bacca46c4b0f6a858827b242f

deephub 博客

deephub 博客

博客 / 詳情