隱語縱向聯邦 SecureBoost Benchmark白皮書詳情 - 隱私,開源,計算機科學隱語SecretFlow 博客

“隱語”是開源的可信隱私計算框架，內置 MPC、TEE、同態等多種密態計算虛擬設備供靈活選擇，提供豐富的聯邦學習算法和差分隱私機制。

開源項目：
https://github.com/secretflow
https://gitee.com/secretflow

導語：

在數據科學競賽中經典算法XGB備受關注。但有小夥伴擔心，在縱向聯邦中XGB是否足夠高效，安全和效率是否可以兼得，隱私計算是否耗時太長導致模型迭代緩慢？使用隱語中聯邦算法SecureBoost的高效實現, 煉丹效率輕鬆狂飆10倍！

隱語近期開源了基於縱向聯邦算法SecureBoost算法，並進行了高性能實現。與秘密分享方案的SS-XGB相比，SecureBoost性能具有更好的表現，不過由於是非MPC算法，在安全方面低於SS-XGB。

隱語SecureBoost（下文簡稱：隱語SGB）利用了安全底座和多方聯合計算的分佈式架構, 極大提高了密態計算效率和靈活性。只需要通過簡單配置, 隱語SGB即可切換同態加密協議, 例如Paillier和OU, 滿足不同場景下的安全和計算效率的需求。

本文將介紹隱語SGB的具體測試環境、步驟和數據, 方便您瞭解協議的使用方法和性能數據, 從而更好地瞭解隱語 SGB, 滿足您的業務需求。讓我們一起來領略隱語SGB的魅力吧！

測試方法和步驟：

一、測試機型

Python：3.8
pip: >= 19.3
OS: CentOS 7
CPU/Memory: 推薦最低配置是 8C16G
硬盤：500G

二、安裝conda

使用conda管理python環境，如果機器沒有conda需要先安裝。

#sudo apt-get install wget
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

#安裝
bash Miniconda3-latest-Linux-x86_64.sh

# 一直按回車然後輸入yes
please answer 'yes' or 'no':
>>> yes

# 選擇安裝路徑, 文件名前加點號表示隱藏文件
Miniconda3 will now be installed into this location:
>>> ~/.miniconda3

# 添加配置信息到 ~/.bashrc文件
Do you wish the installer to initialize Miniconda3 by running conda init? [yes|no]
[no] >>> yes

#運行配置信息文件或重啓電腦
source ~/.bashrc

#測試是否安裝成功，有顯示版本號表示安裝成功
conda --version

三、安裝secretflow

conda create -n sf-benchmark python=3.8

conda activate sf-benchmark

pip install -U secretflow

四、數據要求

兩方數據規模：

alice方：100萬50維
bob方：100萬50維

三方數據規模：

alice方：100萬34維
bob方：100萬33維
carol：100萬33維

五、Benchmark腳本

import logging
import socket
import sys
import time

import spu
from sklearn.metrics import mean_squared_error, roc_auc_score

import secretflow as sf
from secretflow.data import FedNdarray, PartitionWay
from secretflow.device.driver import reveal, wait
from secretflow.ml.boost.sgb_v import Sgb
from secretflow.utils.simulation.datasets import create_df
from secretflow.data.vertical import read_csv as v_read_csv


# init log
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.info("test")

_parties = {
    # you may change the addresses
    # 將alice、bob、carol的ip替換為實際ip
    'alice': {'address': '192.168.0.1:23041'},
    'bob': {'address': '192.168.0.2:23042'},
    'carol': {'address': '192.168.0.3:23043'},

}


def setup_sf(party, alice_ip, bob_ip, carol_ip):

    cluster_conf = {
        'parties': _parties,
        'self_party': party,
    }

    # init cluster
    _system_config = {'lineage_pinning_enabled': False}
    sf.init(
        address='local',
        num_cpus=8,
        log_to_driver=True,
        cluster_config=cluster_conf,
        exit_on_failure_cross_silo_sending=True,
        _system_config=_system_config,
        _memory=5 * 1024 * 1024 * 1024,
        cross_silo_messages_max_size_in_bytes = 2 * 1024 * 1024 * 1024 -1,
        object_store_memory=5 * 1024 * 1024 * 1024,
    )
    # SPU settings
    cluster_def = {
        'nodes': [
            {'party': 'alice', 'id': 'local:0', 'address': alice_ip},
            {'party': 'bob', 'id': 'local:1', 'address': bob_ip},
            {'party': 'carol', 'id': 'local:1', 'address': carol_ip},
        ],
        'runtime_config': {
            # SEMI2K support 2/3 PC, ABY3 only support 3PC, CHEETAH only support 2PC.
            # pls pay attention to size of nodes above. nodes size need match to PC setting.
            'protocol': spu.spu_pb2.ABY3,
            'field': spu.spu_pb2.FM64,
        },
    }

    # HEU settings
    heu_config = {
        'sk_keeper': {'party': 'alice'},
        'evaluators': [{'party': 'bob'},{'party': 'carol'}],
        'mode': 'PHEU',  # 這裏修改同態加密相關配置
        'he_parameters': {
            'schema': 'paillier',
            'key_pair': {
                'generate': {
                    'bit_size': 2048,
                },
            },
        },
        'encoding': {
            'cleartext_type': 'DT_I32',
            'encoder': "IntegerEncoder",
            'encoder_args': {"scale": 1},
        },
    }
    return cluster_def, heu_config


class SGB_benchmark:
    def __init__(self, cluster_def, heu_config):
        self.alice = sf.PYU('alice')
        self.bob = sf.PYU('bob')
        self.carol = sf.PYU('carol')
        self.heu = sf.HEU(heu_config, cluster_def['runtime_config']['field'])

    def run_sgb(self, test_name, v_data, label_data, y, logistic, subsample, colsample):
        sgb = Sgb(self.heu)
        start = time.time()
        params = {
            'num_boost_round': 5,
            'max_depth': 5,
            'sketch_eps': 0.08,
            'objective': 'logistic' if logistic else 'linear',
        'reg_lambda': 0.3,
        'subsample': subsample,
        'colsample_by_tree': colsample,
        }
        model = sgb.train(params, v_data, label_data)
    #    reveal(model.weights[-1])
        print(f"{test_name} train time: {time.time() - start}")
        start = time.time()
        yhat = model.predict(v_data)
        yhat = reveal(yhat)
        print(f"{test_name} predict time: {time.time() - start}")
        if logistic:
        print(f"{test_name} auc: {roc_auc_score(y, yhat)}")
else:
        print(f"{test_name} mse: {mean_squared_error(y, yhat)}")

        fed_yhat = model.predict(v_data, self.alice)
        assert len(fed_yhat.partitions) == 1 and self.alice in fed_yhat.partitions
        yhat = reveal(fed_yhat.partitions[self.alice])
        assert yhat.shape[0] == y.shape[0], f"{yhat.shape} == {y.shape}"
        if logistic:
        print(f"{test_name} auc: {roc_auc_score(y, yhat)}")
else:
        print(f"{test_name} mse: {mean_squared_error(y, yhat)}")

        def test_on_linear(self, sample_num, total_num):
        """
        sample_num: int. this number * 10000 = sample number in dataset.
        """
        io_start = time.perf_counter()
        common_path = "/root/sf-benchmark/data/{}w_{}d_3pc/independent_linear.".format(
        sample_num, total_num
        )
        vdf = v_read_csv(
        {self.alice: common_path + "1.csv", self.bob: common_path + "2.csv", self.carol: common_path + "3.csv"},
        keys='id',
        drop_keys='id',
        )
        # split y out of dataset,
        # <<< !!! >>> change 'y' if label column name is not y in dataset.
        label_data = vdf["y"]
        # v_data remains all features.
        v_data = vdf.drop(columns="y")
        # <<< !!! >>> change bob if y not belong to bob.
        y = reveal(label_data.partitions[self.alice].data)
        wait([p.data for p in v_data.partitions.values()])
        io_end = time.perf_counter()
        print("io takes time", io_end - io_start)
        self.run_sgb("independent_linear", v_data, label_data, y, True, 1, 1)


        def run_test(party):
        cluster_def, heu_config = setup_sf(party, _parties['alice'], _parties['bob'], _parties['carol'])
        test_suite = SGB_benchmark(cluster_def, heu_config)
        test_suite.test_on_linear(100, 100)

        sf.shutdown()


        if __name__ == '__main__':
        import argparse

        parser = argparse.ArgumentParser(prog='sgb benchmark remote')
        parser.add_argument('party')
        args = parser.parse_args()
        run_test(args.party)

將腳本下載到測試機上，可命名為sgb_benchmark.py，alice、bob、carol三方共用1個腳本。

2方SGB啓動方式如下：

alice方：python sgb_benchmark.py alice
bob方：python sgb_benchmark.py bob

3方SGB啓動方式如下：

alice方：python sgb_benchmark.py alice
bob方：python sgb_benchmark.py bob
carol方：python sgb_benchmark.py carol

SGB Benchmark報告

解讀：

本次benchmark的數據為百萬百維。我們在兩組網絡參數下進行實驗。算法參數中的schema也有'paillier'和'ou'兩種。本次實驗訓練的XGB樹的數量為5，深度為5，特徵分桶數量為13，進行二分類任務。我們分別在兩方和三方場景下進行上述實驗。兩方情況下，alice和bob各擁有其中50維的數據。三方情況下，alice， bob 和 carol分別擁有（34，33，33）維數據。

整體來講三方計算效率更高，體現了多方之間並行計算的優勢。

LAN的實驗模擬本地局域網的環境下的性能和WAN的實驗模擬在低延遲互聯網環境下的性能。對於同態加密方案來説，計算應該是瓶頸，計算耗時對於網絡延遲的敏感性比秘密分享方案要低得多，在LAN模式和WAN模式下計算耗時相差並不巨大。

在設置HEU所用協議時，我們分別配置了paillier和ou兩種協議計算作為對比（密鑰長度默認為2048bit）。Paillier和OU均為IND-CPA 安全，語義安全（Semantic Security）的加密系統，但是基於不同的困難假設。在加密性能和密態加法的性能上OU要優於Paillier，密文大小也是Paillier的一半，關於 OU 更詳細的介紹參見下方鏈接。總體來講，OU相比於Paillier在隱語SGB上提供了3～4倍的計算性能加速並把內存需求降低一半。

參考資料：

Okamoto-Uchiyama 算法介紹

https://www.secretflow.org.cn/docs/heu/zh_CN/getting_started/...

🏠 隱語社區：
https://github.com/secretflow
https://gitee.com/secretflow
https://www.secretflow.org.cn（官網）

👇歡迎關注：
公眾號：隱語的小劇場
B站：隱語secretflow
郵箱：secretflow-contact@service.alipay.com

隱語SecretFlow 博客

隱語SecretFlow 博客

博客 / 詳情