Adaptive Learned Bloom Filter

论文：https://par.nsf.gov/servlets/purl/10217248

　Learned Bloom filter

Learned Bloom filter (LBF) 在Bloom filter之前增加了机器学习模型作为预过滤器，对于每个查询的元素$x$，给出得分$s(x)$，$s(x)$通常与$x\in S$的概率正相关。分类器先在一些可用的训练数据上进行预训练，根据其训练数据的特征对给定查询$x\in S$进行判断（二分类）。LBF设置阈值$\tau$，当$s(x)\geq \tau$时则判断$x\in S$，否则便使用Bloom filter进行进一步的判断。就像standard Bloom filter，LBF的false negative rate（FNR）为0。而false positive rate（FPR）可以由分类模型与Bloom filter的误报引起。
可以看出，当区域$s(x)≥τ$包含更多的$x$时，插入Bloom filter的关键字数会减少，从而导致良好的FPR。由于区域$s(x)≥τ$为正值，为了保证预过滤器判断正确，所以$τ$值越高越好，但$\tau$的增大会增加Bloom filter的负载。因此需要找到两者的平衡。

Wastage of Information

当$s(x)<\tau$时，采用Bloom filter进行查询判断，此时$s(x)$中的信息并没有得到使用。例如，当有两个元素$x_1,x_2$，并且$\tau>s(x_1)\gg s(x_2)$。在使用LBF时，两者会采用相同方法进行判断，而直观来看，$x_1$比$x_2$更有可能属于$S$。

Strong dependency on Generalization

由原理可知，当数据分布没有变化时，预过滤器更高的准确率可以降低FPR。但是，在部署了Bloom filters的在线环境中，数据分布可能会发生变化。众所周知，数据流具有分布上有漂移的突发性质。因此，分类器的置信度以及阈值并不是完全可靠的。其次，机器学习对特殊例子的敏感性给系统带来了新的问题，对于任何给定置信水平$τ$的分类器，可以很容易地创建被错误分类的示例。Bloom filter通常用于增加对抗性误报率的网络中，这可能会损害性能。由于冲突而增加的延迟可能会引发Denial-of-Service attacks(DoS)。

Motivation

对于一个分类器，分布密度$f(s(x))$在集合内和集合外的元素呈现不同的趋势。观察到，对于键，$f(s(x)|x \in S)$随着$s(x)$的增加呈现上升趋势，而$f(s(x)|x \notin S)$呈现相反的趋势。为了减少整体FPR，$f(s(x)|x \notin S)$高的组需要更低的FPRs。因此，如果用不同的方式调整哈希函数的数量，相应的组需要更多的哈希函数，而对于有几个非键的组允许更高的FPRs。

Adaptive Learned Bloom Filter (Ada-BF)

上一节中LBF的公式，LBF实际上将$x$分成了两组。当$s(x)≥τ$时，$x$将被直接识别为集合中的元素，而不需要使用Bloom filter测试。换句话说，它先使用零哈希函数来标识资格。否则，将使用$K$哈希函数测试，则通过$s(x)≥τ$判断从$0$哈希函数到$K$哈希函数的条件。Ada-BF根据$s(x)$将$x$分为$g$个组，对于组$j$，采用$K_j$个哈希函数来判断它的资格。Ada-BF的结构如Figure 1(b)所示。
Ada-BF将$s(x)$分为$g$组，其中$x\in \text{Group}\ j$，$s(x)\in[\tau_{j-1};\tau_j), j = 1,2,···,g$，设$0 = τ_0 < τ_1 <···< τ_{g−1} < τ_g = 1$，来自组$j$的关键字使用$K_j$个独立的哈希函数插入到Bloom filter中。对于$j$组，FPR的期望值可表示为
$\mathbb{E}\left(\mathrm{FPR}_{j}\right)=\left(1-\left(1-\frac{1}{R}\right)^{\sum_{t=1}^{g} n_{t} K_{t}}\right)^{K_{j}}=\alpha^{K_{j}}$
其中$n_{t}=\sum_{t=1}^{n} I\left(\tau_{t-1} \leq s\left(x_{i} \mid x_{i} \in S\right)<\tau_{t}\right)$落在$\text{group}\ t$的关键字数。为了避免比特数组的过载，我们只在键数$n_j$较多的组中增加$K_j$，而在键数$n_j$较少的组中减少$K_j$。很明显，Ada-BF对LBF进行了推广。当Ada-BF只通过将查询分成两组时$K_1 = K, K_2 = 0， τ_1 = τ$， Ada-BF降为LBF。

值得注意的是，随着$s(x)$的增加，$f(s(x)|x\in S)$与$f(s(x)|x\notin S)$的趋势相反(Figure 2)。

Simplifying the Hyper-Parameters

为了实现Ada-BF，需要确定$2g-1$个超参数，包括每组的哈希函数数量$K_j$和分组的评分阈值$τ_j (τ_0 = 0， τ_g = 1)$。使用这些超参数，对于Ada-BF，总体FPR的期望值可以表示为:
$\mathbb{E}(\mathrm{FPR})=\sum_{j=1}^{g} p_{j} \mathbb{E}\left(\mathrm{FPR}_{j}\right)=\sum_{j=1}^{g} p_{j} \alpha^{K_{j}}$
其中$p_{j}=\operatorname{Pr}\left(\tau_{j-1} \leq s\left(x_{i} \mid x_{i} \notin S\right)<\tau_{j}\right)$，$
p_{j}$可以用$\hat{p}_{j}=\frac{1}{m} \sum_{i=1}^{m} I\left(\tau_{j-1} \leq s\left(x_{i} \mid x_{i} \notin S\right)<\tau_{j}\right)=\frac{m_{j}}{m}$表示($m$为训练数据中非键的数量，$m_j$为$j$组中非键的数量)。由于$\sum_{j=1}^{g} m_{j} \alpha^{K_{j}}=O\left(\max _{j}\left(m_{j} \alpha^{K_{j}}\right)\right)$，可以考虑使用$m_{j} \alpha^{K_{j}}$来近似。而$\alpha^{K_{j}}$随着$K_j$的增大呈指数级下降，为了保持$m_j\alpha^{K_{j}}$在不同组中的稳定，我们需要$m_j$随着$K_j$呈指数级增长。此外，由于$f(s(x)|x\notin S)$在大多数情况下随着$s(x)$变小而增大，所以对于较小的$s(x)$， $K_j$也应该变大，当$j$减少时，我们应该线性增加$K_j$，让$m_j$指数增长。
根据以上观点，设置$\frac{p_j}{p_{j+1}}=c$和$K_j-K_{j+1}=1$，其中$j=1,2,\cdots,g-1$，因为$s(x|x\notin S)$的真实密度是未知的，用$\hat{\frac{p_j}{p_{j+1}}}=\frac{m_j}{m_{j+1}}=c$来估计$\frac{p_j}{p_{j+1}}$，该策略确保了$\hat{p}_j$与$K_j$的指数级增长。现在只有三个超参数，$c,K_{min},K_{max}(K_{max}=K_1)$。默认情况下设置$K_{min} = K_{g} = 0$，等价于将$g$组中的所有项目标识为键。

Analysis of Adaptive Learned Bloom Filter

与LBF相比，Ada-BF充分利用密度分布$s(x)$，对不同区间的FPR进行优化，而且Ada-BF可以在不增加内存使用的情况下降低LBF的FPR。

当$p_j/p_{j+1}=c_j\geq c > 1$，$K_j-K_{j+1}= 1$时，预期的FPR如下

$\mathbb{E}(\mathrm{FPR})=\sum_{j=1}^{g} p_{j} \alpha^{K_{j}}=\frac{\sum_{j=1}^{g} c^{g-j} \alpha^{K_{j}}}{\sum_{j=1}^{g} c^{g-j}} \leq\left\{\begin{array}{ll} \frac{(1-c)\left(1-(c \alpha)^{g}\right)}{\left(\frac{1}{\alpha}-c\right)\left(\alpha^{g}-(c \alpha)^{g}\right)} \alpha^{K_{\max }}, & c \alpha \neq 1 \\ \frac{1-c}{1-c^{g}} \cdot g, & c \alpha=1 \end{array}\right.$

其中$K_{max} = K_1$。为了方便分析，设$cα > 1$。假设$g$组的数目是固定的，提高$c$就可以满足这以上条件，而$α$会随着$c$的增大而增大。为了比较，需要LBF的$τ$等于Ada-BF的$\tau_{g-1}$，在这种情况下，分数高于$τ$的查询会被机器学习模型直接判断为键。比较整体的FPR便只需要比较得分低于$τ$的查询的FPR。

Theorem 1：对于Ada-BF，对于所有的$j\in[g-1],$给定$\frac{p_j}{p_{j+1}}\geq c >1$，如果存在$λ > 0$使$cα≥1+λ$成立，并且对于所有的$j\in[g-1],$$n_{j+1}−n_j > 0$($n_j$是$j$组的键数)。当$g$足够大且$g≤\lfloor2K\rfloor$时(K是LBF哈希函数的个数)，Ada-BF的FPR小于LBF。
Theorem 1要求$n_j$的键数不断增加，而$p_j$随着$j$呈指数级下降。如图2所示，在真实数据集上，随着分数的增加，$f(s(x)|x\notin S)$下降得非常快，而$f(s(x)|x \in S)$增加，能够满足定理的条件。

Disjoint Adaptive Learned Bloom Filter (Disjoint Ada-BF)

Ada-BF根据键的分数将键分成$g$组，并使用不同数量的哈希函数将键散列到同一个Bloom Filter中。基于类似的想法，disjoint Ada-BF也将键分成$g$组，但是将不同组的键散列到独立的Bloom Filter中。disjoint Ada-BF的结构如图1(c)所示。

假设Bloom Filter的总预算为$R$位，键被分成$g$组。因此，来自组$j$的键被插入到长度为$R_j (R = \sum_{j=1}^g R_j)$的Bloom filter。在查找阶段，需要确定查询所属的组并检查其在对应Bloom Filter中的成员关系。

Simplifying the Hyper-Parameters

与Ada-BF类似，disjoint Ada-BF也有需要设置的超参数，例如组分割的分数阈值$τ_j$和Bloom Filter的长度$R_j$。阈值$τ_j$的确定与上一节类似。为了找到优化整体FPR的$R_j$，我们再次引用上一节的思想，即不同组的false postivives预期应该是相似的。对于具有$Rj$位的Bloom filter，最优哈希函数数量可以近似为$K_j=\frac{R_j}{n_j}\text{log(2)}$，其中$n_j$是组$j$中的键数。对应的最优FPR的期望为$\mathbb{E}\left(\mathrm{FPR}_{j}\right)=\mu^{R_{j} / n_{j}}(\mu \approx 0.618)$，因此，为了使不同组的false postivives预期相似，$R_j$需要满足 $m_{j} \cdot \mu^{\frac{R_{j}}{n_{j}}}=m_{1} \cdot \mu^{\frac{R_{1}}{n_{1}}} \Longleftrightarrow \frac{R_{j}}{n_{j}}-\frac{R_{1}}{n_{1}}=\frac{(j-1) \log (c)}{\log (\mu)}$ 在已知阈值$τ_j$的前提下，$n_j$是已知的，并且桶的总预算$R$是已知的，因此可以求解$R_j$。

Analysis of Disjoint Adaptive Learned Bloom Filter

Disjoint Ada-BF使用一组较短的Bloom Filter来存储键的哈希输出。核心思想与Ada-BF相同，都是通过减少组的FPR来降低整体的FPR。Disjoint Ada-BF为每组分配Bloom Filter，以实现更小的FPR。在下面的定理中，证明在达到LBF的最优预期FPR时，Disjoint Ada-BF消耗更少的桶。
Theorem 2：对于所有的$j\in[g-1]$($n_j$是组$j$的键数)，如果$\frac{p_j}{p_{j+1}}=c>1$和$n_{j+1}−n_j > 0$，为了实现LBF的最优FPR，当$g$较大时，Disjoint Ada-BF消耗的桶数比LBF少。

Experiment

论文测试现有几种Bloom filter算法的性能:1)standard Bloom filter，2)learned Bloom filter，3)sandwiched learned Bloom filter ，4)adaptive learned Bloom filter ，5)disjoint adaptive learned Bloom filter。使用两个具有不同关联任务的数据集，即恶意url检测和病毒扫描，通过它们的FPRs和相应的内存使用来衡量性能。作者在https://github.com/DAIZHENWEI/Ada-BF 上开源了自己的代码与数据集。
url数据集一共有三列（url，label，score），分别为需要检测的url，标签与预先训练的随机森林模型对应得分$\text{score}(x)$(采用“sklearn.ensemble.RandomForestClassifier“)。

病毒扫描数据集一共有三列（label，score，MD5），分别为标签、预先训练的随机森林模型对应得分$\text{score}(x)$(采用“sklearn.ensemble.RandomForestClassifier“)与文件的MD5签名。
实验的相关设置可参考原论文，最终结果如下。

Implement

以下为论文代码实现的相关说明与注释，包括Bloom_filter、learned_Bloom_filter、Ada-BF、disjoint_Ada-BF四种。

Bloom_filter

首先采用sklearn.utils.murmurhash3_32函数作为Bloom filter中的哈希函数，根据key值得到32位的哈希值，对Bloom filter的大小$m$取余则得到$[0,m-1]$范围的值作为下标索引。
1
2
3
4
5
def hashfunc(m): ss = randint(1, 99999999) def hash_m(x): return murmurhash3_32(x,seed=ss)%m return hash_m

Standard Bloom filter的类初始化方式如下，根据推导公式$k=ln\ 2\ \cdot\ (m/n)$得到最优$k$值，

class BloomFilter():
    def __init__(self, n, hash_len):
        self.n = n
        self.hash_len = int(hash_len)
        if (self.hash_len == 0):
            raise SyntaxError('The hash table is empty')
        # 根据推导公式得到最优k值
        if (self.n > 0) & (self.hash_len > 0):
            self.k = max(1,int(self.hash_len/n*0.6931472))
        elif (self.n==0):
            self.k = 1
        # 创建k个哈希函数
        self.h = []
        for i in range(self.k):
            self.h.append(hashfunc(self.hash_len))
        # 初始化“位数组”
        self.table = np.zeros(self.hash_len, dtype=int)

插入时进行$k$次哈希映射，将位数组中对应位置$1$，

def insert(self, key):
    if self.hash_len == 0:
        raise SyntaxError('cannot insert to an empty hash table')
    for i in key:
        for j in range(self.k):
            t = self.h[j](i)
            self.table[t] = 1

检测时，同样进行$k$次哈希映射，判断对应位是否都为$1$，返回结果。

def test(self, keys, single_key = True):
    if single_key:
        test_result = 0
        match = 0
        if self.hash_len > 0:
            for j in range(self.k):
                t = self.h[j](keys)
                match += 1 * (self.table[t] == 1)
            if match == self.k:
                test_result = 1
    else:
        test_result = np.zeros(len(keys))
        ss=0
        if self.hash_len > 0:
            for key in keys:
                match = 0
                for j in range(self.k):
                    t = self.h[j](key)
                    match += 1*(self.table[t] == 1)
                if match == self.k:
                    test_result[ss] = 1
                ss += 1
    return test_result

使用 python Bloom_filter.py --data_path ./Datasets/URL_data.csv --size_of_BF 200000运行代码，url任务对应main函数如下

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_path', action="store", dest="data_path", type=str, required=True,
                        help="path of the dataset")
    parser.add_argument('--size_of_BF', action="store", dest="R_sum", type=int, required=True,
                        help="size of the BF")
    
    # 数据集路径与位数组大小
    results = parser.parse_args()
    DATA_PATH = results.data_path
    R_sum = results.R_sum

    data = pd.read_csv(DATA_PATH)

    negative_sample = data.loc[(data['label'] == -1)]
    positive_sample = data.loc[(data['label'] == 1)]

    # 对所有正样例进行插入
    query = positive_sample['url']
    n = len(query)
    bloom_filter = BloomFilter(n, R_sum)
    bloom_filter.insert(query)
    query_negative = negative_sample['url']
    
    # 对所有负样例进行检测，得到结果
    n1 = bloom_filter.test(query_negative, single_key=False)
    print('False positive items: ', sum(n1))
    print('FPR:', sum(n1)/len(negative_sample))

learned_Bloom_filter

由于数据中已经有了预先训练的随机森林模型对应得分，因此需要确定最优的$\tau$，以url任务为例。

# train_negative为30%抽样得到的负样例
def Find_Optimal_Parameters(max_thres, min_thres, R_sum, train_negative, positive_sample):
    FP_opt = train_negative.shape[0]
    
    # 从最小值到最大值遍历计算确定最优值
    for threshold in np.arange(min_thres, max_thres+10**(-6), 0.01):
        # 根据阈值得到插入Bloomfilter的样例
        query = positive_sample.loc[(positive_sample['score'] <= threshold),'url']
        n = len(query)
        bloom_filter = BloomFilter(n, R_sum)
        bloom_filter.insert(query)
        # 分类器分类错误的样例
        ML_positive = train_negative.loc[(train_negative['score'] > threshold),'url']
        # Bloomfilter检测错误的样例
        bloom_negative = train_negative.loc[(train_negative['score'] <= threshold),'url']
        BF_positive = bloom_filter.test(bloom_negative, single_key=False)
        # False Positive
        FP_items = sum(BF_positive) + len(ML_positive)
        print('Threshold: %f, False positive items: %d' %(round(threshold, 2), FP_items))
        # 保存最优的阈值与对应Bloomfilter
        if FP_opt > FP_items:
            FP_opt = FP_items
            thres_opt = threshold
            bloom_filter_opt = bloom_filter
    return bloom_filter_opt, thres_opt

获得最优的$\tau$后，在整个负样例上进行检测，得到最终的FPR。

if __name__ == '__main__':
    '''Stage 1: Find the hyper-parameters (spare 30% samples to find the parameters)'''
    bloom_filter_opt, thres_opt = Find_Optimal_Parameters(max_thres, min_thres, R_sum, train_negative, positive_sample)

    '''Stage 2: Run LBF on all the samples'''
    ### Test queries
    ML_positive = negative_sample.loc[(negative_sample['score'] > thres_opt), 'url']
    bloom_negative = negative_sample.loc[(negative_sample['score'] <= thres_opt), 'url']
    score_negative = negative_sample.loc[(negative_sample['score'] < thres_opt), 'score']
    BF_positive = bloom_filter_opt.test(bloom_negative, single_key = False)
    FP_items = sum(BF_positive) + len(ML_positive)
    FPR = FP_items/len(negative_sample)
    print('False positive items: {}; FPR: {}; Size of quries: {}'.format(FP_items, FPR, len(negative_sample)))

Ada-BF

Ada-BF中Bloom_filter与原始Bloom_filter的差别在于哈希函数的个数，由于$K_j-K_{j+1}=1$，且$K_g=0$，则$K_{max}$与划分的组数有关，因此Ada-BF需要根据组数进行哈希函数数量的修改，在检测时需要根据$k$值进行$k$次映射。

class Ada_BloomFilter():
    def __init__(self, n, hash_len, k_max):
        self.n = n
        self.hash_len = int(hash_len)
        self.h = []
        # 创建k_max个哈希函数
        for i in range(int(k_max)):
            self.h.append(hashfunc(self.hash_len))
        self.table = np.zeros(self.hash_len, dtype=int)
    def insert(self, key, k):
        for j in range(int(k)):
            t = self.h[j](key)
            self.table[t] = 1
    def test(self, key, k):
        test_result = 0
        match = 0
        for j in range(int(k)):
            t = self.h[j](key)
            match += 1*(self.table[t] == 1)
        if match == k:
            test_result = 1
        return test_result

根据上面的分析，需要确定的最优参数有组数$g$与$\hat{\frac{p_j}{p_{j+1}}}=\frac{m_j}{m_{j+1}}=c$，同样采用从最小值到最大值遍历并保存最优值的方式。

def Find_Optimal_Parameters(c_min, c_max, num_group_min, num_group_max, R_sum, train_negative, positive_sample):
    # c_min到c_max寻找最优值
    c_set = np.arange(c_min, c_max+10**(-6), 0.1)
    FP_opt = train_negative.shape[0]

    k_min = 0
    # k_max 与组数有关
    for k_max in range(num_group_min, num_group_max+1):
        for c in c_set:
            # tau = sum([c**0, c**1,...]) 则各划分区间的最小单位
            tau = sum(c ** np.arange(0, k_max - k_min + 1, 1))
            n = positive_sample.shape[0]
            hash_len = R_sum
            bloom_filter = Ada_BloomFilter(n, hash_len, k_max)
            thresholds = np.zeros(k_max - k_min + 1)
            thresholds[-1] = 1.1
            # 抽样后的负样例总数
            num_negative = sum(train_negative['score'] <= thresholds[-1])
            # 按比例分成碎片，一共tau个碎片
            num_piece = int(num_negative / tau) + 1
            
            score = train_negative.loc[(train_negative['score'] <= thresholds[-1]), 'score']
            score = np.sort(score)
            for k in range(k_min, k_max):
                i = k - k_min
                score_1 = score[score < thresholds[-(i + 1)]]
                # 从后往前计算区间划分阈值 根据m_j/m_{j+1} = c并且剩下样例足够划分
                if int(num_piece * c ** i) < len(score_1):
                    thresholds[-(i + 2)] = score_1[-int(num_piece * c ** i)]

            query = positive_sample['url']
            score = positive_sample['score']
            
            # 插入数据
            for score_s, query_s in zip(score, query):
                ix = min(np.where(score_s < thresholds)[0])
                k = k_max - ix
                bloom_filter.insert(query_s, k)
            # 进行FPR估计
            ML_positive = train_negative.loc[(train_negative['score'] >= thresholds[-2]), 'url']
            query_negative = train_negative.loc[(train_negative['score'] < thresholds[-2]), 'url']
            score_negative = train_negative.loc[(train_negative['score'] < thresholds[-2]), 'score']
			
            # 进行检测
            test_result = np.zeros(len(query_negative))
            ss = 0
            for score_s, query_s in zip(score_negative, query_negative):
                ix = min(np.where(score_s < thresholds)[0])
                # thres = thresholds[ix]
                k = k_max - ix
                test_result[ss] = bloom_filter.test(query_s, k)
                ss += 1
            FP_items = sum(test_result) + len(ML_positive)
            print('False positive items: %d, Number of groups: %d, c = %f' %(FP_items, k_max, round(c, 2)))
            # 保存最优结果
            if FP_opt > FP_items:
                FP_opt = FP_items
                bloom_filter_opt = bloom_filter
                thresholds_opt = thresholds
                k_max_opt = k_max

    # print('Optimal FPs: %f, Optimal c: %f, Optimal num_group: %d' % (FP_opt, c_opt, num_group_opt))
    return bloom_filter_opt, thresholds_opt, k_max_opt

得到最优参数后在整个测试集上做检测，得到最终FPR。

disjoint_Ada-BF

与Ada-BF不同的是，disjoint_Ada-BF需要计算每个组的Bloom filter的位数组大小$R_j$，计算使用的推导公式为

$\frac{R_{j}}{n_{j}}-\frac{R_{1}}{n_{1}}=\frac{(j-1) \log (c)}{\log (\mu)}$

由于在划分阈值的时候，各组区间的非键数按$c$指数增长，因此可以用以计算$c^{j-1}$。

def R_size(count_key, count_nonkey, R0):
    R = [0]*len(count_key)
    R[0] = max(R0,1)
    #　根据推导公式计算R
    for k in range(1, len(count_key)):
        R[k] = max(int(count_key[k] * (np.log(count_nonkey[0]/count_nonkey[k])/np.log(0.618) + R[0]/count_key[0])), 1)
    return R

根据以上函数，调节$R_0$使得最终得到的$R$数组满足所给定内存，则退出调节。

### Search the Bloom filters' size
R = np.zeros(num_group_1 - 1)
R[:] = 0.5 * R_sum
non_empty_ix = min(np.where(count_key > 0)[0])
if non_empty_ix > 0:
    R[0:non_empty_ix] = 0
kk = 1
# 通过调节R_0的大小,使得根据推导公式分配的R满足要求
while abs(sum(R) - R_sum) > 200:
    if (sum(R) > R_sum):
        R[non_empty_ix] = R[non_empty_ix] - int((0.5 * R_sum) * (0.5) ** kk + 1)
    else:
        R[non_empty_ix] = R[non_empty_ix] + int((0.5 * R_sum) * (0.5) ** kk + 1)
    R[non_empty_ix:] = R_size(count_key[non_empty_ix:-1], count_nonkey[non_empty_ix:-1], R[non_empty_ix])
    # 调节大小可以忽略时结束
    if int((0.5 * R_sum) * (0.5) ** kk + 1) == 1:
        break
    kk += 1

为每个组创建一个Bloom filter，其他操作类似。

由于互联网的发展，像Twitter这样的社交媒体几秒内便可在全球范围内传播信息。每天大约有5亿条推文，也就是每秒产生6000 - 10000条推文。Twitter最近报道了一项对其全球分布式索引系统(Twia)的重大改革，该系统使一条推文在一秒钟内就能被全世界看到，而误导内容(或错误信息)也因此会以极快速度在网络上产生并传播。
如果已经确定了一个错误信息的来源及其内容，为了防止其在网络快速传播，过滤这些错误信息的系统必须满足几个关键的条件。首先，一旦确定了错误信息的来源及其内容，应该在毫秒内告知全球各地的所有服务器错误信息的相关内容，同时需要保证最小的通讯开销。其次，给定大量生成的信息，需要在没有任何计算开销的情况下正确地清除信息。由于前所未有的数据生成速度，在清除信息的过程中任何计算开销都将导致系统崩溃。这是Bloom filter的经典用例，我们不希望错误信息通过过滤器，并且不会产生任何计算开销。否则，数据生成速率将超过我们的过滤速率。
在理想情况下，使用Bloom filter可以压缩错误信息的数据库。每在数据库中中增加一个错误信息，只需要传输几个比特。只要FPR足够低，系统就不会崩溃。总的来说，FPR非常重要，我们希望每次更新的通信(内存)尽可能少。由于机器学习在自然语言处理方面取得显著的成功，因此有了一个更高性能的LBF，但仅靠机器学习无法保证应用程序所需的零false negative。
在Preventing Misinformation Spread over Social Media in Real-Time实验在假新闻数据集上进行，该数据集包括23481条假新闻和21417条真新闻，采用的机器学习模型利用TFIDF特征训练朴素贝叶斯分类模型，并测试不同Learned Bloom filter的FPR。与之前的实验结果相似，Ada-BF和disjoint Ada-BF相比于BF和LBF有明显的优势。在相同的内存预算下，BF和LBF使FPRs降低了70%以上。而Ada-BF和disjoint Ada-BF只需要LBF一半的空间，有效地将通信开销减少了$1 / 2$。
在进一步实验中，除了精确的成员测试外，查询到的新闻是否与数据库中的假新闻高度相似也很重要。然而，目前的Bloom filter不能处理这个任务。Kirsch和Mitzenmacher提出了Distance-sensitive bloom filters，这可能是一个可能的解决方案，Learned Bloom filter可以很容易地扩展到Distance-sensitive bloom filters。

论文笔记

Bloom Filter

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

《Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media》笔记上一篇

《Network Applications of Bloom Filters：A Survey》笔记下一篇

《Adaptive Learned Bloom Filter (Ada-BF) Efficient Utilization of the Classifier with Application to Real-Time Information Filtering on the Web》笔记

Adaptive Learned Bloom Filter

Learned Bloom filter

Wastage of Information

Strong dependency on Generalization

Motivation

Adaptive Learned Bloom Filter (Ada-BF)

Simplifying the Hyper-Parameters

Analysis of Adaptive Learned Bloom Filter

Disjoint Adaptive Learned Bloom Filter (Disjoint Ada-BF)

Simplifying the Hyper-Parameters

Analysis of Disjoint Adaptive Learned Bloom Filter

Experiment

Implement

Bloom_filter

learned_Bloom_filter

Ada-BF

disjoint_Ada-BF

Preventing Misinformation Spread over Social Media in Real-Time

　Learned Bloom filter