# 绽放过滤器的启发式分析 -- probability-theory 领域 和 hashing 领域 和 bloom-filters 领域 cs 相关 的问题

## Heuristic analysis of Bloom filters

2

### 问题

• 长度为\$ n \$ bith的绽放过滤器。
• 数据将\$ s \$插入盛开过滤器。

I am currently watching a lecture on Bloom filters, and the professor is doing a heuristic analysis of Bloom filters.

It's all based on the following assumption:

All \$h_{i}(x)\$'s are uniformly random and independent (across different \$i\$'s and \$x\$'s)

Setup:

• Bloom filter of length \$n\$ bits.
• Data set \$S\$ is inserted into the Bloom filters.

The professors claims that for each bit of array \$A\$, the probability that it has been set to 1 is (under above assumption, and after data set has been inserted): \$1 - (1 - 1/n)^{k|S|}\$, where \$k\$ is the number of hash functions.

I am trying to understand why that's the probability.

He explains that as each of the \$|S|\$ objects gets inserted (0 or 1), there are \$k\$ "darts" thrown uniformly at random and independent from each other thrown into the array, so that gives me the intuition of why \$k\$ is being multiplied by \$|S|\$. He then says the probability that a given "dart" is one of the \$n\$ bits is \$1/n\$. Since there are a total of \$k|S|\$ "darts" thrown at the end, it makes sense that after the whole set \$S\$ has been inserted, the probability that a given bit has been hit is \$(1/n)^{k|S|}\$, and the probability that it hasn't been hit is \$(1-1/n)^{k|S|}\$. But I don't understand why all that is being subtracted from \$1\$ at the end to end up being \$1 - (1 - 1/n)^{k|S|}\$, so clearly I'm missing something...

Can somebody explain this to me?

## 回答列表

2

A bit is set to 1 if it has been hit. It has \$k|S|\$ chances of being hit, and each time it is hit with probability \$1/n\$. Using a union bound, the probability that a bit is hit is at most \$k|S|/n\$.

We can improve on this bound by calculating instead the probability that a bit is 0, that is, that it is not hit. For a bit not to be hit, it has to be missed each time. Since it has \$k|S|\$ opportunities to be hit, and each time it is missed with probability \$1-1/n\$, in total a bit is missed with probability \$(1-1/n)^{k|S|}\$. Thus a bit is not missed with probability \$1-(1-1/n)^{k|S|}\$.

## 相关问题

5  在绽放过滤器中，为什么哈希数*增加*滤波器中的位数？  ( In a bloom filter why does the optimal number of hashes increase with the num )

1  如何估算插入绽放过滤器的元素数量  ( How to estimate the number of elements inserted to a bloom filter )
a bloom filter 是一个概率数据结构，允许使用误报的组编码集。< / p> 由位数 \$ m \$ 中的参数化，在数组 \$ a \$ （初始化为零）和哈希函数的数量 \$ k \$ ，Item \$ x \$ 对应于设置 \$ a [h_i（x）] = 1 \$ for \$ i = 1， ldots，k \$ 。...

2  绽放过滤器的启发式分析  ( Heuristic analysis of bloom filters )

1  为什么绽放过滤器需要\$ frac {m} {n} ln {2} \$散列函数？  ( Why bloom filter needs fracmn ln2 hash functions )

3  绽放过滤器为20800万URL  ( Bloom filter for 208 million urls )

7  双重散列和绽放过滤器的变化  ( Double hashing and variations for bloom filters )

2  找到有效地占主导地位的或副线性的线性不平等  ( Find dominated or subsumed linear inequalities efficiently )

2  绽放过滤器变型  ( Bloom filter variant )

3  了解Murmur3  ( Understanding murmur3 )
Bloom过滤器数据结构需要一组散列函数。 Murmur3家族非常适合，因为它包含 99887665544330 参数，可以轻松创建各种不同的功能（加上它具有良好的价值分配和充足的速度）。 选择 seed 是否有任何特定的方法选择Bloom Filter应用程序的 998807665544332 值？通过随机选...

1  双重散列碰撞  ( Double hashing collision )

1  Bloom滤波器交叉点和盛开过滤器联合之间的比率等于其原始集合的比率？  ( Is the ratio between a bloom filter intersection and a bloom filter union equal )

7  盛开滤波器的第一个公众参考是什么？  ( What was the first public reference to bloom filters where the number of hash fu )

0  快速和空间有效的方法，以找到两个套是否相交  ( Quick and space efficient way to find whether two sets intersect )

4  为什么绽放过滤器工作？  ( Why do bloom filters work )

9  删除盛开过滤器  ( Deleting in bloom filters )

11  替代浏览过滤器以获得极端参数  ( Alternative to bloom filter for extreme parameters )

3  比盛开过滤器的理论假阳性率更好  ( Achieving better than the theoretical false positive rate for bloom filters )

4  有效地执行“批处理”近似会员查询的方法  ( Ways to perform batch approximate member queries efficiently )

1  Count-Min草图和商滤波器之间的性能差异（空间和时间复杂性）是什么？  ( What are the performance differnces space and time complexity between count mi )
Count-min草图和商滤波器的空间和时间复杂度（查找，插入）？ 我找不到这些复杂性。 我想制作一个节目，以最快的方式找到一个集合中对象的频率，因此时间复杂性对我来说非常重要。 我读取了关于这些对象的研究文章，关于"donâ€™t thrash：如何缓存闪存上的哈希值" ，它据说它的查找和插入操作比绽放过滤器更...

1  你怎么称呼“非概率绽放过滤器”？  ( What do you call a non probabilistic bloom filter )

1  为什么我们使用多个比特而不是单个位，以存储在绽放过滤器中的对象  ( Why do we use several bits instead of a single bit for storage of objects in blo )

0  在磁盘上使用大规模数据算法是否有效？  ( Is it efficient to use large scale data algorithms on disk )

4  整数设置与同性恋散列等类似的草图上的脱节查询  ( Integer set disjointness query on sketches with something like homomorphic hashi )

27  有防绽放过滤器吗？  ( Is there an anti bloom filter )
a bloom filter 使得可以有效地跟踪是否已经遇到了各种值在处理过程中。当有许多数据项时，盛开的过滤器可能会导致在哈希表上保存的重要内存。盛开过滤器的主要特征，它与哈希表共享，如果一个项目不是新的，它总是说"不是新的" ，但是一个非零概率将被标记为"不是新的"即使是新的。 是否有一个"反绽放过滤器" ...

7  绽放过滤器和完美的哈希  ( Bloom filter and perfect hashing )
a bloom filter 使用哈希函数以给定的设置\$ s \$测试成员身份，通过检查项目是否存在于指定位置。 为了减轻散列碰撞的影响，使用多种功能，如果使用通用哈希，则产生概率界限。 我们可以使用每元素10位以具有"合理" 错误率。 如果我们可以直接构建一个完美的哈希函数设置\$ s + idty \$ ，最后...