Reservoir Sampling

Browse more Python3 Examples
# Reservoir Sampling
# https://www.geeksforgeeks.org/reservoir-sampling/
# Reservoir sampling is a randomly choosing k samples from a list of n items, where n is either a very large or unkonwn number. Typically n is large enough that the list doesn't fit into main memory. For example, a list of search queries in Google and Facebook 
# So we are given a big array ( or stream) of numbers (to simplify), and we need to write an effcient function to randomly selection k numbers where 1 <= k <=n. Let the input array be stream[]
# A simple solution is to create an array reservoir[] of maximum size k. One by one randomly select an iteam from stream[0...n-1]. If the selected item is not previously selected, then put it in reservoir[]. To check if an item is previously selected or not, we need to search the item in reservoir[]. The time complexity of this algorithm will be O(k^2). This can be costly if k is big. Also, this is not efficient if the input is in the form of a stream.
# It can be solved in O(n) time. The solution also suits well for input in the form of stream. The idea is similiar to Shuffle a given array using Fisher–Yates shuffle Algorithm. Following are the steps.
# 1. Create an array reservoir[0...k-1] and copy first k items of stream[] to it
# 2. Now one by one select consider all items from (k+1)th item to nth item
# 2.1. Generate a random number from 0 to i where i is index of current item in stream[]. Let the generated random numbre is j. 
# 2.2. If j is in range 0 to k-1, replace reservoir[j] with arr[i]

# Efficient program to randomly select k items from a stream of items 
import random 
# A utility function to print an array
def printArray(stream, n):
    for i in range(n):
        print(stream[i], end=" ")
    print()

# A function to randomly select k items from stream[0...n-1]
def selectKItems1(stream, n, k):
    i = 0
    # index for elements in stream[]
    # reservoir[] is the output array. Initialize it with first k elements from stream[]
    reservoir = [0 for _ in range(k)]
    for i in range(k):
        reservoir[i] = stream[i]
        
    # Iterate from the (k+1) the element to nth element 
    while i < n:
        # Pick a random index from 0 to i
        j = random.randrange(i+1)
        # if the randomly picked index is smaller than k, then replace the element present at the index with new element from stream
        if j < k:
            reservoir[j] = stream[i]
        i += 1
    
    print('Following are k randomly selected items using 1st method')
    printArray(reservoir, k)

def selectKItems2(stream, n, k):
    reservoir = [0 for _ in range(k)]
    for i in range(k):
        reservoir[i] = stream[i]
    
    for i in range(k, n):
        j = random.randrange(i+1)
        if j < k:
            reservoir[j] = stream[i]
    print('Following are k randomly selected items using 2nd method')
    printArray(reservoir, k)
    
# Driver Code 
if __name__ == "__main__":
    stream = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
    n = len(stream)
    k = 5
    selectKItems1(stream, n, k)
    selectKItems2(stream, n, k)
    
# Time Complexity O(n)

# How does this work?
# To prove that this solution works perfectly, must prove that the probability that any item stream[i] where 0 <= i < n will be in final reservoir[] is k / n. Let us divide the proof in 2 cases as first k items are treated differently
# Case1: For last n - k stream items, eg. for stream[i] where k <= i < n
# For every such stream item stream[i], we pick a random index from 0 to i and if the picked index is one of the first k indexes, we replace the element at picked index with stream[i]
# To simplify the proof, let us first consider the last item. The probability that the last item is in final reservoir = The probability that one of the first k indexes is picked for last item = k / n (the probability of picking one of the k items from a list of size n)
# Let us now consider the second last item. The probability that the second last item is in final reservoir[] = [Probability that one of the first k indexes is picked in iteration for stream[n-2]] * [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2]] = [k/(n-1)] * [(n-1)/n] = k/n
# Similarly, we can consider other items for all stream items from stream[n-1] to stream[k] and generalize the proof

# Case2: For first k stream items, eg, for stream[i] where 0 <= i < k
# The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n].
#The probability that an item from stream[0...k-1] is in final array = Probability that the item is not picked when items stream[k], stream[k+1], ...stream[n-1] are considered = [k/(k+1)] * [(k+1)/(k+2)] * [(k+2)/(k+3)]*...*[(n-1)/n] = k / n

# Example Implementation -- Samples the set of English Wikipedia page titles
'''
import random 
sample_count = 10 

# Force the value of the seed so the results are repeatable 
random.seed(12345)

sample_title = []
for index, line in enumerate(open("enwiki-20091103-all-titles-in-ns0")):
    # Generate the reservoir
    if index < sample_count:
        sample_title.append(line)
    else:
        # Ranomly replace elements in the reservoir
        # with a decreasing probability.
        # Choose an integer between 0 and index (inclusive)
        r = random.randint(0, index)
        if r < sample_count:
            sample_title[r] = line 
print(sample_title)
'''
# Rservoir Sampling 
# https://medium.com/100-days-of-algorithms/day-33-reservoir-sampling-252062ce0baa
import random
def reservoir_sample(size):
    i, sample = 0, []
    while True:
        item = yield i, sample 
        i += 1
        k = random.randint(0, i)
        
        if len(sample) < size:
            sample.append(item)
        elif k < size:
            sample[k] = item
            
# Sampling 
reservoir = reservoir_sample(5)
next(reservoir)

for i in range(1000):
    k, sample = reservoir.send(i)
    if k % 100 == 0:
        print(k, sample)
Reservoir Sampling

Follow

Newsletter