# Reservoir Sampling # https://www.geeksforgeeks.org/reservoir-sampling/ # Reservoir sampling is a randomly choosing k samples from a list of n items, where n is either a very large or unkonwn number. Typically n is large enough that the list doesn't fit into main memory. For example, a list of search queries in Google and Facebook # So we are given a big array ( or stream) of numbers (to simplify), and we need to write an effcient function to randomly selection k numbers where 1 <= k <=n. Let the input array be stream[] # A simple solution is to create an array reservoir[] of maximum size k. One by one randomly select an iteam from stream[0...n-1]. If the selected item is not previously selected, then put it in reservoir[]. To check if an item is previously selected or not, we need to search the item in reservoir[]. The time complexity of this algorithm will be O(k^2). This can be costly if k is big. Also, this is not efficient if the input is in the form of a stream. # It can be solved in O(n) time. The solution also suits well for input in the form of stream. The idea is similiar to Shuffle a given array using Fisher–Yates shuffle Algorithm. Following are the steps. # 1. Create an array reservoir[0...k-1] and copy first k items of stream[] to it # 2. Now one by one select consider all items from (k+1)th item to nth item # 2.1. Generate a random number from 0 to i where i is index of current item in stream[]. Let the generated random numbre is j. # 2.2. If j is in range 0 to k-1, replace reservoir[j] with arr[i] # Efficient program to randomly select k items from a stream of items import random # A utility function to print an array def printArray(stream, n): for i in range(n): print(stream[i], end=" ") print() # A function to randomly select k items from stream[0...n-1] def selectKItems1(stream, n, k): i = 0 # index for elements in stream[] # reservoir[] is the output array. Initialize it with first k elements from stream[] reservoir = [0 for _ in range(k)] for i in range(k): reservoir[i] = stream[i] # Iterate from the (k+1) the element to nth element while i < n: # Pick a random index from 0 to i j = random.randrange(i+1) # if the randomly picked index is smaller than k, then replace the element present at the index with new element from stream if j < k: reservoir[j] = stream[i] i += 1 print('Following are k randomly selected items using 1st method') printArray(reservoir, k) def selectKItems2(stream, n, k): reservoir = [0 for _ in range(k)] for i in range(k): reservoir[i] = stream[i] for i in range(k, n): j = random.randrange(i+1) if j < k: reservoir[j] = stream[i] print('Following are k randomly selected items using 2nd method') printArray(reservoir, k) # Driver Code if __name__ == "__main__": stream = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] n = len(stream) k = 5 selectKItems1(stream, n, k) selectKItems2(stream, n, k) # Time Complexity O(n) # How does this work? # To prove that this solution works perfectly, must prove that the probability that any item stream[i] where 0 <= i < n will be in final reservoir[] is k / n. Let us divide the proof in 2 cases as first k items are treated differently # Case1: For last n - k stream items, eg. for stream[i] where k <= i < n # For every such stream item stream[i], we pick a random index from 0 to i and if the picked index is one of the first k indexes, we replace the element at picked index with stream[i] # To simplify the proof, let us first consider the last item. The probability that the last item is in final reservoir = The probability that one of the first k indexes is picked for last item = k / n (the probability of picking one of the k items from a list of size n) # Let us now consider the second last item. The probability that the second last item is in final reservoir[] = [Probability that one of the first k indexes is picked in iteration for stream[n-2]] * [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2]] = [k/(n-1)] * [(n-1)/n] = k/n # Similarly, we can consider other items for all stream items from stream[n-1] to stream[k] and generalize the proof # Case2: For first k stream items, eg, for stream[i] where 0 <= i < k # The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n]. #The probability that an item from stream[0...k-1] is in final array = Probability that the item is not picked when items stream[k], stream[k+1], ...stream[n-1] are considered = [k/(k+1)] * [(k+1)/(k+2)] * [(k+2)/(k+3)]*...*[(n-1)/n] = k / n # Example Implementation -- Samples the set of English Wikipedia page titles ''' import random sample_count = 10 # Force the value of the seed so the results are repeatable random.seed(12345) sample_title = [] for index, line in enumerate(open("enwiki-20091103-all-titles-in-ns0")): # Generate the reservoir if index < sample_count: sample_title.append(line) else: # Ranomly replace elements in the reservoir # with a decreasing probability. # Choose an integer between 0 and index (inclusive) r = random.randint(0, index) if r < sample_count: sample_title[r] = line print(sample_title) ''' # Rservoir Sampling # https://medium.com/100-days-of-algorithms/day-33-reservoir-sampling-252062ce0baa import random def reservoir_sample(size): i, sample = 0, [] while True: item = yield i, sample i += 1 k = random.randint(0, i) if len(sample) < size: sample.append(item) elif k < size: sample[k] = item # Sampling reservoir = reservoir_sample(5) next(reservoir) for i in range(1000): k, sample = reservoir.send(i) if k % 100 == 0: print(k, sample)
We use cookies to provide and improve our services. By using our site, you consent to our Cookies Policy. Accept Learn more