# Data-Intensive Distributed Computing 2019. 5. 15.¢ Idea #1: Algebraic structures provide...

date post

22-Sep-2020Category

## Documents

view

0download

0

Embed Size (px)

### Transcript of Data-Intensive Distributed Computing 2019. 5. 15.¢ Idea #1: Algebraic structures provide...

Data-Intensive Distributed Computing

Part 9: Real-Time Data Analytics (2/2)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Winter 2019)

Adam Roegiest Kira Systems

April 2, 2019

These slides are available at http://roegiest.com/bigdata-2019w/

Since last time…

Storm/Heron Gives you pipes, but you gotta connect everything up yourself

Spark Streaming Gives you RDDs, transformations and windowing –

but no event/processing time distinction

Beam Gives you transformations and windowing, event/processing time distinction –

but too complex

Source: Wikipedia (River)

Stream Processing Frameworks

Spark Structured Streaming

Step 1: From RDDs to DataFrames

Step 2: From bounded to unbounded tables

Source: Spark Structured Streaming Documentation

Source: Spark Structured Streaming Documentation

Source: Spark Structured Streaming Documentation

Source: Spark Structured Streaming Documentation

Source: Spark Structured Streaming Documentation

Source: Wikipedia (River)

Interlude

Streams Processing Challenges

Inherent challenges Latency requirements

Space bounds

System challenges Bursty behavior and load balancing

Out-of-order message delivery and non-determinism Consistency semantics (at most once, exactly once, at least once)

Algorithmic Solutions

Throw away data Sampling

Accepting some approximations Hashing

Reservoir Sampling

Task: select s elements from a stream of size N with uniform probability

N can be very very large We might not even know what N is! (infinite stream)

Solution: Reservoir sampling Store first s elements

For the k-th element thereafter, keep with probability s/k (randomly discard an existing element)

Example: s = 10 Keep first 10 elements

11th element: keep with 10/11 12th element: keep with 10/12

…

Reservoir Sampling: How does it work?

Example: s = 10 Keep first 10 elements

11th element: keep with 10/11

General case: at the (k + 1)th element Probability of selecting each item up until now is s/k

Probability existing item is discarded: s/(k+1) × 1/s = 1/(k + 1) Probability existing item survives: k/(k + 1)

Probability each item survives to (k + 1)th round: (s/k) × k/(k + 1) = s/(k + 1)

If we decide to keep it: sampled uniformly by definition probability existing item is discarded: 10/11 × 1/10 = 1/11 probability existing item survives: 10/11

Hashing for Three Common Tasks

Cardinality estimation What’s the cardinality of set S?

How many unique visitors to this page?

Set membership Is x a member of set S?

Has this user seen this ad before?

Frequency estimation How many times have we observed x?

How many queries has this user issued?

HashSet

HashSet

HashMap

HLL counter

Bloom Filter

CMS

HyperLogLog Counter

Task: cardinality estimation of set size() → number of unique elements in the set

Observation: hash each item and examine the hash code On expectation, 1/2 of the hash codes will start with 0

On expectation, 1/4 of the hash codes will start with 00 On expectation, 1/8 of the hash codes will start with 000

On expectation, 1/16 of the hash codes will start with 0000 …

How do we take advantage of this observation?

Bloom Filters

Task: keep track of set membership put(x) → insert x into the set

contains(x) → yes if x is a member of the set

0 0 0 0 0 0 0 0 0 0 0 0

Components m-bit bit vector

k hash functions: h1 … hk

0 0 0 0 0 0 0 0 0 0 0 0

xput h1(x) = 2 h2(x) = 5 h3(x) = 11

Bloom Filters: put

0 1 0 0 1 0 0 0 0 0 1 0

xput

Bloom Filters: put

0 1 0 0 1 0 0 0 0 0 1 0

xcontains h1(x) = 2 h2(x) = 5 h3(x) = 11

Bloom Filters: contains

0 1 0 0 1 0 0 0 0 0 1 0

xcontains h1(x) = 2 h2(x) = 5 h3(x) = 11

AND = YES

A[h1(x)] A[h2(x)] A[h3(x)]

Bloom Filters: contains

0 1 0 0 1 0 0 0 0 0 1 0

ycontains h1(y) = 2 h2(y) = 6 h3(y) = 9

Bloom Filters: contains

0 1 0 0 1 0 0 0 0 0 1 0

ycontains h1(y) = 2 h2(y) = 6 h3(y) = 9

What’s going on here?

AND = NO

A[h1(y)] A[h2(y)] A[h3(y)]

Bloom Filters: contains

Bloom Filters

Error properties: contains(x) False positives possible

No false negatives

Usage Constraints: capacity, error probability

Tunable parameters: size of bit vector m, number of hash functions k

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

m

k

Count-Min Sketches

Task: frequency estimation put(x) → increment count of x by one

get(x) → returns the frequency of x

Components m by k array of counters k hash functions: h1 … hk

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

xput h1(x) = 2 h2(x) = 5 h3(x) = 11 h4(x) = 4

Count-Min Sketches: put

0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 1 0 0 0 0 0 0 0 0

xput

Count-Min Sketches: put

0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 1 0 0 0 0 0 0 0 0

xput h1(x) = 2 h2(x) = 5 h3(x) = 11 h4(x) = 4

Count-Min Sketches: put

0 2 0 0 0 0 0 0 0 0 0 0

0 0 0 0 2 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 0

0 0 0 2 0 0 0 0 0 0 0 0

xput

Count-Min Sketches: put

0 2 0 0 0 0 0 0 0 0 0 0

0 0 0 0 2 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 0

0 0 0 2 0 0 0 0 0 0 0 0

yput h1(y) = 6 h2(y) = 5 h3(y) = 12 h4(y) = 2

Count-Min Sketches: put

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

yput

Count-Min Sketches: put

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

xget h1(x) = 2 h2(x) = 5 h3(x) = 11 h4(x) = 4

Count-Min Sketches: get

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

xget h1(x) = 2 h2(x) = 5 h3(x) = 11 h4(x) = 4

A[h3(x)] MIN = 2

A[h1(x)] A[h2(x)]

A[h4(x)]

Count-Min Sketches: get

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

yget h1(y) = 6 h2(y) = 5 h3(y) = 12 h4(y) = 2

Count-Min Sketches: get

0 2 0 0 0 1 0 0 0 0 0 0

0 0 0 0 3 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 2 1

0 1 0 2 0 0 0 0 0 0 0 0

yget h1(y) = 6 h2(y) = 5 h3(y) = 12 h4(y) = 2

MIN = 1 A[h3(y)]

A[h1(y)] A[h2(y)]

A[h4(y)]

Count-Min Sketches: get

Count-Min Sketches

Error properties: get(x) Reasonable estimation of heavy-hitters

Frequent over-estimation of tail

Usage Constraints: number of distinct events, distribution of events, error bounds

Tunable parameters: number of counters m and hash functions k, size of counters

Hashing for Three Common Tasks

Cardinality estimation What’s the cardinality of set S?

How many unique visitors to this page?

Set membership Is x a member of set S?

Has this user seen this ad before?

Frequency estimation How many times have we observed x?

How many queries has this user issued?

HashSet

HashSet

HashMap

HLL counter

Bloom Filter

CMS

Source: Wikipedia (River)

Stream Processing Frameworks

Frontend

Backend

users

BI tools

analysts

ETL (Extract, Transform, and Load)

Data Warehouse

OLTP database

My data is a day old… Yay!

Kafka, Heron, Spark Streaming, Spark Structured Streaming, …

Source: Wikipedia (Cake)

What about our cake?

client online

batch m er

gi n

g

Example: count historical clicks and clicks in real time

Hybrid Online/Batch Processing

Online results

Kafka Online

processing

Batch results

HDFS Batch

processing

Online results

client

Kafka Storm

topology

store1source2 source3 … store2 store3 …source1

read write

i

*View more*