From SpectLog
Jump to: navigation, search

Official Documentation


State Management


Table-table join

[$] If you have changelog streams for several database tables, you can write a stream processing job which keeps the latest state of each table in a local key-value store, where you can access it much faster than by making queries to the original database.

Stream-table join

[$] Activity events such as page views generally only include a small number of attributes, such as the ID of the viewer and the viewed items, but not detailed attributes of the viewer and the viewed items, such as the ZIP code of the user.

Stream-stream join

[$] You cannot rely on the events arriving at the stream processor at the same time, but you can set a maximum period of time over which you allow the events to be spread out.

[!][@] This is important questions. [?] How to correlate two streams? [?] Should there be a dictionary of (aggregate type, aggregate id)? If only this information is provided, there is no big coupling. [!][@] Eventually, there is always a way to misplace two streams in time when one stream get correlated with other on duplicate (old) events rather than current ones.

[$] In order to perform a join between streams, your job needs to buffer events for the time window over which you want to join. [@] I've started elaborating on this in other notes - the join should be bound within time. Even associating events (a primitive join to match the same key) cannot wait forever.

Approaches to managing task state

In-memory state with checkpointing

Using an external store


Local state in Samza

[*] TODO

Key-value storage

[*] TODO

Debug Key-value storage

[*] TODO

Implementing common use cases with the key-value store

[*] TODO

Other storage engines

[*] TODO

Fault tolerance semantics with state

[*] TODO