Bigquery: Check for duplications during stream -


we have data generated our devices installed on clients' side. duplicated data exist , design, means wouldn't able eliminate duplicated ones in data generating phase. looking possibility avoid duplication while streaming bigquery (rather clean data doing table copy , delete later). that's say, every ready-to-be-streamed record, check whether it's in bigquery first, if not continue stream in, if exist, won't stream in.

but here's concern: (quote [here]:https://developers.google.com/bigquery/streaming-data-into-bigquery)

data availability

the first time streaming insert occurs, streamed data inaccessible warm-up period of 2 minutes. after warm-up period, streamed data added during , after warm-up period queryable. after several hours of inactivity, warm-up period occur again during next insert.

data can take 90 minutes become available copy , export operations.

our data go different bigquery tables (the table name dynamically generated data's date_time). "the first time stream insert occur" mean? per table?

does above doc mean cannot rely on query result check duplications in process of streaming?

if provide insert id, bigquery automatically deduplication you, long duplicates within de-duplication window. official docs don't mention how long de-duplicatin window is, 5 minutes 90 minutes (if write data table, closer 5 90, if data trickled in, last longer in deduplication buffers.).

regarding "the first time streaming insert occurs", per table. if have new table , start streaming it, may take few minutes data available querying. once you've started streaming, however, new data available immediately.


Comments

Popular posts from this blog

javascript - how to protect a flash video from refresh? -

visual studio 2010 - Connect to informix database windows form application -

android - Associate same looper with different threads -