I have an AWS Kinesis stream and I created an external table in Hive pointing at it. I then create a DynamoDB table for the checkpoints and in my Hive query I set the following properties as described here:
set kinesis.checkpoint.enabled=true;
set kinesis.checkpoint.metastore.table.name=my_dynamodb_table;
set kinesis.checkpoint.metastore.hash.key.name=HashKey;
set kinesis.checkpoint.metastore.range.key.name=RangeKey;
set kinesis.checkpoint.logical.name=my_logical_name;
set kinesis.checkpoint.iteration.no=0;
I have the following questions:
- Do I always have to start with
iteration.noset to 0? - Does this always start from the beginning of the script (oldest Kinesis record about to be evicted)?
- Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?
- To re-execute the script on the same data, is it enough to re run the query with the same execution number?
- If I execute a
select * from kinesis_ext_table limit 100withiteration.no=0over and over, will I get different/weird results once the first Kinesis records start to be evicted?
Given the DynamoDB checkpoint entry:
{"startSeqNo":"1234",
"endSeqNo":"5678",
"closed":false}
- What's the meaning of the
closedfield? - Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?
- I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?
I know that it's a lot of questions but I could not find these answers on the documentation.
Aucun commentaire:
Enregistrer un commentaire