lundi 4 mai 2015

AWS Hive + Kinesis on EMR = Understanding check-pointing

I have an AWS Kinesis stream and I created an external table in Hive pointing at it. I then create a DynamoDB table for the checkpoints and in my Hive query I set the following properties as described here:

set kinesis.checkpoint.enabled=true;
set kinesis.checkpoint.metastore.table.name=my_dynamodb_table;
set kinesis.checkpoint.metastore.hash.key.name=HashKey;                                                               
set kinesis.checkpoint.metastore.range.key.name=RangeKey;                                                            
set kinesis.checkpoint.logical.name=my_logical_name;                                                                 
set kinesis.checkpoint.iteration.no=0;

I have the following questions:

  • Do I always have to start with iteration.no set to 0?
  • Does this always start from the beginning of the script (oldest Kinesis record about to be evicted)?
  • Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?
  • To re-execute the script on the same data, is it enough to re run the query with the same execution number?
  • If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?

Given the DynamoDB checkpoint entry:

{"startSeqNo":"1234",
 "endSeqNo":"5678",
 "closed":false}

  • What's the meaning of the closed field?
  • Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?
  • I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?

I know that it's a lot of questions but I could not find these answers on the documentation.




Aucun commentaire:

Enregistrer un commentaire