mercredi 30 septembre 2015

Expected behavior for AWS Kinesis ShardIteratorType TRIM_HORIZON

Context: I'm not necessarily referring to a KCL-based application, just pure Kinesis API calls.

Does the using the TRIM_HORIZON shard iterator type immediately give you the earliest published record in the stream (ie earliest available within Kinesis' built-in 24hr window), or simply an iterator/cursor for some time period as much as 24 hours ago, that you must then use to advance along the stream until you hit the earliest published record?

Put another way, in case that's not quite clear....

When using the shard iterator type of TRIM_HORIZON, is the expected behavior that it will begin with returning the records that were available 24 hours ago, BUT if zero records were published exactly 24 hours ago, and instead only 3 hours ago, that your application will need to iteratively poll through the previous 21 hours before it reaches the records published 3 hours ago?

Timeline example:

  1. Sept 29 5:00 am - Create a stream "foo" with 1 shard
  2. Sept 29 5:02 am - Publish a single record, "Item=A", to the "foo" stream
  3. Sept 29 5:03 am - Issue a GetShardIterator call with TRIM_HORIZON as your shard iterator type, then issue a GetRecords call with that shard iterator and receive the record "Item=A"
  4. Sept 30 7:02 am - Publish a second record, "Item=B", to the "foo" stream
  5. Sept 30 7:03 am - Issue a GetShardIterator call with TRIM_HORIZON as your shard iterator type, then issue a GetRecords call with that shard iterator. What should be expected as the result from this call? (Note: we did not remember/re-use the shard iterator from step 3)

For Step 5 above, it's been more than 24 hours since the "Item=A" message was published on the stream and only a minute since "Item=B" was published. Will a fresh shard iterator with TRIM_HORIZON immediately give you the earliest available record, or do you need to need to keep iterating until you hit a time period when something has been published?

I'd been experimenting with Kinesis and everything was working fine yesterday or two days ago (ie. I was publishing AND consuming without any issues). I made some additional modifications to my code and began publishing again today. When I fired up my consumer, nothing was coming out at all even after letting it run for a few minutes. I tried publishing and consuming at exactly the same time, and still nothing. After manually playing with the AFTER_SEQUENCE_NUMBER iterator type, and using some sequence numbers from my consumer logs from a few days ago, I was able to reach my recently published messages. But then if I go back to using the TRIM_HORIZON type, I see no messages at all.

I've looked at the docs, but most of docs I found assume you are using the KCL (I actually was using KCL initially, but when it started failing I dropped down to raw API calls) and mention that you must have an application name and that DynamoDB tables are used for tracking state. Which as best I can tell is not true if you're using pure Kinesis API calls or the Kinesis CLI, both of which I eventually tried. I finally wrote a pure API script to start with TRIM_HORIZON and poll infinitely and eventually it hit new records (took ~600 iterations; started out 14hrs behind "now" and found records at about 5 hours behind "now"). If this is expected behavior, it seems like the wording in the docs is just a little confusing/misleading:

TRIM_HORIZON - Start reading at the last untrimmed record in the shard in the system, which is the oldest data record in the shard.

I assumed (now seemingly incorrectly) that the terms "oldest data record" meant record that I've published into the stream, not simply a time period in the stream.

It'd be great if someone can help confirm/explain the behavior I'm seeing.

Thanks!




Aucun commentaire:

Enregistrer un commentaire