22.5.12

First and last rows in an HBase table

HBase's Client API is quite limited.
This is because HBase itself must support only simple operations so that the data store can be "cloudy", that is, highly distributed and highly scalable. For this reason the API will probably stay simple.

One interesting question is:
How do I get the last row in an HBase table?

And the disappointing answer is:
You have no direct access to the last row of an arbitrary HBase table.

The word "direct" is important, because technically you can get the last row, in an indirect and inefficient way. Just scan the table starting from any row, and when the scan stops, the last result returned is the last row in the table. This is obviously impractical if your table is even slightly big.

However, if you can afford changing the schema of the row keys of a table, it is efficiently possible.

The strategy is to keep the last row key constant, so you get direct access to it, and to allow the table to grow at the beginning or in the middle. Getting the first row requires a plain scan that stops after returning the first result.

This strategy works well with tables with row keys that are consecutive integers, like timestamps or integers as IDs. Normally, when row keys are integers, the table receives rows in increasing order of the row keys. However, if you can have the table receiving rows always in decreasing order of the row keys, you then have easy access to the first and last rows. This is possible because HBase tables are always sorted by row key. 

The first inserted row should have the highest row key. Subsequent inserted rows are "prepended" to the table.


Essentially, we are turning the table "upside down" with regard to row keys.

Consider an example of a table where rows are comments in a blog post. Row keys are (long) integers, such that the first inserted row has the highest row key, and subsequent inserted rows have the smallest row key when they are inserted. For simplicity, in our example, assume that the largest long value is 20.
The table starts with the first comment inserted:
rowkey   commentmsg     name      date
20       "Nice post"   "Smith"  "May 22"

The next comment added should have a row key that is smaller than the largest row key. For example, the number before 20.
rowkey   commentmsg     name      date
19       "I agree"     "John"   "May 23"
20       "Nice post"   "Smith"  "May 22"

When row keys are integers as IDs, it is convenient to use consecutive decreasing numbers, so essentially we are always prepending the table when inserting a new row.
rowkey   commentmsg     name      date
18       "Cool"        "Roger"  "May 24"
19       "I agree"     "John"   "May 23"
20       "Nice post"   "Smith"  "May 22"

So for retrieving the first row, one just needs to:
Scan scan = new Scan();
ResultScanner scanner = commentsTable.getScanner(scan);
Result firstRow = scanner.next();
scanner.close();

Retrieving the last row is simple, since the row key for the last should be constant:
Get get = new Get(Bytes.toBytes(20));
Result lastRow = commentsTable.get(get);

If you are wondering how to "prepend" for inserting rows, that is the hardest part, but not that hard actually. Basically, one just needs to: (1) retrieve the row key of the first row, and (2) try to insert a row immediately before the existing first row.

Step 1 is simply retrieving the first row, and getting its row key. Step 2 uses checkAndPut to deal with concurrent clients. This is necessary because between step 1 and step 2 another client could have prepended the table, hence changing the knowledge of the first row.

The prepending code should look like:
// Get the first row key (Step 1)
Scan scan = new Scan();
ResultScanner scanner = commentsTable.getScanner(scan);
Result firstRow = scanner.next();
scanner.close();

long prependedRowKey = Bytes.toLong(firstRow.getRow()) - 1;

boolean prependSucceeded = false;
// Try to prepend (Step 2)
do {
    prependSucceeded = commentsTable.checkAndPut(
        Bytes.toBytes(prependedRowKey),
        Bytes.toBytes("colfam"),
        Bytes.toBytes("commentmsg"),
        null,
        new Put(Bytes.toBytes(prependedRowKey).add(
            Bytes.toBytes("colfam"),
            Bytes.toBytes("commentmsg"),
            Bytes.toBytes("I like your post")
        )
    );

    if(!prependedSucceeded) {
        prependedRowKey--;
    }
} while(!prependedSucceeded);

The checkAndPut on line 12 is really important. In one atomic step, it checks whether colfam:commentmsg is null in the row we are trying to create, and writes to colfam:commentmsg if the check was positive. The column you use for checking should be a column that always has a value if the row exists. If you do not have such column in the table, you can just create a flag column called e.g. "exists". The Put that is in the checkAndPut can also be customized according to your needs.


Conclusion:
If your table fits the requirements for this strategy, you have an easy access to the first and last rows of the table. These are really efficient operations, and also safe in concurrent scenarios. The prepending algorithm could suffer from race conditions in extremely concurrent situations, but likely that will not give you large latencies.

Regarding region management and general performance of this approach, there are no apparent problems.

I have been using this strategy in a project I am working on, and it has been unit tested, benchmarked and used without headaches.

7 comments:

  1. Thanks for sharing. However, with this type of row key design you are getting continuous numbers as row keys and that will generate "hot-spot" for HBase read/writes. In other words, most of your read/write will go to the same region.

    The normal fix for that is to pre-fix some random number/string to your sequential numeric row keys. But if you pre-fix some random number/string to the row key then you won't be able to get the first row easily.

    Any comments?

    ReplyDelete
  2. Hi Jason, thanks for the first comment on my blog. :)

    Could you explain how do continuous numbers as row keys cause hot-spotting? Shouldn't that be dependent on the application?

    ReplyDelete
  3. Hello,

    Here is what I take from the HBase reference guide(http://hbase.apache.org/book.html):

    '13.3.1.3.1. HBase "Hot Spot" Region
    We hypothesized that we were experiencing a familiar point of pain: a "hot spot" region in an HBase table, where uneven key-space distribution can funnel a huge number of requests to a single HBase region, bombarding the RegionServer process and cause slow response time. Examination of the HBase Master status page showed that the number of HBase requests to the troubled node was almost zero. Further, examination of the HBase logs showed that there were no region splits, compactions, or other region transitions in progress. This effectively ruled out a "hot spot" as the root cause of the observed slowness.'

    Because the keys are in continuous numbers and the HBase table arranges/stores the rows according to the key - if you add the rows 20,19,18,17,...10, these 11 rows will likely to be stored at the same region. Thus all your read/writes during this time period will go to only this region and this is what I meant by 'hot-spot'.

    ReplyDelete
  4. Yes hot-spotting for writing new data happens, but not for read/writes on random existing data.

    ReplyDelete
  5. True.

    This is the con I want to mention - and I can't figure out a way to avoid this 'hot-spot' when writing new data. :(

    ReplyDelete
  6. I wrote a table operation for Accumulo that will find the last row in a table. Accumulo is based on BigTable like HBase. The utility does something similar to a binary search. It starts with the string byte[0] and byte[]={0xff,0xff,0xff,0xff}. The code finds the mid point, using BigInteger, and scans Accumulo. If something exist it searches in the upper half using the row from the scanner as the start, if nothing exist it searches in the lower half. Below is a link to the code. Also there is special handling for the case where the last row is > 0xff,0xff,0xff,0xff.

    http://svn.apache.org/viewvc/accumulo/tags/1.4.1/src/core/src/main/java/org/apache/accumulo/core/client/admin/FindMax.java?view=markup

    ReplyDelete
  7. Indeed, binary search based on existence/nonexistence sounds like a good idea! Would work in HBase as well. I guess it's a matter of performance requirements, in some applications log(table_size) reads might be ok.

    ReplyDelete