Accurate Countdown Timers

Time is a critical component in many games. How long until your health returns? How long until your crops are ready to be harvested? How long until the next player’s turn in a turn-based game?

There are two main ways to do timers, depending on what you are trying to represent. If you are just trying to represent a “count down” to a single user, to keep them waiting in suspense, then you want the illusion of a certain number of seconds passing. It does not matter if they are close to accurate.

Conversely, if you are trying to really synchronize a user with a server-side game state (or other players) then you want the timer to be as accurate as possible.

Simple countdown

In the case where accuracy is a minor concern, you can make a timer which is guaranteed to count down from a million to 0 like this:

var start_count_1 = 1000000;
function timer1(){
    start_count_1--;
    render("timer1", start_count_1);
}
setInterval(timer1, 1000);

The code is Javascript, but the principle works for any language which can simulate a callback once per second (or sleep for a second before calling your code).

By definition, every number will be hit exactly once. But various delays caused by other tasks, processing time, etc. mean that the numbers will not necessarily be hit every second. This code will take more than a million seconds to run. How many more seconds depends on how busy the computer is. On a heavily loaded machine maybe your million second countdown will take two million seconds. It's unlikely to be that bad, but it's bad enough to be a problem.

One extreme situation that I have seen in real life was an applet that stopped counting down when the focus was on another tab or screen!

More accurate countdown

A more accurate solution is to make a counter that corrects itself automatically by keeping track of the time that has passed since it started. Even a heavily loaded computer can keep time properly, because the time-keeping hardware is separate from the CPU in the Real Time Clock (RTC).

Now if the Real Time Clock hardware is inaccurate for any reason then either technique will fail. But if we presume it is good, then our second technique will give you a more accurate countdown by skipping seconds when necessary. This can be disconcerting for the user, but if it is what's required for your game logic then it is still the right choice.

Here is an implementation of a more accurate counter, which skips seconds when necessary:

var start_count_2 = 1000000;
var start_time_2 = (new Date()).getTime();
function timer2(){
    var now = (new Date()).getTime();
    var seconds_elapsed = (now - start_time_2)/1000;
    var current_count = start_count_2 - seconds_elapsed;
    render("timer2", current_count);
}
setInterval(timer2, 1000);

This example does not truncate milliseconds, so you can actually watch and see how many milliseconds the interval counter has "lost" between ticks and use that to predict when an integer second would be skipped in a real application. The best way to simulate this is to really overload the computer with many CPU-intensive tasks like busy loops and Flash applications.

"More accurate" is not the same as "accurate"

Albert Einstein proved that the notion of a universally correct time is an illusion. As long as bits travel down a wire at less than the speed of light, the client computer will never be absolutely synchronized with the server computer. There are also limits of engineering quality to consider. But the biggest issue is that hackers may try to fiddle with your game's sense of time. For all of these reasons, you must treat the game player's clock with suspicion and minimize dependence on it.

Summary

The first technique I showed was like counting down during a game of hide-and-seek. It feels familiar to human beings, because every second is counted. The second example is like checking your watch over and over again and saying: "How long is it until it is time for the event?" If the user happens to look at the screen when a second is skipped, it might seem unnatural, but the count is more accurate.

Computers are nearly as bad at accurately counting down as humans are. Just as we can get distracted during a countdown, they do too. Just as you would use a stopwatch if you wanted an accurate countdown, you should force your program to consult the computer's built-in watch, the Real-Time-Clock.

Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter

Hockey and HTML 5

For Ayogans, Hockey and HTML 5 are like chocolate and peanut butter.

Robby Macdonell has put together a fun interactive visualization of the history of the Stanley Cup.

Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter

Facebook’s f8 and Social Games

Courtesy of Flickr's kohtzy

Courtesy of Flickr's kohtzy

I had the opportunity to attend Facebook’s f8 Conference and I was simply blown away by what they are doing. The tone of the conference felt very much like the 2nd JavaOne conference: a hugely growing platform with many eager super sharp companies and developers using it to change the world. I’ve known Bret Taylor since he started FriendFeed and he was simply amazing as the second keynote speaker. It’s clear that he has had huge impact on Facebook and I could easily see him becoming CTO. It’s taken a while for me to gather my thoughts together on all the different aspects of what they announced, but here it is.

I think the biggest announcement is a combination of announcements, so let’s start with the top-down view. What Facebook can now do is be the gatekeeper for all aspects of personal information. If you like a movie, song, restaurant, article, person…whatever…on a 3rd party site like imdb.com or yelp.com, that site will notify Facebook. Then Facebook will update your profile in real-time! That itself is simply amazing, that the profile you statically filled out on sign-up and never revisited is now real-time with you. But wait, there’s more. Applications that you have added can subscribe to your profile and changes, and will be notified, typically in less than a minute. In fact, they will try up to 5 times and keep the callback for up to 24 hours.

Now, Facebook will be the place that every web site will publish information to. Move over Google, no need for a search engine for user data all over the web as it will be centrally stored on Facebook, conveniently (mostly) hidden from the public including Google. And if you want to find out information about a user, you’ll need to talk to an app that the user has installed. What does this mean for big game companies like Zynga? They’re just about to be worth even more, as they will have the real-time updates on hundreds of millions of people, and that’s worth a lot. Another interesting take away was that many noteworthy issues were barely talked about. For example, Docs.com is Microsoft Office online for Facebook users. Move over Google Docs, 400 million Facebook users can now use Office, however there wasn’t (barely) any mention of this in the press.

Now I’ll get into a bit of the technical details. They announced three main technology developments: 1) the Graph API, which is a much simpler interface to all Facebook user data; 2) the open graph protocol, a mechanism that allows web sites to publish information of users interests to Facebook; 3) Social Plugins, simple one line additions additions to web sites that allow use the open graph protocol.

Graph API
The Graph API is the heavyweight of the announcements in terms of technical change. The main part is that all Facebook “things” are now first class objects with their own URLs. I and all of my “things” like friends, comments, wall posts, pictures, likes, have their own URL. My public information is at “http://graph.facebook.com/david.orchard“. Now you can find out all the information about the thing represented by the URLs by attaching ?metadata=1. http://graph.facebook.com/david.orchard?metadata=1 returns a lengthy structure with links to everything that I connect to, like “friends”: “http://graph.facebook.com/david.orchard/friends“. If you then try to see who my friends are, you are then led to another part of the announcements. Facebook is standardizing on the OAuth 2.0 specification for security. Thus you’ll need OAuth 2.0 credentials from Facebook to access my information. This guarantees that I have authorized you and specifically your application to see my data. Which also leads us to the Yet Another Facebook Privacy Furor (YAFPF).

Privacy Settings
By default, all apps you have installed can see all of your data. In fact, if you go to a 3rd party site, by default they can see your data to customize your view. You may go to a 3rd party site and see information about you much to your surprise! You have to specifically disallow the applications AND there is no “select all” to make it easy for you to not share your data. Further, if you disallow the app from your data but your friends don’t, the app can still see your data.  This is shown in http://news.cnet.com/8301-31322_3-20003185-256.html. One (not-so) funny site has shown CEO Mark Zuckberg’s personal calender including where/when he’s at functions you’d think were private, like birthday parties and baby showers: http://blogs.sfweekly.com/thesnitch/2010/04/want_to_meet_mark_zuckerberg_h.php. Moving back to the graph API, because the profile information is now changing in real-time, they have provided a mechanism for web sites to receive real-time updates to the profile information. I think this is simply amazing. I asked a question about errors in the update callback, and they said they will try up to 5 times over a 24 hour period. I am boggled by the amount of data that they may end up storing. Imagine 10s of millions of people liking a particular movie and dozens of applications per user have subscribed to the users. And that’s just one piece of data that Facebook is now collecting.

Social Plugins
Social Plugins are a very simple way that a web site can communicate with Facebook about a user’s actions. The best known will be the Like plugin. It’s a single line of code into an HTML document, though it has to be used in conjunction with the Open Graph Protocol.

Open Graph Protocol
The Open Graph Protocol is a specification for how to add Meta tags to web pages to describe them. An example is a page about a movie will have the title, the movie type, image, the canonical URL for the movie. Then if the user uses the Like Social Plugin, that information is sent to Facebook and your information is updated in real time. The specification is being developed on opengraphprotocol.org and it is licensed under the Open Web Foundation’s license. That is a very good thing. However, it seems that whenever something has “open” in it, it doesn’t mean what they think it means. It appears that Facebook will decide exactly what is in or out of the specification. Personally, I’d call this a “Freely Usable” specification rather than “Open”.

The Semantic Web community appears quite excited about this specification. Much of it is based upon the RDFa specification, which is how to put Semantic Web information in HTML documents. However, the specification runs into all of the known and expected problems with it’s design. If I like a movie on the movies web site and imdb.com, how are these treated by facebook? There is two likes, of two different URLs on different web sites. What exactly is identified when I say I like a movie? It could be the movie in general, but what if I like the director’s cut but not the theater version? In fact, there is no way of saying I don’t like something. The lack of a no is a typical design in these kind of assertion systems fwiw. There is also no way of undoing a Like from the web site, though you can update your information in facebook. I don’t think that Facebook cares too much about these issues because acquiring the coarsest of data is a big win for them. Also interesting was how underwhelming some of the announcements were.

Mobile
There was almost no mention of Mobile at the conference. I didn’t go to the one Mobile talk and much of the tweets about it were about people leaving the talk because of lack of content. That’s very disappointing to a company like us that wants to extend Facebook apps into mobile devices.

Credits
As a company, we care a lot of about virtual currencies. Facebook Credits are still an invite only program. The credits announcement was pretty much a non-announcement. As in, here’s what we’re doing, send us some feedback. I did like that they are providing many payment mechanisms, but they have only one offer provider. We saw no screen shots that would show that using Credits is as easy as Apple’s in-app purchases. And finally, there was no mobile mention of Credits. Personally, I don’t think that Credits will ever be usable on the iPhone because Apple won’t want to give up their 30% vig.

Location

Facebook had previously announced they working working on Location.  The lack of mention of Location at F8 has given Foursquare, Gowalla and other apps at least 6 months reprieve.  Unlike the poor guy I met from feedback.com or the folks from Twitter…

Insights
Insights is the Facebook analytics tool that was announced. Many of us were hoping for something similar to Google Analytics. Daily users, new users, time on site, exit pages, entry pages. Very little of that was introduced. And that’s not even counting the very useful Google custom events. Another interesting perspective on http://siliconangle.com/blog/2010/04/23/facebook-insights-not-scary-and-not-effective-yet/.

Summary

Facebook has done an amazing job of setting itself up to be the conduit for personal information.  They’ve deprioritized everything else in order to roll out their amazing platform changes.

Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter

The EC2 EBS First-Write Penalty

I wanted to summarize some information on the EC2/EBS “first-write penalty” and give myself a container for future information that arises with respect to it.

Amazon’s docs say:

Due to how Amazon EC2 virtualizes disks, the first write to any location on an instance’s drives performs slower than subsequent writes. For most applications, amortizing this cost over the lifetime of the instance is acceptable. However, if you require high disk performance, we recommend initializing drives by writing once to every drive location before production use.

Although Amazon says that you typically do not need to worry about it, I do prefer to initialize our drives because we have an append-heavy workload (we store a lot of data for analytics). In any case, initialization is very important before you benchmark, because benchmarks may allocate a lot of space in an unrealistically short period of time.

Note that this quote is also about instance stores. According to Amazon technical support (though not EBS documentation) a similar issue applies to EBS.

1. There is an extra penalty on both ephemeral local spindles, and EBS volumes, for first write. For EBS there is an identical first-read penalty.

Note the second sentence…EBS also has a first-read penalty, if the first access is a read.

Unfortunately, in one circumstance, we put a database into production before doing the usual initialization step. This lead to a bit of a conundrum: did I really want to write to every unused block on the same volume as my database server? Could I do that easily without causing fragmentation, slowdown, or out-of-space risks? This particular database is not running on RAID because it has a requirement for snapshots. The RAID initialization step would have obviated the problem. (in the future we’ll handle snapshots from a real-time replication slave).

The interesting thing about the “first-write penalty” is that it is actually misnamed, according to Amazon Technical Support. It is also incurred if the first access to a block is a read:

The EBS penalty is best described as a first-use penalty.

So, if you have written a block, you will take a penalty on the first write, but not on the (subsequent) first read of the block.

Of course, real applications (filesystems et al) always write before they ever read, so real applications will always experience the EBS first-use penalty as a first-write penalty.

But in this particular case, it was more convenient for me to do a first-read rather than a first-write. Unix is cool (and was in its day revolutionary!) because it exposes the underlying device to you as a “file”, which you can work with like any other file. In this case, let’s pretend our device was called “/dev/sdx1″.

My first thought was to use “cat /dev/sdx1 > /dev/null”. That would have read every byte in a very elegant Unix-y way. But I didn’t want to interfere with the load on the server. nice and ionice were not doing the job (I didn’t investigate why, exactly).

So I just wrote a Python program that was designed to read a bit of data, pause to let the drive handle other traffic, then read a bit more. It dynamically increases its chunk size to approximate a goal of 0.05 seconds of read time for every 0.25 seconds of clock time. Of course, if the server had been working flat out, then even this little bit of workload might have had negative consequences. But if your server is frequently working flat out, then you’ve got bigger problems.

What would have happened if the server had been working flat-out is that the chunk size would have shrunk to a single byte every 0.25 seconds. This would have “stolen” 4 of the roughly 100 operations per second that I could expect from the disk, a 1 out of the roughly 100MB per second. It would also have taken approximately forever to finish.

import os, sys
import time

MiB = 1000**2
chunk_size = 5 * MiB

f = open(sys.argv[1], "r")
f.seek(-1, os.SEEK_END)
filesize = f.tell()
f.seek(0)

print "Size", filesize

count = 0
data = "dummy"

while 1:
    before = time.time()
    data = f.read(chunk_size)
    count += chunk_size
    delta = time.time() - before
    print time.ctime(before), "%5f" %  delta , "%5f%%" %(100 * float(count)/filesize), count, "of", filesize, "incrementing", chunk_size
    if not data: break

    # adjust read speed dynamically
    if delta<0.05: chunk_size= min(chunk_size * 1.25, 1000 * MiB)
    elif delta>0.05: chunk_size/=1.25
    chunk_size = int(chunk_size)
    time.sleep(0.25)

print "done"

Given that this is a production system with so many other variables, it’s hard for me to say whether initializing the disk really made a measurable difference or not. Although it is only tangentially related, I found this article (also linked above) about EC2 performance modelling to be quite helpful.

Note also that the first-read technique only applies to EBS disks, not ephemeral ones.

2. Unless you first fill an ephemeral local drive with data, you will get ‘free’ reads against it. So, doing a read-only benchmark on a virgin ephermal drive will give excellent, but inaccurate, results.

Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter

Sorting in Cassandra

Sorting in Cassandra

Courtesy of Flickr's Oryctes

Courtesy of Flickr's Oryctes

The Cassandra distributed data store is, at heart, a key/value storage system. Unlike pure key/value storage systems, however, Cassandra implements first-class notions of key and column ordering. Dominic Williams covered a good chunk of the topic wonderfully in his post “Cassandra: RandomPartitioner vs OrderPreservingPartitioner”. But his framing of the question is a bit different than mine and was somewhat dominated by his focus on full-text search indexing.

You’ve got options

There are three ways to get Cassandra to order your data for you, each with its own costs and benefits.

  1. Use the OrderPreservingPartitioner and order rows by key.For example, given a log file structure, I could have a KeySpace named “Logger”, a Column Family named “Events”, and a list of events keyed by some form of timestamp (probably TimeUUID — more on this later).This technique aligns the physical structure of the database with your sort order.
  2. Use sortable column names and order columns within a row by the column name.In this case my Keyspace and Column family would be the same, but I would probably have a single row Key which I might call “EventsColumns” within the “Events” Column family. This row would have columns which are timestamps (actually TimeUUIDs).This technique stores all of the sorted items in a single row, which lives on a single machine, and thus is limited to the capabilities of that machine.
  3. Hack sorting on top of a random set of keys by using techniques like creating keys that are date named (find the most recent one by working backwards from today) or by partitioning a list of items into multiple rows.

Each of these strategies has its own strengths and weaknesses.

Sorting by Key

Sorting by key is enabled at the top level of the the Cassandra conf/storage-conf.xml file. You specify the OrderPreservingPartitioner which informs Cassandra to sort the physical layout of the data according to the sort order of the keys. Typically this is done in a byte-oriented way, but you can customize Cassandra to sort however you like by creating your own OrderPreservingPartitioner.

The properties (good and bad) of the sort-by-key strategy emerge from the fact that all of Cassandra’s replication and sharing features are based upon key ranges. Even the files-on-disk are sorted by key.

Aligning the storage order with our sorting requirements gives us the following benefits:

  1. “Key range queries” allow us to fetch a pre-sorted slice of the key space based on a predicate. For example, a date range for date-keyed data, or an alphabetic range for word-keyed data.
  2. Using pagination techniques, it is possible to walk through the entire universe of keys (or any subset) forwards or backwards. With the RandomPartitioner, you would either need to walk through in a random order (as might be appropriate in a MapReduce job) or you need to have some external means for knowing the order of the keys.

On the other hand, using the same ordering for storage and querying has some serious side effects that you must plan for. For starters, the Cassandra wiki documentation says:

With OrderPreservingPartitioner the keys themselves are used to place on the ring. One of the potential drawbacks of this approach is that if rows are inserted with sequential keys, all the write load will go to the same node.

This is probably not strictly true, because more than one node may be responsible for the node representing the “end” of the list of data, but it’s approximately true: the load will certainly not be shared by nodes representing the beginning or middle of the data.

This becomes even more problematic when you realize that the partitioner applies to all Column families in all Keyspaces. It is an unfortunate design consequence (flaw?) of Cassandra that you cannot set the partitioner by Column family or Keyspace. You must choose globally.

If you really need random partitioning for some column families and ordered for others, then you should do the randomization yourself by prepending some kind of hash (e.g. MD5). This is similar to what Cassandra would do automatically for you with a RandomPartitioner in any case. It’s just somewhat ugly to write client code to randomize your own content in order to work around a flaw in the datastore.

This flaw is exacerbated by three further facts:

  • if you design a partitioner that partitions according to IEEE floating point comparison rules (as an extreme example), then you’ll end up with some really weird sorting of keys that are not really intended to be interpreted as numbers. You can work around that by converting everything to a uniform structure (typically just ordering by lexicographic byte order).
  • imagine that all of your keys in a particular column family are URLs. This column family will tend to congregate on machines that manage the range of (“http://…”). If you do not care about the sorting of these keys then you can MD5-prepend them before insertion as described above. If you DO care, then you’re kind of out of luck. Now you have “hot” machines for (“http://…”) in one part of your cluster and hot machines for “today’s date” in another part of your cluster. Furthermore, the “hotness” could be in EITHER data size OR read load OR write load OR any subset of the three. (thanks to Dominic Williams for clarifying these aspects for me)
  • you cannot change partitioners on a live cluster without exporting and reloading data. The Cassandra wiki says: “Achtung! Changing this parameter requires wiping your data directories, since the partitioner can modify the !sstable on-disk format.” You must get this decision right at the very beginning of your design process. In this sense, Cassandra is not very “agile”.

On the other hand, running a second “cluster” of Cassandra does not necessarily imply running twice as many machines. A cluster could be nothing more than some more directories on each machine and a second JVM instance.

Even so, the OrderPreserveringPartitioner has enough side effects that you may be hoping that the next alternative will be a magic bullet. Unfortunately it is not.

Sorting by Column Name

Cassandra always stores column names in order. So you get a certain amount of sorting “for free”.

You configure this sorting by turning on a comparator for a particular column family using the compareWith attribute in your conf/storage-conf.xml.

<ColumnFamily CompareWith="BytesType"
       Name="RawData"/>
    <ColumnFamily CompareWith="UTF8Type"
           Name="UniodeData"/>
    <ColumnFamily CompareWith="TimeUUIDType"
           Name="SortedByTime"/>


The complete list of built-in comparators is:

  • BytesType: Simple sort by byte value. No validation is performed.
  • AsciiType: Like BytesType, but validates that the input can be parsed as US-ASCII.
  • UTF8Type: A string encoded as UTF8
  • LongType: A 64bit long
  • LexicalUUIDType: A 128bit UUID, compared lexically (by byte value)
  • TimeUUIDType: a 128bit version 1 UUID, compared by timestamp

The most interesting of these is the TimeUUIDType.

TimeUUIDs

A “TimeUUID” is an old-fashioned form of UUID that is based on the time of day. TimeUUIDs are more appropriate for a distributed system than timestamps, because they are universally unique. Two different nodes in a cluster could generate the same timestamp, but if they are properly configured, they cannot generate the same TimeUUID.

Out of the box, Cassandra does not support TimeUUIDs for sort ordering in an OrderPreservingPartitioner. The timestamp encoded in a TimeUUID does not sort properly according to lexicographic rules. You could create your own Partitioner that parses them and partitions that way, but there is a simpler way. Generate keys by concatenating time stamps and UUIDs or time stamps and machine identifiers (whether mac addresses, or IP addresses or whatever).

Column Sort issues

Column sorting is great in that it happens for free, automatically, all of the time. But it isn’t appropriate for all jobs.

At this point I’m going to quote Dominic Williams, who covered this issue well:

… there is a really simple if brutal solution [to the OrderPreservingPartitioner issues]: simply store your index inside a single column family row as a series of columns. Since Cassandra can in principle cope with millions of columns, this is perfectly possible. Although it is true each index won’t be distributed across your whole cluster, the load will at the least be distributed across the nodes holding the replicas. If you use a typical replication factor (RF) of 3 the load associated with each index will be shared by 3 nodes etc.

In the vast majority of cases, this will be enough, and it will be sufficient that the rest of your data is properly balanced across your cluster.

But, I hear you saying, this is too brutal. Your index is too massive to fit on 3 nodes, is extremely hot and this just won’t work. You moved to Cassandra because you want your load distributed across your entire cluster. Period.

This is a perfectly reasonably point of view.

I’ll add some further caveats:

  • while it is true that Cassandra can in principle cope with millions of columns, Cassandra 0.6 and earlier (latest versions as of this writing) cannot handle rows that grow larger than memory.
  • if you have one or two keys on the a particular machine that each have millions of rows, and another machine randomly happens to lack any such keys, then your load will get unbalanced again.

Workarounds

Before we get to workarounds, I should reiterate Dominic’s point that many people do not have millions or tens of millions of ordered items to deal with, and therefore do not need these workarounds.

That being said, the basic structure of any workaround to these issues is to use the random partitioner as a substrate and build ordered lists on top.

For example, what if the log events for every hour of every day were stored in its own row? So the columns would be events, and the key would be “2010-04-10:9:00-10:00″ Using a random partitioner, these keys would naturally migrate to various servers. When you want to know what happened on “Saturday” you fetch 24 rows and you are off to the races.

Or here’s another option: what if the log events for a particular user or system on a particular day, were kept in a row specific to that user or system? As long as no particular user or system generated millions of events in a single hour, you’d never push row size limits.

Dominic also explains that for his full-text indexing use-case, he could split “big” keys automatically into two smaller keys. You might use some other column family to build an index on top of these chunks, although Dominic describes how that may be redundant for his specific use case.

Building on Dominic’s idea, you can imagine a Column Family that has a Row that contains a million rows and each row is a “pointer” to a row in a different Column Family that has a million log events. If a trillion rows is not enough for you then you can add one more layer of list and you’ll get enough room to reference a quintillion events. Each “chunk” (Row) of a million events would be managed by a single machine and its replicas.

Of course, once you start nesting and indirecting, you’re going to add latency. That said: three queries to drill down into a data set of a quintillion items is not bad. Even so, if you’d like to avoid the overhead, read Dominic’s explanation of why Cassandra’s data structure may make indexing unnecessary for some use-cases.

Final Thoughts

I’d appreciate your thoughts on the following issues:

  • are the reputed problems with the OrderPreservingPartition overblown or exaggerated?
  • are there any open source libraries for hiding the complexity of the workarounds from developers? Should there be?
  • could any of the “workarounds” be incorporated into Cassandra and become “features” which might obsolete the OrderPreservingPartitioner? For example, could Cassandra automatically split rows that cross some site-defined threshold and somehow abstract above the split so that the developer need not see it reflected in the API-view of the data? I’m thinking of the way that memory managers and file systems present linear abstractions on top of the messy reality of disk blocks and fragmented memory.
Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter

Blackberry users are starting to browse the web

According to StatCounter, the BlackBerry browser is now a major force on the mobile web. It is just as popular as the iPhone, and growing quickly.

Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter

Buzzword Bingo: NoSQL cloud databases are a disruptive technology

I apologize for the buzzwordy, markety blog title. But I am confident that the war between cloud databases and relational databases is partially, perhaps primarily, a question of marketing, and a marketing concept is therefore relevant: Disruptive technology.

Disruptive technologies start at the low-end of the market, (like microcomputers, or TCP/IP) and work their way up, disrupting “enterprise” solutions like minicomputers and value-added networks. Just as microcomputers did not completely replace minis and mainframes, I don’t expect cloud databases to completely eradicate traditional databases, but I hope to show why they will grow tremendously in the next few years: to the point that it is very unclear exactly what will happen to traditional databases over the long term.

By a “cloud” database, I mean one with the following properties:

  • provided as part of a hosting package by large web companies like Amazon, Google, Rackspace, Microsoft and even smaller ones like Joyent and Heroku.
  • designed so that they scale up automatically as traffic and content grows, rather than requiring new architectural and deployment decisions to be made as traffic grows. The hardware behind a cloud is finite, but consumers should only bump into limitations as they approach the limits of the cloud itself, rather than the limits of a particular box or cluster.

These product are still in their infancy, and I suspect that some of the vendors listed in point 1 do not actually achieve the linear scalability requirement. I haven’t tested them all myself.

I know the limitations of databases like App Engine and SimpleDB, which is why Ayogo has not jumped on that bandwagon (yet).

But imagine I was a decade younger and I was making a decision about what system to use for my world-beating new dorm-room headquartered startup.

On my left shoulder is a little relational database angel, on my right shoulder is a little cloud database devil.

SQL Angel: “Relational database theory is powerful, and SQL is VERY flexible.”

Cloud Devil: “Our query languages are simpler, and you just can use programming code to do anything tricky (like summation or joining).”

SQL Angel: “Relational databases have been proven to scale. Like at stock exchanges and stuff. You just need to understand the query language, the optimizer, indexing, the disk layout, RAID striping, master-slave replication, read/write splitting, snapshot backups and the physics of  how hard disks spin. Just learn that stuff and you’ll be able to scale beautifully. With SQL, you’ll be the next eBay.”

Cloud Devil: “Our database just scales. Avoid huge numbers of writes to a single row. That’s all you need to learn about scalability. Everything else will just work. With your data in the cloud, you’ll be the next Google.”

SQL Angel: “You lose a lot of ad hoc query capability if you don’t use a database as flexible as a relational database. What about the analytics???”

Cloud Devil: “Analytics? Doesn’t that sound like some kind of stupid idea that would come out of marketing? Or Finance? Do you really need that? Anyhow, Google uses MapReduce for analytics. Major cloud databases either have, or will have access to MapReduce soon.”

SQL Angel: “But SQL databases are used in finance. Like at Goldman Sachs. Those guys are bad-ass enough to lend money to the country of Greece!”

Cloud Devil: “Cloud databases are used at Google. Those guys are bad-ass enough to give the Chinese government the finger.”

SQL Angel: “And there are free relational databases. Like MySQL! Which is used by Facebook and YouTube.”

Cloud Devil: “MySQL sucks. Facebook and YouTube are migrating away from it and onto Cassandra, BigTable, etc.”

SQL Angel: “But there are other free relational databases!”

Cloud Devil: “Wow….this is getting complicated. Do you really want to choose a minority relational database that for some reason is less popular than MySQL?

SQL Angel: “Plus there are some amazing commercial RDBMS’.”

Cloud Devil: “Before you have a dollar of revenue, you’re supposed to shell out cash for a commercial database? When App Engine is FREE and SimpleDB is CHEAP?”

SQL Angel: “SQL is a multi-vendor standard! You can port your code from one SQL engine to another!”

Cloud Devil: “Did you ever actually try porting SQL code and a large volume of data from one server to another? What a headache. If you pick something that scales all the way up from the start, you won’t have to worry about porting. Anyhow, maybe standards will arise for cloud databases too. Trust us.”

SQL Angel: “None of this is new…we already discarded this stuff in the 1960s!”

Cloud Devil: “Do you really care about stuff that happened before you were born? Anyhow, go ask the SQL guy about ‘paxos’ which was invented in 1998, ‘MapReduce’, patented in 2004 and ‘CAP Theorem’ from 2000. Ask about transparent sharding and automatic replication.”

SQL Angel: “The cloud devil is oversimplifying things. Look: SQL databases are just what PROFESSIONALS use.”

Cloud Devil: “Do you want things simple…or complex? Do you want to be like Google or like a Health Insurance company?”

I hope you can see how this debate is going to turn out.

You might wonder what the inevitable popularity of cloud databases will have to do you or your business. That’s where we need to consider the nature of disruptive technologies: they start with the snotty nosed kids. And some of those kids go on to be Larry or Sergey or Mark Z or Kevin Rose or Jerry Yang, and are therefore influential because of their success.

And what about the rest of those kids? Eventually (a few years from now) the other kids arrive at your department, having failed at their dorm-room startup. But they start asking why you’re maintaining your own servers for the customer support site on a managed RDB instead of just deploying it for free into the cloud? After all, not all corporate data is equally secretive.

Then the snotty nosed kid gets the ear of an architect or CIO and asks: “Why don’t we run a cloud database cluster within the firewall? Wouldn’t that be easier than tweaking each relational database to the requirements of each app?” The next thing you know…the old timers need to defend their decision to use “legacy” technology that does not “scale smoothly” and has been rejected by “thought leaders” like “Google”.

I’m not claiming that the old timers will necessarily and consistently lose that debate. Analytics and flexibility are very valuable in the enterprise. But I am saying that by the time you have that debate, you’ll probably have about 15 small internal, departmental systems running on cloud databases that you did not authorize rather than on the “blessed” Oracle installations that you’re paying through the nose for. At some point, the relational database will need to constantly justify its existence, rather than being the default choice.

Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter

Universally Unique Identifiers and You! (Part 2)

Courtesy of Flickr's Koen Vereeken

Courtesy of Flickr's Koen Vereeken

In part 1 we went into the why of UUIDs, now lets get into the how.

Representation of UUIDs in MySQL

UUIDs are intrinsically fairly large, so storage space is something to consider. Large database rows result in more cache misses, more disk seeks, slower backups, etc. There are some techniques you can use to optimize storage if this is a concern for you.

It’s tempting to store UUIDs are some sort of text string since we see them in that format (e.g. 550e8400-e29b-41d4-a716-446655440000) so often, but it’s also a waste of space: it takes 36 bytes to store the string in its default format and a UUID just represents a 16 byte number! It’s also a waste of processing time to store UUIDs as strings since mysql then has to worry about string lengths, character sets and collations. Instead, you can use a BINARY(16) column to be efficient. You might think 16 bytes is a lot of space to use up for just storing an identifier, but it’s only 4x larger than the default INT or 2x larger than a BIGINT.

An annoyance when you’re developing an app or dealing with data full of binary UUIDs is that the bytes are rendered raw when using MySQL on the command line so you end up with control codes in the rendered string and get fun things like the bell characters and misaligned columns. Use MySQL’s HEX and UNHEX functions to create some nice human readable representations and you’re off to the races. If you’re dealing with tables frequently you could create a database view for easy querying.

UUIDs and table layout

Also, consider the issue of your on-disk table layout. At small scales, you do not need to worry about that, but as you scale up, you should really understand the details of how your database manages disk space.

UUIDs are not always the best choice for a primary key. Since MySQL loads data from disk in batches based on primary key, you could end up with wasted memory and unnecessary disk reads. To alleviate this, you can add another column for the primary key (a standard auto-increment column is a good candidate here as it clusters data well), and put an index on your UUID column, but refer to data by only its UUID. Of course the index takes up disk space and slows down reads, but it is often more important to keep chronologically related data together for reads.

UUIDs in Ruby

We use the uuidtools gem.

UUIDTools can sometime have trouble finding your machine’s mac address (needed to generate some UUIDs), so we use the macaddr gem.

require ‘macaddr’
require ‘uuidtools’
UUIDTools::UUID.mac_address = Mac.addr

UUIDTools::UUID the key methods you’re going to need are:

  • to_s – the formatted one with dashes
  • hexdigest – like to_s but without dashes. We use this for memcache keys since you can hit the key length limit surprisingly easily.
  • raw – the raw byte string. Store this in MySQL.

If you are having trouble getting the byte strings into your database, you can use the MySQL literal “x’#{uuid.hexdigest()}’” in your queries to have the database server parse the bytes for you.

Summary

UUIDs are a useful tools for distributed applications. They take some thought to implement well, but hopefully we’ve helped you get a jumpstart on that.

Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter

Universally Unique Identifiers and You! (Part 1)

Courtesy of Flickr's Smithsonian Institution

Courtesy of Flickr's Smithsonian Institution

The computer scientist Phil Karlton is reputed to have said, “there are only two hard things in Computer Science: cache invalidation and naming things.” Universally Unique Identifiers (UUIDs) are a way of naming things such that the name is essentially guaranteed to be unique, literally, universally. A UUID is a string that looks like this: “F7C27314-5CFC-4E36-8F87-C8CC8DDAC0B1″. I generated that string with the uuidgen application and it is universally unique. I just Googled that string, and it does not appear anywhere else on the Internet. If it does when you Google it now, then it was copied from this document. That’s the power of UUIDs: when you see the same UUID in two different places you know that you are referring to the same object. If I ever wish to find copies of this blog post, Googling that UUID will turn them up.

What is a UUID?

UUIDs are basically very large random numbers. They are virtually guaranteed to be unique for basically the same reason that it is very unlikely that hundreds of people will win the lottery: simple statistics. Here’s what Wikipedia has to say about it: “In other words, only after generating 1 billion UUIDs every second for the next 100 years, would the probability of creating just one duplicate be about 50%. The probability of one duplicate would be about 50% if every person on earth owned 600 million UUIDs.”

Why do we care about identifiers that are universally unique?

We want identifiers to be universally unique when they are going to be generated by multiple distributed systems that are not necessarily in constant communication with each other. UUIDs are used in Amazon’s SimpleDB, for example, because internally SimpleDB nodes are not stored in just one location — it is a distributed database. This is in contrast to using an auto-increment column in a single-server relational database implementation, where every thread depends on a single memory location to insure that identifiers are unique.

Another benefit of identifiers being unique within a system is that datasets can be partitioned and merged between systems without the headache of trying to synchronize the migration of identifiers that overlap. As an example, you could take “vehicle” objects from a car racing game and migrate them into a database with “sword” objects from a fantasy game. This gives us great flexibility with our underlying resource allocation.

When to use UUIDs

There are basically only two ways to name things and have the names be universally unique. One is to use a hierarchical system like URLs, where there is a process of “nested registries”. For example: a URL like http://www.facebook.com/ayogo involves a DNS registry for “.com”, which gives Facebook license to use “facebook”, a registry for Facebook.com sub-domains which binds “www.facebook.com” and a registry for Facebook “usernames” which binds “ayogo”.

This is obviously a very useful and popular system of universal naming, and it has the advantage of having names that are more human-readable than UUIDs. But it depends on the management of the hierarchical registry system. When it comes to domain names and Facebook usernames, the management of the database costs millions of dollars per year. Sometimes you’d like to achieve the universal uniqueness without the overhead of a registry. UUIDs are the standard technology for accomplishing that.

Summary

Naming is hard. But if you can live with UUID’s verbosity and poor readability, they are actually quite a bit simpler than hierarchical universal naming schemes.

In Part 2, we will go into some detail as to how to implement UUIDs in your application.

Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter

Bionic Scripting: BigCo’s Bet on Scripting

Courtesy fo Flickrs niallkennedy

Courtesy fo Flickr's niallkennedy

Just as the “Office of Scientific Intelligence” invested millions in turning Lee Majors into a superhuman cyborg, the biggest IT companies in the world are making interesting investments in scripting languages.

Exhibit A: PHP is in the process of being gutted and rebuilt by Facebook.

Exhibit B: Python is being overhauled by Google, Microsoft and Sun (or what remains of it).

Exhibit C: Ruby is being reconstructed by Apple, Microsoft and Sun (ditto).

Javascript is being dramatically accelerated by … basically everybody.

Some conclusions:

  • scripting languages are mainstream
  • no language dominates and the Facebook move may give PHP a new lease on life (though speed was never the primary problem with it)
  • most players are converging on just-in-time compilation, often using LLVM (the JVM and CLR also do Just-in-time compilation).
  • Facebook’s “HipHop” is the exception to the hegemony of the JIT compilation technique
  • multi-language virtual machines have become dominant very quickly.
  • Sun blew a five year lead in multi-language VMs by equating the “Java Virtual Machine” with “Java” when they should have been supporting JPython way back in 1997 when it was desperate for commercial support
  • I predict that in the next few years we’ll see the same trend for functional languages.  Scala.NET and ClojureCLR are on the way. Look out for ports of Haskell, O’Caml and Erlang in the near future.
Share and Enjoy:
  • Digg
  • Facebook
  • Reddit
  • Twitter