2006-06-01
Amazon S3
I've been playing around with Amazon S3 for a project I'm working on and I'm quite impressed with it. Amazon S3 provides unlimited storage for $0.15 per gigabyte storage and $0.20 per gigabyte of bandwidth.
Once you sign up and get an account you can create 'buckets' which are named collections of stored objects. Each bucket acts as a billing point so it provides a way to seperate data stored for different applications so you can track bandwidth, storage utilisation and cost, etc.
Within a bucket you can store and retrieve objects by key. There is no means to update or append an object so it's either create a new one or replace an existing one completely. You can use 'partial get' in HTTP to retrieve partial contents of an object though.
The API to manage the S3 datastore is a REST based HTTP API or using SOAP. Amazon have provided a number of sample client libraries in different languages to get you going fast. This includes Java, Ruby, Perl and Python.
The objects stored can be made publically accessable via an HTTP URL, or private, allowing only the owner to access them. Every object stored has automatic bitorrent support. By appending a torrent suffix onto the URL for the object you get a bitorrent file that allows using the capabilities of the bitorrent protocol to share the bandwidth of downloading large files.
People have already started building interesting libraries on top of it. For example:
- S3Ajax. This is a Javascript library that lets you call the S3 API from within the web browser. By hosting the Javascript files on your S3 system and providing access to them to publically via the browser you get a simple web site. As it is hosted on the S3 domain the Javascript functions can call the S3 API providing read/write access to the storage. S3Wiki is uses this for example.
- This thread in the S3 forums discusses a filesystem built on top of S3. It works under Linux and can be mounted like a normal drive. Each block in the filesystem is stored as an object in the S3 datastore. It can only be mounted on one system at a time but once unmounted it can be mounted safely on another machine. Providing unlimited storage that can be accessed from any Linux system.
- JungleDisk has a similar idea but uses a local WebDAV server and transparently copies and encrypts data to the S3 datastore. It can be used in Windows, Linux or Mac OS X. The main difference between JungleDrive and the S3 filesystem above is that JungleDrive is cross platform. It also stores files into S3 objects on a one-to-one mapping I believe, whereas the S3 filesystem stores the blocks that make up the files into an object. The latter approach makes it easier to support partial access to files and streaming without having to download the entire object first. At the risk of making it harder to reconstruct the file itself if things go wrong I guess.
I'm using S3 for holding uploads of media provided by users for later analysis and classification. I don't know how much data the system will eventually collect so having an unlimited (except by cost) storage capability is useful. It also means I don't have to pay for a storage immediately. Instead I can pay as I go.
So how is S3 different from something like Openomy? My undrstanding is the main intent for Openomy is to provide a place for users to store the data from web applications and the user owns that data. So an Openonmy compliant web application would have access granted to an area of the users Openomy storage and can read/write to it. Should the web application go out of business or become inaccessible the user still has the data in their control.
With S3 the usage model is that the web application uses their own S3 store for storing data. The user does not get S3 storage and provide it to the web application. To do that would require giving up their 'secret key' which can't be revoked. This would be bad. The web application could access all of the users data. With Openomy you can authorise and prevent an application from accessing the data. Both usage models are useful and I think they are complementary services.
Currently I'm using S3 from Javascript on the server. It's very easy to call the S3 Java API from Rhino. To import the basic Java classes into Rhino and create an authenticated connection:
importClass(Packages.com.amazon.s3.AWSAuthConnection);
importClass(Packages.com.amazon.s3.S3Object);
var conn = new AWSAuthConnection(accessKeyId, secretAccessKey);
The 'accessKeyId' and 'secretAccessKey' are the keys supplied by Amazon once you've subscribed to the S3 web service. Once you have a connection you can create buckets and store and retrieve objects:
js> conn.createBucket('mybucket1', null).connection.getResponseMessage()
OK
js> conn.listBucket("mybucket1", null, null, null, null).entries
[]
js> var obj1 = new S3Object(new java.lang.String("Hello!").getBytes(), null);
js> conn.put('mybucket1', 'key1', obj1, null).connection.getResponseMessage();
OK
js> conn.listBucket("mybucket1", null, null, null, null).entries
[key1]
js> new java.lang.String(conn.get('mybucket1', 'key1', null).object.data);
Hello!
Wrapping nicer server side Javascript API would probably be a good idea. For example I can't call the 'delete' method of AWSAuthConnection as 'delete' is a reseved word in Javascript. As Rhino allows serialising any Javascript object, you could even store continuations in S3.