|
|
|
|
|
Hosting Large Public Datasets on Amazon S3
September 04 2008
There's a great deal of interest in large, publicly available datasets (see, for example, this thread from theinfo.org), but for very large datasets it is still expensive to provide the bandwidth to distribute them. Imagine if you could get your hands on the data from a large web crawl, the… read moreElastic Hadoop Clusters with Amazon's Elastic Block Store
August 23 2008
I gave a talk on Tuesday at the first Hadoop User Group UK about Hadoop and Amazon Web services - how and why you can run Hadoop with AWS. I mentioned how integrating Hadoop with Amazon's "Persistent local storage", which Werner Vogels had pre-announced in April, would be a great… read moreJuly 23 2008
I'm noticing an increased desire to make Hadoop more modular. I'm not sure why this is happening now, but it's probably because as more people start using Hadoop it needs to be more malleable (people want to plug in their own implementations of things), and the way to do that… read moreRPC and Serialization with Hadoop, Thrift, and Protocol Buffers
July 08 2008
Hadoop and related projects like Thrift provide a choice of protocols and formats for doing RPC and serialization. In this post I'll briefly run through them and explain where they came from, how they relate to each other and how Google's newly released Protocol Buffers might fit in.RPC and WritablesHadoop… read moreHadoop beats terabyte sort record
July 03 2008
Hadoop has beaten the record for the terabyte sort benchmark, bringing it from 297 seconds to 209. Owen O'Malley wrote the MapReduce program (which by the way has a clever partitioner to ensure the reducer outputs are globally sorted and not just sorted per output partition, which is what the… read moreJune 20 2008
If you want a high-level query language for drilling into your huge Hadoop dataset, then you've got some choice:Pig, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.Jaql, from IBM and soon to be open sourced, is a… read moreJune 13 2008
James Hamilton on The Next Big Thing:Storing blobs in the sky is fine but pretty reproducible by any competitor. Storing structured data as well as blobs is considerably more interesting but what has even more lasting business value is the storing data in the cloud AND providing a programming platform… read moreMay 30 2008
Today I visited Raglan Castle in Monmouthshire with my family. Cadw, the government body that manages the castle, were running a trial to deliver audio files to visitors' mobile phones using Bluetooth. As I walked through the entrance I simply made my phone discoverable, waited a few seconds for the… read moreApril 30 2008
Last July I asked "Why are there no Amazon S3/EC2 competitors?", lamenting the lack of competition in the utility or cloud computing market and the implications for disaster recovery. Closely tied to disaster recover is portability -- the ability to switch between different utility computing providers as easily as I… read moreApril 14 2008
On Friday in Amsterdam there was a lot of Hadoop on the menu at ApacheCon. I kicked it off at 9am with A Tour of Apache Hadoop, Owen O'Malley followed with Programming with Hadoop’s Map/Reduce, and Allen Wittenauer finished off after lunch with Deploying Grid Services using Apache Hadoop. Find… read moreTurn off the lights when you're not using them, please
March 30 2008
One of the things that struck me about this week's new Amazon EC2 features was the pricing model for Elastic IP addresses:$0.01 per hour when not mapped to a running instanceThe idea is to encourage people to stop hogging public IP addresses, which are a limited resource, when they don't… read moreMarch 23 2008
I made this image a few years ago (as a postcard to give to friends), but it's appropriate to show again today as it's a neat visual demonstration that Easter this year is the earliest this century.The scale at the bottom shows the maximum range of Easter: from 22 March… read moreMarch 22 2008
On Wednesday, I ran a session at SPA 2008 entitled "Understanding MapReduce with Hadoop". SPA is a very hands-on conference, with many sessions having a methodological slant, so I wanted to get people who had never encountered MapReduce before actually writing MapReduce programs. I only had 75 minutes, so I… read moreMarch 18 2008
MapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk… read moreMarch 02 2008
There's a class of MapReduce applications that use Hadoop just for its distributed processing capabilities. Telltale signs are:1. Little or no input data of note. (Certainly not large files stored in HDFS.)2. Map tasks are therefore not limited by their ability to consume input, but by their ability to run… read more