After opening up our new Boston office earlier this year (for any of you locals we’re down in the innovation district on Summer St) we finally got the chance to attend out first AWS Boston meetup.
And we are quickly reminded that we should have attended sooner with some great talks on this month’s topic of ‘Big Data’ and a good turn out from the Boston AWS community. For anyone who missed it here’s a brief run down of what was on show:
Matt Drayer from Carbonite presented the outputs from their recent internal hackday adventures where he and a number of his colleagues built a solution combining AWS, Elastic Search, and some C++ hacking to help them answer questions for their product team regarding the types of information they manage for their customers. Some interesting findings from their work included:
- 4% of files they backup for end users are documents, 4% were audio, 15% were images, with the remaining made up of application files (32%) and files relating to file and folder info (37%). Note this data was from a sample set only rather than by analysing all production data. I thought it was interesting that the % number of docs, images and audio files was so low – although I’d imagine if you took size into consideration audio and images in particular would likely consume a high percentage of overall disk space.
- Carbonite’s online backup service manages more than 100 petabytes of customer data. That’s a lot!
- Jocasta Nu was a female Human who served as the Chief Librarian of the Jedi Archives in Star Wars…. the Carbonite guys named their status file importer library after her 🙂 A carbonite status file contains meta data on what type of file users are uploading to their service and formed a key part of their analysis effort.
Next up, Mary Flynn showed how to get data out of various Hadoop implementations and into analyst hands using Jaserpersoft. Mary showed her multitaking skills by fighting with the live demo gods while at the same time giving a concise overview of the likes of Cloudera CDH (HBase and Hive), Cloudera Impala, and Jaspersoft BI for AWS. Interestingly Jaspersoft now how a per hour – pay for what you use!
Finally Patrick Eaton from Stackdriver discussed how they are applying number of different types of data analysis to create a ‘more intelligent monitoring solution’ than the current state of the art. One interesting example was how they are using Map Reduce on AWS (using Python’s Mr Job Library) to perform what they they have termed ‘Pre-compute Summarizations’. Pre-compute summarizations are essentially a way to identify commonly used data summarizations (e.g. average response time over the last week) and then pre-compute them, outside of the critical path. This naturally helps with performance as the values are pre-computed and stored for future use (you do not need to compute them on the fly!). Another advantage is that they can be computed in such a way as to give you a rolled up view of a lot of data – using a rolled up view can potentially reduce noise in your metrics graphs and thus give you a clearer picture of what is going on. For instance if you want to look at 2 weeks worth of data – in a rolled up view you will not see all the data points – instead, for example, data can be rolled up into data at 30 min granularity. Naturally a disadvantage here could be rolling up too much data so that you lose some of the accuracy of the graphs especially if there is a lot of jitter. But this would obviously depend on the type of data you are looking at and how much you roll it up.
I assumed the event was finished when I left the event venue, only to begin further AWS related discussions with some of the cloudhealth team in attendance, and who I randomly bumped into on the T home 🙂 Looking forward to the next AWS T… I mean Boston meetup already!
By the way if you are an AWS expert and are interested in a DevOps role with some cool technology that processes billions of events every day – we’re hiring!
Don’t believe our technology is cool B-) check it out for free here!