X
X

Login

Login if you already have an account

LOGIN

Register

Create a new account. join us!

REGISTER

Support

Need some help? Check out our forum.

FORUM
September 18, 2013 kevin nickels

Big Data, Hubi, and Beer

Okay this blog has nothing to with beer. But hey! I had to get you here somehow.

This blog article focuses on how FatFractal (FF) uses big data and why it is important to the developer. There are lots of ways to collect this kind of data and extrapolate meaning from it. At FF we designed analytics into the platform from day one so that we could generate, store, and mine data using common big data technologies such as Hadoop, MapReduce, Hive, Pig, Flume, and Cassandra. The data that is ultimately stored comes from conventional sources such as logs, infrastructure services such as CloudWatch but also from our instrumented application container which provides a real time view into what is really happening with the applications. Our goal is to ultimately provide developers with the tools and information they need to effectively manage and monitor their application’s compute usage.

At FatFractal (FF) we use big data primarily for:

  • Billing - FF charges developers for their compute consumption and all usage metrics are ultimately stored into the FF Hadoop cluster and at the end of each respective developer’s monthly billing cycle the data is MapReduced into billing records that are stored into Cassandra.
  • Usage Profile - All applications that are deployed to FF have Usage Profiles (UP) constructed for them. The UP represents a set of compute constraints based upon  either a subscription (BaaS) or a number of FatFractal Virtual Spaces (FFVS, an FFVS is a custom LXC container) and services (i.e., database) (PaaS). If the application compute usage consistently approaches or exceeds the thresholds of the UP the developer is notified so that they have an opportunity to upgrade their subscription or allocate additional FFVSs.
  • Application Analytics Service - FF provides analytic reporting for all applications that can be accessed from the FF console. This allows the developer to track their application’s compute usage. Ultimately the goal of this service is to provide the developer with the information and tools to truly monitor and manage their application’s compute usage.

This blog article will focus primarily on the UP and analytics in the context of provisioning properly for an application. In addition it will cover scheduled scaling based on a real world application (Hubi) that is currently deployed on the FF infrastructure.

Application Compute Usage Metrics

This section is included to provide the reader with background information on why and how FF collects application computer usage metrics.

Planning and scaling in multi-tenanted environments is challenging because you don’t know what applications are actually consuming the resources unless you have baked in the necessary instrumentation. When an instance hits say 80% CPU utilization, the simplest thing to do is clone all the applications onto a newly minted instance and then add it to the load balanced mix (which is what the FF traffic directors do). However, if you can identify the pertinent application(s) you simply need to clone that/those application(s) onto existing under utilized instances or spin up a new one (matching the compute needs i.e., EC2 m1.small) and let the FF traffic directors do their job. The type of data you would need to assess each respective application’s compute usage are things like; 1) CPU milliseconds consumed per time, 2) request and response counts/sizes per time, 3) memory consumption (this one is kind of hazy but a relative number can be arrived at) per time, and 4) etc. You then compare the numbers and zero in on which applications are consuming the most compute. It may well be a situation where the instance is oversubscribed and the applications need to be segmented on to different instances. At FF we collect instance level compute usage from the infrastructure services (i.e., CloudWatch) which tells us what is going on with the instance. For application level compute usage we rely on metrics that are generated by the FF application container. FF uses a custom application container (think Google App Engine) to facilitate the deployment and execution of all applications independent of their type (i.e., NoServer, R-o-R, Servlets, etc.). The FF application container has been instrumented to generate compute usage metrics in real time and ultimately propagates them to a Hadoop cluster. It should also be mentioned that the application containers reside in customized paravirtualized containers (LXC/FFVS) that are each assigned a slice (i.e, 1 proc) of the instance’s compute resources. The diagram below provides a high level view of the application container.

Usage Profile (UP)

Ok we now know how application compute usage metrics are generated, next lets look at how those analytics can be leveraged.

When an application is deployed to the FF infrastructure, nothing is known about it compute usage requirements. The UP dictates the compute thresholds which may be an indicator (i.e., the developer signs up for a bronze subscription knowing the associated compute quotas match the application’s compute usage closely) however most green field applications are undersubscribed with significant headroom to grow.

Compute provisioning for BaaS and PaaS applications is defined differently. BaaS developers sign up for a specific subscription which defines what the quotas for the UP are. PaaS developers explicitly choose how many FFVSs their application should be deployed to and what services it will be using.

While the FFVS compute quotas are published it remains difficult to accurately specify precisely how much compute a PaaS application is going to need, especially if it starts out as a high volume application (i.e., a migration from another service). If the PaaS application is oversubscribed auto-scaling will mitigate the situation, however, this use case is not what auto-scaling was designed for and is not optimal from a cost or provisioning perspective.

   

Most applications that are deployed to FatFractal are green field apps that typically have little to no load to start with. With these types of applications there is sufficient lead time where analytics can be collected and reconciled against the UP. Once compute usage has hit certain thresholds the developer is notified that they should upgrade their subscription or they should allocate additional FFVSs.

There is another class of application that is not a greenfield but rather a migration from another platform (i.e., Google App Engine) that may generate huge amounts of load once the switch is fully flipped. If the developer knows the compute characteristics (i.e., the app requires 2*2.6 Ghz worth of CPU) of the application then it is relatively straightforward to formulate a reasonable UP, however, this is generally not the case The challenge with this situation is to define a UP with  a sufficient amount of compute up front to accommodate the compute usage needs of the application but not over charge the developer or impact the users of the application.

This can be done three ways:

  1. By over provisioning and collecting application compute usage metrics over some period of time and later making the deployment adjustments and redefining the UP.
  2. By under provisioning and collecting the usage analytics over some period of time and relying on auto-scaling to mitigate spikes in load and later making deployment adjustments and redefining the UP.
  3. By provisioning minimal compute (i.e., a single FFVS) and having the developer partially open the spigot and collecting usage analytics over some period of time and later making the deployment adjustments and redefining the UP based on some multiple of the number of requests for a certain corpus of users over a period of time.

With all three options it is preferable to work closely with the developer which FF recommends and generally does. Ultimately the goal is to provide the developer with the information and tools they need to do it themselves.

All three methodologies will work and are optimum given certain use cases. IMHO option 3 is the preferred way to do it but unfortunately not all (very few) migration scenarios are actually in a position to take advantage of this approach. Independent of the which methodology is employed application compute usage metrics are critical to scoping the final UP and making adjustments to it over time.

Next I will cover a real use case where option 3 was employed.

Introducing Hubi

Hubi is a very cool mobile application that was developed by Megadevs. It is available both on Android and iOS and has 500,000+ users all over the globe. The application was recently awarded the best movie streaming Android app by heavy.com.

The Servlet back-end for the application was originally deployed onto a hosting provider. FF was approached by one of the Megadevs developers (Dario Marcato …a great guy BTW) at AppsWorld  2013 to discuss migrating Hubi from the hosting provider to FF, a couple months later the journey began.

Hubi  generates significant request load but is unique in that spikes every day at about the same time and is CPU bound based on the request type. Below is a table that shows the number of requests, users, and cpu seconds (which won’t mean too much yet) per month since 05/12/2013 thru 09/17/2013 to give you an idea of what its load is.

Month

Users (unique)

Requests

CPU (seconds)

May

20,842

343,994

307,839.533

June

165,781

2,706,665

1,841,253.982

July

412,126

7,381,020

5,008,210.832

August

491,571

9,644,024

8,915,138.303

September

205,124

5,571,600

3,975,262.658

Hubi was originally provisioned onto one FFVS and for the month of May where there were a limited number of users and things worked out fine. We profiled the compute usage with the analytics we had collected up to this point and formulated a UP based on the full corpus of users (approximately 500,000).  We provisioned for that UP and for twenty one (21) hours a day things went smooth but between the hours of 1pm PST and  3pm PST we would experience load issues where the instance CPU usage would hit 80%+  and result in request timeouts. We then profiled Hubi on an hourly basis across the month of June with the analytics we had collected and observed a spike that occurred every day between the hours of 1pm PST and 3PM PST. At this point we could simply adjust the UP and add n-number of more FFVSs where the compute usage for the additions would be used approximately three (3) hours a day. While the aforementioned plan was simple it was not palatable from a cost perspective since there is an incremental cost associated with each additional FFVS. So we decided to leverage an FF scaling feature where we predictively spin up n-number of FFVSs at a scheduled time and then tear them down once the time allotment has been hit. We then charge for the cummulative hours which effectively amounted to the addition of one (1) FFVS. Hubi has been running with this UP for approximately 1.5 months and there have been no load issues.

I apologize in advanced for the diagram below, I am still ramping up on the nuances of RRDTool (which I do like). The Y-axis is the number of CPU seconds and request counts. The X-axis is the hours on 7/31/2013. The times on the X-axis are UTC times (or 8 hours head of PST). If I were to provide a diagram for every day it would be a carbon copy of what you see in the diagram below. You’ll observe that at about 19:00 (1pm PST) things really start to ramp up. The traffic is effectively a combination of two request types, one of which is extremely CPU intensive. This specific request type ultimately drives the CPU seconds above the request count. This can be specifically attributed to the request type distribution. You’ll notice that in most of the hours the CPU seconds consumed is well below that of the request count. That is because very few users are actually invoking the culpable request type.

The spikes in CPU seconds at 7:00 and 9:00 are not normal and I am still running mapreduce jobs to try and understand that data at this writing. The bad news is I am unsure what they represent but the goods new is I would not even be aware of them unless we had application compute usage metrics.

So at the end of the day we were able to minimize megadevs cost by formulating a UP that represents the load for twenty one (21) hours plus the addition of one (1) FFVS to accommodate the load between 1pm PST and 3PM PST

 

 

Conclusion

At FF we knew application compute usage metrics would be necessary for bookkeeping activities such as billing and intuitively we believed that these analytics would be critical to scaling and managing applications. Hubi and a couple of other applications have validated those assumptions and now the challenge is to deliver the information and tools to the developers so that they can get a highly granular viewport into their applications and can scale and manage them in the most informed manner possible.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Contact