Monday, September 30, 2013

Fault Tolerance in Elastic Charging Engine

Joke: The pilot screamed - The plane is overloaded and we need four people to jump off the plane to save it. Unfortunately we only have three parachutes so one person has to make a sacrifice. The Doctor said - I save lives and I must be saved and he jumps with one parachute. The Lawyer said - I am the smartest person on Earth and I must be saved  and he jumps with a parachute. Seeing no one else was jumping, the priest said to the kid - Son, you have your entire life ahead of you, you take the last parachute and jump and I will sacrifice my life. Kid said - Don't worry Father, the smartest person on Earth just jumped off with my backpack.

Overloaded systems require drastic measures to be taken and different patterns have to be applied to save the system. What is important to understand is, once overloaded the system operates on the edge of the complete meltdown and patterns that are applied to mitigate and recovery are different from other types of fault recovery.
Elastic Charging Engine was built with fault tolerance in mind. With Oracle Coherence providing the core execution platform there are implicit fault tolerance in design. ECE extended its fault tolerance in its own design:
  • Redundancy
    • By the virtue of Oracle Coherence's design each cache entry in a partitioned scheme are backed up on another node preferably on another host. If the primary cache entry is lost the back up becomes the primary and in process creates a new backup on another host. Coherence implicitly is NODE SAFE. Any change in the state of the primary cache entry is by default applied to its backup synchronously before calling the process success (Pattern (I) - Data Redundancy).
    •  After the provisioning exercise of finding the total number of nodes required for a given data set one additional host is typically added called a (n+1) configuration to make sure the cluster can sustain a machine failure without losing any data.
  • Processing are co-located with Data
    • As when/if nodes die the partitions move to new members and processing scheduled on those lost partitions are moved to new members. (Pattern (II) - Processing Failover).
    • Need to provision the system to minimize the time taken for partition move. Faster the data moves to new owners, faster the processing moves. Minimize the Mean Time To Recovery.
    • Requests destined for the just lost member are put in a suspense queue and are retried when the partitions get its new owner (Pattern (III) - Retry).
  • Checkpoints
    • ECE by default bundles requests together so that a batch of requests for Customers that are managed in partitions owned by a single cluster member can be processed together. This strategy is key to support higher through puts.
    • If a batch is partially processed that some requests have been processed but not all before the node dies the same batch is submitted to a suspense queue to be retried. Problem is we do not want the same request already processed before to be reprocessed. 
    • When a request is submitted ECE checks if the same request has been processed before (Pattern (IV) - Duplicate Detection).
With its second released version ( also nick named Amazon, Elastic Charging Engine introduced a new feature "Overload Protection" extending its commitment to provide a true fault tolerant system.

Overload Protection:
ECE provides a layer of APIs that allows any application to interact with its system. This layer has APIs to construct messages that ECE can understand, data structures to send the request and receive responses in and policies around how when these requests are processed. Internally this layer called Batch Request Service (Or BRS) is an amalgam of the following behaviors:
  • Identifying the Member where the submitted request be sent to for processing.
  • Maintaining a batch of requests, its ripe policies and its responses.
  • Tuning parameters to provision the ECE for ultra low latencies or excessive high throughput or something in the middle.
  • Maintaining a suspense queue for the requests that needs to be retried.
  • Managing a high degree of parallelization of threads.
  • To make sure the entry point to the charging system can not be overloaded either by a natural increase in network traffic or denial of service attacks.
How does it work?
BRS has a pool of Threads dedicated to process events. Most of the time these threads wait for new batches to be sent but as these batches are sent to processing nodes of the cluster the CPU usage on those nodes increase. ECE is primarily a CPU bound application and the most precious element in its design are CPU cores. Though ECE has been benchmarked for extremely high throughput (>1000 requests/sec/core) but sending an extremely high number of requests still risks high CPU usage on processing nodes to spike up resulting in long queue build ups in BRS and eventually resulting in a meltdown.
ECE v11. introduced a new overload protection that guarantees if provisioned right will never accept requests that can not be processed with in a given SLA irrespective of how high the incoming throughput is. This is a key feature of the product when it comes to deploying a fault tolerant system at different levels. And the pattern used comes right out of the Joke in the beginning of this blog.

Pattern (V) - Shed Load:
BRS monitors the internal queue of the Thread Pool Executor and at any given time if the pending count is more than a configured value it rejects the submitted request. The rejected request has to be resubmitted by the network mediation based on certain policies typically suggested by ECE. As the system being in an overload situation is dynamic the pending queue size check is done for each submitted request. It is quite possible that one request has been rejected but immediately after requests are accepted for processing. Shedding Load is a legitimate fault mitigation policy specially more so in latency sensitive applications.
Question is how do we configure the acceptable pending count? It depends on the infrastructure's latency sensitivities. Wait adds to Latencies. If it is required to have zero latency impact on acceptable requests then the acceptable pending count should be set to zero. As the requests are submitted it would go through the provisioning flow immediately after. As the acceptable pending count is increased the average latencies would increase as well and between none and equal to the throughput will dictate what the latency per request would end up in.
Following are two snapshots taken from one of a smaller test setups. One snapshot plots throughput and the other latencies. In the following test the infrastructure is configured for around 40K throughput with acceptable latencies of around 50ms. The snapshot is taken from one of the instances of the event simulators configured at 20K with total of two such simulators running on one test host. Look at how the latency increases as the acceptable count is increased in an "overloaded" system gives an insight how this feature should be configured. At 120K which for this test set up was way more than what it was provisioned for, at zero acceptable count had no impact on the processing latencies at all even if the system was receiving way more number of requests than what the infrastructure is provisioned for. Acceptable pending count @20K increases the latency but still prevents a complete system meltdown. So a single tuning parameter allows you to prevent the system from fault to failure.



No comments: