Friday, August 15, 2008

Managing Coherence as an IT Organization

Scenario: A development manager creates Tasks, gives it to a Technical lead who after assessing the task gives it to the developers of his team to complete. Developers run some logic and algorithms on a piece of hardware, returns the result to his lead who in turn gives it to the manager.
This is a typical workflow in any software organization. The terms and roles could be different but the flow more or less remains the same. Oracle Coherence supports an architecture starkly similar to this flow. *Extend Application (Manager) submits tasks to proxies (Leads) who passes agents like Invocable or EntryProcessors (Tasks) to developers (Storage nodes) who run the logic (Implementation) on a piece of hardware. Now if an overall Director/CTO finds the system to be slower than expected, how would you find the bottleneck? Lets see..


  • Is the manager creating enough tasks? Is your Extend application really pushing Tasks or Objects at a rate the system is designed for?

  • Are you overloading your technical leads with tasks far exceeding than they could handle? If you have too many work to do and only one lead to manage, it is obvious that the technical lead will be overloaded with work. Whats the resolution? Depending on the work load you create more lead roles. In the *Extend scenario you have to be very careful and meticulous in finding the number of Extend proxies you have to run based on the load. This number varies from system to system but I would recommend having at least 30% of total storage nodes would be a good number. If the grid size is 10 storage nodes, configure at least three storage disabled proxies.

  • Do you know if the logic was well written? It is very important in synchronous flows that tasks are short-lived. As a rule of thumb long running processes must always be asynchronous. Make sure EntryProcessors or Invocables that are run on the nodes are well written, fast and use minimum of network traffic. Short lived processes provide maximum throughput.

  • Have you hired the top developers? Make sure that JVM is not in GC frequently. Find stable nodes. Developers leaving group affects an overall delivery response. Stable team is very important.

  • Piece of hardware - Yes. Give them the best you can afford to work on. Make sure CPU has enough power. With agents running on nodes it is important that it is multi-core. Processors take CPU time. More cores the better.

  • Most important - the communication channel. How fast and clear is the communication? This translates to available network bandwidth. Always make sure you have at least 1G network with servers connected to the same switch. More switches lowers the bandwidth. As you see in any organization.. More administrative hops and slower will be the organization.

So its either organizational efficiency or a Coherence data grid, make sure before making any decision check and test where the bottleneck is.

No comments: