July 14, 2014

Building Scalable Software Systems

With this article I want to shed more light on a vital aspect of any computer system: scalability. Why scalability is important? The answer is very simple – it gives the business which is based in or supported by the system freedom to grow. An unscalable system is like a tree with very weak roots – as the load on it grows it will eventually fall.

Before diving further into the topic let’s define the term “scalability” for a computing information system.

I personally like this definition: scalability refers to a system’s ability to handle proportionally more load as more resources are added. Scalability of a system’s “information-exchange” infrastructure thus refers to the ability to take advantage of underlying hardware and networking resources, as well as the ability to support larger systems as more physical resources are added.

Here I need to mention that there are two types of scalability – horizontal and vertical, where vertical scalability means the ability to increase the capacity of existing computing unit hardware. This approach is limited and quickly becomes unacceptably expensive.

Instead horizontal scalability refers to a system’s ability to engage additional hardware computing units interconnected by a network.

But here is the catch: systems that are built using classic Object-Oriented methodologies and approaches for system software design which work superbly for local processing begin to break down in distributed or decentralized environments.

Why? Because a distributed computing environment brings a whole new class of challenges to the scene.

Distributed systems must deal with partial failures, arising from failure of independent
components and/or communication links (in general the failure of a component is
indistinguishable from the failure of its connecting communication links). In such systems, there is no single point of resource allocation, resource consumption, synchronization, or failure recovery. Unlike local processes, a distributed system may simply not be in a consistent state after a failure. In the “fallacies of distributed computing” [Van Den Hoogen 2004], summarized below, the author captures the key assumptions that break down (but are nonetheless still often made by architects) when building distributed systems.

  • The network is reliable.
  • Latency is zero.
  • Bandwidth is infinite.
  • The network is secure.
  • Topology doesn’t change.
  • There is one administrator.
  • Transport cost is zero.
  • The network is homogeneous (it’s doubtful that anyone today could believe this)

I prefer to treat this list not as a set of fallacies but as challenges a software architect has to meet to create a horizontally-scalable system. As an architect who has had a chance to work with large-scale systems, I can attest that if one attacks those challenges directly and adds code that resolves the issues one by one, the result is a heap of wiring code which has nothing to do with the business idea. And that code can easily become more complex than the system itself! Implementing communication transactions, zipping/encoding/decoding data, tracking state machines, supporting asynchronous communication, handling network failures, creating and maintaining environment configuration and update scripts, and so on… all this stuff evokes despondency when it comes to maintainability.

So – is there any good solution to make a system easily scalable?

Luckily, yes. In three words: data-oriented programming.

The main idea of data-oriented programming is exposing the data structure as the universal API between system parts and then defining the roles of those parts as “data producer” and “data consumer”. Now, in order to make such a system scalable we just need to decouple data producers from data consumers in location, space, platform, and multiplicity. Here the trusty old “publish/subscribe” pattern comes in handy.

Here’s how it generally works – a data producer declares the intent to produce data of a certain type (lets call it Topic-X) by creating a data writer for it; a data consumer registers interest in a topic by creating a data reader for it. The data bus in the middle manages these declarations, and automatically routes messages from the publisher to all subscribes interested in Topic X.

It’s time to draw a picture to illustrate how the classic client-server architecture would look had it been designed as data-centric system

As you can see all system components are isolated and have no knowledge of each other. They only know the data structure or “topic” they can consume or produce.

And now imagine that the number of clients that wanted to consume information from our system increased so that our system could not resolve all the requests in time. – Let’s try to scale this system horizontally.

On the figure above you can see that I have increased number of business logic processor units. This is easily done because the system doesn’t care which computing unit will do the job and doesn’t even need to know that the units actually exist. Each system unit just waits for the data it can consume or publishes data it has declared. Also I’ve easily decoupled client input and client output, spreading the burden to different servers. Since only the number of clients that want to consume information from our system increased, we add more servers that will handle read requests. Also in order to avoid bottlenecks on DB access side I’ve decoupled DB writes and DB reads and allocated more computing power to the ‘read’ side. Of cause in reality those things are more complex,  but this figure shows basic principles of system scaling.

There are several more important benefits of the data-oriented approach:
1) It’s easy to make system more reliable by adding redundant processing power. If one of
the business process units fail nothing critical will happen because other units of the same type continue to handle requests.
2) The system becomes more flexible – new functionality can be added on the fly by adding new data producers/consumers.
3) Maintainability goes to a whole new level since components are very well isolated one from another.
4) It’s easy to work on the system in parallel.

You can say that it’s all good but what should I do with my existing system?

Fortunately we can isolate all this data-centric publish/subscribe magic into a middleware layer that will handle all communications. And there are a wide variety of such solutions:

What you need to do is define a system data model (most probably its entities will be very similar to the DB model you already have) and then create data readers/writers for each system component which will publish or consume data to/from the middleware.

In my opinion, most prominent and promising messaging solutions that support the publish/subscribe model are:

1) https://kaazing.com/products/kaazing-websocket-gateway/ for web-based solutions

2) http://www.rti.com/products/index.html (or any other DDS implementation) for TCP/IP or in-memory real-time peer-to-peer communication. No brokers or servers in the middle. Instead leverage TCP/IP and IP multicast for real peer-to-peer message transportation.

But you are encouraged to conduct your own research.

Practical hint: keep your messages small. Don’t try to push megabytes through your data bus in a single message. The data bus is a vital component and big messages can turn it to a bottleneck causing the whole system to struggle. If you need to transfer a significant amount of data from one system component to another, data producers should prepare and provide a link to those data, so that the data consumer can access them.

Happy data-oriented programming!