Is your SQL database creaking? ...Time to architect for growth

When is SQL not enough?

SQL was developed in the 1970’s to provide a querying structured data. Data was organised logically, with defined relationships between the different bits of data. For example a person has a name and address. Some of those bits of data have very specific relationships, such as a person having a single specific date of birth. These relational databases have done amazingly well to allow us to model massively complex data, and the big database software (Oracle, Microsoft SQL Server and MySQL) have done a great job at allowing us to manipulate enormous data sets of tens of millions of items. How big they can go depends largely on the complexity of the application and queries. Scaling up a system by hosting on bigger, faster hardware works well up to a point, leads to to higher costs and for larger applications if the biggest server you can find just isn’t fast enough, you have nowhere to go. Fundamentally, the limitation is that none of those systems is great at distributing workload horizontally across multiple servers.  When Larry and Sergei wrote their paper on building a large scale search engine, they suggested that indexing more than 100 million web pages would be much more complicated. There are now approaching 5 billion and growing!

 

Clues that you are approaching the limits of a traditional SQL database design

When looking at a specific application, there are often some telltale signs that your application might need some re-architecture. Here are a few of them:

  • The database occasionally slows to snail’s pace, perhaps because of a "large query” or a “batch job".
  • Your database server is already running on large, expensive hardware.
  • You are reaching the economic limits of how much RAM you can install in a single database server. e.g. Currently 512GB or more qualifies.
  • You spend a LOT of time optimising SQL queries.
  • Traditional performance enhancements such as looking at efficient indexing have already been done. Probably multiple times.
  • You are thinking of employing, or already have, an SQL performance specialist.
  • Developers are looking for ways to split up the database load onto multiple servers
  • Your finance director suggests you build a separate data warehouse server so that they get swift access to routine metrics

If some or all of these are happening for you, then it’s time to look at re-architecting you systems.

 

What are the solutions?

There are many ways of tackling scaling, and the most appropriate solution will depend on your specific environment.  The following steps can help guide you towards a solution.

  1. Analyse current capability
    What is the current system capable of? How much data can it hold? What volume of transactions and queries can it deal with? Under what circumstances does it slow down?
  2. Understand the growth targets
    You have to agree some realistic targets. Nobody wants an open ended design goal, as that tends to lead to open ended budgets, which don’t go down well with your finance team.
  3. Optimise SQL queries and indexing
    The fastest way to improve performance of almost every database application is to first look at optimising the database indexing (which your DBA should be continuously keeping an eye on) and then dig into all your most frequently used and resource-hungry queries and optimise them. All the major platforms have the ability to look at how a particular query is going to get executed in the database, and it is often possible to rewrite the query to directly reduce the load on the servers.
  4. Optimise hardware
    Another speedy way to increase capacity is to upgrade hardware. Typical upgrades include, adding RAM, using faster hard disks, using SSD drives instead of traditional spinning disks, using more disks, increasing the speed or number of network ports to the server and increasing the speed or number of processors.
  5. Split the workload to multiple databases
    Re-architecting the system to split the workload from one server to multiple servers is typically the first step that can truly increase the scale potential for the system. Whether the system has a logical split or the database can be sharded into multiple parts, this can have many traps, where separating out one part of the system might speed that up, but the additional communications overhead might slow down another part of the system.
  6. Move the problem to software
    Sometimes it is possible to simply rewrite the application so that it is not dependant on the traditional SQL database to get its answer. For example, rather than asking the database for a count of certain data items, write that function into the application and store it continuously in memory where it is available to whatever processes need to know. This can create lightning fast results to specific, known questions.
  7. Migrate to NoSQL / Big Data platforms
    Systems such as Hadoop, CouchDB, MongoDB, and Redis, were architected from the ground up to deal with huge quantities of data. They all have different advantages, so the choice can be complex, but the potential for scaling an application on these platforms is huge. These are the types of technologies that underpin all our Internet scale applications such as Google and Facebook.

 

Keep it simple

The solution depends on where you need to get to in terms of scale. However, it also depends on where you are now and what resources you have available to you. For a greenfield application you can architect a slick solution based on the latest technologies, but to make sure an existing SQL database keeps running with next month’s customer growth, that might not be feasible. The best solution is the one that achieves your design goal with the least resources and in the least amount of time.  There is little point embarking on a total application re-write if that will take you 18 months and your systems are creaking right now. 

First look for the simpler, frictionless solutions and implement them. These are the ones that take days to implement rather than months. Then use the capacity headroom you have built as breathing space to do the more complex work, and don’t invent something yourself if you can buy it off the shelf. Both Google and Amazon now sell platform based big data solutions, where you load up your data and start running queries right away. It is easy to get sucked into thinking the answer to a complex problem has to be complex, but new solutions are being developed every day and solutions can be simpler and more accessible than you might think.

Disaster Recovery is more than just an insurance policy

A CEO's guide to Disaster Recovery

A few years ago, I got a call at 3:01am. My guess was that lottery winners only get informed during office hours, so I was expecting bad news!  Sure enough, our primary data centre site had gone off-line due to a major power failure. This had taken down both primary and backup power feeds, so recovery was going to take some time. That primary site had been set up with a fault tolerant, high availability (HA) setup with dual servers. However, in this case the entire data centre had failed, so we were beyond normal HA.  It is this type of scenario that people normally call on their Disaster Recovery setup.

The company ran critical systems for thousands of customers, so of course we had a plan. We failed over to our 2nd data centre site which was automatically kept up to date using database synchronisation.  All would have been fine, but then two hard disk failures took down another server. Thankfully the second server at that site continued to operate, so we kept all systems online for customers. However, the only thing between us and customer facing data loss was the final working server plus the extra backup server I had set up as an additional protection for DR. Despite this It was a stressful night. 

This was a well architected setup with multiple levels of fault tolerance, but circumstances had blown through several layers in quick succession:

  1. Primary site server A - taken out due to data centre power failure
  2. Primary site server B - taken out due to the same data centre power failure
  3. Secondary site server A - taken out due to 2 hard disk failures
  4. Secondary site server B - continued to operate
  5. Backup server C - continued to operate

Imagine sitting down with your PR or communications team drafting a statement to customers. Which of these two scenarios would you prefer not to face:

A) a brief service outage;  or

B) a brief outage, but having lost some customer transactions…irrecoverably!

It’s one thing having service interruption, but quite another facing the prospect of losing any customer data. Whether IT forms part of your product or server, or is a support function to your main business, customers quite rightly need to be able to trust your business. Whereas a service interruption might be annoying for customers, losing their data will certainly hurt your customer retention numbers.

If DR were simply an insurance policy, then the payback would be to save you time on getting systems back up and running, but once you have considered the cost of damage to reputation or brand, you might reassess the risk and business impact of failure. When planning DR expenditure, think this through and make sure you fully understand the potential risks and are comfortable they are appropriate.  It’s easy to believe you have this covered because you spent the money on a backup server, but make sure it is protecting you for your worst case scenario. 

Is complexity the enemy of reliability?

Every CIO/CTO I have spoken to over the past 12 months wants 100% availability. Why? Because they want to sleep like a baby every night, knowing that their systems will continue to operate even through inevitable system failures. They also want their teams to be able to focus on making functional improvements rather than wasting time firefighting the latest issue. 

Why is it that my Linux server at home with one power supply, one network interface and one hard disk has 100% availability over the past 3 years? Why is it that many business critical systems with N+1 fault tolerance struggle to reach even 99.9% uptime?

reliability versus complexity

Of course there are lots of reasons:  demand profile, number of changes, human error, and just plain luck that the single hard drive in that machine has not failed yet.

However, it often appears, that complex ‘high availability’ solutions are not necessarily more reliable.

The truth is that there is a difference between our perception of reliability and theoretical or calculated reliability.  We perceive the simple, single server as reliable on the basis that it has not yet failed. Whereas, as we add more hardware and software to provide more complex high availability solutions, it sometimes looks like we make matters worse. That’s because we have added more physical components that could fail and more software that needs to be properly configured and managed.

reliability versus time

The answer lies in looking at the problem with a time dimension. When complex systems are first installed, they often have teething issues. These can be configuration and set-up related, software/patch levels, new hardware can die after a short time in use, or operators do not yet fully understand how to run the new setup. Soon after any new installation there is often a dip in reliability, in some cases making the whole service worse than it was before!  Eventually, with hard work and perseverance the issues get sorted and the system becomes stable. You can speed up the route to stability by employing experts who have done it before or build reference platforms that are vendor tested, but in most cases it will not be zero.

No, complexity is not the enemy of reliability, we just need to make sure we have allowed for the bedding-in period and found a way to run in parallel before pressing ‘go’.

The real question is: Will you be forced to throw a system into live service before it is fully stable?