18 Things You Can Do to Remove a Bottleneck Caused by High Traffic (Part One)

unexpected high traffic event
This is a three-part blog series. Be on the lookout for parts two and three in the coming days.

unexpected high traffic eventThere was no reason to plan for it, but the load on your system increased 100%, 300%, 500%, and your MySQL database has to support it. This is a reality many online systems have to deal with these days. This series focuses on dealing with the unexpected high traffic event as it is happening.

There are also a lot of things you can do proactively, which we covered in “Prepare Your Databases for High Traffic on Black Friday.”

First, let’s see what impact a traffic spike can cause on the database — what problems will your application engineering team likely see?

  • Poor query response time
  • Elevated error rate (connecting to the database and executing queries)
  • The database is down (unavailable)
  • Incorrect (outdated) data due to replication lag or batch jobs unable to complete

The immediate goal of handling the traffic spike is to eliminate these issues as quickly as possible, which for most teams means focusing on “low hanging fruit” — solutions that can be deployed within hours, or days, and which do not require a massive application or infrastructure changes.

The good news is that for the majority of applications, you can gain several times database capacity by taking some simple actions:

1. Scale your Cloud Instance Size 

Complexity: Low
Potential Impact: High

If you’re running in the cloud (or some virtualized environment), going to a larger instance size (also known as “tuning by credit card”) is often the easiest thing you can do. It is one of the most expensive things you can do, but it is a short term action you can take while you proceed to implement other performance optimization actions.

Note: Databases tend NOT to scale linearly so do not fall into a false sense of security — if your cloud vendor has 10x larger instances available, do not expect it will handle 10x traffic. It may be a lot less depending on workload.

2. Deploy More MySQL Slaves/Replicas

Complexity: Medium
Potential Impact: High

If your workload is read-intensive, deploying more replicas can be a great way to improve performance. Do not know what kind of workload you have? Reviewing “Is it Read Intensive or Write Intensive Workload” should help you figure that out.

Deploying replicas is not enough though; you need to make sure your application is capable of routing traffic to them. Some applications may find it easy to enable this functionality on the application level. For others, deploying ProxySQL and using its read-write splitting functionality may be a better choice.

In many cases, you can even move complete applications to use replicas: reporting applications or applications utilizing MySQL Full-Text Search are often great candidates.

Keep in mind, MySQL replication is asynchronous which means replicas will have date propagated with delay (sometimes significant) therefore, only route queries to replicas which can tolerate not up to date data and make sure to monitor for replication lag and health.

3. Deploy ProxySQL for Connection Management and Caching

Complexity: Medium
Potential Impact: High

ProxySQL is a great tool to help manage MySQL traffic, especially during traffic spikes. One way ProxySQL can help is through connection pooling so the application does not run out of connections and does not overload MySQL by having too many concurrent connections.

If you’re running Percona Server for MySQL or MariaDB, you can also enable ThreadPool which can allow a MySQL instance to handle 100k+ connections directly.

Another feature of ProxySQL which can be even more helpful during traffic spikes is ProxySQL Query Cache, which allows you to cache results of the queries for a period of time.

As you find queries that do not absolutely need to serve completely up to date results, route them to MySQL replicas you can cache the same queries for added benefit.

4. Disable Heavy Load Applications Features 

Complexity: Medium
Potential Impact: Medium 

Management and development teams will often hate such ideas, but it is a great tool to have. Not all application features provide the same value or are used with the same frequency, yet it is advanced, rarely applied features which often can be the most costly, as not a lot of time has been spent optimizing them. Disabling them, at least temporarily, while you go through traffic spikes or finding a time to optimize them, is often a great thing to do.

It does not have to be user-facing features — think if there are any internal-facing reports which you can live without?

5. Check for Resource Bottlenecks 

Complexity: Low
Potential Impact: High 

Hardware-level databases are likely to be bottlenecked by one (or more) primary resources — CPU, memory, disk, or network. If you’re running Percona Monitoring and Management (PMM), you can see these in the Node Summary Section in your MySQL Instance Summary Dashboard.

If a particular resource is saturated, you can typically get better performance by increasing that resource, though focusing on reducing that resource usage is another thing to consider. For example, a high CPU usage problem is often better solved by optimizing your queries rather than by getting a faster CPU.

6. Get More Cores or Faster Cores

Complexity: Low
Potential Impact: Medium

An important thing to know about MySQL is, it can only use a single CPU core to do most of the work of running a single query, which means getting more CPU cores often does not make your slow queries or batch jobs running many queries sequentially run faster. If that is your problem, you need to focus on getting faster CPU cores or you may need to focus on getting more cores.

But how do you know what kind of workload you’re running?

Take a look at your CPU Usage, CPU Saturation, and Max Core Usage in the Node Summary in Percona Monitoring and Management (or comparable graph in your favorite monitoring system).

If CPU Usage is high (excluding IOwait), and if Normalized CPU Load is 2 or more, your system would benefit from having more CPU cores available to run a workload.

If, however, Max CPU Core Utilization is pegged closer to 100% and your CPU usage is not high, you should focus on faster cores.

For example, if you’re running on AWS, Cloud C5 instance type offers higher CPU performance compared to the General Purpose M5 Instance type.

Another thing to watch out for when it comes to CPU, especially in cloud and virtualized environments, is “CPU Stealing” — it may leave your MySQL instance with a lot fewer CPU resources than CPU frequency and core number can indicate. Read “Choose Your EC2 Instance Type Wisely on AWS” for more detail.

This is a three-part blog series. Parts two and three, with numbers 7-18,  coming soon.

by Peter Zaitsev via Percona Database Performance Blog