Azure SQL Synapse: 5 Billion Rows of Lessons Learned

I think I can now officially call it a “success”; I designed and built a SQL Synapse data warehouse (which is now just called Dedicated SQL Pool) in Azure for one of my clients and it’s working great (**knocks on wood**). It’s a fairly large but mostly dimensionless data set with over 5 billion rows of Point of Sale data. It wasn’t all smooth-sailing but it was also much easier than I expected it to be and that’s coming from someone who is merely proficient in SQL Server, no where near DBA level.

I wanted to share my experiences and highlight the things that I found most valuable during the process. This is not a comprehensive list of the capabilities or best practices for Azure Dedicated SQL Pool. This is more of list of things that got me my best bang for my buck.

1. Dedicated SQL Pool is OLAP based, not OLTP

This is probably common knowledge to anyone who works in database technologies regularly but for the average developer, it can be a point of brief confusion. OLTP databases, like traditional SQL server or MySql are OLTP databases (OnLine Transaction Processing). They are built for row-oriented operations and many small transactions (INSERT, UPDATE, DELETE). OnLine Analytical Processing databases, like Dedicated SQL Pool, is built for aggregating and summarizing large amounts of historical data usually using complex queries that don’t run often. In general, it’s better/faster to think of OLAP in terms of columnar operations rather than row operations. If you try to pull back a large amount of data with many columns, you will likely take a big performance hit due to how much data has to be shuffled between nodes. Or if you constantly need to find or scan to a specific row in the data, OLAP is probably not the best solution. OLAP is awesome for many reasons but also keep in mind that this means you lose some of the familiar features and advantages that come with traditional SQL server.

2. Dedicated SQL Pool has 60 compute nodes

Dedicated SQL Pool has 1 controller node and 60 compute nodes no matter which level of DWU you sign up for. The controller node is like the orchestrator and the compute nodes do all the heavy lifting. The DWU level you are running at determines the capabilities of the compute nodes (CPU + Memory). This is important to keep in mind as you make certain design decisions about your tables; especially large tables. Probably the most important performance gain you can achieve is to make sure your large tables are Hash-distributed (rather than Replicated or Round-Robin) and that you choose a hash column that evenly distributes your data across all 60 nodes. This should be a column in your table that has many unique values but DO NOT use a date column. In general, you should not use any column that you regularly use in your WHERE clauses. As soon as you narrow your query down to operate only on one value of your distribution column, you have basically eliminated 59 nodes from sharing the work! Yikes! On the other hand, it helps improve performance in certain situations to use a distribution column that is commonly used in JOINs with other tables. In my case, where I’m working with Point of Sale data, I used the store number as the distribution column. I rarely use it in my WHERE clauses but I do join almost every other table in the database using that column.

3. Partitions Will Help You (or Hurt You)

Partitioning your largest tables (greater than 60 million rows at minimum) will speed up both your queries and your data load times. Partitioning a table is analogous to breaking the table down into smaller tables (with much of the complexity abstracted away from you). The advantage it gives to queries is that you can use partition elimination. Say you had a table with 60 million rows and the table has two (even) partitions; if you use a WHERE clause that includes only one of the partitions, you have effectively eliminated the need to scan 30 million of your rows! So unlike distribution columns, the columns you choose to partition on should frequently be used in your WHERE clauses.

I ran into trouble with partitioning in my first draft of the design; I over partitioned my largest table. I thought I was clever by partitioning on a “DayOfYear” column. That gave me 365 (well, 366) partitions to work with. However, the table only had 4.5 billion rows, which sounds like a lot but I failed to take into consideration that each partition is also distributed across the 60 compute nodes. So that meant that each partition/distribution combination only had about 200K rows. In order to take advantage of Clustered Column Store indexes (which is the default in Dedicated SQL Pool) you should have at least 1M rows per partition/distribution combination, otherwise you might as well not be using Clustered Column Stores. I had to recreate a new table that was partitioned by month of the year instead (giving me 12 partitions) and painfully transfer all of the data over to the new table. However, data operations are super fast in Dedicated SQL Pool especially when you can take advantage of this next tip….

4. COPY INTO (is your friend)

It took me a little while to get comfortable with how the COPY INTO command worked and I still think it’s black-magic but it allows me to import new data at the tune of millions of rows per second(s) instead of minutes or hours. I can’t explain how it all works in my own words, honestly, but I can tell you how I used it. Since I was working with lots of Point of Sale data, I aggregated the data every hour into bite-sized files in Azure Blob Storage (one file per store per hour). Then I had a nightly process that took all of the small files and created one large csv file per customer (this file was typically 100Mb - 200Mb). Then I use the COPY INTO command directly from my Dedicated SQL Pool instance to load the data right from Blob Storage into a staging table. From there I just use a nightly SPROC that takes the staging data, transforms it as it inserts the data into the final production table. It also works with Azure Data Lake. COPY INTO works great with giant gobs of data but isn’t so great if you are just trying to incrementally load small chunks of data, keep that in mind.

5. Replicate Small Tables

Perhaps the biggest unexpected performance boost I found while creating my Dedicated SQL Pool was when I moved all tables under 2GB in size from Hash or Round Robin distributed to Replicated. Replicated tables are “copied” to each node in the SQL Pool. This obviously is not a great idea for large tables or tables that are frequently updated. But replicating all (or most) of your dimension tables can pay off with a pretty big speed improvement. I was shocked (as was my client) when some of my queries went from taking 10 seconds to less than 1 second after I moved all of my dimension tables to Replicated distribution (they were previously a mix of Round Robin or Hash).

6. Update Your Statistics

If you are going to be dumping many rows into your Dedicated SQL Pool on a regular basis (which if you’re not, then why use Dedicated SQL Pool?), it is really important to let the query optimizer know the most up-to-date information on your tables. Otherwise the wrong query path could be chosen and your queries will lose performance over time. It gets even worse if your data is skewed, but you can avoid that by following my advice from #2 above. Anyway, Dedicated SQL Pool will automatically create statistics (as long as you haven’t disabled it) but it will not update statistics on your tables. I have a nightly SPROC that updates all statistics on my main tables. It could probably be improved to update only certain columns or partitions but the entire process only takes a couple of minutes as of right now so it’s not worth optimizing. The takeaway here though is that I saw a noticeable difference in query speeds when I started updating the table statistics every single day.

7. Scale Down During Off Hours

I’m not going to lie, Dedicated SQL Pool is too expensive to use for most projects. It’s important to keep your costs down. One way to do that is to scale the instance down to the lowest acceptable DWU level during the off hours. I scale my instance down to the 100 level from 9PM to 5AM. Then I scale it back up to a 300 level at 5AM so that it’s ready to import yesterday’s 20M rows of data! I do this by running a simple ALTER DATABASE command on the master database twice a day. It’s triggered by a timer on a function app and the C# code is just simply:

        private static void ScaleDWU(bool scaleUp, ILogger log)
        {
            var str = Environment.GetEnvironmentVariable("sqldwConnectionStringMaster");
            try
            {
                var query = $"ALTER DATABASE DigiopSqlMain MODIFY (SERVICE_OBJECTIVE = '{(scaleUp ? "DW300c" : "DW100c")}')";
                using (SqlConnection conn = new SqlConnection(str))
                {
                    using (SqlCommand cmd = new SqlCommand(query, conn))
                    {
                        cmd.CommandTimeout = 100;

                        conn.Open();
                        cmd.ExecuteNonQuery();
                    }
                }
            }
            catch (SqlException sex)
            {
                log.LogError($"ScaleDWU failed with sql exception: {sex}");

                //if it's a sql exception, let's try to run this message again, could be a transient error
                throw;
            }
            catch (Exception ex)
            {
                log.LogError($"ScaleDWU failed: {ex}");
            }

            log.LogInformation($"ScaleDWU finished");
        }
    }

I think that wraps it up for now. Like I said, this is not a comprehensive tutorial on how to create a Dedicated SQL Pool instance but these are the most important lessons I learned along the way on my own journey in Azure.