If you’ve ever found yourself wondering how to solve EDW query contention with Snowflake, we’ve got the answer for you right here. At Crimson Macaw we use this feature a lot, so check out our guide on what we do below.
What is Snowflake?
Snowflake is a market-leading cloud data warehousing solution which offers flexible storage & processing of data. One of the key advantages of the Snowflake architecture is its separation of compute and storage resources, giving it a critical edge over the traditional data warehouse.
Contention Issues in Traditional EDW platforms
In traditional architectures, all database queries compete over the same resources – so a user running a report may find that the performance of the system is less good during the hours when an ETL is running. As a result, some processes are carefully scheduled during more idle periods for the database, for example running the ETL overnight when the volume of user queries is low.
Global user bases and the need for 24/7 reporting have accentuated these resource constraints – with users operating in multiple time zones, it may be impossible to find a window of time to run the ETL without impacting reporting performance. The separation of storage and compute resources within the Snowflake architecture offers a comprehensive solution to this problem.
How does Snowflake work?
Snowflake operates a multi-cluster, shared data architecture whose service, compute and storage layers are physically separated but logically integrated. Individual workloads can be allocated to distinct compute resources that can scale independently of one another, removing any chance of contention over resources.
The diagram below shows the basic Snowflake architecture.
The Cloud Services layer handles key global processes such as authentication, access control, security, metadata management and query optimization.
The processing layer consists of multiple virtual warehouses, which are allocated into clusters. These clusters are physically separate from both each other and the data located in the Database Storage layer. As such, the activity of one virtual warehouse has no impact on the performance of any others.
A virtual warehouse is a discrete allocation of compute that is used for reading and writing data from/into the shared Database storage layer. Except for caching purposes, data is not persisted inside the virtual warehouse.
Each virtual warehouse is provided with full transactional history of the database(s) by the Cloud Services layer, such that queries will always read the latest consistent version of the data based on the most recent commit. This consistency is maintained across all virtual warehouses without locks or contentions.
Why does this matter for contention?
Virtual warehouses can be configured specifically to handle different processes such as ETL and analytic queries, meaning data loads and user activity can operate harmoniously.
For the provisioning of ETL and analytic-specific virtual warehouses, consider the following:
- A high analytic query load would be best served by provisioning additional warehouses within a multi-cluster warehouse to offer a wider pool of compute resources to service the large volume of requests
- A smaller number of intensive queries from an ETL would be best served by selecting a larger virtual warehouse size to offer more compute resources for a smaller volume of complex requests
In both cases, the AUTO_SUSPEND property can be configured to suspend the virtual warehouses when they are not being used. This helps to preserve credits and avoid paying for processing resources when they are not needed.
By separating the ELT and analytic resources, we can also avoid paying for the coverage of a larger virtual warehouse when the ETL is not running and resources are only required for user queries.
With multi-cluster warehouses, the minimum / maximum number of warehouses can be configured (via the MIN_CLUSTER_COUNT and MAX_CLUSTER_COUNT properties) such that a cluster will always revert to a single warehouse and only invoke additional resources when needed, and not provision additional warehouses beyond a certain number – again useful for pricing.
We can take this further to support other scenarios
Imagine, within an organisation, a scenario where the analysts working in the Customer Service department run regular reports to support operational decision making. Then come period end, the Risk or Finance department need to run very intensive and complex queries which leads to degradation of performance for customer service analysts, reducing the ability to perform in their role effectively.
This can be solved by provisioning separate compute resource tailored to the needs of those departments. For example, Finance could access a separate larger virtual warehouse at period end to run their complex queries in a timely fashion. The Customer Service department can use smaller auto-scaling virtual warehouses to support their continuous less complex queries daily.
In this scenario, Snowflake allows very specific tailoring of resources for different use cases, allowing billing to be allocated accordingly between different departments – in this example, between Finance and Customer Service. Any heavy queries run by Finance will not impact the ability of the Customer Services teams to access the data they need. When considering the costing of resources, users & business units need not be constrained by a central IT budget which usually results in a compromise for all parties. Instead, users can access resources purpose-built for their use case to maximise productivity and provide transparency on cost.
The Snowflake architecture offers vastly increased flexibility in the provisioning of compute resources compared to traditional approaches. The separation of compute from storage is a key selling point of the solution which allows the user to strike a sensible and flexible balance between resources and cost, whilst maintaining a great end user analytic experience.
Want to find out more? Need support on your own data journey? Get in touch here.