Reasons why our customers tend to chose Amazon Redshift as their data warehousing platform?
- Quick Time to Market.
- Faster executing of Analytical and Business Intelligence queries.
- Removes the need of Dedicated admin team.
- Scalability.
- Column based and Massively Parallel Support.
- AWS Support for Redshift.
Although the reasons to choose Redshift may be apparent, the true benefits are reaped when the right architecture and best practices are applied. Below are key architecture criteria that would be considered as the pillars of a good implementation.
Pillar 1. Temporary Tables as Staging:Â Too many parallel writes into a table would result in write lock on the table. Temporary tables in Redshift can be used to load the data first, then copy the data from temporary table to main table. This approach helps in reduction of contention to main table and hence fewer locks.
Pillar 2.Column Compression Encoding:Â Column compression Encoding helps in performance. Amazon Redshift Column Encoding Utility helps in identifying the right encoding for the columns. Compression of data makes it faster and cheaper because compression & decompression consumes only CPU. Any data warehouse application is I/O intensive than CPU intensive hence it pretty much works on any large size table.
Advantages
- Reduce I/O.
- 100% improvement in performance when compared to uncompressed columns.
- Scale down of the cluster size without compromising on the performance.
Pillar 3.Natural keys as Distribution Keys:Â Distribution keys should be selected wisely for optimal query performance in Redshift. As a best practice have the natural key and distribution same. It helps in even distribution of data in linear Scale
Pillar 4.Concurrency:
WLM – Query Concurrency
In Amazon Redshift, the golden rule is to run multiple queries in a sequence rather than running large chunk of queries in parallel. Amazon Redshift system is not designed for query execution in parallel but to execute queries in sequence, since queries run much faster if there aren’t too many queries running in parallel at that time. Amazon Redshift Utils available in Github is useful utilities for identifying the concurrency issues e.g. WLM Queue slot etc. Query concurrency parameter in redshift is by default 5 (query_concurrecy = 5) it can be increased to improve performance. One should execute caution while increasing the number since higher query_concurrency number would degrade the performance, optimal query_concurrency number for the application can be identified by testing with multiple values.
Concurrent Writes
It is important to remember that concurrent writes into the same table would result in lot of rollbacks, increased table access times. It is not a best practice to do lot of writes in parallel. Rather do writes in sequence, it would yield better performance vs writes in parallel.
Pillar 5.Scheduled vs On Demand Vacuum: Standard scheduled vacuum of 10% in 24 hours is not a best solution always. If the throughput of the data in system is not steady, i.e. if there is a burst of load, 30% of the volume of the table in an hour and if you perform Update/Insert in table will result in lot of free space (the way Amazon Redshift handles update is, it does Delete and Insert the data. Hence the amount of free space in table increases quickly). To overcome this issue SVV_TABLEINFO_VIEW can be monitored periodically for percentage of unsorted regions in the table. If it crosses a threshold vacuum operation on the table can be performed.