Amazon Redshift, the new era in Data warehousing staunched the natal spring of fully managed, cloud based, petabyte scale Data warehouse. A vivacious touch of on-demand and scalability to handle increasing volumes of data stacks it up against the more traditional approach of on-premise data warehousing.lets explore the world of amazon redshift.

Amazon Redshift Architecture

Amazon Redshift primarily consists of one Leader Node and multiple compute nodes. Primary operations of Leader and compute nodes are as given below.

Leader Node

  • Stores Metadata
  • Co-ordinates parallel SQL Processing
  • SQL Endpoint

Compute Nodes

  • Local Columnar Storage
  • Executes queries in parallel
  • Load, Backup and Restore

Data Distribution

  • Data in Redshift should be distributed evenly across all compute nodes to leverage parallel processing.
  • Data movement within nodes has be minimized
    • Co-located joins
    • Localized aggregations.

Data can be distributed with following methods

Distribution Key

Data is distributed based on the distribution key selected, distribution key can be a simple key or a compound key. While using a compound distribution key for data distribution it is recommended to use the columns with the same data type. If the distribution key has columns with different data types there will be a noticeable performance degradation.

Even

Round robin distribution of data in compute nodes in the cluster.

All

Stores full table data on the 1st slice of each node. This type of distribution is particularly useful in a Star Schema representation.

Sort Keys

Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.

Zone maps :  A zone map exists for each 1 MB block, and consists of in-memory metadata that tracks the minimum and maximum values within the block, Hence if you sort the column e.g. a date_column If it is sorted then it will be faster to find the block in which data is stored. Amazon redshift does not use indexes as any conventional database.

Compound Sort Keys : In any select statement in Amazon Redshift the column with the least resolution has to 1st column and go ascending in the order of resolution.

Interleaved Sort Keys: An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple queries use different columns for filters, then you can often improve performance for those queries by using an interleaved sort style. When a query uses restrictive predicates on secondary sort columns, interleaved sorting significantly improves query performance as compared to compound sorting.

**Interleaved sort keys are useful only on very very large tables.

**Do not use interleaved sort keys on Time stamp columns

Continuous Monitoring Amazon Redshift Cluster

  • Check cluster status periodically. Check the following
    • WLM Queue
    • Commit Queue
    • Database Locks
  • Vacuum and Analyze Regularly. Check for
    • Table Skew
    • Uncompressed columns
    • Unsorted Data
    • Missing Statistics
  • Check table status.

 

Columnar Datastore vs Traditional Datawarehouse

A very important thing to note if you are coming from a conventional Data warehouse implementation is redshift is a columnar data store internally. A data warehouse table will have a large number of columns. A common practice of writing SQL queries with SELECT ‘*’ in big NO in Redshift. You need to select only the columns that are required, It is a crime to do a selection of all columns in a columnar database.

 

AWS Schema Conversion Tool for Data Migration

AWS Schema conversion Tool is a highly useful tool for Data Migration and Suggestions. This tool can read the source databases tables and their usage statistics and suggest a best Amazon Redshift target schema with the best distribution key, sort key etc.. The new feature can have AWS Redshift  as source. It helps us take get insights on  improving the performance of our current Redshift Schema.

Source Database Target Database on Amazon Redshift
Greenplum Database (version 4.3 and later) Amazon Redshift
Microsoft SQL Server (version 2008 and later) Amazon Redshift
Netezza (version 7.0.3 and later) Amazon Redshift
Oracle (version 10 and later) Amazon Redshift
Teradata (version 13 and later) Amazon Redshift
Vertica (version 7.2.2 and later) Amazon Redshift
Amazon Redshift Amazon Redshift

QuickSight

QuickSight is a useful tool for building dashboards and BI Reports on Redshift. It is tuned into work faster with Redshift.

Amazon Redshift Utils in GitHub

Amazon Redshift Github utilities available in github have highly useful admin scripts.

https://github.com/awslabs/amazon-redshift-utils

This post will help you get off the ground from Traditional on-premise Data warehouse and usher you into the New era of Data warehousing with Amazon Redshift.