Data Storage in a Modern Analytics Architecture by Meagan Longoria

MeaganLongoria1_denny cher

In classic business intelligence environments, we integrate, transform, and summarize data to monitor business operations and conditions, helping decision makers determine what actions to take. Where we were once limited to a data warehouse and some reports, we now have more choices for how to store, process, and analyze our data. 

Organizations want to reduce the “time to value”:  the amount of time it takes to acquire, perform necessary transformations, and deliver data to consumers. They also want to expand their analytical capabilities to include more types and sizes of data.  

Modern analytics architectures often follow a polyglot persistence strategy to help achieve those goals. This means we use multiple types of data persistence layers, each selected because it is the optimal choice for the type of data and how it will be used. Common data storage services in a modern analytics architecture include file storage, MPP (Massively Parallel Processing) databases, SMP (Symmetric Multi-Processing) databases, and analytical databases. In the Microsoft Azure data platform, this could translate to Data Lake Storage Gen 2, SQL Data Warehouse, SQL Database, and Analysis Services.

While introducing more components increases development complexity, it can more efficiently provide data to the people who need it.

Three common mistakes to avoid when deciding where to store your data are: 

  1. Trying to store data where it doesn’t fit 
  2. Transforming and standardizing data before its value has been determined
  3. Losing track of sensitive data

Store Data Where It Best Fits

Data warehouses built in relational databases expect standardized tabular data with a common schema. The schema is imposed when the data is written to the table. Although images can be stored in a relational database as binary objects, it not optimal to store terabytes of images there.  

Data lakes are a good low-cost solution for storing data in a variety of formats. They don’t require or impose an up-front schema definition. Schema-on-read techniques are used to impose structure and meaning at query time. This can be useful when you have files that vary in format and number of columns. Data lakes easily store images and video files.

Data Transformation for Business Value

Data lakes can be useful archives of transactional systems and reference data. They often contain staging areas for data waiting to be loaded to a data warehouse or other application. 

Data lakes can also serve as an exploration area for data scientists to review new data without requiring transformation to fit it into the data warehouse (which can cause a long wait in some organizations). If data scientists are analyzing a new business question, they may not want the structure and transformation imposed by pulling data from the data warehouse. If the exploration proves useful and there are requests for similar analyses, that data can then be standardized and added to a curated zone in the data lake or to the data warehouse. But it would be a waste of effort to immediately integrate any new data to the data warehouse before repeated value was found.

Not every analyst has or needs the tools and skills to extract data from the data lake. You don’t want every analyst to perform their own transformation and cleansing. Automating data transformation and load to a data warehouse reduces duplicate transformation efforts and undesirable variations in calculations.  

An analytical database summarizes data and delivers it to end users with predefined relationships and calculations, making it easy to see and compare metrics through a drag-and-drop reporting interface. It is used to provide quick, consistent answers to common or important questions. The predefined relationships and calculations may meet 70% of the organization’s analytical needs. Some analysts or report developers may need direct access to the data warehouse to write queries that use different relationships and calculations. If they often write the same queries, that data should be considered for addition to the analytical database. 

Catalog and Limit Access to Sensitive Data

Special consideration should be given to sensitive data, such as PII (personally identifiable information). It may not be advisable to keep a copy of sensitive data in every layer of your data platform. One strategy is to load data only to an area of the data warehouse with restricted access, deleting any staged copies from the data lake and declining to add it to analytical databases. The data warehouse may offer features to encrypt sensitive columns so only users with special permissions can view that column while other users can see other less sensitive columns in the table. Sensitive data should be catalogued and tagged for easy identification, audit, and deletion as needed. In this era of data leaks and GDPR, keeping an inventory of sensitive data is important for compliance and risk management.

Modernizing your analytics architecture is more than just adding technologies to your stack. It gives you the opportunity to reassess how your organization uses data and to meet business needs with timely data in usable formats.  

Comment here