The Ferrari In First Gear: The False Promise of Lift And Shift and How To Move Your Data & Analytics Projects Into High Gear
We have seen first-hand companies approaching the “modernization” of their data architecture by simply applying traditional design techniques to their new Cloud data architecture. Traditionally, data architectures were similar between transactional and analytic systems. When dealing with the Cloud, these two data architectures need to be bifurcated.
Although a Life and Shift approach is comfortable, it usually fails to leverage the benefits of the Cloud and results in higher costs, reduced performance, reduced speed to market, and most significantly, reduced value from the organization’s data assets when the purpose is analytics. In fact, if the designs are not singularly focused on supporting “value delivery” methods of Business Intelligence, Data Sharing and Artificial Intelligence (AI)/machine learning (ML), it will most likely fail to realize value from an organization’s data assets.
Why? If you are designing a data architecture to deliver actionable data, then Data Management Solutions for Analytics that focus on directionally accurate, aggregated information is needed. This is different from Data Management Solutions for Transactional Systems which focus on individual transactions to make sure the information storage is as accurate as possible. With Data Management Solutions for Analytics that support BI and Analytics, heavy traditional data management solutions such as Master Data Management (MDM), Data Quality (DQ) projects, structured storage and the like are less critical. Yes, these concepts are important down the road, but to quickly deliver value from your data assets, they need to be right-sized with “good-enough” techniques to get you “close enough”. After all, artificial intelligence (AI), machine learning and advanced analytics are about providing enough information, in aggregate, to make decisions that will move the needle.
Let’s consider just a few design differences between a Cloud data architecture that supports analytics vs. an On-prem data architecture that supports transactional systems:
Business Intelligence (BI). Many of the traditional BI platforms were built to solve scalability challenges with On-prem Data warehouses. They, therefore, had to include data transformation (ETL), in-memory processing and storage, and visualizations. With today’s Cloud Data warehouse such as Redshift, Snowflake and Azure DW, the in-memory processing and storage are no longer necessary. By pushing computation down to the underlying compute infrastructure, this leads to lower TCO, better compliance and governance and increased scalability. In fact, if a BI tool does not allow you to push the compute processing to the Data warehouse and leverage MPP (massively parallel processing), then this is like only driving a Ferrari in first gear.
The second piece of traditional BI platform that is significantly different is data movement and transformation (ETL), which is now performed much more efficiently with purpose-built tools. This leaves visualizations – the main purpose of a BI tool. Would you rather have a Cloud-built BI tool that invests all of their R&D into visualizations and getting value from data, or a BI tool that must share R&D investments between ETL, in-memory processing, and visualizations? With tool specialization comes a much richer feature set and increased value.
Another trend that we are seeing is companies are choosing multiple BI tools. Thinking of this from another perspective, most users simply want to get greater access to data. In many cases, they are looking for an “Excel-type” tool on steroids that is connected to Cloud data stores and Data warehouses with ready to use data. For these Power Users, they want to understand what is happening and then evangelize their learnings. Sounds like a great approach to increase adoption, promote a data-driven culture and uncover value, huh? As an organization matures their data-driven culture, it can then institutionalize corporate learnings into dashboards and visualizations that have a direct actionability for users.
Data movement and transformation. ETL vs. ELT + Orchestration. Although several of the On-prem “all-in-one” ETL tools have ported their legacy products to the Cloud, the design patterns for moving, loading and processing data in the Cloud has dramatically changed. Traditionally, the pattern was “ETL” or “Extract, Transform then Load”. This approach pulled data from a source and transformed the data (aka cleaned, standardized, applied business rules) and then loaded to the target system. A few of the challenges with this approach in a Cloud data architecture are: 1) it becomes difficult to consistently apply business rules, especially as different sources change or sources must be chained to ensure the integrity of the data; 2) it becomes increasing difficult to monitor, coordinate and manage a larger and larger number of transformation jobs; 3) re-loads or partial loads are difficult to perform; 4) processing is difficult to scale and usually run up against or over the “processing window”, missing delivery SLAs; and 5) real-time or even near-real-time or intra-day data is difficult to obtain.
With an “Extract, Load then Transform” + Orchestration, data is moved as quickly as possible (aka Landed) from the source system to your Data Lake. This is usually in the same structure as the source to limit processing times and connectivity to transactional systems. With this simple change, a Cloud data architecture can now leverage the best-of-breed for transformations and orchestrations (control of the jobs). This approach offers several advantages over the traditional ETL approach: 1) Cloud compute can be applied to transformations, greatly reducing the elapsed time and shortening the “processing window”; 2) real-time, near-real-time and intra-day updates are now possible; 3) reloads and storage of historical information is now possible; and 4) business rules can be centralized, governed and re-applied when changes are made.
What we are also seeing is that multiple EL and T tools are being used, especially if they all can be orchestrated by another tool. In addition, data transformation pipelines can be broken out into smaller, atomic pieces that allow for job parallelization (aka reduced processing time), reusability, and easier maintenance when paired with an appropriate orchestration tool. Another benefit of this approach is that all changes can be source-controlled, a common software development methodology that is traditionally lacking in a data development methodology. These approaches allow companies to pick the best of breed and not have to be locked into a single vendor. When better tools become available, a massive rewrite is not necessary.
Other patterns that we are seeing are EL-only tools (think Stitch which was recently acquired by Talend) and tools that allow transformations to be developed in well-known languages such as SQL or Python. There was a push not too long ago to move to a low- or no-code ELT. We would recommend caution. Although at the surface this has a lot of nice benefits, we have seen that limitations quickly manifest as the environment becomes more complicated and interconnected.
Analytics/ML/AI. When people think of Artificial Intelligence (AI), they think of plots of Sci-Fi movies where machines become self-aware and take over the world. This type of general AI where a machine is pointed at all your data and it figures out what you should look at is still years away…thankfully. With the current generation of AI that most organizations need, humans need to help machines by putting up guard rails. These guard rails are in the form of the business-related question that is being answered. What is the likelihood that a customer will leave? Should I spend time and resources chasing this prospect? What is my competitor likely to bid on that job? Once the question is known, then AI and machine learning can be used to find more patterns than humans would be able to find. It’s still up the humans to figure out what it means and more importantly what actions to take to improve the outcome, but at least we have actionable data that can move the needle!
Traditionally, a lot of planning went into selecting which analytic model and platform to use and to prepare the data in a format the model could accept. These models were created in tools such as SPSS and SAS in a close ecosystem “black box”, making it difficult to leverage the work of others and fine tune the parameters. Since modeling was usually done in a “back room”, the output had to be painstakingly integrated with other systems and process, making time to realize value very long. A lot of time was spent planning and executing because if the wrong model was chosen, it was a costly and lengthy mistake. Therefore, years ago, predictive modeling and advanced analytics where reserved for the Fortune 50 companies.
Fortunately, analytics in the Cloud has reduced the cost and time so much that nearly everyone can leverage advanced analytics. When setting up Data Management Solutions to support Analytics, a main foundational component is Data Storage (aka Data Lake / Data Warehouse). This allows for the ingestion and processing of large amounts of data quickly, with multiple modeling techniques and more importantly, the ability to send the output to users and systems in real- or at least near-real-time.
A new breed of AI/ML tool is gaining popularity as well that allows companies to quickly try several different modeling techniques to find the optimal model. The Cloud landscape has enabled vendors to create advanced ML/AI tools and capabilities that can be leveraged in Cloud environments. A few select tools and capabilities are:
- Utilizing commercially available use case-specific API’s, such as speech-to-text, optical character recognition (OCR), etc.
- Leveraging “analytic-as-a-service” SaaS platforms with the ability to customize & serve pre-built model templates
- Building custom models by leveraging the work of others and stitching together libraries within R, Python, and Scala for unique use cases
- Utilizing auto-ML tools, such a Data Robot, to apply numerous models to a data set to determine which is the most viable.
The above is just a few of the new capabilities making it easier for an organization to extract tremendous value from their data assets and greatly reducing TCO and development time.
Moving away from technology and to resources, it is only natural that companies leverage their existing data management and analytics staff. As with any change, human capital will need to be nurtured and adapted. What we are seeing is that these resources typically fall into two categories – 1) those with the ability and attitude to adapt to a Cloud-based data architecture, and 2) those that resist, are not willing to learn, or not able to learn the differences. The difference becomes gaining measurable benefits that Executives care about or watered down results that lead to just simply “getting by”.
Seasoned architects have the benefit of experience, but may be burdened with the attitude of “this has worked before, so it should continue to work” or “I don’t need to learn something new”. The other characteristic of a good architect is whether they have the attitude to adapt. When was the last time they needed to learn a new technology that was completely different? Do they only know one way of solving a problem? Are they uncomfortable (and therefore resistant) to change? When did they last proactively identify new trends and pitch how they might apply to the business? Are they care-takers or change-agents?
The following will be critical in evaluating, retraining, and refocusing your resources:
|Lift and Shift – what’s worked before will be fine in the Cloud||Understanding the different design patterns to take the best of the existing data architecture and merge it with the best of a cloud data architecture|
|Using familiar tools that have been ported to the Cloud||Truly evaluating cloud-built tools as a potential replacement|
|Learning on the fly and reading reviews||“Buying” experience by leveraging an expert or hiring someone that “has been there and done that” to accelerate the process and improve the chances of success|
|Build it and they will come||Choosing a single use case, sponsored by the business, to prove out the cloud data architecture|
|A Data warehouse and BI tool is all we need||A comprehensive data & analytics strategy is needed to turn data into a strategic asset that contributes to the top- and bottom-lines. A Data Store is a tool and not how value is delivered.|
|Users are not ready for this||Especially in an age when the workforce is increasingly comprised of employees that only know the Cloud – ask them – they want answers, they want data, they want to institutionalize and share learnings|
This is an exciting time for Data & Analytics in the Cloud. The Titans (Amazon, Microsoft, and Google) are in an all-out war to get their share of the pie. Recent projections have this space growing at a 50% compound annual growth rate and the expected returns from these efforts to be game-changing.
As with all shifts, it is prudent to dip a toe in the water and assess the environment before jumping right in. By leveraging the new capabilities in the Cloud around Data & Analytics it is even easier to rapidly iterate and see what works. After the journey has begun, it is important to understand the goals and who in the business will benefit from the outcomes. Having a clear vision and strategy will allow nearly any company to get out of first gear and open up that sports car to extract huge returns from their Data & Analytics projects.
If you don’t have a Data & Analytics Strategy, now is the time to act before your competitors do. Let us know how we can help by clicking here.