It is a well-known statistic that a large percentage of big data projects fail to deliver the results originally promised in their charters. This occurs in spite of “big data” being an almost universal bullet point on a CIOs’ lists of strategic initiatives. Part 1 of this post explores some of the technical reasons for the failure of advanced analytics projects while Part 2 will focus on the process and people challenges.
According to Forrester, an analyst or data scientist can spend up to 80% of his time simply preparing a data set for analysis. Similarly, when creating an operational (i.e., production-ready) process to ingest a source of data on a regular basis, a data engineer, without a proper framework in place, can spend a large chunk of time simply getting data into the analytics platform.
While we all agree that analytics without data would be a useless exercise, the real value in the process comes from the insights that will be produced from the data–not simply having the data present on the platform. In order to focus on the high value-add tasks, a big data effort should allocate significant time in the early stages of the project to focus on creating a simple and effective data ingestion strategy. This includes both the initial “data wrangling” that will be done during the experimentation cycle, as well as the development of production-ready processes during the implementation cycle.
In general, we can consume the output of any analytics process in one of two ways:
In most cases, both become a key part of any data project. As such, both should be carefully considered when determining how the solution will contribute to achieving the specified business objectives.
To give an example, if the output of the analytics process is a recommendation model that will power product placement on a website, there must be a technical solution including software and infrastructure that can handle the volume of expected requests. In many cases, this might involve teams and processes that are outside of the project’s direct scope.
Similarly, when crafting the optimal means of presentation for a complex, multidimensional report or visualization, it is common to significantly exceed the capabilities of traditional charting and graphing solutions. We task data scientists and information delivery experts with the creative use of non-traditional tools and techniques to optimize the visual delivery of the information. This implies a strong need to ensure the right people, tools and delivery mechanisms are in place to extract all relevant value from the insights produced.
Both technologists and data scientists have a tendency to focus on the most interesting part of the technical solution–the tools and products that are actually cranking through the data to mine and produce valuable information. In doing so, it is common to neglect the basics such as monitoring, alerting and log aggregation–these are also essential to assessing and ensuring the health of the overall platform, troubleshooting issues, and planning for sufficient capacity.
Once an analytics solution has started to produce tangible and measurable value, the tolerance for downtime will quickly decrease. Prior to building out and deploying a production-ready solution, one must consider, test, and manage, the ancillary components to ensure long-term success.
Inevitably, on every successful big data project, there will ultimately be a clash between the day-to-day operational needs of the platform, and the need for power users (e.g., data scientists) to execute experiments at scale. While the technical platforms themselves have evolved to assist in doling out resources as fairly as possible, one must account for these competing needs early on in the project to ensure adoption, and continued success as the effort begins to scale.
There are various technical, as well as process-based solutions to the problem. Ideally, both would be employed, and would be considered when laying out the initial architecture and determining appropriate sizing.
In the past, security for a typical analytics environment (i.e., Hadoop) was an afterthought. The only real option was an in-or-out policy based on convoluted integration with Kerberos. Nowadays, technologies such as Ranger, Knox and Sentry, while relatively immature, are laying the foundation for a much richer security model on the typical big data platform. In addition to more fine-grained security, we should consider techniques such as obfuscation and encryption at rest to secure sensitive data.
Closely related to security needs, is the need to have a clear data governance policy in place. In the age of the data lake, it is common to stuff whatever bits and bytes can be found lying around into the platform. Without proper governance surrounding both ingestion and usage, a “wild west” scenario can ensue, resulting in inappropriate data use and inconsistent answers to similar questions. This can lead to enterprise-wide confusion when reports and insights fail to “tick and tie.”
While a heavy-handed bureaucratic governance procedure will be stifling in the fast-moving analytics world, there is, at a minimum, the requirement for some oversight to ensure consistency, accuracy and proper usage of the data that enters and leaves the platform.
In part 2 of this post, we’ll explore the people and process-related reasons for big data failures. For a sneak peak at Precocity’s solution to some of these common challenges, check out ZipTie, our big data accelerator that is designed to reduce the time to value and increase your project’s chances of success.