Building reliable data pipelines with AI and DataOps

Image Credit: Shutterstock (Image credit: Image Credit: Alexskopje / Shutterstock)

The industry’s use of analytics is ubiquitous and highly varied. From correlating all components in a technology ecosystem to learning from and adapting to new events as well as automating and optimising processes - in many different ways, these use cases are all about assisting the human in the loop and making them more productive and reducing error rates.

We as a society are finding that analytics are increasingly seen as the glue or brain that drive emerging business and social ecosystems that can, and already are, transforming our economy and the way we live, work and play.

From people data to ‘thing’ data

The old touchstone of the technology industry - ‘people, processes and technology’ - is firmly entrenched, but we might start replacing ‘technology’ with ‘things’; increasingly so as embedded and unseen tech becomes truly ubiquitous from sensors and connected tech in everything around us.

As we become more connected, it’s been called an Internet of Things or an internet of everything, but for a truly connected and efficient system we are beginning to layer on top a much needed ‘analytics of things’. Forrester talk of ‘systems of Insight’ and believe that these are the engines that are powering future-proofed digital businesses. This is required as it’s only through analytics that businesses and institutions can synchronise the varied components of this complex ecosystem that is driving business and social transformation. Put another way, if we can’t understand and make use of all this data, then why are we bothering to generate it all?

While having a digital fabric means that so much can connect together, from varied enterprise solutions to manufacturing, or even consumer digital solutions like home control applications, it is analytics that coordinates and adapts demand using cognitive capabilities in the face of new forces and events. It’s needed to automate and optimise processes, making humans more productive and able to respond to pressures like the money markets, global social media feeds, and other complex systems in a timely and adaptive manner.

However, the fly in the analytics ointment has tended to be the well-known plethora of problems with data warehouses – even well-designed ones. Overall, data warehouses have been good for answering known questions, but business has tended to ask the data warehouse to do too much. It’s generally ideal for reporting and dashboarding with some ad hoc analysis around those views, but it’s just one aspect of many data pipelines and has tended to be slow to deploy, hard to change, expensive to maintain, and not ideal for many ad hoc queries or for big data requirements.

Image Credit: Pixabay

Image Credit: Pixabay

Spaghetti data pipelines

The modern data environment relies on a variety of sources beyond the data warehouse, like production databases, applications, data marts, ESB, big data stores, social media, and other external data sources - and unstructured data too. the trouble is, it often relies on a spaghetti architecture joining these up with the ecosystem and the targets like production applications, analytics, reporting, dashboards, websites and apps.

To get from these sources to the right endpoints, data pipelines consist of a number of steps that convert data as a raw material into a usable output. Some pipelines are relatively simple, such as ‘export this data into a CSV file and place into this file folder’. But many are more complex, such as ‘move select tables from ten sources into the target database, merge common fields, array into a dimensional schema, aggregate by year, flag null values, convert into an extract for a BI tool, and generate personalised dashboards based on the data’.

Complementary pipelines can also run together, such as operations and development, where development feeds innovative new processes into the operations workflow at the right moment - usually before data transformation is passed into data analysis.

As long as the process works efficiently, effectively and repeatedly - as well as pulls data from sources through the various data processes, to the business users that need it - be they data explorers, users, analysts, scientists, or consumers, then it’s a successful pipeline.

Dimensions of DataOps

DataOps provides a series of values into the mix. From the agile perspective, SCRUM, kanban, sprints and self-organising teams keep development on the right path. DevOps relies on continuous integration, deployment and testing, with code and config repositories and containers. Total quality management is derived from performance metrics, continuous monitoring, benchmarking and a commitment to continuous improvement. Lean techniques feed into automation, orchestration, efficiency, and simplicity.

The benefits this miscellany of dimensions bring include speed, with faster cycles times and faster changes; economy, with more reuse and coordination; quality, with fewer defects and more automation; and higher satisfaction, based on a greater trust in data and in the process.

AI can add considerable value to the DataOps mix, as together data plus AI is becoming the default stack upon which many modern enterprise applications are built. There’s no part of the DataOps framework that AI cannot optimise, from the data processes (development, deployment, orchestration) or data technologies (capture, integration, preparation, analytics); or the pipeline itself from ingestion to engineering and analytics.

This AI value will come from machine learning, AI, and advanced analytics beyond troubleshooting (though that will be a massive cost, resource and time saving), through greater automating and rightsizing the process and the parts to work in optimal harmony.

Where DataOps adds value

The goal of good architecture is to coordinate and simplify data pipelines, and the goal of DataOps is to fit in and automate, monitor and optimise data pipelines. Enterprises do need to inventory their data pipelines and ensure they carefully explore DataOps processes and tools so that they solve their challenges with the right-sized tools. AI will layer on top by bringing the ultimate value from DataOps.

Kunal Agarwal, CEO of Unravel Data

Kunal Agarwal
Kunal Agarwal co-founded Unravel Data in 2013 and serves as CEO. Kunal has led sales and implementation of Oracle products at several Fortune 100 companies. He co-founded Yuuze.com, a pioneer in personalised shopping and what-to-wear recommendations. Before Yuuze.com, he helped Sun Microsystems run Big Data infrastructure such as Sun's Grid Computing Engine.