Data Jigsaw

Shashank Shekhar
Dec 8, 2020
6 min read

This domain has seen decade and a half of innovation and disruption but before it could settle, the tear on its seams have started forming.

According to Deloitte, the three disruptor forces - Ambient Experience, Exponential Intelligence and Quantum, are gaining grounds and are poised to make significant business contributions in the coming decade. Ambient experience pertains to technology becoming part of the environment, almost invisible. Exponential Intelligence refer to, from being transactional like Q&A to being proactive, connected, aware of context and embedded in functional tools and therefore able to provide the unanticipated suggestion. Quantum will miniaturize and supper-charge computing resources; a side-effect of which will be to completely displace cyber-security as we understand it today. In this article, I will limit myself to the domain that I know well – “Data”, a domain which is actively working towards fulfilling the goal of “self-detecting and adapting to a new normal on a continuous basis”. I view traditional data analytics, big data, and AI as being on a continuum - Data.

The trend of technology evolution is best documented by tracing what has happened in last 10 years; given a directional prediction. What could be a better reference than Mckinsey, when it comes to the art of predicting. Here’s a peek from their 2011 prediction on big-data:

The use of big data will become a key basis of competition and growth for individual firms; underpin new waves of productivity growth and consumer surplus.
Opportunities and challenges vary from sector to sector - computer and electronic products and information sectors, as well as finance and insurance, and government are poised to gain substantially from the use of big data.
There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions
Policies related to privacy, security, intellectual property, and even liability will need to be addressed in a big data world

In June 2013, Accel launched a 2nd $100M fund, with Accel Partner Jake Flomenberg commenting, “Over the past few years, we’ve focused a tremendous amount of attention on what people like to call the ‘three Vs’ of big data: variety, volume, and velocity. We now believe there is a fourth V, which is end user value, and that hasn’t been addressed to the same extent.”

2021 will represent a decade since Big Data came into prominence, and based on industry findings, the promise remains largely unfulfilled. NewVantage Partners, have been running an annual survey of fortune 1000 companies since 2012 and here’s the outcome from their 2020 survey – “great majority of spending on big data and AI goes for technology and its development. We hear little about initiatives devoted to changing human attitudes and behaviours around data. Unless the focus shifts to these types of activities, we are likely to see the same problem areas in the future that we’ve observed year after year in this survey”.

It was predicted in 2010, that most Fortune 1000 companies have upward of 200 Tb of data already and are expected to grow exponentially for several decades. In 2020, it will be safe to assume that most Fortune 50 will have upward of 100 Pb of data. As has been beautifully analogized by Dave McCroy, data has gravity -Data accumulates applications and services. Unless the enterprise is in the business of monetizing data (e.g. Facebook or LinkedIn); enterprises like to keep data within its firewalls or in private cloud. The SaaSification of infrastructure and applications (modern day cloud computing), gave enterprises a platform to quickly develop services and a scalable mechanism to digitize. Applications accessing data, siting 1000 miles away; led to development of several technologies beachhead – Data Quality, Data Catalog, Visualization & KPIs and streaming.

Data Quality (DQ) and Data Catalog:

High-quality data is key to interpretable and trustworthy data analytics and the basis for meaningful data-driven decisions. Gartner observes the following DQ capabilities: connectivity, data profiling, measurement and visualization, monitoring, parsing, standardization and cleaning, matching, linking and merging, multidomain support, address validation/geocoding, data curation and enrichment, issue resolution and workflow, metadata management, DevOps environment, deployment environment, architecture and integration, and usability. Over a period of time, due to migration of applications, merger of databases, migration to cloud etc. data does become dirty.

Commercial software players like Informatica, SAS, Tibco, IBM took a lead in this domain. However, open-source alternatives have not only caught-up but are leading in several areas of CDQM (apache Griffin), Apache Nifi to process and distribute data, and ofcourse Spark, Kafka and Airflow which have become synonymous with large data (batch or stream) and flow handling. Data which you already have could be dirty too and that is where OpenRefine with its MetricDoc functionality does a fabulous job. In my opinion, OpenRefine needs a makeover on usability front and Griffin will do well subsuming the functionality of a data searching & cataloguing tools like – Amundsen (Lyft) or MetaCat (Netflix) or DataHub (LinkedIn). Airtable addresses the space of making data available in the schematic way, no matter what and where the data is residing. Similarly, Alteryx, is bringing flexibility to analytics operations for data engineers and increasingly investing in enabling cross-domain analytics. Superset lets business users explore their data.

KPI observability and Storied data-tends and clues

Observability or APM (application performance) space, commercial vendors like AppDynamics and New Relics have taken a lead. However, Apache Skywalking and Jaeger (Uber) are catching up quickly and together could field an alternative in APM as a service space. Jaeger’s collector framework and tracing are flexible and more scalable than Skywalking and could easily be integrated with it. Skywalking visual topology outsmarts its commercial variants too. KPI observability space has gaps around end-to-end tracing – which causal relationship to track, which directional graph to retain that represents the correct sequence of events flow from storage to infra to application to service, and which meta-data trace to co-relate to for the given diagnostics; is an area which needs lot more work from vendors. An adjacent domain, AIOps has vendors still grappling with offerings for ML engineers and some parts for scientists too. But the domain of observing the business KPIs, like cost of deteriorating predictions or cost of algorithm rewrite or model training, has largely remained untouched.

If you read the NewVantage survey, referred above, you will notice a peculiar data point, less than 12% of the CDOs (chief data/science/analytics) have responsibilities relating to revenue numbers. Given the sample size of the survey, it will be fair for me to assume that less than 5% of the Fortune 1000 CDO’s have revenue impacting responsibilities. And the same report, rues over most decision-makers of these enterprises are not data-savvy. Business users work at an abstraction level, most often avoiding getting into the weeds. This is where lies the third big gaping hole in this domain, what I term as storytelling. Algorithmic predictions based on trends will see accelerating production deployments but getting the models to discover data and insights from other models will see richer research in the short-term.

Streaming

Industry4.0, IoT, Supply-chain resilience, Digital-twins, Oil and gas, Smart cities and the ever growing data-volume of BFSI; have brought data stream processing and analytics to the centre-stage of technological maturity and innovation. There are choices even in open-source alternatives to handle these use-cases. A typical deployment will look like this:

It’s difficult to argue as to which distributed messaging platform is better between Apache Kafka (LinkedIn) and Pulsar (Yahoo). Except for if the type of data being received could be stored and therefore processed with differing priority, then I would go with Pulsar. For real-time data processing, Apache Flink is the most well-known. Flink processes elements when they occur rather than processing them in micro-batches like Spark streaming. Though, comparison between Flink and Spark is not justifiable. Its more appropriate to compare it with Storm. Flink other than being easier to deploy, beats Storm by natively supporting event-generation time rather than processing ingestion time of the event/data. Its watermark implementation also makes it easier to handle out-of-band data as compared to Storm. If the use-case requires real-time analytics, say anomaly, then its best to bypass kafka and get the data directly to Flink. If SQL is the preferred way to conduct streaming data analysis, then its best to route the messages from Kfka to Confluent KSQL, which has better SQL support than FlinkSQL.

If the use-case involves processing or streaming text or processing streaming data with some basic machine-learning (like linear regression) or distributed graph analysis (like PageRank), then its best to use Husky or Timely Dataflow. Husky’s performance is usually better than Timely (which is better than), Flink (which is better than), Spark.

Data Jigsaw

Recent Posts

Comments