Glossary Terms›Directed Acyclic Graph (DAG)

Directed Acyclic Graph (DAG)

A directed acyclic graph (DAG) is a conceptual representation of a series of activities. The order of the activities is depicted by a graph, which is visually presented as a set of circles, each representing an activity, some of which are connected by lines, representing the flow from one activity to another. Each circle is known as a “vertex,” and each line is known as an “edge.” “Directed” means that each edge has a defined direction, so each edge necessarily represents a single directional flow from one vertex to another. “Acyclic” means that there are no loops (i.e., “cycles”) in the graph, so that for any given vertex if you follow an edge that connects that vertex to another, there is no path in the graph to get back to that initial vertex.

Why Are Directed Acyclic Graphs Useful?

DAGs help represent many different types of flows, including data processing flows. By thinking about large-scale processing flows in terms of DAGs, one can more clearly organize the various steps and the associated order for these jobs. In many data processing environments, a series of computations are run on the data to prepare it for one or more ultimate destinations. This data processing flow type is often called a data pipeline. As an example, sales transaction data might be processed immediately to prepare it for making real-time recommendations to consumers. As part of the processing lifecycle, the data can go through many steps, including cleansing (correcting incorrect/invalid data), aggregation (calculating summaries), enrichment (identifying relationships with other relevant data), and transformation (writing the data into a new format).

Characteristics of DAGs in Stream Processing

One key characteristic of DAGs and the data processing flows that they model is that there can be multiple paths in the flow. This is important because it recognizes the need to process data in multiple ways to accommodate different outputs and needs. In the example flow below, a stream of sensor data is processed. The streaming data is first loaded from the sensors and then separated by the sensor type. Sensor X data will be summarized per second and then analyzed in real-time. If any critical status is observed, an alert is sent. The data is also saved for long-term storage and possibly other analysis. Also in this flow is data from sensor Y, which for now is summarized per minute and then stored in the same long-term store as the data for sensor X.

A stream of sensor data represented as a directed acyclic graph (DAG). — A stream of sensor data is represented as a directed acyclic graph.

DAGs in Batch Processing

To give an example of how DAGs apply to batch processing pipelines, suppose you have a database of global sales, and you want a report of all sales by region, expressed in U.S. dollars. You might first load all data into a processing engine, separate out data by the different currencies, convert the financial figures to U.S. dollars, summarize the data by country/region, and then bring all the data together into a final report. And let’s say that the U.S.-only data will also be created into a separate report. This data flow could be represented by the DAG shown below.

Global sales data represented by the directed acyclic graph (DAG). — An example of a directed acyclic graph represents global sales data in a batch processing environment.

Since DAGs apply to both batch and stream processing, it is increasingly common to have hybrid data processing environments that handle both stream and batch data sets. Technologies such as Hazelcast Platform, designed to handle both types of data, let companies build architectures that take advantage of all their data.

Back to Glossary Terms

Keep Reading

Webinar

/ Video

/ 60 min

Understanding In-Memory Technologies and Caching Strategies

Are you a developer, software engineer or architect looking to apply in-memory technologies to your current architecture? Are you looking to deliver ultra-fast response times, better performance, scalability and availability? Are you seeking new tools and techniques to manage and scale data and processing through an in-memory-first and caching-first architecture?

Webinar

/ Video

/ 60 min

Introduction to In-Memory Data Grids

The speed at which today’s evolving data is exploding (90% of all data was collected in the last 2 years) imposes complex business problems that prevailing technology platforms can not address. This is preventing Enterprises from quickly extracting business value from this data. This poses further challenges as the value of data and the insights we can get from them decrease if it takes too long to take action.

In this talk, we will learn how Hazelcast® addresses these problems and helps Enterprises overcome the challenges of extracting business value from massive scale data.

You will be introduced to Distributed Systems and In-Memory Computing with Hazelcast. This talk will cover some familiar distributed data structures like Maps, Lists, Queues, etc., along with running complex business algorithms in parallel over a Hazelcast cluster by using Distributed Executor Service, EntryProcessors and In-Memory MapReduce.

Guide

/ PDF

/ 19 pages

Data Caching Introduction

Three Hazelcast IMDG® users document about their experience with data caching. Includes some architecture diagrams and code examples.

White Paper

An Architect’s View of the Hazelcast Platform

This white paper provides an introduction for enterprise architects and application developers to the distributed, real-time capabilities of the Hazelcast Platform.

Brochure

/ PDF

/ 4 pages

Hazelcast In-Memory Computing Platform Brochure

When your customers and business processes demand real-time interactions and high-speed data streaming from a broad range of sources, the Hazelcast in-memory computing platform (IMCP) can deliver a comprehensive and integrated solution that is relied on by the most demanding companies in the world.

Why Hazelcast?

Forrester names Hazelcast as a Strong Performer

Platform

Introducing Hazelcast Platform 5.4

Solutions

By Industry

By Use Case

By Architecture

Join us for a deep dive into Hazelcast Platform's capabilities

Resource Center

Learn

The Gartner®️ Market Guide for Event Stream Processing

Developers

Community

Learn

Toolbox

Directed Acyclic Graph (DAG)

Why Are Directed Acyclic Graphs Useful?

Characteristics of DAGs in Stream Processing

DAGs in Batch Processing

Keep Reading

Understanding In-Memory Technologies and Caching Strategies

Introduction to In-Memory Data Grids

Data Caching Introduction

An Architect’s View of the Hazelcast Platform

Hazelcast In-Memory Computing Platform Brochure

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect

Why Hazelcast?

Forrester names Hazelcast as a Strong Performer

Platform

Introducing Hazelcast Platform 5.4

Solutions

By Industry

By Use Case

By Architecture

Join us for a deep dive into Hazelcast Platform's capabilities

Resource Center

Learn

The Gartner®️ Market Guide for Event Stream Processing

Developers

Community

Learn

Toolbox

Directed Acyclic Graph (DAG)

Why Are Directed Acyclic Graphs Useful?

Characteristics of DAGs in Stream Processing

DAGs in Batch Processing

Spread the Word

Keep Reading

Understanding In-Memory Technologies and Caching Strategies

Introduction to In-Memory Data Grids

Data Caching Introduction

An Architect’s View of the Hazelcast Platform

Hazelcast In-Memory Computing Platform Brochure

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect