Beyond the Lambda Architecture: Effective Scheduling for Large Scale EO Information Mining and Interactive Thematic Mapping

Authors: Marco Quartulli Javier Lozano Silva Igor García Olaizola

Date: 27.07.2015


PDF

Abstract

As per the 2013 EOSDIS annual metrics reportfootnote{url{https://earthdata.nasa.gov/sites/default/files/field/document/FY13AnnualReport_V4.xlsx}, visited 09 Jan 2015}, Petabyte-scale Earth Observation (EO) raster data archive volumes are growing at rates of about ten Gigabytes per day while around 95% of their contents have never been accessed by a human observer~cite{koubarakis2012building}. Metadata-based search clearly needs to be complemented by semi-automatic raster catalog content mining.

While large scale near real-time data processing is a crucial component of EO mining for the interactive semantic labeling of archive contents, available architectures~cite{quartulli2013review} typically only permit batch processing strategies on large scale coverages, hence trying to adapt continuous ingestion from rolling archives and catalog content index updates to this prevalent paradigm.

In this contribution, we propose instead to consider the streaming approach as the standard one for EO mining, adapting large and historical batch data processing to a near-real time scenario by task queueing and smart scheduling.

To this end, we show how the organization of the data on N-dimensional lattices and the locality of access patterns in the spatial and in the resolution dimensions allow us to define and exploit a methodology that is based on streaming cluster computing frameworks~cite{zaharia2012discretized}, Hilbert curve scheduling~cite{drozdowski2010scheduling} and multi-scale pyramid decompositions for optimizing access to distributed storage and computing resources and maximize perceived processing speed in interactive operations.

In doing so, we propose a methodology to improve on the `Lambda architecture'~cite{marz2013big}, the prevalent approach in managing the contradiction between the large sizes of remote sensing products and the significant data rates their processing involves, especially in the interactive training sessions needed for semantic labeling of the archive contents.
The Lambda architecture solves the problem of computing arbitrary functions on arbitrary data in realtime by combining (as in a $Lambda$-shaped diagram) a batch layer for processing large scale historical data and a streaming layer for processing items being retrieved in real time from an input queue.

While resulting systems have good scalability characteristics, their composition in terms of two different and typically incompatible processing paradigms represents a significant disadvantage in terms of programming complexity and maintainability.
Our approach effectively supersedes the Lambda architecture for one focusing on streaming exclusively for both the near-real time and the batch processing in the design of distributed EO mining systems.

The specific and simplified scenario we consider is that of interactive thematic mapping over Gigapixel input coverages supervised via a tile-based user interface (UI). 
As per~cite{mascolo2014bids}, machine learning algorithms based on cluster computing frameworks~cite{zaharia2012resilient} can be used to try and manage the rapid growth in processing costs, yet in the case of near-real time processing their operation can effectively be improved by the adoption of specific scheduling mechanisms.

BIB_text

@Article {
title = {Beyond the Lambda Architecture: Effective Scheduling for Large Scale EO Information Mining and Interactive Thematic Mapping},
pages = {1492-1495},
keywds = {

Big Data, thematic mapping


}
abstract = {

As per the 2013 EOSDIS annual metrics reportfootnote{url{https://earthdata.nasa.gov/sites/default/files/field/document/FY13AnnualReport_V4.xlsx}, visited 09 Jan 2015}, Petabyte-scale Earth Observation (EO) raster data archive volumes are growing at rates of about ten Gigabytes per day while around 95% of their contents have never been accessed by a human observer~cite{koubarakis2012building}. Metadata-based search clearly needs to be complemented by semi-automatic raster catalog content mining.

While large scale near real-time data processing is a crucial component of EO mining for the interactive semantic labeling of archive contents, available architectures~cite{quartulli2013review} typically only permit batch processing strategies on large scale coverages, hence trying to adapt continuous ingestion from rolling archives and catalog content index updates to this prevalent paradigm.

In this contribution, we propose instead to consider the streaming approach as the standard one for EO mining, adapting large and historical batch data processing to a near-real time scenario by task queueing and smart scheduling.

To this end, we show how the organization of the data on N-dimensional lattices and the locality of access patterns in the spatial and in the resolution dimensions allow us to define and exploit a methodology that is based on streaming cluster computing frameworks~cite{zaharia2012discretized}, Hilbert curve scheduling~cite{drozdowski2010scheduling} and multi-scale pyramid decompositions for optimizing access to distributed storage and computing resources and maximize perceived processing speed in interactive operations.

In doing so, we propose a methodology to improve on the `Lambda architecture'~cite{marz2013big}, the prevalent approach in managing the contradiction between the large sizes of remote sensing products and the significant data rates their processing involves, especially in the interactive training sessions needed for semantic labeling of the archive contents.
The Lambda architecture solves the problem of computing arbitrary functions on arbitrary data in realtime by combining (as in a $Lambda$-shaped diagram) a batch layer for processing large scale historical data and a streaming layer for processing items being retrieved in real time from an input queue.

While resulting systems have good scalability characteristics, their composition in terms of two different and typically incompatible processing paradigms represents a significant disadvantage in terms of programming complexity and maintainability.
Our approach effectively supersedes the Lambda architecture for one focusing on streaming exclusively for both the near-real time and the batch processing in the design of distributed EO mining systems.

The specific and simplified scenario we consider is that of interactive thematic mapping over Gigapixel input coverages supervised via a tile-based user interface (UI). 
As per~cite{mascolo2014bids}, machine learning algorithms based on cluster computing frameworks~cite{zaharia2012resilient} can be used to try and manage the rapid growth in processing costs, yet in the case of near-real time processing their operation can effectively be improved by the adoption of specific scheduling mechanisms.


}
isbn = {978-1-4799-7929-5},
isi = {1},
date = {2015-07-27},
year = {2015},
}
Vicomtech

Parque Científico y Tecnológico de Gipuzkoa,
Paseo Mikeletegi 57,
20009 Donostia / San Sebastián (Spain)

+(34) 943 309 230

close overlay