[IGNITE-7437] Partition based dataset implementation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.5
Component/s: ml
Labels:
None

Description

We want to implement our dataset based on entire partition instead of key sets.

A main idea behind the partition based datasets is the classic MapReduce.

The most important advantage of the MapReduce is an ability to perform computations on a data distributed across the cluster without involving significant data transmissions over the network. This idea is adopted in the partition based datasets in the following way:

1. Every dataset consists of partitions.
2. Partitions consists of a context built on top of the Apache Ignite Cache and recoverable data stored locally on every node.
3. Computations needed to be performed on a dataset splits on Map operations which executes on every partition and Reduce operations which reduces results of Map operations into one final result.

Why partitions have been selected as a building block of dataset and learning contain instead of cluster node?

One of the fundamental ideas of Apache Ignite Cache is that partitions are atomic, which means that they cannot be splitted between multiply nodes. As result in case of rebalancing or node failure partition will be recovered on another node with the same data it contained on the previous node.

In case of machine learning algorithm it's very important because most of the ML algorithms are iterative and require some context maintained between iterations. This context cannot be split or merged and should be maintained in the consistent state during the whole learning process.

Another idea behind the partition based datasets is that we need to have data (in every partition) in BLAS-like format as much as it possible.

BLAS and CUDA makes machine learning 100x faster and more reliable than algorithms based on self-written linear algebra subroutines and it means that not using BLAS is a recipe for disaster. In other words we need to keep data in BLAS-like format at any price.

Attachments

Issue Links

Blocked

IGNITE-7438 LSQR: Sparse Equations and Least Squares for Lin Regression

Resolved

causes

IGNITE-8271 Add documentation for partition based dataset (release 2.5)

Closed

is related to

IGNITE-8335 TensorFlow integration

Resolved

links to

GitHub Pull Request #3410

GitHub Pull Request #3472

TeamCity Build

TeamCity Licenses & Javadoc Build

(2 links to)

Activity

People

Assignee:: Anton Dmitriev

Reporter:: Yury Babak

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Jan/18 11:17

Updated:: 20/Apr/18 08:48

Resolved:: 05/Feb/18 12:13