A comparison on scalability for batch big data processing on Apache Spark and Apache Flink View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2017-03-01

AUTHORS

Diego García-Gil, Sergio Ramírez-Gallego, Salvador García, Francisco Herrera

ABSTRACT

The large amounts of data have created a need for new frameworks for processing. The MapReduce model is a framework for processing and generating large-scale datasets with parallel and distributed algorithms. Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation. Recently a novel framework called Apache Flink has emerged, focused on distributed stream and batch data processing. In this paper we perform a comparative study on the scalability of these two frameworks using the corresponding Machine Learning libraries for batch data processing. Additionally we analyze the performance of the two Machine Learning libraries that Spark currently has, MLlib and ML. For the experiments, the same algorithms and the same dataset are being used. Experimental results show that Spark MLlib has better perfomance and overall lower runtimes than Flink. More... »

PAGES

1

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/s41044-016-0020-2

DOI

http://dx.doi.org/10.1186/s41044-016-0020-2

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1064134953


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain", 
          "id": "http://www.grid.ac/institutes/grid.4489.1", 
          "name": [
            "Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Garc\u00eda-Gil", 
        "givenName": "Diego", 
        "id": "sg:person.013613661642.85", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013613661642.85"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain", 
          "id": "http://www.grid.ac/institutes/grid.4489.1", 
          "name": [
            "Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Ram\u00edrez-Gallego", 
        "givenName": "Sergio", 
        "id": "sg:person.015360430211.35", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015360430211.35"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Faculty of Computing and Information Technology, King Abdulaziz University, North Jeddah, Saudi Arabia", 
          "id": "http://www.grid.ac/institutes/grid.412125.1", 
          "name": [
            "Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain", 
            "Faculty of Computing and Information Technology, King Abdulaziz University, North Jeddah, Saudi Arabia"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Garc\u00eda", 
        "givenName": "Salvador", 
        "id": "sg:person.01221271101.39", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01221271101.39"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Faculty of Computing and Information Technology, King Abdulaziz University, North Jeddah, Saudi Arabia", 
          "id": "http://www.grid.ac/institutes/grid.412125.1", 
          "name": [
            "Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain", 
            "Faculty of Computing and Information Technology, King Abdulaziz University, North Jeddah, Saudi Arabia"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Herrera", 
        "givenName": "Francisco", 
        "id": "sg:person.011360734641.33", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011360734641.33"
        ], 
        "type": "Person"
      }
    ], 
    "datePublished": "2017-03-01", 
    "datePublishedReg": "2017-03-01", 
    "description": "The large amounts of data have created a need for new frameworks for processing. The MapReduce model is a framework for processing and generating large-scale datasets with parallel and distributed algorithms. Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation. Recently a novel framework called Apache Flink has emerged, focused on distributed stream and batch data processing. In this paper we perform a comparative study on the scalability of these two frameworks using the corresponding Machine Learning libraries for batch data processing. Additionally we analyze the performance of the two Machine Learning libraries that Spark currently has, MLlib and ML. For the experiments, the same algorithms and the same dataset are being used. Experimental results show that Spark MLlib has better perfomance and overall lower runtimes than Flink.", 
    "genre": "article", 
    "id": "sg:pub.10.1186/s41044-016-0020-2", 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1158761", 
        "issn": [
          "2058-6345"
        ], 
        "name": "Big Data Analytics", 
        "publisher": "Springer Nature", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "2"
      }
    ], 
    "keywords": [
      "machine learning libraries", 
      "batch data processing", 
      "data processing", 
      "Apache Flink", 
      "Apache Spark", 
      "MapReduce model", 
      "learning libraries", 
      "large-scale data processing", 
      "big data processing", 
      "large-scale datasets", 
      "Spark MLlib", 
      "memory computation", 
      "low runtime", 
      "general engine", 
      "novel framework", 
      "Flink", 
      "same dataset", 
      "same algorithm", 
      "MLlib", 
      "new framework", 
      "scalability", 
      "good perfomance", 
      "Spark", 
      "experimental results", 
      "dataset", 
      "algorithm", 
      "framework", 
      "processing", 
      "large amount", 
      "runtime", 
      "main features", 
      "library", 
      "computation", 
      "engine", 
      "streams", 
      "model", 
      "performance", 
      "features", 
      "perfomance", 
      "need", 
      "parallel", 
      "data", 
      "experiments", 
      "comparative study", 
      "amount", 
      "results", 
      "comparison", 
      "study", 
      "mL", 
      "paper"
    ], 
    "name": "A comparison on scalability for batch big data processing on Apache Spark and Apache Flink", 
    "pagination": "1", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1064134953"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/s41044-016-0020-2"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/s41044-016-0020-2", 
      "https://app.dimensions.ai/details/publication/pub.1064134953"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2022-08-04T17:04", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20220804/entities/gbq_results/article/article_732.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://doi.org/10.1186/s41044-016-0020-2"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/s41044-016-0020-2'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/s41044-016-0020-2'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/s41044-016-0020-2'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/s41044-016-0020-2'


 

This table displays all metadata directly associated to this object as RDF triples.

131 TRIPLES      20 PREDICATES      74 URIs      66 LITERALS      6 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/s41044-016-0020-2 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author N2c571131f82245a1b3c696db9c6d62be
4 schema:datePublished 2017-03-01
5 schema:datePublishedReg 2017-03-01
6 schema:description The large amounts of data have created a need for new frameworks for processing. The MapReduce model is a framework for processing and generating large-scale datasets with parallel and distributed algorithms. Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation. Recently a novel framework called Apache Flink has emerged, focused on distributed stream and batch data processing. In this paper we perform a comparative study on the scalability of these two frameworks using the corresponding Machine Learning libraries for batch data processing. Additionally we analyze the performance of the two Machine Learning libraries that Spark currently has, MLlib and ML. For the experiments, the same algorithms and the same dataset are being used. Experimental results show that Spark MLlib has better perfomance and overall lower runtimes than Flink.
7 schema:genre article
8 schema:isAccessibleForFree true
9 schema:isPartOf N3b60dd915f724573ad57ba799c7ead7e
10 Na47a97ab13344b638362bbe08eaefe6d
11 sg:journal.1158761
12 schema:keywords Apache Flink
13 Apache Spark
14 Flink
15 MLlib
16 MapReduce model
17 Spark
18 Spark MLlib
19 algorithm
20 amount
21 batch data processing
22 big data processing
23 comparative study
24 comparison
25 computation
26 data
27 data processing
28 dataset
29 engine
30 experimental results
31 experiments
32 features
33 framework
34 general engine
35 good perfomance
36 large amount
37 large-scale data processing
38 large-scale datasets
39 learning libraries
40 library
41 low runtime
42 mL
43 machine learning libraries
44 main features
45 memory computation
46 model
47 need
48 new framework
49 novel framework
50 paper
51 parallel
52 perfomance
53 performance
54 processing
55 results
56 runtime
57 same algorithm
58 same dataset
59 scalability
60 streams
61 study
62 schema:name A comparison on scalability for batch big data processing on Apache Spark and Apache Flink
63 schema:pagination 1
64 schema:productId N324768812d8447cba15f68157c880ee6
65 Nc23637694ed84d5ca5429ffcb74a6d9f
66 schema:sameAs https://app.dimensions.ai/details/publication/pub.1064134953
67 https://doi.org/10.1186/s41044-016-0020-2
68 schema:sdDatePublished 2022-08-04T17:04
69 schema:sdLicense https://scigraph.springernature.com/explorer/license/
70 schema:sdPublisher N1be32b3dac20471a900962d03563fb7e
71 schema:url https://doi.org/10.1186/s41044-016-0020-2
72 sgo:license sg:explorer/license/
73 sgo:sdDataset articles
74 rdf:type schema:ScholarlyArticle
75 N1be32b3dac20471a900962d03563fb7e schema:name Springer Nature - SN SciGraph project
76 rdf:type schema:Organization
77 N2c571131f82245a1b3c696db9c6d62be rdf:first sg:person.013613661642.85
78 rdf:rest N60d0bc4d1f4b4a3a9f8352ccac3dfb29
79 N324768812d8447cba15f68157c880ee6 schema:name dimensions_id
80 schema:value pub.1064134953
81 rdf:type schema:PropertyValue
82 N3b60dd915f724573ad57ba799c7ead7e schema:volumeNumber 2
83 rdf:type schema:PublicationVolume
84 N60d0bc4d1f4b4a3a9f8352ccac3dfb29 rdf:first sg:person.015360430211.35
85 rdf:rest N7adb004d73094f23b3085f93ae6449b1
86 N7adb004d73094f23b3085f93ae6449b1 rdf:first sg:person.01221271101.39
87 rdf:rest Nbef6af1ab5684340abb31163c4f5bce5
88 Na47a97ab13344b638362bbe08eaefe6d schema:issueNumber 1
89 rdf:type schema:PublicationIssue
90 Nbef6af1ab5684340abb31163c4f5bce5 rdf:first sg:person.011360734641.33
91 rdf:rest rdf:nil
92 Nc23637694ed84d5ca5429ffcb74a6d9f schema:name doi
93 schema:value 10.1186/s41044-016-0020-2
94 rdf:type schema:PropertyValue
95 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
96 schema:name Information and Computing Sciences
97 rdf:type schema:DefinedTerm
98 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
99 schema:name Artificial Intelligence and Image Processing
100 rdf:type schema:DefinedTerm
101 sg:journal.1158761 schema:issn 2058-6345
102 schema:name Big Data Analytics
103 schema:publisher Springer Nature
104 rdf:type schema:Periodical
105 sg:person.011360734641.33 schema:affiliation grid-institutes:grid.412125.1
106 schema:familyName Herrera
107 schema:givenName Francisco
108 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011360734641.33
109 rdf:type schema:Person
110 sg:person.01221271101.39 schema:affiliation grid-institutes:grid.412125.1
111 schema:familyName García
112 schema:givenName Salvador
113 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01221271101.39
114 rdf:type schema:Person
115 sg:person.013613661642.85 schema:affiliation grid-institutes:grid.4489.1
116 schema:familyName García-Gil
117 schema:givenName Diego
118 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013613661642.85
119 rdf:type schema:Person
120 sg:person.015360430211.35 schema:affiliation grid-institutes:grid.4489.1
121 schema:familyName Ramírez-Gallego
122 schema:givenName Sergio
123 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015360430211.35
124 rdf:type schema:Person
125 grid-institutes:grid.412125.1 schema:alternateName Faculty of Computing and Information Technology, King Abdulaziz University, North Jeddah, Saudi Arabia
126 schema:name Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain
127 Faculty of Computing and Information Technology, King Abdulaziz University, North Jeddah, Saudi Arabia
128 rdf:type schema:Organization
129 grid-institutes:grid.4489.1 schema:alternateName Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain
130 schema:name Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain
131 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...