CCFinder: using Spark to find clustering coefficient in big graphs View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2017-04-12

AUTHORS

Mehdi Alemi, Hassan Haghighi, Saeed Shahrivari

ABSTRACT

Networks with billions of vertices introduce new challenges to perform graph analysis in a reasonable time. Clustering coefficient is an important analytical measure of networks such as social networks and biological networks. To compute clustering coefficient in big graphs, existing distributed algorithms suffer from low efficiency such that they may fail due to demanding lots of memory, or even, if they complete successfully, their execution time is not acceptable for real-world applications. We present a distributed MapReduce-based algorithm, called CCFinder, to efficiently compute clustering coefficient in very big graphs. CCFinder is executed on Apache Spark, a scalable data processing platform. It efficiently detects existing triangles through using our proposed data structure, called FONL, which is cached in the distributed memory provided by Spark and reused multiple times. As data items in the FONL are fine-grained and contain the minimum required information, CCFinder requires less storage space and has better parallelism in comparison with its competitors. To find clustering coefficient, our solution to triangle counting is extended to have degree information of the vertices in the appropriate places. We performed several experiments on a Spark cluster with 60 processors. The results show that CCFinder achieves acceptable scalability and outperforms six existing competitor methods. Four competitors are those methods proposed based on graph processing systems, i.e., GraphX, NScale, NScaleSpark, and Pregel frameworks, and two others are the Cohen’s method and NodeIterator++, introduced based on MapReduce. More... »

PAGES

4683-4710

References to SciGraph publications

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/s11227-017-2040-8

DOI

http://dx.doi.org/10.1007/s11227-017-2040-8

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1084813021


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0802", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Computation Theory and Mathematics", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information Systems", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Faculty of Computer Science and Engineering, Shahid Beheshti University, G. C., Tehran, Iran", 
          "id": "http://www.grid.ac/institutes/grid.412502.0", 
          "name": [
            "Faculty of Computer Science and Engineering, Shahid Beheshti University, G. C., Tehran, Iran"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Alemi", 
        "givenName": "Mehdi", 
        "id": "sg:person.012730516031.30", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012730516031.30"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Faculty of Computer Science and Engineering, Shahid Beheshti University, G. C., Tehran, Iran", 
          "id": "http://www.grid.ac/institutes/grid.412502.0", 
          "name": [
            "Faculty of Computer Science and Engineering, Shahid Beheshti University, G. C., Tehran, Iran"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Haghighi", 
        "givenName": "Hassan", 
        "id": "sg:person.012616426732.32", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012616426732.32"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Department of Computer Engineering, Tarbiat Modares University (TMU), Tehran, Iran", 
          "id": "http://www.grid.ac/institutes/grid.412266.5", 
          "name": [
            "Department of Computer Engineering, Tarbiat Modares University (TMU), Tehran, Iran"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Shahrivari", 
        "givenName": "Saeed", 
        "id": "sg:person.0726542073.40", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0726542073.40"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1155/2007/52861", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1012295124", 
          "https://doi.org/10.1155/2007/52861"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/s00778-015-0405-2", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1038135216", 
          "https://doi.org/10.1007/s00778-015-0405-2"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/30918", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1041985305", 
          "https://doi.org/10.1038/30918"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/11427186_54", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1013907940", 
          "https://doi.org/10.1007/11427186_54"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/s10115-013-0693-z", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1041186630", 
          "https://doi.org/10.1007/s10115-013-0693-z"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/1471-2105-6-270", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1024400474", 
          "https://doi.org/10.1186/1471-2105-6-270"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2017-04-12", 
    "datePublishedReg": "2017-04-12", 
    "description": "Networks with billions of vertices introduce new challenges to perform graph analysis in a reasonable time. Clustering coefficient is an important analytical measure of networks such as social networks and biological networks. To compute clustering coefficient in big graphs, existing distributed algorithms suffer from low efficiency such that they may fail due to demanding lots of memory, or even, if they complete successfully, their execution time is not acceptable for real-world applications. We present a distributed MapReduce-based algorithm, called CCFinder, to efficiently compute clustering coefficient in very big graphs. CCFinder is executed on Apache Spark, a scalable data processing platform. It efficiently detects existing triangles through using our proposed data structure, called FONL, which is cached in the distributed memory provided by Spark and reused multiple times. As data items in the FONL are fine-grained and contain the minimum required information, CCFinder requires less storage space and has better parallelism in comparison with its competitors. To find clustering coefficient, our solution to triangle counting is extended to have degree information of the vertices in the appropriate places. We performed several experiments on a Spark cluster with 60 processors. The results show that CCFinder achieves acceptable scalability and outperforms six existing competitor methods. Four competitors are those methods proposed based on graph processing systems, i.e., GraphX, NScale, NScaleSpark, and Pregel frameworks, and two others are the Cohen\u2019s method and NodeIterator++, introduced based on MapReduce.", 
    "genre": "article", 
    "id": "sg:pub.10.1007/s11227-017-2040-8", 
    "isAccessibleForFree": false, 
    "isPartOf": [
      {
        "id": "sg:journal.1133522", 
        "issn": [
          "0920-8542", 
          "1573-0484"
        ], 
        "name": "The Journal of Supercomputing", 
        "publisher": "Springer Nature", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "11", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "73"
      }
    ], 
    "keywords": [
      "big graphs", 
      "billions of vertices", 
      "graph processing systems", 
      "data processing platform", 
      "real-world applications", 
      "less storage space", 
      "Apache Spark", 
      "Spark cluster", 
      "data items", 
      "processing platform", 
      "data structure", 
      "execution time", 
      "storage space", 
      "acceptable scalability", 
      "CCFinder", 
      "degree information", 
      "competitor methods", 
      "processing system", 
      "reasonable time", 
      "social networks", 
      "graph analysis", 
      "MapReduce", 
      "biological networks", 
      "network", 
      "new challenges", 
      "good parallelism", 
      "graph", 
      "Spark", 
      "clustering coefficient", 
      "algorithm", 
      "NScale", 
      "GraphX", 
      "Pregel", 
      "scalability", 
      "multiple times", 
      "parallelism", 
      "information", 
      "processors", 
      "analytical measures", 
      "low efficiency", 
      "memory", 
      "platform", 
      "vertices", 
      "billions", 
      "appropriate place", 
      "method", 
      "competitors", 
      "applications", 
      "challenges", 
      "system", 
      "time", 
      "space", 
      "efficiency", 
      "solution", 
      "clusters", 
      "counting", 
      "items", 
      "experiments", 
      "triangle", 
      "results", 
      "coefficient", 
      "measures", 
      "comparison", 
      "analysis", 
      "structure", 
      "minimum", 
      "place", 
      "Cohen's method"
    ], 
    "name": "CCFinder: using Spark to find clustering coefficient in big graphs", 
    "pagination": "4683-4710", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1084813021"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/s11227-017-2040-8"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1007/s11227-017-2040-8", 
      "https://app.dimensions.ai/details/publication/pub.1084813021"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2022-12-01T06:35", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20221201/entities/gbq_results/article/article_720.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://doi.org/10.1007/s11227-017-2040-8"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s11227-017-2040-8'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s11227-017-2040-8'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s11227-017-2040-8'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s11227-017-2040-8'


 

This table displays all metadata directly associated to this object as RDF triples.

174 TRIPLES      21 PREDICATES      100 URIs      84 LITERALS      6 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/s11227-017-2040-8 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 anzsrc-for:0802
4 anzsrc-for:0806
5 schema:author N922116be12f64bbd8e6d37523ea23e16
6 schema:citation sg:pub.10.1007/11427186_54
7 sg:pub.10.1007/s00778-015-0405-2
8 sg:pub.10.1007/s10115-013-0693-z
9 sg:pub.10.1038/30918
10 sg:pub.10.1155/2007/52861
11 sg:pub.10.1186/1471-2105-6-270
12 schema:datePublished 2017-04-12
13 schema:datePublishedReg 2017-04-12
14 schema:description Networks with billions of vertices introduce new challenges to perform graph analysis in a reasonable time. Clustering coefficient is an important analytical measure of networks such as social networks and biological networks. To compute clustering coefficient in big graphs, existing distributed algorithms suffer from low efficiency such that they may fail due to demanding lots of memory, or even, if they complete successfully, their execution time is not acceptable for real-world applications. We present a distributed MapReduce-based algorithm, called CCFinder, to efficiently compute clustering coefficient in very big graphs. CCFinder is executed on Apache Spark, a scalable data processing platform. It efficiently detects existing triangles through using our proposed data structure, called FONL, which is cached in the distributed memory provided by Spark and reused multiple times. As data items in the FONL are fine-grained and contain the minimum required information, CCFinder requires less storage space and has better parallelism in comparison with its competitors. To find clustering coefficient, our solution to triangle counting is extended to have degree information of the vertices in the appropriate places. We performed several experiments on a Spark cluster with 60 processors. The results show that CCFinder achieves acceptable scalability and outperforms six existing competitor methods. Four competitors are those methods proposed based on graph processing systems, i.e., GraphX, NScale, NScaleSpark, and Pregel frameworks, and two others are the Cohen’s method and NodeIterator++, introduced based on MapReduce.
15 schema:genre article
16 schema:isAccessibleForFree false
17 schema:isPartOf Nd932edb00b9940a2a40b65d4c9293ec7
18 Nf789c9f0d7144bd89758a7a36c2d2abe
19 sg:journal.1133522
20 schema:keywords Apache Spark
21 CCFinder
22 Cohen's method
23 GraphX
24 MapReduce
25 NScale
26 Pregel
27 Spark
28 Spark cluster
29 acceptable scalability
30 algorithm
31 analysis
32 analytical measures
33 applications
34 appropriate place
35 big graphs
36 billions
37 billions of vertices
38 biological networks
39 challenges
40 clustering coefficient
41 clusters
42 coefficient
43 comparison
44 competitor methods
45 competitors
46 counting
47 data items
48 data processing platform
49 data structure
50 degree information
51 efficiency
52 execution time
53 experiments
54 good parallelism
55 graph
56 graph analysis
57 graph processing systems
58 information
59 items
60 less storage space
61 low efficiency
62 measures
63 memory
64 method
65 minimum
66 multiple times
67 network
68 new challenges
69 parallelism
70 place
71 platform
72 processing platform
73 processing system
74 processors
75 real-world applications
76 reasonable time
77 results
78 scalability
79 social networks
80 solution
81 space
82 storage space
83 structure
84 system
85 time
86 triangle
87 vertices
88 schema:name CCFinder: using Spark to find clustering coefficient in big graphs
89 schema:pagination 4683-4710
90 schema:productId N0d209768a401457e930e852a2c5d0b75
91 N10d9ae0a16ed49ab8813e75ac547f17b
92 schema:sameAs https://app.dimensions.ai/details/publication/pub.1084813021
93 https://doi.org/10.1007/s11227-017-2040-8
94 schema:sdDatePublished 2022-12-01T06:35
95 schema:sdLicense https://scigraph.springernature.com/explorer/license/
96 schema:sdPublisher Nca98f622202a4cfb857862892a4c8b81
97 schema:url https://doi.org/10.1007/s11227-017-2040-8
98 sgo:license sg:explorer/license/
99 sgo:sdDataset articles
100 rdf:type schema:ScholarlyArticle
101 N0d209768a401457e930e852a2c5d0b75 schema:name dimensions_id
102 schema:value pub.1084813021
103 rdf:type schema:PropertyValue
104 N10d9ae0a16ed49ab8813e75ac547f17b schema:name doi
105 schema:value 10.1007/s11227-017-2040-8
106 rdf:type schema:PropertyValue
107 N86c8205c685e4d6db528f3e2fc59b70a rdf:first sg:person.012616426732.32
108 rdf:rest Nca288d4e28154d34acc3e1bcde5cf407
109 N922116be12f64bbd8e6d37523ea23e16 rdf:first sg:person.012730516031.30
110 rdf:rest N86c8205c685e4d6db528f3e2fc59b70a
111 Nca288d4e28154d34acc3e1bcde5cf407 rdf:first sg:person.0726542073.40
112 rdf:rest rdf:nil
113 Nca98f622202a4cfb857862892a4c8b81 schema:name Springer Nature - SN SciGraph project
114 rdf:type schema:Organization
115 Nd932edb00b9940a2a40b65d4c9293ec7 schema:issueNumber 11
116 rdf:type schema:PublicationIssue
117 Nf789c9f0d7144bd89758a7a36c2d2abe schema:volumeNumber 73
118 rdf:type schema:PublicationVolume
119 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
120 schema:name Information and Computing Sciences
121 rdf:type schema:DefinedTerm
122 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
123 schema:name Artificial Intelligence and Image Processing
124 rdf:type schema:DefinedTerm
125 anzsrc-for:0802 schema:inDefinedTermSet anzsrc-for:
126 schema:name Computation Theory and Mathematics
127 rdf:type schema:DefinedTerm
128 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
129 schema:name Information Systems
130 rdf:type schema:DefinedTerm
131 sg:journal.1133522 schema:issn 0920-8542
132 1573-0484
133 schema:name The Journal of Supercomputing
134 schema:publisher Springer Nature
135 rdf:type schema:Periodical
136 sg:person.012616426732.32 schema:affiliation grid-institutes:grid.412502.0
137 schema:familyName Haghighi
138 schema:givenName Hassan
139 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012616426732.32
140 rdf:type schema:Person
141 sg:person.012730516031.30 schema:affiliation grid-institutes:grid.412502.0
142 schema:familyName Alemi
143 schema:givenName Mehdi
144 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012730516031.30
145 rdf:type schema:Person
146 sg:person.0726542073.40 schema:affiliation grid-institutes:grid.412266.5
147 schema:familyName Shahrivari
148 schema:givenName Saeed
149 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0726542073.40
150 rdf:type schema:Person
151 sg:pub.10.1007/11427186_54 schema:sameAs https://app.dimensions.ai/details/publication/pub.1013907940
152 https://doi.org/10.1007/11427186_54
153 rdf:type schema:CreativeWork
154 sg:pub.10.1007/s00778-015-0405-2 schema:sameAs https://app.dimensions.ai/details/publication/pub.1038135216
155 https://doi.org/10.1007/s00778-015-0405-2
156 rdf:type schema:CreativeWork
157 sg:pub.10.1007/s10115-013-0693-z schema:sameAs https://app.dimensions.ai/details/publication/pub.1041186630
158 https://doi.org/10.1007/s10115-013-0693-z
159 rdf:type schema:CreativeWork
160 sg:pub.10.1038/30918 schema:sameAs https://app.dimensions.ai/details/publication/pub.1041985305
161 https://doi.org/10.1038/30918
162 rdf:type schema:CreativeWork
163 sg:pub.10.1155/2007/52861 schema:sameAs https://app.dimensions.ai/details/publication/pub.1012295124
164 https://doi.org/10.1155/2007/52861
165 rdf:type schema:CreativeWork
166 sg:pub.10.1186/1471-2105-6-270 schema:sameAs https://app.dimensions.ai/details/publication/pub.1024400474
167 https://doi.org/10.1186/1471-2105-6-270
168 rdf:type schema:CreativeWork
169 grid-institutes:grid.412266.5 schema:alternateName Department of Computer Engineering, Tarbiat Modares University (TMU), Tehran, Iran
170 schema:name Department of Computer Engineering, Tarbiat Modares University (TMU), Tehran, Iran
171 rdf:type schema:Organization
172 grid-institutes:grid.412502.0 schema:alternateName Faculty of Computer Science and Engineering, Shahid Beheshti University, G. C., Tehran, Iran
173 schema:name Faculty of Computer Science and Engineering, Shahid Beheshti University, G. C., Tehran, Iran
174 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...