Fork me on GitHub

A cluster computing framework

for processing large-scale spatial data



Unsplashed background img 2Fork me on GitHub

Scalable Spatial Computing

GeoSpark extends Apache Spark with a set of out-of-the-box Spatial Resilient Distributed Datasets (SRDDs) that efficiently load, process, and analyze large-scale spatial data across machines.

Better Developer Experience

GeoSpark provides APIs for Apache Spark programmer to easily develop their spatial analysis programs with Spatial Resilient Distributed Datasets (SRDDs) which have in house support for geometrical and distance operations.

Speed Up Your Program

Extensive experiments show that GeoSpark has much faster speed than Hadoop-based systems in spatial analysis applications like spatial join, spatial aggregation, and spatial co-location.

About

GeoSpark is an open source in-memory cluster computing system for processing large-scale spatial data. GeoSpark extends RDDs to form Spatial RDDs (SRDDs) and efficiently partitions SRDD data elements across machines and introduces novel parallelized spatial (geometric operations that follows the Open Geosptial Consortium (OGC) standard) transformations and actions (for SRDD) that provide a more intuitive interface for users to write spatial data analytics programs. Moreover, GeoSpark extends the SRDD layer to execute spatial queries (e.g., Range query, KNN query, and Join query) on large-scale spatial datasets. After geometrical objects are retrieved in the Spatial RDD layer, users can invoke spatial query processing operations provided in the Spatial Query Processing Layer of GeoSpark which runs over the in-memory cluster, decides how spatial object-relational tuples could be stored, indexed, and accessed using SRDDs, and returns the spatial query results required by user.

    Supported objects and operations briefs
  • Objects: Point, Rectangle, Polygon, LineString
  • Spatial index: R-Tree and Quad-Tree
  • Geometrical operations: Minimum Bounding Rectangle, PolygonUnion, and Overlap/Inside(Self-Join)
  • Spatial query operations: Spatial range query, spatial join query and spatial KNN query
  • News

    • GeoSpark is listed as Infrastructure Project on Apache Spark Official Third Party Project Page
    • GeoSpark v0.6.0 is released - 04/03/2017
    • GeoSpark v0.5.3 is released - 04/03/2017
    • GeoSpark v0.5.2 is released - 03/03/2017
    • GeoSpark v0.5.1 is released - 02/15/2017
    • GeoSpark v0.5.0 is released - 01/19/2017. We start to release Babylon Visualization Framework along with GeoSpark from now.
    • GeoSpark v0.4.0 is released - 12/22/2016
    • GeoSpark v0.3.2 is released - 10/26/2016
    • GeoSpark v0.3.1 is released - 10/06/2016
    • GeoSpark v0.3 is released - 8/17/2016
    • GeoSpark v0.2 is released - 10/25/2015
    • GeoSpark v0.1 is released - 7/7/2015
    • GeoSpark website is online now! - 7/7/2015
    • Users

      Companies that are using GeoSpark (incomplete list):

      Please contact us if you want to be shown here.

      Tutorial

        Prerequisites
      1. Read GeoSpark Important Features

      2. Read GeoSpark Full Tutorial

      3. Read Babylon Visualization Framework

        GeoSpark Spatial Join Query + Babylon Choropleth Map: USA mainland tweets per USA county
    • Assume PointRDD is geo-tagged Twitter dataset (Point) and PolygonRDD is USA county boundaries (Polygon). Then the Spatial Join Query result is in the following schema: County, Number of Tweets. Pseudo code using the latest GeoSpark is listed below (written in Java, also works for Scala):
        Babylon Example 1: Scatter Plot: USA mainland rail network
        Babylon Example 2: Heat Map: New York City Taxi Trips (with a given map background)

      Download

      GeoSpark source code and precompiled Jars are put on GeoSpark Github Repository

      GeoSpark artifacts are hosted on GeoSpark Maven Central

      Publication

      • A Demonstration of GeoSpark: A Cluster Computing Framework for Processing Big Spatial Data

        Jia Yu, Jinxuan Wu, Mohamed Sarwat. In Proceeding of IEEE International Conference on Data Engieering IEEE ICDE 2016, Helsinki, Finland May 2016

      • GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data

        Jia Yu, Jinxuan Wu, Mohamed Sarwat. In Proceeding of ACM International Conference on Advances in Geographic Information Systems ACM SIGSPATIAL GIS 2015, Seattle, WA, USA November 2015

      • Contact

        • Jia Yu

        • E-mail: jiayu2@asu.edu

        • Mohamed Sarwat

        • E-mail: msarwat@asu.edu

        • Physical address: Brickyard Engineering (BYENG) 407, 699 S. Mill Ave., Tempe, AZ 85281