Apache Spark is a framework that is used to process unstructured data in parallel. Spark supports Scala, Java, Python, R, and SQL programming languages. Spark is integrated into Hadoop, an open-source ecosystem of tools that includes libraries, a cluster management system, a technology for storing files on different servers, and MapReduce computing system.
Appealing APIs
Spark offers a data frame abstraction with object-oriented methods for transformations, joins, filters and more. This object orientation makes it easy to create custom reusable code that is also testable with mature testing frameworks.
Lazy execution
It allows you to define a complex series of transformations represented as an object. Further, you can inspect the structure of the result without even executing the individual intermediate steps.
Easy conversion
PySpark offers “toPandas()” method that allows you to work in-memory once Spark has crunched the data into smaller datasets. When combined with Pandas’ plotting method, you can chain together commands to join your large datasets, filter, aggregate and plot all in one command.
Easy transformations
Spark has an easy and intuitive way of pivoting a DataFrame, the “map-side join” broadcast method that speeds up joins significantly when one of the tables is smaller than the other and can fit in its entirety on individual machines.
dbt
What is dbt?
dbt is an open-source framework created by Fishtown Analytics for executing, testing, and documenting SQL query.
Learn moreDataform
What is Dataform?
Dataform is an application to manage data in BigQuery, Snowflake, Redshift, and other data warehouses.
Learn moreMattilion
What is Mattilion?
Matillion transforms businesses’data, across its various locations and forms, into cloud data warehouses to enable informed decision-making.
Learn more