Apache Spark is a framework that is used to process unstructured data in parallel. Spark supports Scala, Java, Python, R, and SQL programming languages. Spark is integrated into Hadoop, an open-source ecosystem of tools that includes libraries, a cluster management system, a technology for storing files on different servers, and MapReduce computing system.
Spark offers a data frame abstraction with object-oriented methods for transformations, joins, filters and more. This object orientation makes it easy to create custom reusable code that is also testable with mature testing frameworks.
It allows you to define a complex series of transformations represented as an object. Further, you can inspect the structure of the result without even executing the individual intermediate steps.
PySpark offers “toPandas()” method that allows you to work in-memory once Spark has crunched the data into smaller datasets. When combined with Pandas’ plotting method, you can chain together commands to join your large datasets, filter, aggregate and plot all in one command.
Spark has an easy and intuitive way of pivoting a DataFrame, the “map-side join” broadcast method that speeds up joins significantly when one of the tables is smaller than the other and can fit in its entirety on individual machines.
What is dbt?
dbt is an open-source framework created by Fishtown Analytics for executing, testing, and documenting SQL query.Learn more
What is Dataform?
Dataform is an application to manage data in BigQuery, Snowflake, Redshift, and other data warehouses.Learn more