04.11.2022 | Nikolay Valiotti

Data Terminology

Data Terminology

Data analysis has gone mainstream, carrying a dense vocabulary along with it. With so much jargon being tossed around by data scientists, we wanted to shed light on some data terms and make them more palatable for day-to-day use.

It is impossible to cover all the data terms, so we’ve focused on and grouped around 50 data terms that you should know first and foremost.

General

General

Big data refers to extremely large and complex data sets. Because big data contains great variety, volumes, and velocity (the three Vs), it cannot be handled by traditional data processing software and requires more sophisticated and powerful solutions.

Business intelligence is a set of processes, architectures, and technologies that convert raw data into meaningful information. It helps executives, managers, and employees uncover insights for strategic decisions through data analytics, data mining, data visualization, and other means.

Dashboard is a graphical representation of data. It helps aggregate and consolidate various critical pieces of information (e.g., key performance indicators, metrics, and other data points) into a single screen.

Data contract is a formal agreement between the owner of a source system and the team ingesting data from it. It describes the data to be exchanged, accountability regarding data delivery, and a lot of technical metadata.

Data flow is a path for data to move through an information system from a source to a destination. Data flows can use a data model or diagram to map out how information passes from one component to the next.

Data governance encompasses processes, roles, policies, standards, and metrics that dictate how data is gathered, processed, stored, and deleted.

Data lake is a centralized repository that stores, processes, and secures big data in a raw, granular format. Such a scalable platform can ingest any data (including structured data, semi-structured data, and unstructured data) from any system at any speed.

Data warehouse is a central repository of information that contains structured data after it has been cleaned and processed based on specific business needs. Data warehousing implies that data will not be maintained unless it is expected to be used.

DataOps – Data Operations – is a collaborative approach to data management that focuses on enhanced communication, integration, and automation.

DBMS – Database Management System – is an interface between a database and an end user, helping them to manage and manipulate data. The most widely used types of DBMS are distributed, hierarchical, relational, object-oriented, and network.

Deep learning is a subfield of machine learning and artificial intelligence that teaches computers to “think” like a human and learn by example.

Encryption is a computing process converting information into a code to conceal it and prevent unauthorized access.

Metadata defines and describes data. It contains information about the structure, nature, and context of the data it is attached to.

Modern data stack is a suite of tools used by an organization for data integration. Compared to a traditional data stack, the way data is created, processed, and analyzed is faster, more scalable, and more accessible.

Lakehouse is a data management architecture that combines the elements of a data warehouse and a data lake. For instance, it has standardized storage formats while also providing direct access to source data.

Pipeline is an end-to-end sequence of processes an organization uses to collect, transform, and move data from a source to a target location. It also specifies the business logic behind data management.

Raw data, also referred to as primary data, is data in its initial state. It is collected from a source and delivered to a provider without being processed by a machine or a human.

Semi-structured data has characteristics of both structured and unstructured data. It has some structure but does not conform to a data model. For example, HTML code, graphs and tables, emails, and XML documents do not have a fixed schema.

Silo is a data repository that can only be accessed or managed by one business unit, making it completely isolated from the rest of an organization. Many companies are trying to move past this practice because siloed data often hinders data analytics initiatives and wastes resources.

Stack is a coordinated set of technologies that allow organizations to make use of data. The tools within a stack take data through a processing pipeline from sources to storage and then finally to insights.

Structured data, often categorized as quantitative data, follows a standardized format and has a well-defined structure. It is typically formatted before being placed in data storage so that it will be more straightforward to analyze.

Unstructured data, often categorized as qualitative data, is information that does not follow a conventional data model or have an easily identifiable structure.

Data Formatting and Processing

Data Formatting and Processing

API – Application Programming Interface – is a software intermediary that allows two software components to communicate, e.g., your product or service, with other products and services. Each API has a distinct function and a set of definitions and protocols.

ELT – Extract, Load, Transform – is a data integration process where raw data is moved from different sources to a unified data repository, such as a data warehouse. Data transformations, e.g., cleansing, validating, calculations, etc., are done at the destination resource.

ETL – Extract, Transform, Load – is another form of data integration that moves raw data from multiple sources. But unlike the ELT approach, ETL pre-processes data and prepares it for upload to the target destination; it loads the data as the final step.

FTP – File Transfer Protocol – is a standard internet protocol used to communicate and transfer files. An FTP client allows users to download, upload, copy, move, rename, and delete files on servers within the TCP/IP (Transmission Control Protocol/Internet Protocol) network, aka the Internet.

JSON – JavaScript Object Notation – is an open standard format used for storing data and transporting it. It represents structured data in human-readable text, which makes it easier to use, understand, and generate compared to formats that are designed to carry data rather than display it.

Python is a computer programming language that is interpreted (= has an interpreter to execute lines of codes directly), object-oriented ( = built around self-contained entities), and high-level (= easily understandable and close to human language, abstract from the details of computer operation).

RESTful API (REST API) – REpresentational State Transfer – is an application programming interface that supports secure information exchange over the Internet. It conforms to the constraints of REST architectural style, including requests managed through HTTP and cacheability, among others.

SQL – Structured Query Language – is a domain-specific programming language used to manage relational databases and perform various operations with the data on them. It is the most common language for extracting and organizing data on RDBMS. The standard commands are Create, Select, Insert, Drop, Update, and Delete.

Server Technologies

Server Technologies

Airflow is an open-source platform for data engineering pipelines. It offers the functionality to author, schedule, and monitor workflows as directed acyclic graphs (DAGs) of tasks.

Apache Kafka is an open-source distributed platform optimized for event stores and stream processing. It has three main functions: publishing and subscribing to streams of records, storing streams in chronological order, and processing streams in real-time.

Apache Spark is an open-source unified engine for data processing, data science, and machine learning. The framework can quickly perform tasks on vast sets of unstructured data in parallel.

AWS (Amazon Web Services) offers on-demand cloud computing services on a pay-as-you-go basis. The list includes packaged software-as-a-service (SaaS), platform-as-a-service (PaaS), and infrastructure-as-a-service (IaaS) offerings.

Dagster is an open-source orchestrator for developing, producing, observing, and maintaining data assets (tables, data sets, reports, machine learning models, and more).

dbt is an open-source command line tool. By executing, testing, and documenting SQL queries, data analysts and engineers can work with data directly in the cloud data warehouse.

Docker is a set of platform-as-a-service products for automating application deployment and management. It uses data virtualization technology to deliver software in containerized environments.

Google Cloud is a suite of cloud computing services for data management and data analytics, hybrid and multi-cloud, artificial intelligence and machine learning, and more. It provides an execution environment where all Google and third-party services are treated as building blocks.

JupyterLab is a user interface for configuring and arranging workflows. It is bundled with all features of the Jupyter Notebook, such as the terminal, text editor, file browser, rich outputs, and more.

Snowflake is a cloud-based data platform. It combines data from multiple sources into a unified view and provides the functionality of an enterprise analytic database.

Databases

Databases

Amazon Redshift is a fully managed, petabyte-scale data warehouse. It is offered as a cloud and is known for its ability to handle huge volumes of data.

ClickHouse is an open-source database management system that uses SQL queries to generate analytical reports in real-time.

Cloudera Impala is an open-source mechanism for processing high-performance, low-latency SQL queries for data stored in Apache Hadoop.

Google BigQuery is a fully managed, serverless data warehouse that makes it possible to perform scalable analysis over petabytes of data.

HP VERTICA is an analytic database management platform for big data that leverages cloud-based technologies or can be employed as an on-premise solution.

MySQL is an open-source relational database management system. It is standard for websites with massive volumes of both data and end users, including Facebook, Twitter, and Wikipedia.

PostgreSQL is an open-source object-relational database system that supports both relational (SQL) and non-relational (JSON) querying. It is considered the most advanced free system in the world.

BI Tools

BI Tools

Looker is a cloud-based platform for business intelligence, data applications, and embedded analytics that takes big data and turns it into data visualizations.

Looker Studio (ex Google Data Studio) is a web-based tool that allows users to build and customize interactive reports and dashboards.

Metabase is an open-source data visualization platform that enables data analysis from a variety of destinations and sources.

Mode is a cloud-based collaborative analytics tool that helps data analysts query, visualize, and share data easily and quickly.

Mprove is a self-service business intelligence application where users can fine-tune data models, visualizations, and dashboards with version control for DataOps.

Plotly Dash is an open-source analytics software framework for building interactive web applications in Python.

PowerBI is a data and analytics reporting tool that can connect to a wide range of data sets. It includes a collection of apps, services, and connectors that work together to generate business insights.

Redash is an open-source web application used for clearing databases and visualizing the results. It is also known for its robust integration capabilities.

Superset is an open-source enterprise-ready web application. It focuses on data exploration and data visualization and can augment proprietary existing BI tools for organizations.

Tableau is interactive data visualization software available as a fully managed or self-service product. It enables flexible end-to-end analytics for business users of all technical levels.

That concludes our glossary of the most important data terms that any businessperson introducing or already using data analytics should understand. Remember that knowing the precise definition is less important than understanding the meaning and interpreting the term correctly for the context.