07.10.2022 | Nikolay Valiotti
10 Tips for a Strong Data Infrastructure
Businesses are becoming increasingly more reliant on data. That doesn’t come as a surprise—massive amounts of data enable organizations to draw real-time insights and gain an unbeatable competitive edge. But there are a lot of bumps along the way.
One of the most challenging parts of analytics and data science is still building and maintaining modern data pipelines. The processes for making raw data ready for analysis can go wrong at any stage. Pipelines get stuck, data transformations take forever, security errors come up all the time, and components run out of sequence. And when those issues occur, every organization needs a hero to fix them. Among the many vendors vying for that position is FiveTran, a cloud-based, fully-managed data integration service.
Let’s discuss how FiveTran allows a business to radically simplify its data pipelines.
A data pipeline is a set of actions and technologies for data processing. It has three main components:
Data pipelines help organizations centralize data collected from disparate sources, feed data into operational systems, and ensure consistent data quality.
For example, a company’s marketing and commerce stack integrates multiple independent platforms—say, Facebook Ads, Google Analytics, and Shopify. Before a customer experience analyst can make sense of various data points, data needs to be transferred and normalized from these disparate sources into a data warehouse with the help of a data pipeline.
FiveTran is a data integration platform that uses ELT-style (Extract-Load-Transform) data pipelines. It’s built for companies that don’t want to manage their own data pipeline infrastructure.
As a paid tool, FiveTran needs to have distinct advantages over open-source tools. The main one is the simplification of data processes. It requires virtually no coding or engineering expertise to connect sources to targets and delegates the more complex and time-consuming tasks (maintaining, managing, and debugging data pipelines) to FiveTran’s internal team.
FiveTran opts for ELT in their product, which means it switched the order of the loading and transformation stages. By doing so, it addressed three major shortcomings of ETL: complexity, brittle and risky architecture, and inaccessibility.
Data flows from sources through the FiveTran data pipeline to the data warehouse and then to the business intelligence and visualization tool. As a result, the data pipeline is:
Initially, FiveTran focused almost exclusively on data movement. But the eventual integration with dbt allowed the company to branch beyond the “E” and the “L” in “ELT.” The dbt Core lets it add the “T” into the workflow.
Users can also integrate third-party tools via REST APIs to expand built-in capabilities. Of course, that works only if an organization has the resources to leverage additional technology. For example, you can connect it to Sagemaker hosting or Pandas machine learning libraries for predictive capabilities.
FiveTran doesn’t support on-premises data warehouses, so in all examples, data warehouses refer strictly to the cloud. The tool is a better fit for modern cloud environments with abundant computing power and plenty of storage but limited engineering time.
To understand how FiveTran works, you have to know the difference between ETL and ELT. So, here is another refresher. A “traditional” ETL pipeline does the transformation in the middle, which often requires users to rerun jobs and re-validate the results later in the process. ELT moves the transformation step to the end of a data pipeline, so errors and bugs in transformation don’t stop the process of loading data. Thus, FiveTran ensures a quicker journey to the destination and fewer errors.
After loading the data, FiveTran allows data teams to set up data transformations to SQL and dbt.
Users can see all their active transformations alphabetically by name. The short description of each transformation includes the status and time of its last run, the next scheduled run, frequency, and the number of connectors associated with output.
You can also see a more detailed rundown of a transformation in the run log.
Raw data is available alongside transformed data, which allows for greater analytical flexibility. If a transformation run fails, users can look up the logs to identify the issue. If the analytical needs of the business change, users can edit transformations and run them again.
After the initial sync of historical data representing your workspace’s current state, FiveTran will automatically check for changes in files at a certain interval. If there are changes, it will load the updated rows into the target destination. The users can set the intervals for such updates with greater or lesser frequency as necessary.
A simple data pipeline takes less than 15 minutes to set up, which is impressive compared to the time needed to build your own data pipeline infrastructure. You can also factor in the time and cost savings associated with infrastructure management—data parsing, maintaining connections, and running transformations.
More complex data pipelines (including complex transformations) take weeks rather than months to build. The platform is designed to handle thousands of data sources and run thousands of pipelines simultaneously. It allows you to run a large number of data pipelines without needing to keep track of all changes made to each of them. What’s more, it establishes access controls to restrict unauthorized users from making changes.
As a result, FiveTran unlocks raw data that might have been bottlenecked by the complexity of the data engineering process.
Just like any tool, FiveTran can’t be the right fit for every organization. The two biggest limiting factors are data sources and destinations. Firstly, to be able to use FiveTran, your organization must use the data sources it supports—150+ in total, with the most recent additions being Google Analytics 4, Amazon Ads, Zendesk Sell, Splunk, AWS CloudTrail, and Salesforce Commerce Cloud. If not, you should look for another solution.
Secondly, FiveTran packages are available to organizations that use one of the supported outputs. These include Azure Synapse, Redshift, Snowflake, BigQuery, Databricks, and a few others. There is an option to use FiveTran’s built-in data transformation functionality, but it’s less flexible.
Also, if something goes wrong with a data pipeline, you’ll have to either work with the support team or rely on the information in the error message. This will likely make the debugging process challenging.
Finally, you might not be able to adapt FiveTran’s architecture to your needs if you need special functionality in your data pipelines. You’ll be limited to best practices as specified by the tool.
FiveTran uses a credit-based system, where you pay for the rows updated or inserted by its connectors. Plans with a different number of credits are advertised as perfect for:
We should also note that the tool is suitable for less-technical users, which is another advantage over self-managed solutions.
At first glance, it may seem that the tool is perfect for any business arrangement. But you should account for the limitations we’ve mentioned and the fact that FiveTran relies on dbt under the hood. So, FiveTran is a good fit as long as the team is comfortable using dbt or some other means to handle data transformations.
Overall, the benefits of using FiveTran are evident. As a fully automated zero-maintenance data pipeline built especially for analysts, it offers a faster, better way to centralize all of your data sources. It enables analysts to deploy modern data pipelines for effective data discovery, exploration, and delivery to business users.
The tool replicates all of your data in minutes, synchronizes it in one place, and allows you to easily connect your data warehouse with your favorite BI tools for full-scale visualization.
Bear in mind that FiveTran can be costly if you have a lot of data to sync. You’ll be charged based on the unique database rows that get processed by a FiveTran data pipeline, i.e., you’ll be paying for the resources you use. At the same time, you’ll be preserving the labor of data scientists and engineers. So, it’s up to you to decide whether the simplification of data processes and a lesser burden placed on the workforce will be worth the price.
You need this guide, because it includes:
Turn it on to get exclusive guide on modern data stack
Emails suck. This newsletter doesn’t
Subscribe to the newsletter and get the most useful guide on modern data stack
You will also receive other useful materials on data analysis hacks with case examples from our company.