For the novice and those not in the industry, I’m sure the amount of tools that are available in the data market is overwhelming. In this article I am going to try and simplify these tools down into groups and describe what they all do.
Event Capture
Let’s start at the beginning with our first category, event capture. In this category we are going to count anything that is capturing data about your business. If your customers interact with you through a website or an app you probably have a tracking tool. Examples of this could be Google Analytics (link) or Adobe Analytics (link) or your own tool. These tools will be capturing events for you. If you have physical stores you probably have data generated by your POS data. If you sell a physical product you might also have data on stock levels. Another source could be your email service provider. You could also include feeds from your social media accounts. Whatever you may have, these tools are generating data for you that can be analysed.
Storage
From this I would say the next category of tool is storage. Some of those tools mentioned above will have some way of analysing the data in their systems. However they can only see the data in their own system. If we want to combine data sources together to see the interactions between the two sources e.g. transactions generated by sent emails we need to store the data from all the sources in one location.
Before the time of the cloud and big data, this data would have been in an on-premises relational database/data warehouse. Times change and there are now many different ways in which you can store this data. Each of them are optimised for the type of data being stored and how you want to use it.
Data Lake
For this post we’re going to concentrate on the two options most useful for data analytics. The first one is some form of cloud data storage for your data. You may have heard the term data lake before, that’s exactly what we’re creating here. We will store the data in a really simple text file based format, it can be structured or unstructured. The cost of storing data this way is much cheaper than the on premise data warehouse. This is why this method was adopted for big data . Some of the providers in this space include Amazon S3, Google Cloud Storage or Microsoft Azure Data Lake.
Data Warehouse
The second option is a data warehouse. A data warehouse is still a repository for data. But in this instance the system built around it is optimised for reporting and analytics. The data contained in the data warehouse will be structured and probably filtered depending on the objective for the system. Again there are lots of providers in this space. Examples include Amazon Redshift, Snowflake, Microsoft Azure Synapse Analytics and Google BigQuery
Coding Tool
So we have multiple sources capturing data for us and we have a central location to store that data. We now need a tool or tools that help us to do things with the data we have. Depending on the other tools you choose you might be restricted by what you can use to access your data. However the most common solutions here are an SQL based client and/or Python.. Depending on what you want to do with your data other tools include R and SAS.
You might also hear about tools like Databricks and AWS Sagemaker. These tools also act as a way of performing actions on and with your data. They also take the hard work away from managing the systems that are designed to run multiple cluster servers. These systems are needed to cope with the workloads generated when carrying out data science projects on big data. Hence if you are not intending to run any data science type projects or you don’t have any big data datasets in the short term, you might not need to worry about these types of tools.
Automation and orchestration tool
We’ll now look at tools that can help to orchestrate and automate the actions we perform on our data.
You might have heard the term “ETL tools” before. ETL stands for Extract Transform and Load. You might also hear the term “ELT tools”. ELT stands for Extract Load Transform. ELT is essentially a newer variation on ETL tools where the transformation stage happens on the data warehouse rather than a separate server. Most commonly used tools are capable of both ETL and ELT type pipelines examples include Apache Airflow, AWS Glue.
There is also another variation on these tools and that is SaaS based ETL tools. Typically these will have in-built integrations with the data sources mentioned above. This can mean that pipelines can be set up more quickly and with less experience needed. The downside is you can find yourself locked into a service. Most also carry a subscription for the service. Examples of this type of tool include Hevo, Segment, and (another variation)..
Data Visualisation Tool
The final category is an analysis/visualisation/reporting tool. These tools help to create dashboards and reports that are distributed across your business. Examples of this would be Power BI, Tableau, Looker, Quiksight and Excel to an extent. You can also include tools like Mixpanel and Google Analytics in this category too. These tools allow you to show your data in interesting ways and help to gain insight into what’s going on.
A side note to these categories and why I think the landscape has gotten so hard to comprehend recently is that there are so many tools now and a lot of them actually sit in more than one of the categories above. Google Analytics for instance can actually sit in every single category I mentioned above.