From Synapse to Fabric: How Glowi Modernized Its Data Architecture

Digital Hive
Mar 20
5 min read

Industry: Domestic services

Client Introduction

Glowi is a multi-brand organization, including Het Poetsbureau, which operates 75 offices across Flanders and has over 10K domestic helpers employed. They mostly have data about customers and employees. It is mainly used to create the schedules, matching the right customers to the right employees. It is also used as an input for their recruiting and marketing teams.

Challenge

The current data setup is built using Microsoft synapse on azure infrastructure.

It is focused on being cost effective and fit for purpose. Tailored to the size and requirements of Glowi. To stay up to date with new technologies, they wanted to investigate the possibility of moving to Microsoft Fabric (What is Microsoft Fabric - Microsoft Fabric | Microsoft Learn), the new data platform by Microsoft. This should be able to replace their older synapse system.

There are many reasons to move to Fabric. It is clear that the focus of Microsoft is on developing this platform, shifting away from synapse. This means all the new development and features will be added to Fabric. Furthermore, the platform will be closer integrated with PowerBI and machine learning use cases, ensuring that Glowi is both future-proof and ready for AI.

One of the main challenges here was cost. The current system was very effective, and the team at Glowi managed to align hardware with the requirements of the data, keeping the costs very low. The question was if it would be possible to migrate to Fabric without significantly increasing costs. Both costs related to infrastructure but also costs related to licenses (both Fabric and PowerBI licenses).

The other challenge was the infrastructure behind setting up the new Fabric environment. Since Fabric is a new offering, no one at Glowi had deep knowledge of the platform. This means that they were unsure about how to implement it, and what the best way of doing things on the platform is. This goes from how to run jobs to best practices and limitations of the platform.

Scope

A workshop around efficiently ingesting new data into the Fabric lakehouses
Investigation around capacity and licenses.
A POC of one month to do an initial onboarding and setup of the Fabric platform.

Results

Cost analysis
To tackle the concerns around cost we provided Glowi with an overview of the costs related to using Fabric and PowerBI. We also provided them with an overview of the costs, comparing Synapse and Fabric (part of this analysis can be found in A comparison of Spark pools in Synapse and Fabric). The main result from this analysis is that is it highly preferred to use reserved capacity, since this comes at quite a discount. There are however also some limitations, since the reserved capacity is available 24/7, and if you are not using it, you are still paying for it. To identify the best capacity for their jobs, we suggested the use of pay-as-you-go resources to identify the right size. Afterwards this capacity can then be reserved for a year or longer. It is also taken into consideration when designing the infrastructure, with runs over night and analysis during the day, to make sure the reserved capacity is fully used.

POC: setup and infrastructure
During a one month POC, we set up the infrastructure of their new Fabric environment, and migrated a first data source to this new environment. This was done using a medallion architecture, making use of a lakehouse for each layer. Multiple ingestion flows were defined, with data coming in from api’s and azure databases. An example of this medallion architecture in fabric can be seen below:

As a part of the POC, we identified a few key flows in the current synapse workflows, and we reimplemented them in fabric. This allowed us to set up the way of working, using pyspark in notebooks, reusing shared logic, and allow orchestration through notebooks. This provides a clear starting point for future implementation of new data sources.

We will now provide the details of some of the choices we made during this POC:

Implementation using pyspark and notebooks: Moving away from the more ui-based workflows in spark, we decided to do all the ETL using pyspark and notebooks. This makes it a lot easier to test and standardize the way the ETL is done. It also allows for easier versioning and sharing of the code. A clear advantage of this was ensuring all data writes were done using a specific helper function, this allowed us to easily add logging to all flows later on, using only a single line of code.

Medallion architecture with each layer in its own lakehouse: Following this standard as advised by Microsoft, we made a lakehouse for the bronze, silver, and gold layer of the medallion architecture. This allows for a clear separation of data. Furthermore it makes it easier to set up different access rules and rights to the data. In each of these lakehouses we store the data in delta parquet, which is perfect for distributed access using pyspark. In the final layer, we suggest a setup of the data in a dimensional model, using facts and dimensions.

Orchestration using notebooks: while it is possible to create a workflow to run notebooks, for now the most efficient way is to use a notebook that references other notebooks. This means that for each layer we created an orchestration notebook, which firstly called the helper notebook with shared functions, and then in turned called the notebooks relevant to that layer, structured based on the data that was being processed. This means that if you need to find the details about some data, you can easily find it by searching for the concept in the right layer. (eg Employee, Customer …). It also makes it easy to find the shared logic, since this is in a separate notebook.

To ingest new data into the system, we defined three different methods. Firstly, for the azure databases that allowed this, we set up mirroring in Fabric. This ensures that the data from this database is automatically copied into the delta lake, without having any extra maintenance or setup. We can then setup a shortcut to allow access to this data from the bronze lakehouse. Secondly, for databases that are too small to allow mirroring, we designed a reusable notebook setup that allows for incremental ingestion of the data into the bronze lakehouse. Thirdly, for data coming from API sources, the same setup can be followed, creating a python notebook that accesses the api and loads it into the bronze lakehouse. This allows for API ingestion that fits perfectly into the tech stack and infrastructure, making it easy to be maintained.

Conclusion

Together with Glowi, we created the basis of their Fabric infrastructure, ensuring that they are future-proof and ready to get started with this new platform. We used our expertise to make sure the setup and infrastructure fits the needs and expectations of Glowi.

In a short period of time, we were able to not only design and document the setup, but to also onboard a first data source into this system, giving them a strong starting point for further development.

When we started this project, Glowi mostly had questions about Fabric, now they have a clear idea of what the future will look like, and even better, they have a working proof of concept that can be used to judge the platform and make it their own.