How to build a common data model in an EDW using data from silo-ed applications?
One of our customers has a silo-ed data landscape: applications were build in isolation but they each represent a step in a business process. We stream the data from the underlying datasources to Google's BigQuery using Apache Kafka.
Among the data sources is MongoDB, we derive AVRO schemas from JSON, merge schemas in real-time when schema changes have been detected and insert the data using the merged schema. Downstream the kafka connector changes the Google BigQuery table structures accordingly when needed to stream the data into tables. Less complex is the streaming synchronisation of RDBMSs.
How to model all data into a common data model and how to implement this? Any guidelines, recommendations or is this explained in the course? The different applications don't use the same vocabulary for instance; eg. Property and Good.
What's already clear to us is that some applications need to be adapted to be able to join/link all data from the different applications, in other words data quality needs to be improved and data models will need to be adapted.
Additionally, has anybody experience in moving the batch-code to join all data to real-time?
And how to model consolidated 360° views in real-time? Are you in favour of APIs disclosing this data or using a streaming platform to publish this data to be consumed by some other component?
Creating a common data model for such a complex architecture can be a huge undertaking, depending on the level of model needed (business terms, logical, or physical). Maintenance of such a model can also be a challenge.
The first steps would be to identify what the common model will be used for (how will it make the organization more successful), and the audience that would most benefit (e.g. business analysts, modelers/architects, developers, etc.).
Can you post these answers? We can continue the conversation.
These applications have been developed in isolation to follow laws, each application is a step in an business process consisting of multiple steps. All data will be streaming into a data lake to get a view on top of all data belonging to this business process. So first goal is to be able to report on all data which is not easy because applications don't speak the same business language (eg property vs good) and data cannot be joined easily because data is not saved with same keys or at another granular level, source applications will have to be adapted to improve data quality and to be able to join data properly.
Second goal is to use data insights in business processes by using AI. (eg property valorisation model). To able to do so we need to be able to join data properly.
Third goal is to offer curated data to business analysts in a curated environment.
On the longterm the business wants to evolve into a platform ecosystem where multiple partners will connect and where data is going to be shared among parties and where insights are used to deliver added value to all parties involved. This is the fourth goal.
Like the admin said, what you are aiming for can potentially be a huge undertaking (especially if approached wrongly).
As a starting point, I would have a look at John Giles’ work (especially his book https://technicspub.com/fridge/ and the latest two blog posts in https://tdan.com/author/jgilescountrye-com-au ). John’s approach can help you with arriving at a target model and mapping your sources to it.
The “consolidated 360° views in real-time” on top of all that might be a challenge as well, depending on your consistency and historization requirements. Before making detailed implementation decisions, you should become aware of the presentation modes you want to support (dimensional/flat, if/how to historize, ...).
This is not meant to discourage you but to give you some pointers for arriving at a proper approach.
If you want to go into more detail, you can find me on LinkedIn ( https://www.linkedin.com/in/christian-kaul/).