Data is the new oil - Clive Humby
Gathering data from external systems into a central platform is no easy task. When you manage to pull it off, what do you get? A data lake? Not so fast! Endless companies have tried to jump on board the Big Data train, but were they able to extract the value they expected from these Big Data platforms? After all, it is not a cheap endeavor. The truth is that most of them did not see the end of the tunnel and got stuck with a data swamp. A messy dump site for all the data that an engineer can get his hands on. Merely throwing data into a centralized location is not enough to make it usable for downstream data consumers. It has to be refined and curated into an analytics-friendly, use-case specific dataset that a user can utilize out of the box for data analysis.
Bronze - The Landing Layer
Data can arrive from various sources, some of which are out of our control. We cannot keep the data stored in-flight, therefore we dedicate a Landing Zone for all data ingested into the platform. This layer acts as the original source of truth in our data lake. Data is kept in its original form and only used as input for transformation into the silver layer.
Bronze Data Lifecycle
Data is transitioned to long term cold storage as it ages. Data retention is enforced according to compliance standards.
Bronze Data Security
This layer contains data in its purest and most sensitive format. Therefore, controlled access to this layer is enforced. Only approved personal and automated systems are given access to the data. All access must be logged.
Silver - The Warehouse Layer
Relevant data (subset of the bronze layer) is transformed and condensed into an analytics-friendly columnar format such as Parquet. This processed data will be used as the bases of curated gold datasets. Any essential transformations such as anonymization, de-duplication, and cleaning is applied to the data in order to make it ready for on-demand consumption.
Silver Data Lifecycle
Data is transitioned to long term cold storage as it ages. Data retention is enforced based on application usage needs instead of compliance needs. Data can be expired earlier since it can be regenerated from the bronze layer.
Silver Data Security
This layer contains data that can be used by an engineer to curate datasets for downstream consumers. Regular data users should not have access to this layer as it is typically in an inefficient format to serve their access patterns.
Gold - The Curated Layer
These datasets are curated for the specific use case of an end user. Data is reshaped, joined, and stored in a location that will be directly accessed by consumers. The dataset will be optimized for the access patterns of its use case.
Gold Data Lifecycle
Gold datasets should typically be retained for the lifetime of their use case and cleaned up when no longer needed.
Gold Data Security
This layer contains data that has been tailor-made for a particular team, therefore only that team should have access to the data. If other teams find the dataset useful, they can create a copy of it in their own gold layer.
Platinum - The Enriched Layer
The insights and reports generated by our data analysts are used to augment the gold dataset into the platinum layer. These valuable datasets are used by business stakeholders for making data-driven decisions.
Platinum Data Lifecycle
Platinum datasets should be made available for as long as the business stakeholders need them to support decision making.
Platinum Data Security
All business stakeholders should be encouraged to make data-driven decisions from these datasets with minimum friction.