At the moment, we’re seeing extra firms embrace cloud-based applied sciences to ship superior buyer experiences. An underlying architectural sample is the leveraging of an open information lakehouse. That’s no shock – open information lakehouses can simply deal with digital-era information varieties that conventional information warehouses weren’t designed for.
Knowledge warehouses are nice at each analyzing and storing the tables and schema that signify conventional enterprise processes surrounding merchandise, gross sales transactions, accounts, and different structured information. Open information lakehouses can moreover analyze and retailer semi-structured and unstructured information, which incorporates information like click-stream information, sensor information, geospatial information, and media information. Evaluation is carried out by way of conventional SQL queries and ML/AI programming frameworks. On high of this flexibility, the open information lakehouse gives these capabilities with free, open-source packages and open information codecs. However not like the info warehouse, open information lakehouses don’t come as one built-in platform. They’re best-of-breed OSS stacks to ship the question execution capabilities, transactional assist, and bullet-proof safety.
TAKE A LOOK AT OUR DATA ARCHITECTURE TRAINING PROGRAM
In the event you discover this text of curiosity, you would possibly get pleasure from our on-line programs on Knowledge Structure fundamentals.
On this article, we’ll take a look at how firms are constructing the open information lakehouse as an increase to the info warehouse. The open information lakehouse is a extra versatile stack that solves for the excessive prices, lock-in, and limitations of the normal information warehouse. Particularly, we’ll take a look at how firms are securing the open information lakehouse, together with preliminary challenges and their open-source options.
A Fast Overview of the Open Knowledge Lakehouse

The open information lakehouse consists of low-cost, scalable information lake storage (e.g., AWS S3), database-like information administration performance (e.g., Apache Hudi, Apache Iceberg, Apache Ranger), open information codecs (e.g., Apache Parquet, ORC), governance/safety (e.g., Apache Ranger, AWS Lake Formation), ML and AI Frameworks (e.g., TensorFlow, PyTorch) and SQL question processing engines (e.g., Presto). On high you might have your reporting and dashboarding instruments alongside along with your information science, ML, and AI instruments.
Whereas this text will deal with safety, it’s essential to notice that SQL question capabilities, ML and AI frameworks, and transactional assist can all be added to your information lake. Many firms are evolving to this structure for the explanations listed above – higher price, extra flexibility, and higher price-performance than the info warehouse paradigm.
Implementing Knowledge Safety: The Knowledge Platform Group
As the info lake has turn out to be extensively used, digital-native firms are extra intently managing the information safety and governance of their numerous information units and their corresponding use. Controlling who has entry to what information and what permissions a person may need is crucial. For the groups engaged on information lakehouse safety, the group usually consists of the info platform proprietor, the info practitioner (i.e., information analyst, information scientist, information engineer), and the safety administrator. For the needs of this text, we’ll deal with the info platform proprietor and the info practitioner.
Relating to information lakehouse safety, there are three key areas that have to be addressed:
- Multi-user assist
- Position-based entry management
- Auditing
Within the final 12 months, we’ve seen a pronounced effort round constructing applied sciences that handle these areas for the info lakehouse. Earlier than, it was a problem to deal with these safety necessities – the info platform workforce must custom-build and handle these insurance policies on their very own. As firms develop, their information and the customers who want entry to that information enhance dramatically. Maintaining with that scale from a safety perspective was very exhausting; many occasions, it meant sharing entry credentials throughout groups or simply giving everybody entry to every little thing within the lakes.
Now, as extra proprietary and private information is being saved and extra information practitioners work on the info lakehouse, safety must be a lot tighter. Under, we’ll dive into these three key safety areas and why they’re essential.
Multi-Person Assist
Knowledge practitioners want entry to computing clusters that the info platform proprietor provisions for them. For this reason id entry administration and authorization are essential. Multi-user assist inside an open information lakehouse structure helps make this potential, so it’s a crucial element of safety. As a substitute of everybody being a knowledge platform proprietor, it means giving narrower rights to a number of customers or particular customers credentials to particular clusters, which reduces “key-person” threat protection throughout groups. In the end, the info platform workforce desires simple administration of a set of customers. Sharing credentials throughout a corporation doesn’t meet immediately’s safety necessities.
Position-Based mostly Entry Management (RBAC)
Authorization ranges for a corporation’s customers are the following crucial piece of safety. Knowledge must be authenticated and approved in a unified method – you need to ensure the correct individuals inside your group have the correct entry to their information. A few of the extra widespread RBAC applied sciences we see within the open information lakehouse stack are Apache Ranger and AWS Lake Formation. Each supply fine-grained entry management on your information, giving information platform homeowners extra management over who can entry what information.
Auditing
Audit assist permits for the centralized auditing of person entry primarily based on permission ranges. Moreover, Apache Ranger does auditing on an audit, which is when customers work together with information, it tracks what they did. It’s additionally essential to have the ability to observe when customers request entry to information and if these requests are accepted or denied primarily based on permission ranges.
Key Applied sciences to Allow Knowledge Safety
We’ve touched on a number of applied sciences, so let’s dive a bit of deeper into them. Relating to securing your information within the information lakehouse, there are three applied sciences to dive into: Apache Ranger, AWS Lake Formation, and Presto.
Apache Ranger
Apache Ranger is an open-source framework that enables customers to handle information safety throughout the info lake. One of many huge advantages of Ranger is its open and pluggable structure, that means it may be used throughout clouds, on-prem, or in hybrid environments and could be built-in with numerous compute and question engines together with Presto, Google Large Question, Azure HDInsight, and lots of extra. Apache Ranger provides you unified information entry governance and safety on your information.
AWS Lake Formation
Amazon Lake Formation is an Amazon service that makes it extremely simple to arrange a safe information lake in a matter of days. For AWS customers, this service may be very simple to combine into your present stack and is often the go-to alternative. Lake Formation gives the governance layer for AWS S3, and it’s extremely easy to arrange – customers outline their information sources and what entry and safety insurance policies they need to apply, and so they’re up and operating.
Presto
Presto is an open-source SQL question engine for the info lakehouse. It’s used for interactive, advert hoc analytics on information in addition to the widespread reporting and dashboarding use circumstances. It runs at scale at a number of the high digital firms like Meta/Fb, Uber, Bytedance, and Twitter. With Presto, information platform homeowners get built-in multi-user assist for his or her Presto clusters (which entry the info within the information lake to run queries). Presto makes it simple to manage who has entry to what information. In the event you use a Presto managed service, you possibly can leverage pre-built integrations with Apache Ranger and/or AWS Lake Formation to benefit from the safety and governance these applied sciences present as nicely.
Securing information within the information lakehouse has turn out to be much more paramount as extra firms need to increase their cloud information warehouse with the insights on their lake. With all the advantages the info lakehouse gives, together with higher price, extra flexibility, higher scale, and being extra open, digital-native firms need to leverage it greater than ever earlier than. And now it’s potential to relaxation assured that the info lakehouse safety is on par with the info warehouse. With extra fine-grained entry management and governance capabilities out there immediately, it’s now potential to architect a totally secured information lakehouse.