The data lake has come on strong in recent years as a modern design pattern that fits today’s data and the way many users want to organize and use their data. For example, many users want to ingest data into the lake quickly so it’s immediately available for operations and analytics. They want to store data in its original raw state so they can process it many different ways as their requirements for business analytics and operations evolve.
They need to capture — in a single pool — big data, unstructured data, and data from new sources such as the Internet of Things (IoT), social media, customer channels, and external sources such as partners and data aggregators. Furthermore, users are under pressure to develop business value and organizational advantage from all these data collections, often via discovery-oriented analytics.
A data lake, especially when deployed atop Hadoop, can assist with all of these trends and requirements — if users can get past the lake’s challenges. In particular, the data lake is still very new, so its best practices and design patterns are just now coalescing. Most data lakes are on Hadoop, which itself is immature; a data lake can bring much-needed methodology to Hadoop. To the uninitiated, data lakes appear to have no methods or rules, yet that’s not true. In fact, best practices for the data lake exist, and you’ll fail without them.
Why Would you need this service?
- Onboard and ingest data quickly with little or no up-front improvement.
One of the innovations of the data lake is early ingestion and late processing, which is similar to ELT, but the T is far later in time and sometimes defined on the fly as data is read. Adopting the practice of early ingestion and late processing will allow integrated data to be available ASAP for operations, reporting, and analytics. This demands diverse ingestion methods to handle diverse data structures, interfaces, and container types; to scale to large data volumes and real-time latencies; and to simplify the onboarding of new data sources and data sets.
- Control who loads which data into the lake and when or how it is loaded.
Without this control, a data lake can easily turn into a data swamp, which is a disorganized and undocumented data set that’s difficult to navigate, govern, and leverage. Establish control via policy-based data governance. A data steward or curator should enforce a data lake’s anti-dumping policies. Even so, the policies should allow exceptions — as when a data analyst or data scientist dumps data into analytics sandboxes.
Document data as it enters the lake using metadata, an information catalog, business glossary, or other semantics so users can find data, optimize queries, govern data, and reduce data redundancy.
- Persist data in a raw state to preserve its original details and schema.
Detailed source data is preserved in storage so it can be repurposed repeatedly as new business requirements emerge for the lake’s data. Furthermore, raw data is great for exploration and discovery-oriented analytics (e.g., mining, clustering, and segmentation), which work well with large samples, detailed data, and data anomalies (outliers, nonstandard data).
As users work with lake data over time, they sometimes break this rule to apply light data standardization when required for reporting, complete customer views, recurring queries, and general data exploration.
- Improve data at read time as lake data is accessed and processed.
This is common with self-service user practices, namely data exploration and discovery, coupled with data prep and visualization. Data is modeled and standardized as it is queried iteratively, and metadata may also be developed during exploration. Note that these data improvements should be applied to copies of data so that the raw detailed source remains intact. As an alternative, some users improve lake data on the fly with virtualization, metadata management, and other semantics.
- Capture big data and other new data sources in the data lake.
1Tech survey data shows that over half of data lakes are deployed exclusively on Hadoop, with another quarter deployed partially on Hadoop and partially on traditional systems. Many data lakes are deployed to handle big data (i.e., large volumes of Web data), and so Hadoop is a good fit. Hadoop-based data lakes are increasingly capturing large data collections from new sources, especially the IoT (machines, sensors, devices, vehicles), social media, and marketing channels.
- Integrate data of diverse sources, structures, and vintages.
Data lakes aren’t just for IoT and big data. Many users blend traditional enterprise data and modern big data on a Hadoop-based lake to enable advanced analytics, extend customer views with big data, enlarge data samples of existing fraud and risk analytics, and enrich cross-source correlations for more insightful clusters and segments. In addition, 1Tech has seen blended lake data enable logistics optimization, sentiment analysis, near-time business monitoring, patient outcome analytics in healthcare, and predictive maintenance.
- Extend and improve enterprise data architectures, both old and new.
Data lakes are rarely siloed. Most are integral parts of a larger data architecture or multi platform data ecosystem — common examples being the multiplatform data warehouse environment, omnichannel marketing, and the digital supply chain. A lake can also extend traditional applications — such as those for multi module ERP, financials, content management, and data or document archiving. Hence, a data lake can be a modernization strategy that extends the useful life and functionality of an existing application or data environment.
- Make each data lake serve multiple technical and architectural purposes.
A single lake typically fulfills multiple architectural purposes, such as data landing and staging, archiving for detailed source data, sandboxing for analytics data sets, and managing operational data sets (especially complete views and data masters). Even so, when a single data lake plays this many architectural roles, it may need to be distributed over multiple data platforms, each with unique storage or processing characteristics. For example, 1Tech surveys show that a quarter of data lakes are on both Hadoop and multiple instances of relational databases.
- Enable new self-service data-driven business best practices.
These include data exploration, prep, visualization, and some kinds of analytics. Nowadays, savvy users (both business and technical) expect self-service access to lake data, and they will consider the lake a failure without it. Note that self-service functionality is enabled by key components, namely tools built for the high ease-of-use that business users need along with business metadata and other specialized semantics.
- Select data management platforms that satisfy data lake requirements.
Hadoop is the preferred data platform for most lakes due to its low price, linear scalability, and powerful in situ processing for analytics. However, some users implement a massively parallel processing (MPP) relational database when the lake’s data is relational and/or requires relational processing (complex SQL, OLAP, materialized views).
Hybrid platforms are on the rise with data lakes; they may combine Hadoop and relational systems or on-premises and on-cloud systems. With many data collections (data lakes, warehouses, big data, analytics, etc.), 1Tech sees an increase in cloud storage, whether file/folder, object, or block.
How we deliver this service
With a data lake built on Amazon S3, you can use native AWS services to run big data analytics, artificial intelligence (AI), machine learning (ML), high-performance computing (HPC) and media data processing applications to gain insights from your unstructured data sets.
Using Amazon FSx for Lustre, you can launch file systems for HPC and ML applications, and process large media workloads directly from your data lake. You also have the flexibility to use your preferred analytics, AI, ML, and HPC applications from the Amazon Partner Network (APN). Because Amazon S3 supports a wide range of features, IT managers, storage administrators, and data scientists are empowered to enforce access policies, manage objects at scale and audit activities across their S3 data lakes.
Amazon S3 hosts tens of thousands of data lakes for household brands such as Netflix, Airbnb, Sysco, Expedia, GE, and FINRA, who are using them to securely scale with their needs and to discover business insights every minute.
- Ingest structured and unstructured data;
- Store, secure and protect data at unlimited scale;
- Catalogues and index for analysis without data movement;
- Connect data with analytics and ML tools;
- You can Build a data lake in days instead of months with AWS Lake Formation;
- Run AWS analytics applications with no data movement;
- Connect data to file systems for high-performance workloads;
- Manage data at every level across your data lake;
- The ability to configure finely-tuned access policies to sensitive data;
- Cost-effectively store objects across the S3 Storage Classes;
- Audit all access requests to S3 resources and other activities.
Benefits/ Typical Outcomes
- Ability to derive value from unlimited types of data
- Ability to store all types of structured and unstructured data in a data lake, from CRM data to social media posts
- More flexibility—you don’t have to have all the answers up front
- Ability to store raw data—you can refine it as your understanding and insight improves
- Unlimited ways to query the data
- Application of a variety of tools to gain insight into what the data means
- Elimination of data silos
- Democratized access to data via a single, unified view of data across the organization when using an effective data management platform