From blueprint to data lake: empowering manufacturing with AWS

September 05, 2023

We recently had the opportunity to build an AWS data lake from the ground up for a manufacturing client. It’s the kind of project where making technical and architectural decisions can be challenging due to the number and diversity of requirements. The pitfall one can get into is analysis paralysis, where you never start and never learn. Instead, we recommend embracing uncertainty, identifying and inventorying high-level requirements and finding a part of a project that is valuable to start with.

Large manufacturing companies have hundreds of manufacturing sites. Spread across the globe, each of these sites manufactures multiple products, with each manufacturing line generating data in diverse formats (log files, measurements for temperature, voltage, pressure, vibration and so on). All this data offers immense business potential if it can be analyzed and used with machine learning.

When starting a data lake, we recommend performing a company-wide inventory of the possible use cases. A survey can be crafted for each engineering team to report on their use case and interest in a data lake. The data from this survey can then be applied to understand common needs to prioritize the implementation phase accordingly.

Here are some broad examples of questions that can be used in the survey and how they impact the data lake design:

Where are the data sources located? This will lead to the design as to which cloud region to use for ingestion and storage.
What type of locations are the sources located at (internal, partner, client)? Impact on how to choose device authentication and device management.
What is the type of data (JSON, CSV, XML, binary, DOC, XLS)? This is useful to understand what services can be used to analyze the data.
What is the minimum, average and maximum size of the data? This will determine which cloud service can be used to process the data. For example, services like Lambda and API Gateway have limited requests size.
What is the direction of the data flow (is the data source just pushing, or is it expecting a response back based on the data sent, or does it also query the data lake)? If the client needs real-time inference from a Sagemaker model in the request API response, it'll rule out using a service like Kinesis as it implements a one-way data flow collection.
How many data sources currently exist and are expected in the future? What is the total size of data expected to be ingested yearly? Those questions will impact the plan for scaling the ingestion and storage, cost forecast and management of the software and devices at the source.
How often do you expect to send the data to the cloud data lake (as a daily or hourly batch, discrete, continuously streamed)? This will impact which service to use to ingest data.
Is there a need for machine learning inference? Identifying a detailed need for machine learning would require a survey. At the least, this will help identify the need.
Is there a need for data analytics? Which platform do you need? Identifying the analytics tools needed or expected by engineers and business users helps plan for the integration with these tools. We can propose using AWS services to answer this need if those tools don't exist.
What are the latency requirements (100ms, 500ms, one min, none)? For tight latency requirements, we'll have to consider provisioned resources across the services concerned: Lambda, DynamoDb, SageMaker endpoints, EC2. It may also reveal the need for an IoT device for local computing, storage and inference.
What are the security and data residency requirements ( PII data, government requirements)? Who should have access to the data? This will impact where the data will be stored (which region, which S3 bucket), what type of encryption should be used and what access policies should be implemented.
What is the ease of integration? In some cases, data is already shipped to an existing data silo outside the data lake. Changing applications to use completely new APIs and data flow may not be practical. Too many applications sometimes depend on the library, which must be modified to communicate with the data lake. A one-off ingestion pattern on the cloud side may be more practical. On the other hand, if this is a greenfield implementation, we have more freedom.

Along with these technical requirements, the survey should collect information about the project (point of contact, business unit, project name), information about the business drivers (what is the estimated ROI, risk if not implemented, deadlines) and if funding has been reserved for the implementation.

Once survey data is gathered, you may find dozens to hundreds of use cases. You can then build charts to summarize the data and identify common requirements. You can then start architecting and building the data lake for the first few use cases selected while keeping future needs in mind.

The cloud is an ideal place to build a data lake. AWS offers a wide range of services to adapt to the diversity of technical requirements. It also requires minimal (if any) licensing, upfront investment and allows for technical implementation to change and adapt as the initial uncertainty settles and clarity rises.

We're excited to share our experiences and insights at the upcoming AWS User Group meeting this September 21 in Cologne, Germany, where I'll delve even deeper into the intricacies of constructing data lakes tailored for manufacturing. Join us to learn more about our lessons learned, innovative strategies and practical advice for navigating the complexities of this transformative technology.

Visit NTT DATA and Amazon Web Services to learn more.

From blueprint to data lake: empowering manufacturing with AWS

Related Blog Posts

Services

Resources

Company