Step-By-Step Guide to Building a Serverless Data Lake with AWS’s Aditya Challa

Lights

On August 19, 2020, more than 50 participants gathered virtually to hear Aditya “Adi” Challa, AWS Solutions Architect with Amazon Web Services (AWS), carefully walk through how to build and automate a modern serverless data lake on AWS as part of ASU UTO’s Innov8: A Speaker Series.

Defining Data Lakes & Serverless Computing

Challa, who has more than 15 years of experience in architecting, designing, building and implementing IT solutions for various verticals including academic, financial and fundraising organizations, started the session by polling the audience on their understanding of data lake and serverless computing.

A data lake, a system or repository of data stored in its natural/raw format, usually represents the single store of all data from an enterprise. It can be established “on premises” (within an organization’s data centers) or “in the cloud” (cloud services from vendors like Amazon Web Services), and data lakes are essential to the maintenance of an organization’s crucial information.

Serverless computing can be used in support of data lakes, where a cloud provider runs the server and dynamically manages resources. 

“We believe there are more than 10,000 serverless data lakes that are currently being built and maintained on AWS,” Challa said.

Steps to Building a Data Lake & Common Misconceptions of Data Lakes

He went on to explain that there are five typical steps in building a data lake:

  1. Set up storage
  2. Move data
  3. Cleanse, prep, and catalog data
  4. Configure and enforce security and compliance policies
  5. Make data available for analytics

“The whole purpose of the data lake is to democratize access to this data and to avoid silos,” said Challa. “This [data lake] brings everything together.”

While there are common misconceptions of what a data lake is, Challa explained it is more flexible than the more traditional “data warehouse.”

“In the old days when we had data in data warehouses, we had to, ahead of time, know the schema of the data that’s being stored and if there was any ETL (extracting, transforming and loading) that had to be done,” explained Challa. “When there was a change in the data we had to stop, change the schema of the tables in the data warehouse and then write it.”

“Data lakes are schema on read,” Challa said. “Data warehouses only do structured data, whereas data lakes can take videos, text files, logs, JSONs, XMLs—you name it. It can take any kind of data as long as you have room for that data in your data lake.”

Benefits of Using AWS for Big Data & Analytics and Featured Services & Products

Challa shared that AWS has two new services in the last two years that have become extremely popular:

  • AWS Transfer for SFTP: Fully managed service enabling transfer of data over SFTP while stored in Amazon S3
  • AWS DataSync: Transfer service that simplifies, automates and accelerates data movement

And did you know that 80 percent of any work on the data lake is data preparation? “We want to make sure that we provide the best tools and most cost-effective tools to our customers,” said Challa. “And that’s why we have AWS Glue.”

AWS Glue is a serverless ETL service. Challa explained how you need to set up a catalog, ETL and data prep with AWS Glue. Challa also presented Lambda, a productivity-focused computing platform to build powerful, dynamic and modular applications in the cloud. 

Summary & Resources

The main benefits of a serverless data lake are that it’s just that — serverless from start to finish — and you only pay for when files come in and are transferred and processed.

Challa ended his session by sharing a valuable resource: an AWS Big Data Blog post that recaps how to build and automate a serverless data lake using an AWS Glue trigger for the Data Catalog and ETL jobs, which includes a cloud formation template that can set up the architecture for you so you can follow instructions and try your hand on how a service data lake works.

Special thanks to Challa for sharing his knowledge!

ASU’s monthly Innov8 Speaker Series brings industry leaders to ASU to highlight innovative topics across many domains and is devoted to spreading ideas and sharing knowledge through short, powerful tech-driven talks. Keep an eye on the Events Calendar for our next installment!