AWS Glue

AWS Glue is a fully managed ETL. It simplifies the ETL process by automating the create, maintain and execution of ETL jobs.

AWS Glue is a fully managed extract, transform and load service. We can point AWS Glue to our data stored on AWS and AWS Glue discovers data and stores metadata in the AWS Glue Data Catalog. Once catalog is created data is available for search, query and ETL operations.

Benefits of AWS Glue

  1. Less hassle :-
    • less onboarding hassle.
    • is serverless so no infrastructure to manage.
    • natively supports data stored in Amazon Aurora, RDS engines, Redshift and S3.
    • Supports common database in Amazon EC2 instances.
    • handles provisioning , configuration and scaling of the resources required to ETL jobs on a fully managed , scale-out Apache Spark environment.
  2. Cost effective :- Pay only for the resource used while jobs are running.
  3. More power :-
    • Automates effort in building , maintaining and running ETL jobs.
    • Crawls data sources and identifies data formats and suggests transformations and loading process.

How AWS Glue works

  1. Build Data Catalog :- AWS Glue crawl data sources and construct your Data Catalog using pre-built classifiers for many popular source formats and data types, including JSON, CSV, Parquet etc.
  2. Generate and Edit Transformation :- AWS Glue generates code in Scala or Python to extract data from source and transform data to match schema and load it into the target. Code is editable.
  3. Schedule and Run your jobs :- AWS Glue makes it easy to schedule recurring ETL jobs, chain multiple jobs together, or invoke jobs on-demand from other services like AWS Lambda. AWS Glue manages the dependencies between jobs, automatically scales underlying resources, and retries jobs if they fail.

Features of AWS Glue

  • Integrated data catalog :- Data catalog is persistent metadata store for all data assets irrespective of their location. It contains table definition , job definition and other management information for Glue environment. It makes query against data efficient and cost effective by computing statistics and partitions. It also maintains version of Schema to track changes.
  • Automatic schema discovery :- AWS Glue crawlers connect to source or target data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata in AWS Glue Data Catalog. The metadata is stored in tables in data catalog and used in the authoring process of ETL jobs. We can run crawlers on a schedule, on-demand, or trigger them based on an event to ensure that our metadata is up-to-date.
  • Code generation :- Generates code to extract , transform and load data. The code is generated in Scala or Python and is written for Apache Spark.
  • Clean and deduplicate data :- AWS Glue helps clean and prepare your data for analysis by providing a Machine Learning Transform called FindMatches for deduplication and finding matching records.
  • Developer endpoints :- Helps to interactively develop ETL code using IDE.
  • Flexible job scheduler :- AWS Glue jobs can be invoked on a schedule, on-demand, or based on an event. We can start multiple jobs in parallel or specify dependencies across jobs to build complex ETL pipelines. AWS Glue will handle all inter-job dependencies, filter bad data, and retry jobs if they fail. All logs and notifications are pushed to Amazon CloudWatch so we can monitor and get alerts from a central service.

Use Cases

Query AWS S3 Data lake.
Analyze Log Data in Data Warehouse
Unified View of Data across Data Stores
Event Driven ETL Pipelines.

Frequently Asked Questions: