Skip to content
GitHubXDiscord

Crawler

The Crawler resource allows you to create and manage AWS Glue Crawlers, which can automatically discover and catalog data stored in various data sources. For more detailed information, refer to the AWS Glue Crawlers documentation.

This example demonstrates how to create a basic Glue Crawler with required properties and a couple of common optional properties.

import AWS from "alchemy/aws/control";
const basicCrawler = await AWS.Glue.Crawler("basicCrawler", {
role: "arn:aws:iam::123456789012:role/service-role/AWSGlueServiceRole",
targets: {
s3Targets: [{
path: "s3://my-data-bucket/"
}]
},
databaseName: "my_database",
name: "basic-crawler"
});

In this example, we configure a Glue Crawler with additional settings such as a schema change policy and a schedule for regular crawling.

import AWS from "alchemy/aws/control";
const advancedCrawler = await AWS.Glue.Crawler("advancedCrawler", {
role: "arn:aws:iam::123456789012:role/service-role/AWSGlueServiceRole",
targets: {
s3Targets: [{
path: "s3://my-data-bucket/"
}]
},
databaseName: "my_database",
name: "advanced-crawler",
schemaChangePolicy: {
updateBehavior: "UPDATE_IN_DATABASE",
deleteBehavior: "LOG"
},
schedule: {
scheduleExpression: "cron(0 12 * * ? *)" // Daily at 12 PM UTC
}
});

This example demonstrates how to include custom classifiers for the Crawler, which can help in identifying the schema of data files.

import AWS from "alchemy/aws/control";
const classifierCrawler = await AWS.Glue.Crawler("classifierCrawler", {
role: "arn:aws:iam::123456789012:role/service-role/AWSGlueServiceRole",
targets: {
s3Targets: [{
path: "s3://my-data-bucket/"
}]
},
classifiers: ["my-json-classifier", "my-csv-classifier"],
databaseName: "my_database",
name: "classifier-crawler"
});

This example illustrates the use of a recrawl policy to determine how the Crawler should handle data changes in the target location.

import AWS from "alchemy/aws/control";
const recrawlPolicyCrawler = await AWS.Glue.Crawler("recrawlPolicyCrawler", {
role: "arn:aws:iam::123456789012:role/service-role/AWSGlueServiceRole",
targets: {
s3Targets: [{
path: "s3://my-data-bucket/"
}]
},
recrawlPolicy: {
recrawlBehavior: "CRAWL_NEW_DATA_ONLY"
},
databaseName: "my_database",
name: "recrawl-policy-crawler"
});