Skip to content

mycsHQ/amazon-cloudfront-access-logs-queries

 
 

Repository files navigation

Analyzing your Amazon CloudFront access logs at scale

This is a sample implementation for the concepts described in the AWS blog post Analyze your Amazon CloudFront access logs at scale using AWS CloudFormation, Amazon Athena, AWS Glue, AWS Lambda, and Amazon Simple Storage Service (S3).

This application is available in the AWS Serverless Application Repository. You can deploy it to your account from there:

cloudformation-launch-button

Overview

The application has two main parts:

  • An S3 bucket <ResourcePrefix>-<AccountId>-cf-access-logs that serves as a log bucket for Amazon CloudFront access logs. As soon as Amazon CloudFront delivers a new access logs file, an event triggers the AWS Lambda function moveAccessLogs. This moves the file to an Apache Hive style prefix.

    infrastructure-overview

  • An hourly scheduled AWS Lambda function transformPartition that runs an INSERT INTO query on a single partition per run, taking one hour of data into account. It writes the content of the partition to the Apache Parquet format into the <ResourcePrefix>-<AccountId>-cf-access-logs S3 bucket.

    infrastructure-overview

FAQs

Q: How can I get started?

Use the Launch Stack button above to start the deployment of the application to your account. The AWS Management Console will guide you through the process. You can override the following parameters during deployment:

  • The NewKeyPrefix (default: new/) is the S3 prefix that is used in the configuration of your Amazon CloudFront distribution for log storage. The AWS Lambda function will move the files from here.
  • The GzKeyPrefix (default: partitioned-gz/) and ParquetKeyPrefix (default: partitioned-parquet/) are the S3 prefixes for partitions that contain gzip or Apache Parquet files.
  • ResourcePrefix (default: myapp) is a prefix that is used for the S3 bucket and the AWS Glue database to prevent naming collisions.

The stack contains a single S3 bucket called <ResourcePrefix>-<AccountId>-cf-access-logs. After the deployment you can modify your existing Amazon CloudFront distribution configuration to deliver access logs to this bucket with the new/ log prefix.

As soon Amazon CloudFront delivers new access logs, files will be moved to GzKeyPrefix. After 1-2 hours, they will be transformed to files in ParquetKeyPrefix.

You can query your access logs at any time in the Amazon Athena Query editor using the AWS Glue view called combined in the database called <ResourcePrefix>_cf_access_logs_db:

SELECT * FROM cf_access_logs.combined limit 10;

Q: How can I customize and deploy the template?

  1. Fork this GitHub repository.

  2. Clone the forked GitHub repository to your local machine.

  3. Modify the templates.

  4. Install the AWS CLI & AWS Serverless Application Model (SAM) CLI.

  5. Validate your template:

    $ sam validate -t template.yaml
  6. Package the files for deployment with SAM (see SAM docs for details) to a bucket of your choice. The bucket's region must be in the region you want to deploy the sample application to:

    $ sam package
        --template-file template.yaml
        --output-template-file packaged.yaml
        --s3-bucket <BUCKET>
  7. Deploy the packaged application to your account:

    $ aws cloudformation deploy
        --template-file packaged.yaml
        --stack-name my-stack
        --capabilities CAPABILITY_IAM

Q: How can I use the sample application for multiple Amazon CloudFront distributions?

If your data does not need to be partitioned by Amazon CloudFront distribution, you can use the same bucket and path (new/) for more than one distribution. Then you can query the data by host column. If you need to speed up the parquet transformation duration (must stay under 15 minutes) or query duration, deploy another AWS CloudFormation stack from the same template for each distribution. The stack name is added to all resource names (e.g. AWS Lambda functions, S3 bucket etc.) so you can distinguish the different stacks in the AWS Management Console.

Q: In which region can I deploy the sample application?

The Launch Stack button above opens the AWS Serverless Application Repository in the US East 1 (Northern Virginia) region. You may switch to other regions from there before deployment.

Q: How can I add a new question to this list?

If you found yourself wishing this set of frequently asked questions had an answer for a particular problem, please submit a pull request. The chances are good that others will also benefit from having the answer listed here.

Q: How can I contribute?

See the Contributing Guidelines for details.

License Summary

This sample code is made available under a modified MIT license. See the LICENSE file.

About

Analyze your Amazon CloudFront Access Logs at Scale with Amazon Athena.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 80.9%
  • Python 19.1%