All Blogs

Partner

AWS

amazon

Optimizing file formats and compression

In the earlier blog post Athena: Beyond the Basics – Part 1, we have examined working with twitter data and executing complex queries using Athena. In the current article, we will understand the pricing model, experiment with different file formats and compression techniques and perform analysis based on the results and decide the best price to performance solution for the current use case.

Athena Pricing

The Athena service is priced by the amount of data scanned when running the queries. Selecting the appropriate format and compression and balancing those factors with response time for queries will yield considerable cost savings for the expected response times.

https://aws.amazon.com/athena/pricing/

The following section will share the process of converting to multiple file formats and compressions for additional experimentation for calculating the costs and response time for Twitter use-case.

Converting File formats and Compress

Athena currently doesn’t support inserting/updating data on S3. EMR cluster or similar has to be used to create new data set using preferred file format from existing data set. We are going to add EMR to the previous architecture to convert the data to different file formats and compression techniques

Athena Blog Graphics2 02

The following link covers great deal of converting into multiple file formats
http://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html

The approach for converting to different file formats and querying from Athena

  • Create Hive table to the relevant target format and compression

CREATE EXTERNAL TABLE ([column - data type])
STORED AS
LOCATION
tblproperties(“)

  • Populate the data into new storage/compression format

INSERT INTO SELECT * from

  • Create Athena table based on the new dataset stored on S3. Currently, Athena catalog manager doesn’t share Hive catalog

The following code snippets are used to create multiple versions of the same data set for experimenting with Athena

JSON FORMAT:

  • To convert from Json to snappy compression we execute this commands in HIVE

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;