
Optimizing file formats and compression
In the earlier blog post Athena: Beyond the Basics – Part 1, we have examined working with twitter data and executing complex queries using Athena. In the current article, we will understand the pricing model, experiment with different file formats and compression techniques and perform analysis based on the results and decide the best price to performance solution for the current use case.
Athena Pricing
The Athena service is priced by the amount of data scanned when running the queries. Selecting the appropriate format and compression and balancing those factors with response time for queries will yield considerable cost savings for the expected response times.
https://aws.amazon.com/athena/pricing/
The following section will share the process of converting to multiple file formats and compressions for additional experimentation for calculating the costs and response time for Twitter use-case.
Converting File formats and Compress
Athena currently doesn’t support inserting/updating data on S3. EMR cluster or similar has to be used to create new data set using preferred file format from existing data set. We are going to add EMR to the previous architecture to convert the data to different file formats and compression techniques

The following link covers great deal of converting into multiple file formats
http://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html
The approach for converting to different file formats and querying from Athena
- Create Hive table to the relevant target format and compression
CREATE EXTERNAL TABLE ([column - data type])
STORED AS
LOCATION
tblproperties(“)
- Populate the data into new storage/compression format
INSERT INTO SELECT * from
- Create Athena table based on the new dataset stored on S3. Currently, Athena catalog manager doesn’t share Hive catalog
The following code snippets are used to create multiple versions of the same data set for experimenting with Athena
JSON FORMAT:
- To convert from Json to snappy compression we execute this commands in HIVE
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;