All Blogs

Partner

AWS

amazon

Optimizing file formats and compression

In the earlier blog post Athena: Beyond the Basics – Part 1, we have examined working with twitter data and executing complex queries using Athena. In the current article, we will understand the pricing model, experiment with different file formats and compression techniques and perform analysis based on the results and decide the best price to performance solution for the current use case.

Athena Pricing

The Athena service is priced by the amount of data scanned when running the queries. Selecting the appropriate format and compression and balancing those factors with response time for queries will yield considerable cost savings for the expected response times.

https://aws.amazon.com/athena/pricing/

The following section will share the process of converting to multiple file formats and compressions for additional experimentation for calculating the costs and response time for Twitter use-case.

Converting File formats and Compress

Athena currently doesn’t support inserting/updating data on S3. EMR cluster or similar has to be used to create new data set using preferred file format from existing data set. We are going to add EMR to the previous architecture to convert the data to different file formats and compression techniques

Athena Blog Graphics2 02

The following link covers great deal of converting into multiple file formats
http://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html

The approach for converting to different file formats and querying from Athena

  • Create Hive table to the relevant target format and compression

CREATE EXTERNAL TABLE ([column - data type])
STORED AS
LOCATION
tblproperties(“)

  • Populate the data into new storage/compression format

INSERT INTO SELECT * from

  • Create Athena table based on the new dataset stored on S3. Currently, Athena catalog manager doesn’t share Hive catalog

The following code snippets are used to create multiple versions of the same data set for experimenting with Athena

JSON FORMAT:

  • To convert from Json to snappy compression we execute this commands in HIVE

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Then we create the snappy external table exactly the same as we’ve created it as json but with different name (tweets_snappy for example) and in different location and insert from json table to snappy table

  • To convert from Json to gzip compression we execute this commands in HIVE

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.type=BLOCK;

Then we create the gzip external table exactly the same as we’ve created it as json but with different name (tweets_gzip for example) and in different location and insert from json table to gzip table

TIP: for Hive running faster is better to copy from a compressed table, so it has to read less data and do less maps.

PARQUET FORMAT:

  • To convert from parquet to parquet_snappy compression we add the following to the end of CREATE parquet table statement:

TBLPROPERTIES ("PARQUET.COMPRESS"="SNAPPY");

Then we insert from any other already created and with data (json, json_snappy, parquet…) to parquet_snappy table

  • To convert from parquet to parquet_gzip compression we we add the following to the end of CREATE parquet table statement:

TBLPROPERTIES ("PARQUET.COMPRESS"="GZIP");

Then we insert from any other already created and with data (json, json_snappy, parquet…) to parquet_gzip table

ORC FORMAT:

  • To convert from orc to orc_snappy compression we add the following to the end of CREATE ORC table statement:

TBLPROPERTIES ("ORC.COMPRESS"="SNAPPY");

Then we insert from any other already created and with data (json, json_snappy, parquet…) to orc_snappy table

  • To convert from orc to orc_gzip compression we we add the following to the end of CREATE orc table statement: