Query content of S3 objects with SQL using Laravel

Mar 30, 2022

Recently, I was working on a Laravel application that deals with a lot of CSV data. Different business partners were uploading a new version of the CSV files to S3 every hour.

My application was processing these files to build various charts for reporting. It also provides an interface for authorized users to view the content of uploaded CSV files. Users can also filter and search different records from the interface. Most of the time application only needs to access the most recent rows.

My initial approach was to download the entire CSV file from s3 into memory on every request. Then loop through each row to calculate the required data for charts. I was caching the calculated results for half an hour to improve the performance. This approach had a few drawbacks:

Another approach was to get the content of CSV every 15 minutes. Then insert these records into the SQL database. I did not like this approach because now the data exists in two different places. This approach was not scalable either. As the number of records in the CSV file increase, I have to update or insert more records into the database. On the bright side, I can use Eloquent or Database query builder to access the data.

AWS S3 Select

While Amazon S3 is awesome for storage, It also has a feature called S3 Select. With S3 Select, You can use a simple SQL query to filter the content of the stored objects. and retrieve only a subset of data that you need.

S3 Select currently works on objects that store data in CSV, JSON, and Apache Parquet formats. It also works with GZIP or BZIP2 compressed CSV and JSON objects.

Example of S3 Select Query with Laravel

For example purposes, I am going to use the following sample data. You can download the complete dataset here

date euro_to_usd uk_pound_to_usd australian_dollar_to_usd canadian_dollar_to_usd
2022-03-18 0.908430233 0.761643627 1.35501355 0.792581438
2022-03-17 0.904895485 0.760138345 1.366493577 0.790451348
2022-03-16 0.909587047 0.763825237 1.387539892 0.786101722
2022-03-15 0.90983532 0.765990042 1.391594768 0.781066937

Laravel filesystem does not support QueryObjectContent operation out of the box. I created the following macro so I can reuse the same logic in many places.

<?php

namespace App\Providers;

use Illuminate\Filesystem\AwsS3V3Adapter;
use Illuminate\Support\ServiceProvider;

class MacroServiceProvider extends ServiceProvider
{
    public function register(): void
    {
        AwsS3V3Adapter::macro('query', function (string $key, string $expression, array $overrides = []) {
            /** @var AwsS3V3Adapter $this */
            $default = [
                'Bucket' => $this->config['bucket'],
                'Expression' => $expression,
                'ExpressionType' => 'SQL',
                'InputSerialization' => [
                    'CSV' => [
                        'FileHeaderInfo' => 'USE',
                        'FieldDelimiter' => ','
                    ],
                ],
                'Key' => $key,
                'OutputSerialization' => [
                    'CSV' => [
                        'RecordDelimiter' => ','
                    ],
                ],
                'RequestProgress' => [
                    'Enabled' => false,
                ],
            ];

            return $this->getClient()->selectObjectContent(array_merge($default, $overrides));
        });
    }
}

This maco uses the underlying S3Client to execute the QueryObjectContent command. It accepts the name of the object, a query, and an optional parameter for the QueryObjectContent command. and It returns Aws\Result back.

Select & Limit Clause

S3 Select uses simple SQL statements. The following code snippet retrieves the first five records from our example dataset.

$records = collect();

// S3 Query to select first 5 records
$query = 'SELECT "date", "canadian_dollar_to_usd" FROM s3object LIMIT 5';

// Execute the query and fetch the results
$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);

// Loop through payload events to collect result
foreach($result['Payload'] as $event) {
    if (isset($event['Records'])) {
        $records = $records->merge(
            Str::of($event['Records']['Payload'])
                ->explode(PHP_EOL)
                ->filter()
                ->map(fn(string $item) => Str::of($item)->explode(','))
        );
    }
}

dd($records);

If your CSV file does not have headers, You can also specify index numbers in your query. You may have to update the FileHeaderInfo parameter value. The possible values are USE, IGNORE, or NONE. If your CSV does not include headers use NONE otherwise use IGNORE.

$query = 'SELECT _1, _2 FROM s3object LIMIT 5';

Like SQL statement, You can also use * to retrieve all the columns.

$query = "SELECT * FROM s3Object LIMIT 5"

Where Clause

To explore where clause, Let's get the list of the CAD to the USD conversion rate since March 1, 2022. Notice that the date is a reserved keyword in S3 select. To use these reserved keywords, you have to wrap them in double quotes. You can find the list of reserved keywords here.

// ...

$query = 'SELECT "date", canadian_dollar_to_usd FROM s3object WHERE CAST("date" AS TIMESTAMP) >= CAST(\'2022-03-01\' AS TIMESTAMP)';

$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);

// ...

Using Functions

Like SQL, S3 select also allows us to use functions. You can find a complete list of functions in AWS documentation. In the following code, we will get the total number of records in a file using the COUNT aggregate function.

$count = null;

$query = 'SELECT COUNT(*) FROM s3object';

$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);

// Loop through payload events to get records
foreach($result['Payload'] as $event) {
    if (isset($event['Records'])) {
        $count = Str::of($event['Records']['Payload']);
    }
}

dd($count);

Cost of S3 Select

Amazon S3 Select is very cheap. The cost of S3 Select depends on the number of select requests, data scanned and amount of data returned. Let's say I have 50GB of data stored in S3. I make 100,000 Select requests per month and return 20GB of data. It will cost me less than 2 US dollars. Following is the pricing estimate from the AWS pricing calculator as of March 2022.

Tiered price for: 50 GB
50 GB x 0.0230000000 USD = 1.15 USD
Total tier cost = 1.1500 USD (S3 Standard storage cost)
100,000 GET requests in a month x 0.0000004 USD per request = 0.04 USD (S3 Standard GET requests cost)
20 GB x 0.0007 USD = 0.014 USD (S3 select returned cost)
30 GB x 0.002 USD = 0.06 USD (S3 select scanned cost)
1.15 USD + 0.04 USD + 0.014 USD + 0.06 USD = 1.26 USD (Total S3 Standard Storage, data requests, S3 select cost)
S3 Standard cost (monthly): 1.26 USD

Conclusion

In conclusion, You can reduce data transfer time and cost by using s3 select. Although S3 only supports simple queries, It can be very quiet useful in some scenarios.

If you have any questions or suggestions feel free to leave a comment below.

Subscribe to Newsletter I send out a newsletter every two weeks containing lot of interesting articles.

© Copyright 2022

Proudly hosted with Digital Ocean