Query content of S3 objects with SQL using Laravel

Recently, I was working on a Laravel application that deals with a lot of CSV data. Different business partners were uploading a new version of the CSV files to S3 every hour.

My application was processing these files to build various charts for reporting. It also provides an interface for authorized users to view the content of uploaded CSV files. Users can also filter and search different records from the interface. Most of the time application only needs to access the most recent rows.

My initial approach was to download the entire CSV file from s3 into memory on every request. Then loop through each row to calculate the required data for charts. I was caching the calculated results for half an hour to improve the performance. This approach had a few drawbacks:

  • When the cache expires, I had to download the entire file from the s3. Some files were over 5MB in size. Downloading such large files decreases the performance significantly.
  • In S3, You have to pay when you retrieve data. Downloading large files frequently can increase the cost.
  • Users of my application wanted to see recent records. Most of the time they were viewing the old cached data.
  • Writing a code that filters the CSV content was not as pleasant as writing a query in SQL.

Another approach was to get the content of CSV every 15 minutes. Then insert these records into the SQL database. I did not like this approach because now the data exists in two different places. This approach was not scalable either. As the number of records in the CSV file increase, I have to update or insert more records into the database. On the bright side, I can use Eloquent or Database query builder to access the data.

AWS S3 Select

While Amazon S3 is awesome for storage, It also has a feature called S3 Select. With S3 Select, You can use a simple SQL query to filter the content of the stored objects. and retrieve only a subset of data that you need.

S3 Select currently works on objects that store data in CSV, JSON, and Apache Parquet formats. It also works with GZIP or BZIP2 compressed CSV and JSON objects.

Example of S3 Select Query with Laravel

For example purposes, I am going to use the following sample data. You can download the complete dataset here

dateeuro_to_usduk_pound_to_usdaustralian_dollar_to_usdcanadian_dollar_to_usd
2022-03-180.9084302330.7616436271.355013550.792581438
2022-03-170.9048954850.7601383451.3664935770.790451348
2022-03-160.9095870470.7638252371.3875398920.786101722
2022-03-150.909835320.7659900421.3915947680.781066937

Laravel filesystem does not support QueryObjectContent operation out of the box. I created the following macro so I can reuse the same logic in many places.

<?php

namespace App\Providers;

use Illuminate\Filesystem\AwsS3V3Adapter;
use Illuminate\Support\ServiceProvider;

class MacroServiceProvider extends ServiceProvider
{
    public function register(): void
    {
        AwsS3V3Adapter::macro('query', function (string $key, string $expression, array $overrides = []) {
            /** @var AwsS3V3Adapter $this */
            $default = [
                'Bucket' => $this->config['bucket'],
                'Expression' => $expression,
                'ExpressionType' => 'SQL',
                'InputSerialization' => [
                    'CSV' => [
                        'FileHeaderInfo' => 'USE',
                        'FieldDelimiter' => ','
                    ],
                ],
                'Key' => $key,
                'OutputSerialization' => [
                    'CSV' => [
                        'RecordDelimiter' => ','
                    ],
                ],
                'RequestProgress' => [
                    'Enabled' => false,
                ],
            ];

            return $this->getClient()->selectObjectContent(array_merge($default, $overrides));
        });
    }
}
Code language: PHP (php)

This maco uses the underlying S3Client to execute the QueryObjectContent command. It accepts the name of the object, a query, and an optional parameter for the QueryObjectContent command. and It returns Aws\Result back.

Select & Limit Clause

S3 Select uses simple SQL statements. The following code snippet retrieves the first five records from our example dataset.

$records = collect();

// S3 Query to select first 5 records
$query = 'SELECT "date", "canadian_dollar_to_usd" FROM s3object LIMIT 5';

// Execute the query and fetch the results
$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);

// Loop through payload events to collect result
foreach($result['Payload'] as $event) {
    if (isset($event['Records'])) {
        $records = $records->merge(
            Str::of($event['Records']['Payload'])
                ->explode(PHP_EOL)
                ->filter()
                ->map(fn(string $item) => Str::of($item)->explode(','))
        );
    }
}

dd($records);Code language: PHP (php)

If your CSV file does not have headers, You can also specify index numbers in your query. You may have to update the FileHeaderInfo parameter value. The possible values are USE, IGNORE, or NONE. If your CSV does not include headers use NONE otherwise use IGNORE.

$query = 'SELECT _1, _2 FROM s3object LIMIT 5';Code language: PHP (php)

Like SQL statement, You can also use * to retrieve all the columns.

$query = "SELECT * FROM s3Object LIMIT 5"Code language: PHP (php)

Where Clause

To explore where clause, Let’s get the list of the CAD to the USD conversion rate since March 1, 2022. Notice that the date is a reserved keyword in S3 select. To use these reserved keywords, you have to wrap them in double quotes. You can find the list of reserved keywords here.

// ...

$query = 'SELECT "date", canadian_dollar_to_usd FROM s3object WHERE CAST("date" AS TIMESTAMP) >= CAST(\'2022-03-01\' AS TIMESTAMP)';

$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);

// ...Code language: PHP (php)

Using Functions

Like SQL, S3 select also allows us to use functions. You can find a complete list of functions in AWS documentation. In the following code, we will get the total number of records in a file using the COUNT aggregate function.

$count = null;

$query = 'SELECT COUNT(*) FROM s3object';

$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);

// Loop through payload events to get records
foreach($result['Payload'] as $event) {
    if (isset($event['Records'])) {
        $count = Str::of($event['Records']['Payload']);
    }
}

dd($count);Code language: PHP (php)

Cost of S3 Select

Amazon S3 Select is very cheap. The cost of S3 Select depends on the number of select requests, data scanned and amount of data returned. Let’s say I have 50GB of data stored in S3. I make 100,000 Select requests per month and return 20GB of data. It will cost me less than 2 US dollars. Following is the pricing estimate from the AWS pricing calculator as of March 2022.

Tiered price for: 50 GB
50 GB x 0.0230000000 USD = 1.15 USD
Total tier cost = 1.1500 USD (S3 Standard storage cost)
100,000 GET requests in a month x 0.0000004 USD per request = 0.04 USD (S3 Standard GET requests cost)
20 GB x 0.0007 USD = 0.014 USD (S3 select returned cost)
30 GB x 0.002 USD = 0.06 USD (S3 select scanned cost)
1.15 USD + 0.04 USD + 0.014 USD + 0.06 USD = 1.26 USD (Total S3 Standard Storage, data requests, S3 select cost)
S3 Standard cost (monthly): 1.26 USD

Conclusion

In conclusion, You can reduce data transfer time and cost by using s3 select. Although S3 only supports simple queries, It can be very quiet useful in some scenarios.

If you have any questions or suggestions feel free to leave a comment below.

Leave a Reply

Your email address will not be published. Required fields are marked *