Accessing S3 Buckets from Python

By Eddy Davies – Data Science Associate

Introduction

Here at Crimson Macaw, we use SageMaker as our Machine Learning platform and store our training data in an S3 Bucket. This is both a cheap and easy solution due the excellent integration between Python and S3. However, some of the Python code can prove less than intuitive, depending on the data being used. So here are four ways to load and save to S3 from Python.

Pandas for CSVs

Firstly, if you are using a Pandas and CSVs, as is commonplace in many data science projects, you are in luck. The very simple lines you are likely already familiar with should still work well to read from S3:

import pandas as pd
df = pd.read('s3://example-bucket/test_in.csv')

And to write back S3:

df.to_csv('s3://example-bucket/test_out.csv')

Smart Open for JSONs

The next simplest option to try is the very useful smart_open python package, which can be installed and imported with:

!pip install smart_open
from smart_open import open

It can then used interchangeably for the default python open function. This allows S3 buckets but also Azure Blob Storage, Google Cloud Storage, SSH, SFTP or even Apache Hadoop Distributed File System. After pip installing and loading smart_open, this can be used for a csv but is more useful for opening JSON files. This is because pandas has not overloaded their read_json() function to work with S3 as they have with read_csv().  So the json must be opened as a file handler:

import json
with open("s3://cm-yelp/test.json", "rb") as f:
    test = json.load(f)
    df = pd.read_json(test)

And JSON Lines requires an extra option to be applied the json loading line:

test = json.load(f, lines=True)

SageMaker S3 Utilities

The SageMaker specific python package provides a variety of S3 Utilities that may be helpful to you particular needs. You can upload a whole file or a string to the local environment:

from sagemaker.s3 import S3Uploader as S3U
S3U.upload(local_path, desired_s3_uri)
S3U.upload_string_as_file_body(string_body, desired_s3_uri)

Alternatively, to download a file or read one:

S3D.download(s3_uri, local_path,)
file = S3D.read_file(s3_uri)

The required SageMaker session is automatically generated by these functions but if you have created one as shown in the next section it can be passed in to these functions as well.

Custom Functions using Boto3

The final method I will describe is the more complex method that uses the low level boto3 python package. Maybe your environment restvricting use of 3rd party packages, you are working with an unusual data type, or you just need more control of the process. Regardless of the reason to use boto3, you must first get an execution role and start a connection:

import boto3
from sagemaker import get_execution_role
role = get_execution_role()
s3client = boto3.client('s3')

Then I have created the following function that demonstrate how to use boto3 to read from S3, you just need to pass the file name and bucket. This is the lowest possible level to interact with S3.
def read_s3(file_name: str, bucket: str):
    fileobj = s3client.get_object(
        Bucket= bucket,
        Key= file_name
    )

 
    # Open the file object and read it into the variable file data.
    filedata = fileobj['Body'].read()

 
    # Decode and return binary stream of file data.
    return filedata.decode('utf-8')

Then you have the following function to save an csv to S3 and by swapping df.to_csv() for a different this work for different file types.

def upload_csv(df, data_key: str):
 
    with io.StringIO() as csv_buffer:
        df.to_csv(csv_buffer, index=False)

 
        response = s3client.put_object(
            Bucket=BUCKET, Key=data_key, Body=csv_buffer.getvalue()
        )

 
        status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")
 
        if status == 200:
            print(f"Successful S3 put_object response. Status - {status}")
        else:
            print(f"Unsuccessful S3 put_object response. Status - {status}")

Conclusion

SageMaker Notebooks or SageMaker Studio are AWS’ recommended solutions for prototyping a pipeline and although they can also be used for training or inference this is not recommended. This is because once the given task is completed, they will continue running in an idle state and cause unnecessary charges. Instead, Container Mode or Script Mode are recommended for running a large amount of data through your model and we will have blogs out on these topics soon.

Regardless of the shade of SageMaker you choose to use, I have described four various methods to use a Python environment to load from and place data on S3. I hope you found this helpful. Look out for more blogs posted soon discussing how we can put this data to good use.

Want to know more? Check out our latest blogs or get in touch with us here.