Week 1-2 of the AWS MLops: Data repositories and AWS managed storage choices
Introduction to AWS S3#
Object storage service that offers scalability, data availability, security, and performance.
- 99.99999999999% of durability
- Easy to use management features
- Can respond to event triggers
Use cases:
- Content storage/distribution
- Backup, restore, and archive
- Data lakes and big data analytics
- Disaster recovery
- Static website hosting
Component:
- Bucket:
https://s3-<aws-region>.amaonaws.com/<bucket-name>/
- Object:
https://s3-<aws-region>.amaonaws.com/<bucket-name>/<object-key>
- Objects in an S3 bucket can be referred by their URL
- The key value identifies the object in the bucket
- Bucket:
Prefixes:
- Use prefixes to imply a folder structure in an S3 bucket
- Specify prefix:
2021/doc-example-bucket/math
- Returns the following kets:
2021/doc-example-bucket/math/john.txt
2021/doc-example-bucket/math/maris.txt
- Specify prefix:
- Use prefixes to imply a folder structure in an S3 bucket
Object metadata:
- System-defined:
- objection creation data
- object size
- object version
- User-defined:
- information that you assign to the object
x-amz-meta
key followed by a custom name. Example:x-amz-meta-alt-name
- System-defined:
Versioning:
- Keep multiple variants of an object in the same bucket
- In versioning-enabled S3 buckets, each object has a version ID
- After versioning is enabled, it can only be suspended.
Three operations:
- PUT:
- Upload entire object to a bucket. Max size: 5 GB
- Should use multipart upload for objects over 100 MB
import boto3 S3API = boto3.client("s3", region_name="us-east-1") bucket_name = "samplebucket" filename = "/resources/website/core.css" S3API.upload_file( filename, bucket_name, "core.css", ExtraArgs={'ContentType': "text/css", "CacheControl": "max-age=0"})
- GET:
- Used to retrieve objects from Amazon S3
- Can retrieve the complete object at once or a range of bytes
- DELETE:
- Versioning disabled - object is permanently deleted from the bucket
- Versioning enabled - delete with key and version ID
- PUT:
S3 SELECT:
- A powerful too to query data in place without the need to fetch the data from buckets.
Data encryption: S3 has two types of policies for bucket access:
- ACLs: access control lists.
- Resource-based access policy to manage access at the object level or bucket level.
Data storage#
As the first step, catalog all of the different data source in the organization into a master list. Once the master list is created, develop a strategy around how to process the data in a data engineering pipeline
Determine the correct storage medium
- Database
- Key/value database, e.g. DynamoDB, ideal for user records or game stats
- Graph database, e.g. Neptune, for relationship building
- SQL, e.g. Amazon Aurora, RDS, for transaction-based queries
- Data lake:
- Built on top of S3
- Metadata + Storage + Compute
- Can index things that are inside S3
- EFS
- Elastic File System
- Amazon EFS is a cloud-based file storage service for apps and workloads that run in the AWS public cloud
- Automatically grows and shrinks as you add and remove files
- The system manages the storage size automatically without any provisioning
- EBS
- Stands for Elastic Block Storage. This is a high-performance block-storage service designed for AWS Elastic Compute Cloud (AWS EC2)
- Offers very fast file system, ideal for machine learning training that requires fast file IO
- Database
MLOps Template Github#
Great templates for MLops projects with GPU: