Skip to main content

Study notes: MLops Week 1-2 Data Repositories

·475 words·3 mins
Table of Contents

Week 1-2 of the AWS MLops: Data repositories and AWS managed storage choices

Introduction to AWS S3
#

  • Object storage service that offers scalability, data availability, security, and performance.

    • 99.99999999999% of durability
    • Easy to use management features
    • Can respond to event triggers
  • Use cases:

    • Content storage/distribution
    • Backup, restore, and archive
    • Data lakes and big data analytics
    • Disaster recovery
    • Static website hosting
  • Component:

    • Bucket: https://s3-<aws-region>.amaonaws.com/<bucket-name>/
    • Object: https://s3-<aws-region>.amaonaws.com/<bucket-name>/<object-key>
    • Objects in an S3 bucket can be referred by their URL
    • The key value identifies the object in the bucket
  • Prefixes:

    • Use prefixes to imply a folder structure in an S3 bucket
      • Specify prefix: 2021/doc-example-bucket/math
      • Returns the following kets:
        • 2021/doc-example-bucket/math/john.txt
        • 2021/doc-example-bucket/math/maris.txt
  • Object metadata:

    • System-defined:
      • objection creation data
      • object size
      • object version
    • User-defined:
      • information that you assign to the object
      • x-amz-meta key followed by a custom name. Example: x-amz-meta-alt-name
  • Versioning:

    • Keep multiple variants of an object in the same bucket
    • In versioning-enabled S3 buckets, each object has a version ID
    • After versioning is enabled, it can only be suspended.
  • Three operations:

    • PUT:
      • Upload entire object to a bucket. Max size: 5 GB
      • Should use multipart upload for objects over 100 MB
      import boto3
      
      S3API = boto3.client("s3", region_name="us-east-1")
      bucket_name = "samplebucket"
      filename = "/resources/website/core.css"
      
      S3API.upload_file(
      	filename, 
      	bucket_name, 
      	"core.css", 
      	ExtraArgs={'ContentType': "text/css", "CacheControl": "max-age=0"})
      
    • GET:
      • Used to retrieve objects from Amazon S3
      • Can retrieve the complete object at once or a range of bytes
    • DELETE:
      • Versioning disabled - object is permanently deleted from the bucket
      • Versioning enabled - delete with key and version ID
  • S3 SELECT:

    • A powerful too to query data in place without the need to fetch the data from buckets.
  • Data encryption: S3 has two types of policies for bucket access:

    policies

    • ACLs: access control lists.
    • Resource-based access policy to manage access at the object level or bucket level.

Data storage
#

  • As the first step, catalog all of the different data source in the organization into a master list. Once the master list is created, develop a strategy around how to process the data in a data engineering pipeline

  • Determine the correct storage medium

    • Database
      • Key/value database, e.g. DynamoDB, ideal for user records or game stats
      • Graph database, e.g. Neptune, for relationship building
      • SQL, e.g. Amazon Aurora, RDS, for transaction-based queries
    • Data lake:
      • Built on top of S3
      • Metadata + Storage + Compute
      • Can index things that are inside S3
    • EFS
      • Elastic File System
      • Amazon EFS is a cloud-based file storage service for apps and workloads that run in the AWS public cloud
      • Automatically grows and shrinks as you add and remove files
      • The system manages the storage size automatically without any provisioning
    • EBS
      • Stands for Elastic Block Storage. This is a high-performance block-storage service designed for AWS Elastic Compute Cloud (AWS EC2)
      • Offers very fast file system, ideal for machine learning training that requires fast file IO

MLOps Template Github
#

Great templates for MLops projects with GPU:

https://github.com/nogibjj/mlops-template