Skip to main content

Study notes: MLops Week 2 AWS ML Data Preparation

·414 words·2 mins
Table of Contents

Week 2 of the AWS MLops: Data preparation using AWS services

Machine Learning (ML) and AWS ML Services
#

  • Deep learning $\subset$ Machine learning $\subset$ Artificial intelligence

  • ML is the scientific study of algorithms and statistical models to perform a task using inference instead of instructions

    Typical workflow: Data → Model → Prediction

  • Types of ML algorithms and common business use cases

    • Supervised learning:
      • Fraud detection
      • Image recognition - computer vision
      • Customer retention
      • Medical diagnostics - computer vision
      • Personalized advertising
      • Sales prediction
      • Weather forecasting
      • Market projection
      • Population growth prediction
      • $\ldots$
    • Unsupervised learning:
      • Product recommendations
      • Customer segmentation
      • Targeted marketing
      • Medical diagnostics
      • Natural language processing - chatbot, translation, sentiment analysis
      • Data structure discovery
      • Gene sequencing
      • $\ldots$
    • Reinforcement learning: best when the desired outcome is known but the exact path to achieving it is not known
      • Game AI
      • Self-driving cats
      • Robotics
      • Customer service routing
      • $\ldots$
  • Use ML when you have:

    • Large datasets, large number of variables
    • Lack of clear procedures to obtain solutions
    • Existing ML expertise
    • Infrastructure already in place to support ML
    • Management support for ML
  • Typical ML workflow

    workflow

    • Iterative process
      • Data processing
      • Training
      • Evaluation
  • ML frameworks and infrastructure

    • Frameworks provide tools and code libraries
      • Customized scripting
      • Integration with AWS services
      • Community of developers
      • Example: PyTorch, TensorFlow, scikit-learn, $\ldots$
    • Infrastructure
      • Designed for ML applications
      • AWS IoT Greengrass provides an infrastructure for building ML for IoT devices
      • AWS Elastic Inference reduces costs for running ML apps
  • AWS ML managed services, no ML experience required

    • Computer vision: Amazon Rekognition, Amazon Textract
    • Speech: Amazon Polly, Amazon Transcribe
    • Language: Amazon Comprehend, Amazon Translate
    • Chabots: Amazon Lex
    • Forecasting: Amazon Forecast
    • Recommendations: Amazon Personalize
  • Three layers of the Amazon Machine Learning stack:

    • Managed Services
    • Machine Learning Services
    • MAchine Learning Frameworks
  • ML challenges

    • Data
      • Poor quality
      • Non-representative
      • Insufficient
      • Overfitting and underfitting
    • Business
      • Complexity in formulating questions
      • Explaining models to business stakeholders
      • Cost of building systems
    • Users
      • Lack of data science expertise
      • Cost of staffing with data scientists
      • Lack of management support
    • Technology
      • Data privacy issue
      • Tool selection can be complicated
      • Integration with other systems

Feature Engineering
#

  • Public dataset for feature engineering and model tuning

    • Hugging Face public datasets
    • Kaggle public datasets
    • Amazon S3 buckets
  • An useful concept: combine old features and produce new features for training/validation

  • Helpful to create a ML project structure so that the project can be managed and tracked phase by phase

    • Data ingest
    • Exploratory data analysis (EDA)
    • Modeling
    • Conclusion
  • At the EDA phase, typical approaches

    • Look at descriptive statistics
    • Graphing data, examine trends: linear, logarithmic, $\ldots$
    • Clustering data