Week 2 of the AWS MLops: Data preparation using AWS services
Machine Learning (ML) and AWS ML Services#
Deep learning $\subset$ Machine learning $\subset$ Artificial intelligence
ML is the scientific study of algorithms and statistical models to perform a task using inference instead of instructions
Typical workflow: Data → Model → Prediction
Types of ML algorithms and common business use cases
- Supervised learning:
- Fraud detection
- Image recognition - computer vision
- Customer retention
- Medical diagnostics - computer vision
- Personalized advertising
- Sales prediction
- Weather forecasting
- Market projection
- Population growth prediction
- $\ldots$
- Unsupervised learning:
- Product recommendations
- Customer segmentation
- Targeted marketing
- Medical diagnostics
- Natural language processing - chatbot, translation, sentiment analysis
- Data structure discovery
- Gene sequencing
- $\ldots$
- Reinforcement learning: best when the desired outcome is known but the exact path to achieving it is not known
- Game AI
- Self-driving cats
- Robotics
- Customer service routing
- $\ldots$
- Supervised learning:
Use ML when you have:
- Large datasets, large number of variables
- Lack of clear procedures to obtain solutions
- Existing ML expertise
- Infrastructure already in place to support ML
- Management support for ML
Typical ML workflow
- Iterative process
- Data processing
- Training
- Evaluation
- Iterative process
ML frameworks and infrastructure
- Frameworks provide tools and code libraries
- Customized scripting
- Integration with AWS services
- Community of developers
- Example: PyTorch, TensorFlow, scikit-learn, $\ldots$
- Infrastructure
- Designed for ML applications
- AWS IoT Greengrass provides an infrastructure for building ML for IoT devices
- AWS Elastic Inference reduces costs for running ML apps
- Frameworks provide tools and code libraries
AWS ML managed services, no ML experience required
- Computer vision: Amazon Rekognition, Amazon Textract
- Speech: Amazon Polly, Amazon Transcribe
- Language: Amazon Comprehend, Amazon Translate
- Chabots: Amazon Lex
- Forecasting: Amazon Forecast
- Recommendations: Amazon Personalize
Three layers of the Amazon Machine Learning stack:
- Managed Services
- Machine Learning Services
- MAchine Learning Frameworks
ML challenges
- Data
- Poor quality
- Non-representative
- Insufficient
- Overfitting and underfitting
- Business
- Complexity in formulating questions
- Explaining models to business stakeholders
- Cost of building systems
- Users
- Lack of data science expertise
- Cost of staffing with data scientists
- Lack of management support
- Technology
- Data privacy issue
- Tool selection can be complicated
- Integration with other systems
- Data
Feature Engineering#
Public dataset for feature engineering and model tuning
- Hugging Face public datasets
- Kaggle public datasets
- Amazon S3 buckets
An useful concept: combine old features and produce new features for training/validation
Helpful to create a ML project structure so that the project can be managed and tracked phase by phase
- Data ingest
- Exploratory data analysis (EDA)
- Modeling
- Conclusion
At the EDA phase, typical approaches
- Look at descriptive statistics
- Graphing data, examine trends: linear, logarithmic, $\ldots$
- Clustering data