Description: overview videos for each section below - MOV format.
Description: ETL is a process that extracts, transforms, and loads data from multiple sources to a data warehouse or other unified data repository. Slides and diagrams for the section.
Description: Hadoop - An open source software platform for distributed storage and distributed processing of very large data sets on computer clusters build from commodity hardware. Slides and diagrams for the section.
Description: Spark is a scalable open-source data processing engine built around speed, ease of use, and analytics, with APIs in Java, Scala, Python, R, and SQL. Spark runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Slides and diagrams for the section. Code in examples.
Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative. Airflow is used to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes tasks on an array of workers while following the specified dependencies. Command line utilities allow you to perform complex DAGs. The graphical user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. Slides and diagrams for the section. Code in examples.
Sensemaking - make sense of the data in MIT's course catalog. Create a data pipeline to solve the project. Slides and diagrams for the section. - Starter code. - Visualization code. - Solution - do not share with learners.