In this advanced part of our HPC Cloud tutorial we ask you to run an exercise to look at the scale-out and scale-up scenarios. You will be using Principal Component Analysis to study flight delays where data can be analysed by scaling up one VM or scaling out across multiple VMs.
The original dataset is coming from here, but we have already prepared some files. Among other preparation steps, we focused on a selection of dates and some variables, along with some cleaning steps to get more useful data.
You are now in the advanced section of the workshop. You have your laptop and an Internet connection. We expect you will be able to find out more on your own about things that we hardly/don’t explain but which you think you need. For example, if we were you, at this point we would’ve already googled for several things:
- Principal Component Analysis
- R language
- R modules
Start a new single core VM with 1 GB memory (you are now in the advanced part; you should be able to do this on your own). The steps in this exercise assume that you are using an Ubuntu image.
In this part of the exercise you shall prepare the software and download data for analysis. After logging int othe VM:
sudo apt-get install r-base
wget http://doc.hpccloud.surfsara.nl/UvA-20190130/code/airplane-delay.tar tar -xvf airplane-delay.tar
Food for brain:
- What version or R do you have?
- How can you inspect the files without opening them?
cd ~/airplane-delay Rscript airplane-delay-all-comp.r
You just ran an R script and saw the output. What do these numbers mean? Which variables (columns) were used to perform the PCA?
For simplicity we only plot a part of the data. You may use all the datapoints to create the plots.
Food for brain:
- How can you display these plots? (Hint: You can login with X11 forwarding enabled)
How do these numbers compare to the previous analysis with the full dataset? A similar example can be found here that can help you in the interpretation and further analysis of the results.
So far you worked on a single dataset on a single core VM with 1 GB memory. There are two datasets provided to you in the
airplane-delay.tar file and data files for another few months are available here (delay-2018-*.csv). How would you run the analysis for the year as a whole?
You can scale up your VM and run the analysis serially over multiple datasets and combine the results in the end. How big should the VM be to optimize the analysis?
You may choose to scale out and run separate VMs for each dataset and combine the results in the end. How would you go about doing this?
The data was downloaded from here taking only a few columns into account. You can use one of the above methods/both to run the same analysis for several years.