Yantrajaal by Prithwis Mukerjee
Big Data with Spark & NoSQL in Google Colab
Students of Data Science need to install or have access to a range of complex software. Most of these softwares run on Linux. This means that students have to either reconfigure their Windows machines to run Linux VMs with Dockers, WSL etc or get a separate Linux machine. Even then, installation of these softwares proves difficult because of differences in machine configuration, different home directories, differing paths and many other complications.
To avoid wasting time in sorting out these myriad installation challenges, students at Praxis Business School are encouraged to work on Google Colab. To know the what and how of Colab, see this or other freely available tutorials. While Colab was designed to be a hosted version of Jupyter Notebook and run Python programs, it is in reality a very powerful Ubuntu VM whose underlying shell and OS can be accessed by prefixing commands with ‘!’
Using this strategy, it is possible to install and use any software in the terminal mode on the underlying VM and use it for academic and pedagogical purposes. With some clever and creative hacks, even production grade applications are possible. Since the VM is not persistent, the installion needs to be done each time the Colab notebook is started. However data and configuration files can be persistently stored on the users Google Drive that can be mounted in the VM and be used with read-write access. The biggest advantage of using this strategy is that all dependencies are taken care of within the notebook cells itself. Web based GUI frontends can be accessed with tunnels as shown in the Spark Wordcount example
The following notebook URLs demonstrate and serve as templates for Colab Notebooks for the installation and usage of different software. They should work out-of-the-box. After opening the URL, use button to open a safe, editable and executable copy of the codebase in Google Colab. The only change that might be necessary is the version number of the download file and the corresponding change in names of directories, the values of the $HOME environment variables and the contents of the $PATH
The Colab notebooks, python codes and datasets in this repository are used to teach the course on BigData Spark NoSQL at Praxis Business School. Some of the more interesting programs are listed below.
- Install MySQL in Colab VM and execute SQL statements MySQL Local Shell Pandas
- Connect to a remote MySQL server and execute SQL statements MySQL Remote Shell Pandas
- Install Hadoop in VM, Run Wordcount program : Hadoop Wordcount
- Install Hadoop, Hive. Run queries, load bulk data Hadoop Hive
- Install Hadoop, HBase, Run queries, load bulk data with Hadoop HBase Shell
- Install Hadoop, HBase, HappyBase to use HBase with Python Hadoop Hbase Python
- Install Spark, Run WordCount program Spark WordCount
- Install Spark, use SparkSQL, SQLContext, HiveContext Spark SQL Hive
- Install MongoDB in local VM, Run basic CRUD operations MongoDB getting started
- Access MongoDB on a remote site, load bulk data, complex queries MongoDB Remote Complex queries
- Install MongoDB, Spark on local VM and access MongoDB from Spark MongoDB Spark
- Install Cassandra, Access from Shell and Python Cassandra Getting Started
- Access Cassandra as remote service from Datastax from Python. Cassandra_DataStax_Python
- Build ML Pipelines with Spark for Customer Conversion. ML_Pipeline_1_Customer_Conversion
- Build Reusable ML Pipelines with Spark for Diabetes Prediction using multiple ML algorithms. ML_Pipeline_2_Diabetes_Prediction
xterm Access
Linux / Ubuntu commands in Google Colab are generally used by prefixing them with ! ( exclaimation mark ) However an XTERM console with TTY support can be created with colab-xterm. This is demonstrated in these two notebooks MySQL Local Shell Pandas and Cassandra Getting Started *** Examples of deploying ML models using Google Colab are available in the Praxis DEMD repository. This includes the use of Python Flask WebApps that make ML models accessible through a public website.
Password matters
In certain cases, passwords and other user credentials need to be supplied. While normal users may hardcode this into the notebook as a variable, a public notebook – like the ones that are stored here – cannot have them hardwired and visible. Hence in these cases, we have taken the strategy of storing the credentials as .py files in Google Drive. During usage, these Google Drive is mounted, the file copied into the VM and then the variables are imported .