BDSN

Data Science Tools - MySQL, Hadoop, Hive, Hbase, Spark, MongoDB, Cassandra - in Google Colab

View on GitHub

alt text
Praxis Business School

Big Data with Spark & NoSQL in Google Colab

Students of Data Science need to install or have access to a range of complex software. Most of these softwares run on Linux. This means that students have to either reconfigure their Windows machines to run Linux VMs with Dockers, WSL etc or get a separate Linux machine. Even then, installation of these softwares proves difficult because of differences in machine configuration, different home directories, differing paths and many other complications.

To avoid wasting time in sorting out these myriad installation challenges, students at Praxis Business School are encouraged to work on Google Colab. To know the what and how of Colab, see this or other freely available tutorials. While Colab was designed to be a hosted version of Jupyter Notebook and run Python programs, it is in reality a very powerful Ubuntu VM whose underlying shell and OS can be accessed by prefixing commands with ‘!

Using this strategy, it is possible to install and use any software in the terminal mode on the underlying VM and use it for academic and pedagogical purposes. Obviously production grade installations are not recommended. Since the VM is not persistent, the installion needs to be done each time the Colab notebook is started. However data and configuration files can be persistently stored on the users Google Drive that can be mounted in the VM and be used with read-write access. The biggest advantage of using this strategy is that all dependencies are taken care of within the notebook cells itself. Web based GUI frontends can be accessed with tunnels as shown in the Spark Wordcount example

The following notebook URLs demonstrate and serve as templates for Colab Notebooks for the installation and usage of different software. They should work out-of-the-box. After opening the URL, use button button to open a safe, editable and executable copy of the codebase in Google Colab. The only change that might be necessary is the version number of the download file and the corresponding change in names of directories, the values of the $HOME environment variables and the contents of the $PATH

  1. Install MySQL in Colab VM and execute SQL statements MySQL Local Shell Pandas
  2. Connect to a remote MySQL server and execute SQL statements MySQL Remote Shell Pandas
  3. Install Hadoop in VM, Run Wordcount program : Hadoop Wordcount
  4. Install Hadoop, Hive. Run queries, load bulk data Hadoop Hive
  5. Install Hadoop, HBase, Run queries, load bulk data with Hadoop HBase Shell
  6. Install Hadoop, HBase, HappyBase to use HBase with Python Hadoop Hbase Python
  7. Install Spark, Run WordCount program Spark WordCount
  8. Install Spark, use SparkSQL, SQLContext, HiveContext Spark SQL Hive
  9. Install MongoDB in local VM, Run basic CRUD operations MongoDB getting started
  10. Access MongoDB on a remote site, load bulk data, complex queries MongoDB Remote Complex queries
  11. Install MongoDB, Spark on local VM and access MongoDB from Spark MongoDB Spark
  12. Install Cassandra, Access from Shell and Python Cassandra Getting Started

Password matters

In certain cases, passwords and other user credentials need to be supplied. While normal users may hardcode this into the notebook as a variable, a public notebook – like the ones that are stored here – cannot have them hardwired and visible. Hence in these cases, we have taken the strategy of storing the credentials as .py files in Google Drive. During usage, these Google Drive is mounted, the file copied into the VM and then the variables are imported.

Data Science @ Praxis Business School