Setup Spark and Jupyter on Windows

Linh Ngo
5 min readNov 22, 2020

--

This is yet another guide to setup Spark and Jupyter on Windows. After consulting a number of other sources such as phoenixNAP and Chang Hsin Lee, there are still a number of steps that I needed to stumble through on my own. In the remainder of this article, I will describe the steps needed to have a running Jupyter notebook that powered by a local Spark cluster on the back-end.

Windows Terminal

My computer is running Windows 10 Professional. As I am on Windows Insider track, my current build is 20262.fe_release.20113–1436. As Windows becomes more and more friendly toward Linux, the first item that we want to setup is a proper terminal. Windows Terminal is my application of choice. To acquire Windows Terminal, open up the Microsoft Store app, search and install Windows Terminal.

Windows Terminal in Microsoft Store

Install Java

It is possible that you already have Java installed. The easiest way to confirm this is to open your Windows Terminal and run $ javac -version. The version that you see on your screen maybe different from mine.

Checking version/availability of javac

If you don’t have this, there is still a chance that you have Java setup. Go to C:\Program Files and check if variety of Java, from either Oracle or AdoptOpenJDK.

If you don’t have Java, you will need to install that. I recommend using the Java distribution maintained by OpenJDK:

  • Go to OpenJDK website
  • Choose OpenJDK 8 (LTS) for version and HotSpot for JVM
  • Click on the generated download link to get the installation package.
  • Run the installer. You can keep all default settings on the installer.
  • Once the installation finishes, you can run javac -version again to confirm the installation.

Install Anaconda

  • Visit Anaconda’s download page and download the corresponding Anaconda installers.
  • You should select the 64-Bit variety for your installers.
  • Run the installation for Anaconda.
  • Remember the installation directory for Anaconda.
  • For Windows, this is typically under C:\Users\YOUR_WINDOWS_USERNAME\anaconda3. or C:\ProgramData\anaconda3.

Download Spark

  • Visit Spark’s download page
  • Select the download options as shown in the figure below.
  • Click on the generated link to download Spark.
Link to how to download Apache Spark
  • Untar and store the final directory somewhere that is easily accessible.
  • You might need to download and install 7-Zip to decompress .tgz file on Windows.
  • When decompressing, you might have to do it twice, because the first decompression will return a .tar file, and the second decompression is needed to completely retrieve the Spark directory.
  • Move the resulting directory under the C: drive.

Install libraries to support Hadoop functionalities

Open Windows Terminal, and create a hadoop directory and a bin subdirectory under the C: drive.

  • cd c:\
  • mkdir -p hadoop\bin
  • Visit the link to winutils.exe, right click on the Download and choose Save Link As.
  • Save the file to C:\hadoop\bin
  • Visit the link to hadoop.dll, right click on the Download and choose Save Link As.
  • Save the file to C:\hadoop\bin.

Setup environment variables

Click on the Windows icon and start typing environment variables in the search box, then click on Edit the system environment variables.

Click on Environment Variables. Under User variables for ..., click New and enter following pairs of input for each of the items below. Click OK when done.

  • Java
  • Variable name: JAVA_HOME.
  • Variable value: Typically C:\Program Files\AdoptOpenJDK\jdk-8.0.222.10-hotspot.
  • Spark
  • Variable name: SPARK_HOME.
  • Variable value: C:\spark-3.0.1-bin-hadoop3.2.
  • Hadoop
  • Variable name: HADOOP_HOME.
  • Variable value: C:\hadoop.
  • Anaconda3
  • Variable name: ANACONDA_HOME.
  • Variable value: C:\Users\YOUR_WINDOWS_USERNAME\anaconda3.

In User variables for ..., select Path and click Edit. Next, add the executable and enter following pairs of input by pressing New to enter each the items below into the list. Click OK when done.

  • Java: %JAVA_HOME%\bin
  • Spark: %SPARK_HOME%\bin
  • Hadoop: %HADOOP_HOME%\bin
  • Anaconda3: %ANACONDA_HOME%\Scripts

Close your terminal and relaunch it. Test that all paths are setup correctly by running the following:

> where.exe javac
> where.exe spark-shell
> where.exe winutils

Setup Jupyter and pyspark

Open a terminal and run the followings:

> conda create -y -n pyspark python=3.6
> conda init powershell
> conda activate pyspark
> conda install -y -c conda-forge findspark
> conda install -y ipykernel
> python -m ipykernel install --user --name=pyspark

Test Jupyter and pyspark

  • Download the following Shakespeare collection.
  • Open a terminal
  • Launch Jupyter notebook using thejupyter notebook command.
  • A web browser will pop up for the Jupyter Notebook Server.
  • Open a new notebook using the pyspark kernel.
  • Enter the following Python code into a cell of the new notebook.
  • Replace PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE with the actual path (including the file name) to where you downloaded the file earlier
import os
import sys
spark_path = os.environ['SPARK_HOME']
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.9-src.zip")
import findsparkfindspark.init() import pyspark
number_cores = 8
memory_gb = 16
conf = (pyspark.SparkConf().setMaster('local[{}]'.format(number_cores)).set('spark.driver.memory', '{}g'.format(memory_gb))) sc = pyspark.SparkContext(conf=conf)

textFile = sc.textFile("PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE")
wordcount = textFile.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)wordcount.saveAsTextFile("output-wordcount-01")
  • You can adjust number_cores and memory_gb to be more suitable to your computer. Mine runs on an Intel i7 with 32GB of memory, so I pick my numbers to 8 and 16.
  • Run the cell. Once/if the run completes successfully, you can revisit your Jupyter Server and observe that output-word-count-1 directory is now created.
  • _SUCCESS is an empty file that serves the purpose of notifying of a successful execution.
  • part-00000 and part-00001 contains the resulting outputs.
  • You can also visit 127.0.0.1:4040/jobs to observe the running Spark cluster spawned by the Jupyter notebook

Changing the Spark page’s tab to Executorsto observe the configuration of the cluster:

  • The cluster has 8 cores
  • The amount of available memory is only 8.4GB out of 16GB, this is due to Spark’s memory storage reservation/protection.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Linh Ngo
Linh Ngo

Written by Linh Ngo

Associate Professor, West Chester University of Pennsylvania

No responses yet

Write a response