Setup Spark and Jupyter on Windows

5 min readNov 22, 2020

This is yet another guide to setup Spark and Jupyter on Windows. After consulting a number of other sources such as phoenixNAP and Chang Hsin Lee, there are still a number of steps that I needed to stumble through on my own. In the remainder of this article, I will describe the steps needed to have a running Jupyter notebook that powered by a local Spark cluster on the back-end.

Windows Terminal

My computer is running Windows 10 Professional. As I am on Windows Insider track, my current build is 20262.fe_release.20113–1436. As Windows becomes more and more friendly toward Linux, the first item that we want to setup is a proper terminal. Windows Terminal is my application of choice. To acquire Windows Terminal, open up the Microsoft Store app, search and install Windows Terminal.

Install Java

It is possible that you already have Java installed. The easiest way to confirm this is to open your Windows Terminal and run $ javac -version. The version that you see on your screen maybe different from mine.

If you don’t have this, there is still a chance that you have Java setup. Go to C:\Program Files and check if variety of Java, from either Oracle or AdoptOpenJDK.

If you don’t have Java, you will need to install that. I recommend using the Java distribution maintained by OpenJDK:

Go to OpenJDK website
Choose OpenJDK 8 (LTS) for version and HotSpot for JVM
Click on the generated download link to get the installation package.
Run the installer. You can keep all default settings on the installer.
Once the installation finishes, you can run javac -version again to confirm the installation.

Install Anaconda

Visit Anaconda’s download page and download the corresponding Anaconda installers.
You should select the 64-Bit variety for your installers.
Run the installation for Anaconda.
Remember the installation directory for Anaconda.
For Windows, this is typically under C:\Users\YOUR_WINDOWS_USERNAME\anaconda3. or C:\ProgramData\anaconda3.

Download Spark

Visit Spark’s download page
Select the download options as shown in the figure below.
Click on the generated link to download Spark.

Untar and store the final directory somewhere that is easily accessible.
You might need to download and install 7-Zip to decompress .tgz file on Windows.
When decompressing, you might have to do it twice, because the first decompression will return a .tar file, and the second decompression is needed to completely retrieve the Spark directory.
Move the resulting directory under the C: drive.

Install libraries to support Hadoop functionalities

Open Windows Terminal, and create a hadoop directory and a bin subdirectory under the C: drive.

cd c:\
mkdir -p hadoop\bin

Visit the link to winutils.exe, right click on the Download and choose Save Link As.
Save the file to C:\hadoop\bin
Visit the link to hadoop.dll, right click on the Download and choose Save Link As.
Save the file to C:\hadoop\bin.

Setup environment variables

Click on the Windows icon and start typing environment variables in the search box, then click on Edit the system environment variables.

Click on Environment Variables. Under User variables for ..., click New and enter following pairs of input for each of the items below. Click OK when done.

Java
Variable name: JAVA_HOME.
Variable value: Typically C:\Program Files\AdoptOpenJDK\jdk-8.0.222.10-hotspot.
Spark
Variable name: SPARK_HOME.
Variable value: C:\spark-3.0.1-bin-hadoop3.2.
Hadoop
Variable name: HADOOP_HOME.
Variable value: C:\hadoop.
Anaconda3
Variable name: ANACONDA_HOME.
Variable value: C:\Users\YOUR_WINDOWS_USERNAME\anaconda3.

In User variables for ..., select Path and click Edit. Next, add the executable and enter following pairs of input by pressing New to enter each the items below into the list. Click OK when done.

Java: %JAVA_HOME%\bin
Spark: %SPARK_HOME%\bin
Hadoop: %HADOOP_HOME%\bin
Anaconda3: %ANACONDA_HOME%\Scripts

Close your terminal and relaunch it. Test that all paths are setup correctly by running the following:

> where.exe javac
> where.exe spark-shell
> where.exe winutils

Setup Jupyter and pyspark

Open a terminal and run the followings:

> conda create -y -n pyspark python=3.6
> conda init powershell
> conda activate pyspark
> conda install -y -c conda-forge findspark
> conda install -y ipykernel
> python -m ipykernel install --user --name=pyspark

Test Jupyter and pyspark

Download the following Shakespeare collection.
Open a terminal
Launch Jupyter notebook using thejupyter notebook command.
A web browser will pop up for the Jupyter Notebook Server.
Open a new notebook using the pyspark kernel.
Enter the following Python code into a cell of the new notebook.
Replace PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE with the actual path (including the file name) to where you downloaded the file earlier

import os
import sys spark_path = os.environ['SPARK_HOME']
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.9-src.zip") import findsparkfindspark.init() import pyspark 
number_cores = 8
memory_gb = 16 conf = (pyspark.SparkConf().setMaster('local[{}]'.format(number_cores)).set('spark.driver.memory', '{}g'.format(memory_gb))) sc = pyspark.SparkContext(conf=conf)
 
textFile = sc.textFile("PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE")wordcount = textFile.flatMap(lambda line: line.split(" "))           .map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)wordcount.saveAsTextFile("output-wordcount-01")

You can adjust number_cores and memory_gb to be more suitable to your computer. Mine runs on an Intel i7 with 32GB of memory, so I pick my numbers to 8 and 16.
Run the cell. Once/if the run completes successfully, you can revisit your Jupyter Server and observe that output-word-count-1 directory is now created.
_SUCCESS is an empty file that serves the purpose of notifying of a successful execution.
part-00000 and part-00001 contains the resulting outputs.