This is yet another guide to setup Spark and Jupyter on Windows. After consulting a number of other sources such as phoenixNAP and Chang Hsin Lee, there are still a number of steps that I needed to stumble through on my own. In the remainder of this article, I will describe the steps needed to have a running Jupyter notebook that powered by a local Spark cluster on the back-end.
Windows Terminal
My computer is running Windows 10 Professional. As I am on Windows Insider track, my current build is 20262.fe_release.20113–1436. As Windows becomes more and more friendly toward Linux, the first item that we want to setup is a proper terminal. Windows Terminal
is my application of choice. To acquire Windows Terminal
, open up the Microsoft Store
app, search and install Windows Terminal
.

Install Java
It is possible that you already have Java installed. The easiest way to confirm this is to open your Windows Terminal and run $ javac -version
. The version that you see on your screen maybe different from mine.

If you don’t have this, there is still a chance that you have Java setup. Go to C:\Program Files
and check if variety of Java, from either Oracle or AdoptOpenJDK.
If you don’t have Java, you will need to install that. I recommend using the Java distribution maintained by OpenJDK:
- Go to OpenJDK website
- Choose OpenJDK 8 (LTS) for version and HotSpot for JVM
- Click on the generated download link to get the installation package.
- Run the installer. You can keep all default settings on the installer.
- Once the installation finishes, you can run
javac -version
again to confirm the installation.
Install Anaconda
- Visit Anaconda’s download page and download the corresponding Anaconda installers.
- You should select the 64-Bit variety for your installers.
- Run the installation for Anaconda.
- Remember the installation directory for Anaconda.
- For Windows, this is typically under
C:\Users\YOUR_WINDOWS_USERNAME\anaconda3
. orC:\ProgramData\anaconda3
.
Download Spark
- Visit Spark’s download page
- Select the download options as shown in the figure below.
- Click on the generated link to download Spark.

- Untar and store the final directory somewhere that is easily accessible.
- You might need to download and install 7-Zip to decompress
.tgz
file on Windows. - When decompressing, you might have to do it twice, because the first decompression will return a
.tar
file, and the second decompression is needed to completely retrieve the Spark directory. - Move the resulting directory under the
C:
drive.
Install libraries to support Hadoop functionalities
Open Windows Terminal
, and create a hadoop
directory and a bin
subdirectory under the C:
drive.
cd c:\
mkdir -p hadoop\bin

- Visit the link to winutils.exe, right click on the
Download
and chooseSave Link As
. - Save the file to
C:\hadoop\bin
- Visit the link to hadoop.dll, right click on the
Download
and chooseSave Link As
. - Save the file to
C:\hadoop\bin
.

Setup environment variables
Click on the Windows icon and start typing environment variables
in the search box, then click on Edit the system environment variables
.

Click on Environment Variables
. Under User variables for ...
, click New
and enter following pairs of input for each of the items below. Click OK
when done.
- Java
Variable name
:JAVA_HOME
.Variable value
: TypicallyC:\Program Files\AdoptOpenJDK\jdk-8.0.222.10-hotspot
.- Spark
Variable name
:SPARK_HOME
.Variable value
:C:\spark-3.0.1-bin-hadoop3.2
.- Hadoop
Variable name
:HADOOP_HOME
.Variable value
:C:\hadoop
.- Anaconda3
Variable name
:ANACONDA_HOME
.Variable value
:C:\Users\YOUR_WINDOWS_USERNAME\anaconda3
.

In User variables for ...
, select Path
and click Edit
. Next, add the executable and enter following pairs of input by pressing New
to enter each the items below into the list. Click OK
when done.
- Java:
%JAVA_HOME%\bin
- Spark:
%SPARK_HOME%\bin
- Hadoop:
%HADOOP_HOME%\bin
- Anaconda3:
%ANACONDA_HOME%\Scripts

Close your terminal and relaunch it. Test that all paths are setup correctly by running the following:
> where.exe javac
> where.exe spark-shell
> where.exe winutils

Setup Jupyter and pyspark
Open a terminal and run the followings:
> conda create -y -n pyspark python=3.6
> conda init powershell
> conda activate pyspark
> conda install -y -c conda-forge findspark
> conda install -y ipykernel
> python -m ipykernel install --user --name=pyspark
Test Jupyter and pyspark
- Download the following Shakespeare collection.
- Open a terminal
- Launch Jupyter notebook using the
jupyter notebook
command. - A web browser will pop up for the Jupyter Notebook Server.
- Open a new notebook using the
pyspark
kernel. - Enter the following Python code into a cell of the new notebook.
- Replace
PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE
with the actual path (including the file name) to where you downloaded the file earlier
import os
import sys spark_path = os.environ['SPARK_HOME']
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.9-src.zip") import findsparkfindspark.init() import pyspark
number_cores = 8
memory_gb = 16 conf = (pyspark.SparkConf().setMaster('local[{}]'.format(number_cores)).set('spark.driver.memory', '{}g'.format(memory_gb))) sc = pyspark.SparkContext(conf=conf)
textFile = sc.textFile("PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE")wordcount = textFile.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)wordcount.saveAsTextFile("output-wordcount-01")
- You can adjust
number_cores
andmemory_gb
to be more suitable to your computer. Mine runs on an Intel i7 with 32GB of memory, so I pick my numbers to 8 and 16. - Run the cell. Once/if the run completes successfully, you can revisit your Jupyter Server and observe that
output-word-count-1
directory is now created. _SUCCESS
is an empty file that serves the purpose of notifying of a successful execution.part-00000
andpart-00001
contains the resulting outputs.


- You can also visit
127.0.0.1:4040/jobs
to observe the running Spark cluster spawned by the Jupyter notebook

Changing the Spark page’s tab to Executors
to observe the configuration of the cluster:
- The cluster has 8 cores
- The amount of available memory is only 8.4GB out of 16GB, this is due to Spark’s memory storage reservation/protection.
