Spark Installation in WSL2

Step 1. Install required files

  • Install java and spark file. (Skip if already installed.)

    1
    2
    3
    $ sudo apt-get install openjdk-8-jdk
    $ sudo wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
    $ sudo tar -xvzf spark-3.2.0-bin-hadoop3.2.tgz

Step 2. Set environment variables

  • Open .bashrc file and add the code below.

    1
    2
    3
    4
    5
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    export SPARK_HOME=/mnt/c/hadoop/spark-3.2.0-bin-hadoop3.2
    export PATH=$JAVA_HOME/bin:$PATH
    export PATH=$SPARK_HOME/bin:$PATH
    export PYSPARK_PYTHON=/usr/bin/python3
  • Update the code and make sure it’s actually reflected.

    1
    2
    3
    4
    source ~/.bashrc
    echo SPARK_HOME

    /mnt/c/hadoop/spark-3.2.0-bin-hadoop3.2

Step 3. Run Pyspark

  • Run pyspark in the path.

  • Run the code below in the CMD and check the result printed.

    1
    2
    3
    4
    >>> rd = sc.textFile("README.md")
    >>> rd.count()

    109

Step 4. Deploy in Web browser.

  • Create a new directory temp, and virtual environment.
1
2
$ mkdir temp && cd temp
$ virtualenv venv
  • Connect to virtual environment and install pyspark.
1
2
$ source venv/bin/activate
$ pip install pyspark
  • Create a new directory and README.md file.

    1
    2
    $ mkdir data && cd data
    $ vi README.md

    *This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. As with the Scala and Java examples, we use a SparkSession to create Datasets. For applications that use custom classes or third-party libraries, we can also add code dependencies to spark-submit through its –py-files argument by packaging them into a .zip file (see spark-submit –help for details). SimpleApp is simple enough that we do not need to specify any code dependencies.

    We can run this application using the bin/spark-submit script:*

  • Back to temp and create SampleApp.py.

    1
    2
    $ cd ..
    $ vi SampleApp.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    ***# SampleApp.py***
    from pyspark.sql import SparkSession

    logFile = "data/README.md" # Should be some file on your system
    spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
    logData = spark.read.text(logFile).cache()

    numAs = logData.filter(logData.value.contains('a')).count()
    numBs = logData.filter(logData.value.contains('b')).count()

    print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

    input("Typing....")

    spark.stop()
  • Run the SimpleApp.py

    1
    $SPARK_HOME/bin/spark-submit --master local[4] SimpleApp.py
  • Check the address below and copy it.

  • Enter the corresponding address in the web browser and check the web UI.

Share