Hello Spark ! (Installing Apache Spark on Windows 7)

In this post i will walk through the process of downloading and running Apache Spark on Windows 7 X64 in local mode on a single computer.

Prerequisites

  1. Java Development Kit (JDK either 7 or 8) ( I installed it on this path ‘C:\Program Files\Java\jdk1.8.0_40\’).
  2. Python 2.7 ( I installed it on this path ‘C:\Python27\’ ).
  3. After installation, we need to set the following environment variables:
    1. JAVA_HOME , the value is JDK path.
      In my case it will be ‘C:\Program Files\Java\jdk1.8.0_40\’. for more details click here.
      Then append it to PATH environment variable as ‘%JAVA_HOME%\bin’ .
    2. PYTHONPATH , i will set the value to python home directory plus scripts directory inside the python home directory, separated by semicolon.
      In my case it will be  ‘C:\Python27\;C:\Python27\Scripts;’.
      Then append it to PATH environment variable as ‘%PYTHONPATH%’ .

Downloading and installing Spark

  1. We will download Apache Spark 1.3.0 (the type will be : Pre-built for Hadoop 2.4 and later). visit this site.
  2. Extract the zipped file to C:\Spark
  3. Spark has two shells, they are existed in ‘C:\Spark\bin\’ directory :
    1. Scala shell (C:\Spark\bin\spark-shell.cmd).
    2. Python shell (C:\Spark\bin\pyspark.cmd).
  4. You can run of one them, and you will see the following exception:
    java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.The reason is because spark expect to find HADOOP_HOME environment variable pointed to hadoop binary distribution, for more details click here.
  5. Let’s download hadoop common binary, extract the downloaded zipped file to ‘C:\hadoop’ , then set the HADOOP_HOME to that folder.
  6. Spark use Apache log4j for logging, to configure log4j go to ‘C:\Spark\conf\’ , you will find template file called ‘log4j.properties.template’.
    Delete the extension ‘.template’, and open file in text editor, find the property called ‘log4j.rootCategory’ , you can set it to the level you want.
    In my case i changed to ‘WARN’ instead of ‘INFO’, here is more details about levels.

And that’s it ! , Spark is installed and we are ready to write a sample.

How to use IPython & IPython notebook with Apache Spark:

  • IPython:
    • Spark version <= 1.1.0 : set environment variable IPYTHON = 1
    • Spark version > 1.1.0 : set environment variable  PYSPARK_DRIVER_PYTHON = ipython
  • IPython notebook:
    • Spark version <= 1.1.0 : set environment variables IPYTHON = 1  and IPYTHON_OPTS = notebook
    • Spark version > 1.1.0 : set environment variables PYSPARK_DRIVER_PYTHON = ipython and PYSPARK_DRIVER_PYTHON_OPTS = notebook

Resources : Apache Spark programming documentation and pyspark script.

Text search sample

I have a text file for War and Peace novel on ‘C:\war_and_peace.txt’ (i uploaded it on Google drive, download.) , let’s use spark to count the ‘war’ and ‘peace’ words’ counts:

In Scala :

scala> val file = sc.textFile("C:\\war_and_peace.txt")
scala> val warsCount = file.filter(line => line.contains("war"))
scala> val peaceCount = file.filter(line => line.contains("peace"))
scala> warsCount.count()
res0: Long = 1218
scala> peaceCount.count()
res1: Long = 128

In Python :

>>>file = sc.textFile("C:\war_and_peace.txt")
>>> warsCount = file.filter(lambda line:"war" in line)
>>> peaceCount = file.filter(lambda line:"peace" in line)
>>> warsCount.count()
1218
>>> peaceCount.count()
128

Hope that will help ! it’s glad to hear from you.

13 thoughts on “Hello Spark ! (Installing Apache Spark on Windows 7)

  1. Thanks … I have a question
    I installed anaconda (ipython-Notebook) that works well under Windows 7.
    I’d like to access to IPython-notebook from Spark but it does not work for me.
    Do you have any idea how to do that under windows (correct setting of the system environment variables)?

  2. I got lost when you jumped to the scripts…

    Can you please elobrate on steps you did before jumping to the python script…

    >>>file = sc.textFile(“C:\war_and_peace.txt”)

    I get the following error when i open the pythonbook..

    NameError Traceback (most recent call last)
    in ()
    —-> 1 file = sc.textFile(“C:\war_and_peace.txt”)

    NameError: name ‘sc’ is not defined

    1. Well for my sample , you have to work with Spark inside the python spark shell. Inside the Spark project, there is ‘bin’ folder, you will find a file called ‘pyspark.cmd’. Open command-line in windows and run this file, then you will enter the spark shell.

      Or you have to initialize a new SparkContext so you can connect to Spark cluster from outside Spark shells, i will try to write a new post about it.

      For more details read read the scripts in github :
      https://github.com/apache/spark/blob/master/bin/pyspark.cmd
      https://github.com/apache/spark/blob/master/bin/pyspark
      https://github.com/apache/spark/blob/master/python/pyspark/shell.py

  3. Thank you a lot. I looked for a guide like this one for a lot of time, I’ve tried different things (even much harder than those you suggest) and I always had problems… Now everything works like a charm!
    Thank you so much 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s