Setting up a Hadoop cluster on Windows using Docker and WSL2

2 minute read

I wanted to setup a Hadoop cluster as a playground on my Windows 10 laptop. I thought that using Docker with the new WSL2 (Windows Sub-system Linux version 2) included in Windows 10 version 0420 could be a solution. Indeed Docker can use WSL2 to run natively Linux on Windows. I basically followed the tutorial How to set up a Hadoop cluster in Docker that is normally designed for a Linux host machine running docker (and not Windows).

1. Install Docker on Windows

I’m currently using docker desktop version 2.3.0.3 from the stable channel. But any version that supports WSL2 should work. The corresponding engine version is 19.03.8 and docker-compose version is 1.25.5:

Docker version

You can confirm that docker is running properly by launching a web server:

docker run -d -p 80:80 --name myserver nginx

2. Setting up Hadoop cluster using Docker

Use git to download the the Hadoop Docker files from the Big Data Europe repository:

git clone git@github.com:big-data-europe/docker-hadoop.git

Deploy the docker cluster using the command:

docker-compose up -d

You can check that the containers are running using:

docker ps

You can also double check with the Docker dashboard:

Docker Dashboard

And the current status can also be checked using the web page http://localhost:9870:

Hadoop Overview

3. Testing the Hadoop cluster

We will test the Hadoop cluster running the Word Count example.

  • Open a terminal session on the namenode
    docker exec -it namenode bash
    

    This will open a session on the namenode for the root user.

  • Create some simple text files to be used by the wordcount program
    cd /tmp
    mkdir input
    echo "Hello World" >input/f1.txt
    echo "Hello Docker" >input/f2.txt
    
  • Create a hdfs directory named inut
    hadoop fs -mkdir -p input
    
  • Put the input files in all the datanodes on HDFS
    hdfs dfs -put ./input/* input
    
  • Download on the host pc (e.g in the directory on top of the hadoop cluster directory) the word count program from this link

  • Run the command below in a terminal on the Windows host to identify the namenode container id:
    docker container ls
    

    namenode id

  • Use the command below on the Windows host to copy the word count program in the namenode container:
docker cp ../hadoop-mapreduce-examples-2.7.1-sources.jar afb235f8629c:/tmp
  • Run the word count program in the namenode:
    hadoop jar hadoop-mapreduce-examples-2.7.1-sources.jar org.apache.hadoop.examples.WordCount input output
    

    The program should display something like:

Hadoop Job

  • Print the output of the word count program
    hdfs dfs -cat output/part-r-00000
    

    Hadoop Output

  • Shutdown the Hadoop cluster by running on the Windows host
    docker-compose down
    

That’s all !