This tutorial aims to provide a basic understanding of Hadoop and HDFS (Hadoop Distributed File System). We will learn to set up a Hadoop environment, understand the basic operations of HDFS, and write simple programs to interact with HDFS.
By the end of this tutorial, you will be able to:
Basic knowledge of Linux commands and Java programming is recommended, but not mandatory.
Hadoop is an open-source software framework for storing and processing big data in a distributed manner on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing.
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications. It is a distributed file system that allows you to store data across multiple machines.
You can download Hadoop from the official Apache website. After downloading, extract the zip file and set the environment variables for Hadoop in your bash profile.
# Unzip the downloaded file
tar xzf hadoop-3.3.0.tar.gz
# Set the environment variables
export HADOOP_HOME=/path/to/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
HDFS provides shell-like commands to interact with HDFS directly.
# List files in the root directory
hadoop fs -ls /
# Expected output: A list of files/directories in the root of HDFS
# Create a directory named 'test'
hadoop fs -mkdir /test
# Expected output: No output if the command is successful
# Upload a local file to HDFS
hadoop fs -put localfile.txt /test
# Expected output: No output if the command is successful
Hadoop provides a Java API to interact with HDFS programmatically.
// Import necessary classes
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
// Initialize the configuration
Configuration conf = new Configuration();
// Set the HDFS server
conf.set("fs.defaultFS", "hdfs://localhost:9000");
// Create a filesystem object
FileSystem fs = FileSystem.get(conf);
// Specify the file path
Path path = new Path("/test/localfile.txt");
// Open the file for reading
FSDataInputStream inStream = fs.open(path);
// Read the file
String line;
while ((line = inStream.readLine()) != null) {
System.out.println(line);
}
// Close the file and filesystem
inStream.close();
fs.close();
We have covered the basics of Hadoop and HDFS, including how to install Hadoop, perform basic operations on HDFS using both shell commands and Java API.
Exercise 1: Use HDFS commands to create a new directory, upload a file to it, and list the files in the directory.
Exercise 2: Write a Java program to write data to a file in HDFS.
Exercise 3: Write a Java program that counts the number of lines in a file in HDFS.
Remember, practice is the key to mastering any skill, so be sure to practice these exercises and explore the additional resources to deepen your understanding of Hadoop and HDFS.