Hadoop File Formats and its Types

Everywhere we look, there is a massive amount of data coming in from a variety of different sources. This information is divided into several categories, each of which will be discussed in detail in this tutorial.

In the following section, we’ll take a look at the different types of Hadoop file formats.

Types of Hadoop File Formats

Tables in HDFS such as Hive and Impala can be created using one of four different Hadoop file formats:

The Sequence File is made up of text files.
Files containing the Avro encoding
A file format known as Parquet

Take a closer look at each of the Hadoop file formats.

Text files

When it comes to file types, text files are the most fundamental and easily understandable. A comma or a tab delimits the majority of the code, which allows it to be read and written in any programming language.

A numeric value must be stored as a string in the text file format, and this consumes additional disk space. Additionally, binary data, such as an image, is difficult to represent.

Sequence File

If you want to store an image in binary format, you can use the sequencefile format to do so. The key-value pairs are stored in a binary container format, which is more efficient than storing them in a text-based format. Sequence files, on the other hand, are not readable by humans in any way.

Avro Data Files

Because of its optimized binary encoding, the Avro file format provides efficient storage. Internally and externally to the Hadoop ecosystem, it is well-supported.

Important data should be stored in the Avro file format for long periods of time. A variety of languages, such as Java, Scala, and other programming languages are supported. In order to ensure that the file will always be readable, schema metadata can be embedded in it. A change in the schema can be accommodated by the evolution process. Hadoop users generally agree that the Avro file format is the best option for general-purpose storage.

Parquet File Format

Parquet is a columnar format that was developed by Cloudera and Twitter in conjunction with others. It is supported by a number of different frameworks, including Spark, MapReduce, Hive, Pig, Impala, Crunch, and others. Schema metadata is embedded in the file in the same way that Avro metadata is.

Dremel’s paper on Google’s Parquet file format describes how the format makes use of advanced optimizations. These optimizations reduce the amount of storage space required while simultaneously increasing performance. For the purpose of inserting multiple records at the same time, this Parquet file format is considered to be the most efficient. Some optimizations are based on the identification of recurring patterns. In the following section, we’ll go over what data serialization is and how it works.

What is Data Serialization?

In computer storage, data serialization refers to the representation of data in the form of a sequence of bytes. It enables you to save data to disk or send it over a network connection.

For example, what is the best way to serialize the number 123456789. Java int can be serialized as 4 bytes, and Java String can be serialized as 9 bytes when stored as a Java String.

Avro is a high-performance data serialization framework that is widely used throughout the Hadoop ecosystem and beyond. Remote Procedure Calls, or RPC, are also supported, and it provides compatibility with the programming environment without compromising on performance. Consider the data types that are supported by Apache Avro and Avro Schemas, which are listed in Table 1.

Supported Data Types in Avro

Below you will find a list of the primitive data types that Avro can support:

Null: The absence of a value is referred to as null.
Boolean: A binary value is denoted by the symbol boolean.
Int:int is a 32-bit signed integer that is used in programming.
Long is a signed integer with a length of 64 bits.
Float: A floating point value with a single precision is represented by the symbol float.
Double: A double-precision floating point value is represented by the symbol double.
Bytes: A byte is a sequence of 8-bit unsigned bytes that are used to represent information.
string A string is a collection of Unicode characters.

Examine the various complex data types that are supported by Avro schemas.

Record: A record is a user-defined type that consists of one or more named fields that can be accessed.
Enum: An enumeration of values that has been specified.
array: An array is made up of zero or more values that are all of the same type.
Map: A map is a set of key-value pairs that can be used to find things. The key is a string, and the value is a specific type of data.
Union: A union is a single value that matches a specified set of types exactly once.
This term refers to a fixed number of 8-bit unsigned bytes in a given file.

Consider how the Avro schema assists Hive in the creation of a database table.

Hive Table and Avro Schema

Listed below is an example that shows the Avro schema’s equivalent of a Hive table.

the following: (id INT, name STRING, title STRING);

According to the example below, Avro also allows you to set a default value in a schema.

{“namespace”:”com.simplilearn”,

“type”:”record”,

“name”:”orders”,

“fields”:[

“name”:”id,” “type”:”int,” and other similar terms.

the variables name” and type” are both set to a string,

The title is the name of the type, and the string is the type of the data.

}

Operational Activities of Avro

{“namespace”:”com.opennel”,

“type”:”record”,

“name”:”orders”,

“fields”:[

“name”:”id,” “type”:”int,” and other similar terms.

In the example above, the name would be “name,” the type would be “string,” and the default would be “opennel.”

The title, the type, and the default value are all strings.] [name=”title,” type=”string,” default value=”bigdata.”]

}

Consider how the Avro and Parquet file formats make use of Sqoop in the following sections:

Avro With Scoop

Using the Avro and Parquet file formats requires the use of Sqoop, a tool designed to transfer data between Hadoop and a “Relational Database Management System,” also known as a “RDBMS.”

Sqoop allows you to import data into HDFS in the Avro format and export data in the Avro format to a relational database management system. You must include the —as-avrodatafile option in the Sqoop command in order to carry out the operation.

Hive is capable of supporting all Avro types. Impala, on the other hand, does not support Avro types that are complex or nested, such as enum, array, fixed, map, union, and record types (nested).

In the following section, we’ll go over how to import data from a Parquet file format into Sqoop using the Sqoop tool.

Parquet With Sqoop

Sqoop allows you to import data into HDFS in the Parquet file format and export data in the Parquet file format to a relational database management system (RDBM). You must specify -as-parquetfile in the Sqoop command line in order to carry out this operation. Importing MySQL data into HDFS in Parquet File Format is covered in the following section.

Importing MySQL to HDFS in Parquet File Format

The Parquet File Format of MySql data can be imported to HDFS in two ways: by creating a new table in Hive or by reading Parquet Files. Let’s take a closer look at them.

Creating a New Table in Hive With Parquet File Format

Impala and Hive both allow you to create new tables that are saved in the Parquet file format. To use column metadata from an existing Parquet data file in Impala, you can use the LIKE PARQUET function. This is illustrated in the image below.For the purpose of accessing existing Parquet file format data, a new table and a new order are being created. It is difficult to read parquet files because they are in binary format.

Reading Parquet Files

The parquet-tools package, which is listed below, can be used to read parquet files.

Summary

An overview of the Hadoop file formats tutorial is provided below.

  • Text files can be used to create Hive and Impala tables in the HDFS storage system.
  • Sequence files, Avro data files, and Parquet file formats are all examples of data storage formats.
  • It is a method of storing data in memory as a series of bytes that is known as data serialization.
  • Avro is a data serialization framework that is widely used throughout the Hadoop ecosystem and its ecosystem of tools and libraries.
  • Data in the Avro and Parquet file formats can be imported into HDFS with the help of Sqoop.
  • The Sqoop, Avro, and Parquet file formats can be used to export data to a relational database management system.

Conclusion

Unquestionably, understanding the different types of Hadoop file formats is essential. Although one of the many nuances of Big Data and Hadoop, Hadoop file formats are just one of the many.

Read More:

Related Posts