What To Expect In This Data Era
HOW DO YOU THINK COMPANIES LIKE GOOGLE AND FACEBOOK MANAGE THEIR DATA ?
“Data is the new oil”
is perhaps one of the most popular catchphrases that can describe the fuel that makes our increasingly interconnected world go round. Although such a metaphor is rather inaccurate, it captures the very essence of our collective online footprint with respect to the global economy and our digital lifestyle.
Internet is full of Data, and these data are available in structured and unstructured format online. The size of the Data that is generated every day is equal to 2.5 Quintillion Bytes of Data. This massive set of Data is often referred to as Big Data. It is estimated that almost 1.7 megabytes of data will be generated per second by the year 2021 by every person on earth.
Data Generated By Some Companies :-
- Facebook generates 4 new petabytes of data per day.
- Google currently processes over 20 petabytes of data per day.
The Problem is not only of Storing the Data but also the companies want to minimize the time taken for storing as well as retrieving those data. Therefore They Need to have a tool/software which can help them overcome these Problems, Hadoop is just the tool.
Hadoop As A Tool/Software :-
Hadoop is a open source project which is created for Distributed Storage Clusters where we have multiple number of Nodes connected to each other and shares their own storage/RAM/CPU(Resources) to a particular node called as Master Node or in Hadoop terms we call it as Name Node(NN) and the one who contributed their resources are called as Slave Nodes or Data Nodes(DN).
Why is Hadoop important?
- Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.
- Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
- Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
- Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
- Low cost
- Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
The 3 Important V’s of Big Data :-
(i) Volume — The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data.
(ii) Variety — The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
(iii) Velocity — The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
Hadoop as a tool is used in some or the other domain in most of the companies . It allows to create a network through which resources such as ram and hard disk can be shared, Master which virtually owns all the resources connected to it with the help of slaves(Data Node usually commodity hardware to cut the cost down , they share their resources to the Master/Name Node). Hadoop offers flexibilty in a way to expand the storage space as the need arises by increasing the number of nodes connected to the Master (Horizontal Scalling) .
If the Data is bigger than the physical capability of the Master , Hadoop stripes the data into n no of parts and through network sends the Block of data (Striped Data) to the required no of slaves to store the data . Since the Data are stored in parallel , the time to store and retrieve the date is reduced drastically .
For Example :-
- 100 GB data when stored in a computer takes 10 minutes .
- Same 100 GB data when divided into 10 parts of each 10 GB and stored in 10 Different Computers simultaneously can take only 1 minute .
This Type of System is known as Distributed Storage .
Like these, data oriented companies also have their own millions of slaves (servers). With the help of those servers/slaves the data can be stored and processed quickly .