Introduction to HDFS-1 | Hadoop Distributed File System-1 Tutorial | Hadoop HDFS

so guys the next topic is HDFS that is Hadoop distributed file system the storage layer of Hadoop let’s quickly see what are the what is the agenda for this particular section we are going to see HDFS introduction then we’ll see HDFS notes we know there are two type of nodes then the name is responsible for HDFS then we will see the data storage mechanism of HDFS and we’ll see its basic architecture that is architecture one then we will discuss its feature in great details few important few that is file read operation and file write operation after that we are going to see rack awareness and then advanced architecture or architecture too let’s quickly start the introduction of HDFS so base what is HDFS what is HDFS HDFS is a file system designed for storing very large files running on cluster of commodity hardware commodity Hardware non expensive low end hardware so guys HDFS is designed for storage of large files files of terabytes to petabytes inch it is designed on principle of storage of less number of large files no ways another important point HDFS is designed for storage of less number of large files why so so guys if you recall our but by the way the main thing is in terms of Processing’s also in terms of storage also because HDFS is not developed for small files it provides fault tolerant storage layer for Hadoop and its other component guys do you know all its components if you recall in the architecture in the ecosystem I have shown you other components also like hive like big like luminol all these components are ultimately going to store the data inside HDFS only for all the Hadoop ecosystem storage layer is common that is hdf just for the processing we have another mechanisms it stores data reliably even in case of hardware failures suppose your hard disk crashes networking goes down yes your system is burned out in such extreme scenarios also your system will be highly available your data will be reliably stored in now it is distributed file system that provides high throughput access to application data what we have already discussed that instead of low latency it is more focusing only high throughput and at this feature it is getting you to distributed storage let’s understand in terms of HDFS nodes so guys there are two types of HDFS notes that works in the master and slave fashion guys please note down name node is a diamond that runs on the master master is also called as name node master is also called as name node but an actual name node is it just a demon that is running on the master and data node Emmett that runs only all the slaves data nope them and that run on all these leaves so we have a master node we have n number of data nodes now guys master node is a single point of failure intim in up to Hadoop fun up to hadoo one we have just one master he’snot down this up to Hadoop one dot X we have just one master so being a master if there is just one master it would be single point of failure you you you what do I mean by single point of failure guys if name node goes down the cluster is in accessable the same question I have given to you all of you guys why if name node goes down the cluster will become inaccessible I am having one

zero one node cluster out of which one master and hundred slaves the question was if name node goes down what will happen now please answer this one point I have mentioned that if name not goes down cluster will be an assessable now question to all of you why appreciate your answer sandal answered correctly good expecting few more answers Moorhead money pratik you you you you second correct answer so more heat and sandal answered correctly guys why if master goes down then complete cluster become an assessable the reason is we won’t be having excess of the metadata on master node we are having all the metadata it means there are all the blocks corresponding to the file size for apart from that we will be having the information about these slaves about all the blocks about the network configuration about these slaves all complete information so if your master is down here actual data that is stored on these slaves will be of no use but this particular issue has been resolved in the latest version of a loop that we’ll see later so guys if your master goes down all the currently running MapReduce job will also be filled now so guys with this I am giving you 2 points parallely that is corresponding to Hadoop 1 as well as Hadoop 2 so we will be having the complete sessions of a new one also as well as root 2 also although Hadoop 2 related things will see later only what is exactly the new in Hadoop two that we are going to specifically discuss later apart from that whatever things are common we are going to discuss during our sessions no slave nodes are expected to fail at some point Hadoop is having this in its design principle you so guys that’s why Hadoop has developed that if your sleeves are going down few of these lives are down then also it won’t be an issue no for these sleeves I already told you guys it is community hardware but can I take master as the commodity hardware can I choose a hardware of masters the just the commodity all of you guys please think about it giving you 30 seconds then answer you you you you you you you so guys since we are having usually in this small cluster here having just one master that’s why it’s not recommended to have master as the commodity hardware because master is a single machine it can be single point of it here apart from that master machine there are few hardware considerations also that we need to keep in our mind that in terms of memory and in terms of CPU disk also so in just currently being at the abstract level just er I would say you can no doubt master machine can be the commodity Hardware this machine should be a reliable machine that shouldn’t go time that shouldn’t go down at any point of time so guys for these leaves they can go down at any point of time this is an assumption that while developing can do so guys name node will automatically read applicate the blocks that were on

the freely node to the other slaves that we’ll see later so as that’s about the hdfs nodes now let’s talk about the HDFS master HDFS cluster consists of a master server consist of a master server that manages the file system namespace and regulates the access to the stuff then a glaze is a so file let’s regulate the SS of the cluster by the Kleins guys a master is a manager that doesn’t really do the work its main work is to manage coordinate monitor but this is the very important machine it manages all these slaves it manages all these slave nodes and assign work to the slaves apart from that it executes file system namespace operations like opening losing or remaining files and directories so guys if there is any operation related to the namespace that is going to be start from the master only like opening closing of the files or the renaming or other type of things things that is done from NIC master master should be deployed on reliable hardware it should be highly available as it is a single point of failure so guys please note down this is an assignment also although I have explained you in great details what is a single point of failure in hadoo one dot X and how to resolve the same what is single point of failure SPO F single point of failure in one dot X and how to resolve the same apart from that guys okay let me assign someone who will note down and who will remind me of this I know is anyone taking this responsibility that he will note down all the assignments and will remind me sandal will you do the same thank you so much for this because these assignments I have one noted down at any place these are only fly assignments while discussing these concepts with you let’s talk about HDFS leaves HDFS cluster consists of number of slaves which actually stores the data guys master machine doesn’t store the data on the other hand slaves actually stores the data blocks actual worker nodes slaves are the actual worker nodes which does the actual task of read/write process and so on slaves are responsible for reserving read and write request from the file systems client ultimately the request of the reading writing processing goes to these slaves although there is a master ahead of them which will regulate will chill manage which will maintain all such requests but ultimately these decrees are going to be slaves because actual data is on the sleeves they also perform block they also perform block creation deletion and replication upon instruction from the master HDFS slaves who loop slaves or data moves they can be deployed on the commodity hardware ok so guys if we discuss a little more about the HDFS demence the very first name is the name note there’s we already discuss about the HDFS demonstrate to Demi’s end data moves name node this dependence on the master no guys please note down this what exactly is stored on the master or name node name node stores the metadata like file name file path number of blocks block IDs block locations number of replicas slave related configurations etc your name node stores all these metadata now guys please note down another important point name node keeps this metadata into the

memory name node keeps the metadata into the memory all the metadata is available in the memory why so for the faster a triple for the fast retrieval name note keeps all the metadata into the memory that is available for the fast retriever because if the data is in the disk then there will be disk seek that will sufficiently degrade the performance because name notes metadata is usually in Russia say frequently you used utilize by the clients by the cluster to get the access of the blocks to process the blocks everything all the requests are going ahead via the name not only that is by the metadata only all of you guys till now is this clear can I say now that my name nodes Hardware can I make one point that I should have little high memory little high ram on the name node can I say this because it is keeping my all the data in the memory although guys understand this carefully although a copy of persistent copy of the same is stored in the disk also it’s not like if your name node is going down you’ll lose all your metadata the persistent copy is available in your disk but for a fast retrieval it is keeping the data in the memory that’s why I would say all sorry your name node should have higher memory now again it’s time for assignment there is a bath aware there is a thumb rule if you do a little bit googling you’ll find out so the assignment is corresponding to a block how much metadata is what is the size of the metadata that is getting created on the muscle corresponding to one block corresponding to one block how much metadata is getting created on the master what is the size of metadata that is created on the master I now guys there is a some intention there is some reasons why I am giving these many assignments the very first reason is that you should identify the resources you should identify where exactly these things chil resources are there and apart from that if you start doing these are only if you go to three to four websites to identify these solutions to identify my questions also this thing will be permanent to you this thing would be pretty much permanent with you and while doing the research of one topic you will find at least three new topics at least three new terminologies are you see new different technical terms or different things we will again and again going through that and guys to all of you it’s my humble request to finish all the assignments before the next session pratik is asking to repeat the question pressing the question is corresponding to a block corresponding to a data block how much metadata is created or what is the size of metadata that is created on the master the second level second Emin is the data node data node demin it runs on all these slaves data node demon runs on all the slaves data node stores the actual data data nodes store the actual data now let’s move head now guys an important point this is how exactly data is getting stored in the HDFS understand this carefully when a file is stored in HDFS guys now I think it’s third time that I again I’m showing you this the file is broken down into small chunks that is 64 MB size whenever a file is stored in HDFS it is broken down into small chunks or a small block or small pieces that is the default size is 64 MB

these blocks are stored on the multiple nodes in the cluster in distributed manner so guys usually Hadoop tries to distribute the data as much as possible and my data is more distributed more the data has distributed more the performance better the performance I am going to get so suppose out of say I’m having 100 node cluster and if the data is distributed between 50 nodes only so just 50 nodes will also will come into the picture while the MapReduce while the processing rest 50 will be item so guys it is the philosophy of Hadoop it is the design principle of a loop that is to develop or to distribute the data as much as possible since data is stored distributed Li it provides a mechanism for MapReduce to process the data in parallel to process the data in parallel over the cluster so guys so surely MapReduce is the heart of ado but to provide the distributed capabilities to MapReduce we have we must have a file system that itself is distributed that is HDFS now guys I think it’s first time that I am giving you another concept HDFS stores multiple piece of data on different notes that is by default three copies of blocks are created by default three copies of blocks are created replication of data provides fault tolerance reliability and high availability so guys if you recall I have given you these three features of HDFS in the hadoop session but how they are getting these three characteristics these three features it is due to the application so now if I’m having just let me check yeah so base data storage of HDFS if you recall my data blocks are stored in distributed manner apart from that if you see by default three copies of data is created look at this block one is here as well as here as well as here block two is available here as well as here as well as here in this manner all the blocks are replicated thrice over all three copies are there overall the replication factor is 3 by default all of you guys it’s very important funda it’s very important things although it’s little tough to digest why because now you’ll ask me if I am copying hundred terabytes of data it will consume 300 terabytes of space that’s totally correct that’s totally correct that’s probably accept correct that it is going to occupy 300 terabytes of space but guys do you know hard disk spaces are very cheap nowadays so everybody is pretty much ok pretty much happy in investing this much in the hard disk because on the basis of just from the multiple copies of the data if I am getting these facilities if I am getting these facilities like reliability like fault tolerance like high availability that’s why Hadoop fundamental Hadoop fundamental principle is that create multiple copies of data now ways if you create multiple copies of data now I think nobody asked me about this question that if any machine goes down so let’s look at this if this machine goes down nobody’s my data my block one is available here as well and master most master knows block one is here as well as here as well as here master knows block one is available at three different locations so pratik is asking what are the advantages of three copies pratik i am explaining the same if you concentrate here they are chances that at any point of time any machine goes down if this machine is down if this machine is down I can access my data from here from this slave and even

suppose this machine is also down I can assess my block one from this machine I can assess my block 4 from this machine Alafia is the main advantage of multiple copies is high availability fault tolerance reliable storage predict have answered ucation ok a good question sandal is asking will there be a possibility that for data nodes are down which has similar block corrects handle there is a possibility so suppose even not for even safe if I make it three suppose if three data nodes are down look at this block one is available here here as well as here in such cases I am going to lose my data I need to wait until one machine is up until one machine is up I need to wait now for these same guys this is this will be a deciding factor of my application the replication factor is 3 by default but I can increase or decrease the same so if I identify if I based on my previous experience or based on my assumption that at any point of time it makes three machine can go down or from the hardware team I am having this much SLA that if at any point of time three machine is down they will not make four machine going down before a fourth machine is going on they’ll make it least one up so if at any point of time three machine can go down I am having this assumption I can make the application factor two for Senden okay so guys usually on a smaller cluster like if I’m having a 10 node cluster usually we go ahead with we can go ahead with the two’s replication factor also that in a 10 nodes it makes hardly one machine can go down although it depends on your hardware it depends on your hardware capabilities it depends on few other factors also like if they are in this table in moment or not all such things but in the production usually machines do not go down usually but Hadoop is having provision if they are going down then also it is pretty much fine but up to a certain limit surely out of hundred nodes if 50 nodes are done Hadoop also can’t do anything right so guys if 100 machines on 100 machine cluster we can surely increase the replication factor to say up to around 5 but again it depends on a number of factors okay I think I guys it would be a very good although I have given you two points two major points it can be a good research question good research assignment ideally what should be their application factor on a cluster of 500 notes all of the guys ideally what should be the replication factor ideally what should be the replication factor on a cluster of 500 notes guys tomorrow before starting the session we are going to discuss all the assignments so be ready for that so guys is this replication is clear if I can get a smiley if their application is hundred percent clear the reason of their application and how exactly it is getting we are getting benefited with their application money okay good now guys this is the very basic architecture of a loop understand this carefully one by one will let’s see each and every component need node name node is our master name node is the master that is managing file system namespace metadata operations and all so name node is having all the metadata with this for the backup of know’m node we are having secondary name you know please note down this if you haven’t noted in the architecture of a helĂș please note down now we are having a secondary name node or secondary master

which is for the backup of name node although it is not a very good solution but please note down this now when this is an HDFS client that needs to interact with the name node firstly before writing or reading any data so ways they have few terms that is hard beats what is a heartbeat anybody know what is a heartbeat not really okay so guys data nodes send all the data node sends oh yeah I got a correct answer the system is supposed to tell they everything again not system or sandal it’s the data node that sends a very small message to name node and saying that I am alive simply it keeps saying it I am Alive all of you guys look at the meaning which is heartbeat it is pretty much simple from its name only that data node when sending to the name node it will say correct Chris sandals so usually this thing this message usually is sent by the data node that data not sent to the name node then only name node will come to know okay all my slaves are up and running now each and every cater node is having his own separate disk where the data corresponding to HDFS will be stored now if talk about talking about the third point answer application look at this let’s stick talk about the screen so this is green here here and here look at this so three copy of these block are available here take any block like orange or black do this first copy here second copy here third copy here by the way guys all copies are identical there is no nothing like first copy or second copy or third copy it’s it’s there is no priorities between the copies of the data if I like that this is the original copy or this is the duplicate copies all the copy of the blocks are having the same priority or in exactly the same so I am having this as variety sorry first copy the same copy is available here and the another copies available here so that’s the application that I am was trying to say and at any point of time if machine goes down then our master will issue the command to rebalance the cluster or to balance the cluster it means if if I talk about this green block if this machine goes down suppose green blocks application is now and it’s under replicated in the correct terminology I would say it’s under applicated so from this machine or from this machine one block will go supposed to this machine so again it will be up so this would be the just merely to command that can be issued and node will automatically replicate it so a system admin can also issue the command to rebalance the cluster there is a very basic architecture that I was trying to explain you here just to give you a glimpse of HDFS just glimpse of flow of HDFS or exactly how the yebin’s all blocks looks like okay sandal is asking a question so a given point the name node will always send address of the first block in media no sandal ha why it’s not really like that it’s really no not really like a sandal is asking at any given point name node will always send the address of first block in the data node 1 know for a specific file supposed block what is this is this supposed block 1 that is here for a specific file suppose this is the block 1 that is not on this label by the way position or placement of block is usually random but there are a few factors just in next few slides only sandal we are going to discuss the same on what basis your block is placed to a specific sling so your question no sandal I understood your question apart from your question yeah that is a Yaya please please explain this once again if I miss any point of your question so two other guys also if anybody is having any questions so sending live in speak then sir if

here it’s little hard for you to type you can speak yourself okay understood so guys the question sander is asking suppose I’m having these data nodes suppose okay this is the block number suppose one this is the block number one and this is the block number one and this is the data node one this is the data node – this is the round 3 this is unit a note for said it around 5 the crashing is every time is there is a read request for the block 1 every time this request will go here so the answer is it depends suppose at any point of time this slave itself is busy in serving other clients request this slave is busy in processing the data the slave is busy in reading or writing the data the slave is over okay bye so this lay this request will go by the way reading the request is usually pretty much a random it’s a random at any point of time suppose it is this slave is serving suppose this live is having the post the slaves but they are few factors like the first factor is dependent depending on the this if the slave is busy if the data node is busy so surely name node will ask for the another data to respond second point the distance of this client and this slave the client who want to access this block and the this particular sleep suppose let me argue very simple my data node itself can act as a client my data node can itself act as a client if on this data node itself if I am born to read the data so instead of pulling the data from a third data node this itself this data node we will give the response just giving you the example of the distance so they are in the same rack although I haven’t discussed the rack so I shouldn’t take this term I’ll explain that I can all later so these are the few factors sandal on which this reading request exactly depends and similar is the reasons for the placements also that in the next slides only I am going to discuss sandal is this clear okay that’s good clear the drawing so after HDFS basic architecture let’s talk about the features of HDFS nowadays guys during this features we are going to discuss about the distributed storage how distributed your data is stored what is the place of placement policies on what the hell’s depends blocks how exactly the it has split up into blocks file is split up into blocks your application replication factors how on what factors it is dependent it depends and how data applications actually happens high availability data reliability and fault tolerance this three term these three features all three are related on dependent on the application but what are the difference between them and scalability after that we’ll discuss high throughput access to application data these are the features of HDFS and let’s see one by one so guys distributed storage data is distributed lis stored data is distributed stored across the cluster of nerves now as I told you earlier as well Hadoop tries to distribute the data as much as possible your data is firstly split up into small pieces and store

across multiple nodes now why is due to this distributed storage only it provides a way to MapReduce to process a subset of large data parallely on multiple nodes since my data is stored distributed Li since it is a split up into small pieces due to which only MapReduce can process the data distributed Lee or parallely so this distributed storage also provides the fault tolerance capability since the data is a split and distributed due to which only we are getting the fault tolerance capability okay that’s go ahead okay drink the blocks will discuss about the block placement also nowadays understand blocks carefully it is one of the most important topic of HDFS your files are split up into number of blocks your files are split up into number of blocks do you know the block size of OS file system is merely one KBR if we talk about the latest file system like Windows 8 or so it is 4 Kb the block size of the file system is merely in few KB 1 KB or 4 Kb on the other hand HDFS block size is 64 MB it’s a huge why so why HDFS block size is that much huge and even I am saying it can be increased according to the requirements it can be increased according to the requirements correct so since we are talking the day time that’s huge apart from that dealing with small size will be little difficult to manage and copying will take more time and the reason is that there is a huge disk seek so guys you know that while reading any data from the disk hard disk need to seek to that particular sector before reading the data so if my block size is large it is going to save my disk seat time and that is really a significant factor if I talk about the overall performance of my framework in terms of storage as well as processing so guys another advantages in terms of processing also that is when we are going to discuss MapReduce in the first part of MapReduce is the map so at a time one map tasks will process one block so if my block size is small it will process suppose if my block size is 1 KB it will process just one KB at a time but if my block size is higher than a 64 MB you can process 64 MB of data okay so guy so apart from that let me ask one crash in to all of you suppose suppose I am having a file I am having a file whose size is 65 MB the size of the file is 65 MB how many blocks will be there correct two blocks will be there if a file yeah both and I think Sun sandal it’s correct I – I think you two guys answered what about money & pratik two blocks right what so let me create two blocks suppose these are the two blocks the first block and second locks so what is the block sites default block size 64 MB he’s keeping me answers whenever I’m asking questions be ready the default block size 64 MB so my first block will be completely full you’ll be completely filled off since I’m having 64 MB data yeah so first block will store 64 MB data what about second doc second vlog is

merely having 1 MB of data now my question to all of you is what will happen to this 63 MB space what will happen to this 63 MB space think about it giving you 20 seconds think about it then answer so sandal is saying it will be left as it is no sandal how we can left 63 MB spaces it would be quite a waste of storage because understand we will be having millions of blocks will be having petabytes of data we will be having petabytes of data and if it is wasted like this I that won’t be a good solution a money saying used for another file money since a block is located allocated for just one file how one block can have the data of two files we can’t mix that so the answer is the answer is please note down if the data size is less than the block size if the data size is less than the block size then block size will be equals to the data size what do I mean by that I hope you’ll note it down this case will not arrive rather than that rather than that what will happen one block of 64 MB has been already created second block will be created off nearly 1 MB all of you guys if the data size is less than block size if the data size is less than block size then block size will be equals to the data size it means my block size is random in the nature if it’s not getting full it is not going to occupy the complete space it will occupy this space whatever is necessary is this clear okay let me again explain the same if it’s not clear what about other guys Sandhill pratik and group and Mohit so guys suppose I am having a file whose size is 130 MB now all of you answer me how many blocks will be created and what will be the size of the blocks how many blocks will be created and what will be the block size I got just one correct answer what about money Ratigan group I got second correct answers and alarms are correctly correct just one one seven one more answer I’m waiting so guys overall three blocks will be there two blocks of 64mb three blocks will be there two blocks of 64mb and one block of 2 MB okay 6464 two blocks and one block of two MB so that will be 130 MB and I’ll be having three blocks that’s more hand so today we are going to just let’s understand the block placement policy so guys again means don’t consider that block one is having first copy second copy and third copy is not there block two is having first copy second copying third cup this is for the illustration

only that if all the this your configurations are default so just for the simplicity few copies has been shown here now we are having the name note that stores the metadata name notes metadata that is file system file name sorry file name name space number of replicas block IDs etcetera look at the first block is available here as well as your second block here here here fifth block here as well as here as well as here on what basis these blocks are placed on these machines guys if you see the block placement policies these are pretty much random blocks are placed on the machines data pretty much random but but but it actually depends on few factors the very first factor is the load on the specific machine suppose this machine is pretty much loaded load it in terms of storage as well as processing if this machine is pretty much loaded already your name node will try to send less number of locks to this machine apart from that the distance is also measured say if this machine is little far sure sandal let me explain that again so what I was saying the block in the block placement policy if this machine is already loaded say for example this machine is 80 percent more than 80% full other machines are quite less of having less occupied apart from here lots of work is also going on lots of tasks are going on that user a submitted work so the block that is coming to this machine will be less name node will send less number of blocks on this machine these is the first reason second is the distance suppose if the client is pretty much near to this machine then its chances for this machine will be little higher to get more blocks so guys by the way to illustrate this we’ll see during our workshops also like will make a let’s will consider one of the data nodes itself as a client suppose from this data node itself I will copy the data inside HDFS so when we’ll see lot many blocks will be stored on the same machine lot many blocks will be stored on the same machine since the distance is quite less so guys these are very minor factors apart from that your blocks are placed pretty much randomly they are placed as well as process pretty much randomly okay all of you guys regarding the blocks is everything clear till now just because after this we are going to increase the the depth of the knowledge as well as we are going to go deep dive into the concepts of like replication like other concepts so till now if everything clear hundred-percent first just I got few Smiley’s what about other guys okay thank you sandal I got the biggest money from sandal money that they can grope more health and other grace okay so if any by the way guys I recommend you now this recording will be uploaded today only in the LMS I recommend all of you guys to go through the recordings before the next session so that will continue off for this so guys I think that we should stop for the day let’s call it a day but what we are going to see in the next session let me show you that so guys in the next session we are going to cover the application in that application we are going to discuss a number of concepts like fire application where exactly you can set their application or what are the different places where their application will go we can configure we can edit the configuration of their application things also we are going to discuss about the high availability also in great details after that we’ll discuss about data reliability then we’ll discuss about fault tolerance scalability high throughput access to

application data then we’ll see file access and other operations how we can assess the files another operation very important concept file read operation that is very important then we are going to cover file write operation again very important in terms of interview in terms of certification these two concepts are very very important then we’ll see this again another way just to illustrate file write operation then we will see a rake awareness very important funda this although small for now for HDFS then we’ll see advanced HDFS architecture that is having complete end-to-end picture of HDFS what is happening where we are going to see okay how client will read the data or how client will write the date our what is exactly the rag how it is getting replicated what is different block operations mob so there is that’s it for today hope you enjoyed this session next session tomorrow Sunday same time sharp at 8 p.m. we are going to start it alright guys thanks so much for attending good night sandal is saying again one more awesome session Thank You sandal