There is lot of talk as well as increasing adoption of Big Data especially Apache Hadoop and technologies from its ecosystem. Business Intelligence and analytics are definitely the primary areas of focus and these new technologies are revolutionizing this space. It is a disruptive force which brings a lot of value through possibilities created by advancement in the technology. However, there is more to Apache Hadoop than just business intelligence and analytics.
Apache Hadoop and its map reduce paradigm is much more than just BI. Hadoop is about distributed computing at a scale. It is about moving the code where the data is vs. moving the data where the code is. The latter has been the traditional de-facto mechanism of data processing. There are big databases and there are big and powerful computing nodes, where the data is be moved for processing much like industrial production machines. The processed data or 'finished product' is moved back to the big databases, warehoused for its use then or at a later point of time. Hadoop has changed this core tenet of data processing. Hadoop instead moves the processing code across the servers where the data is stored. The size of the code is not even fraction of the volume of data that it processes. The local distributed processing of data creates economies of scale, achieves high collective processing throughput and avoids the high cost of network traffic of data movement.
When you think about it most of the business processes involve data processing enriching the data and creating value through application of rules, transformation and logic. With advancement in the technology and greater adoption of data generating devices (increased data collection, mobile devices and social media), the challenge has been the volume and the speed 'velocity' at which the data is created. The traditional processing paradigm is increasingly a difficult fit for today's processing needs. The cost of network bandwidth to move unprocessed data to the processing nodes and moving the processed data back to the storage is prohibitive. At the same time, the processing nodes have to be increasingly powerful to process such large volumes of data. Hadoop solves this problem by avoiding the movement of data to the minimum. It solves the need for huge computing machines by making each of the less powerful machines 'nodes' in the cluster contribute to the overall processing.
Lets take an example of massive processing in the telecommunications space which is mediation and billing. Millions of call records are generated per minute by the switches that are received by the mediation system. Mediation system stores this information in files and processes them by enriching, collating, guiding and summarizing in a format ready for rating engine. The Rating engine takes these records and applies the rate information creating the usage records. The rated records are stored in the relational database. At the end of the billing cycle the usage records are pulled out of the database and the billing processes summarize the usage records for each billing entity, apply the volume discounts and create the bill. This bill starts the next set of revenue collection processes by creating an account receivable entry, effectively enabling the telco to realize revenue from subscribers' usage of its network.
If we look at the whole process above, it is processing of files at multiple processing station and enriching the information. Anyone who has ever worked on Billing platform has stories of the performance bottlenecks and scaling issues associated with the billing platform. The major culprit is the volume of data, constrained capacity of the processing nodes and the challenges associated with reliably scaling the system.
Fortunately, Hadoop solves all of these problems. If we build the billing processes on Hadoop, every node will be the processing nodes and will work in synchronized collaboration. Processing nodes could be added on the fly, solving the scaling issues. Distributed processing is the DNA of the Hadoop architecture and the billing processes will need absolutely no tweaking to make them run in distributed manner across the nodes of the Cluster. Thus the core architecture of Hadoop solves the major issues of performance, scaling, distributed processing and the data movement.
There are other processes in any business which have bigger performance issues because they store data between each processing states in centralized relational databases. Such processes can greatly benefit by using Big Data and Hadoop. So when we look at using Hadoop and what could be the possible with it, think beyond BI and analytics. Go to the core business processes and improve the processes bottoms up leading to business process efficiency as well as improved analytics and decision making.
Apache Hadoop and its map reduce paradigm is much more than just BI. Hadoop is about distributed computing at a scale. It is about moving the code where the data is vs. moving the data where the code is. The latter has been the traditional de-facto mechanism of data processing. There are big databases and there are big and powerful computing nodes, where the data is be moved for processing much like industrial production machines. The processed data or 'finished product' is moved back to the big databases, warehoused for its use then or at a later point of time. Hadoop has changed this core tenet of data processing. Hadoop instead moves the processing code across the servers where the data is stored. The size of the code is not even fraction of the volume of data that it processes. The local distributed processing of data creates economies of scale, achieves high collective processing throughput and avoids the high cost of network traffic of data movement.
When you think about it most of the business processes involve data processing enriching the data and creating value through application of rules, transformation and logic. With advancement in the technology and greater adoption of data generating devices (increased data collection, mobile devices and social media), the challenge has been the volume and the speed 'velocity' at which the data is created. The traditional processing paradigm is increasingly a difficult fit for today's processing needs. The cost of network bandwidth to move unprocessed data to the processing nodes and moving the processed data back to the storage is prohibitive. At the same time, the processing nodes have to be increasingly powerful to process such large volumes of data. Hadoop solves this problem by avoiding the movement of data to the minimum. It solves the need for huge computing machines by making each of the less powerful machines 'nodes' in the cluster contribute to the overall processing.
Lets take an example of massive processing in the telecommunications space which is mediation and billing. Millions of call records are generated per minute by the switches that are received by the mediation system. Mediation system stores this information in files and processes them by enriching, collating, guiding and summarizing in a format ready for rating engine. The Rating engine takes these records and applies the rate information creating the usage records. The rated records are stored in the relational database. At the end of the billing cycle the usage records are pulled out of the database and the billing processes summarize the usage records for each billing entity, apply the volume discounts and create the bill. This bill starts the next set of revenue collection processes by creating an account receivable entry, effectively enabling the telco to realize revenue from subscribers' usage of its network.
If we look at the whole process above, it is processing of files at multiple processing station and enriching the information. Anyone who has ever worked on Billing platform has stories of the performance bottlenecks and scaling issues associated with the billing platform. The major culprit is the volume of data, constrained capacity of the processing nodes and the challenges associated with reliably scaling the system.
Fortunately, Hadoop solves all of these problems. If we build the billing processes on Hadoop, every node will be the processing nodes and will work in synchronized collaboration. Processing nodes could be added on the fly, solving the scaling issues. Distributed processing is the DNA of the Hadoop architecture and the billing processes will need absolutely no tweaking to make them run in distributed manner across the nodes of the Cluster. Thus the core architecture of Hadoop solves the major issues of performance, scaling, distributed processing and the data movement.
There are other processes in any business which have bigger performance issues because they store data between each processing states in centralized relational databases. Such processes can greatly benefit by using Big Data and Hadoop. So when we look at using Hadoop and what could be the possible with it, think beyond BI and analytics. Go to the core business processes and improve the processes bottoms up leading to business process efficiency as well as improved analytics and decision making.

Big data for telecom and good blog more information our site Post processing
ReplyDelete