Saturday, January 25, 2014

Why Big Data and Hadoop is not just for BI? Think broader data processing ....

There is lot of talk as well as increasing adoption of Big Data especially Apache Hadoop and technologies from its ecosystem. Business Intelligence and analytics are definitely the primary areas of focus and these new technologies are revolutionizing this space. It is a disruptive force which brings a lot of value through possibilities created by advancement in the technology. However, there is more to Apache Hadoop than just business intelligence and analytics.

Apache Hadoop and its map reduce paradigm is much more than just BI. Hadoop is about distributed computing at a scale. It is about moving the code where the data is vs. moving the data where the code is. The latter has been the traditional de-facto mechanism of  data processing. There are big databases and there are big and powerful computing nodes, where the data is be moved for processing much like industrial production machines. The processed data or 'finished product' is moved back to the big databases, warehoused for its use then or at a later point of time. Hadoop has changed this core tenet of data processing. Hadoop instead moves the processing code across the servers where the data is stored. The size of the code is not even fraction of the volume of data that it processes. The local distributed processing of data creates economies of scale, achieves high collective processing throughput and avoids the high cost of network traffic of data movement.

When you think about it most of the business processes involve data processing enriching the data and creating value through application of rules, transformation and logic. With advancement in the technology and greater adoption of data generating devices (increased data collection, mobile devices and social media), the challenge has been the volume and the speed 'velocity' at which the data is created. The traditional processing paradigm is increasingly a difficult fit for today's processing needs. The cost of network bandwidth to move unprocessed data to the processing nodes and moving the processed data back to the storage is prohibitive. At the same time, the processing nodes have to be increasingly powerful to process such large volumes of data. Hadoop solves this problem by avoiding the movement of data to the minimum. It solves the need for huge computing machines by making each of the less powerful machines 'nodes' in the cluster contribute to the overall processing.

Lets take an example of massive processing in the telecommunications space which is mediation and billing. Millions of call records are generated per minute by the switches that are received by the mediation system. Mediation system stores this information in files and processes them by enriching, collating, guiding and summarizing in a format ready for rating engine. The Rating engine takes these records and applies the rate information creating the usage records. The rated records are stored in the relational database. At the end of the billing cycle the usage records are pulled out of the database and the billing processes summarize the usage  records for each billing entity, apply the volume discounts and create the bill. This bill starts the next set of revenue collection processes by creating an account receivable entry, effectively enabling the telco to realize revenue from subscribers' usage of its network.

If we look at the whole process above, it is processing of files at multiple processing station and enriching the information. Anyone who has ever worked on Billing platform has stories of the performance bottlenecks and scaling issues associated with the billing platform. The major culprit is the volume of data, constrained capacity of the processing nodes and the challenges associated with reliably scaling the system.

Fortunately, Hadoop solves all of these problems. If we build the billing processes on Hadoop, every node will be the processing nodes and will work in synchronized collaboration. Processing nodes could be added on the fly, solving the scaling issues. Distributed processing is the DNA of the Hadoop architecture and the billing processes will need absolutely no tweaking to make them run in distributed manner across the nodes of the Cluster. Thus the core architecture of Hadoop solves the major issues of performance, scaling, distributed processing and the data movement.

There are other processes in any business which have bigger performance issues because they store data between each processing states in centralized relational databases.  Such processes can greatly benefit by using Big Data and Hadoop. So when we look at using Hadoop and what could be the possible with it, think beyond BI and analytics. Go to the core business processes and improve the processes bottoms up leading to business process efficiency as well as improved analytics and decision making.

Tuesday, October 29, 2013

Big Data in Telecom - Is mediation the right starting point?

Mediation is the starting point for the revenue generation processes in the telecommunications or more broadly the communications service providers back office systems. Telecommunications industry is a highly capital intensive industry where the returns on invested capital typically start to be become profitable in at least 7-10 years horizon.  Realization of ROIC (return on invested capital) is the result of the combination of managing the network, keeping up with the technological advancements and making sure every bit of the usage is translated into the revenue.

More effective is the process of revenue recognition or generation, better is the return on invested capital on the network. Translating the use of network by subscribers into the revenue starts with the mediation. Simply put, streams and files of data from switches, router and gears across the network are collected, collated, formatted, and then processed through the myriad of billing sub-applications to generate bill and thus recognize revenue.

Today’s networks are complex mesh of interconnected and intelligent devices and systems, which generate a lot more information than what is just needed for the revenue recognition. Traditionally, the billing systems have focused on three categories of information from the mediation feed– the charging attributes, the attributes influencing rates, also called qualitative attributes, and other non-charging attributes, which provide supplementary information. Rest of the information from the network devices, typically ignored and often discarded by the billing systems, has much more meaningful information about the usage patterns and the usage behavior of the subscribers.

How often are particular services or features used?  How does the subscriber use the services? What is the geographical usage pattern and how mobile is the user? How the demographical attributes affect the usage of particular services and new service adoption? How diverse is the use of the various products across the subscriber base? Many of these questions are not directly relevant to the billing process in the short run, and are thus often ignored with a billing centric view of mediation data.

There are also system level limitations or constrains which encourage a very billing system centric view of the data acquired by mediation devices. The mediation platforms are typically integral part or subsystems of the billing platform. Billing platforms are designed as structured relational database based systems, where additional storage and additional attributes means additional cost. The cost of change is very high for these monolithic billing systems in place today and requires thousands of man-hours of efforts over many months of release cycles. Any change in the mediation feed structure, or addition of new switches resulting in new data feeds, lead to cascading change effect on the mediation system. Mediation systems therefore avoid this cost by discarding the information that is not needed directly from a billing perspective right at the door, and focusing on the charging aspect of the data.

In the end, by using the traditional relational database based billing systems, we end up losing a lot of meaningful information from the mediation feeds. Also, whatever we capture is very billing centric and carries a high cost of storage.  Communications Service Providers thus bear a high cost and at the same time are not able to realize the full potential of the network data.

Thinking out loud, what if we could store all kinds of usage data provided to the mediation platforms at a much lower cost for a much longer retention period? What if we could accommodate all formats of usage data (there is a fancy word for it -unstructured data) from switches without having to invest in defining schemas and associated databases upfront? What if we could keep all this usage data and add additional streams of data like diagnostic information, network outage information, incident information from CRM systems, and customer profile to create a mesh of meaningful information?

All of the above would create much higher value from the mediation data, part of which is discarded today due to associated cost and no immediately known value. The CSPs will be able to create usage patterns, segment level subscriber behavior, analytics on the device and their usage, revenue patters for subscribers, and geographical usage patterns to the level of devices, towers and subscriber segments. The possibilities created by just the ability to store and process this unstructured mediation feed are numerous.

Fortunately, the technology to achieve the above is available today from Big Data technologies from Apache Hadoop ecosystem.  Big Data technologies like Hadoop HDFS file system supports unstructured data and can store the data feeds with high level of redundancy on commodity hardware requiring no database or schema definition upfront. There is actually no database in big data. Once the data feeds are ingested into the big data repository, the ‘Data Lake’, map-reduce applications can process the data creating insights and meaningful information, when needed. Map-reduce applications are distributed applications which run where the data is among the nodes of the ‘Data Lake’ and provide extremely high level of horizontally scaled processing.  They can also provide charging specific information to billing system, effectively replacing the billing centric mediation systems.

 By creating data lakes of mediation data and wiring in additional information feeds, CSPs can create meaningful datasets, which can be analyzed and correlated to create new insight that can shape the network planning, the customer care and the product design. Insights into the usage patterns and subscriber behavior can provide opportunity for creating personalized offering. The segment level usage analytics can create opportunities for targeted marketing and network development to service the targeted segments.


The telcos also get an advantage that they have the usage information outside of the billing systems and they can tie the rest of the information systems to the ‘Data Lake’ at a much lower cost without dependency on the billing systems. With petabytes of information about the use of their network and the usage patterns of applications offered on top of the network layer, should not communications service provider use this information to their competitive advantage as Google and Yahoo have done it for the world wide web?