Churn prediction is currently a relevant subject in data mining and has been applied in the field of banking, mobile telecommunication, life insurances, and others. In fact, all companies who are dealing with long term customers can take advantage of churn prediction methods. Recently, the mobile telecommunication market has changed from a rapidly growing market, into a state of saturation and fierce competition. The focus of telecommunication companies has therefore shifted from building a large customer base into keeping customers in house. For that reason, it is valuable to know which customers are likely to switch to a competitor in the near future. Given a set of data records of telecom customers, each of which belongs to one of a number of predefined attribute classes, the problem is concerned with the discovery of classification rules that can allow records with unknown attribute class membership to be correctly classified. In this computer science project will will discuss about Data mining by evolutionary learning (DMEL) using HBase.
Many algorithms have been developed to mine large data sets for classification models and they have been shown to be very effective. However, when it comes to determining the likelihood of each classification made, many of them are not designed with such purpose in mind. Hence we use new data mining algorithm, called data mining by evolutionary learning (DMEL), to handle churn prediction problems of which the accuracy of each predictions made has to be estimated. In performing its tasks, DMEL searches through the possible rule space using an evolutionary approach.
When identifying interesting rules, an objective interestingness measure is used. The fitness of a solution is defined in terms of the probability that the attribute values of a record can be correctly determined using the rules it encodes. Hbase is used so as to perform operations on multiple records of the database which results in fast retrieval speed. Experiments with different data sets showed that DMEL is able to effectively discover interesting classification rules.
DBMiner is a data mining system for interactive mining of multiple-level knowledge in large relational databases. The system implements a wide spectrum of data mining functions, including generalization, characterization, discrimination, association, classification, and prediction. By incorporation of several interesting data mining techniques, including attribute-oriented induction,progressive deepening for mining multiple-level rules, and meta- rule guided knowledge mining, the system provides a user-friendly, interactive data mining environment with good performance. It is an SQL based system.
Important features of DBMiner
- It incorporates data cube and OLAP technology attribute oriented induction, statistical analysis.
- It implements some data mining functions such as summarization, association, classification, prediction.
- It allows user to specify and adjust various thresholds which control the data mining process and to perform roll up or drill down at multiple levels of abstraction.
- It generates multiple forms of output: graphics, tables, and different kinds of charts.
- It has a user-friendly interface, and for each mining function it has a wizard, which guides the user through the mining process.
ACID PropertiesACID is an acronym for atomicity, consistency, isolation, and durability.
- Atomicity: The atomicity property identifies that the transaction is atomic.
- Consistency:A transaction enforces consistency in the system state by ensuring that at the end of any transaction the system is in a valid state.
- Isolation: When a transaction runs in isolation, it appears to be the only action that the system is carrying out at one time.
- Durability:A transaction is durable in that once it has been successfully completed, all of the changes it made to the system are permanent.
DBMiner works on top of a customized data cube, built using either multidimensional array (for small and medium size data sets) or relational tables (for large data sets). In order for DBMiner to work, the user must define the data on which the OLAP or mining functions will be performed. The first step required in order to define the data set is to choose the DB server used to manipulate the data set. After choosing the DB the user can browse it in order to view the tables it contains. He/she can also browse a table thus getting its structure. But the user cannot see the data contained in the tables.
All this browsing operations are very easy to perform; the user has just to select the object he/she wants to browse. DBMiner can only work on a single table or view. If the user wants DBMiner to work on multiple tables he/she has to define a query, which will produce a single view, integrating these tables. If the user wants to work on a single table he has to import a data-mart (actually a table).This operation is very easy and can be done by just choosing some items from a menu. If he/she wants to use more than one tables (define a view) then he/she must create a data-mart. This operation is really difficult for a usual user because, in order to perform it he/she has to know tables structure, and must have some knowledge of SQL.Limitations
DBMiner depends only on MS SQL Server as its back-end and uses MS Excel 2000 as its visualization tool for OLAP browsing. Other unavailable functional modules are data dispersion module, time serial analysis module, and prediction module. Summarizing the limitations:
- Uses MySQL alone and MS excel for visualization.
- Consumes large time to process when data size is billions.
- Time series data model not supported.
Limitations of SQL Technique
- Doesn't support huge inflow of data.
- Consumes a large amount of time to perform operations on larger set of objects.
- To process 15 mins of Data, MySQL takes around 5-6 hours.
Proposed Solution - Data mining by evolutionary learning using HBase
Owing to the limitations of these existing techniques, we propose a new algorithm, called data mining by evolutionary learning (DMEL), to mine classification rules in databases. When identifying interesting rules, an objective interestingness measure is used. The fitness of a chromosome is defined in terms of the probability that the attribute values of a record can be correctly determined using the rules it encodes. The likelihood of predictions (or classifications) made are estimated so that subscribers can be ranked according to their likelihood to churn.
The customer data is stored in HBase. In a normal RDBMS we are able to perform operations only on a single record at time. But in the case of HBase operations are performed such that multiple records are effected simultaneously. Thus HBase helps in fast retrieval of data. Please go through the attached project report fore more info.