THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Dejan Sarka

Fraud Detection with the SQL Server Suite Part 4

This is the fourth part of the fraud detection whitepaper. You can find the first part, the second part, and the third part in my previous blog posts about this topic.

Data Mining Models

We create multiple mining models by using different algorithms, different input data sets, and different algorithm parameters. Then we evaluate the models in order to find the most appropriate candidates for the actual deployment to production.

Many different algorithms can be used for fraud detection; it is difficult to say which one would generally yield the best result. In a project, the available algorithms are typically chosen, based on experience and the knowledge about the given domain. Because we use the Microsoft SQL Server suite, we use Microsoft Decision Trees, Microsoft Neural Network, and Microsoft Naïve Bayes directed algorithms, and Microsoft Clustering for the undirected one. In recent years, the Support Vector Machines methods are becoming more and more popular. SSAS does not bring this algorithm out of the box. However, it can be downloaded as a free plug-in algorithm for SSAS from the Microsoft CodePlex site at

Valkonet, J. (2008). Support Vector Machine plug-in in Analysis Services. Retrieved from Microsoft CodePlex: http://svmplugin.codeplex.com/

Of course, if there are time and software policy constraints that prevent us from using this download, we simply skip it. We do not lose much, because, according to

Sahin Y., & Duman E. (2011). Detecting Credit Card Fraud by Decision Trees and Support Vector Machines. Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol 1. Hong Kong: IMECS.,

the Decision Tress algorithm usually yields better results in fraud detection analysis than Support Vector Machines. For details on specific data mining algorithms, please refer to

Han J., Kamber M., & Pei J. (2011). Data Mining: Concepts and Techniques, Third Edition. Morgan Kaufmann,

or to the SolidQ course

Sarka D. (2012). Data Mining with SQL Server 2012. SolidQ. Retrieved from http://www.solidq.com/squ/courses/Pages/Data-Mining-with-SQL-Server-2012.aspx.

We evaluate the efficiency of different supervised models by using standard techniques, namely the Lift Chart, the Classification Matrix, and Cross Validation. All of these techniques are built into the SSAS data mining feature and are described in more detail in

MacLennan J., Tang Z., & Crivat B. (2009). Data Mining with Microsoft SQL Server 2008. Wiley.

To evaluate the Clustering models, we have developed a technique of our own, implementing entropy. If the individual clusters are homogenous, the entropy in any given cluster must be low. We calculate the average entropy and the standard deviation of the entropy across all clusters. In a SSAS Clustering model that has been trained (or processed), it is possible to read the model data with DMX queries. In the cluster notes we can identify the distribution of the input variables, and then use it to can calculate the entropy.

From experience, we have learned that not all algorithms are equally useful for all data sets. The Microsoft Neural Network algorithm works best when the frequency of the target state (i.e. fraud) is about 50%. Microsoft Naïve Bayes can work well when the target state is represented by approximately 10% of the population or more. However, Microsoft Decision Trees work well even if the target state frequency is only about 1%, and is thus a very suitable algorithm for small data sets and low frequency of the target state as well.

The Continuous Learning Cycle

The continuous learning cycle is shown graphically in Figure 1.

image

Figure 1: The continuous learning cycle

We start by creating the directed models, assuming that the customer has already flagged frauds in the existing data. We evaluate the directed models and then use the best one to predict the frauds in the new data. We also create the undirected models, evaluate them, and use the best one for selection of potential frauds. We do this over time and check the difference between the number or percentage of frauds caught with the directed and the undirected model deployed. When this difference drops, it is time to refine the directed model. In addition, we store the predictions of both models and the actual, confirmed or reported frauds in a data warehouse. When the percentage of the predicted frauds in the total number of frauds drops, it is time to refine both models. We use an OLAP cube on the top of the DW to measure the efficiency of the models over time.

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Dejan Sarka said:

This is the fifth, the final part of the fraud detection whitepaper. You can find the first part , the

January 6, 2014 12:53 PM

Leave a Comment

(required) 
(required) 
Submit

About Dejan Sarka

Dejan Sarka, MCT and SQL Server MVP, is an independent consultant, trainer, and developer focusing on database & business intelligence applications. His specialties are advanced topics like data modeling, data mining, and data quality. On these toughest topics, he works and researches together with SolidQ and The Data Quality Institute. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of eleven books about databases and SQL Server, with more to come. Dejan Sarka also developed and is developing many courses and seminars for SolidQ, Microsoft and Pluralsight. He is a regular speaker at many conferences worldwide for more than 15 years, including conferences like Microsoft TechEd, PASS Summit and others.
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement