تشخیص کلاه برداری در کارت اعتباری توسط الگوریتم ژنتیک و جستجوی پراکنده

[toggle title=”عنوان انگلیسی”]

Detecting credit card fraud by genetic algorithm and scatter search

[/toggle]

[toggle title=”فهرست مطالب”]

مقدمه

تعریف مساله

الگوریتم GASS

بحث و نتایج

خلاصه و نتیجه گیری

[/toggle]

[toggle title=”ترجمه چکیده”]

در این مطالعه، روشی را که موجب بهبود راهکار تشخیص تقلب در کارت اعتباری که در حال حاضر در یک بانک استفاده می شود، توسعه دادیم. با این راهکاربه هر معامله عددی داده می شود و بر اساس این اعداد معاملات بصورت جعلی یا قانونی طبقه بندی می شوند. هدف معمول راهکارهای تشخیص تقلب، کمینه کردن تعدداد طبقه بندی غلط معاملات است. به هرحال، در واقعیت، طبقه بندی غلط هر معماله اثر مشابهی در آن ندارد اگر یک کارت در دست کلاهبرداران باشد همه محدودیت های موجود بر آن، تماما مصرف می شود. این چیزی است که در این مطالعه می خواهیم به حداقل برسانیم. به همین دلیل برای روش حل، ترکیبی از دو روش فراابتکاری معروف، به نام های الگوریتم های ژنتیک و جستجوی پراکنده، را پیشنهاد دادیم. این روش بر روی داده های حقیقی اعمال شده و نتایج بسیار موفقی در مقایسه با عملکرد فعلی به دست آمده است.

[/toggle]

[toggle title=”ترجمه مقدمه”]

انگیزه این مطالعه از یک پروژه مشاوره صنعتی گرفته شده است. شریک صنعتی ما (یک بانک بزرگ در ترکیه) چندین سال است که از راهکار تشخیص تقلب کارت اعتباری که بصورت داخلی توسعه یافته، استفاده می کند. هرچند این راهکار موفق بوده است، مقامات بانک به این فکر کردند که به دو دلیل می تواند در آینده بهبود یابد. اول، وزن از پارامترهای مورد استفاده می تواندبا استفاده از روش های اخیر استفاده از کارت ها و تقلابات رخ داده، بهتر تنظیم شود. دوم، فهمیده شده که یک راه حل خوب روشی نیست که لزوما تعداد زیادی تقلب را اشکار سازد بلکه ممکن است تعداد کمتری تقلب اما با ریسک بزرگتری را تشخیص دهد. تقلب می تواند به عنوان مصرف غیرقانی هر سیستم یا کالایی تعریف شود. به همین صورت، فعالیت های قانونی را می توان یک عمل قانونی نام گذاری کرد. ممکن است با تقلب در یک نوع از دامنه های متفاوت شامل بانک داری، بیمه، مخابرات، مراقبت های بهداشتی و خدمات عمومی روبرو شویم. در بانک داری، تقلب می تواند در استفاده از کارت های اعتباری، کارت های بدهی، حساب های بانکی اینترنتی و مرکز تماس (تلفن بانک) رویت شود. پول شویی و تقلب پرسنل از دیگر انواع تقلب بانکی هستند. این ضررهای ناشی از تقلب در مجموع مقادیر زیادی می شود وتهدید بزرگی برای اقتصاد مشروع به حساب می آید. این موضوع به دلیل اهمیتش، علاقه بسیاری از دانشمندان را به خود جلب کرده است. بر طبق داده های ISI Web of Knowledge، در طول ده سال گذشته (1999-2009)، 1361 مقاله چاپ شده که با یک جستجوی کلمه “fraud” ساخته شده است. دراین مقاله، فقط تقلب کارت اعتباری را بررسی کردیم. زماینکه داده های شریک صنعتیمان و چند بانک دیگر را آنالیز کردیم، فقط بالغ بر 100000 معامله تقلبی را رویت کردیم. باقی قانونی هستند. این عدم تعادل بسیار بالا بین دو کلاس باعث می شود که تشخیص تقلب یک کار چالش برانگیز شود. تشخیص تقلب معمولا به عنوان یک مساله داده کاوی مطرح می شود که هدف طبقه بندی صحیح معاملات در دو دسته مشروع و جعلی است. برای مسائل طبقه بندی، اندازه های عملکرد زیادی تعریف شده که اغلب آنها به تعداد درست موارد طبقه بندی صحیح مربوط است. در این میان نسبت دقت، سرعت جذب، نرخ ضربه، ضریب جینی و لیفت معروفترین ها هستند (Gadi, Wang, & Lago, 2008; Kim & Han, 2003). به موازات این معروفیت، در مقالات تعداد مطالعات زیادی بر روی تشخیص تقلب با استفاده از الگوریتم های مختلف داده کاوی شامل درخت های تصمیم گیری، رگرسیون و شبکه عصبی مصنوعی، وجود دارد. Quah و Srinagesh (2008) چارچوبی پیشنهاد داند که می توانست بلادرنگ اعمال شود که در آن برای اولین بار از آنالیز پرت برای هر مشتری بطور جداگانه توسط نقشه های خود سازمان یافته ساخته شد و سپس از یک الگوریتم پیش بین برای طبقه بندی معاملات درظاهر غیرطبیعی استفاده شده است. Panigrahi, Kundu, Sural, and Majumdar (2009) راه حل تشخیص تقلب چهار بخشی را که به یک روش سریالی متصل شده، پیشنهاد دادند. ایده اصلی اولا تعیین یک مجموعه از معاملات مشکوک و سپس اجرای الگوریتم یادگیری Bayesian بر روی این لیست به منظور پیش بینی تقلب هاست. Sanchez, Vila, Cerda, and Serrano (2009) روشی متفاوت ارائه کردند و از استخراج قانون وابستگی برای تعریف نقشه هایی برای استفاده معمولی از کارت و نشان دادن آنهایی که با این الگو تطابق ندارند به عنوان موارد مشکوک، استفاده کردند. مطالعات Bolton و Hand (2002) خلاصه ای بسیار خوب از مقالات مسائل تشخیص تقلب را ارائه می دهد. در این مطالعات، عملکرد الگوریتم ها اغلب توسط اندازه های بالا اندازه گیری شده اند. زمانیکه کلاهبرداران کارتی را به دست آورند، معمولا از کل موجودی (استفاده نشده) آن کارت استفاده (مصرف) می کنند. طبق آمار، به طور متوسط این کار را در چهار یا پنج معامله انجام می دهند؛ بنابراین، برای مساله تشخیص تقلب، اگر چه اقدامات ذکر شده در بالا کاملا مرتبط است، همانگونه که توسط مقامات بانک نشان داده شده، یک معیار برجسته، اندازه گیری میزان ضرری است که می توان بر روی کارت¬هایی که معاملاتشان به عنوان تقلب شناخته شده، ذخیره کرد؛ به عبارت دیگر یک تقلب بر روی کارت دارای محدودیت در دسترس زیاد معتبرتر از تشخیص تقلبی بر روی کارت دارای محدودیت در دسترس کم است. در نتیجه، آنچه که با آن مواجه می شویم مساله طبقه بندی با هزینه های بدرده بندی متغیر است. از آنجا که الگوریتم های DM کلاسیک برای چنین ساختار هزینه بدرده بندی طراحی نشده اند، برای مورد ما مستقیما کاربردی نیستند (اینها زمانی که هدف کمینه کردن تعداد موارد که نادرست رده بندی شده اند باشد به خوبی کار می کنند). یا برخی اصلاحات باید بر روی اینها صورت گیرد و یا الگوریتم های جدید باید بویژه برای این منظور، توسعه بایند (در واقع در برخی از پکیج های نرم افزاری DM مثل SAS Enterprise Miner یا SPSS PASW Modeler، معرفی هزینه های بدرده بندی های مختلف برای دو کلاس امکانپذیر است اما باید نسبت بین آنها ثابت باشد و بنابراین اینها برای انجام مورد ما کافی نیستند). از آنجا که الگوریتم های DM کلاسیک مستقیما قابل استفاده نیستند، به روش های جایگزین برای مساله رده بندی نیاز داریم. در این راستا، ما فکر کردیم که الگوریتم های فرا ابتکاری که برای بسیاری از حوزه های مختلف مساله کاربردی هستند را می توان بکار برد. پس از تجزیه و تحلیل ویژگی های اصلی الگوریتم های فراابتکاری، تصمیم گرفتیم که برای این مساله از الگوریتم ژنتیک (genetic algo rithm (GA)) و جستجوی پراکنده (scatter search (SS) ) بصورت ترکیبی استفاده کنیم. ما این روش حل ترکیبی را GASS نامیدیم. الگوریتم های ژنتیک الگوریتم های تکاملی هستند که هدفشان دستیابی به راه حل های بهتر با گذشت زمان است (Mitchell, 1998). پس از اولین معرفی این الگوریتم ها توسط Holland (1975)، این الگوریتم ها به بسیاری از حوزه ها از ستاره شناسی (Charbonneau, 1995) تا ورزش (Charbonneau, 1995)، از بهینه¬سازی (Levi, Burrows, Fleming, & Hopkins, 2007; Krzysztof & Peter, 2004) تا علوم کامپیوتر (Kaya, 2010) وغیره با موفقیت اعمال شدند. همچنین این الگوریتم ها در داده کاوی بخصوص برای انتخاب متغیر (Bidgoli, Kashy, Kortemeyer, & Punch, 2003) استفاده شده اند و به طورعمده با دیگر الگوریتم های DM پیوند داده شده اند. جستجوی پراکنده نوع دیگری از الگوریتم های تکاملی است که اولین بار توسط Glover (1977) معرفی شد. پس از آن، برای حدود 20 سال تقریبا فراموش شده بود تا اینکه در 1997 (Glover, 1997) دوباره معرفی شد و به بسیاری از مسائل مختلف اعمال شده است. به هرحال، طبق اطلاعات ما تا کنون هیچ کس این روش را در مسائل DM بکار نبرده است. سهم این مطالعه برای مقالات دوگانه است. اولا یک رده بندی جدید تابع هزینه برای مساله تشخیص تقلب معرفی شده است. دوما، یک اجرای جدید از دو الگوریتم فراابتکاری معروف ساخته شده است. باقی این مقاله به صورت زیر سازماندهی شده ایت. در بخش بعدی مساله تشخیص تقلبی که با آن مواجهیم با جزییات به همراه سیستم تشخیص جاری که توسط شریک تجاری ما استفاده می شود، تشریح شده است. بخش 3 مختصرا اصول اولیه الگوریتم های ژنتیک و جستجوی پراکنده و سپس جزییات اعمال GASS را بیان می کند. نتایج حاصله بر روی پایگاه داده های نمونه و انتخاب بهترین پارامترهای راه حل در بخش 4 توصیف می شوند. تجزیه و تحلیل حساسیت با توجه به مقادیر پارامتر نیز در این بخش ساخته شده و ارائه می شوند. این مقاله در بخش 5 با ارائه خلاصه ای از مطالعات و نتایج بدست آمده، پایان می پذیرد.

[/toggle]

[toggle title=”مقدمه انگلیسی”]

This study is motivated from an industrial consultancy project. Our industrial partner (a major bank in Turkey) has been using an internally developed credit card fraud detection solution for some years. Although that solution has been regarded as successful, the bank authorities thought that it can further be improved due to two expectations/reasons. First, the weights of the parameters used could be better adjusted using the recent card usage behaviors and frauds happened. Second, it has been understood that a good solution is not necessarily the one detecting many frauds but the one detecting frauds maybe fewer in number but larger in risk. Fraud can be defined as the illegal usage of any system or good. Correspondingly the legal activities can be named as legitimate. We can face with fraud in a variety of different domains including banking, insurance, telecommunications, health care and public services. In banking, frauds can be observed in the use of credit cards, debit cards, internet banking accounts and call center (telephone banking). Money laundering and personnel fraud are the other banking related fraud types. The losses due to fraud sum up to huge amounts and it is a major threat to the legal economy. Inherited to its importance it has attracted the interest of many scientists. During the last 10 years (1999–2009) 1361 articles are found to be published according to the ISI Web of Knowledge data when a search with the keyword “fraud” is made. In this study we are concerned only with the credit card frauds. When we analyzed the data of our industrial partner and several other banks we observe that only several out of 100,000 transactions are fraudulent transactions. The rest are legitimate. This extremely high imbalance between the two classes makes the fraud detection a challenging task. Fraud detection has been usually seen as a data mining problem where the objective is to correctly classify the transactions as legitimate or fraudulent. For classification problems many performance measures are defined most of which are related to the correct number of cases classified correctly. Among these the accuracy ratio, the capture rate, the hit rate, the gini index and the lift are the most popular ones (Gadi et al., 2008 and Kim and Han, 2003). Parallel to its popularity, in the literature there are many studies on fraud detection using various data mining algorithms including decision trees, regression and artificial neural networks. Quah and Srinagesh (2008) suggest a framework which can be applied real time where first an outlier analysis is made separately for each customer using self organizing maps and then a predictive algorithm is utilized to classify the abnormal looking transactions. Panigrahi, Kundu, Sural, and Majumdar (2009) suggest a four component fraud detection solution which is connected in a serial manner. The main idea is first to determine a set of suspicious transactions and then run a Bayesian learning algorithm on this list to predict the frauds. Sanchez, Vila, Cerda, and Serrano (2009) presented a different approach and used association rule mining to define the patterns for normal card usage and indicating the ones not fitting to these patterns as suspicious. The study of Bolton and Hand (2002) provides a very good summary of literature on fraud detection problems. In these studies, the performance of the algorithms is mostly measured by the above measures. When the fraudsters obtain a card, they usually use (spend) its entire available (unused) limit. According to the statistics, they do this in four or five transactions, on the average. Thus, for the fraud detection problem, although the above mentioned measures are quite relevant, as indicated by the bank authorities, a measure, measuring the loss that can be saved on the cards whose transactions are identified as fraud is more prominent. In other words, detecting a fraud on a card having a large available limit is more valuable than detecting a fraud on a card having a small available limit. As a result, what we are faced with is a classification problem with variable misclassification costs. As the classical DM algorithms are not designed for such a misclassification cost structure, they are not directly applicable to our case (they work well when the objective is to minimize the incorrectly classified number of cases). Either some modifications should be made on them or new algorithms should be developed specifically for this purpose (actually in some popular DM software packages like SAS Enterprise Miner or SPSS PASW Modeler, it is possible to introduce different misclassification costs for the two classes but there has to a fixed ratio between them and thus they are not sufficient to handle our case). As the classical DM algorithms are not directly usable, we need alternative methods for our classification problem. In this regard, we thought that, the meta-heuristic algorithms which are applicable to many different problem domains could serve. After analyzing the main characteristics of the popular meta-heuristic algorithms, for our problem we decided to use the genetic algorithm (GA) and the scatter search (SS) in a combined manner. We called our hybrid solution method as GASS. Genetic algorithms are evolutionary algorithms which aim at obtaining better solutions as time progresses (Mitchell, 1998). Since their first introduction by Holland (1975), they have been successfully applied to many problem domains from astronomy (Charbonneau, 1995) to sports (Charbonneau, 1995), from optimization Levi et al., 2007 and Krzysztof and Peter, 2004 to computer science (Kaya, 2010), etc. They have also been used in data mining mainly for variable selection (Bidgoli, Kashy, Kortemeyer, & Punch, 2003) and are mostly coupled with other DM algorithms. Scatter search is another type of evolutionary algorithms. It has been first introduced by Glover (1977). Afterwards, it has been almost forgotten for about 20 years and since its re-introduction in 1997 (Glover, 1997) it has been applied to many different problems. However, to the best of our knowledge nobody has used it in DM problems so far. The contributions of this study to the literature are twofold. First, a new classification cost function for the fraud detection problem is introduced. Secondly, a novel implementation of two well known meta-heuristic algorithms is made. The rest of the paper is organized as follows. In the next section, the fraud detection problem we were faced is described in detail together with the current detection system used in our industrial partner. Section 3 briefly summarizes the basic principles of genetic algorithms and scatter search and then details the GASS implementation. The results obtained on the sample databases and the selections of the best solution parameters are discussed in Section 4. The sensitivity analysis regarding the parameter values is also made and presented in this section. The paper is finalized in Section 5 by providing the summary of the study and the major conclusions arrived.

[/toggle]

[toggle title=”منبع”]

Journal : Expert Systems with Applications, Volume 38, Issue 10, 15 September 2011, Pages 13057–13063
Publisher : Science Direct (Elsevier)

[/toggle]

[aio_button align=”none” animation=”none” color=”red” size=”small” icon=”none” text=”انجام مقاله علمی پژوهشی و ISI در این زمینه” target=”_blank” relationship=”dofollow” url=”http://payannameha.ir/?p=796″]

[aio_button align=”none” animation=”none” color=”orange” size=”small” icon=”none” text=”دریافت سایر مقالات در این زمینه” target=”_blank” relationship=”dofollow” url=”http://payannameha.ir/?page_id=297″]

[aio_button align=”none” animation=”none” color=”blue” size=”small” icon=”none” text=”انجام پایان نامه در این حوزه” relationship=”dofollow” url=”http://payannameha.ir/?page_id=3206″]

[aio_button align=”none” animation=”none” color=”pink” size=”small” icon=”none” text=”انجام پروپوزال در این حوزه” target=”_blank” relationship=”dofollow” url=”http://payannameha.ir/?page_id=3206″]

فایل مقاله : 7 صفحه PDF

فایل ترجمه : 14 صفحه WORD

سال انتشار : 2011

جهت خرید فایل مقاله و ترجمه فارسی آن بر روی دکمه زیر کلیک نمایید:

ورودایجاد یک حساب کاربری

تشخیص کلاه برداری در کارت اعتباری توسط الگوریتم ژنتیک و جستجوی پراکنده

دیدگاهتان را بنویسید