How we’re using machine learning to fight shell selling
In this first in an occasional series, we’re taking a look at machine learning initiatives at WePay — the kinds of problems we use machine learning for, how we build technology to address them, and how the unique challenges of the payments industry shape our approach.
We thought the best introduction would to be to look at an actual fraud problem we face, shell selling, and how we built the algorithm we’re now using to solve it.
What is shell selling?
While fraud is a concern pretty much anywhere money is exchanged for a good or service, there are certain types of fraud that are unique to platforms — those services that act as an intermediary in a transaction for the purposes of making it easier. Whereas a traditional buyer or seller just has to worry that the other party is a fraudster, platforms need to be worried about both sides of the transaction. If either of them is committing credit fraud, it will often be the platform itself that will be footing the bill for the refund if/when the real cardholder finds out and reverses the charges.
Shell selling is one of the types of fraud that’s of particular concern in this situation. Basically, it’s what happens when both sides of a transaction are fraudulent — there’s a criminal with two accounts paying themselves with a stolen credit card. Once the real cardholder finds out, the fraudster disappears with the money they’ve stolen and the platform is left to foot the bill for the chargeback.
Shell selling can be tough to spot, because these fraudsters keep a low profile. They don’t generally have many “real” customers, so you can’t rely on user feedback scores the way you can with more traditional scammers. And while it’s obvious when a merchant gets a bunch of payments from different cards at the same IP in a short amount of time, the fraudsters who perpetrate this crime tend to be more sophisticated than that. They often employ all manner of techniques to hide their identity and evade detection.
Since shell selling a common problem, and one that’s difficult for humans to spot, we decided to build a machine learning algorithm to help us catch it.
A note on how we build machine learning algorithms
At WePay, we build our entire machine learning pipeline in Python, using the popular, open-source scikit-learn machine learning package. If you haven’t used scikit-learn, I highly suggest you check it out. For things like fraud modeling where you need to retrain constantly and deploy quickly, it offers a lot of advantages:
- Scikit-learn uses a uniform API for model fitting and prediction across different machine learning algorithms, making it really efficient to reuse code from algorithm to algorithm.
- Scoring web services can be directly hosted with Python using Django or Flask, making deployment much simpler. One need only install scikit-learn and copy the exported model file and necessary data processing pipeline code to the web service instance to start.
- The entire model development and deployment cycle is self-contained in Python. This gives us an advantage over other popular languages for machine learning like R or SAS, which require converting the model to another language before it runs in production. In addition to simplifying development by eliminating unnecessary steps, this gives us more flexibility to try different algorithms, because there are many that don’t handle this conversion process especially well and thus would be more trouble than they’re worth in another environment.
The algorithm: Random Forest
Getting back specifically to shell selling, we tested several algorithms before we settled on the one that gave us the best performance: Random Forest.
Random Forest is a tree-based ensemble method developed by Leo Breiman and Adele Cutler, and first put forward by Breiman in a peer-reviewed article in the journal Machine Learning in 2001 . Random Forest trains many decision trees on random subsets of the training data, and then uses mean prediction from individual trees as the final prediction. The random subsets are sampled from original training data by sampling with replacement (bootstrapping) on the record level, and random subsampling on the feature level.
Random Forest gave us the best precision at fixed recall of the algorithms we tried, followed closely by neural networks and another ensemble method, AdaBoost. Compared to other algorithms, Random Forest had a number of advantages for the kinds of fraud data we work with, which are ultimately why it won out:
- Tree based ensemble methods can handle well both non-linearity and non-monotonicity, which are quite common in fraud signals. By comparison, neural networks handle non-linearity quite well, but get tripped up by non-monotonicity, and logistic regression cannot handle either. For the latter two methods to deal with non-linearity and/or non-monotonicity, extensive and proper feature transformations are required.
- Random Forest requires minimum feature preparation and transformation and it does not require standardization of input variables the way neural networks and logistic regression do, nor does it require binning and risk-rating conversion for non-monotonic variables.
- Random Forest gives the best out-of-the-box performance compared with other algorithms. Another tree based method, Gradient Boosted Trees, can achieve comparable performance, but requires more parameter tuning.
- Random Forest outputs feature importance as a by-product of model training, which is very useful for feature selection .
- Random Forest has better tolerance for overfitting compared with other algorithms, and it can handle a large number of variables without seeing much overfitting , since overfitting can be reduced with more trees. Variable selection and reduction are not as critical as they are for other algorithms.
Here’s how Random Forest performed vs. the competition:
Training the algorithm
Our machine learning pipeline follows a standard procedure, which includes data extraction, data cleaning, feature derivation, feature engineering and transformation, feature selection, model training, and model performance evaluation:
After an extensive period of training, our random forest algorithm for shell selling identification is now live and actively stopping fraud. It was a lot of work to select, train and deploy this algorithm, but it has made our risk processes even more robust and capable of catching more fraud with less manual review. At the same fraud recall rate, the model precision is about 2 to 3 times higher than constantly tuned and optimized rules.
In addition to the obvious benefits of having the algorithm live, we also learned a lot about our data and our approach in the process:
- Through feature selection process, we found that the most predictive features for this kind of fraud are velocity-type variables. These include things like transaction volume by user, device, true IP, and credit cards. We also found that account linking features by device ID, bank accounts, credit cards are quite useful, like multiple accounts logged on to one device, and multiple withdrawals to one bank account.
- Risk ratings of some categorical variables, such as email domains, application id, user country, and hours of the day, also proved highly predictive.
- Digital footprints such as browser language, OS fonts, screen resolution, user agent, and flash version were somewhat useful in fighting fraud. Somewhat more predictive were the presence of practices people use to hide their digital footprints, like VPN tunneling or the use of virtual machines and TOR.
- We also found that model performance deteriorates quickly. This wasn’t really a surprise — fraudsters change their methods constantly to avoid detection so even the best model will eventually become outdated if it doesn’t change as well. But we were surprised at how quickly this happens. For shell selling, precision drops by one half in just the first month after the model has been trained. Hence, refreshing model frequently to maintain the high detection precision is crucial to the success of fraud detection.
- Unfortunately, frequent refreshes present their own problems. While refreshing the model as frequently as possible is ideal, one needs to be careful when using the most recent transaction data for training models. Fraud labels can take as long as a month to mature, so using data that’s too recent may actually contaminate the model. Despite what we initially assumed, online learning with the most current data doesn’t always deliver the best results.
- Random Forest is a superior machine learning algorithm in producing high performance models, however, it is mostly used as a black box method. This is an issue, because we’re not trying to cut humans out of the process entirely, and probably couldn’t even if we wanted to. Human analysts always want reason codes that tell them why things were flagged to guide their case review. But random forest, by itself, can’t readily provide reason codes. Interpreting data from the model is difficult, and may involve digging into the structure of the “forest”, which can significantly increase the scoring time. To combat this problem, WePay’s data science team actually had to invent a new proprietary method for generating reason codes from random forest algorithms, which we’ve filed a provisional patent on
At WePay, risk management technology is the core of what we do. It’s the thing that lets us bear the considerable fraud risk we shoulder on behalf of the more than 1,200 platforms that use us to settle funds between their users.
That being said, Risk management isn’t just about technology, it’s about a seamless partnership between human and technology. It’s still largely humans that have to think of the ways that fraudsters can attack a payment system and write rules to block them, and it’s still an experienced professional that has to make the judgment call whether to block a transaction when it falls in the gray area between “obvious fraud” and “obviously legitimate,” as it so often does.
And that’s why we’re so excited about machine learning and artificial intelligence. That might seem a bit weird if you’ve been raised on a steady diet of movies where the robot overlords enslave the human race. But it makes perfect sense when you think of what these technologies let you do.
This isn’t Terminator II. When WePay thinks about machine learning, we’re not trying to replace human beings — we just want to make the machines they work with better. We want machine intelligence to be smarter so we can focus human intelligence on the hard problems where it makes the most difference.
If you’d like to hear more about what we’re doing with machine learning, drop us a line at firstname.lastname@example.org.
Jon Xavier contributed to this post.