Patralekha Bhattacharya |
Deduping duplicate transactions is an important step when building a predictive model and models that are built to create credit risk scorecards in the commercial lending industry are no exception. What is specific to this industry, however, is that many duplicate transactions occur because the borrower splits the loan amount into multiple parts and asks for the loan on a piecemeal basis in order to increase the chances of loan approval. These individual loans appear in the dataset as separate records, even though they are actually part of the same transaction and only one of them should be kept. Therefore, the method of deletion of duplicates may be different from that in other industries. Since borrowers care more about paying off their larger loans on time than the smaller ones, we prefer to keep the transaction with the largest transaction amount when removing duplicates.
A commonly used method is to only remove the duplicates that appear within the same calendar quarter. This method is very easy, and one can use a simple sort to only keep the transactions with the highest transaction amount. However, the drawback of this method is that transactions that may have taken place just one day apart (e.g.: March 31st and April 1st) will not be candidates for deduping as they do not belong to the same calendar quarter.
In this paper we present a different method of deleting duplicate observations when they are within 90 days apart. Since the time-range we consider depends on the date of each transaction, this method is more complex and may need multiple iterations. However, if done correctly, this method will be more accurate and will remove more duplicate transactions correctly than the above calendar quarter deduping method. |