by Hyorim (Grace) Lee 

To prevent customer service departments from committing violations against Unfair, Deceptive, Abusive Acts or Practices (UDAAP), companies need to monitor conversations between call center employees and customers. If the employees provide customers with false information or unethical treatment, it can cause the company tremendous financial loss. It also causes mental stress on both the company’s and customer’s side.

For this project, I utilized a public consumer complaint database(, analyzing the pattern of the complaints and trained a machine learning model to classify each complaint’s topic. 

About the Dataset

2.1M cases (rows)

data features (columns):

‘Date received’, ‘Product’, ‘Sub-product’, ‘Issue’, ‘Sub-issue’, ‘Consumer complaint narrative’, ‘Company public response’, ‘Company’, ‘State’, ‘ZIP code’, ‘Tags’, ‘Consumer consent provided?’, ‘Submitted via’, ‘Date sent to company’, ‘Company response to consumer’, ‘Timely response?’, ‘Consumer disputed?’, ‘Complaint ID’

‘Consumer complaint narrative’ column played a role as a complaint description and the ‘Product’ column as a topic. Out of 2 million cases, only 700,000 cases have valid complaint descriptions and 18 topics. 

Exploratory Data Analysis

Since 2011, the number of complaints has been regularly increasing because the companies can provide customers with more complicated services and the number of customers has increased. 

On September 8th of 2017, the number of complaints skyrocketed and most of the complaints were about improper use of credit reports. 

On a demographic perspective, Washington D.C. has the most complaints reported per capita, followed by Georgia, Florida and Delaware. 

With the 18 unique topics, I visualized each topic as a word cloud using the most frequently used words as below. I created the word clouds after stemming and lemmatization, so some words can look incomplete (bank america -> Bank of America, well fargo -> Wells Fargo and etc.).

After tfidf-vectorizing each document, it was clear that each topic has a distinct pattern shown in the line graph and heatmap. Because some topics have very similar or inclusive topic names, such as “Bank account or service” and “Checking or savings account”, I would like to see how similar they are, compared to  a totally irrelevant topic, “Virtual currency”. The original vectorizer has 182,351 columns but I reduced the number to 200 by using Truncated SVD.  

or heatmap version: 

As you can see, the similar topics show very similar patterns. And below is another heatmap for all the 18 topics from the dataset.

Random Forest Classifier

Based on the fact that each topic has a distinct pattern, I trained a Random Forest classification model, with 300 trees. After around 12 hour training, it shows 75% accuracy with the testset (over 140k test values). 

However, since it took a while to train the model, it was hard to try out other algorithms or parameters. Overall, the model’s performance is great but there should be more space to improve. 


It is important that companies can provide fair service to all customers from the beginning, but it is more critical to accept customers’ feedback and improve the business. For the next step, I would like to analyze customers’ patterns during the conversation with customer service, and train a machine learning model to predict potential complaints.