Enhancing Data Insights with Text Cleaning and AI

Table of contents
Partner with
Aimpoint Digital
Meet an Expert

In today’s data-driven world, organizations are interested in extracting insights from structured numeric data and monetizing text data that were historically generated and collected. Textual information holds significant potential for informed decision-making. However, text data is often riddled with errors, inconsistencies, and noise that impede the extraction of valuable information. Enter text cleaning, supported by AI, to transform unstructured data into valuable assets.

Common text analytics use cases and their associated challenges include: 

  • Customer Sentiment Analysis
    Determining the sentiment expressed in product or service reviews necessitates understanding subtle emotional nuances. 
  • Chat Logs and Customer Support Tickets
    Handling abbreviations, acronyms, misspellings, and non-standard grammar in chat logs without altering the intended meaning can be challenging. 
  • Social Media Mentions
    Understanding social media comments requires a contextual understanding of the discussion thread. In addition, social media comments and discussions are often multilingual, making language detection and multilingual text handling a requirement. 

To tackle these challenges, text cleaning is an essential step in data processing. Text cleaning ranges from correcting simple spelling mistakes to using natural language processing (NLP) to ensure accurate analysis and extraction of insights from text data. Through effective text cleaning, organizations can transform text data into a valuable asset that can inform their decision-making. 

Common Approaches to Text Cleaning 

There are several approaches to performing text cleaning programmatically. Depending on the use case, each method has its advantages and disadvantages. 

  1. Regular Expressions (RegEx)
    RegEx is a powerful tool for pattern matching and text manipulation. They allow for complex search and replace operations, making cleaning and transforming text easier based on specific patterns or rules.
  2. Named Entity Recognition (NER)
    NER identifies and classifies named entities within text data, such as people, locations, and dates. The primary goal of NER is to extract structured information from unstructured text data by identifying and categorizing named entities.
  3. Natural Language Processing (NLP) Techniques
    NLP encompasses a range of methods and algorithms designed to understand and process human language. Techniques like tokenization, part-of-speech tagging, and syntactic parsing are commonly used to break down text into smaller units, assign grammatical tags, and analyze the syntactic structure. 

Improving Performance with Large Language Models (LLMs) 

In recent years, natural language processing and AI advancements have provided powerful tools like OpenAI’s ChatGPT, which help us understand and process human-like text responses. The key advantage of LLMs is they are good at understanding the context in which a question or prompt is given. They can comprehend the meaning and intent conveyed by words and phrases, thereby excelling at producing contextually accurate and appropriate outputs. 

Figure 1

Use Case: Color Categorization 

Our client, a pet policy management software company, recently changed their web application form for the pet color field, which previously allowed free-text input. The client wanted to map the historic responses from the color field to a predefined list, as the lack of standardized color classification prevented practical data analysis.  

With Dataiku and OpenAI’s ChatGPT, Aimpoint Digital’s automated mapping approach helped our pet policy management software client classify 20,000 challenging user-specified color entries in less than 4 hours. 

Throughout this process, we encountered several challenges that needed to be addressed: 

  1. Spelling Variations
    Users may spell colors differently, such as “gray” vs. “grey.” These minor differences can lead to inconsistencies when mapping the responses to a standardized color list.
  2. Domain-Specific Colors
    The color field may contain domain-specific color descriptions not present in the predefined color list. For example, “tri-color” or “brindle” may describe specific coat patterns or markings. Mapping these domain-specific colors to standard color categories requires domain knowledge and a mapping mechanism to handle these variations.
  3. Overly Descriptive Colors
    Users often provide overly descriptive terms to describe the color of their pets, such as “Oreo cookie” or “honeycomb.” These descriptions may not align with standard color categories, making it challenging to map them accurately. 

To address common misspellings mentioned above (1), fuzzy matching algorithms such as Levenhstein Distance or Cosine Similarity, metrics used to measure how different two strings are, are effective methods to rectify those issues. However, the difficulty lies in mapping unconventional color descriptions (2 and 3), which do not necessarily map easily to the pet coat list. This requires a more sophisticated approach. By leveraging ChatGPT’s natural language processing capabilities, we can generate accurate and contextually appropriate color mappings. 

Figure 2

Accelerating Color Classification with AI and Batch Processing 

One challenge in using ChatGPT for color mapping is the slow, one-by-one processing time. To overcome this, we utilized Dataiku’s batching mechanism to optimize efficiency. Splitting the user-entered color dataset into multiple splits within Dataiku lets us create multiple batches to be queried to ChatGPT simultaneously, allowing for parallel processing and categorizing pet colors at scale. 

The Batching Mechanism 

  1. Data Preparation in Dataiku
    In Dataiku, the incoming user-entered pet color dataset is split into multiple smaller datasets or splits, depending on the volume of data (see Figure 1). This splitting process allows for parallel processing and faster execution.
  2. Batch Creation
    For each split, multiple batches are created within Dataiku. Each batch represents a subset of the split dataset, ensuring manageable data sizes for querying ChatGPT.
  3. ChatGPT Integration
    Dataiku interacts with ChatGPT using the OpenAI API. It queries ChatGPT with batches of color data, submitting multiple color entries simultaneously for processing.
  4. Parallel Processing
    By leveraging Dataiku’s parallel processing capabilities, the batches are sent to ChatGPT in parallel, enabling efficient utilization of computational resources and minimizing overall processing time.
  5. ChatGPT Responses
    As the responses from ChatGPT arrive, Dataiku uses its ‘Append instead of overwrite’ functionality to populate the incoming mappings (see Figure 2) continuously. This ensures that the mappings are gradually updated and expanded as each batch is processed. 

Our solution successfully converted 20,000 challenging entries that previously struggled with fuzzy-matching algorithms in less than 4 hours (compared to an estimated two-week manual effort). As a result, the number of entries requiring further review was significantly reduced to 300. Our efficient approach expedited the process and ensured greater precision in the classification of pet coat colors.  

Harness Dataiku and ChatGPT for your business needs 

By harnessing the power of Dataiku and ChatGPT, we can automate your data-cleaning process and achieve accurate results. Contact us through the form below for more innovative use cases like this to help simplify complex tasks and provide intelligent solutions. 

Author
William Wirono
William Wirono
Senior Data Scientist
Read Bio

Let’s talk data.
We’ll bring the solutions.

Whether you need advanced AI solutions, strategic data expertise, or tailored insights, our team is here to help.

Meet an Expert