
Artificial Intelligence
In today’s data-driven world, organizations are interested in extracting insights from structured numeric data and monetizing text data that were historically generated and collected. Textual information holds significant potential for informed decision-making. However, text data is often riddled with errors, inconsistencies, and noise that impede the extraction of valuable information. Enter text cleaning, supported by AI, to transform unstructured data into valuable assets.
Common text analytics use cases and their associated challenges include:
To tackle these challenges, text cleaning is an essential step in data processing. Text cleaning ranges from correcting simple spelling mistakes to using natural language processing (NLP) to ensure accurate analysis and extraction of insights from text data. Through effective text cleaning, organizations can transform text data into a valuable asset that can inform their decision-making.
There are several approaches to performing text cleaning programmatically. Depending on the use case, each method has its advantages and disadvantages.
In recent years, natural language processing and AI advancements have provided powerful tools like OpenAI’s ChatGPT, which help us understand and process human-like text responses. The key advantage of LLMs is they are good at understanding the context in which a question or prompt is given. They can comprehend the meaning and intent conveyed by words and phrases, thereby excelling at producing contextually accurate and appropriate outputs.
Figure 1
Use Case: Color Categorization
Our client, a pet policy management software company, recently changed their web application form for the pet color field, which previously allowed free-text input. The client wanted to map the historic responses from the color field to a predefined list, as the lack of standardized color classification prevented practical data analysis.
With Dataiku and OpenAI’s ChatGPT, Aimpoint Digital’s automated mapping approach helped our pet policy management software client classify 20,000 challenging user-specified color entries in less than 4 hours.
Throughout this process, we encountered several challenges that needed to be addressed:
To address common misspellings mentioned above (1), fuzzy matching algorithms such as Levenhstein Distance or Cosine Similarity, metrics used to measure how different two strings are, are effective methods to rectify those issues. However, the difficulty lies in mapping unconventional color descriptions (2 and 3), which do not necessarily map easily to the pet coat list. This requires a more sophisticated approach. By leveraging ChatGPT’s natural language processing capabilities, we can generate accurate and contextually appropriate color mappings.
Figure 2
One challenge in using ChatGPT for color mapping is the slow, one-by-one processing time. To overcome this, we utilized Dataiku’s batching mechanism to optimize efficiency. Splitting the user-entered color dataset into multiple splits within Dataiku lets us create multiple batches to be queried to ChatGPT simultaneously, allowing for parallel processing and categorizing pet colors at scale.
The Batching Mechanism
Our solution successfully converted 20,000 challenging entries that previously struggled with fuzzy-matching algorithms in less than 4 hours (compared to an estimated two-week manual effort). As a result, the number of entries requiring further review was significantly reduced to 300. Our efficient approach expedited the process and ensured greater precision in the classification of pet coat colors.
By harnessing the power of Dataiku and ChatGPT, we can automate your data-cleaning process and achieve accurate results. Contact us through the form below for more innovative use cases like this to help simplify complex tasks and provide intelligent solutions.
Whether you need advanced AI solutions, strategic data expertise, or tailored insights, our team is here to help.