10 Step Toolkit to Improve Messy Conversational Data for Chatbots – ICTworks

⇓ More from ICTworks

10 Step Toolkit to Improve Messy Conversational Data for AI Chatbots

By Guest Writer on February 8, 2024

Many direct-to-client digital services rely on interactions with real people to make decisions, provide support, and improve their platforms over time. But real people produce ‘messy’ data, which makes it hard – at scale – to accurately draw a line between what they need, and the official services and systems that can support them.

User-generated conversational data

For example, Jacaranda’s AI-enabled digital health tool PROMPTS is built to read, process and triage large volumes of conversational inputs from new and expecting mothers. Users are prompted to report on pregnancy-related health, danger signs, and experiences of facility-based care, which is used by the helpdesk to decide an appropriate course of action.

PROMPTS is expecting data from women that can be mapped to official DHIS2 facility data to create a bridge between care experience (eg. women reporting poor care) and care provision (eg. facility-reported data on provider skills gaps).

Linking user data directly with official data gives a better idea of what is driving nationally-reported outcomes data, and a more personalized referral pathway – ie. deciding which facility to refer a client to based on poor/positive experiences of care.

Messy conversational data challenge

But the challenge is that conversational data can be ‘messy’. The facility names that mothers talk about can’t always be mapped with their official names because of inconsistencies, like formatting issues (eg. irregular capitalization), misspellings, abbreviations (eg. Level 4 > L4), and incomplete or mis-described names, (eg. ‘Sub-county hospital’ > ‘Sub-district hospital’).

‘Messy data’ is not unique to the interactions we have on PROMPTS. Other services could benefit from tools that aggregate or standardize diverse or conversational inputs.

For instance, tools that match farmers’ descriptions of crop or pest problems with scientific names to accurately identify crop diseases. Or a means to standardize names of government departments or public facilities from various sources to create unified public databases or improve accessibility of government services.

Improving conversational data with AI

Jacaranda developed an automated approach to augment and standardize conversational data to address this challenge. The approach uses ‘perturbation attacks’, small, intentional changes to input data to trick machine learning algorithms into making incorrect decisions.

The result is a database of variations on an official name, entity or program, to mimic the inconsistencies in user-reported data.

10 Step Toolkit for Conversational Data

A step-by-step approach to implementing these attacks is below.

1. Setup:

In your augmenting script, import necessary libraries and dependencies, like pandas, random, re, string, nltk, textattack and sklearn libraries, as below.

import pandas as pd
import random
import re
import string
import nltk

Load your dataset and inspect its structure to understand its features, data types, and layout.

2. Install the TextAttack library.

TextAttack library offers a framework to introduce subtle changes, or modifications, to desirable text inputs (eg. DHIS facility data), including misspellings, word substitutions or character swapping.

3. Choose a NLP model

Choose a Natural Language Processing system suitable for your task and load it using TextAttack. For example, a transformer model for sequence classification.

4. Initialize TextAttack Augmenter

Augmenting a dataset using TextAttack requires only a few lines of code when it is done right. The Augmenter class is created for this purpose to generate augmentations of a string or a list of strings.

5. Define an augmentation function

Next, define an augmentation function to apply perturbation attacks on your text data. This can be character swaps, misspelling, lowercasing, word substitutions, or character replacements, as outlined in the table below. Augmenting a dataset using TextAttack requires only a few lines of code, as below, and can be done in either python script or command line.

def perturbation_augmentation(text):
augmented_text = augmenter.augment(text)
return augmented_text

Correct DHIS2 Name	Perturbation	Augmenting Data
Kianyaga Sub-County Hospital	Misspelling	Kianyaga subounty hospital
	Character replacement	Kianyaga sub couGty hospital
	Characters omission	Kianyaga sc hospital
	Character substitution	Kianyaga subconuty hospital
	Omission of word	Kianyaga subcounty
	Character deletion	Kianyaga subcounty hospial
	Lowercasing	kianyaga subcounty hospital
	Character swaps	Kianyaga subocunty hospital

6. Repeat the augmentation process

Be sure to repeat the augmentation process for every data point in your dataset, using the sample code snippet below. For example, applying the perturbation augmentation function for five hospital names, starting with the 1st.

augmented_dataset = [perturbation_augmentation(text) for text in original_dataset]

7. Save the augmented dataset

You will need to save the augmented dataset to a new file or overwrite the existing one, using the sample code snippet below.

with open(‘augmented_dataset.txt’, ‘w’) as f:
for text in augmented_dataset:
f.write(“%sn” % text)

8. Inspect and Validate

Review a few samples from the augmented dataset to ensure it aligns with your expectations. Optionally, assess the model’s performance on both the original and augmented datasets.

9. Iterate and fine-tune

We found the need to iterate and fine-tune our approach throughout the process, adjusting parameters or selecting different models to achieve the desired augmentation.

10. Document

Good techies always document the augmentation process, including the model used, parameters, and any specific considerations, to explore augmentation options missed, experiment with different models or techniques, or modified parameters going forward.

Did You Improve Messy Conversational Data?

The hope is that this toolkit will support faster, cheaper augmentation of diverse data at scale, helping implementers make sense of the data they collect from users, and connect them with systems and services that could support them better.

By nature of producing ‘cleaner’, standardized datasets, this process will also help implementers more seamlessly train AI-based models and better report on the insights or implications of the data they generate.

We are keen to hear from other implementers how this toolkit has supported data augmentation in their services and systems. Please share feedback, learnings, and areas for improvement in the comments below.

By Stanslaus Mwongela, Machine Learning Manager, Jay Patel, Head of Technology, and Laura Down, Head of Global Communications, Jacaranda Health

Now Read These Related Posts

Filed Under: Data, Featured
More About: Chatbots, Data Analysis, data management, Natural Language Processing, User Generated Content

Written by Guest Writer

This Guest Post is an ICTworks community knowledge-sharing effort. We actively solicit original content and search for and re-publish quality ICT-related posts we find online. Please suggest a post (even your own) to add to our collective insight.

Ads

World’s Leading High-rise Marketplace

Stay Current with ICTworksGet Regular Updates via Email

Ads

Launch your virtual restaurant, free.

About Us

ICTworks™ is the premier resource for international development professionals committed to utilizing new and emerging technologies to magnify the intent of communities to accelerate their social and economic development.

This post was originally published on 3rd party site mentioned in the title of this site

10 Step Toolkit to Improve Messy Conversational Data for Chatbots – ICTworks

10 Step Toolkit to Improve Messy Conversational Data for AI Chatbots

User-generated conversational data

Messy conversational data challenge

Improving conversational data with AI