Train Chatbots With Clean Data

The data you provide is utilized as the CONTEXT INFORMATION for the chatbot to respond to queries. The cleaner the CONTEXT INFORMATION, the higher the likelihood that the chatbot will discern the essential messages within the context to accurately address the query. This is especially critical when using GPT-4o mini as the underlying model.

Sources of Polluted Data

Polluted data can occur during the extraction of text content from PDF files or through website crawling for various purposes:

Unrecognized Unicode

Some PDF texts use specialized Unicode encoding as a copy-protection measure. Consequently, extracting text from these PDF files may result in "garbled code," which appears unreadable and incomprehensible. This garbled code, when used as context information for a chatbot, may hinder its performance. To ensure the integrity of the extracted text, it is advisable to preview the content by clicking the eye icon to check for any instances of garbled code. Below is an example of garbled code from a PDF file.

Garbled Code Example — Example of Garbled Code Extracted From the PDF file

The solution to fix the 'garbled code' issue is to use OCR packages to convert the PDF file to another PDF file with the normal encoding. For example you can use the following command in ubuntu

ocrmypdf --force-ocr input.pdf output.pdf

After converting the PDF file with normal encoding, you can upload the file again. The unreadable text will be readable.

Clean Text Example — Example of Clean Text From the PDF file

Redundant Metadata

Both PDF files and websites often contain extensive metadata that does not provide valuable information but occupies significant text space. This metadata is superfluous and consumes space that could be better used for relevant content. Removing this metadata can help streamline the data, although it typically requires considerable manual effort. This is a key reason why we recommend that Shopify/Woocommerce store owners import product details directly rather than scraping websites, as metadata can clutter the data with irrelevant information.

Bad Product Meta Data — Bad Prodcut Meta Data

To address this issue, it is advisable to manually condense the useful information and then upload the consolidated content into the database.