Chatbot Training Guide
Structured Data V.S. Unstructured Data
Structured data maintains its integrity better and enhances chatbot performance when used as context information. In contrast, unstructured data is divided into text chunks to provide context in the Retrieval-Augmented Generation (RAG) workflow. Because determining the precise boundaries to segment unstructured data is challenging, some important information from the structured data might be lost when it serves as context information.
The Retrieval-Augmented Generation (RAG) workflow
+----------------+ +-----------------------+ +-------------------------+
| | | | | |
| Create Query +--------->+ Generate Embedding +--------->+ Calculate Cosine |
| | | Vector | | Similarity |
+----------------+ +-----------------------+ +------------+------------+
|
v
+------------+-------------+
| |
| Find Relevant Text |
| Chunks |
| |
+------------+-------------+
|
v
+------------+------------+
| |
| Chatbot Responds |
| Using Context |
| |
+-------------------------+
The Retrieval-Augmented Generation (RAG) workflow follow the following Steps:
- Create Query: Initiate the process with a user input-based query.
- Generate Embedding Vector: Transform the query into a numerical vector that encapsulates its information.
- Calculate Cosine Similarity: Use cosine similarity to identify text segments with the closest relevance to the query's vector.
- Find Relevant Text Chunks: Select the text chunks most pertinent to providing context for the conversation.
- Chatbot Responds: Utilize the selected context to formulate the chatbot’s response.
Challenges with Unstructured Data
At step 4 above, issues arise when relevant text chunks are segmented. Consider the following text:
Jack Freeman is a good friend of Jimmy
It might be divided into two chunks: part1:
Jack
part2:
Freeman is a good friend of Jimmy
If the query is "Who is a good friend of Jimmy?", the chatbot would utilize part2 "Freeman is a good friend of Jimmy", disregarding "Jack" as it appears unrelated. Consequently, the chatbot might incorrectly respond to "Who is a good friend of Jimmy?". This segmentation issue is typically absent in structured data, which does not split text mid-sentence.
Structured Data
Structured data refers to the following types of upload data:
.csv
CSV files facilitate the batch import of structured, JSON-formatted data. The first row outlines the JSON schema, and each subsequent row provides the corresponding data as JSON objects. It is not advisable to use the chatbot to count rows, since the content of the CSV is processed row by row into JSON payloads. For example, consider the following CSV file:
| user | expected response |
|--------|-------------------|
| query1 | response1 |
| query2 | response2 |
will be converted into JSON objects as follows:
[
{"user": "query1", "expected response": "response1"},
{"user": "query2", "expected response": "response2"}
]
You may include multiple columns, each aligning with an element of the JSON schema. Ensure each column header is semantically meaningful to enhance clarity and understanding.
Q&A
The question and answer pairs inputted by the user.
Product
Product information imported directly from Shopify or WooCommerce stores.
Unstructured Data
The following types of training data are categorized as unstructured data:
The plain text and tables extracted from PDF files.
.docx
The plain text extracted from Docx files.
.html
The plain text content, tables and anchors extracted from HTML files
.txt
The raw text content in the txt files.
websites
The plain text content, tables and anchors extracted from the websites by URLs.
Text
The plain text inputs provided by the user.
Audio and Audio
The text content extracted by transcribing the audio of the media files.