Chatbot Creation
Train With Your Data
If you expect the chatbot to introduce your company or website, read your documents, answer queries, or conduct complex logic-based questionnaires to evaluate your users, we recommend using the upload-custom-data
backend. Train the chatbot with your personalized data to achieve these capabilities.
Upload Training Data
The custom-data-upload
is designed for training with your own provided data. Upon selecting this option, the chatbot will employ the default gpt-3.5-turbo
model for processing your data through in-context learning. Additionally, you have the choice to opt for the gpt-4-turbo
model, which offers more precise responses, albeit at a higher cost of 20 credits per message.
If you happen to deplete your credits while on the Professional plan, you can seamlessly switch to using your personal Open AI API key. Once obtained, simply input your API key in the Account settings.
Currently, we support the following types of training data:
- Documents: We extract raw text from various file types including .pdf, .docx, .csv, .html, and .txt.
- Websites: We crawl websites, extracting raw text and anchor links using your preferred crawling method.
- Text: Directly upload plain text for training.
- Q&A: Input question and answer pairs for training. This structured data format is highly recommended as chatbot prioritize these inputs when generating responses.
- Video & Audio: We transcribe raw text from videos and audios, then utilize a traditional Retrieval-Augmented Generation (RAG) workflow to segment the text appropriately for chatbot response generation.
- Product: Import product JSON payloads directly from your WooCommerce or Shopify store. Since product data is structured, we keep it intact. The chatbot prefer these inputs over unstructured product information extracted from websites by crawling.
Documents
Choose to train your chatbot with data stored on your local device. Access the file upload tool, select the appropriate file, and use the eye icon to review and modify the text extracted from the file. Make any necessary edits to the extracted text before submitting it for training.
We extract raw text from PDF files page by page, preserving page number information in the extracted text. For PDFs amenable to table extraction, we also retrieve tables. However, due to the unstructured nature of PDFs, table extraction is limited to one table per page.
.docx
Extraction from DOCX files involves retrieving all raw text collectively, without preserving page numbers. Table extraction from DOCX files is not supported.
.csv
CSV files are used to import structured JSON-formatted payloads in batches. The first row defines the schema of the JSON payload, with subsequent rows representing corresponding payload values. Please refrain from requesting the chatbot to count rows, as the CSV content is converted into JSON payloads row by row for processing. For instance, a CSV file like this:
| user | expected response |
|--------|-------------------|
| query1 | response1 |
| query2 | response2 |
will be converted into JSON objects as follows:
[
{"user": "query1", "expected response": "response1"},
{"user": "query2", "expected response": "response2"}
]
.html
HTML files are utilized to extract text content, tables and anchors from private pages that are inaccessible via public web crawling. First, download the content of the private page into an HTML file, then use this method for content extraction.
.txt
Similar to .docx files, uploading a .txt file involves extracting raw text from the document without regard to tables or page numbers.
Websites
Enhance your website's functionality by utilizing the "Website" option located in the left sidebar. Start by entering the URL of the website and selecting the appropriate crawling method. After the crawling process completes, click the eye icon to inspect and edit the text extracted from the page. Ensure all modifications are made before proceeding to submit the text for training.
Page Options
The page options provide you with the ability to pinpoint specific sections of a webpage for extraction, while also allowing you to omit irrelevant content. This selective approach ensures that you maximize your chatbot's training capacity by focusing on pertinent information. Additionally, you have the option to extract only the main content of the page, automatically removing elements such as the header
, footer
, sidebar
, and navbar
. For pages requiring authentication, you can utilize cookies to authorize the scraping process, enabling the chatbot to access content as if you were logged in. Let's examine each of these options in detail.
Exclude Tags
You can use a list of selectors strings separated by commas to reprensend the DOM elements that you want to exclude from the scraping result. For example, to exclude all div
element with the id=sidebar
, you can use div#sidebar
as the exclude tags.
Include Only Tags
You can specify a list of selectors strings separated by commas to represend the DOM elements that you only want to extract from the webpage. For example, you can use div.site-content
to extract only the content from that specific div.
Cookies
You can use a list of cookies separated by semicolons to authorize the scraping process, enabling the chatbot to access content as if you were logged in. For example, you can use the following cookies:
cookie1=value1; cookie2=value2; cookie3=value3
To get the cookies for the website, you can use the chrome development tool to inspect the cookies of the website as the following and directly copy the whole cookies string:
Extract only the main content
This is the shortcut to remove the header, footer, nav, .navbar, .menu, .sidebar, .advertisement, .ad
content from the webpage so that you don't have to set up the selectors in the Exclude Tags any more.
The page options are applied to all the following crawling methods:
Add Single Site
Extracts raw text, along with corresponding anchor URLs and tables from a single specified URL. This method excludes images, scripts, media files, and documents. This basic method forms the foundation for the rest.
Submit Sitemap:
After you provide a sitemap URL, our system determines the total URLs listed and extracts the raw text with corresponding anchor URLs and tables from each URL. Please ensure you do not submit a sitemap of sitemaps, as we cannot process multiple layers of complexity.
Crawl a list of websites:
Simply provide a list of URLs; the extraction process mirrors that of the single site method. This method is nearly identical to sitemap crawling, except that the URLs come directly from you rather than a sitemap extraction.
Automatic Crawl:
This method begins with a seed URL you provide and recursively follows the anchor links discovered in the crawled pages. To ensure thorough and relevant crawling:
- We restrict crawling to URLs that share the same prefix. For example, if the seed URL is
https://developers.google.com/google-ads/api/performance-max/getting-started
, we will only crawl URLs that begin withhttps://developers.google.com/google-ads/api/performance-max/
. - We only include URLs that appear in the pages we've automatically crawled.
- We exclude any documents, images, or media pages.
Positive Filters
You can use positive filters to instruct the crawler to follow URLs that match certain Regex Patterns.
Example:
To crawl all URLs in https://www.homes.com/california/
that have the pattern https://www.homes.com/property/*
, set https://www.homes.com/california/
as the starting URL and /property/
as the positive filter. The crawler will collect all URLs with this pattern for the area of California. See the example below:
Negative Filters
Negative filters allow you to skip URLs that match specific Regex Patterns.
Example:
To crawl all URLs in https://developers.google.com/google-ads/api/docs/start
but exclude pages that are not in English, set https://developers.google.com/google-ads/api/docs/start
as the starting URL and ^.*\?hl=[a-zA-Z0-9-]{2,}(?:-[a-zA-Z0-9-]{2,})?$
as the negative filter. The crawler will collect all URLs except those matching the Regex pattern ^.*\?hl=[a-zA-Z0-9-]{2,}(?:-[a-zA-Z0-9-]{2,})?$
, as shown below:
Automated Website Retraining
Once your chatbot has already been trained with a list of websites URLs, you can set up a cron job to automatically refetch the content of the existing websites URLs of your chatbot and retraining the chatbot with the new content. This makes sure that your chatbot knowledge base is always up to date with the latest information from your websites. Based on your plan, you can
- set up a monthly cron job for the Entry plan.
- set up a weekly or daily cron job for the Standard plan or above. During cron job configuration, you can specify the appropriate Page Options to ensure the extracted website content contains only the most pertinent information. Additionally, you can select from three crawl types:
list
,automatic
, orsitemap
. Bothautomatic
andsitemap
modes will identify and remove any undiscovered pages, marking them as obsolete. Thelist
mode maintains the existing page structure without additions or removals.
To ensure enough scraping workers for all our users, the cron job will consume scrape credits for each scraped page url.
Text
This format is similar to a .txt file for document training, allowing you to directly input relevant plain text information. For instance, you can create mappings from Website URLs to Page names as shown below:
https://www.producthunt.com/: Home Page
This main landing page showcases the latest and most popular products featured in ProductHunt
https://www.producthunt.com/marketplace: Discover deals for your next favorite product
List of deals from different startups
https://www.producthunt.com/launch: ProductHunt Launch Guide
This detailed guide covers common questions, dispels myths, and outlines best practices for launching your product. It provides insights into defining "success" and prepares you for a successful launch. Bookmark this page to get started.
Q&A
For customization, you have the option to manually input questions and their corresponding answers. To do this, navigate to the Q&A section via the left sidebar and click on "Add" to access input fields for both questions and answers. Fill these fields as needed. The data entered here is structured and won't be fragmented into separate text chunks. Furthermore, the chatbot assigns a higher priority to this structured data when generating responses, due to its curated nature.
For bulk uploads of Q&As, it is advisable to format your data into CSV files as shown below. This method facilitates the uploading of thousands of Q&As efficiently using the CSV file upload feature.
| user | expected response |
|--------|-------------------|
| query1 | response1 |
| query2 | response2 |
| ... | ... |
| query1000 | response1000 |
Audio or Video
Just as with training using websites or files, you can extract text from the audio streams of videos or audios. Once the texts are extracted, they can be reviewed and edited by clicking the eye icon.
Audio
We directly extract the text from the audio and then segment this text into chunks to serve the RAG (Retrieval-Augmented Generation) workflow.
Video
We do not process the visual frames within videos. Instead, we extract the audio component and use the transcribed text in the RAG (Retrieval-Augmented Generation) workflow
Product
If you have an online store with either WooCommerce or Shopify and want to train the chatbots with the prodcut information, this importign method is the pefect tool for you. Compared with website crawling of the product page, this action allows for the direct import of your store's product catalog and inventory details into the chatbot's knowledge base, offering
- The whole information of the product stayed together without being segmented into different text chunks, which might happen with website crawling.
- Only relavant and clean information of the product is imported into the knowledge base. Whilte website crawling might crawl unuseful data like the head and foot of the website.
- More complete data of the product, some metadata of the products , for example inventory information, might not show up in the website page but can still be imported into the knowledge base with direct products importing.
- Allow for real time sync. After importing the products from your store, you can set up webhooks sync from WooCommerce or Shopify so that all the updates in your store will automaticall be sent to the knowledge base of your chatbot, allowing your chatbot to always get updated information of your online store.
WooCommerce
WooCommerce is a flexible, open-source eCommerce plugin designed for WordPress, allowing users to build and manage an online store with ease. Here, we will go through the steps on how to import all products from your WooCommerce store into Chat Data for training your custom chatbot
Step 1: Get Your Consumer Key and Secret
Prior to importing products from your WooCommerce store for training purposes, you'll need to obtain a pair of consumer keys and secrets for making API calls. Please be assured that we utilize your consumer key and secret solely for this purpose, and they will not be retained by our system. Below are the steps to obtain a pair of consumer key and secret:
1.1 Go to your WordPress Admin Page
You can navigate to your WordPress Admin page by appending
/wp-admin
to your WordPress website domain and browsinghttps://example.wordpress,com//wp-admin
.1.2 Go to WooCommerce's Advanced Setting
Click WooCommerce on the left sidebar and then click the Setting tab. This will take you to the WooCommerce settings page.
1.3 Add the API Key
After clicking the Add key tab and then the REST API tab, click the Add key button to start the API key creation process.
1.4 Create the API Key
Enter the API key name and select the user account, choose the
read
permission, then click the Generate API key button to create your API key pair.1.5 Save API Key Pair
After creation, your consumer key and consumer secret will be displayed. Make a copy of them to prepare for importing your products into Chat Data.
Step 2: Import Products
In the Sources tab and choose the Product tab as the source, filling in the domain of your WordPress website(Without the http://
prefix), the consumer key and consumer secret obtained from Step 1. Then press the Import button to import products from your WooCommerce Store. By clicking on the eye icon, you can access a detailed view of each product's information. Should you find any discrepancies, kindly correct the product details within your WooCommerce store and proceed to re-import the products. Please note, direct inline modifications of product details within the product modal are not permitted. Additionally, you have the option to remove any undesired products, retaining only those you wish to keep. Once you have verified the accuracy of all product details, you may proceed to click the Create/Retrain Chatbot button to initiate the chatbot training process. Below is a sample view of an imported product page, representing the information we will submit to the chatbot for training purposes.
Shopify
Step 1: Get Your Access Token
Prior to importing products from your Shopify store for training purposes, you'll need to obtain an access token. This token enables our platform to invoke the Shopify GraphQL API and import all your store's products. Please be assured that we utilize your access token solely for this purpose and it will not be retained by our system. Below are the steps to acquire your access token:
1.1 Create the Dev App
Click the Create an app button under the Apps and sales channel tab.
1.2 Config Admin API Scope
Add the
read_products
scope to the created App and save the configuration.1.3 Install the App
Click the Install app button so that the created app can take effect.
1.4 Reveal Access Token
Click the Reveal token once button to get the access token.
Step 2: Import Products From Shopify Store
Press the Import button to import products from your Shopify Store. By clicking on the eye icon, you can access a detailed view of each product's information. Should you find any discrepancies, kindly correct the product details within your Shopify store and proceed to re-import the products. Please note, direct inline modifications of product details within the product modal are not permitted. Additionally, you have the option to remove any undesired products, retaining only those you wish to keep. Once you have verified the accuracy of all product details, you may proceed to click the Create/Retrain Chatbot button to initiate the chatbot training process. Below is a sample view of an imported product page, representing the information we will submit to the chatbot for training purposes.