Chatbot Creation

Train With Your Data

If you expect the chatbot to introduce your company or website, read your documents, answer queries, or conduct complex logic-based questionnaires to evaluate your users, we recommend using the upload-custom-data backend. Train the chatbot with your personalized data to achieve these capabilities.


Upload Training Data

The custom-data-upload is designed for training with your own provided data. Upon selecting this option, the chatbot will employ the default gpt-3.5-turbo model for processing your data through in-context learning. Additionally, you have the choice to opt for the gpt-4-turbo model, which offers more precise responses, albeit at a higher cost of 20 credits per message.

If you happen to deplete your credits while on the Professional plan, you can seamlessly switch to using your personal Open AI API key. Once obtained, simply input your API key in the Account settings.

Currently, we support the following types of training data:

  • Documents: We extract raw text from various file types including .pdf, .docx, .csv, .html, and .txt.
  • Websites: We crawl websites, extracting raw text and anchor links using your preferred crawling method.
  • Text: Directly upload plain text for training.
  • Q&A: Input question and answer pairs for training. This structured data format is highly recommended as chatbot prioritize these inputs when generating responses.
  • Video & Audio: We transcribe raw text from videos and audios, then utilize a traditional Retrieval-Augmented Generation (RAG) workflow to segment the text appropriately for chatbot response generation.
  • Product: Import product JSON payloads directly from your WooCommerce or Shopify store. Since product data is structured, we keep it intact. The chatbot prefer these inputs over unstructured product information extracted from websites by crawling.

Documents

Choose to train your chatbot with data stored on your local device. Access the file upload tool, select the appropriate file, and use the eye icon to review and modify the text extracted from the file. Make any necessary edits to the extracted text before submitting it for training. Train With Documents

.pdf

We extract raw text from PDF files page by page, preserving page number information in the extracted text. For PDFs amenable to table extraction, we also retrieve tables. However, due to the unstructured nature of PDFs, table extraction is limited to one table per page.

.docx

Extraction from DOCX files involves retrieving all raw text collectively, without preserving page numbers. Table extraction from DOCX files is not supported.

.csv

CSV files are used to import structured JSON-formatted payloads in batches. The first row defines the schema of the JSON payload, with subsequent rows representing corresponding payload values. Please refrain from requesting the chatbot to count rows, as the CSV content is converted into JSON payloads row by row for processing. For instance, a CSV file like this:

| user   | expected response |
|--------|-------------------|
| query1 | response1         |
| query2 | response2         |

will be converted into JSON objects as follows:

[
  {"user": "query1", "expected response": "response1"},
  {"user": "query2", "expected response": "response2"}
]

.html

HTML files are utilized to extract text content, tables and anchors from private pages that are inaccessible via public web crawling. First, download the content of the private page into an HTML file, then use this method for content extraction.

.txt

Similar to .docx files, uploading a .txt file involves extracting raw text from the document without regard to tables or page numbers.

Websites

Enhance your website's functionality by utilizing the "Website" option located in the left sidebar. Start by entering the URL of the website and selecting the appropriate crawling method. After the crawling process completes, click the eye icon to inspect and edit the text extracted from the page. Ensure all modifications are made before proceeding to submit the text for training. Train With Websites

Page Options

The page options provide you with the ability to pinpoint specific sections of a webpage for extraction, while also allowing you to omit irrelevant content. This selective approach ensures that you maximize your chatbot's training capacity by focusing on pertinent information. Additionally, you have the option to extract only the main content of the page, automatically removing elements such as the header, footer, sidebar, and navbar. For pages requiring authentication, you can utilize cookies to authorize the scraping process, enabling the chatbot to access content as if you were logged in. Let's examine each of these options in detail. Scraping Page Options

Exclude Tags

You can use a list of selectors strings separated by commas to reprensend the DOM elements that you want to exclude from the scraping result. For example, to exclude all div element with the id=sidebar, you can use div#sidebar as the exclude tags.

Include Only Tags

You can specify a list of selectors strings separated by commas to represend the DOM elements that you only want to extract from the webpage. For example, you can use div.site-content to extract only the content from that specific div.

Cookies

You can use a list of cookies separated by semicolons to authorize the scraping process, enabling the chatbot to access content as if you were logged in. For example, you can use the following cookies:

cookie1=value1; cookie2=value2; cookie3=value3

To get the cookies for the website, you can use the chrome development tool to inspect the cookies of the website as the following and directly copy the whole cookies string: Get Cookies

Extract only the main content

This is the shortcut to remove the header, footer, nav, .navbar, .menu, .sidebar, .advertisement, .ad content from the webpage so that you don't have to set up the selectors in the Exclude Tags any more.

The page options are applied to all the following crawling methods:

Add Single Site

Extracts raw text, along with corresponding anchor URLs and tables from a single specified URL. This method excludes images, scripts, media files, and documents. This basic method forms the foundation for the rest.

Submit Sitemap:

After you provide a sitemap URL, our system determines the total URLs listed and extracts the raw text with corresponding anchor URLs and tables from each URL. Please ensure you do not submit a sitemap of sitemaps, as we cannot process multiple layers of complexity.

Crawl a list of websites:

Simply provide a list of URLs; the extraction process mirrors that of the single site method. This method is nearly identical to sitemap crawling, except that the URLs come directly from you rather than a sitemap extraction.

Automatic Crawl:

This method begins with a seed URL you provide and recursively follows the anchor links discovered in the crawled pages. To ensure thorough and relevant crawling:

  1. We restrict crawling to URLs that share the same prefix. For example, if the seed URL is https://developers.google.com/google-ads/api/performance-max/getting-started, we will only crawl URLs that begin with https://developers.google.com/google-ads/api/performance-max/.
  2. We only include URLs that appear in the pages we've automatically crawled.
  3. We exclude any documents, images, or media pages.

Positive Filters

You can use positive filters to instruct the crawler to follow URLs that match certain Regex Patterns.

Example:

To crawl all URLs in https://www.homes.com/california/ that have the pattern https://www.homes.com/property/*, set https://www.homes.com/california/ as the starting URL and /property/ as the positive filter. The crawler will collect all URLs with this pattern for the area of California. See the example below: Positive Filters Example

Negative Filters

Negative filters allow you to skip URLs that match specific Regex Patterns.

Example:

To crawl all URLs in https://developers.google.com/google-ads/api/docs/start but exclude pages that are not in English, set https://developers.google.com/google-ads/api/docs/start as the starting URL and ^.*\?hl=[a-zA-Z0-9-]{2,}(?:-[a-zA-Z0-9-]{2,})?$ as the negative filter. The crawler will collect all URLs except those matching the Regex pattern ^.*\?hl=[a-zA-Z0-9-]{2,}(?:-[a-zA-Z0-9-]{2,})?$, as shown below: Negative Filters Example

Automated Website Retraining

Once your chatbot has already been trained with a list of websites URLs, you can set up a cron job to automatically refetch the content of the existing websites URLs of your chatbot and retraining the chatbot with the new content. This makes sure that your chatbot knowledge base is always up to date with the latest information from your websites. Based on your plan, you can

  1. set up a monthly cron job for the Entry plan.
  2. set up a weekly or daily cron job for the Standard plan or above. During cron job configuration, you can specify the appropriate Page Options to ensure the extracted website content contains only the most pertinent information. Additionally, you can select from three crawl types: list, automatic, or sitemap. Both automatic and sitemap modes will identify and remove any undiscovered pages, marking them as obsolete. The list mode maintains the existing page structure without additions or removals.

To ensure enough scraping workers for all our users, the cron job will consume scrape credits for each scraped page url. Set up the websites cron training job

Text

This format is similar to a .txt file for document training, allowing you to directly input relevant plain text information. Train With Text For instance, you can create mappings from Website URLs to Page names as shown below:

https://www.producthunt.com/: Home Page
This main landing page showcases the latest and most popular products featured in ProductHunt

https://www.producthunt.com/marketplace: Discover deals for your next favorite product
List of deals from different startups

https://www.producthunt.com/launch: ProductHunt Launch Guide
This detailed guide covers common questions, dispels myths, and outlines best practices for launching your product. It provides insights into defining "success" and prepares you for a successful launch. Bookmark this page to get started.

Q&A

For customization, you have the option to manually input questions and their corresponding answers. To do this, navigate to the Q&A section via the left sidebar and click on "Add" to access input fields for both questions and answers. Fill these fields as needed. The data entered here is structured and won't be fragmented into separate text chunks. Furthermore, the chatbot assigns a higher priority to this structured data when generating responses, due to its curated nature. Train With Q&As

For bulk uploads of Q&As, it is advisable to format your data into CSV files as shown below. This method facilitates the uploading of thousands of Q&As efficiently using the CSV file upload feature.

| user   | expected response |
|--------|-------------------|
| query1 | response1         |
| query2 | response2         |
| ...    | ...               |
| query1000 | response1000   |

Audio or Video

Just as with training using websites or files, you can extract text from the audio streams of videos or audios. Once the texts are extracted, they can be reviewed and edited by clicking the eye icon. Train With Videos

Audio

We directly extract the text from the audio and then segment this text into chunks to serve the RAG (Retrieval-Augmented Generation) workflow.

Video

We do not process the visual frames within videos. Instead, we extract the audio component and use the transcribed text in the RAG (Retrieval-Augmented Generation) workflow

Product

If you have an online store with either WooCommerce or Shopify and want to train the chatbots with the prodcut information, this importign method is the pefect tool for you. Compared with website crawling of the product page, this action allows for the direct import of your store's product catalog and inventory details into the chatbot's knowledge base, offering

  1. The whole information of the product stayed together without being segmented into different text chunks, which might happen with website crawling.
  2. Only relavant and clean information of the product is imported into the knowledge base. Whilte website crawling might crawl unuseful data like the head and foot of the website.
  3. More complete data of the product, some metadata of the products , for example inventory information, might not show up in the website page but can still be imported into the knowledge base with direct products importing.
  4. Allow for real time sync. After importing the products from your store, you can set up webhooks sync from WooCommerce or Shopify so that all the updates in your store will automaticall be sent to the knowledge base of your chatbot, allowing your chatbot to always get updated information of your online store.

WooCommerce

WooCommerce is a flexible, open-source eCommerce plugin designed for WordPress, allowing users to build and manage an online store with ease. Here, we will go through the steps on how to import all products from your WooCommerce store into Chat Data for training your custom chatbot

Step 1: Get Your Consumer Key and Secret

Prior to importing products from your WooCommerce store for training purposes, you'll need to obtain a pair of consumer keys and secrets for making API calls. Please be assured that we utilize your consumer key and secret solely for this purpose, and they will not be retained by our system. Below are the steps to obtain a pair of consumer key and secret:

  • 1.1 Go to your WordPress Admin Page

    You can navigate to your WordPress Admin page by appending /wp-admin to your WordPress website domain and browsing https://example.wordpress,com//wp-admin.

  • 1.2 Go to WooCommerce's Advanced Setting

    Click WooCommerce on the left sidebar and then click the Setting tab. This will take you to the WooCommerce settings page. WooCommerce Setting

  • 1.3 Add the API Key

    After clicking the Add key tab and then the REST API tab, click the Add key button to start the API key creation process. Add WooCommerce API key

  • 1.4 Create the API Key

    Enter the API key name and select the user account, choose the read permission, then click the Generate API key button to create your API key pair. Create WooCommerce API key pair

  • 1.5 Save API Key Pair

    After creation, your consumer key and consumer secret will be displayed. Make a copy of them to prepare for importing your products into Chat Data. WooCommerce API key

Step 2: Import Products

In the Sources tab and choose the Product tab as the source, filling in the domain of your WordPress website(Without the http:// prefix), the consumer key and consumer secret obtained from Step 1. Then press the Import button to import products from your WooCommerce Store. By clicking on the eye icon, you can access a detailed view of each product's information. Should you find any discrepancies, kindly correct the product details within your WooCommerce store and proceed to re-import the products. Please note, direct inline modifications of product details within the product modal are not permitted. Additionally, you have the option to remove any undesired products, retaining only those you wish to keep. Once you have verified the accuracy of all product details, you may proceed to click the Create/Retrain Chatbot button to initiate the chatbot training process. Import products from the  WooCommerce store Below is a sample view of an imported product page, representing the information we will submit to the chatbot for training purposes. WooCommerce product view

Shopify

Step 1: Get Your Access Token

Prior to importing products from your Shopify store for training purposes, you'll need to obtain an access token. This token enables our platform to invoke the Shopify GraphQL API and import all your store's products. Please be assured that we utilize your access token solely for this purpose and it will not be retained by our system. Below are the steps to acquire your access token:

  • 1.1 Create the Dev App

    Click the Create an app button under the Apps and sales channel tab. Create a Shopify App

  • 1.2 Config Admin API Scope

    Add the read_products scope to the created App and save the configuration. Config the scope of the admin API Add the read_products scope

  • 1.3 Install the App

    Click the Install app button so that the created app can take effect. Install the Shopify App

  • 1.4 Reveal Access Token

    Click the Reveal token once button to get the access token. Reveal Access Token

Step 2: Import Products From Shopify Store

Press the Import button to import products from your Shopify Store. By clicking on the eye icon, you can access a detailed view of each product's information. Should you find any discrepancies, kindly correct the product details within your Shopify store and proceed to re-import the products. Please note, direct inline modifications of product details within the product modal are not permitted. Additionally, you have the option to remove any undesired products, retaining only those you wish to keep. Once you have verified the accuracy of all product details, you may proceed to click the Create/Retrain Chatbot button to initiate the chatbot training process. Import products from the Shopify store Below is a sample view of an imported product page, representing the information we will submit to the chatbot for training purposes. Shopify product view

Previous
Full Chat Data Features Video