Chatbot Kit

Generate Knowledge Base

Transform your website content into intelligent, searchable knowledge bases with this automated workflow that crawls URLs, processes content with AI, and creates vector embeddings for semantic search.

What it is

The Generate Knowledge Base workflow allows you to:

  • Submit multiple website URLs through a secure web form
  • Automatically crawl and extract clean content from each page
  • Generate intelligent Q&A pairs using AI to improve searchability
  • Store everything as vector embeddings in your Qdrant knowledge base
  • Keep your data fresh by automatically updating existing content

How to use it

Step 1: Access the Form

Navigate to your workflow's webhook URL to access the knowledge base form. You'll see:

  • Knowledge Base Dropdown: Select which collection to populate (e.g., "customer-service", "wellness-center")
  • Website URLs Field: Enter multiple URLs, one per line

Step 2: Submit URLs

Add the website URLs you want to crawl:

https://example.com/help/getting-started
https://example.com/faq
https://example.com/documentation/api
...

Click submit to start the automated processing.

Step 3: Automated Processing

The workflow will automatically:

  1. Clean existing data - Remove any outdated content from the same URLs
  2. Crawl websites - Extract clean, readable content from each URL
  3. Generate Q&A pairs - Use AI to create 5-10 relevant questions and answers per page
  4. Create embeddings - Convert content into searchable vectors
  5. Store in Qdrant - Save everything to your selected knowledge base collection

Step 4: Content Enhancement

The AI processes each page to:

  • Extract the most relevant information
  • Generate professional, concise question/answer pairs
  • Handle incomplete or unclear content gracefully
  • Create content optimized for chatbot responses

How it works

  1. Smart Crawling: Uses Firecrawl to extract only the main content, excluding navigation, ads, and images
  2. AI Enhancement: Transforms raw content into structured Q&A pairs using your configured language model
  3. Vector Search: Creates semantic embeddings that enable intelligent content retrieval
  4. Data Management: Automatically handles duplicates and maintains content freshness

Customization Options

Language Model

The workflow uses OpenAI GPT-4o-mini by default, but you can replace it with any compatible language model:

Simply update the LLM node in your n8n workflow to use your preferred model.

Content Processing

You can customize:

  • Chunk size: Adjust how content is split for processing
  • Q&A generation: Modify the AI prompt to change question/answer style (Loaded from get scenario data workflow)
  • Content filtering: Configure which HTML elements to exclude
  • Metadata: Add custom fields for better content organization

Best Practices

  • URL Quality: Use URLs with substantial, well-structured content
  • Batch Size: Process 1-10 URLs at a time for optimal performance
  • Content Updates: Re-run the workflow periodically to keep knowledge bases current
  • Knowledge Base Organization: Use descriptive collection names for different content types

The workflow handles all technical complexity automatically, letting you focus on building comprehensive knowledge bases for your chatbots and AI applications.

Clearing Existing Data

By default, the workflow removes any existing content from the selected knowledge base that matches the submitted URLs. This ensures that your knowledge base remains up-to-date without duplicates.

To completely clear a knowledge base collection, you can use the Clear Full Scenario Knowledge Base workflow.

This is useful if you want to start fresh with entirely new content. It will delete the collection and all its data.