Generate Knowledge Base

Transform your website content into intelligent, searchable knowledge bases with this automated workflow that crawls URLs, processes content with AI, and creates vector embeddings for semantic search.

What it is

The Generate Knowledge Base workflow allows you to:

Submit multiple website URLs through a secure web form
Automatically crawl and extract clean content from each page
Generate intelligent Q&A pairs using AI to improve searchability
Store everything as vector embeddings in your Qdrant knowledge base
Keep your data fresh by automatically updating existing content

How to use it

Step 1: Access the Form

Navigate to your workflow's webhook URL to access the knowledge base form. You'll see:

Knowledge Base Dropdown: Select which collection to populate (e.g., "customer-service", "wellness-center")
Website URLs Field: Enter multiple URLs, one per line

Step 2: Submit URLs

Add the website URLs you want to crawl:

https://example.com/help/getting-started
https://example.com/faq
https://example.com/documentation/api
...

Click submit to start the automated processing.

Step 3: Automated Processing

The workflow will automatically:

Clean existing data - Remove any outdated content from the same URLs
Crawl websites - Extract clean, readable content from each URL
Generate Q&A pairs - Use AI to create 5-10 relevant questions and answers per page
Create embeddings - Convert content into searchable vectors
Store in Qdrant - Save everything to your selected knowledge base collection

Step 4: Content Enhancement

The AI processes each page to:

Extract the most relevant information
Generate professional, concise question/answer pairs
Handle incomplete or unclear content gracefully
Create content optimized for chatbot responses

How it works

Smart Crawling: Uses Firecrawl to extract only the main content, excluding navigation, ads, and images
AI Enhancement: Transforms raw content into structured Q&A pairs using your configured language model
Vector Search: Creates semantic embeddings that enable intelligent content retrieval
Data Management: Automatically handles duplicates and maintains content freshness

Customization Options

Language Model

The workflow uses OpenAI GPT-4o-mini by default, but you can replace it with any compatible language model:

Simply update the LLM node in your n8n workflow to use your preferred model.

Content Processing

You can customize:

Chunk size: Adjust how content is split for processing
Q&A generation: Modify the AI prompt to change question/answer style (Loaded from get scenario data workflow)
Content filtering: Configure which HTML elements to exclude
Metadata: Add custom fields for better content organization

Best Practices

URL Quality: Use URLs with substantial, well-structured content
Batch Size: Process 1-10 URLs at a time for optimal performance
Content Updates: Re-run the workflow periodically to keep knowledge bases current
Knowledge Base Organization: Use descriptive collection names for different content types

The workflow handles all technical complexity automatically, letting you focus on building comprehensive knowledge bases for your chatbots and AI applications.

Clearing Existing Data

By default, the workflow removes any existing content from the selected knowledge base that matches the submitted URLs. This ensures that your knowledge base remains up-to-date without duplicates.

To completely clear a knowledge base collection, you can use the Clear Full Scenario Knowledge Base workflow.

This is useful if you want to start fresh with entirely new content. It will delete the collection and all its data.

Codebase Updates

Get Scenario Data

Documentation

Architecture

N8N Workflows

Customization

Deployment