Back to Templates

Turn your website docs into a GPT-4.1-mini support chatbot with MrScraper and Pinecone

Created by

Created by: riandra || riandradiva
riandra

Last update

Last update 17 hours ago

Share


Description

This n8n template turns any website or documentation portal into a fully functional AI-powered support chatbot — no manual copy-pasting, no static FAQs. It uses MrScraper to crawl and extract your site's content, OpenAI to generate embeddings, and Pinecone to store and retrieve that knowledge at chat time.

The result is a retrieval-augmented chatbot that answers questions using only your actual website content, always cites its sources, and never hallucinates policies or pricing.


How It Works

  • Phase 1 – URL Discovery: The Map Agent crawls your target domain using include/exclude patterns to discover all relevant documentation or help center pages. It returns a clean, deduplicated list of URLs ready for content extraction.
  • Phase 2 – Page Content Extraction: Each discovered URL is processed in controlled batches by the General Agent, which extracts the readable content (title + main text) from every page. Low-quality or near-empty pages are automatically filtered out.
  • Phase 3 – Chunking & Embedding: Page text is split into overlapping chunks (default: ~1,100 chars with 180-char overlap) to preserve context at boundaries. Each chunk is sent to OpenAI Embeddings to generate a vector, then stored in Pinecone with metadata including the source URL, page title, and chunk index.
  • Phase 4 – Chat Endpoint: A Chat Trigger exposes a webhook endpoint your website or widget can connect to. When a user asks a question, the Support Chat Agent queries Pinecone for the most relevant chunks and generates a grounded answer using GPT-4.1-mini — always with source URLs included and strict anti-hallucination rules enforced.

How to Set Up

  1. Create 2 scrapers in your MrScraper account:

    • Map Agent Scraper (for crawling and discovering page URLs)
    • General Agent Scraper (for extracting title + content from each page)
    • Copy the scraperId for each — you'll need these in n8n.
  2. Set up your Pinecone index:

    • Create a Pinecone index with dimensions that match your chosen OpenAI embedding model (e.g. 1536 for text-embedding-ada-002)
    • Choose a namespace (recommended format: docs-yourdomain)
  3. Add your credentials in n8n:

    • MrScraper API token
    • OpenAI API key (used for both embeddings and the chat model)
    • Pinecone API key
  4. Configure the Map Agent node:

    • Set your target domain or docs root URL (e.g. https://docs.yoursite.com)
    • Set includePatterns to focus on relevant sections (e.g. /docs/, /help/, /support/)
    • Optionally set excludePatterns to skip noise (e.g. /assets/, /tag/, /static/)
  5. Configure the General Agent node:

    • Enter your General Agent scraperId
    • Adjust the batch size in the SplitInBatches node (start with 1–5 to stay within rate limits)
  6. Configure the Pinecone nodes:

    • Select your Pinecone index in both the Upsert and Retriever nodes
    • Set the correct namespace in both nodes so indexing and retrieval use the same data
  7. Customise the chatbot system prompt:

    • Edit the Support Chat Agent's system message to set the chatbot's name, tone, and rules
    • Adjust topK in the Pinecone Retriever (default: 8) based on how much context you want per answer
  8. Connect your chat widget or frontend to the Chat Trigger webhook URL generated by n8n


Requirements

  • MrScraper account with API access enabled
  • OpenAI account (for embeddings and GPT-4.1-mini chat)
  • Pinecone account with an index created and ready

Good to Know

  • The overlap between chunks (default 180 chars) is intentional — it prevents answers from being cut off at chunk boundaries and significantly improves retrieval quality.
  • The chatbot is configured to cite 1–3 source URLs per answer, so users can always verify the information themselves.
  • The anti-hallucination rules in the system prompt instruct the agent to say it can't find an answer rather than guess — making it safe to use for support, pricing, or policy questions.
  • Re-indexing is as simple as re-running the workflow. Use a consistent Pinecone namespace and upsert mode to update existing vectors without duplicating them.

Customising This Workflow

  • Swap the chat model: Replace GPT-4.1-mini with GPT-4o or another OpenAI model for higher-quality answers on complex queries.
  • Scheduled re-indexing: Add a Schedule Trigger to automatically re-crawl and re-index your docs whenever content changes.
  • Multiple knowledge bases: Use different Pinecone namespaces (e.g. docs-product, docs-api) and route questions to the right namespace based on user intent.
  • Embed on your website: Connect the Chat Trigger webhook to any chat widget library to give your users a live support experience powered entirely by your own documentation.
  • Multilingual support: Add a translation node before chunking to index content in multiple languages and serve a global audience.