> ## Documentation Index
> Fetch the complete documentation index at: https://docs.convocore.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Crawler

> Efficiently build your agents knowledge base with the Convocore AI Crawler

The Crawler is a powerful tool designed to streamline the process of creating and maintaining your agent's knowledge base. By automatically scraping and importing content from specified websites, the crawler ensures your agent stays up-to-date with the latest information.

## How the Crawler Works

<Info>
  The crawler operates by systematically visiting web pages, extracting relevant content, and organizing it into makrdown suitable for your agent's knowledge base. This process involves crawling through links, scraping text, and formatting the data for optimal use by AI models.
</Info>

## Crawler Jobs

The core of the crawler functionality revolves around crawler jobs. Each job is associated with specific source URLs you want to crawl and is identified by a unique ID. The [developer API](/api-reference) offers several ways to interact with the crawler.

<Card title="Job Details UI" icon="briefcase">
  <img src="https://mintcdn.com/convocore/h_96SLAvmHo1k2ly/images/crawler-ui-shots.png?fit=max&auto=format&n=h_96SLAvmHo1k2ly&q=85&s=f65c915120f85787a41fb5e9dae15ef4" alt="Scraped Pages Interface" width="1875" height="625" data-path="images/crawler-ui-shots.png" />
</Card>

### Creating a New Crawler Job

To initiate a new crawler job, follow these steps:

<Steps>
  <Step title="Set Source URLs">
    Navigate to the crawler tab from the menu on the left side of your dashboard. Click on `new job` and Enter the main URL(s) you want the crawler to begin with (e.g., [https://www.convocore.ai](https://www.convocore.ai)).

    <Warning>
      Ensure you use valid URLs in the correct format: [https://example.com](https://example.com)
    </Warning>
  </Step>

  <Step title="Configure Crawl Settings">
    There are two options to consider that determines the quality of the scrape:

    <Accordion title="Regular crawl">
      * Default option, costs **1 credit** per page scraped.
    </Accordion>

    <Accordion title="Deep crawl">
      * Autoscrolls and forces loading of images for better quality. Costs **10 credits** per page.
    </Accordion>
  </Step>

  <Step title="Refresh Rate">
    ## Set Crawler Refresh Rate

    The crawler refresh rate determines how often the crawler will update the current job with potential new information from the scraped site. This is particularly useful for websites that update frequently, such as e-commerce sites.

    <Tip>
      You can create separate crawler jobs for different sub-pages. This allows you to set different refresh rates for various sections of a website. For example, on a Shopify site, you might want to update `/collections/protein-powder` more frequently than the main page, as product information changes more often.
    </Tip>

    ### Available Refresh Rate Options:

    ```bash Refresh Rates theme={null}
    Every 6 hours
    Every 12 hours
    Every 24 hours
    Every 7 days
    Never
    ```
  </Step>

  <Step title="Specify Page Limit">
    Set the maximum number of pages to scrape for that job, ranging from **10** up to **500** pages.

    <Tip>
      Review the **sitemap** beforehand to determine the optimal number of pages to scrape. To view, write `/sitemap.xml` at the end of a valid URL. Ex. [https://www.convocore.ai/sitemap.xml](https://www.convocore.ai/sitemap.xml) or use [this](https://www.seowl.co/sitemap-extractor/)
    </Tip>
  </Step>

  <Step title="Define URL Patterns">
    <Card title="Match URLs: Include subpages you want to scrape based on the source URL." icon="link">
      <Accordion title="Example: Match URLs">
        ```
        /collections
        /products
        /blog
        ```

        This would include URLs containing the above.
      </Accordion>
    </Card>

    <Card title="Unmatch Patterns: Specify URLs or patterns to exclude from the scrape." icon="filter">
      <AccordionGroup>
        <Accordion title="Example: Unmatch Patterns">
          ```
          /blog
          ```

          This would exclude any URLs containing "/blog"
        </Accordion>
      </AccordionGroup>
    </Card>
  </Step>
</Steps>

<Note>
  Coming soon: Ability to assign crawl jobs directly to specific agents for automatic knowledge base updates.
</Note>

## Scraped Pages

After completing a crawler job, you can review and manage the scraped pages in the jobs dedicated interface. This section provides an overview of all pages collected during the job and status messages, such as when the maximum page limit is reached or when the crawler is active.

<img src="https://mintcdn.com/convocore/h_96SLAvmHo1k2ly/images/crawler-job-ui-shots.png?fit=max&auto=format&n=h_96SLAvmHo1k2ly&q=85&s=ae38e50e7a5b4bd4252492127357549c" alt="Scraped Pages Interface" width="600" data-path="images/crawler-job-ui-shots.png" />

<Card icon="file-lines" title="The pages view provides:">
  <AccordionGroup>
    <Accordion title="Unique document ID">
      A distinct identifier for each scraped page
    </Accordion>

    <Accordion title="URL">
      The web address of the scraped page
    </Accordion>

    <Accordion title="Title">
      The main title of the document
    </Accordion>

    <Accordion title="Description">
      A brief summary of the page content
    </Accordion>

    <Accordion title="Character count">
      The total number of characters in the scraped document
    </Accordion>
  </AccordionGroup>
</Card>

## Managing Scraped Pages

You can perform the following actions on the scraped pages:

<Steps>
  <Step title="Select Pages" icon="check">
    Check the pages you want to process further
  </Step>

  <Step title="Export" icon="file-export">
    Download selected pages as a zip file containing .txt documents
  </Step>

  <Step title="Import" icon="file-import">
    Add selected pages to the knowledge base of your chosen agent

    <video controls autoPlay muted loop playsInline style={{ width: '500px', borderRadius: '0.8rem' }} src="https://mintcdn.com/convocore/h_96SLAvmHo1k2ly/images/crawler-import-page-video.mp4?fit=max&auto=format&n=h_96SLAvmHo1k2ly&q=85&s=ebe2ab63009f367aa4f06b25ef5903b9" data-path="images/crawler-import-page-video.mp4" />
  </Step>
</Steps>

## Scraped Page Data

The scraped page data shows a detailed view of each page scraped in the job:

<img src="https://mintcdn.com/convocore/h_96SLAvmHo1k2ly/images/crawler-page-ui-shots.png?fit=max&auto=format&n=h_96SLAvmHo1k2ly&q=85&s=c9aa69e52991018d8cf3fa1a1ed7d729" alt="Scraped Page Data Interface" width="600" data-path="images/crawler-page-ui-shots.png" />

<Card icon="file-lines" title="The page data view provides:">
  <AccordionGroup>
    <Accordion title="URL">
      The web address of the scraped page
    </Accordion>

    <Accordion title="Title">
      The main title of the page
    </Accordion>

    <Accordion title="Description">
      A descriptive sentence that works as a summary of the page content and provides context to the LLM when retrieving from the knowledge base.
    </Accordion>

    <Accordion title="Detected URLs">
      Links found within the scraped page
    </Accordion>

    <Accordion title="Content">
      The main text scraped from the page, formatted in markdown
    </Accordion>
  </AccordionGroup>
</Card>

#

**Example snippet of scraped information in markdown:**

```markdown theme={null}
convocore AI provides access to a wide range of **state-of-the-art** AI models, ensuring that your agents are always equipped with the best and newest models on the market.

=========================================================

As soon as **new models** are released, the Convocore AI team promptly updates the platform. This means you typically get access to the latest and most powerful models **right away**.
```

<Note>
  Scraped information is formatted in markdown for easy reading by LLMs. To learn more about formatting KB documents, visit the [formatting doc](/agent-creation/knowledgebase/structuring-kb-documents).
</Note>

## Crawler Job Status

When you initiate a new crawler job, it will progress through several status stages:

1. **Pending**: The crawler has started and is in the process of gathering URLs and scraping content.
2. **Active**: The crawler is actively scraping pages.
3. **Completed**: The job has finished, and all specified pages have been scraped.

<Check>
  You will receive a notification in the dashboard when the job status changes to `Completed`.
</Check>

## Best Practices and Tips

<CardGroup cols={2}>
  <Card title="Optimize URL Patterns" icon="sitemap">
    Carefully define match and unmatch patterns to focus on the most relevant content.
  </Card>

  <Card title="Monitor Credit Usage" icon="coins">
    Remember that each page scraped costs credits (1 for normal, 10 for deep scrape).
  </Card>

  <Card title="Regular Updates" icon="clock-rotate-left">
    Set appropriate refresh rates for dynamic content to keep your knowledge base current.
  </Card>

  <Card title="Review Before Import" icon="magnifying-glass">
    Always review scraped content before importing it into your agent's knowledge base.
  </Card>
</CardGroup>
