All Collections
Tutorials
Advanced Settings & Filters
Advanced Settings & Filters

Get more control over your crawls.

Jeff avatar
Written by Jeff
Updated over a week ago

Advanced Settings

Custom Headers: This allows your server to identify our crawlers, in case you need to white-list them to allow them through the security walls.

Crawl an XML URL:
The URL of an XML sitemap. We just crawl URLs found in the XML sitemap, links from the pages will not be followed.

Crawl a CSV file:
Format = the first column must contain a URL.
We just crawl URLs found in the file, links from the pages will not be followed. Remember to use a comma , as a delimiter and " as an escape character.
The CSV may use headers as the first row.


Advanced FILTERS

Include & Exclude Directories
Directories are now NOT-STRICT ( aka fuzzy ) in their matches!

ie: /dog/ will match either of these URLs : 

site.com/dog/... OR
site.com/products/dog/...


ie: /pets/dog/  will match either of these URLs:

site.com/pets/dog/... OR
site.com/products/pets/dog/... 



Include & Exclude Keywords

Keywords are NOT STRICT, but "fuzzy" in their matches, and are more flexible than directories as they can match any part of the URL's path.

To include all urls containing 'dog' ie > https://site.com/dog-products.html  > Enter dog . This will also include urls such as https://site.com/categories/dogs/food

IMPORTANT!

for URLs that have parameters like:

site.com/index.php?page=1

to Exclude all those pages only use the params for the Keywords:

page=

do not use 

index.php?page=


Advanced Combos:

All of the filters can be used in combination with each other:
ie: /directory/  + keyword  =  will return only results that strictly match both.

Example:  Directories + Keywords together

will not crawl:  site.com/products/dog/monkey-pants.html
will not crawl:  site.com/dog/monkey-pants.html
will not crawl:  site.com/dog/monkeys.html

it will crawl:  site.com/dog/foo.html  ✅
_________________

Multiple values within each filter,  use an “OR” logic. 

That is if you enter 3 keywords (dog , cat , monkey ), then the URL will pass the check as long as it has dog  OR cat  OR monkey  in its path.

Example:  Multiple Keywords

will not crawl:  site.com/dog/foo.html  ✘
will not crawl:  site.com/products/dog/...
will not crawl:  site.com/tags/cat/...
will not crawl:  site.com/category/monkey/...  ✘

it will crawl:  site.com/cats/foo.html  ✅


Example : Multiple /Directories/ + Multiple Keywords

will not crawl  :  site.com/dog/monkey-pants.html  ✘
will not crawl  :  site.com/cat/monkeys.html  ✘
will not crawl  :  site.com/dog/birds/..  ✘
will not crawl  :  site.com/monkey/..  ✘
will not crawl  :  site.com/bird-cages/..  ✘


it will crawl :  site.com/dog-cages/  ✅
it will crawl :  site.com/categories/cat-litters/  ✅



Example: Exclude + Include Combos:

will not crawl:  site.com/cake/monkeys.html  ✘
will not crawl:   site.com/cake/monkey-pants/foo.html
will not crawl:  site.com/monkey/cakes.html
it will crawl:  site.com/cake/foo.html

*These can also cancel themselves if you use:
​ Exclude Directory ="/cat/" + Include keyword "cat"

Did this answer your question?