Job
Job - is a task that you can run on the Webcrawler API. It has an asynchronous nature. It means you will get a notification when it is done (read more about async request).
Job request parameters
url- (required) the seed URL where the crawler starts. Can be any valid URL.scrape_type- (default:html) the type of scraping you want to perform. Can behtml,cleaned.items_limit- (default:20) crawler will stops when it reaches this limit of pages for this job.webhook_url- (optional) the URL where the server will send a POST request once the task is completed (read more about webhooks and async requests).crawl_delay- (default:2000) delay between requests in milliseconds. To respect the website and avoid being blocked we recommend to leave it default.max_retries- (default:2) the number of retries if page request fails.whitelist_regexp- (optional) a regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.blacklist_regexp- (optional) a regular expression to blacklist URLs. URLs that match the pattern will be skipped.scrape_type- (default:html) the type of scraping you want to perform. Can behtml,cleaned.allow_subdomains- (default:false) iftruethe crawler will also crawl subdomains (for example,blog.example.comif the seed URL isexample.com).
Example:
{ "url": "https://stripe.com/", "webhook_url": "https://yourserver.com/webhook", "items_limit": 10, "crawl_delay": 2000, "max_retries": 1, "scrape_type": "clean", "allow_subdomains": false}Job response
-
id- the unique identifier of the job. -
url- the seed URL where the crawler started. -
status- the status of the job. Can benew,in_progress,done,error. -
scrape_type- the type of scraping you want to perform. -
extract_rules- an object with rules to extract data from the page. -
whitelist_regexp- a regular expression to whitelist URLs. -
blacklist_regexp- a regular expression to blacklist URLs. -
allow_subdomains- if the crawler will also crawl subdomains. -
items_limit- the limit of pages for this job. -
crawl_delay_ms- delay between requests in milliseconds. -
max_retries- the number of retries if page request fails. -
created_at- the date when the job was created. -
finished_at- the date when the job was finished. -
webhook_url- the URL where the server will send a POST request once the task is completed. -
webhook_status- the status of the webhook request. -
webhook_error- the error message if the webhook request failed. -
job_items- an array of items that were extracted from the pages.Job Item:
id- the unique identifier of the item.status- the status of the item. Can benew,in_progress,done,error.job_id- the job identifier.original_url- the URL of the page.page_status_code- the status code of the page request.raw_content_url- the URL to the raw content of the page.cleaned_content_url- the URL to the cleaned content of the page (ifscrape_typeiscleaned).title- the title of the page.created_at- the date when the item was created.cost- the cost of the item in $.
Example:
{ "id": "23b81e21-c672-4402-a886-303f18de9555", "url": "https://stripe.com/", "scrape_type": "clened", "extract_rules": "", "whitelist_regexp": "", "blacklist_regexp": "", "allow_subdomains": false, "items_limit": 10, "created_at": "2024-06-17T12:22:08.034Z", "crawl_delay_ms": 0, "finished_at": "2024-06-17T12:23:01.53Z", "webhook_url": "https://yourserver.com/webhook", "webhook_status": 0, "webhook_error": "", "status": "done", "job_items": [ { "id": 578720, "job_id": "23b81e21-c672-4402-a886-303f18de9555", "original_url": "https://stripe.com/docs/no-code/tap-to-pay", "page_status_code": 200, "raw_content_url": "https://data.webcrawlerapi.com/raw/clwgv3ywz000hsy99lwbk7q18/23b81e21-c672-4402-a886-303f18de9555/https___stripe_com_docs_no_code_tap_to_pay", "cleaned_content_url": "https://data.webcrawlerapi.com/raw/clwgv3ywz000hsy99lwbk7q18/23b81e21-c672-4402-a886-303f18de9555/https___stripe_com_docs_no_code_tap_to_pay", "status": "done", "title": "Tap to Pay on the Dashboard mobile app | Stripe Documentation", "created_at": "2024-06-17T12:22:19.511Z", "updated_at": "2024-06-17T12:22:33.334Z", "retries": 0, "cost": 0.002 } ]}