Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2034

CrawlDB filtered documents counter.

    XMLWordPrintableJSON

Details

    Description

      When we are doing big crawls we would like to know how many of the URLs are being discarded by the regex filters, this is only presented in the Inject class:

      Injector: Total number of urls rejected by filters: 0

      It will be nice to have a counter in the CrawlDB class so we know in every round how many were discarded by our filters:

      CrawlDb update: Total number of URLs filtered by regex filters: 31415

      Attachments

        Activity

          People

            lewismc Lewis John McGibbney
            betolink Luis Lopez
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: