Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1919

Getting timeout when server returns Content-Length: 0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.10
    • protocol
    • None
    • Patch Available

    Description

      This has been investigated in fixed in the Storm-Crawler https://github.com/DigitalPebble/storm-crawler/issues/48.

      curl -I "http://www.dailynewslosangeles.com/"
      HTTP/1.1 301 Moved Permanently
      Location: http://www.dailynews.com
      Connection: close
      Content-Length: 0
      Content-Type: text/html; charset=UTF-8

      when fetching with Nutch we are getting a timeout exception :

      ./nutch parsechecker -D http.agent.name="PebbleCrawler" "http://www.dailynewslosangeles.com/"
      fetching: http://www.dailynewslosangeles.com/
      Fetch failed with protocol status: exception(16), lastModified=0: java.net.SocketTimeoutException: Read timed out

      The reason for this is that we are trying to read from the stream even though we know that the content length is 0.

      The patch attached fixes the issue.

      Attachments

        1. NUTCH-1919.patch
          0.7 kB
          Julien Nioche

        Activity

          People

            Unassigned Unassigned
            jnioche Julien Nioche
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: