Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1723

nutch updatedb fails due to avro (de)serialization issues on images

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • 2.3, 2.2.1
    • 2.5
    • crawldb, parser
      • Ubuntu 12.04.3 LTS (GNU/Linux 3.2.0-36-generic x86_64)
      • DataStax Community Edition Apache Cassandra 2.0.4

    Description

      Running `bin/crawl` for 2 iterations using either the nutch-2.2.1 release or the latest 2.x checkout on a seed file containing for example http://www.mountsinai.on.ca and http://www.dhzb.de (or any other webpage with image files with no obvious file extensions) causes to throw either java.lang.IllegalArgument, IOException and/or OutOfBoundsExceptions in the the readFields function of WebPageWritable:

      @Override
      public void readFields(DataInput in) throws IOException

      { webPage = IOUtils.deserialize(getConf(), in, webPage, WebPage.class); }

      @Override
      public void write(DataOutput out) throws IOException

      { IOUtils.serialize(getConf(), out, webPage, WebPage.class); }

      2014-02-04 13:50:15,421 INFO util.WebPageWritable - Try reading fields: ...
      2014-02-04 13:50:15,423 ERROR util.WebPageWritable - Error - Failed to read fields: http://www.mountsinai.on.ca/carousel/patient-care-banner/image
      2014-02-04 13:50:15,423 ERROR util.WebPageWritable - Error - Reading fields of the WebPage class failed - java.lang.IllegalArgumentException
      2014-02-04 13:50:15,425 ERROR util.WebPageWritable - Error - Printing stacktrace - java.lang.IllegalArgumentException

      Or,
      java.lang.IndexOutOfBoundsException
      at java.nio.Buffer.checkBounds(Buffer.java:559)
      at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:143)
      at org.apache.avro.ipc.ByteBufferInputStream.read(ByteBufferInputStream.java:52)
      at org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:183)
      at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:265)
      at org.apache.gora.mapreduce.FakeResolvingDecoder.readString(FakeResolvingDecoder.java:131)
      at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:280)
      at org.apache.avro.generic.GenericDatumReader.readMap(GenericDatumReader.java:191)
      at org.apache.gora.avro.PersistentDatumReader.readMap(PersistentDatumReader.java:183)
      at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83)
      at org.apache.gora.avro.PersistentDatumReader.readRecord(PersistentDatumReader.java:139)
      at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:80)
      at org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:103)
      at org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:98)
      at org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:73)
      at org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:36)
      at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:205)
      at org.apache.nutch.util.WebPageWritable.readFields(WebPageWritable.java:45)
      at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
      at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
      at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
      at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117)
      at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
      at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
      at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)

      The exceptions are caused by image files that sneak through the urlfilter (no extension indicating an image file) and that get (properly?) parsed by tika library.

      Note that silently catching the thrown exceptions causes corruption of the Cassandra database, as the deserializer reads over multiple webpage entries in the DataInput. Resulting in a loss of several pages of other host present in the seed file.

      Moreover, if one makes sure that the image pages don't end up in the DataInput written by DBUpdateMapper, e.g. by configuring nutch-site.xml to disable the tika parser, the nutch dbupdate finishes properly.

      <property>
      <name>plugin.excludes</name>
      <value>parse-tika</value>
      </property>

      I highly suspect that the issues are due to gora's dependency on the outdated avro-1.3.3 library.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ksmets Koen Smets
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: