Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4542

Suggestion: Don't load large streams completely into memory, reference them instead

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.0.14
    • None
    • Parsing, PDModel

    Description

      As we processed large PDF files, many of which containing large image streams, we wanted to avoid loading the entire streams into memory. Instead, we implemented a mechanism that merely referenced their location on disk.

      We eventually did this by subclassing COSStream, and then overriding COSParser.parseCOSStream(COSDictionary) to conditionally create our stream. Here is the code, this is currently still a work-in-progress. I've just refactored the entire mechanism.

      public class ReferencedCOSStream
         extends COSStream
      {
         //~ Instance members ------------------------------------------------------------------------------------------------------------------------------
      
         boolean isReference = false;
         File    reference   = null;
         long    offset      = -1;
         long    length      = -1;
      
         //~ Constructors ----------------------------------------------------------------------------------------------------------------------------------
      
         private ReferencedCOSStream(final ScratchFile scratchFile)
         {
            super(scratchFile);
         }
      
         //~ Methods ---------------------------------------------------------------------------------------------------------------------------------------
      
         public static ReferencedCOSStream createFromCOSStream(final COSStream stream)
         {
            final ReferencedCOSStream out = new ReferencedCOSStream(stream.getScratchFile());
      
            for (final Map.Entry<COSName, COSBase> entry : stream.entrySet())
            {
               out.setItem(entry.getKey(), entry.getValue());
            }
            return out;
         }
      
      
         @Override
         public COSInputStream createInputStream(final DecodeOptions options)
            throws IOException
         {
            if (this.isReference)
            {
               final InputStream in = new SlicedFileInputStream(this.reference, this.offset, this.length);
      
               return COSInputStream.create(getFilterList(), this, in, this.getScratchFile(), options);
            }
            else
            {
               return super.createInputStream(options);
            }
         }
      
      
         @Override
         public InputStream createRawInputStream()
            throws IOException
         {
            if (this.isReference)
            {
               return new SlicedFileInputStream(this.reference, this.offset, this.length);
            }
            else
            {
               return super.createRawInputStream();
            }
         }
      
      
         @Override
         public OutputStream createOutputStream(final COSBase filters)
            throws IOException
         {
            this.isReference = false;
            return super.createOutputStream(filters);
         }
      
      
         @Override
         public OutputStream createRawOutputStream()
            throws IOException
         {
            this.isReference = false;
            return super.createRawOutputStream();
         }
      
      
         public void setReference(final File file,
                                  final long offset,
                                  final long length)
         {
            this.isReference = true;
            this.reference   = file;
            this.offset      = offset;
            this.length      = length;
            this.setLong(COSName.LENGTH, length);
         }
      
         //~ Inner Classes ---------------------------------------------------------------------------------------------------------------------------------
      
         private class SlicedFileInputStream
            extends FileInputStream
         {
            //~ Instance members ---------------------------------------------------------------------------------------------------------------------------
      
            private long       index;
            private final long length;
      
            //~ Constructors -------------------------------------------------------------------------------------------------------------------------------
      
            public SlicedFileInputStream(final File file,
                                         final long offset,
                                         final long length)
               throws FileNotFoundException, IOException
            {
               super(file);
               this.length = length;
               this.skip(offset);
               this.index = 0;
            }
      
            //~ Methods ------------------------------------------------------------------------------------------------------------------------------------
      
            @Override
            public int available()
               throws IOException
            {
               final long remaining = length - index;
      
               if (remaining < 0)
               {
                  return 0;
               }
               return (int)remaining;
            }
      
      
            @Override
            public int read(final byte[] b)
               throws IOException
            {
               final int remaining = this.available();
               final int len       = (remaining < b.length) ? remaining : b.length;
      
               index += len;
               if (len > 0)
               {
                  return super.read(b, 0, len);
               }
               else
               {
                  return -1;
               }
            }
      
      
            @Override
            public int read(final byte[] b,
                            final int    off,
                            int          len)
               throws IOException
            {
               final int remaining = this.available();
      
               len   =  (remaining < len) ? remaining : len;
               index += len;
               if (len > 0)
               {
                  return super.read(b, 0, len);
               }
               else
               {
                  return -1;
               }
            }
      
      
            @Override
            public long skip(final long n)
               throws IOException
            {
               index += n;
               return super.skip(n);
            }
      
      
            @Override
            public FileChannel getChannel()
            {
               throw new UnsupportedOperationException("Obtaining a FileChannel is not supported because a correct offset cannot be ensured.");
            }
         }
      }
      
         @Override
         protected COSStream parseCOSStream(final COSDictionary dic)
            throws IOException
         {
            /*
             * This needs to be dic.getItem because when we are parsing, the underlying object might still be null.
             */
            final COSNumber streamLengthObj = getLength(dic.getItem(COSName.LENGTH), dic.getCOSName(COSName.TYPE));
      
            COSStream       stream          = document.createCOSStream(dic);
      
            // read 'stream'; this was already tested in parseObjectsDynamically()
            readString();
      
            skipWhiteSpaces();
      
            if (streamLengthObj == null)
            {
               if (isLenient)
               {
                  LOG.warn("The stream doesn't provide any stream length, using fallback readUntilEnd, at offset " + source.getPosition());
               }
               else
               {
                  throw new IOException("Missing length for stream.");
               }
            }
      
            if ((streamLengthObj != null) && (streamLengthObj.longValue() >= 1024))
            {
               final long                streamBegPos = source.getPosition();
               final ReferencedCOSStream refStream    = ReferencedCOSStream.createFromCOSStream(stream);
      
               try
               {
                  readValidStream(null, streamLengthObj);
               }
               finally
               {
                  stream.setItem(COSName.LENGTH, streamLengthObj);
               }
               refStream.setReference(new File(reference), streamBegPos, source.getPosition() - streamBegPos);
               stream = refStream;
            }
            else
            {
               try(final OutputStream out = stream.createRawOutputStream())
               {
                  if ((streamLengthObj != null) && validateStreamLength(streamLengthObj.longValue()))
                  {
                     readValidStream(out, streamLengthObj);
                  }
                  else
                  {
                     readUntilEndStream(new EndstreamOutputStream(out));
                  }
               }
               finally
               {
                  stream.setItem(COSName.LENGTH, streamLengthObj);
               }
            }
      
            final String endStream = readString();
      
            if (endStream.equals("endobj") && isLenient)
            {
               LOG.warn("stream ends with 'endobj' instead of 'endstream' at offset " + source.getPosition());
      
               // avoid follow-up warning about missing endobj
               source.rewind(ENDOBJ.length);
            }
            else if ((endStream.length() > 9) && isLenient && endStream.substring(0, 9).equals(ENDSTREAM_STRING))
            {
               LOG.warn("stream ends with '" + endStream + "' instead of 'endstream' at offset " + source.getPosition());
      
               // unread the "extra" bytes
               source.rewind(endStream.substring(9).getBytes(ISO_8859_1).length);
            }
            else if (!endStream.equals(ENDSTREAM_STRING))
            {
               throw new IOException("Error reading stream, expected='endstream' actual='" + endStream + "' at offset " + source.getPosition());
            }
      
            return stream;
         }
      

      The class ReferencedCOSStream exposes the underlying data in exactly the same way as it does COSStream, but instead of keeping the storage in memory, it always opens a FileInputStream to retrieve the content. SlicedFileInputStream basically wraps around a FileInputStream and tries to imitate the behaviour of an InputStream for this specific chunk of data.

      I needed to expose some APIs for these classes, the method ReferencedCOSStream.createFromCOSStream(COSStream) would better be located in PDDocument and create the stream directly, I just didn't want to also modify PDDocument.

      Right now, encrypted streams are currently loaded into memory by the SecurityHandler directly after creation. If you want to accept this proposal, it might make sense to move the decryption handling also into COSStream and ReferencedCOSStream and perform it upon request.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            Rahn2 Jonathan

            Dates

              Created:
              Updated:

              Slack

                Issue deployment