PDF.js Network Streaming

Mukul Mishra bio photo By Mukul Mishra

Hello !!

This is the fourth post in the series of my Google Summer of Code 2017 experience. In last post, I gave detailed overview of how we are streaming text content of PDFs in PDF.js, with some performance measurements. In this post, I am going to give updates of my project Streams API in PDF.js.

During this phase of our project, we are working on the networking part of PDF.js to support streaming. PDF.js internally use XHR to fetch PDF data, but as XHR response is not streamable, we have to store all the response internally into the buffer. This may cause inefficiency in terms of memory. To support Streams in networking part, we have to fulfil some of the TODO’s:

  • Move networking part from worker to main side.
  • Implement node_stream, and use it in node environment.
  • Implement fetch_stream, and use when fetch is supported by browser.

First part of TODO is to move networking part from worker to main side. PDF.js internally uses XHR at worker thread to make requests for PDF file. Although it is bit memory inefficient. XHR requests for PDF data and stores all the response at worker. Moving XHR logic to main side and implementing required IPDFStream interfaces at worker side(for communication with main) may reduce memory usage. PDFWorkerStream(in worker thread) uses sendWithPromise/sendWithStream methods of MessageHandler internally for communication with main thread. After this, worker will ask for PDF data from main only when needed. This will not force worker thread to store whole PDF data, and hence reduces memory usage. Using sendWithStream in PDFWorkerStream makes it stream data in small chunks, hence increases efficiency.

Second part is to implement node_stream, and use it in node environment. We can use node.js http/https module for remote urls whereas fs for local file system urls. I will publish detailed post for node_stream logic in future.

Third part is to implement fetch_stream and use it when browser has native support for fetch API. fetch gives streamable response, that can be used to stream PDF data. I will be publish detailed post for fetch_stream logic in future.

For more information and disussion about the patch, please follow PR Stream API support in networking task of PDF.js.