PDF.js Node Stream

Mukul Mishra bio photo By Mukul Mishra

Hello !!

This is the fifth post in the series of my Google Summer of Code 2017 experience. In last post, I gave overview of how we are adding streaming support in networking task of PDF.js. In this post, I am going to give updates of my project Streams API in PDF.js.

In this post, I am going to talk about adding node.js logic in networking task of PDF.js project. PDF.js internally uses XHR to request PDF data, but this will not work in node.js environment(as node.js don’t have native support for XHR). The only way to load PDF in node.js is to create blob of data(or load PDF file into typed array), and pass it to getDocument as is, something like:

let pdfData = new Uint8Array(fs.readFileSync('path/to/pdf.pdf'));
let loadingTask = PDFJS.getDocument(pdfData);

This is bit memory inefficient, as we have to store whole data at main thread. To remove this inefficiency, we can implement required IPDFStream interfaces for node.js, and use it to read PDF data. Using modules like http(s) and fs, can give us streamable response, and hence we can read PDF data incrementally, so no need to buffer whole response.

HTTP(s) and Fs module of node.js

HTTP(s) and Fs modules of node.js are really an interesting substitute of XHR in networking task of PDF.js. Both fs.createReadStream and http.request returns a ReadableStream, that can be used to stream PDF data in small chunks.

Simple example to create ReadableStream using fs module:

let readableStream = fs.createReadStream('path/to/some/pdf.pdf');

// Listen for `readable` event, emitted when there is data available
// to be read from stream. We can also listen for `data` event, but
// `readable` event might increase throughput.
readableStream.on('readable', () => {
  // Read data or some stuff.
  let result = readableStream.read();
});

// `error` event emitted when any underlying internal failure occur.
readableStream.on('error', (reason) => {
  // Do some stuff, when `error` occured.
  throw new Error(reason);
});

// `end` event emitted when there is no more data to be consumed from the stream.
readableStream.on('end', () {
  // Do some stuff when done.
  readableStream.destroy();
});

Simple example to create ReadableStream using http(s) module:

// Parse using url module.
let url = url.parse('http://example.com/path/to/some/pdf.pdf');
let options = {
  protocol: url.protocol,
  host: url.host,
  path: url.path,
  headers: { ... },
};
let request = http.request(options, (response) => {
  // `response` is a type of ReadableStream, used to listen for specific events.
  response.on('readable', () => { ... });
  response.on('error', (reason) => { ... });
  response.on('end', () => { ... });
});
// `request.end()` must be called to signify end of the request.
request.end();

For full disussion and code, please see node.js logic for networking task of PDF.js.