# Streaming PDF TextContent

## Hello !!

This is the third post in the series of my Google Summer of Code 2017 experience. In last post, I gave detailed overview of SendWithStream method of MessageHandler. In this post, I am going to give updates of my project Streams API in PDF.js.

First phase of coding is about to end, and we already have our first Streams API supported PDF.js API, i.e., streamTextContent. This API is inspired from PDF.js getTextContent API, that is used to extract text contents of PDFs. As name suggests, streamTextContent can be used to stream text contents of PDFs incrementally in small chunks.

#### So how it is going to help?

Earlier we were using getTextContent API to retrieve text content of PDFs, that is not very efficient in terms of memory and speed. When getTextContent requests text content from worker thread, it has to wait until all the text contents are built. Text content is accessible in main thread only when it is completely built at worker thread and send via MessageHandler. Due to this reason, it creates inefficiency in terms of:

• Speed: As we have to wait for building the whole text content.
• Memory: As we have to store all the text content in worker thread and send in bulk.

But using streamTextContent will solve this problem, as we can stream text contents in small chunks whenever we have any in worker thread. This will eliminate the inefficiency in terms of:

• Speed: As now, we don’t have to wait in main thread for text contents and we would be able to perform rendering process incrementally.
• Memory: As now, we don’t have to store whole text content in worker thread, and be able to stream text contents in small chunks.

#### So how streamTextContent works?

I would recommend to read my last post first, it will explain how we are streaming data between the two threads(main + worker).

Let’s first understand how getTextContent works:

When getTextContent API is called at main thread, it sends message to worker using MessageHandler. This message is handled by GetTextContent handler in worker.js file. Parsing of PDF commands are done in evaluator.js, GetTextContent handler calls getTextContent method of PartialEvaluator via document.js file. In getTextContent method of PartialEvaluator, whole textContent is build and send back to main thread by resolving promises.

Building whole text content at worker and sending back to main thread in bulk take lots of memory. Text content is only available at main thread, when promise at PartialEvaluator is resolved(i.e. resolve(textContent)), that forces rendering processes to wait. This degrade user experience(by janky scrolling, slow rendering…).

Incrementally sending chunks using streamTextContent

The above mentioned problems can be eliminated by incrementally sending data chunks from worker to main thread, and rendering PDF on the fly. In streamTextContent method, instead of building the whole text content in woker, we are sending it to main thread whenever we have any.

In the next function of PartialEvaluator, we are calling enqueueChunk() to send back data chunks to main thread. If we look into the enqueueChunk function, it is self explanatory:

function enqueueChunk() {
let length = textContent.items.length;
if (length > 0) {
// Enqueue chunks to sink.
sink.enqueue(textContent, length);
// Reset textContent to free memory.
textContent.items = [];
textContent.styles = Object.create(null);
}
}


next function is best suited for calling enqueueChunk, because it is called whenever any promise needs to be resolved. Basically it holds the parsing process for sometimes, and that is best time to enqueue chunks in sink, e.g:

next(promise) {
enqueueChunk();
promise.then(() => {
// Resume parsing process.
});
}


Using streamTextContent internally in getTextContent

For not breaking the support of getTextContent API, we have created a parallel method streamTextContent to support Streams API in the process of retrieving the text content of PDFs. But it is better idea to use streamTextContent internally in getTextContent, so that we can stream text content in that too. Using readAllChunk approach in getTextContent, will refactor the API something like:

getTextContent() {
params = params || {};

return new Promise(function(resolve, reject) {
function pump(params) {
if (done) {
resolve(textContent);
return;
}
// Build text content by looping through all read operations.
Util.extendObj(textContent.styles, value.styles);
Util.appendToArray(textContent.items, value.items);
pump();
}, reject);
}

let textContent = {
items: [],
styles: Object.create(null),
};

pump();
});
}


#### Some changes to PDF.js viewer to use streamTextContent

Some changes has made into the PDF.js standard viewer to use streamTextContent instead of getTextContent, so that we can measure live performance of Streams API support.

For the same, we have passed the RedableStream from pdf_page_view.js to text_layer.js via text_layer_builder.js. Reading chunks in _render method of TextLayerRenderTask, and passing incrementally to _processItems will lower memory usage of viewer. Cancelling the read operation of stream in cancel method when page is not into view, make viewer more responsive.

#### Performance measurement of streamTextContent:

Steps for measuring the performance:

• Open the viewer in master branch, and paste this code in console to take data for automation of scrolling.
• Paste above taken data(or this) into the console(as rr = data), and run this code in console to perform automatic viewer scrolling.
• Measure the memory usage using devtools or run this code as python name_of_file.py <pID>, where pID is process ID of viewer. This will give peak memory used, when process ends.

Measurement Configuration:

Output:

Branch\measurement[2] 1 2 3 4 5 6
master 399.39 397.07 416.43 405.59 433.07 428.06
streams 390.02 389.16 385.09 384.05 387.34 387.46
Branch\measurement[3] 1 2 3
master 16.74 16.56 16.81
streams 17.44 18.07 18.77

[1] Run script in web console to perform automatic scrolling.
[2] Memory used(in MB) during measurement.
[3] Average fps during measurement.

Result:

From the above measurements, we can conclude that we are using comparatively less memory using Streams API in PDF.js viewer. Using less memory and cancelling streaming of text content chunks, when page scrolled out of view, makes viewer more responsive.

Measurement snapshorts

Follow streams-getTextContent PR for all the discussions or read the full code here.