PDF Image Extraction and Validation with Playwright and sharp

09 / Sep / 2024 by Rahul Khurana 0 comments

Text and images can be stored in PDFs and have their formatting preserved across platforms and devices. There are use cases in software testing where it’s critical to make sure the right images are included in PDFs and their quality is maintained. It can take a while and be prone to human mistakes to manually verify these images, especially when working with large pdfs or a lot of test cases. By automating this process, you may limit the chance of missing any discrepancies and reduce manual efforts.

The extraction and validation of images from PDFs can be automated with the help of Node library Sharp. Sharp is an image processing library that can be coupled with Playwright, a robust end-to-end testing framework, to extract images from PDFs and compare them to expected images.


Step 1: Set Up the Environment

Prerequisite:

Node.js must be installed on the system before installing the required libraries. Install the following libraries are installation of node.js

pdf-lib: Library for working with PDFs
sharp: An image processing library used for image manipulation.

To install these libraries we can use the following commands in the terminal
npm install pdf-lib sharp fs path


Step 2: Extracting Images from the PDF

Once libraries are installed we can need to use pdf-lib for parsing pdf and working with pdf content. The below code is for extracting images from PDF file.

const fs = require('fs');
const { PDFDocument } = require('pdf-lib');
const sharp = require('sharp');
const path = require('path');
async extractImageFromPdf(pdfPath){
  const pathToExtract = "./extractedImg/"; //folder in project where all the images from pdf would be extracted
  try {
    const images = await PDF.sharpsFromPdf(pdfPath);
    images.forEach(({ image, name, channels }) => {
      const ext = '.png'; //can be use other extensions conditionally
      image.toFile(pathToExtract+`${name}${ext}`);
    });

    // Progress events
    await PDF.sharpsFromPdf(pdfPath, {
      handler(event, data) {
        if (event === 'loading') {
          console.log('Loading PDF:', (data.loaded / data.total) * 100);
        } else if (event === 'loaded') {
          console.log('PDF loaded');
        } else if (event === 'image' || event === 'skip' || event === 'error') {
          console.log('Parsing images:', (data.pageIndex / data.pages) * 100);
        } else if (event === 'done') {
          console.log('Done');
        }
      },
    });
    return pathToExtract;
  } catch (error) {
    console.error('Error extracting images from PDF:', error);
  }
}

In this function:

  • The function called PDF.sharpsFromPdf(pdfPath) is used to extract the images from the provided PDF. The extracted images are returned as an array of objects containing images, names, and channels. image represents the image data, usually as a sharp object or similar, that allows you to manipulate and save the image, name is used to extract the name assigned to the image, often derived from the page number or some identifier within the PDF, channels indicates the number of color channels in the image(although not used in above function)
  • The images.forEach loop iterates over the extracted images, saving each one to the specified directory with a .png extension using image.toFile.
  • A second call to PDF.sharpsFromPdf is made, this time with a handler function that tracks progress. The handler logs various stages like loading, parsing images, and completion.
  • If successful, the function returns the path where images are extracted. Otherwise, an error message is logged.

Step 3: Comparing Extracted Images with Expected Images

Once the images are extracted from pdf, we can compare them with the expected images to verify their accuracy. The sharp library is used to perform pixel-level comparison

Here’s an example of a comparison function:

async compareImages(expectedImgPath, extractedImgFolderPath) {
let flag = false;
try {
// Load and process the first image
const img1 = await sharp(expectedImgPath).resize(500, 500).ensureAlpha().raw().toBuffer({ resolveWithObject: true });
const { data: data1, info: info1 } = img1;

// Read the directory and filter image files
const files = fs.readdirSync(extractedImgFolderPath);

const imageFiles = files.filter(file => /^img_p3_\d+\.png$/i.test(file)); //restrict search to images starting with img_p3 

for (const file of imageFiles) {
const filePath = path.join(extractedImgFolderPath, file);

// Load and process the second image
const img2 = await sharp(filePath).resize(500, 500).ensureAlpha().raw().toBuffer({ resolveWithObject: true });
const { data: data2, info: info2 } = img2;

// Check if dimensions match
if (info1.width !== info2.width || info1.height !== info2.height) {
console.log(`Image dimensions do not match for file: ${file}`);
continue;
}

// Calculate Mean Squared Error (MSE)
let mse = 0;
for (let i = 0; i < data1.length; i++) {
mse += (data1[i] - data2[i]) ** 2;
}
mse /= data1.length;

if (mse < 550) {
flag = true;
console.log(`Image ${file} matches with the given image.`);
break;
} else {
console.log(`Image ${file} does not match.`);
}
}
} catch (error) {
console.error('Error comparing images:', error);
}
return flag;
}

In this code:

  • In the above function, these 2 parameters are used,  expectedImgPath: Path to the reference image to compare, extractedImgFolderPath: Directory containing the images to be compared against the reference image.
  • The function uses the sharp library to load, resize, and convert the images to a raw pixel buffer. The ensureAlpha() function ensures that the images have an alpha channel, even if they don’t originally.
  • The function reads the directory at folderPath and filters the files using a regular expression (/^img_p3_\d+\.png$/i). This regex restricts the comparison to images named according to a pattern (e.g., “img_p3_123.png”) that indicates they belong to page 3.
  • For each image in the folder, the function checks if its dimensions match the reference image. If they do, it calculates the Mean Squared Error (MSE) between the images’ pixel data.
    MSE is a measure of the difference between the images. If the MSE is below a threshold (550 in this case), the images are considered a match. We can change the MSE value accordingly.
  • If a match is found (MSE < 550), the flag is set to true, and the loop breaks. The test asserts that the flag is true, meaning at least one image matched.


Step 4: Automating the Process with Playwright

To take it a step further, we can integrate this into a broader testing framework like Playwright. Here’s how you could use Playwright to automate the verification process:

const { test, expect } = require('@playwright/test');
const path = require('path');

test('Verify images in PDF', async ({}) => {
const pdfPath = 'path/to/your/pdf/document.pdf';
const expectedImgPath = 'path/to/expected/images';
const extractedImagesFolder = 'path/to/extracted/images';

await extractImageFromPdf(pdfPath);
let imageCompareFlag = await compareImages(expectedImgPath, extractedImagesFolder);

expect(imageCompareFlag).toBeTruthy();
});

Conclusion

One effective method for automating visual content validation is to extract photos from a PDF and compare them with expected images. Testing pipeline may easily incorporate this procedure, guaranteeing that your PDFs automatically include the right images. Playwright and libraries like pdf-lib and sharp make this automation simple and reliable, allowing you to be assured of the caliber and coherence of your visual content.

 

References:

https://www.npmjs.com/package/sharp-pdf

FOUND THIS USEFUL? SHARE IT

Leave a Reply

Your email address will not be published. Required fields are marked *