Monday, March 15, 2010

Asserting PDF's

A question on the JMeter mailing list regarding extracting/asserting text inside a PDF file and since I ran into this for a functional scenario , I wrote up my attempt. I use PDFBox as the library. Download the binaries and copy pdfbox-1.0.0.jar and external/fontbox-1.0.0.jar into JMeter's lib directory. We'll access the PDF at http://jakarta.apache.org/jmeter/usermanual/jmeter_distributed_testing_step_by_step.pdf and check that the PDF does contain "Distributed Testing Step-by-step" Here's the JMeter sample script So we have a transaction controller 'Check PDF' so that we only get a result item. The HTTP Request Sampler 'Request PDF' requests the PDF. The bulk of the code is in the beanshell post processor titled 'Extract Text'
import java.io.ByteArrayInputStream;
import java.io.StringWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;


PDDocument document = null;
StringWriter sw = new StringWriter();
try {
ByteArrayInputStream bais = new ByteArrayInputStream(data);
document = PDDocument.load(bais);
PDFTextStripper stripper = new PDFTextStripper("UTF-8");
stripper.setSortByPosition( false );
stripper.setShouldSeparateByBeads( true );
stripper.setStartPage( 1 );
stripper.setEndPage(Integer.MAX_VALUE );
stripper.writeText( document, sw );
} catch (Throwable t) {
t.printStackTrace();
sw.append("ERROR");
} finally {
sw.close();
document.close();
}
vars.put("extractedText", sw.toString());

Update - for PDFBox 2.0.26

import java.io.ByteArrayInputStream;
import java.io.StringWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;


PDDocument document = null;
StringWriter sw = new StringWriter();
try {
	ByteArrayInputStream bais = new ByteArrayInputStream(data);
	document = PDDocument.load(bais);
	PDFTextStripper stripper = new PDFTextStripper();		
	stripper.setSortByPosition( false );
	stripper.setShouldSeparateByBeads( true );
	stripper.setStartPage( 1 );
	stripper.setEndPage(Integer.MAX_VALUE );        
	stripper.writeText( document, sw );
} catch (Throwable t) {
                             t.printStackTrace();
	sw.append("ERROR");
} finally {
	sw.close();
	document.close();
}
vars.put("extractedText", sw.toString());
All this does is use the PDFBox API to extract text from the PDF (bytes are present in the data object) and write it to a variable in JMeter , extractedText. The next sampler is a Java Sampler which sets the ResultData to ${extracted text}. This will echo back the contents of this variable as the response of the sampler. Once this is done , you can use a normal response assertion, regex extractor or whatever else you need to process the text Update: as mentioned by Milamber in the comments , you can instead have the last line of the beanshell as
prev.setResponseData(sw.toString());
This will set the response of the HTTPSampler to be the text value and allow you to directly specify the assertion (and you can also eliminate the transaction controller). Update : if you are interested in MS office formats then follow the steps in this post, just change the BeanShell post processor to that mentioned in Asserting MS Office Formats Future work a. make a custom PDF(or any format sampler) with the options that are hardcoded (e.g. startPage or endPage) b. experiment with ways to use less memory. Typically should extract and write to file and use that.

10 comments:

Steve Eckhart said...

Very useful. I'm pretty new to JMeter and have no experience with BeanShell. In order to get the test to run successfully, I had to put the fontbox-1.0.0.jar file in my lib\ext directory.

Deepak Shetty said...

@Steve
just the lib directory should have worked. There are other supporting jars too and if you want to use any of the other features you'd need to copy the other jars into JMeter's directory

Milamber said...

Great post!

If you replace last bsh line with :
prev.setResponseData(sw.toString());

You can directly put Assertion Response as a child to first request.

Milamber

Deepak Shetty said...

@Milamber
Sometimes I can see the forest for the trees!. Yes that is probably a better solution

Steve Eckhart said...

I have a question about how the script is actually working. On the first line of the try, we have:

ByteArrayInputStream bais = new ByteArrayInputStream(data);

What is data and where did it come from?

Deepak Shetty said...

@Steve
When you add a beanshell postprocessor, certain variables are made available to it by default.
In the script section in the post processor you will see
Script (variables : ctx,vars,props,prev,data,log)

ctx is JMeterContext
vars is JMeterVariables etc etc
data is the raw data of the sampler.

Deepak Shetty said...

@Steve
This is described in the manual
http://jakarta.apache.org/jmeter/usermanual/component_reference.html#BeanShell_PostProcessor
and you can use the javadocs to investigate the methods of the objects
http://jakarta.apache.org/jmeter/api/org/apache/jmeter/samplers/SampleResult.html

Klein Balázs said...

This was very useful.

I found a tool that seems to do something similar to odt than pdfbox does to pdf but I cant't figure out how to put it to a similar BeanShell PostProcessor.

Do you think it is possible to do the same with odt files?

Here is the tool I found:
http://www.java2s.com/Code/Jar/o/Downloadodfutils21jar.htm

Thanks for the help.

Deepak Shetty said...

@Klein
Yes it should work.
a. Did you copy the jar file to jmeter's lib directory?
b. What is your beanshell script? Did you check jmeter.log for errors?

Klein Balázs said...

Hi,
thanks for the response and for the pdf solution.

I did put odf_utils-2.1.jar into the lib directory of Jmeter.

Unfortunately I know nothing about java programming. I did try to experiment with the BeanShell but it fails on the first relevant line:

import com.catcode.odf.OpenDocumentTextInputStream;

with the error

Constructor error: Can't find default constructor for: class com.catcode.odf.OpenDocumentTextInputStream