import java.io.ByteArrayInputStream;
import java.io.StringWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
PDDocument document = null;
StringWriter sw = new StringWriter();
try {
ByteArrayInputStream bais = new ByteArrayInputStream(data);
document = PDDocument.load(bais);
PDFTextStripper stripper = new PDFTextStripper("UTF-8");
stripper.setSortByPosition( false );
stripper.setShouldSeparateByBeads( true );
stripper.setStartPage( 1 );
stripper.setEndPage(Integer.MAX_VALUE );
stripper.writeText( document, sw );
} catch (Throwable t) {
t.printStackTrace();
sw.append("ERROR");
} finally {
sw.close();
document.close();
}
vars.put("extractedText", sw.toString());
Update - for PDFBox 2.0.26
import java.io.ByteArrayInputStream;
import java.io.StringWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
PDDocument document = null;
StringWriter sw = new StringWriter();
try {
ByteArrayInputStream bais = new ByteArrayInputStream(data);
document = PDDocument.load(bais);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition( false );
stripper.setShouldSeparateByBeads( true );
stripper.setStartPage( 1 );
stripper.setEndPage(Integer.MAX_VALUE );
stripper.writeText( document, sw );
} catch (Throwable t) {
t.printStackTrace();
sw.append("ERROR");
} finally {
sw.close();
document.close();
}
vars.put("extractedText", sw.toString());
All this does is use the PDFBox API to extract text from the PDF (bytes are present in the data object) and write it to a variable in JMeter , extractedText.
The next sampler is a Java Sampler which sets the ResultData to ${extracted text}. This will echo back the contents of this variable as the response of the sampler.
Once this is done , you can use a normal response assertion, regex extractor or whatever else you need to process the text
Update: as mentioned by Milamber in the comments , you can instead have the last line of the beanshell as
prev.setResponseData(sw.toString());
This will set the response of the HTTPSampler to be the text value and allow you to directly specify the assertion (and you can also eliminate the transaction controller).
Update : if you are interested in MS office formats then follow the steps in this post, just change the BeanShell post processor to that mentioned in Asserting MS Office Formats
Future work
a. make a custom PDF(or any format sampler) with the options that are hardcoded (e.g. startPage or endPage)
b. experiment with ways to use less memory. Typically should extract and write to file and use that.
10 comments:
Very useful. I'm pretty new to JMeter and have no experience with BeanShell. In order to get the test to run successfully, I had to put the fontbox-1.0.0.jar file in my lib\ext directory.
@Steve
just the lib directory should have worked. There are other supporting jars too and if you want to use any of the other features you'd need to copy the other jars into JMeter's directory
Great post!
If you replace last bsh line with :
prev.setResponseData(sw.toString());
You can directly put Assertion Response as a child to first request.
Milamber
@Milamber
Sometimes I can see the forest for the trees!. Yes that is probably a better solution
I have a question about how the script is actually working. On the first line of the try, we have:
ByteArrayInputStream bais = new ByteArrayInputStream(data);
What is data and where did it come from?
@Steve
When you add a beanshell postprocessor, certain variables are made available to it by default.
In the script section in the post processor you will see
Script (variables : ctx,vars,props,prev,data,log)
ctx is JMeterContext
vars is JMeterVariables etc etc
data is the raw data of the sampler.
@Steve
This is described in the manual
http://jakarta.apache.org/jmeter/usermanual/component_reference.html#BeanShell_PostProcessor
and you can use the javadocs to investigate the methods of the objects
http://jakarta.apache.org/jmeter/api/org/apache/jmeter/samplers/SampleResult.html
This was very useful.
I found a tool that seems to do something similar to odt than pdfbox does to pdf but I cant't figure out how to put it to a similar BeanShell PostProcessor.
Do you think it is possible to do the same with odt files?
Here is the tool I found:
http://www.java2s.com/Code/Jar/o/Downloadodfutils21jar.htm
Thanks for the help.
@Klein
Yes it should work.
a. Did you copy the jar file to jmeter's lib directory?
b. What is your beanshell script? Did you check jmeter.log for errors?
Hi,
thanks for the response and for the pdf solution.
I did put odf_utils-2.1.jar into the lib directory of Jmeter.
Unfortunately I know nothing about java programming. I did try to experiment with the BeanShell but it fails on the first relevant line:
import com.catcode.odf.OpenDocumentTextInputStream;
with the error
Constructor error: Can't find default constructor for: class com.catcode.odf.OpenDocumentTextInputStream
Post a Comment