Populating PDF Forms with MultiValue Data
Walk into any shop, MultiValue or not, that's been around for years and you are likely to find special form-overlay programs which PRINT s data on forms, using either physical printers or virtual printers that overlay raw print data on images. It's a tried-and-true way to get the job done. These programs always work great until the form changes for one reason or another.
The reason each form change becomes an issue is that we aren't really working with the form when we create an overlay. We are working with where we expect the spaces to be. PDF forms address this problem. Once you have a PDF document with form prompts on it, you can merge the data into your form and not worry about where on the from it needs to go. The PDF document will take care of that for you.
What You Need
In order to populate a PDF with data, you will need one third-party program on your system:
PDFtk by PdfLabs
https://www.pdflabs.com/tools/pdftk-server/
This program is delivered as part of the distribution version of many Linux systems but is not limited to Linux only. There is a windows version of the same program, so for those with Window based systems, this will work as well.
PDFtk (PDF toolkit) does a number useful things, even before we add our MultiValue magic. It is designed to merge, encrypt, decrypt, add watermarks, and single PDFs split into multiple individual files. And, of course it can fill-in PDF form data.
Example Used
I want to keep this article business-practical, so my example will involve filling out a legal form for the payroll department. The sample PDF that I will be using is the IRS W9 Form [ Figure 1 ]. While this form isn't something that is used every day, it is a good example. If you aren't working with an American company, you'll find that there's an equivalent document in most if not all other countries.
To get your own copy of the original PDF: https://www.irs.gov/pub/irs-pdf/fw9.pdf . Once you have that, you can follow along and build your program as we continue our way through the article together.
Retrieving the Form Data
Like HTML forms, PDF forms are set up with a unique name assigned to each field. This is very much like how we assign dictionaries to individual fields in our database. Unfortunately, the names aren't visible when you look at the un-filed document. Since we don't know what these names are, we have to extract them in order to have the names when we populate the form.
The following command will extract this information for you:
$ pdftk fw9.pdf dump_data_fields > fw9_fields.txt
This will produce an output [ Figure 2 ] file that contains information about each field in the PDF document. Each PDF input will have 4-7 pieces of information designed to describe how the field is to be populated. The key data you need is FieldName . This will be the unique identifier which will mark each spot that can be filled-in. Connect the right data to the right name and the results will make sense.
--- FieldType: Text FieldName: topmostSubform[0].Page1[0].f1_1[0] FieldFlags: 8388608 FieldJustification: Left --- FieldType: Button FieldName: topmostSubform[0].Page1[0].FederalClassification[0].c1_1[0] FieldFlags: 0 FieldJustification: Left FieldStateOption: 1 FieldStateOption: Off
Figure 2
I have found that the input field names aren't always self-explanatory. You may have to do a little bit of homework in order to get the right field for the right input [ Figure 3 ]. The easiest way to do this is test the tab order. Open the PDF document and tab between the fields to verify which fields are the the first, second, third, etc. in order.
Figure 3
You will also need to watch for the FieldType information to make sure you are providing valid information. If you look at Figure 2 , you will see a FieldType for the button, which has two FieldStateOption values. The first value is the checked (Yes) value and the second value is the unchecked (No) value.
You will also need to watch for the FieldType for Choice, which may contain two or more FieldStateOptions as well, if it is present. This might be a good time to remind you that I didn't design this methodology, I'm just explaining what PDF forms provide.
If the FieldType is Button, then you need to look at the FieldStateOption field to find out what values are allowed to be assigned to the field.
Form Data File
Once you know what the field names are, you need to create a Form Data Format (FDF) file. This is a special file format used by PDFs to populate the data. They made it really easy for us by keeping this file text, but it does look a little odd [ Figure 4 ].
%FDF-1.2 1 0 obj << /FDF << /Fields [ << /T(topmostSubform[0].Page1[0].f1_1[0]) /V(International Spectrum) >> << /T(topmostSubform[0].Page1[0].FederalClassification[0].c1_1[1]) /V(2) >> << /T(topmostSubform[0].Page1[0].Address[0].f1_7[0]) /V(3691 E 102nd Ct) >> ] >> >> endobj trailer << /Root 1 0 R >> %%EOF
Figure 4
If you have read previous articles on generating PDFs from within MultiValue BASIC ( http://www.intl-spectrum.com/mag/JULAUG.2009/default.aspx and https://www.intl-spectrum.com/resource/category/168/PDF.aspx ), you'll see a similarity in the file formats and structures.
That's a truly ugly layout. If this is your first look at PDF internals, it may be hard to follow. Believe it or not, it is actually pretty simple. This file is basically a Key/Pair file. The /T indicates the key and the /V represents the value. The data is wrapped in parenthesis, much like you would use quotes. Once again, not my design. The first two lines in Figure 4 are the header of the file, and the last five lines are the footer. Both the header and the footer will always be the same for any FDF-formated file.
In between the header and footer is where we need to put the data we want to merge into the PDF [ Figure 5 ].
<< /T(topmostSubform[0].Page1[0].f1_1[0]) /V(International Spectrum) >> << /T(topmostSubform[0].Page1[0].FederalClassification[0].c1_1[1]) /V(2) >> << /T(topmostSubform[0].Page1[0].Address[0].f1_7[0]) /V(3691 E 102nd Ct) >>
Figure 5
Once you have created your FDF file, and it has been saved with the .fdf extension, you can merge the pdf and data together to create a new PDF document:
$ pdftk fw9.pdf fill_form fw9_data.fdf output fw9_merged.pdf flatten
If you look at this command line, you will see the original PDF is named fw9.pdf , the data is in the FDF file fw9_data.fdf , and the final merged document will be called fw9_merged.pdf . The flatten keyword will create the new PDF document without editable input fields. The original files will remain as-is and can be used again.
Alternate Form Data Format
There is an alternate FDF format called XFDF, which is XML based [ Figure 6 ]. Why didn't I cover that format first? Well, depending upon the version of pdftk you have on your system, XFDF may not be supported.
<?xml version="1.0" encoding="UTF-8"?> <xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve"> <fields> <field name="topmostSubform[0].Page1[0].f1_1[0]"> <value>International Spectrum</value> </field> <field name="topmostSubform[0].Page1[0].FederalClassification[0].c1_1[1]"> <value>2</value> </field> <field name="topmostSubform[0].Page1[0].Address[0].f1_7[0]"> <value>3691 E 102nd Ct</value> </field> </fields> </xfdf>
Figure 6
I thought it would be best if you have the most up-to-date version of pdftk, but that is not always the case, so I started with the harder format first. Besides being easier to understand, XFDF has one more advantage. It will support Unicode in UTF-8 format. The FDF format does not support Unicode.
Extended Features Error
Some of original PDF documents start with Extended Features enabled. If this is the case with a document you are working with, you'll get an error when you open it in Acrobat:
"This Document enabled extended features in Adobe Reader. This document has been changed since it was created and use of extended features is no longer available."
Sometimes this is due to Signed PDFs, other times, its due to security settings like related to Page Extraction. In order to remove these errors, you need to run the PDFtk command one more time to strip this information:
$ pdftk fw9_merged.pdf cat output fw9_finished.pdf
Putting This all Together
As you can see, this is all really easy to do. While you can do it yourself, there are subroutines available at the following URL that will takes all this into account:
https://www.intl-spectrum.com/resource/category/168/PDF.aspx
Creating Your own PDF Documents With Form Inputs
You aren't limited to pre-made PDF documents. If your company has documents they regularly fill out, like liens, mortgage forms, tax forms, or credit requests, then you can convert any existing PDF document into a PDF document with input. You just need the right program. Adobe Acrobat Pro is the most commonly used, but also the most expensive. A good open source version is OpenOffice.