How to Automate Filling In Web Forms with Python

Python is great and an easy to learn programming language that can help you automate routine tasks and make your life easier.

Have you ever encountered a situation where you need to fill in some online forms and do this multiple times per day? If so, Python can help you automate most of these tedious tasks. Join me on this journey to learn how a simple Python script can automate online data-entry.

Yes, you can use Python to automatically fill out a form online.

Download the Completed Project

Before we begin, here is the completed Python script, as well as the web form I’ll reference.

How to Extract Data from a PDF with Python

Three Types of PDF Format

1. Text-Based PDF Example

There are three ways data can be stored in a PDF. The most common way is by having the data as text within the PDF file, which is known as a Text-based PDF.

In this case, the PDF is nothing more than an unstructured (without a specific layout, i.e. a manual) document or a semi-structured document (that conforms to a layout, i.e. an invoice) where the data is simply the text that resides within the PDF file itself, which is visible to the human eye, and readable. Below is an example.

Image Based PDF Format

Another common type of PDF files is what is known as Image-based PDFs. These are PDFs that are literally scanned copied of paper documents. The text that is visible and readable to the human eye is really part of the image and can only be extracted by using Optical Character Recognition (OCR).

Extracting the text information contained within these PDFs is harder, as specialized OCR engines are required, which also doesn’t always guarantee that the text extracted is fully readable, as the outcome depends on the quality of the embedded image that was scanned.

Besides that, it is possible that the scanned image within the PDF is not in the correct orientation, which makes the process of extracting any data even more difficult. Below is an example of such a document, which is simply a scanned image with the wrong orientation, embedded within a PDF.

PDF Forms Format

Finally, there’s a third type of PDF files that neither one or the other. The information contained within this type of PDF file is data that is kept within internal PDF fields. These type of documents are known as PDF Forms.

PDF Forms can easily be created using specialized software such as Adobe Acrobat or PDFelement.

Running the Python Form Filling Script

Before we start, let’s see an example of the online mortgage loan software we’re going to make. This is how my folder looks: It contains the Python script, the .ini files and the PDF form document with the applicant’s data.

This is how the online (empty) mortgage application online form looks like.

If I execute the Python script (.py), I see that a .txt file with the same name as the PDF form file gets created in the folder where the Python script resides.

Next, let’s open the JavaScript code (.txt) file created and copy all the code contained within it.

Now, back to the online form on the browser, let’s open the Developer tools and then go to the Console tab and paste in the copied code.

With the code pasted, just press enter (with the focus inside the Console tab). Then the online form will be automatically filled in, with the same data contained within the PDF form document.

We can check that the data filled into the online form is indeed the same as the one on the PDF form document.

Creating the PDF from Scratch

One of the pain points with regards to the first two types of PDF documents described (Text-based PDFs and Image-based PDFs) is that the information contained within the PDF itself is not organized.

This means that even if we are able to extract the text by programmatically reading the PDF lines, or by performing an OCR operation on the image embedded within the PDF that contains the text, we still need to make sense of that resultant extracted text.

All that text will be nothing more than words within lines or sentences if we are not able to give any meaning to it. Understanding how to find an invoice total amount within lines of text that contain multiple numbers is not an easy feat and such a process requires a certain level of algorithmic intelligence.

So, the first step to automate the data acquisition process is to change the way how people send their information.

Instead of having people send over scanned copies of their paper documents or any PDF versions of their scanned employment letters when applying for a mortgage, why not have them fill in the data for you, by using a PDF form?

To get a sense of what this really means, take a look at the following PDF form document.

This document is a PDF file, just like any other PDF, with a small but important difference. It contains editable text boxes (fields) where data can be entered. In this particular example, this is a new customer onboarding PDF form that can be sent through email, after being filled in manually by the applicant.

So, instead of having the applicant send in their information as scanned documents, let’s have them fill in all the required information using PDF forms, which already saves 90% or more of the time required to gather the data for applying for a mortgage loan.

Creating a PDF form document is a very simple process and this video describes the steps involved.

In essence, having a PDF form is a great way to have applicants submit over information that is easier to extract and process. For my financial advisor friend, this is the option that I recommended.

Automating the PDF Data Extraction Process

For our real-world scenario, we’ve reached a major milestone, which is to have the data nicely organized and structured. This is why PDF forms are a great way of gathering data.

The next step is to write some Python code, that can extract the data contained within the PDF form documents, and create a JavaScript script which can then be executed within the Console tab of the browser Developer tools to automatically fill in an online form. To understand better the whole process, let’s have a look at the following diagram.

So in essence, the PDF form document is put through the Python script, and this script reads the content of the document and checks each field. Then for each field, the value of the field is extracted and a JavaScript script is generated, which contains the name of the equivalent online (web) field.

This JavaScript script can be executed (on the online form we wish to automatically fill in) with the data extracted from the PDF form document, by opening the browser Developer tools and then running the JavaScript script using the Console tab.

The JavaScript script will fill in automatically the value of the fields of the online form. For all this to work properly, it is necessary that each field within the PDF form document corresponds to a field within the online form.

To ensure that this is the case, it’s a recommended practice (when creating the PDF form document) to give each field name, the same name as found on the id tag of the corresponding online field.

So, before creating the PDF form document, you must inspect with the browser the name of each online field. This is done by retrieving the value of the id tag of the HTML element that corresponds to the field.

Those same id values retrieved for each field is what you will use to name each of the PDF form fields, for the PDF document that you will create (for your users to fill in later).

So now that we’ve reviewed how the Data Extraction Automation process works, it’s important to keep in mind that to achieve it, there are essential steps involved:

1. Check which online form(s) you would like to automate. Get the id tag of each field that you want to automatically fill in using the JavaScript script that the Python script is going to generate, by Inspecting the HTML element of each corresponding field of the online form(s).

2 . Open Notepad or any other text editor and save (for your reference) those id tags gathered to a text (.txt) file and give this file a name, i.e. fields.txt. This file is only for reference purposes and won’t be manipulated by the Python script. You will need these field names when you create and design your PDF form using Adobe Acrobat or PDFelement.

3. If your online form has fields that are selectable fields (with a drop-down menu) then get the id tags of each of these fields and add them to a .ini (text) file. Save this file with the same name as you will use for your Python script.

4. If your online form has fields that can be checked, i.e. radio buttons and checkbox fields, then get the id tags of each of these fields and add them to another .ini (text) file. Save this file with the same name as you will use for the Python script and append the _ext suffix to it, before adding the .ini extension.

Once these steps have been done, we are ready to write our Python script.

Writing the Python Script

The Python script is the heart and soul of the whole process. It’s where the magic happens. It takes a PDF form document, reads its content, identifies each field with its respective value and generates a JavaScript script which you can then use on the browser to automatically fill in your online form.

If you manually need to enter different data to the same online form multiple times a day, having this script can be an invaluable time-saving tool.

Take my friend the financial advisor, who has to enter mortgage loan data to the same online form, for different applicants, sometimes up to 30 times a day.

“I’m a financial advisor who helps people arrange mortgage loans. The process of entering all this data manually, for each applicant, is a tedious, error-prone and time-consuming process, which takes many hours to complete. If I learn Python could I write a script to automate filling in online forms? – Mike”

Imagine entering manually the data for a form that that at least 20 fields, for each person. Then imagine doing that 30 times a day. That’s 600 fields a day that need to be manually entered. Not a job that I would be excited to do.

So at a high level, how does the Python script work? The script does essentially three things:

It identifies all fields that exist within the PDF form document.
By checking the .ini file with the same name as the Python script, it is able to identify which fields include a drop-down selectable menu which can contain multiple possible responses. This is an important differentiation when generating the JavaScript code. The file can be left empty if there are no selectable fields.
By checking the _ext.ini file with the same name as the Python script, it is able to identify which fields are radio-button or checkbox fields. This is another important differentiation when generating the JavaScript code. The file can be left empty if there are no radio-buttons or checkbox fields.

So, let’s look at the complete Python script and then break it down into smaller chunks, to understand it better. All the code was written using Python version 3.6 or higher. You can download Python from the official site.

import os
import sys
from collections import OrderedDict
from PyPDF2 import PdfFileReader

def _getFields(obj, tree=None, retval=None, fileobj=None):
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', 
    '/TU': 'Alternate Field Name', '/TM': 'Mapping Name', '/Ff': 'Field Flags', 
    '/V': 'Value', '/DV': 'Default Value'}
    if retval is None:
        retval = OrderedDict()
        catalog = obj.trailer["/Root"]
        if "/AcroForm" in catalog:
            tree = catalog["/AcroForm"]
        else:
            return None
    if tree is None:
        return retval

    obj._checkKids(tree, retval, fileobj)
    for attr in fieldAttributes:
        if attr in tree:
            obj._buildField(tree, retval, fileobj, fieldAttributes)
            break

    if "/Fields" in tree:
        fields = tree["/Fields"]
        for f in fields:
            field = f.getObject()
            obj._buildField(field, retval, fileobj, fieldAttributes)

    return retval

def get_form_fields(infile):
    infile = PdfFileReader(open(infile, 'rb'))
    fields = _getFields(infile)
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())

def selectListOption(all_lines, k, v):
    all_lines.append('function setSelectedIndex(s, v) {')
    all_lines.append('for (var i = 0; i < s.options.length; i++) {')
    all_lines.append('if (s.options[i].text == v) {')
    all_lines.append('s.options[i].selected = true;')
    all_lines.append('return;') 
    all_lines.append('}')
    all_lines.append('}')
    all_lines.append('}')
    all_lines.append('setSelectedIndex(document.getElementById("' + k + '"), "' + v + '");')

def readList(fname):
    lst = []
    with open(fname, 'r') as fh:  
        for l in fh:
            lst.append(l.rstrip(os.linesep))
    return lst

def createBrowserScript(fl, fl_ext, items, pdf_file_name):
    if pdf_file_name and len(fl) > 0:
        of = os.path.splitext(pdf_file_name)[0] + '.txt'
        all_lines = []
        for k, v in items.items():
            print(k + ' -> ' + v)
            if (v in ['/Yes', '/On']):
                all_lines.append("document.getElementById('" + k + "').checked = true;\n");
            elif (v in ['/0'] and k in fl_ext):
                all_lines.append("document.getElementById('" + k + "').checked = true;\n");
            elif (v in ['/No', '/Off', '']):
                all_lines.append("document.getElementById('" + k + "').checked = false;\n");
            elif (v in [''] and k in fl_ext):
                all_lines.append("document.getElementById('" + k + "').checked = false;\n");
            elif (k in fl):
                selectListOption(all_lines, k, v)
            else:
                all_lines.append("document.getElementById('" + k + "').value = '" + v + "';\n");
        outF = open(of, 'w')
        outF.writelines(all_lines)
        outF.close()

def execute(args):
    try: 
        fl = readList('myview.ini')
        fl_ext = readList('myview_ext.ini')
        if len(args) == 2:
            pdf_file_name = args[1]
            items = get_form_fields(pdf_file_name)
            createBrowserScript(fl, fl_ext, items, pdf_file_name)
        else:
            files = [f for f in os.listdir('.') if os.path.isfile(f) and f.endswith('.pdf')]
            for f in files:
                items = get_form_fields(f)
                createBrowserScript(fl, fl_ext, items, f)
    except BaseException as msg:
        print('An error occurred... :( ' + str(msg))

if __name__ == '__main__':
    from pprint import pprint
    execute(sys.argv)

Import Our Python Libraries

So, let’s start from the very beginning. To make things happen we’ll need to use some Python libraries.

import os
import sys
from collections import OrderedDict
from PyPDF2 import PdfFileReader

Every library is standard, except the PyPDF2 library.

The PyPDF2 library is required to be able to read PDF form documents. This library can be installed using the following command:

pip install PyPDF2

A Step By Step Guide to Reading the PDF Fields

Next we have the _getFields function. The objective of this function is to read the fields within any PDF form document by inspecting the document’s field tree. This is achieved by using the following code.

def _getFields(obj, tree=None, retval=None, fileobj=None):
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', 
    '/TU': 'Alternate Field Name', '/TM': 'Mapping Name', '/Ff': 'Field Flags', 
    '/V': 'Value', '/DV': 'Default Value'}
    if retval is None:
        retval = OrderedDict()
        catalog = obj.trailer["/Root"]
        if "/AcroForm" in catalog:
            tree = catalog["/AcroForm"]
        else:
            return None
    if tree is None:
        return retval

    obj._checkKids(tree, retval, fileobj)
    for attr in fieldAttributes:
        if attr in tree:
            obj._buildField(tree, retval, fileobj, fieldAttributes)
            break

    if "/Fields" in tree:
        fields = tree["/Fields"]
        for f in fields:
            field = f.getObject()
            obj._buildField(field, retval, fileobj, fieldAttributes)

    return retval

In essence, what this code does is to look at the document’s root node (/Root) and then loop through the fields found under the fields tree (/Fields), get the field object value by inspecting the field through specific field attributes.

Field attributes are used internally by PDF form documents to describe how fields are structured. By using field attributes, it is possible to determine a field name, a field value and also any flags or default values a field might have.

Next, we have the get_form_fields function.

def get_form_fields(infile):
    infile = PdfFileReader(open(infile, 'rb'))
    fields = _getFields(infile)
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())

This function simply reads the PDF form document and then calls the _getFields function, and it returns all the fields values (/V) contained within the PDF form file read, as an ordered dictionary.

Next, we have the selectListOption function.

def selectListOption(all_lines, k, v):
    all_lines.append('function setSelectedIndex(s, v) {')
    all_lines.append('for (var i = 0; i < s.options.length; i++) {')
    all_lines.append('if (s.options[i].text == v) {')
    all_lines.append('s.options[i].selected = true;')
    all_lines.append('return;') 
    all_lines.append('}')
    all_lines.append('}')
    all_lines.append('}')
    all_lines.append('setSelectedIndex(document.getElementById("' + k + '"), "' + v + '");')

This function simply creates a JavaScript function that is capable of selecting at runtime (when the JavaScript script is copied to the Console window of the browser Developer tools and executed), the correct drop-down option that for an online form field, that corresponds to the value contained within the equivalent PDF form field.

Next we have the readList function.

def readList(fname):
    lst = []
    with open(fname, 'r') as fh:  
        for l in fh:
            lst.append(l.rstrip(os.linesep))
    return lst

This function is simply used to read the .ini files we might have, for selectable, radio-buttons and checkbox fields.

Next we have the createBrowserScript function.

def createBrowserScript(fl, fl_ext, items, pdf_file_name):
    if pdf_file_name and len(fl) > 0:
        of = os.path.splitext(pdf_file_name)[0] + '.txt'
        all_lines = []
        for k, v in items.items():
            print(k + ' -> ' + v)
            if (v in ['/Yes', '/On']):
                all_lines.append("document.getElementById('" + k + "').checked = true;\n");
            elif (v in ['/0'] and k in fl_ext):
                all_lines.append("document.getElementById('" + k + "').checked = true;\n");
            elif (v in ['/No', '/Off', '']):
                all_lines.append("document.getElementById('" + k + "').checked = false;\n");
            elif (v in [''] and k in fl_ext):
                all_lines.append("document.getElementById('" + k + "').checked = false;\n");
            elif (k in fl):
                selectListOption(all_lines, k, v)
            else:
                all_lines.append("document.getElementById('" + k + "').value = '" + v + "';\n");
        outF = open(of, 'w')
        outF.writelines(all_lines)
        outF.close()

This function is the main part of the Python script. It is responsible for creating the JavaScript script that will be executed on the browser.

It basically goes through all the PDF form fields and creates for each the corresponding JavaScript code that when executed, will be able to fill in the value of the corresponding online field automatically, depending on whether the field is a regular field, selectable field, radio-button or checkbox.

The function saves the JavaScript script to the same folder the Python script runs from (and also where the .ini files are located). This JavaScript script is saved with the same name as the name of the input PDF form document provided to the Python script.

So, if the input PDF form file is called form_1.pdf, then the resultant JavaScript script file will be called form_1.txt. Notice that a .txt extension is prefered, instead of a .js extension.

This is so that the generated JavaScript code can be opened with a text editor and easily be copied to the Clipboard, and then be pasted within the browser’s Developer tools Console window to be executed, by pressing enter.

The Final Execute Function is Ready!

Finally, we have to execute the function.

def execute(args):
    try: 
        fl = readList('myview.ini')
        fl_ext = readList('myview_ext.ini')
        if len(args) == 2:
            pdf_file_name = args[1]
            items = get_form_fields(pdf_file_name)
            createBrowserScript(fl, fl_ext, items, pdf_file_name)
        else:
            files = [f for f in os.listdir('.') if os.path.isfile(f) and f.endswith('.pdf')]
            for f in files:
                items = get_form_fields(f)
                createBrowserScript(fl, fl_ext, items, f)
    except BaseException as msg:
        print('An error occurred... :( ' + str(msg))

This function, as it name implies, basically executes the rest of the Python script functions already described.

It starts off by reading the .ini files and then it can either create the corresponding JavaScript script file (with the .txt extension) for the name of the PDF form file passed to the Python script, or for each PDF form document found under the same folder where the Python script resides, it will create a corresponding JavaScript script file (with the .txt extension).

So, either you can execute the Python script by passing a PDF form document name as a parameter to it, or you don’t pass any parameter to the Python script and the script will read from the folder it is contained, the name of each PDF form document and create a corresponding JavaScript script (.txt) file.

If we have a folder with two PDF form documents and within it, also our Python script and .ini files, then after executing the Python script without any parameters, we can expect two resultant .txt files, one for each PDF form document.

Each resultant JavaScript script (.txt) file can be opened with a text editor, its code copied, and then you could simply navigate on your browser to the online form you wish to fill in, then paste the copied JavaScript code to the Developer tools Console and press enter to execute it.

Then you should auto-magically see the fields of the online form filled with the same values as those of the PDF form document.

You can open the other resultant .txt file with a text editor, copy the code to the clipboard, navigate to the corresponding online form, open Developer tools and paste the copied code on the Console, press enter to execute it. Voila, the online form fields should be automatically populated. How cool is that!

Conclusion

We’ve accomplished something really cool, which is how to extract data contained within any PDF form document, and automatically fill in an equivalent online form using a relatively small and uncomplicated Python script.

The same technique here described, and exact same Python script can be used to eliminate manual data-entry for any online form, not just this specific real-world example. The Python script and processes are generic enough to work for any PDF and online form documents.

The key is to facilitate the data-acquisition process by creating PDF form documents that have the same field names as the fields present on the online form which you want to automatically enter the acquired data.

Overall, this relatively simple technique if applied correctly can be a massive time-saver for manually intensive and time-consuming online data-entry tasks.

Again, you can download the completed Python script, as well as the the web form we used an try it all for yourself!

Hopefully, you can also apply this technique and script in your daily job, save valuable time and have some fun along the way. Thank you reading and until next time.

Written by Chris Castiglione and based on a project by Ed Freitas.