PDF (Portable Document Format) files are ubiquitous in business, often containing crucial information from contracts to invoices. Business flows often rely upon the data inside these documents for reports, onward processes, auditing and more.
Manually processing piles of PDFs to extract this information is tedious and time-consuming. While many solutions exist to extract data from PDF's - including AI options, their high costs often put them out of reach for many.
Fortunately, Power Automate Desktop (PAD) offers an affordable automation alternative without compromising efficiency. With the right steps applied, this Microsoft 365 tool - that you already have - can optimize PDF workflows for a fraction of the cost.
The Power of PDF Automation (PAD)
For those acquainted with PAD, its benefits in terms of speed and accuracy are evident. But for the uninitiated, PAD is all about automating manual steps. It is available as part of the make.powerautomate.com cloud service under the download button in the top right of the home screen.
Much like it's cloud sister Power Automate, the Desktop version focusses on turning manual repetitive tasks into repeatable, robust and fast automated processes. PAD is very much designed around the need to gather and interact with artefacts you can access from your laptop or desktop (hence its name).
Think of PAD as a macro creator for business processes using desktop resources. It turns tedious, repetitive tasks (like pulling text or tables from your PDF's) into streamlined workflows.
Extracting Key Data from PDFs
The backbone of most PDFs is textual data. Instead of manually copying or typing out data, PAD offers you the "Extract text from PDF" action which pulls text into a variable for inspection and use downstream. No more manual copying.
The PAD action "Extract tables from PDF" recognizes tables of any format within a PDF document. You can then programmatically extract them, navigate rows and columns and select specific items for use in other documents.
Graphics and images play a pivotal role in conveying information. PAD also has the "Extract Images from PDF" action you can use. Once run, the images are held within a PAD flow and can be passed on to other tools retaining their integrity and resolution.
Once You Have the Data, What Next?
Recognize Values Automatically in a PDF
PAD has some great tools built in for you to use when you have accessed this data from your PDF. We'll explore these in a little depth, then I will give you an example to paste into your PAD client to test (along with instructions).
First off, you can use the "Recognize entities in text" action. This presents a configurable action where you can automatically search for any emails, dates, hashtags, URL's and more in the text you just grabbed from your PDF.
Using this one action alone may get the data you want from your PDF without any further work.
Retrieve Values Using Searches and Positions
The data you want may not be easily recognisable by this method. However all is not lost. So long as your PDF's follow a regular structure and the data you need can be recognized somehow using a rule that remains consistent, you have options.
You can use the "Parse Text" action to look for text that matches what you need. Just type in what you are looking for (perhaps you want the "Invoice Number").
Parse text returns the position of the first character of the text match that it finds. What you then need to do is decide where the item sits in relation to this position. For example, the Invoice number "123456ABC" sits 14 characters after the "Invoice Number" text begins. So to retrieve this, you then can implement the "Get Subtext" action using the correct start position for the values along with the length of the text you want to extract.
Use Regular Expressions to Find Items
If you need more sophisticated searches, Regex (regular expressions) are a powerful string matching technique.
Mastering regex reduces reliance on high-priced extraction tools, giving you the same results at a fraction of the cost. A great tool if you want to experiment with regex strings is Regexstorm.
You can use this online tool to build expressions which will find a pattern in your text and return it to you. The "Parse text" action - with the "Is Regular Expression" option turned on can use Regex strings to grab exactly what you need from your text automatically.
I've included an example towards the end of the post so you can see how it works.
Navigating and Extracting Data from Tables:
Sometimes your PDF's contain tables. These are no barrier for PAD, but navigating them programmatically requires understanding of the PAD syntax for tables. To get the tables ad data in your PDF document use the "Extract tables from PDF" action.
Fetching Items From Tables
Navigation of table columns and rows can be a little tricky due to the Syntax used.
Each table in a PDF document is stored in an Array when you extract it. That means you can access data using techniques as you would for any array. The one caveat is to use the correct PAD syntax.
Here are some examples:
Get the Second Table From The Document
Invoice Line Items are stored in the second table gathered when you run an "Extract Tables from PDF" using the example we will provide. Use the following syntax to retrieve that table into a variable:
Note the use of . in an array, the first item is always at position 0. So the "" will give you the second item in the stored tables within PAD.
Fetch A Column From the Line Items Table
To get the item within a column called "DESCRIPTION" which is the first column in the Second table in your document, use:
Could This Help You?
As promised here is a text version of a PAD flow I wrote to accompany this post. To use it, copy and paste the text into a new Desktop Flow (don't worry if it looks a bit odd, it will work when you paste it as is). I've also included a demonstration invoice you can use (save this locally).
Be sure to change the PDF document location to wherever you have saved the example. You'll see some examples I've written about above.
Folder.GetFiles Folder: $'''C:\\Users\\jonm_\\Collab365\\Collab365 - Documents\\Academy Events''' FileFilter: $'''*.pdf''' IncludeSubfolders: False FailOnAccessDenied: True SortBy1: Folder.SortBy.Name SortDescending1: False SortBy2: Folder.SortBy.NoSort SortDescending2: False SortBy3: Folder.SortBy.NoSort SortDescending3: False Files=> PDFFiles
LOOP FOREACH PDFCurrentItem IN PDFFiles
Pdf.ExtractTextFromPDF.ExtractText PDFFile: PDFCurrentItem DetectLayout: False ExtractedText=> ExtractedPDFText
# Get the items we can identify easily - email address, date of invoice
**REGION Entity Extraction
Text.RecognizeEntitiesInText Text: ExtractedPDFText Mode: Text.RecognizerMode.DateTime Language: Text.RecognizerLanguage.English RecognizedEntities=> varDate
Text.RecognizeEntitiesInText Text: ExtractedPDFText Mode: Text.RecognizerMode.Email Language: Text.RecognizerLanguage.English RecognizedEntities=> varContact
**REGION Parse And Get subtext
Text.ParseText.ParseForFirstOccurrence Text: ExtractedPDFText TextToFind: $'''INVOICE NO''' StartingPosition: 0 IgnoreCase: False OccurrencePosition=> Position
Variables.IncreaseVariable Value: Position IncrementValue: 11
Text.GetSubtext.GetSubtext Text: ExtractedPDFText CharacterPosition: Position NumberOfChars: 7 Subtext=> varInvoiceNumber
**REGION Use RegEx
Text.ParseText.RegexParseForFirstOccurrence Text: ExtractedPDFText TextToFind: $'''(?<=INVOICE NO\\r\\n.*\\r\\n).+''' StartingPosition: 0 IgnoreCase: False Match=> varInvoiceName
**REGION Extract From a Table in PDF
Pdf.ExtractTablesFromPDF.ExtractTables PDFFile: PDFCurrentItem MultiPageTables: True SetFirstRowAsHeader: True ExtractedPDFTables=> ExtractedPDFTables
LOOP FOREACH CurrentTable IN ExtractedPDFTables
SET DESCRIPTION TO ExtractedPDFTables.DataTable['DESCRIPTION']
When you run it, look at the variables in the right hand panel to see them get populated with data from your Invoice. Have an experiment to get used to how the data is being held.
Want some more examples?
From a recent competition we ran on the Academy, here are some other great tips to try out in relation to Document text extraction using PAD. Have a go and let us know how you got on in the Academy.
Cycle through the rows in a table
Using the table extraction technique, create a For Each Loop to read each row in a table using the following syntax:
Tidy Up White Space in your Extracted Data Items
Once you have the row selected and you have an item from that row, maybe remove leading and trailing spaces to prepare the data for onward movement. Use this approach to do this:
Create a new list of Data to pass on elsewhere
Once you have narrowed your search, you may simply want to create a new list of only the prepared and cleansed data items you need. You can do this by first of all using the Create New List action before a loop and then using the following action to add items to the list dynamically.
as a final process, you may wish to search the list and extract items at the correct time. You can do this by treating the list as an array and searching within it. The syntax to do this looks like this:
Will retrieve the value of the DESCRIPTION field in the first row of your list.
We hope with this little bit of inspiration, it will entice you to embrace PAD and embark on a journey of efficiency and affordability! You can do that over at our Collab365 Community!