Data Extraction using Regexes

Regular expressions – commonly known as regexes – are used by modern programming languages as a text processing tool for locating and extracting data. At its core, a regular expression is a search pattern (a set of symbols) that helps you match, locate and manage text – making it easier to extract information from code, log files, spreadsheets, PDFs and documents. Surprisingly, the idea behind regexes did not originate as a computer science tool, but instead, regular expressions were created in the neuroscience field. So how was this powerful tool first developed and how has the use of regular expressions evolved over the years?

The Creation of Regexes

In 1943, Warren S. McCulloch (Neuroscientist) and Walter Pitts (Logician) began to develop models describing how the human nervous system works. Their research focused on trying to understand how the brain could produce complex patterns using simple cells that are bound together. In 1956, mathematician Stephen Kleen described McCulloch-Pitts neural models with an algebra notation that he penned ‘regular expressions’. Influenced by Kleen’s notion, in 1968, mathematician and Unix pioneer, Ken Thompson, implemented the idea of regular expressions inside the text editor, ‘ed’. His aim was that ed users would be able to do advanced pattern matching in text files. ed soon evolved to have the functionality to search based on regular expressions – this is when regexes entered the computing world. 

Evolution of regular expressions

Many people have contributed to the development and promotion of regular expressions since they entered popular usage in ed software. Notably, Larry Wall’s Perl programming language from the late 80s helped regular expressions to become mainstream. Perl was originally designed as a flexible text-processing language but grew into a fully-fledged programming language that remains a natural choice for any text-processing to this day. The programme still relies heavily on the use of regexes.

Future of regular expressions

Despite being hard to read, hard to validate, hard to document and notoriously hard to master, regexes are still widely used today. Supported by all modern programming languages, text processing programs and advanced text editors, regexes are now used in more than a third of both Python and JavaScript projects. With this in mind, over 50 years since their inception, the use of regexes seems very much here to stay. 

Use of regular expressions within PDFDataNet

Regexes are widely used within PDFDataNet as a text processing tool for locating and extracting data. For example if we wish to extract a UK postcode from withing a delivery address – and the number of lines in the address changes – and the postcode is not always the last line – then we can select the whole address and use the regex:-

^(GIR 0AA)|((([A-Z][0-9]{1,2})|(([A-Z][A-HJ-Y][0-9]{1,2})|(([A-Z][0-9][A-Z])|([A-Z][A-HJ-Y][0-9]?[A-Z])))) [0-9][A-Z]{2})$

to locate the postcode anywhere within the address.

Another use is to extract SKU or product codes from the line details of a sales order. The regex ^[A-B]{2}[0-9]{6} would find codes of the format AB123456 – for more details on the PDFDataNet functions and data manipulations see our syntax guide.

 

Pin It on Pinterest

Share This