FLIE: Form Labelling for Information Extraction

Elzbieta Pustulka, Fachhochschule Nordwestschweiz

2 June 2020, 1 video, 40 views, Open Channel

Information extraction from forms is a challenging topic with high practical relevance, in particular for the insurance industry in Switzerland. We have gathered over 20'000 anonymized insurance policies and related documents in German, French, English and Italian and have prototyped an automated method for information extraction. We tested this method with three policy types in German.

Given a user schema, expressed as a list of attributes to be found in an insurance policy, we extract the relevant information and map it to the attributes. To do that, we first extract the text from pdf and generate the bounding boxes as a csv. We then reconstruct a page, group the text boxes into horizontal groups and columns within groups and annotate the geometry. 24 policies coming from various insurers and representing three policy types were annotated manually by the user with the desired attribute names. Machine learning was used to propagate this annotation in two steps: first, text was tagged as being metadata or data, and in the second step, attribute names were mapped to the extracted text. The accuracy of the first step is now at 88%, and in the second step we can map the attributes which appear over 8 times in the documents with similar accuracy, while other attributes are often singletons and cannot be mapped yet. Data extraction uses those annotations to produce the required output for the user. With more annotated data, we will be able to reach the required accuracy of over 90%.

Viewable by everyone.