Effortless PDF Data Extraction with Textricator and Open Source Tools

pdf data extraction made simple and open source n.w
1 / 10
Embed
Share

Simplify your data extraction process with Textricator, an open-source tool built in Kotlin. Explore its advanced features, GUI options, command-line interface, and future potential for parsing various document types like PDF, Word, and Excel files.

  • PDF extraction
  • Textricator
  • open source
  • data parsing
  • document processing

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. PDF Data extraction made simple (and Open Source)

  2. Textricator https://textricator.mfj.io https://github.com/measuresforjustice/textricator https://github.com/measuresforjustice/textricator-web 100% Kotlin (Java-interoperable) AGPLv3 Expression Parser https://github.com/measuresforjustice/expr Apache 2.0

  3. Examples of documents Textricator can parse

  4. Parser configuration YaML Contains extractor config record definitions type definitions FSM state definitions FSM transition conditions

  5. Web GUI https://textricator-demo.mfj.io

  6. Command Line Interface

  7. Command Line Interface

  8. Advanced Features Match on color/background color Regex matching Page number matching First and last page are often different Output data match/replace Variables Complex data types JSON output

  9. Possible Future Features GUI to build the parser configuration doc / docx parsing We have received Word file reports. We converted them to PDF and then parsed them with Textricator. spreadsheet parsing We have received Excel file reports (for which we wrote report- specific parsers) that could probably be easily parsed by the Textricator FSM if column/row were mapped to ulx/uly.

Related


More Related Content