ipxpdf

ipxpdf

This program extract elements - mainly text - from PDF files.

It uses a parameter file in tables/setup. This can be selected from the DF
FipHdr file and defaults to XPDF.FIP

Keywords for the parameter file are :
    ; comment line
    newname: (FipSeq)   name of the output file     default: the same as input
        eg  replace:Q1  SN  .pdf=""
            newname:\Q1.article.\$v.txt
    supercede: yes/no Overwrite the output if it exists default: do not

    outque: output queue for the new file.          default: spool/2go or -o switch
    doneque: done queue for the old file.           default: none or the -d switch
    infoque: queue for the hints/stats/info  file.      default: none
    checkque: if a PDF errors - not readable, no pages etc - put the input file in
this folder, so it can be reviewed/checked manually
                                default: none

    extra-fiphdr:   more fixed Fip Hdr fields to add to the file (before any new
matched additions)
                                default: none
    extra-fiphdr-file: (File in tables/setup)       default: none
        Include the contents of this file in the FipHdr
    script: script to run against the New file.     default: none
        eg  ; clean up some of the crap... full path/filename is added to end
            script:/fip/bin/ipxchg -D xpdf_clean -1

    want-data: yes/no/pdf
        Rip apart the text from the PDF ?
        want-data:yes   - flag metadata and render PDF to text (default)
        want-data:no    - flag metadata only and ignore all data
        want-data:pdf   - flag metadata and preserve the PDF as data
        The default is YES for text (not PDF) but the default may be changed by the
-D input switch
    use-sx:
or  use-external-file:
        if there is an SX FipHdr field with a path to the data file, use that rather
than the data in the input file.

    fiphdr-for-page-width: (2 letter FipHdr field)
        Put the Page Width (Media or Crop Box) in this FipHdr zone - default:
ignored.
    fiphdr-for-page-height: (2 letter FipHdr field)
        Put the Page Width (Media or Crop Box) in this FipHdr zone - default:
ignored.
    fiphdr-for-pdf-version: (2 letter FipHdr field)
        Put the PDF version in this FipHdr zone - default: ignored.
    fiphdr-for-page-total: (2 letter FipHdr field)
        Put the no of pages in the document in this FipHdr zone - default: ignored.
    fiphdr-for-docinfo-total: (2 letter FipHdr field)
        Put the no of doc info elements in the document in this FipHdr zone -
default: ignored.
    fiphdr-for-docinfo: (2 letter FipHdr field)
        Put ALL the doc info elements in the document in this FipHdr zone - default:
ignored.
        They are separated by a pipe : eg
        AB:Producer-DynaPDF 2.5.4.557|Creator-Asura Version 9.6 (SR
3)|OneVisionQueueName-Q229_WORKFLOW_2_PAIRSCORCERER|Title-HA-A-LEI-15-08-13-p012.eps|OneVisionDongleID-_9WXs9sImmNuhtq9|OneVisionCreationDate-D:20130813184434+01'00'|OneVisionProducer-OneVision
PDFengine (Windows Build 21.066.S)|OneVisionCreator-Asura Version 9.6 (SR
3)|Author-asuraadmin|

    log-line: extra logging information for the Fip log default: none
        Logging is done at the end of each page
            EN is filename
            EP is path
            S1 MAY be the size
            S2 is the pagenumber of pages generated from this input file
    show-changes:yes/all/no or a series of entries
        Show point size and font information inline default: no
        tags such as <font.Arial> <ptsize.8.04> are added
            no  - display nothing (default)
            all     - display all stye changes
            font    - display font changes
            ptsize  - display pointsize changes
            x   - display x posn of line from left
            y   - display y posn of line from bottom
    add-space-x:NO or (number of chrs)
    added-space-chr: (FipSeq single chr)        default: SPC
        Where 'show-changes:no' or NOT displaying the 'x' position,
        add a space between blocks of text if the gap between them is >= (ptsize *
add-space-x)
        This number can smaller than '1'.   default: 1.0
        (ie the start of the next block (on the right of the line) is more than a
single chr width from the end of the last block)
    max-body-ptsize: (number in points)     default: 15.0
    gutter-x: (number in points)            default: 6
        when reading DOWN, what is the approx gutter between columns
    min-col-x: (number in points)           default: 90.0
        when reading DOWN, what is the approx col width for grouping elements

    read-direction:down/across
        Is the text in multiple columns across the page ?   default: down
        If so,     should the columns be read DOWN - like a magazine page
            or should the columns be read ACROSS - like a spreadsheet
    output-single-file:no/yes               default: no
        if yes, ignore the PDF page end and continue to write in the same output file

    group-furniture:yes/no
        Group all Furniture items at the top of the output file default: no
        Furniture items are flagged by font - see below
    group-headings:yes/no
        Group all Headings at the top of the output file    default: no
        Headings are flagged by font - see below

    symbol-font-default-char:*
    symbol-font-char:l<Bullet>
    symbol-font-char:L<bullet>
        If the font is flagged as a Symbol font (internal PDF setting), map the data
to these strings.

    font:(Name) type:(type) min;(minPtSize) max:(maxPtSize)
        type can be body, head, caption or furniture

        font:IdentikalSansRegular type:body
        font:AGBook-Stencil type:head   min:14
        font:IdentikalSansBold  type:head
        font:DIN-Regular    type:caption
        font:MagistralA  type:furniture

    force-single-caps: yes/no
        By default single letters are forced uppercase
        Use this when massive letterspacing produces lots of single letters !

Where sections of FipHdr fields are required or changes to the output style,
use keywords : fixed, partial, combie, optional, repeat, newdate and/or style.
(see The SysAdmin manual for more information).

    They are normally specified :
        fixed:QZ    1234543
        partial:QT  ST,3,2,U,<,>
        combie:QY   ep|na,(0000000)a
        option:QE   ep,11,7,s
        repeat:QK   XK,-,3
    or  repeat:QP   PK,,4,#X
        style:QS    XN,%.03d

The FipHdr of the incoming file can also be used to change the
    PDF_FIPHDR:(yes/no)
        Add/Dont add the FipHdr to the output file  default: add
    PDF_OUTQUE:(FipSeq)
        Output folder to override the -o Input switch   default: /fip/spool/2go

Input Parameters are (all optional) :
Either
    -1 : path/filename for single shot      default: spooled
        The input file is NOT deleted
        If this does NOT start with a '/', it is assumed relative to the current
path.
Or
    -i : input queue                    default: spool/xpdf
        If this does NOT start with a '/', it is assumed under spool.
    -d : done folder for the input file in FipSeq       default: none
        If this does NOT start with a '/', it is assumed under spool.
    -D : default for want-data -D no -D yes or -D pdf   default: -D yes for text
    -L : do NOT log files in                default: log
    -o : output queue                   default: spool/2go
        If this does NOT start with a '/', it is assumed under spool.
    -w : file wait for files arriving across a network. default: no wait
    -z : default parameter file in tables/setup     default: tables/setup/XPDF.FIP
    -v : print version number and exit

---NOTES---

Version Control
;0z 10feb10 original version ;p 11sep13 added want-data and the 6 fiphdr-for..
    ;q-r 2nov13 added read-direction and single-output-file
    ;s-t 9apr14 added force-single-caps
    ;u-y 14may14 allow unlimited number of lines and added add-space-x
    ;z 15feb17 dyna v4

(copyright) 2024 and previous years FingerPost Ltd.