ipxpdf
ipxpdf
This program extract elements - mainly text - from PDF files.
It uses a parameter file in tables/setup. This can be selected from the DF
FipHdr file and defaults to XPDF.FIP
Keywords for the parameter file are :
; comment line
newname: (FipSeq) name of the output file default: the same as input
eg replace:Q1 SN .pdf=""
newname:\Q1.article.\$v.txt
supercede: yes/no Overwrite the output if it exists default: do not
outque: output queue for the new file. default: spool/2go or -o switch
doneque: done queue for the old file. default: none or the -d switch
infoque: queue for the hints/stats/info file. default: none
checkque: if a PDF errors - not readable, no pages etc - put the input file in
this folder, so it can be reviewed/checked manually
default: none
extra-fiphdr: more fixed Fip Hdr fields to add to the file (before any new
matched additions)
default: none
extra-fiphdr-file: (File in tables/setup) default: none
Include the contents of this file in the FipHdr
script: script to run against the New file. default: none
eg ; clean up some of the crap... full path/filename is added to end
script:/fip/bin/ipxchg -D xpdf_clean -1
want-data: yes/no/pdf
Rip apart the text from the PDF ?
want-data:yes - flag metadata and render PDF to text (default)
want-data:no - flag metadata only and ignore all data
want-data:pdf - flag metadata and preserve the PDF as data
The default is YES for text (not PDF) but the default may be changed by the
-D input switch
use-sx:
or use-external-file:
if there is an SX FipHdr field with a path to the data file, use that rather
than the data in the input file.
fiphdr-for-page-width: (2 letter FipHdr field)
Put the Page Width (Media or Crop Box) in this FipHdr zone - default:
ignored.
fiphdr-for-page-height: (2 letter FipHdr field)
Put the Page Width (Media or Crop Box) in this FipHdr zone - default:
ignored.
fiphdr-for-pdf-version: (2 letter FipHdr field)
Put the PDF version in this FipHdr zone - default: ignored.
fiphdr-for-page-total: (2 letter FipHdr field)
Put the no of pages in the document in this FipHdr zone - default: ignored.
fiphdr-for-docinfo-total: (2 letter FipHdr field)
Put the no of doc info elements in the document in this FipHdr zone -
default: ignored.
fiphdr-for-docinfo: (2 letter FipHdr field)
Put ALL the doc info elements in the document in this FipHdr zone - default:
ignored.
They are separated by a pipe : eg
AB:Producer-DynaPDF 2.5.4.557|Creator-Asura Version 9.6 (SR
3)|OneVisionQueueName-Q229_WORKFLOW_2_PAIRSCORCERER|Title-HA-A-LEI-15-08-13-p012.eps|OneVisionDongleID-_9WXs9sImmNuhtq9|OneVisionCreationDate-D:20130813184434+01'00'|OneVisionProducer-OneVision
PDFengine (Windows Build 21.066.S)|OneVisionCreator-Asura Version 9.6 (SR
3)|Author-asuraadmin|
log-line: extra logging information for the Fip log default: none
Logging is done at the end of each page
EN is filename
EP is path
S1 MAY be the size
S2 is the pagenumber of pages generated from this input file
show-changes:yes/all/no or a series of entries
Show point size and font information inline default: no
tags such as <font.Arial> <ptsize.8.04> are added
no - display nothing (default)
all - display all stye changes
font - display font changes
ptsize - display pointsize changes
x - display x posn of line from left
y - display y posn of line from bottom
add-space-x:NO or (number of chrs)
added-space-chr: (FipSeq single chr) default: SPC
Where 'show-changes:no' or NOT displaying the 'x' position,
add a space between blocks of text if the gap between them is >= (ptsize *
add-space-x)
This number can smaller than '1'. default: 1.0
(ie the start of the next block (on the right of the line) is more than a
single chr width from the end of the last block)
max-body-ptsize: (number in points) default: 15.0
gutter-x: (number in points) default: 6
when reading DOWN, what is the approx gutter between columns
min-col-x: (number in points) default: 90.0
when reading DOWN, what is the approx col width for grouping elements
read-direction:down/across
Is the text in multiple columns across the page ? default: down
If so, should the columns be read DOWN - like a magazine page
or should the columns be read ACROSS - like a spreadsheet
output-single-file:no/yes default: no
if yes, ignore the PDF page end and continue to write in the same output file
group-furniture:yes/no
Group all Furniture items at the top of the output file default: no
Furniture items are flagged by font - see below
group-headings:yes/no
Group all Headings at the top of the output file default: no
Headings are flagged by font - see below
symbol-font-default-char:*
symbol-font-char:l<Bullet>
symbol-font-char:L<bullet>
If the font is flagged as a Symbol font (internal PDF setting), map the data
to these strings.
font:(Name) type:(type) min;(minPtSize) max:(maxPtSize)
type can be body, head, caption or furniture
font:IdentikalSansRegular type:body
font:AGBook-Stencil type:head min:14
font:IdentikalSansBold type:head
font:DIN-Regular type:caption
font:MagistralA type:furniture
force-single-caps: yes/no
By default single letters are forced uppercase
Use this when massive letterspacing produces lots of single letters !
Where sections of FipHdr fields are required or changes to the output style,
use keywords : fixed, partial, combie, optional, repeat, newdate and/or style.
(see The SysAdmin manual for more information).
They are normally specified :
fixed:QZ 1234543
partial:QT ST,3,2,U,<,>
combie:QY ep|na,(0000000)a
option:QE ep,11,7,s
repeat:QK XK,-,3
or repeat:QP PK,,4,#X
style:QS XN,%.03d
The FipHdr of the incoming file can also be used to change the
PDF_FIPHDR:(yes/no)
Add/Dont add the FipHdr to the output file default: add
PDF_OUTQUE:(FipSeq)
Output folder to override the -o Input switch default: /fip/spool/2go
Input Parameters are (all optional) :
Either
-1 : path/filename for single shot default: spooled
The input file is NOT deleted
If this does NOT start with a '/', it is assumed relative to the current
path.
Or
-i : input queue default: spool/xpdf
If this does NOT start with a '/', it is assumed under spool.
-d : done folder for the input file in FipSeq default: none
If this does NOT start with a '/', it is assumed under spool.
-D : default for want-data -D no -D yes or -D pdf default: -D yes for text
-L : do NOT log files in default: log
-o : output queue default: spool/2go
If this does NOT start with a '/', it is assumed under spool.
-w : file wait for files arriving across a network. default: no wait
-z : default parameter file in tables/setup default: tables/setup/XPDF.FIP
-v : print version number and exit
---NOTES---
Version Control
;0z 10feb10 original version ;p 11sep13 added want-data and the 6 fiphdr-for..
;q-r 2nov13 added read-direction and single-output-file
;s-t 9apr14 added force-single-caps
;u-y 14may14 allow unlimited number of lines and added add-space-x
;z 15feb17 dyna v4
(copyright) 2025 and previous years FingerPost Ltd.