ipxpdf (Sat Oct 25 2014 01:31:01)

ipxpdf

This program extract elements - mainly text - from PDF files.

It uses a parameter file in tables/setup. This can be selected from the DF
FipHdr file and defaults to XPDF.FIP

Keywords for the parameter file are :
	; comment line
	newname: (FipSeq)	name of the output file		default: the same as input
		eg	replace:Q1	SN	.pdf=""
			newname:Q1.article.$v.txt
	supercede: yes/no Overwrite the output if it exists	default: do not

	outque:	output queue for the new file.			default: spool/2go or -o switch
	doneque: done queue for the old file.			default: none or the -d switch
	infoque: queue for the hints/stats/info  file.		default: none
	checkque: if a PDF errors - not readable, no pages etc - put the input file in
this folder, so it can be reviewed/checked manually
								default: none

	extra-fiphdr:	more fixed Fip Hdr fields to add to the file (before any new
matched additions)
								default: none
	extra-fiphdr-file: (File in tables/setup)		default: none
		Include the contents of this file in the FipHdr
	script:	script to run against the New file.		default: none
		eg	; clean up some of the crap... full path/filename is added to end
			script:/fip/bin/ipxchg -D xpdf_clean -1

	want-data: yes/no/pdf
		Rip apart the text from the PDF ?
		want-data:yes 	- flag metadata and render PDF to text (default)
		want-data:no 	- flag metadata only and ignore all data
		want-data:pdf 	- flag metadata and preserve the PDF as data
		The default is YES for text (not PDF) but the default may be changed by the
-D input switch
	use-sx:
or	use-external-file:
		if there is an SX FipHdr field with a path to the data file, use that rather
than the data in the input file.

	fiphdr-for-page-width: (2 letter FipHdr field)
		Put the Page Width (Media or Crop Box) in this FipHdr zone - default:
ignored.
	fiphdr-for-page-height: (2 letter FipHdr field)
		Put the Page Width (Media or Crop Box) in this FipHdr zone - default:
ignored.
	fiphdr-for-pdf-version: (2 letter FipHdr field)
		Put the PDF version in this FipHdr zone - default: ignored.
	fiphdr-for-page-total: (2 letter FipHdr field)
		Put the no of pages in the document in this FipHdr zone - default: ignored.
	fiphdr-for-docinfo-total: (2 letter FipHdr field)
		Put the no of doc info elements in the document in this FipHdr zone -
default: ignored.
	fiphdr-for-docinfo: (2 letter FipHdr field)
		Put ALL the doc info elements in the document in this FipHdr zone - default:
ignored.
		They are separated by a pipe : eg
		AB:Producer-DynaPDF 2.5.4.557|Creator-Asura Version 9.6 (SR
3)|OneVisionQueueName-Q229_WORKFLOW_2_PAIRSCORCERER|Title-HA-A-LEI-15-08-13-p012.eps|OneVisionDongleID-_9WXs9sImmNuhtq9|OneVisionCreationDate-D:20130813184434+01'00'|OneVisionProducer-OneVision
PDFengine (Windows Build 21.066.S)|OneVisionCreator-Asura Version 9.6 (SR
3)|Author-asuraadmin|

	log-line: extra logging information for the Fip log	default: none
		Logging is done at the end of each page
			EN is filename
			EP is path
			S1 MAY be the size
			S2 is the pagenumber of pages generated from this input file
	show-changes:yes/all/no or a series of entries
		Show point size and font information inline	default: no
		tags such as <font.Arial> <ptsize.8.04> are added
			no	- display nothing (default)
			all 	- display all stye changes
			font	- display font changes
			ptsize	- display pointsize changes
			x	- display x posn of line from left
			y	- display y posn of line from bottom
	add-space-x:NO or (number of chrs)
	added-space-chr: (FipSeq single chr)		default: SPC
		Where 'show-changes:no' or NOT displaying the 'x' position,
		add a space between blocks of text if the gap between them is >= (ptsize *
add-space-x)
		This number can smaller than '1'.	default: 1.0
		(ie the start of the next block (on the right of the line) is more than a
single chr width from the end of the last block)
	max-body-ptsize: (number in points)		default: 15.0
	gutter-x: (number in points)			default: 6
		when reading DOWN, what is the approx gutter between columns
	min-col-x: (number in points)			default: 90.0
		when reading DOWN, what is the approx col width for grouping elements

	read-direction:down/across
		Is the text in multiple columns across the page ?	default: down
		If so,	   should the columns be read DOWN - like a magazine page
			or should the columns be read ACROSS - like a spreadsheet
	output-single-file:no/yes				default: no
		if yes, ignore the PDF page end and continue to write in the same output file

	group-furniture:yes/no
		Group all Furniture items at the top of the output file	default: no
		Furniture items are flagged by font - see below
	group-headings:yes/no
		Group all Headings at the top of the output file	default: no
		Headings are flagged by font - see below

	symbol-font-default-char:*
	symbol-font-char:l<Bullet>
	symbol-font-char:L<bullet>
		If the font is flagged as a Symbol font (internal PDF setting), map the data
to these strings.

	font:(Name)	type:(type)	min;(minPtSize) max:(maxPtSize)
		type can be body, head, caption or furniture

		font:IdentikalSansRegular type:body
		font:AGBook-Stencil	type:head	min:14
		font:IdentikalSansBold  type:head
		font:DIN-Regular	type:caption
		font:MagistralA	 type:furniture

	force-single-caps: yes/no
		By default single letters are forced uppercase
		Use this when massive letterspacing produces lots of single letters !

Where sections of FipHdr fields are required or changes to the output style,
use keywords : fixed, partial, combie, optional, repeat, newdate and/or style.
(see The SysAdmin manual for more information).

	They are normally specified :
		fixed:QZ	1234543
		partial:QT	ST,3,2,U,<,>
		combie:QY	ep|na,(0000000)a
		option:QE	ep,11,7,s
		repeat:QK	XK,-,3
	or	repeat:QP	PK,,4,#X
		style:QS	XN,%.03d

The FipHdr of the incoming file can also be used to change the
	PDF_FIPHDR:(yes/no)
		Add/Dont add the FipHdr to the output file	default: add
	PDF_OUTQUE:(FipSeq)
		Output folder to override the -o Input switch	default: /fip/spool/2go

Input Parameters are (all optional) :
Either
	-1 : path/filename for single shot		default: spooled
		The input file is NOT deleted
		If this does NOT start with a '/', it is assumed relative to the current
path.
Or
	-i : input queue					default: spool/xpdf
		If this does NOT start with a '/', it is assumed under spool.
	-d : done folder for the input file in FipSeq		default: none
		If this does NOT start with a '/', it is assumed under spool.
	-D : default for want-data -D no -D yes or -D pdf	default: -D yes for text
	-L : do NOT log files in				default: log
	-o : output queue					default: spool/2go
		If this does NOT start with a '/', it is assumed under spool.
	-w : file wait for files arriving across a network.	default: no wait
	-z : default parameter file in tables/setup		default: tables/setup/XPDF.FIP
	-v : print version number and exit

---NOTES---

Version Control
;000y	10feb10 original version ;p 11sep13 added want-data and the 6
fiphdr-for..
	;q-r 2nov13 added read-direction and single-output-file
	;s-t 9apr14 added force-single-caps
	;u-y 14may14 allow unlimited number of lines and added add-space-x

(copyright) 2014 and previous years FingerPost Ltd.