iplookup (Sat Oct 25 2014 01:31:01)

iplookup

This program is used to add FipHdr fields matched against a key field in a
lookup file. The original header and text are left untouched.

The format of the lookup file is :
	(Search String) (separator) (NewField1) [optional sep] [opt new2]...(eoln)
	One item per line.
	Lines starting with a semicolon ';' are considered comments and are ignored.

Examples - a tab delimited file
	ICI		ICI.EU	CHEMICALS	"Imperial Chemical Industries"
	ATT		ATT.US	TELECOMS	"A.T. & T."
or a CSV file as Lotus or Excel might generate :
	"Maserati","Italian","Brillo"
	"TVR","British","Knockout"
or a file of your own making - such as one with pipes as separators :
	747|Boeing|long distance|4 engines|commercial
	300|Airbus|short distance|2 engines|commercial

In each of these cases the program will attempt to match the content of a
specified FipHdr field with the FIRST field on the line.

Note that if you have a very big lookup file, processing is vastly speeded up
by sorting it beforehand in ascending order and letting the program hash this
sorted file.
For Unix Sort :				sort -b -o sorted.file orig.file
or if the separator is a Pipe '|' :	sort -b -t| -o sorted.file orig.file

If you change the Lookup file, pls kill and reactive the program to get it to
read the new version as it reads the lookup file once only and uses the copy in
memory for processing.

The parameter file describing where the lookup file is for the data is held in
tables/setup and defaults to 'LOOKUP'. This can be overridden by the DY: fipHdr
field. As per normal the name of the parameter file is forced uppercase.

There is also a question of where to send the output file as this, by default,
is put in spool/2go for IPWHEEL to distribute. So it needs a Destination(s) or
DU FipHdr field. This is added by either :
	- It there is a DX FipHdr field in the input file, that is used.
	- If not, the keyword 'dest' is used in the parameter file.
	- If that is not specified either, it is sent to 'woops' the Intercept queue.
	- You may also specify it from the incoming data or attribute-data using the
'fixhdr' keyword.
		In this case the contents of DX, 'dest' or 'woops' will be the default if
there is no data.

If using the Reuters MetaData Repository switch, the lookup (but NOT template)
files are ignored and the data is added to the output file directly (minus the
Newsml tags).

Keywords for LOOKUP parameter file are :
Mandatory:
	; comment line
	lookup: file containing codes to match and any additional fields.
		eg lookup:/data1/MATCHCODES
	match:(existing FipHdr field)	Match FipHdr field with lookup table entry
		Optional subKeywords
			newfld:(Fip1, Fip2, Fip3)	one or more 2 letter FipHdr Codes
					The additional fields on the lookup lines will be allocated to these
FipHdr names.
					By default thes are L0 for the first field, L1 the second etc
					eg if the lookup file has 5 fields - the first being the match.
						newfld:AB,FF,AC,L6
						will stuff field 2 in AB, 3 in FF, 4 in AC and 5 into L6
			default:	Default value if NOT found (in FipSeq)
					There is NO default default - the field is ignored.
		Up to 50 hdr fields may be matched.
	There is also a 'repeat-match' keyword which allows a single field containing
	multiple items to be broken automatically into zones and EACH zone is matched
in turn.
	(see below)

Optional:
	sorted	The match field (first field) of the Lookup File
			is in the correct sort order.
			default:no
	sep:	single chr Field separator in the Lookup file
				This is defaults to any run of Tabs/Spcs
	casesens:y	Match Fields are Case sensitive - normally NOT.
	fmt:csv		Comma separated	lookup file - ie strip double quotes.
			This does NOT affect the Separator which should be set if NOT tab/spcs.
			default is Space/Tab separated
	outque:	output queue for the new file.
	dest:	Destination (as in sys/USERS)
	extra-fiphdr:
or	fixhdr:	more fixed Fip Hdr fields to add to the file (before any new matched
additions)
	script:	script to run against the New file.		default: none
	newname: name of the output file.
	template: Use the template and fill in with the new FipHdr values.
	repeat-sep:	Separator for repeat-match fields	default: '+'
	log-line: Substitute log line.

	REUTERS-HEADLINE:KK,M9
	REUTERS-SLUGLINE:MD,M8
	REUTERS-LANGUAGE:KF
	REUTERS-PRIORITY:KH
	REUTERS-GET-XML:yes for XXnews

	reuters-topic-lookup:\R8
	reuters-ccs-lookup:\R6
  	REUTERS-iso-LC-file:mrm.language.fip
	This is read once on startup and should be in the default
	parameter file only.
		syntax is	LANG-VARIANT-Duid (NL)
			eg	en-GB-T0024959592

	The Genre processing uses FipHdr fields J0 to J9
	plus	R9 for the Genre Duid
			R8 for Language Duid
			R0 for Language Variant
			R6 list of paths/topics

	For Reuters PNAC processing, the relevant FipHdr fields needs :
	reuters-sq-topics:KB
	reuters-sq-priority:KF
	reuters-sq-filed:WK
	reuters-sq-source:QR
	combie:QR	M3|KZ|KV,RTRS

	;move these from Topic to Products
	reuters-rnp:DF XRNP

	; NTM teXt or Table Fiphdr to add a <PRE>
	reuters-TX-fiphdr:KE

Where sections of FipHdr fields are required or changes to the output style,
use keywords : fixed, partial, combie, optional, repeat, newdate and/or style.
(see The SysAdmin manual for more information).

	They are normally specified :
		fixed:QZ	1234543
		partial:QT	ST,3,2,U,<,>
		combie:QY	ep|na,(0000000)a
		option:QE	ep,11,7,s
		repeat:QK	XK,-,3
	or	repeat:QP	PK,,4,#X
		style:QS	XN,%.03d
		unique:QT	XT

------------------------------------------------------
Example LOOKUP file :
; Codes file is in MATCHCODES
lookup:/fip/tables/setup/MATCHCODES
; field sep in MATCHCODES is a pipe
sep:|
; incoming FipHdr field is in the format :
;SH::LP:N0297:VZ:H-----:PU:CF:TFXJ.AU-PBL.AU-BRY.NZ:KHeavy Selling in
Australian Publishing
; Breakup the XT Fip Source Header field into 3 fields called Q1,Q2,Q3 divided
by a hyphen
; note that Q1,Q2,Q3 are for internal use only - they do NOT get created in the
output file.
repeat:Q1	XT,-,1
repeat:Q2	XT,-,2
repeat:Q3	XT,-,3
; match each one of these against the MATCHCODE table and create (up to) 4 new
header fields.
match:Q1	newfld:A1,A2,A3,A4
match:Q2	newfld:B1,B2,B3,B4
match:Q3	newfld:C1,C2,C3,C4

-- The MATCHCODES file in this case could look like :
; with a sep of PIPE, first field is the key, a newline finishes each line
; the file should be sorted on the first field.
FXJ.AU|AU000000FXJ5|6467074|FAIRFAX(JOHN)|AU|PUB
PBL.AU|AU000000PBL6|6637082|PUBLISHING & BROAD|AU|PUB


------------------------------------------------------
Notes
-----
Use the repeat-match for cases where the input field looks something like :
	WT:ASIA EMRG IN IND AUT MAC RES

To use :

; make sure the codes are unique and separated by a single plus sign
unique:AT	WT
; go get a match for each one ! - rptfld holds the FipHdr contining just this
single search.
repeat-match:AT		newfld:W2,W3,W4		rptfld:W1

You then have to define what to do with the output.
 - If you do nothing, then there will be multiple 'W1,W2 etc' etc in the FipHdr
   BUT only the last one will be accessible!
 - If you are using templates, then a new template is generated for each
   match. Each template will be appended to the last for the output file.

------------------------------
Normally 'iplookup' only adds new FipHdrs and the data or text of the file is
not changed at all.

There are times however when you wish to use the new lookup headers immediately

The 
	V1 is the index to our table - which provides uniqueness inside the file for
Duids and other ids
	V2 is the original search key
	V3... is the first new field
	if there are any more fields, they are V4, V5 etc

text-template: (file in table/setup)

<rn2wEntity><Category Type="COMPANY" Resolve="tf_entity" Alias="V1"
SearchCount="1"><SearchResult IdRef="V3"/></Category>V2</Entity>

newV1: (FipHdr to use inplace of V1, V2 and V3 if they are used elsewhere)

------------------------------
In the current version only ONE lookup file can be searched per file. To search
more, you need to run the file through the program twice, hopefully against
different parameter files!

In the first parameter file, use 'outque' and 'extra-fiphdr' to loop :
	; stuff it back into 2lookup afterwards
	outque:2lookup
	; use the 2nd parameter file /fip/tables/setup/PARAM2
	extra-fiphdr:#DY:param2

OR if there are two very big lookups, you will not want to read and hash each
lookup for every file coming though - so if sub-second speed is importnat, use
two 'iplookups' in the SYSTEM file and use either input switch '-o' or 'outque'
to get the files from one to the other.
	look1	local	iplookup -i 2lookup -o 2lookup2 -z param1
	look2	local	iplookup -i 2lookup2  -z param2

------------------------------
The Parameter file is read every time a file is processed. The lookup file is
first checked to see if it is the same as the last request or if it has changed
before reloading.

Note this is a change from version 04c, before which a change to the lookup
needed a stop and restart of iplookup to load the new version.

------------------------------
There are extra parameters for the MRM v2 :
	reuters-check-headline:K5
This will also check the FipHdr K5 and replace any RICs with Entity tags. Any
'<' and '>' tags are mapped to 036 and 037 for mapping back in a subsequent
xchg. 
	reuters-RICs-in-text:U9
Split CompanyIds/Rics into the original FipHdr field plus U9 for any that
appear in Text
	reuters-ignore-RICS-in-text:B9
If this FipHdr field is NON-blank, do NOT look for Rics in the text.

------------------------------

Input Parameters are (all optional) :
Either
	-1 : path/filename for single shot		default: spooled
		The input file is NOT deleted
		If this does NOT start with a '/', it is assumed relative to the current
path.
Or
	-i : input queue			default: spool/2lookup
		If this does NOT start with a '/', it is assumed under spool.

	-D : time in seconds to sweep the queue		default: not enabled
		Use this to batch and delay copy for 20 mins -D 1200
	-h : Show results from Reuters MRM			default: no
	-H : display ONLY the new FipHdr fields
		The default is a complete file with FipHdr and data
		This is of most use with the '-1' single file switch.
	-l : do NOT log files in		default: log
	-o : output queue			default: spool/2go
		If this does NOT start with a '/', it is assumed under spool.
		This is overridden by the 'outque' or 'outdest' parameters if they exist.
	-r : Reuters Duid checking for internal NewsML feed			default: no
	-R : use the Reuters Metadata Repository for the Lookup.	default: no
	-C : do NOT use Reuters CCS codes (ie old variants)			default: CCS
	-w : file wait for files arriving across a network.	def: 8 secs
	-z : default parameter file in tables/setup	default: tables/setup/LOOKUP
	-v : print version number and exit

Version Control
;008q	10dec04 Roy-mrmCache added
	;f 07feb05 added NLS/Genre Cache too
	;g-i 21feb05, Roy - ignore language variants
	;j 09mar05 Roy added table-pre
	;k-o 21mar05 Roy buglettes
	;p-q 15jul05 xxnews - added REUTERS-GET-XML
;007z7	05may04 N2Wversion4
	;b-i 01jun04 woops - [] only for .ULs
	;j-l 28jul04 Roy - leg/CCS dual Rics and NSC codes too for -r/ccs
	;m-r 04aug04 for RFC 46 [*] (p for dualric bug) (q speedy) (r-duid<002)
	;s  18oct04 AVANT-PAPIER catered for plus FEATURE
	;t-u 20oct04 ADVISORY and allow 4 x head and 5 x slug genres
	;v-z7 27oct04 rfc87/46/110 work
;006w	13aug03 bugette in RTR Genre ;c upped Rics 300->500
	;d 06oct03 reworked Duid bit for EITHER n2000 OR rrr.
	;e 31oct03 timings
	;f-h 21nov03 v3.0 RICs - markup NoCoys and ONLY markup the RIC not the name
	;i 09dec03 added FIP_maxFipHdrSize = (4*STDBUF) - 1;
	;j 17dec03 allow multiple TOPICS for a single N2000 code inbound
	;k 05feb04 bug in non-mrm version - double FipHdrs
	;l-n 01mar04 RTR added UnlistedRics and made Genre generic
	;o-q 12mar04 added -h and reuters-iso.lc-tags (q v4 mrm .h)
	;r-w 26mar04 bugette in Genre-Headline and UTF8
;005z	11apr03 added Dual Rics for RTR
	;b-g 26may03 added Genre
	;h-k 17jun03 added reuters-check-alertline plus & in Genre plus Features
			plus reuters-priority
	;l-n 23jun03 NNG variant ie V3
	;o-p 04jul03 bugette - missing last number.in headline
			plus if UPDATE with no number - default is 1
	;q 12jul03 CoInst chg to APARTMENTTHREADED
	;r-u 16jul03 bugette - ignoring 1st Co Duid plus redid rn2wcats plus bug in
Genre
	;v-z 31jul03 **see notes of Nathan's changes (y-FEATURE bugette)
;004z	29nov01 added COM support for Reuters Metadata Repository
	;b 23jan02 added text-template 
	;c/d/e 05apr02 added check time/size of lookup file
	;f Reuters-lookup NOT be uppercase
	;g 29aug02 bug for Reuters-lookup picking up bad xml
	;h-m 17oct02 Reuters MRM version 2 (j=added headline and bug if no data)
	;n-p 09dec02 Tuned routine for finding Reuters MRM name
	;r-s 14feb03 MRMv2 build 58/59 mods for RICs
	;t-z 11mar03 if FipHdr B9 is TOP - ignore Rics-in-text
;003	29oct01 added repeat-match: and template

(copyright) 2014 and previous years FingerPost Ltd.