ipw4 (Sat Oct 25 2014 01:31:01)

ipw4

This program generates w4 structures - lists and files - for the w4
browser-based tasting system.

Each incoming file is compared against the parameter file and added to each
directory found.

The text is left unaltered as it assumes 'ipxchg' has already cleaned it up.

A single file can be in none, one or many lists.

A single copy of the file is maintained for all the lists so that when it is
copied/exported, the audit message is inserted at all relevant points in all
relavant lists.

If more than one publication is specified, then a copy of the file is made for
each publication. In this case the audit message is restricted to those lists
belonging to that publication.


To decide which lists a file should be in, each destination in the Fip
destination field 'DU' is compared to all the entries of the 'dest' parameters.

Then the same is done for any 'testforlist' parameters.

Note there MUST always be a DU field - even if you are only using
'testforlist'.

The Parameter file is in tables/w4 and, by default, is called W4. The syntax is
the normal Fip style :
	; comment
	dest:	define which lists an entry is inserted for each destination
	eg	dest:w4arte	list:KULTUR,ARTE
			Each list must be defined by a 'list:'
			parameter as below.
			These are fixed lists. eg
				dest:w4soccer	list:SOCCER_SUNDAY,SPORT_SUNDAY
				list:SOCCER_SUNDAY	maxitems:200
				list:SPORT_SUNDAY
			.. or a specified FipSeq.
				dest:w4client	list:DA
				list:SUBLIST
				default-list-parameters:SUBLIST

		If you do not use 'default-list', any non-matching file is ignored.
		Case is ignored for the names of the 'list' and 'pub'
		There can be multiple lists separated by a comma.
		The same 'list' can be defined on several 'dest' lines but only one entry
will be made.

	testforlist:	define lists for one or more FipHdr tests.
		Syntax is
			testforlist:(list1,list2,..) (FipHdr)=(test) (FipHdr)#(test)

			There can be one or more lists separated by a comma (no spaces)
			There can be one or more tests with can be either equal '='
				or not equal '#' (not equal can also be '!=' )
			For the test a single wildcard '*' can be added at the end.
			To test for a blank field (or a field which does not exist),
				use double quotes :	XY="" ZZ#""
		eg:	testforlist:AFX_SPORTS	SU=afx XC=s* XC#sdd

		Both the FipHdr and the Test fields can be FipSeq .. eg
			; Check if the source is 'epd'
			; should be the XA field XA:epd
			; BUT also XA:/AFP-SX77, so repeat on punctuation
			; if XA does NOT exist or there is no data, chk SU
			repeat:Q1	XA,,1,#x
			repeat:Q2	XA,,2,#x
			combie:QA	Q1|Q2|SU
			testforlist:epd		QA=epd$d

		Note that 'testforlist' and 'dest' can be  equivalent except the test
		for 'testforlist' is case INsensitive while for 'dest' sensitive.

	list:	define the size of a list and optional ticker.
		Sub Parameters are :
			maxitems	maximum number of items
					Specify zero to mean all items
					default is all files.
			maxsize		maximum no of chrs of text per item
					default is 1000 bytes
			ticker-items	maximum no of items for the ticker
					if not specified, there is no ticker
			ticker-size	max no of chrs per ticker item
					default is 60 bytes
			refresh		(optional) refresh the main list only
					once every X secs. Normally the
					main LIST is refreshed on every new file
			pub		(optional) publication name
					It restricts audits to a single publication.
					pub:sunday
			entry		(optional) name of a specific entry if not the default.
			group		(optional) Group List name
					Use this to build collated lists of several sub lists.
					eg group:ALL_SPORT
					The item is also put in this LIST
					The GROUP list is specified as an ordinary List but may/may not have any
'dests' or 'testforlists' pointing to it.
					It must be specified BEFORE any other list refers to it.
			maint:500	(optional) Trim the Top List to this number of items
					NONE - implying no maintenance - ie all items will be
					left in the main top list - ** Only use this with
					extreme care as the list can get very big !!
					MIDNIGHT or 0 (default) - trim the top list to start at midnight
					(number of items from 1 to 3000) - just that !

	eg	list:MOTOR maxitems:0 maxsize:300 ticker-items:100 ticker-size:60

	pub:	define publication - optional, use only for multi-pub sites.
		The same parameter is added to each 'list' line which
		means that any audit message will be restricted to that pub
	eg	pub:sunday
	before:	text to add at the top of the data file.
	after:	text to add at the bottom of the data file.
	filebefore:	file to add at the top of the data file.
	fileafter:	file to add at the bottom of the data file.
	entry:	List entry for each file in HTML with FipSeq.
		This is the directory line in the LIST.
		Special care should be taken if you need to change from the
		default as certain key fields are requred for Audit and Search
		These include the '<!-- @@## -->' and the first '<br>'.
		If more than one entry are specified (up to 100 may be), the
		first (ie top of file) is considered the default.
		Syntax:	entry:(name)	(HTML in FipSeq)
		see below for an example
	entry-abstract: ditto for the abstract part of the list (ie the bit underneath
the clickable link to the data)
	ticker-entry:	The List entry for Tickers.
		This is generally fairly short with no or few comments to
		reduce the size of each ticker list.
	script:	Run a script after the file has been written.
	log:	Item log entry if not default.
	folder:	name of a sub-folder under /fip/data/w4 for this list.
		This should be used for your own scripts as the standard
		Fip w4 does not normally track folders.
	default-maxsize:(number of bytes)	default is 1000 bytes
	metadata-for-source: Define the MetaData for a particular source.
		Do NOT put a tab in or NL or CR.
		syntax is 	metadata-for-source:(agencyName)	(Meta Strings)
		Default the 'default-metadata' keyword
			or 	'pri=WP cat=WC' for non-fip search
			and	'WP WC' for fip search
		eg	metatdata-for-source:WIRE2	sender=XU ref=XR
		The Headline (WK), Source (SU) and Filename (SN) are always added
		automatically and do not need to be specified.
	default-metadata: Define default search metadata
		Default meta is 'pri=WP cat=WC' for non-fip search
			and	'WP WC' for fip search

Other less often used parameters :
	missing-list: (name of list)
		If the file is NOT in any other, it is added to this	default: file is ignored
	default-list-parameters: (name of list)
		Use this to 'list' for all default parametes NOT specified.
	syndication-list: (FipSeq containing name of client)
	default-list-for-syndication: (name of list)
		Any file that matches a 'dest' but the 'list' is not specified
		uses the parameters as specified by this 'list' but is named
		as in the 'dest' line.
		ie if DA:biggles and DU:w4planes :
			syndication-list:DA
			list:heros	maxitems:0
			default-list-for-syndication:heros
		so the list will be called BIGGLES with no check on the number of items.
	use-hour-folders:
		Where there is masses of data, store the files in hour folders
		in order to improve disk access.
	number: default number system - octal,decimal or hex.
	chkexists: for NFS or NT mapped drives, a check-file to make sure the
		drive is valid
	outputdrive: (NT only) drive letter for data
	audit-msg: Html string to replace the default audit message
		default: <font color="green">Fetched by \WA at \WT<br></font>
	audit-text: Html string to replace the default audit text point
	output-filename: change the output filename
	supercede:	files with the same name are normally replaced
	no-supercede:   files with the same name are normally replaced
			use this to create new everytime.
	owner:	Unix only, logon of the owner of the files if not yours
	archive:	Archive the file in log/data
	NewSU:		FipHdr field for source if NOT 'SU'
	wild:		wild string chr for matching if not the default '*'
	singlewild:	wild single chr for matching if not the default '?'
	hostname:	Name of this host if not that booted from (for IP address)
	log-unmatched-files: If a file is NOT in any list - log it with a !ox flag
	allow-deletes:	Allow Delete tokens to zap files and list items	- default:no
	chrmap: (old 8 bit chr) (replacement 8 bit chr)	default:no
		chrmap:236243
		for FipHdr fields only
	list-end-of-line:	String (in FipSeq) to flag an end-of-line in a List or
Search
		default is none - all endoflines are translated to a space.
		Take care not to reuse a special chr which you are using to flag something
else
		In particular the text-marker which is usually a TAB or a NL/CR which are
end-of-item.
	default-unwrap-abstract: yes/no
		if the abtract text is wrapped - at 64 chrs for example - use this to put the
list-end-of-line marker at the end of a para.
		Each file is checked for the optional FipHdr field W4_ABSTRACT_UNWRAP: yes/no
which, if found, will override the default.
	zap-xml-abstract: yes/no
		ZAp all xml <p>, <br> etc in the abstract	default: just zap the < and >
		Each file is checked for the optional FipHdr field W4_ABSTRACT_ZAP_XML:
yes/no which, if found, will override the default.
	hdr-hash:05
		A single chr (usually a control chr - 05 or 35) to use internally in
place of a hash '#' in the FipHdr
		default is 035

	allow-flow:	Allow data to input into the Fip Web Flow system
	flow-default-section: default section		default: fip
	flow-default-status: default status		default: Input
	flow-unique-id:	FipSeq for generating the unique-id if there is not a
W4_FLOW_ID
			default:\WR
	flow-ext:	File extension for files	default: fip
			Do Not add the '.'
			This should match any filemapping on the client side
			for flow_edit.pl or flow_read.pl
	flow-balance: Balance Group for all data files	default:none
	Files should have one or more of the FipHdr fields :
		W4_FLOW:
			This flag is needed to signal the file is part of a flow.
			no parameters required
		W4_FLOW_SECTION: (section name required - if not default)
		W4_FLOW_STATUS: (status required - if not default)
		W4_FLOW_ID:(actual ID to use)

		Optionally they can also have :
			W4_FLOW_L1: (data)
				..
			W4_FLOW_L9: (data)
			These are extra fields for the LISTs in addition to the first line of data.
			They can be defaulted using parameters 'flow-default-1' etc

Plus the usual suspects for FipSeq - such as fixed: partial: combie: option:
repeat: style: replace: newdate: etx (pls link to http://www.fingerpost.co.uk
and look for FipSeq )


Ordinary incoming files are checked for FIP header fields :
	W4_TOP:		name of template file to add before the data of the file.
			The full path should be specified.
			default: none
	W4_BOTTOM:	name of template file to add after the data of the file.
			The full path should be specified.
			default: none
	W4_HTML_IN_LIST: This flag will NOT strip any HTML in the List file
			Normally all tags - HTML, SGML or XML are stripped for the list
			Nor are they counted inthe 'chunks' for a list.
			** Please label all Pictures this way : ie in sys/USERS
		w4reupix=	DP:localhost   DQ:2w4   DC:SC  W4_HTML_IN_LIST:

	W4_TOP_LIST:	name of file to add before the List Entry.
			The full path should be specified.
			default: none
	W4_BOTTOM_LIST: name of template file to add after the List Entry.
			The full path should be specified.
			default: none
	W4_CHRSET: (chrset) Used with -C utf8 to flag files which are already UTF8 and
so need no conversion
			This changes both fiphdrs and the abstract
			use	W4_ABSTRACT_CHRSET: utf8 to change/flag the Abstract only
			use	W4_FIPHDR_CHRSET: utf8 to change/flag the FipHdr only
			The chrset can be blank or utf8
	W4_ABSTRACT: (FipSeq)
			Replacement for the abstract in the List and Search from the data in this
FipHdr
			which is normally the first bit of text OR the entry-abstract:(entryname)
for that service
	W4_ABSTRACT_FILE: (FullPathName)
			Replacement for the abstract in the List and Search from the contents of
this file
			which is normally the first bit of text OR the entry-abstract:(entryname)
for that service
	W4_ABSTRACT_UNWRAP: yes/no
			unwrap/ do not unwrap the abstract for this file
			default: no
	W4_ABSTRACT_ZAP_XML: yes/no
			remove any XML tags from the abstract
			default: no
	W4_LIST_DATE: (yyymmdd)
			Force the List/Search date to be this
			(default is current system time when the file hits the input folder)

FipHdr fields used include :
	WM:	Mime Type
	WZ:	Xchg to use when reading the file.
	WI:	IP address of the host creating this
	DS:	Supercede this file if it already exists	default: yes
	XD:	DO NOT Supercede this file if it already exists default: yes
	WB:	if the mimetype is NOT text, use this as replacement text
		for the list
	WN:	filename
	WQ:	subpath (the top path is assumed as /fip/data)
	WL:	all the lists this file is in, semicolon separated
	WV:	all the lists, space separated - for displaying
	WD:	all the list DELTAS
	WG:	all the list GROUPS
	WJ:	Julian day of this file
	WH:	Date of this file
	WC:	Category
	WP:	Priority
	WK:	Headline
	WW:	No of words	(added 07y1)
	W$:	No of chrs	(added 07y1)


For AUDIT messages, incoming files are checked for FIP header fields :
	WA:	audit file logon
	WT:	Time and date of audit
	WY:	audit message
	WN:	(From Data)	filename
	WQ:	(From Data)	subpath (the top path is assumed as /fip/data)
	WL:	(From Data)	all the lists this file is in, comma separated
	WV:	(From Data)	all the lists, space separated - for displaying
	WD:	(From Data)	all the list DELTAS
	WJ:	(From Data)	Julian day of this file
	WH:	(From Data)	Date of this file

For DELETE messages, incoming files are checked for FIP header fields :
	WX:	Security checksum for this file
	WA:	logon of the delete person
	WT:	Time and date of delete
	WN:	(From Data)	filename
	WQ:	(From Data)	subpath (the top path is assumed as /fip/data)
	WL:	(From Data)	all the lists this file is in, comma separated
	WV:	(From Data)	all the lists, space separated - for displaying
	WD:	(From Data)	all the list DELTAS (semicolon separated)
	WJ:	(From Data)	Julian day of this file
	WH:	(From Data)	Date of this file

For Flow messages, ipw4 will ADD the following FipHdr fields :
	WR	Duid
	WF	1stline of text
(Section and Status are implict in the Flow system and are NOT carried in
FipHdr fields)

IPW4 uses the following environment variables :
	FIP_W4_defEQ		default queue		default: general
	FIP_W4_LINE		default line length for $L	def: 80
	FIP_W4_WORD		default word length for $W	def: 6

	$2 is the second line of text
		..
	$9 is the ninth line of text

Input switches (all optional) :
	-0 : Use Old Version 0 format files		default: current version
	-9 : run in Speedy mode			 default: no
	-a : alert file if not the default which is
		no publications specified	: tables/w4/ALERT
		publications specified		: tables/w4/ALERT_PUBLICATION
	-c : check this queue or file exists before writing files
		(for NFS and other mounted queues
		- see CHKEXISTS above)			default: no
	-C : convert list entry characters to ..	default: unconverted
		-C utf8		convert to utf8
	-d : Output Drive (WINNT only)			default: drive with Fip on
			This is overridden by the 'outputdrive' keyword.
	-D : name of a done queue for input files after processing.
		If this does not start with a '/', it is assumed to be under /fip/spool.
							default: files are deleted
	-f : default flow path				default: /fip/data/flow
	-F : default no of flow sub queues (before 07r was 256). default: 100
	-g : do NOT make search Group lists		default: do
	-l : log all files				default: do NOT log
	-L : do NOT log files				default: do NOT log
	-m : UNIX file mask - input to umask for file creation.
		Pls remember this is input as a DECIMAL number while
		access is normally an octal ie -m 420 = 0644.
						default: 0 for rw-rw-rw- access
	-N : use the next/previous flags		default: do not
	-o : Output path name
		default for Version 0 : /fip/spool/w4data
		If this does not start with a '/', it is assumed to be under /fip/spool.
		default for other versions : /fip/data/w4/
	-q : queue to scan				default: 2w4
	-Q : keep quiet if the queue for the incoming file does not exist
		or there are two many duplicates. default:no
	-r : reindex - just reindex incoming (resent) files.
		do not add to the lists.		default: no
		do not add the files either.
	-R : reindex - just reindex incoming (resent) files.
		do not add to the lists.		default: no
	-s : using external Search			default: fip search
		with a search Group list too
	-S : using external Search			default: fip search
		WITHOUT a search Group list too
	-t : sleep time betwix scans			default: 1 sec
	-T : name of search tickers file		default: none
	-u : default owner for ALL files.		default: that of 'ip'
		This may be overridden by the 'owner' parameter.
	-V : version					default: 8
		0 - html lists
		5 - audit in list
		8 - filsize in lists
	-X : No Search file nor Index file required	default: fip search
	-z : default parameter file			default: tables/w4/W4
	-v : print version number and exit

---------- Example ----------

pub:herald
pub:times
pub:sunday

; Text at start of file - Put time stamp and cross references at the end of the
file
filebefore:/fip/web/setup/w4.file.top

; Text at end of file
fileafter:/fip/web/setup/w4.file.bottom

; The aim is to have a cross reference to a file in a directory below this
level,
; with the SU as the name of the directory where stories are saved
entry:default <DT><!-- $U CAT:XC PRI:XP --><a
href="/fip-cgi/pick_showlist.pl?Fipid=91251948919514&file=19981201_rtr/reu4052.0502.html"
TARGET="wirecopy_window"> <IMG SRC=/fip-pages/gifs/crush.gif width=10 height=10
border=0> </a><A HREF="/fip-cgi/wir_readfile.pl?Fipid=##FIPID##&file=WQ/WN"
TARGET="wirecopy_window">WK</A>s<FONT SIZE=-1 FACE="Helvetica"
COLOR="red">(s$D $M $Y,s$H:$N<!-- ##@@ -->s)</FONT><BR>

; Run Verity index program afterwards
script:/bin/echo "/fip/data/w4/files/WQ/WN" > /fip/spool/2verity/WN

; Actual lists

list:ALL_WIRES_HER	maxitems:0	maxsize:300	ticker-items:100	ticker-size:25 
pub:herald
list:ALL_WIRES_SUN	maxitems:0	maxsize:300	ticker-items:100	ticker-size:25 
pub:sunday
list:AP_ADVISORIES_HER  maxitems:0	maxsize:300	ticker-items:100	ticker-size:25 
pub:herald

; ----------------------------------------------------------------------
; Associated Press/Press Association/Reuters
dest:all_wires	  list:ALL_WIRES_HER,ALL_WIRES_SUN

;
; Associated Press
;
dest:ap_advisories	list:AP_ADVISORIES_HER,AP_ADVISORIES_SUN,AP_ADVISORIES_ET
; NO Financial for Evening Times
dest:ap_financial	list:AP_FINANCIAL_HER,AP_FINANCIAL_SUN


audit-msg:Read by WA at WT

----------------------------------------------------------------------
 Notes

- Installation
Do you need to run UTF8 ???
	SYSTEM - ipw4 -l -N -C utf8 -T sticker
	USERS
	- just text needs
		w4cp1251	DP:localhost	DQ:2w4 W4_ABSTRACT_DC:W4ABS	DC:SC CX:PREW4
W4_ABSTRACT_CHRSET:UTF8
	- fiphdr and text
		w4cp1251	DP:localhost	DQ:2w4 W4_ABSTRACT_DC:W4ABS	DC:SC CX:PREW4
W4_CHRSET:UTF8
	- xchg
	(SC)2W4ABS
; W4 Abstract -  Russian (CP1251) to UTF8
;
; Default character set
c:isoascii

z:chghdr:IH,HK
z:convert-fiphdr:utf8,map

z:unicode-map:CP1251.TXT

; Convert to UTF-8
z:convert-to-utf8

- Tuning points
For the OLD system (-0 input switch or pre version 05 of ipw4) if you have any
Lists which will :
	either	have more than 0.5 mb of data at the end of the day
	or	have more than 3 or 4 items per minute
Then use the 'refresh' parameter in the list to put the last x secs worth of
data into the cache file. This does not affect searching or anythingelse - but
it really speeds up processing. In particular this covers wires like Dow Jones,
Bloomberg, Business Wire and Bridge/KR.

This is not applicable for the ipw4 05+ as the new lists are in chucks of 100
files.

- examples for the SYSTEM
Using Glimpse as the Search - including Groups (or Collated)
w4	local	ipw4 -l -s

Using Verity as the Search - Excluding Groups (or Collated)
w4	local	ipw4 -l -S

Using Fip Search - with Groups
w4	local	ipw4 -l

Using Fip Search - withOUT Groups
w4	local	ipw4 -l -g

-- What if it is not text in the incoming file
	- check a couple of areas
			1.1 FipHdr field	WM does NOT start 'text'
			1.2 and there is NO fiphdr market W4_TEXT_REPLACEMENT
				Nothing will be put out - except the contents of an optional fiphdr field
WB
	- and/or	2.1 add FipHdr field W4_LIST_ABSTRACT with the text/html to
	- and/or	3.1 match the entry-abstract

----------------------------------------------------------------------

Version Control
;007z32	25jul02 added flow (do not use versions 7a or 7b) (7d for WR)
	;h 13may03 audit on other sys was broken.
	;i 10jun03 flow - added delete BEFORE adding search link
	;j 08aug03 bugette with large files.
	;k-o 23apr04 bugette with Audit...
	;p-q 05sep04 zippy and timing stats
	;r-u 03feb05 added -F for flow queues 256 -> 100
	;v-w 18apr05 added flow-balance-group
	;x1 23sep07 buggette in filename - not always unique!
	;y1 19mar08 redid search meta to allow for FipSearch too/add WW and W$
		;2-3 16may08 added -C utf8 and chrmap
	;z4 6jun08 added next/prv 'np' and -N ;5-6 bugette in search WW/W$
		;8 added default-maxsize ;9 18nov08 added W4_CHRSET
		;10 28nov08 bugette with utf8 ;11-12 entry-abstract added
		;13-14 29dec08 added W4_DATE ; 15 bugette with utf8
		;16-18 added eolnList and unwrapAbs
		;19-24 added W4_ABSTRACT_FILE/UNWRAP/ZAP_XML/CHRSET
		;25 added missing-list: to hold files not in any other list ;26 minor bugette
if zero length file
		;27-28 added hdr-hash and bugette with size of W$
		;30 30apr14 file-trace ;31 buffer sizes -> STDBUF ;32 added zap-xml-abstract
;006l	14feb01	added mimetypes with different entries
	;a/b 23feb01 added WG: for groups as fiphdr field in file and -X
	;c/d/e 26feb01 added W4_HTML_IN_LIST:
	;f 08may01 maint:none
	;g/h/i 10sep01 testforlist not catching NOT fields (XC#ABC or XC!=ABC)
	;j/k 18mar02 added syndication stuff to version 1
	;l 14may02 added log-unmatched-files
;005g	09aug00 version 2 lists -cdef
	;b added -S and w4index for external searches
	;c 17oct00 bugette for txt64 for BIG (>64k) files
	;d 02nov00 added Groups plus -Reindex
	;e 14nov00 added metadata-for-source, audit balanced.
	;f 26nov00 added -s and addGroupSearch
	;g 14feb01 cleanup

(copyright) 2014 and previous years FingerPost Ltd.