ipw4

ipw4

This program generates w4 structures - lists and files - for the w4
browser-based tasting system.

Each incoming file is compared against the parameter file and added to each
directory found.

The text is left unaltered as it assumes 'ipxchg' has already cleaned it up.

A single file can be in none, one or many lists.

A single copy of the file is maintained for all the lists so that when it is
copied/exported, the audit message is inserted at all relevant points in all
relavant lists.

If more than one publication is specified, then a copy of the file is made for
each publication. In this case the audit message is restricted to those lists
belonging to that publication.

To decide which lists a file should be in, each destination in the Fip
destination field 'DU' is compared to all the entries of the 'dest' parameters.

Then the same is done for any 'testforlist' parameters.

Note there MUST always be a DU field - even if you are only using
'testforlist'.

The Parameter file is in tables/w4 and, by default, is called W4. The syntax is
the normal Fip style :
    ; comment
    dest:   define which lists an entry is inserted for each destination
    eg  dest:w4arte list:KULTUR,ARTE
            Each list must be defined by a 'list:'
            parameter as below.
            These are fixed lists. eg
                dest:w4soccer   list:SOCCER_SUNDAY,SPORT_SUNDAY
                list:SOCCER_SUNDAY  maxitems:200
                list:SPORT_SUNDAY
            .. or a specified FipSeq.
                dest:w4client   list:\DA
                list:SUBLIST
                default-list-parameters:SUBLIST

        If you do not use 'default-list', any non-matching file is ignored.
        Case is ignored for the names of the 'list' and 'pub'
        There can be multiple lists separated by a comma.
        The same 'list' can be defined on several 'dest' lines but only one entry
will be made.

    testforlist:    define lists for one or more FipHdr tests.
        Syntax is
            testforlist:(list1,list2,..) (FipHdr)=(test) (FipHdr)#(test)

            There can be one or more lists separated by a comma (no spaces)
            There can be one or more tests with can be either equal '='
                or not equal '#' (not equal can also be '!=' )
            For the test a single wildcard '*' can be added at the end.
            To test for a blank field (or a field which does not exist),
                use double quotes : XY="" ZZ#""
        eg: testforlist:AFX_SPORTS  SU=afx XC=s* XC#sdd

        Both the FipHdr and the Test fields can be FipSeq .. eg
            ; Check if the source is 'epd'
            ; should be the XA field XA:epd
            ; BUT also XA:/AFP-SX77, so repeat on punctuation
            ; if XA does NOT exist or there is no data, chk SU
            repeat:Q1   XA,,1,#x
            repeat:Q2   XA,,2,#x
            combie:QA   Q1|Q2|SU
            testforlist:epd     QA=epd\$d

        Note that 'testforlist' and 'dest' can be  equivalent except the test
        for 'testforlist' is case INsensitive while for 'dest' sensitive.

    list:   define the size of a list and optional ticker.
        Sub Parameters are :
            maxitems    maximum number of items
                    Specify zero to mean all items
                    default is all files.
            maxsize     maximum no of chrs of text per item
                    default is 1000 bytes
            ticker-items    maximum no of items for the ticker
                    if not specified, there is no ticker
            ticker-size max no of chrs per ticker item
                    default is 60 bytes
            refresh     (optional) refresh the main list only
                    once every X secs. Normally the
                    main LIST is refreshed on every new file
            pub     (optional) publication name
                    It restricts audits to a single publication.
                    pub:sunday
            entry       (optional) name of a specific entry if not the default.
            group       (optional) Group List name
                    Use this to build collated lists of several sub lists.
                    eg group:ALL_SPORT
                    The item is also put in this LIST
                    The GROUP list is specified as an ordinary List but may/may not have any
'dests' or 'testforlists' pointing to it.
                    It must be specified BEFORE any other list refers to it.
            maint:500   (optional) Trim the Top List to this number of items
                    NONE - implying no maintenance - ie all items will be
                    left in the main top list - ** Only use this with
                    extreme care as the list can get very big !!
                    MIDNIGHT or 0 (default) - trim the top list to start at midnight
                    (number of items from 1 to 3000) - just that !

    eg  list:MOTOR maxitems:0 maxsize:300 ticker-items:100 ticker-size:60

    pub:    define publication - optional, use only for multi-pub sites.
        The same parameter is added to each 'list' line which
        means that any audit message will be restricted to that pub
    eg  pub:sunday
    before: text to add at the top of the data file.
    after:  text to add at the bottom of the data file.
    filebefore: file to add at the top of the data file.
    fileafter:  file to add at the bottom of the data file.
    entry:  List entry for each file in HTML with FipSeq.
        This is the directory line in the LIST.
        Special care should be taken if you need to change from the default as
certain key fields are requred for Audit and Search
        These include the '<!-- @@## -->' and the first '<br>'.
        If more than one entry are specified (up to 100 may be), the first (ie top of
file) is considered the default.
        Syntax: entry:(name)    (HTML in FipSeq)
        see below for an example
    entry-abstract: ditto for the abstract part of the list (ie the bit underneath
the clickable link to the data)
    search-entry: ditto for the search entry which defaults to
"\\WQ/\\WN|\\$U|\\WK|\\WM|"
    ticker-entry:   The List entry for Tickers.
        This is generally fairly short with no or few comments to reduce the size of
each ticker list.
    script: Run a script after the file has been written.
    log:    Item log entry if not default.
    folder: name of a sub-folder under /fip/data/w4 for this list.
        This should be used for your own scripts as the standard
        Fip w4 does not normally track folders.
    default-maxsize:(number of bytes)   default is 1000 bytes
    metadata-for-source: Define the MetaData for a particular source.
        Do NOT put a tab in or NL or CR.
        syntax is   metadata-for-source:(agencyName)    (Meta Strings)
        Default the 'default-metadata' keyword
            or  'pri=\WP cat=\WC' for non-fip search
            and '\WP \WC' for fip search
        eg  metatdata-for-source:WIRE2  sender=\XU ref=\XR
        The Headline (\WK), Source (\SU) and Filename (\SN) are always added
        automatically and do not need to be specified.
    default-metadata: Define default search metadata
        Default meta is 'pri=\WP cat=\WC' for non-fip search
            and '\WP \WC' for fip search

Other less often used parameters :
    missing-list: (name of list)
        If the file is NOT in any other, it is added to this    default: file is ignored
    default-list-parameters: (name of list)
        Use this to 'list' for all default parametes NOT specified.
    syndication-list: (FipSeq containing name of client)
    default-list-for-syndication: (name of list)
        Any file that matches a 'dest' but the 'list' is not specified
        uses the parameters as specified by this 'list' but is named
        as in the 'dest' line.
        ie if DA:biggles and DU:w4planes :
            syndication-list:\DA
            list:heros  maxitems:0
            default-list-for-syndication:heros
        so the list will be called BIGGLES with no check on the number of items.
    use-hour-folders:
        Where there is masses of data, store the files in hour folders
        in order to improve disk access.
    number: default number system - octal,decimal or hex.
    chkexists: for NFS or NT mapped drives, a check-file to make sure the
        drive is valid
    outputdrive: (NT only) drive letter for data
    audit-msg: Html string to replace the default audit message
        default: <font color=\"green\">Fetched by \\WA at \\WT<br></font>
    audit-text: Html string to replace the default audit text point
    output-filename: change the output filename
    supercede:  files with the same name are normally replaced
    no-supercede:   files with the same name are normally replaced
            use this to create new everytime.
    owner:  Unix only, logon of the owner of the files if not yours
    archive:    Archive the file in log/data
    NewSU:      FipHdr field for source if NOT 'SU'
    wild:       wild string chr for matching if not the default '*'
    singlewild: wild single chr for matching if not the default '?'
    hostname:   Name of this host if not that booted from (for IP address)
    log-unmatched-files: If a file is NOT in any list - log it with a !ox flag
    allow-deletes:  Allow Delete tokens to zap files and list items - default:no
    chrmap: (old 8 bit chr) (replacement 8 bit chr) default:no
        chrmap:\236\243
        for FipHdr fields only
    list-end-of-line:   String (in FipSeq) to flag an end-of-line in a List or
Search
        default is none - all endoflines are translated to a space.
        Take care not to reuse a special chr which you are using to flag something
else
        In particular the text-marker which is usually a TAB or a NL/CR which are
end-of-item.
    default-unwrap-abstract: yes/no
        if the abtract text is wrapped - at 64 chrs for example - use this to put the
list-end-of-line marker at the end of a para.
        Each file is checked for the optional FipHdr field W4_ABSTRACT_UNWRAP: yes/no
which, if found, will override the default.
    zap-xml-abstract: yes/no
        ZAp all xml <p>, <br> etc in the abstract   default: just zap the < and >
        Each file is checked for the optional FipHdr field W4_ABSTRACT_ZAP_XML:
yes/no which, if found, will override the default.
    hdr-hash:\005
        A single chr (usually a control chr - \005 or \035) to use internally in
place of a hash '#' in the FipHdr
        default is 035

    allow-flow: (version9)  Allow data to input into the Fip Web Flow system
        version 0 for pre 2014 multi-instance mods (default)
        version 1 for multi-instance
    flow-default-section: default section       default: fip
    flow-default-status: default status     default: Input
    flow-unique-id: FipSeq for generating the unique-id if there is not a
W4_FLOW_ID
            default:\\WR
    flow-ext:   File extension for files    default: fip
            Do Not add the '.'
            This should match any filemapping on the client side
            for flow_edit.pl or flow_read.pl
    flow-balance: Balance Group for all data files  default:none
    Files should have one or more of the FipHdr fields :
        W4_FLOW:
            This flag is needed to signal the file is part of a flow.
            no parameters required
        W4_FLOW_SECTION: (section name required - if not default)
        W4_FLOW_STATUS: (status required - if not default)
        W4_FLOW_ID:(actual ID to use)

        Optionally they can also have :
            W4_FLOW_L1: (data)
                ..
            W4_FLOW_L9: (data)
            These are extra fields for the LISTs in addition to the first line of data.
            They can be defaulted using parameters 'flow-default-1' etc

Plus the usual suspects for FipSeq - such as fixed: partial: combie: option:
repeat: style: replace: newdate: etx (pls link to http://www.fingerpost.co.uk
and look for FipSeq )

Ordinary incoming files are checked for FIP header fields :
    W4_TOP:     name of template file to add before the data of the file.
            The full path should be specified.
            default: none
    W4_BOTTOM:  name of template file to add after the data of the file.
            The full path should be specified.
            default: none
    W4_HTML_IN_LIST: This flag will NOT strip any HTML in the List file
            Normally all tags - HTML, SGML or XML are stripped for the list
            Nor are they counted inthe 'chunks' for a list.
            ** Please label all Pictures this way : ie in sys/USERS
        w4reupix=   DP:localhost   DQ:2w4   DC:\SC  W4_HTML_IN_LIST:

    W4_TOP_LIST:    name of file to add before the List Entry.
            The full path should be specified.
            default: none
    W4_BOTTOM_LIST: name of template file to add after the List Entry.
            The full path should be specified.
            default: none
    W4_CHRSET: (chrset) Used with -C utf8 to flag files which are already UTF8 and
so need no conversion
            This changes both fiphdrs and the abstract
            use W4_ABSTRACT_CHRSET: utf8 to change/flag the Abstract only
            use W4_FIPHDR_CHRSET: utf8 to change/flag the FipHdr only
            The chrset can be blank or utf8
    W4_ABSTRACT: (FipSeq)
            Replacement for the abstract in the List and Search from the data in this
FipHdr
            which is normally the first bit of text OR the entry-abstract:(entryname)
for that service
    W4_ABSTRACT_FILE: (FullPathName)
            Replacement for the abstract in the List and Search from the contents of
this file
            which is normally the first bit of text OR the entry-abstract:(entryname)
for that service
    W4_ABSTRACT_UNWRAP: yes/no
            unwrap/ do not unwrap the abstract for this file
            default: no
    W4_ABSTRACT_ZAP_XML: yes/no
            remove any XML tags from the abstract
            default: no
    W4_LIST_DATE: (yyymmdd)
            Force the List/Search date to be this
            (default is current system time when the file hits the input folder)

FipHdr fields used include :
    WM: Mime Type
    WZ: Xchg to use when reading the file.
    WI: IP address of the host creating this
    DS: Supercede this file if it already exists    default: yes
    XD: DO NOT Supercede this file if it already exists default: yes
    WB: if the mimetype is NOT text, use this as replacement text
        for the list
    WN: filename
    WQ: subpath (the top path is assumed as /fip/data)
    WL: all the lists this file is in, semicolon separated
    WV: all the lists, space separated - for displaying
    WD: all the list DELTAS
    WG: all the list GROUPS
    WJ: Julian day of this file
    WH: Date of this file
    WC: Category
    WP: Priority
    WK: Headline
    WW: No of words (added 07y1)
    W$: No of chrs  (added 07y1)

For AUDIT messages, incoming files are checked for FIP header fields :
    WA: audit file logon
    WT: Time and date of audit
    WY: audit message
    WN: (From Data) filename
    WQ: (From Data) subpath (the top path is assumed as /fip/data)
    WL: (From Data) all the lists this file is in, comma separated
    WV: (From Data) all the lists, space separated - for displaying
    WD: (From Data) all the list DELTAS
    WJ: (From Data) Julian day of this file
    WH: (From Data) Date of this file

For DELETE messages, incoming files are checked for FIP header fields :
    WX: Security checksum for this file
    WA: logon of the delete person
    WT: Time and date of delete
    WN: (From Data) filename
    WQ: (From Data) subpath (the top path is assumed as /fip/data)
    WL: (From Data) all the lists this file is in, comma separated
    WV: (From Data) all the lists, space separated - for displaying
    WD: (From Data) all the list DELTAS (semicolon separated)
    WJ: (From Data) Julian day of this file
    WH: (From Data) Date of this file

For Flow messages, ipw4 will ADD the following FipHdr fields :
    WR  Duid
    WF  1stline of text
(Section and Status are implict in the Flow system and are NOT carried in
FipHdr fields)

IPW4 uses the following environment variables :
    FIP_W4_defEQ        default queue       default: general
    FIP_W4_LINE     default line length for \$L def: 80
    FIP_W4_WORD     default word length for \$W def: 6

    \$2 is the second line of text
        ..
    \$9 is the ninth line of text

Input switches (all optional) :
    -0 : Use Old Version 0 format files     default: current version
    -9 : run in Speedy mode          default: no
    -a : alert file if not the default which is
        no publications specified   : tables/w4/ALERT
        publications specified      : tables/w4/ALERT_PUBLICATION
    -c : check this queue or file exists before writing files
        (for NFS and other mounted queues
        - see CHKEXISTS above)          default: no
    -C : convert list entry characters to ..    default: unconverted
        -C utf8     convert to utf8
    -d : Output Drive (WINNT only)          default: drive with Fip on
            This is overridden by the 'outputdrive' keyword.
    -D : name of a done queue for input files after processing.
        If this does not start with a '/', it is assumed to be under /fip/spool.
                            default: files are deleted
    -f : default flow path              default: /fip/data/flow
    -F : default no of flow sub queues (before 07r was 256). default: 100
    -g : do NOT make search Group lists     default: do
    -l : log all files              default: do NOT log
    -L : do NOT log files               default: do NOT log
    -m : UNIX file mask - input to umask for file creation.
        Pls remember this is input as a DECIMAL number while
        access is normally an octal ie -m 420 = 0644.
                        default: 0 for rw-rw-rw- access
    -N : use the next/previous flags        default: do not
    -o : Output path name
        default for Version 0 : /fip/spool/w4data
        If this does not start with a '/', it is assumed to be under /fip/spool.
        default for other versions : /fip/data/w4/
    -q : queue to scan              default: 2w4
    -Q : keep quiet if the queue for the incoming file does not exist
        or there are two many duplicates. default:no
    -r : reindex - just reindex incoming (resent) files.
        do not add to the lists.        default: no
        do not add the files either.
    -R : reindex - just reindex incoming (resent) files.
        do not add to the lists.        default: no
    -s : using external Search          default: fip search
        with a search Group list too
    -S : using external Search          default: fip search
        WITHOUT a search Group list too
    -t : sleep time betwix scans            default: 1 sec
    -T : name of search tickers file        default: none
    -u : default owner for ALL files.       default: that of 'ip'
        This may be overridden by the 'owner' parameter.
    -V : version                    default: 8
        0 - html lists
        5 - audit in list
        8 - filsize in lists
    -X : No Search file nor Index file required default: fip search
    -z : default parameter file         default: tables/w4/W4
    -v : print version number and exit

---------- Example ----------

pub:herald
pub:times
pub:sunday

; Text at start of file - Put time stamp and cross references at the end of the
file
filebefore:/fip/web/setup/w4.file.top

; Text at end of file
fileafter:/fip/web/setup/w4.file.bottom

; The aim is to have a cross reference to a file in a directory below this
level,
; with the SU as the name of the directory where stories are saved
entry:default <DT><!-- \$U CAT:\XC PRI:\XP --><a
href="/fip-cgi/pick_showlist.pl?Fipid=91251948919514&file=19981201_rtr/reu4052.0502.html"
TARGET="wirecopy_window"> <IMG SRC=/fip-pages/gifs/crush.gif width=10 height=10
border=0> </a><A HREF="/fip-cgi/wir_readfile.pl?Fipid=##FIPID##&file=\WQ/\WN"
TARGET="wirecopy_window">\WK</A>\s<FONT SIZE=-1 FACE="Helvetica"
COLOR="red">(\s\$D \$M \$Y,\s\$H:\$N<!-- ##@@ -->\s)</FONT><BR>

; Run Verity index program afterwards
script:/bin/echo "/fip/data/w4/files/\WQ/\WN" > /fip/spool/2verity/\WN

; Actual lists

list:ALL_WIRES_HER  maxitems:0  maxsize:300 ticker-items:100    ticker-size:25 
pub:herald
list:ALL_WIRES_SUN  maxitems:0  maxsize:300 ticker-items:100    ticker-size:25 
pub:sunday
list:AP_ADVISORIES_HER  maxitems:0  maxsize:300 ticker-items:100    ticker-size:25 
pub:herald

; ----------------------------------------------------------------------
; Associated Press/Press Association/Reuters
dest:all_wires    list:ALL_WIRES_HER,ALL_WIRES_SUN

;
; Associated Press
;
dest:ap_advisories  list:AP_ADVISORIES_HER,AP_ADVISORIES_SUN,AP_ADVISORIES_ET
; NO Financial for Evening Times
dest:ap_financial   list:AP_FINANCIAL_HER,AP_FINANCIAL_SUN

audit-msg:Read by \WA at \WT

----------------------------------------------------------------------
 Notes

- Installation
Do you need to run UTF8 ???
    SYSTEM - ipw4 -l -N -C utf8 -T sticker
    USERS
    - just text needs
        w4cp1251    DP:localhost    DQ:2w4 W4_ABSTRACT_DC:W4ABS DC:\SC CX:PREW4
W4_ABSTRACT_CHRSET:UTF8
    - fiphdr and text
        w4cp1251    DP:localhost    DQ:2w4 W4_ABSTRACT_DC:W4ABS DC:\SC CX:PREW4
W4_CHRSET:UTF8
    - xchg
    (SC)2W4ABS
; W4 Abstract -  Russian (CP1251) to UTF8
;
; Default character set
c:isoascii

z:chghdr:IH,HK
z:convert-fiphdr:utf8,map

z:unicode-map:CP1251.TXT

; Convert to UTF-8
z:convert-to-utf8

- Tuning points
For the OLD system (-0 input switch or pre version 05 of ipw4) if you have any
Lists which will :
    either  have more than 0.5 mb of data at the end of the day
    or  have more than 3 or 4 items per minute
Then use the 'refresh' parameter in the list to put the last x secs worth of
data into the cache file. This does not affect searching or anythingelse - but
it really speeds up processing. In particular this covers wires like Dow Jones,
Bloomberg, Business Wire and Bridge/KR.

This is not applicable for the ipw4 05+ as the new lists are in chucks of 100
files.

- examples for the SYSTEM
Using Glimpse as the Search - including Groups (or Collated)
w4  local   ipw4 -l -s

Using Verity as the Search - Excluding Groups (or Collated)
w4  local   ipw4 -l -S

Using Fip Search - with Groups
w4  local   ipw4 -l

Using Fip Search - withOUT Groups
w4  local   ipw4 -l -g

-- What if it is not text in the incoming file
    - check a couple of areas
            1.1 FipHdr field    WM does NOT start 'text'
            1.2 and there is NO fiphdr market W4_TEXT_REPLACEMENT
                Nothing will be put out - except the contents of an optional fiphdr field
WB
    - and/or    2.1 add FipHdr field W4_LIST_ABSTRACT with the text/html to
    - and/or    3.1 match the entry-abstract

----------------------------------------------------------------------

Version Control
;07z39  25jul02 added flow (do not use versions 7a or 7b) (7d for WR)
    ;h 13may03 audit on other sys was broken.
    ;i 10jun03 flow - added delete BEFORE adding search link
    ;j 08aug03 bugette with large files.
    ;k-o 23apr04 bugette with Audit...
    ;p-q 05sep04 zippy and timing stats
    ;r-u 03feb05 added -F for flow queues 256 -> 100
    ;v-w 18apr05 added flow-balance-group
    ;x1 23sep07 buggette in filename - not always unique!
    ;y1 19mar08 redid search meta to allow for FipSearch too/add WW and W$
        ;2-3 16may08 added -C utf8 and chrmap
    ;z4 6jun08 added next/prv 'np' and -N ;5-6 bugette in search WW/W$
        ;8 added default-maxsize ;9 18nov08 added W4_CHRSET
        ;10 28nov08 bugette with utf8 ;11-12 entry-abstract added
        ;13-14 29dec08 added W4_DATE ; 15 bugette with utf8
        ;16-18 added eolnList and unwrapAbs
        ;19-24 added W4_ABSTRACT_FILE/UNWRAP/ZAP_XML/CHRSET
        ;25 added missing-list: to hold files not in any other list ;26 minor bugette
if zero length file
        ;27-28 added hdr-hash and bugette with size of W$
        ;30 30apr14 file-trace ;31 buffer sizes -> STDBUF ;32 added zap-xml-abstract
;33 allow-flow:1 for multi-instance
        ;34 12may15 made search-entry variable ;35 added SX tracking !
        ;36-39 better NPseqno - and max items for Ticker
;006l   14feb01 added mimetypes with different entries
    ;a/b 23feb01 added WG: for groups as fiphdr field in file and -X
    ;c/d/e 26feb01 added W4_HTML_IN_LIST:
    ;f 08may01 maint:none
    ;g/h/i 10sep01 testforlist not catching NOT fields (XC#ABC or XC!=ABC)
    ;j/k 18mar02 added syndication stuff to version 1
    ;l 14may02 added log-unmatched-files
;005g   09aug00 version 2 lists -cdef
    ;b added -S and w4index for external searches
    ;c 17oct00 bugette for txt64 for BIG (>64k) files
    ;d 02nov00 added Groups plus -Reindex
    ;e 14nov00 added metadata-for-source, audit balanced.
    ;f 26nov00 added -s and addGroupSearch
    ;g 14feb01 cleanup

(copyright) 2017 and previous years FingerPost Ltd.