webwire

NOTE - These FipHdrs are used internally by webwire
    C0-9 for cookies
    F0-9 for forms
So pls do NOT used for 'fiphdr:XX ....' etc

    webwire

FOR HTTPS/port 443, please use the 'webwiressl' version of this program.

Webwire goes and gets pages of data from Other people's web sites automatically
and then sends those pages to your destination - usually the editorial system -
in the normal Fip fashion.

These can be updates of weather, financial data, sports results, backup for
Wire services if the satellite is down (those were the days !), graphics,
software. In fact most things.

It can be used either :
    - on a timed basis to get regular known pages.
    - on demand by sending a file into spool/webpoll with the FipHdr field DF set
to the parameter file required.

What it can do -
    - drill down links to several layers deep,
        optionally ignoring the data on the top levels.
    - select only certain links - either in XML, HTML, JSON or CSV
        - you set masks to filter which to get and which to ignore.
    - logon automatically to protected sites
        and save Cookie information for use in later accesses.
    - fill in standard form data to get make on-demand searches.
    - strip or rework HTML tags to make the data more presentable.
        This is meant for reasonably simple pages while more complicated ones
        will be routed through 'ipsgml' and/or 'ipxchg'.
    - Use an external list of values to make several grabs to the same
site/page/script
        but varying the search data for each hit. eg to pull all the values of a
financial index. (This we call a 'values-file')
    - Grab a 'id' from a List-of-items from a REST web service and then
sequentially call all items

What it cannot do -
    - play tunes.
    - run javascripts or any other applet type affairs. (yet..)
    - run FTP, GOPHER or whatever (for these and especially FTP, see program
'ipftp' and 'iptimer').

The current version is primarily for getting text data but can be used for
images etc if required.

There is a TUNING mode to be used for setting up a new link and trying to clean
up the relevant parameter file WITHOUT sending (possibly) live data to the
required destination.
    - This shows the data with escaped unprintables and '$' at the end of a line.
    - All links and forms are also displayed.
    - Any pages saved in Tuning mode are NOT sent to the normal output queue
(spool/2go) but are left in spool/webtest for future perusal and/or deletion.
    - To run, choose your parameter file in tables/webwire and run 'webwire'
manually in a window:
        webwire -T AUS.STOX | more      for prompt before calls
    or  webwire -A -T AUS.STOX | tee aussies    for no prompting

There are Two (sometimes three) types of parameter file :
    1. Main Parameter file which sets up the polling of certain pages at set times
(if any).
    2. A Page Description file for each site/page accessed.
    3. Optional lookup file of values where you want to repetitively hit a site
changing certain values each time. (eg a sport site for several divisions or a
list of stox to get)

----- Main Parameter file -----

The syntax of the Main Parameter File - by default tables/webwire/XWEB :
    ; comment line
    poll:(pram file)    day:(MTuWThFSaSu)   time:20:30  mustget:

In detail, the 'poll' keyword :
    Pram file is the name of the Page Description file - see below for its syntax
    day:    Day of week to run the job :
            M   Monday
            Tu  Tuesday
            W   Wednesday
            Th  Thursday
            F   Friday
            Sa  Saturday
            Su  Sunday
            X   Every day.
            Z   Weekdays M-F.
            Case is NOT important.
            Commas (but NOT spaces) may be used to separate.
            Default is every day.
    either
    time:   Time of the day on 24 hour clock.   Default is 18:00.
    or
    every:  interval between grabs          Default: none
        every: (mins)   [(optional) start:(starttime) end:(endtime)
        every:30    start:07:30 end:19:00
        The minimum interval is 1 min and maximum is 3 hours (ie every:180 mins)
        You may also specify in seconds using 'secs' or 'seconds'
        immediately after the number (with no spaces)
            every:10secs    start:09:30 end:09:50
eg:
    poll:AP     day:ALL     time:20:10
        Get the Page file tables/webwire/AP every day at 20:10
    poll:Forex  day:MTuWThF time:16:30
    poll:Forex  day:MTuWThF time:16:40
        Get the Page file tables/webwire/FOREX every week day at 16:30 and 16:40

There can be none or up to 200 polls in the main parameter file.
Note that the page is grabbed ONLY if the program is running.

----- Page Description Parameter files -----

The individual Page description parameter files are also in tables/webwire. The
syntax of these are :
    ; comment start with a semi colon like this

MANDATORY
    url:    Full url of the page.               default: none
        There MUST be one and only one 'url:' specified.
        You can also specify the page, cgi and any subparameters.
        eg  url:www.fingerpost.co.uk
            url:www.big-press-org/sports/baseball/index.htm
            url:www.marketlook.co.uk/scripts/Summary.dll?HandleSummary

    dest:   Fip Destination for the files           default: WEBDATA
        This is the 'DU' FipHdr field as per the USERS file.
        eg  dest:w3saves

OPTIONAL:
    use-tls: no/yes
    use-ssl: no/yes
    use-https: no/yes
        Use Secure Sockets Layer (TLS/SSL) - also called HTTPS  default: no
        If the url starts 'https://....' then this command is NOT needed.
        (There is also a setup option for openssl s/r to use either Bio or SSL
functions for the secure connection
            use-ssl:BIO
            use-ssl:SSL
    ssl-method: tls1.3 tls tls1 tls1.1 tls1.2 sslv2 sslv3 sslv2and3
        Version number to use for TLS/SSL       default: 999 for current default (2 or 3)
        (only the digits are significant, so add other text to make it readable)
        For 'modern' connection, pls do NOT use sslv2 ! as it is deemed insecure
        If default it will check the available list and pick the highest.
        The default is currently 23 which on a modern server is sslv3 and tls1_2 !)
    ssl-password: (password)
    ssl-passwd: (password)                default: none
        Optional password if the handshake requires a shared secret
    ssl-key: (name of a certiticate key file)       default: none
    ssl-cert: (name of a certificate file)      default: none
    ssl-root-cert: (name of a root PEM certificate file)    defaunt: none
        Optional certificates are in tables/ssl unless name starts with '/'
    ssl-verify: yes/no  verify server certificates  default: yes
    ssl-ciphers: (list) acceptable ciphers
        (use 'openssl ciphers' to list)
        default: 
"ECDH+AESGCM:ECDH+CHACHA20:ECDH+AES256:ECDH+AES128:!aNULL:!SHA1:!AESCCM"
        pre 2021oct default: 
"ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS"
        pre 2017 default:  "HIGH:!aNULL:!kRSA:!SRP:!PSK:!CAMELLIA:!RC4:!MD5:!DSS"
    ssl-display: yes/no display SSL connection details  default: no

    port:   Port number of the Remore Server.       default: 80
        This forces the port to be this if none is specified.
    nofiphdr: Do NOT add a Fip Hdr to the file.     default: yes pls
    source: Fip Source of the files. (FipHdr 'SU').     default: XWEB
        Unless 'noarchive' is specified, all data files will be archived under this
name in log/data.
        This can be in FipSeq so that 'combie' can be used to set a default..
    noarchive: Do NOT archive these files in log/data.  default: archive
    maxlevel:3  Maximum no of levels to drill down. default: 1
        Normally the URL you have requested is the data you want.
        However if that is an index page with links that may change, it may be these
lower-level pages that are needed. 'maxlevel' states how many levels of link
the actual data pages are.
        Default is 1 = do NOT drill down any of the links.
        Note that level 1 is the first page.
    ignorelevel: Used with 'maxlevel' where the information     def: no
        required is on a linked page and NOT on the first page,
        use 'ignorelevel' to ignore all those pages on intermediate levels.  Note
that level 1 is the first page.
        eg  ; ignore levels 1, 2, 4 and 6
            ignorelevel:1,2,4,6
    matchlinks: Only follow links which match this mask.    def: all links
        Used only if 'maxlevel' is greater than 1.
        There can be many 'matchlinks'.
        Use the '*' as a wild card string and '?' as a wild chr.
        eg  ; get all links ENDING 'html'
            matchlinks:*html
    matchforms: Only process forms which match this mask.   default:no forms
        Used only if 'maxlevel' is greater than 1.
        There can be many 'matchforms'.
        Use the '*' as a wild card string and '?' as a wild chr.
            eg  ; get all forms ENDING 'asp'
                matchforms:getfile.asp
    matchframes: Only follow frames which match this mask.  def: all frames
        Used only if 'maxlevel' is greater than 1.
        There can be many 'matchframes'.
        Use the '*' as a wild card string and '?' as a wild chr.
        eg  ; get all links ENDING 'html'
            matchframes:*.top
    matchkeys: Only follow links which match this test. def: all links
        Used only if 'maxlevel' is greater than 1.
        Used only for 'linktag' where an attribute MUST be set for the link to be
valid
        There can be many 'matchkeys.
        Use the '*' as a wild card string and '?' as a wild chr.
        eg  ; <hotel id=33 name="Fawlty Towers" url="http://www.ohnonotagain.com"
status="current" />
            linktag:hotel@url
            matchkeys:hotel@status=current
            matchkeys:hotel@status=ready
    match-case-sensitive: yes/no
        all matches and ignores can be case sensiive or in-sensitive
        DEFAULT changed 05u to INsensitive - previously sensitive.
    match-dedup: (FipSeq)
        Check and ignore Sequencial duplicate items with (possibly) diff urls -
FipSeq
        It can be used the same as skip-save-data if that else use \W$ for the normal
grab url
        match-dedup:\VX\Q6-\Q8\$o

    force-lower-levels: (levelNumber)
        When data is on more than one level - maybe a text page has a link to a PDF
and you need both bits, use this to get all bits of this element before
continuing with the next element.
        The default (without this parameter) is to get all this level and then all
the next lowest level etc.
        ; force the lower levels below level 2
        force-lower-levels:2
    mime-type-fiphdr:(2 letter FipHdr field)
        if the MimeType is present, add the mime-type  to the fiphdr as  this fiphdr
field
    level-fiphdr:(2 letter FipHdr field)
        add the Level of this file to the fiphdr as this fiphdr field
        This can be used for option inside the parameter file:
            level-fiphdr:AL
            option:V1   AL,,,,1,
            option:V2   AL,,,,2,
            filename:level\V1ONE\$o\V2TWO\$o_file
    level-link-fiphdr: (FipSeq - 2 letter FipHdr)
        This gives access to the top level link in force-lower-levels
        eg  force-lower-levels:2
            level-link-fiphdr:C1
        so C1 is for level 1 link, C2 for level 2 etc
        If you want only a part of the link use FipSeq to pull apart
        there is no default

    top-level-nextpage:(FipSeq)
        Some BIG feeds will only return the first 'n'100 items of a list of items -
eg S3 is up to 1000
        Use this to FipSeq that an input tag is saved into a FipHdr field and the
webwire loops for more
        eg  ; FH for next page
            top-level-nextpage:\JY
            ; save the contents of NextContinuationToken tag at the top level 1
            fiphdr:JY   level:1 tag:NextContinuationToken
            ; only add 'continuation-token=' and the Token if the Token HAS data
            option:VY   JY
            fixed:P1
\VYcontinuation-token=\JY&\$odelimiter=/&encoding-type=url&list-type=2&max-keys=\JG&prefix=\JE

    skip-links: Name of a file in /fip/fix/webwire holding names of links
        and forms already accessed; so that only new ones are tried.
            eg  skip-links:webwirelinks.\$d
            default: none
    skip-details-tag: (tagname) extra details (such as a publishdate) for check if
existing links have been updated
            see below on the section for RSS feeds
            default: none
    skip-purge-after: (hours) Number of hours to keep the skip entry
        default is 24.  You might want to tune this :
            - make bigger if sites add/take off old material
            - reduce the time if the same link is used for different data
    skip-save-data: (FipSeq field)
        Sometimes there is some data in the link which changes for every access -
such as a Cookie or SessionId
        eg the first access might get
            search.do;jsessionid=A9823A4622A23C10C4EC7F1825BF9E26.node1?messageId=268482
        and the second
            search.do;jsessionid=FCC18E9582E77C2AD9EFE6C68CA0F0A2.node1?messageId=268482
        But they both happen to be the same file - messageId=268482
        Use FipSeq to just get the data that contains ONLY the information you want
to save.
        Certain FipHdr fields hold relevant info:
            WX is the field marker '^'
            WS is the skip details tag (optional - see above)
            WT is the type - 'a'-anchor
            WL is the level no
            W$ is the actual link - anchor, form etc
            S$ is the actual hostname or IP address
            WH is the associated display text from an anchor tag
        In the above example :
            ; split on the '?' - get the second field
            repeat:Q1   W$,?.2
            ; skip string is now 'messageId=268482' - note the FipSeq needs a backslash
            skip-save-data:\Q1

    skip-balance-group: name of a balance group (in tables/sys/BALANCE) to
distribute
        the skip file when changed (see doc on 'ipbalan')
        This is often used where a second system could be used as a redundant server
        if the main system fails. (see also -B input switch)
    ignorelinks:    Of the Links found, skip any matching this mask. default: all
links
        Used only if 'maxlevel' is greater than 1.
        There can be many 'ignorelinks'.
        Use the '*' as a wild card string and '?' as a wild chr.
        eg  ; ignore any links pointing at any 'netscape' or 'microsoft' site
            ignorelinks:*microsoft*
            ignorelinks:*netscape*
            ; ignore any links requiring 'ftp:'
            ignorelinks:ftp:// *
            ; ignore any links to other sections
            ignorelinks:../ *
            ; ignore any links to any index
            ignorelinks:*index*
    httphdr: Extra lines of HTTP header you may need.   default: none
        Remember to add a NL at the end of each line.
        There can be multiple httphdr lines but pls remember to add a '\n' at the
        end of each one. (or you can try to force all on one httphdr line!)
        eg  httphdr:Authorization: Basic AbGtGgbhpdOkOTE=\n
            httphdr:User-Agent: Mozilla/4.0\n
            httphdr:Host: wibble.wobble.com\n
        see below for 'useful, common header lines'
        ** ALL basic-authentication MUST BE HIGHER IN THE PARAMETER FILE THAN httphdr
OR proxy-logon
    httphdr-on-all-grabs:yes/no
        Normally the httphdr is only for a single host. So if the 2nd or subsequent
level is to a different host, by default, nothing defined as 'httphdr' will be
added.
        if 'yes', the option adds the httphdrs to all grabs
    httphdr-on-proxy:yes/no
        Normally the httphdr is only for data grabs NOT for getting thru the Proxy.
        if 'yes', the option adds the httphdrs to the proxy call

    basic-authentication: (fiphdr field) (logon:password)
        Build a FipHdr field with the BasicAuthentication formatted logon:password
        Pls remember to escape any funny chrs - like backslashes
        ** ALL basic-authentication MUST BE HIGHER IN THE PARAMETER FILE THAN httphdr
OR proxy-logon
        eg  basic-authentication:BA DOMMY\\zipple:Ardvark99
            httphdr:Authorization: Basic \BA\n

    method: POST/GET/DELETE/PUT etc             default: GET unless 'post:' is specified
        normally this is a single UPPERCASE action - with NO spaces.
    post:    Post a Form                    default: get url
        see below for processing a form using method=POST.
    filename: Filename for the output file in FipSeq.   default: WEB\$Z
    newname: ditto
        If this does NOT start with a '/' it is left under the Output Queue as
specified on startup (default spool/2go)
        eg  filename:AFP\$d.\$z
        eg  newname:#SN:\JC.\JK.\JZ#XX:\$u.z\$z.v\@v
        Note \@v is no of items in this file
        It is ignored if a -N (forcename) is specified as an input parameter
    supercede:(FipSeq which should resolve to yes|no)   default: no
        if supercede:yes, the contents of any existing file is overwritten
    striptags:(yes|no) Strip tags and attributes        default: no
    wild: (FipSeq)  Character used as a Wild String for default: '*'
        'matchlinks/ignorelinks'.
        eg  wild:\377
    singlewild: (FipSeq) Character used as a single     default: '?'
        Wild chr for 'matchlinks/ignorelinks'.
        eg  singlewild:!
    number: (o|d|h) Number system for FipSeq        default: octal
        octal, decimal or hexidecimal
        The following are all equivalent :
            number:octal
            before:\040
            number:decimal
            before:\032
            number:hex
            before:\020
    before: FipSeq String to add before any data.       default: none
    after:  FipSeq String to add after any data.        default: none
    script: Script to run on ths data of the incoming file. default: none
    outque: Output folder (in FipSeq)           default: spool/2go
        This overrides both the default and the '-o' output switch
        except for Testing/Tuning mode where the file is forced to spool/webtest.
    log:    FipSeq custom logging for the item log.     default:\SN \SU \EF : \EH,\EP
        This logs each Page grabbed
        Note that
            EH or ST    remote site host
            EP or SP    remote site port
            EN or SF or SG  remote site url     SG is the actual link, the others are the
link used to grab
            EF      parameter file used
        The default is that no incoming files are logged by webwire
    custom-log: FipSeq custom logging for the item log.     default: none
        This can be used to log link details in a custom log
/fip/log/webwire/(date)_(paramfile).fip

custom-log:pnac.\YN|date.\YT|procdate.\T7|taketime.\T9|source.\TU|take.\TZ|head.\TH
    log-errors:w (warn)         - for all communication errors
    log-https-errors: w (for warning)   - for all HTTPS comms errors
        Any failure to go secure in https connections are flagged as warnings
        The transmission is always aborted. This parameter affects only the logging.
        default:  !x for failures

    extra:
    extra-grab:
    extra-pre: Extra FipHdr fields (in FipSeq) to be added to the output.   default:
none
        To separate FipHdr fields, pls use a '#' or a newline.
        extra-pre is added as soon as the file is read - so may be used for
information in the URL
        extra is only used for any output file and is not used at all for any other
purpose.
        extra-grab is added before each grab
        eg  extra:ZH:NYNZ#DI:Headline News#QZ:333
            extra-grab:\nD1:\nD2:\nD3:\nD4:\$p\n
    tag:    FipSeq String to replace the start tag      default: none
        such as <H1>. There can be many 'tag's.
        eg  tag:P       {Para}\n
    endtag: FipSeq String to replace the End tag        default: none
        such as </P>, </TITLE>. There can be many 'endtag's.
        eg  endtag:TITLE    \n
    getimages: Also get all the images
        By default all images - *.gif or *.jpeg are ignored.
    keep-alive: yes/no                  default: no
        Just that ! default:no
    http-version: 1.0 or 1.1                default:1.0
    only-get-if-modified: (FipSeq for yes or no or etag)        default: NO for get data
each time
        This will check the remote server for the time the page was last modified.
This does not work with old servers and some set to HTTP/1.0.
        If remote data has been modified since, data is grabbed and processed
normally
        If not - it is ignored (unless logging is lon)
        If the parameter is 'etag' then any incoming ETag tag is saved and subsequent
request use 'If-None-match: (Etag)'
    if-modified-suffix: (FipSeq)
        'only-get-if-modified' uses a save file named by the parameter file (or the
poll name)
        If there are several grabs using the same parameter file but need their own
separate times.
        (Otherwise they would all use the one, latest time for all grabs ! - not good
!)
        This adds a suffix to the saved time file
        combie:QJ   AJ|BJ,json
        if-modified-suffix:\QJ

    ignore-key:PHPSESSID
        When matching for skip files, ignore this key-value pair.
        see the section below on Repeat Offenders
    max-items: (number)                 default: 0 for all
        Max number of items to grab per session
        Some sites only allow you to read 5 or 10 items before blocking you.
        Use this to creep under that total.
        (from 6a20) The number can be in FipSeq.
        Note this is the number of files produced - ignoring Skipped files
        So it the number of linked grabs is 2 * a FipHdr field, use FipSeq 'sum' to
adjust.
        eg is AL:7 is a FipHdr field and 2 files per link are generated - THUMBNAIL
and HIRES
            sum:Q7  (\AL * 2)
            max-items:\Q7
        There can be a subparameter - level:(number) - where there are multiple
levels and you want to grab the all the items on the lowest level BUT need to
track the previous level
    pause-between-files: (secs)
        Gap/wait/pause between grabs    default is 5 for standalone, 1 for iptimer
        This is overridden by the -w input switch
    one-output-file: Put ALL data in a single output file.
        The default is one file per page/access
        Use this with 'values' to create a single output file.
        This ONLY uses the FipHdr of the first file if 'values' have been specified.
    end-of-document: Where a site is sending really really crap HTML - or XML
        use this to state what the last tag.
        For no checking at all : end-of-document:
        Default:        end-of-document:</HTML>
        See below for a standard-fingerpost-rant on crap HTML.....
    end-of-cookie-page: end text which signifies the end of a logon or cookie page
        This is rarely changed.
        default is </HTML>
    connection-retries: (number)
        No of retries that a connection or a  broken connection (ie before a response
is received)
        Some slow sites are throttled and will kick the n+1 th connection off before
servicing it.
        Use this to retry. Default is 1 connection - ie NO retries.
    connection-timeout: (secs)
        Slow, busy sites, may take a lot longer than normal to connect to. Use this
to adjust the time to connect.
        Default is 90
    wait-end-timeout: (secs)
        For slow, busy sites, data - especially big files - may take a lot longer
than normal to be retreived. Use this to expand that time.
        Default is 120 (it should be divisible by 5 for some arcane reason)
    pretend-301: (3 digit number)
    pretend-302: (3 digit number)
        Ignore redirects (HTTP return code 301/307 or 302/308) and assume they are
this return code
        pretend-301:200
        this will take a 301 and save the data as through it was an incoming file.
    dump-data:
        Save /Dump a copy of the each request and response and data in a dump file in
/fip/dump default:no
    dump-filename: (FipSeq)
        Name to be appended to the dump filename in /fip/dump   default: none
    no-data: (FipSeq string in place of data)
        Do not get/send the data - just this string
    data-is-binary:(yes/no/maybe - can be FipSeq)
        Data files at the lowest level are binary or not
        default is check for <?xml, Tiff, Jpeg, MsWord/Office, EPS and PDF
automatically; otherwise it is treated as text
    ignore-mime-if-binary: (yes/no - can be FipSeq)
        if yes = Strip the MimeHeader off binary files
        default is no to leave it on - so you know what the file really is !
For Socks 4/5 - use these parameters to control
    use-socks:4/5 yes/no (yes is same as 5)
    socks-host: (hostname of the socks proxy)   no default
    socks-port: (port number of the socks proxy)    default: 1080
    socks-user: (user name for the socks proxy) no default
        if nothing specified, assumed that there is none
    socks-pwd: (password for the socks proxy)   no default

For old-style HTTP Proxies :
    proxy-server: If using a proxy, these are the name and port to aim at.
    proxy-port:
    proxy-logon: This is the logon and password to get thru the firewall if
required. The format is (logon) (colon) (password) and is converted to base 64.
        proxy-logon:Y2hyaXMuaHVnaGpvbmVzOnBhbnRoZXIK=

        ** ALL basic-authentication MUST BE HIGHER IN THE PARAMETER FILE THAN httphdr
OR proxy-logon
        To generate use basic-authentication or:
            echo -n "logon:password" | sffb64 -i
        eg  echo -n "chris:sleekpanther" | sffb64 -i
        gives   Y2hyaXM6c2xlZWtwYW50aGVy
            proxy-logon:Y2hyaXM6c2xlZWtwYW50aGVy=
    proxy-is-squid:yes/no   Is the proxy a Squid ?  default: no
    proxy-handshake:yes/no  Does the proxy need to CONNECT first ?  default: no
        If the proxy is a Squid, this MUST be NO

    logeachfile:(dest) Send a Success/failed msg to this destination
            for each file. There is no default. This log file is
            just a FipHdr with the following extra fields :
                DR-File Sent OK     DR:ok or DR:error
                DG-Will Retry later DG:retrying, DG:stopped
                DT-Some message text    DT:No connection
            default: no log created.
        The text for the DR and DG can be in FipSeq and so can contain
        FipHdr and other variables. As they are FipHdr fields, please
        do NOT put NL, CR etc in the fields.
        Note that System Variable \$q holds the time taken for transmission.
    DRgood:(text)   Message for the FipHdr field DR on a   successful tx
            default: ok
    DRbad: (text)   Message for the FipHdr field DR on a unsuccessful tx
            default: error
    DGcont:(text)   Message for the FipHdr field DG if, after an
            unsuccessful tz, another attempt will be made.
            default: retrying
    DGstop:(text)   Message for the FipHdr field DG if no further
            attempts will be made as the file was sent successfully
            or the maximum no of attempts has been tried.
            default: stopped
    fiphdr-for-logeachfile: (FipSeq) or
    msgeachfile:(FipSeq) Additional information to add to the FipHdr of the
            'logeachfile' or 'loglasterrfile' msg. This should be in FipHdr
            format and be in FipSeq. It can be used to pass FipHdr fields
            in the outgoing file into the log file.
            eg  msgeachfile:    DF:logdial\nSS:\SS\n
            default: nothing added

    convert-CDATA-sections:
        convert-CDATA-sections:no   - no dont ! (default)
        convert-CDATA-sections:zap  - no but zap the '<!CDATA[' and ']]>'
        convert-CDATA-sections:yes  - yes pls and zap the '<!CDATA[' and ']]>'
        convert-CDATA-sections:preserve - yes pls and leave the '<!CDATA[' and ']]>'
        Normally a CDATA section like :
            <![CDATA[ Vongerful Vondafool C&oe;penh&areing;gen <99thisIsAnon-compliant
XMLtag> ]]>
        is considered a single, raw string of XML/SGML data. And all the tags and
entities (like &lt;) are not changed either.
        Use this parameter to convert them.
        Note that you should use this option CAREFULLY if any tag in the CDATA is the
same as a tag in the main envelope. See below for more comments.

To save the contents of a particular Tag or TagAttribute, use the 'fiphdr'
keyword :
    fiphdr:(FipHdr field)  (optional subkeywords)
        Either  tag:(name of tag)
                specify the tag name which contains the data required.
        Or  data:(FipSeq)
                for adding FipHdrs with standing data.
                fiphdr:TT   data:\$e\$y\$i\$d
                will create a FipHdr field DT with the current date in it
        Or  tag:(name of tag)@(name of attribute)
                specify the tag name and the attribute name which contains the data
required.
        Or there can also be a 'key' parameter for selecting the data ONLY if there
is Key attribute with its data equal to a certain string:
            eg: if the tag is <meta name="category" content="f"/>
                fiphdr:NC   tag:meta@content key:meta@name=category
                Get the contents of the content attribute of 'meta' where another attribute
called 'name' has the value 'category'
            or  fiphdr:NC   tag:meta    key:meta@name=category
            or  fiphdr:NC   tag:meta@name=category
                Get the data for the 'meta' tag that has an att 'name' = 'category'
            Double quotes around the Key Data are optional unless there are embedded
spaces. The Key Data can be in FipSeq.

        For any of the tag options, use 'dup' to flag duplicated fields.
            dup:(optional separator)
                This field may be duplicated. Duplicate fields are separated
                with a space unless a separator chr is also specified.

        Where there might be embedded tags inside the main tag, use 'repxml' to
specify a replace string
            repxml:(FipSeq)
            eg fiphdr:AL    tag:TD  repxml:+\s+
                and the data is <td>abc<br>efg<br>line3</td>
                will give   AL:abc+ +efg+ +line3

        As some FipHdr fields have distinct meanings - SN, DU, DP etc - please use
other 2 letter codes starting N or Q.
        In the current version of webwire, you CANNOT specify trees of tags ie
fiphdr:AA tag:entry/id.

    eg  fiphdr:NA   tag:itemid  dup:+
            get the data from each <ITEMID> field. If there is more than one,
            they are separated by a '+'.

    fiphdr-save:(FipSeq)
    fiphdr-file:(Filename in /fip/fix/webwire/fiphdr)
        This allows data to be stored as FipHdrs at the end of the session - and read
at the begining of the next
        So items like Sequence numbers and time-of-access can be passed between
attempts.
            ; default name
            combie:QA   WA,default
            ; save and possibly reuse the FipHdrs ....
            repeat:JQ   J1,+,1
            repeat:JD   J2,+,1
            fiphdr-save:BQ:\JQ\nBD:\JD\nXX:some comment\n
            fiphdr-file:websave_\QA
        ** This must be lower down the parameter file than any FipSeq if you are
using FipHdr fields as the example above !
        There can be multiple 'fiphdr-file' - all of which are read as the parameter
file is read.
        But if there is a fiphdr-save, ONLY the last 'fiphdr-file' is stored to.

    fiphdr-on-all-levels:
        Add the FipHdr to each file on every level - default: no
    fiphdr-hash: (single chr in FipSeq)
        This will replace a Hash '#' in a FipHdr field (as Hashes are normally
end-of-fiphdr field)

    meta-to-save:(FipSeq)
    meta-save-file: (Filename)
    meta-save-on-tag: (tag name)
        This meta file is appended to on the End-of-tag specified (or end-of-file if
no tag specified)
            ; save these fields to the lookup file
            meta-to-save:\J3|\J5|\J6|\J1|\J4|\$h:\$n:\$b\n
            meta-save-file:/fip/data/blob/\$e\$y\$i\$d/\WA
            meta-save-on-tag:LINK
    reset-fiphdr-on-tag: (tagName)
        Trim the FipHdr - and extra, added fields - on the end of this tag to the
same position when the tag started
        This can be used in meta-save to make sure that FipHdr fields from one grab
do NOT exist for the second or subsequent grabs
        default: not used.
    grab-on-tag: (tagName)
    grab-on-endtag: (tagName)
        Any links should be grabbed at the start or end of this Tag
        default: all links are grabbed at the end of the page
        An extra parameter may be specified on the same line for level eg
            grab-on-endtag:VALUE    level:3
            grab-on-endtag:params/param/value/struct/member
        NOTE that grab-on-endtag does not trim the FipHdr (as we might need the extra
meta for a fiphdr-save). So use reset-fiphdr-on-tag with the same tag to trim
(if there is NO fiphdr-save)

    retry-404-max:3
    retry-404-gap:1
    retry-404-error:abort/ignore/move
    retry-404-queue:2go
    retry-404-fiphdr:#CE:300#DU:nextstage
        Retry links which return a 404 Not Found error. Max is the number of retries
and Gap is the pause in seconds between the retries
        Use this for those sites which are a bit slow to add the data files the links
point to.
        If the files really are not there - and you do NOT want to abort the
transmission - use 'retry-404-error:ignore' to continue with the next grab
        OR you can use retry-404-error:move and retry-404-queue:(queue in spool) and
retry-404-fiphdr:(FipSeq) to send a item
    retry-500-code:505
    retry-500-max:5
    retry-500-gap:1
    retry-500-error:abort/ignore/move
    retry-500-fiphdr-file:delete/ignore
    retry-500-queue:2go
    retry-500-fiphdr:#CE:300#DU:nextstage
        Retry links which return this system error - code can be any 3 digit number
above 400.
        Max is the number of retries and Gap is the pause in seconds between the
retries
        Use this for those sites which are a bit slow to add the data files the links
point to.
        If the errors continue - and you do NOT want to abort the transmission - use
'retry-500-error:ignore' to continue with the next grab
        OR you can use retry-500-error:move and retry-500-queue:(queue in spool) and
retry-500-fiphdr:(FipSeq) to send a item
    save-data-path: (Fipseq pathname for data)
        This puts the data of the incoming file into this folder and creates a FipHdr
        file that contains 2 FipHdrs containing the full path/filename
            SX: and FTP_EXTERNAL_FILE:
        (ipbalan uses SX and ipftp uses FTP_EXTERNAL_FILE)
            eq  save-data-path:/fip/data/jpegs/\$e\$y\$i\$d/
        Use this for big files that you do not want to copy around the Fip Spool
area.
    save-data-filename: (FipSeq name)
        Use this to specify exactly what the the 'save-data-path' name should be
        default is (incoming filename).(time).(seqno)
        eg  save-data-filename:\HR-\SU.raw
    save-data-balance-group: (Balance group)
        Balance all save-data files to the following group. default: do not balance
    save-data-balance-folder: (Balance folder)
        If balancing, put the token in this folder under spool. default: 2balance

    max-children:(number in Fipseq) - same as the -E switch. default: none
    forks-per-sec: (number in FipSeq)
        Throttle forks to this number per second    default: no throttle

-- More Complex sites ------

-------- Oauth2, Oauth digest and AWS notes --------

-- For accessing Oauth2 protected assets - eg GCP Cloud Storage or G-Drive or
Microsoft Azure features

    ; OAUTH2 authentication as per Google GCP or Microsoft Azure

    use-oauth2:yes/no
        Use OAUTH2 to grab/use an access-token or Bearer token eg for Gmail access
        default is NO
    ; We need an access token
    use-oauth2:yes

    ; which flavour of Oauth2 ? - only the first letter is meaningful
    ; oauth-flavour: Google (Gmail) or Microsoft (Office365)
    oauth-flavour:microsoft for office 365

    ; Current token file will be saved in /fip/fix/goauth2
    oauth-token-file:\OT

    ; Credentials file in /fip/tables/cert
    oauth-credentials-file:\OC

    ; sffoauth and imapwire
    oauth-scope:https://outlook.office365.com/.default

    ; Script to run when token expires - approximately every 12 hours
    oauth-refresh-script: (Script in FipSeq)        script to generate the access_token
using a refresh_token
    oauth-refresh-script:/fip/bin/sffoauth -z wire/IMAP.O365.OAUTH.SEA -c \OC -t
\OT -H '#WN:\WN' -a

    These 5 FipHdrs are use to generate, check, add/renew permissions to access
the remote data - normally Gmail or Office365

    oauth-client-fiphdr: (FipHdr)   default: IC
    oauth-secret-fiphdr: (FipHdr)   default: IS
    oauth-access-fiphdr: (FipHdr)   default: IA
    oauth-refresh-fiphdr: (FipHdr)  default: IR
    oauth-expiry-fiphdr: (FipHdr)   default: IX

-- For accessing other Oauth protected assets - like twitter
    OAUTH digest - such as twitter and dropbox use

There are three parameters to define to
    ; this is the salt key
    oauth-signature-key:\JC&\JA
    ; string to use
    oauth-signature-data:\R5
    ; fiphdr to add to
    oauth-signature-fiphdr:RS
    ; type sha1
    oauth-signature-type:sha1

    oauth-signature-key: (FipSeq)   Key for Oauth; normally the ConsumerKey and the
AccessToken
    oauth-signature-data: (FipSeq)  Data string to encode - see your remote api doc
for what needs to be included and how it should be formatted
    oauth-signature-fiphdr: (2 letter FipHdr field) FipHdr which will hold the
signature
    oauth-signature-type: (type)    Signature type
        valid types are md5, sha1, sha224 sha256 sha384 and sha512

-- For AWS grabs, use the same pararmeters ars oauth (almost!)

    aws-signature-key: (FipSeq) Key for AWS; use | as a sep :
secretkey|dateStamp|regionName|serviceName
    aws-signature-fiphdr: (2 letter FipHdr field) FipHdr which will hold the
signature
    aws-signature-type: (type)  Signature type
        valid types are md5, sha1, sha224 sha256 sha384 and sha512
    aws-request: (FipSeq)   FipSeq string to hash using the signature key
            eg
aws-request:GET\n/349445556777/fiptest1\nAction=ReceiveMessage&AttributeName=All&MaxNumberOfMessages=10&MessageAttributeName=All&Version=2012-11-05&VisibilityTimeout=1&WaitTimeSeconds=20\nhost:sqs.us-east-99.amazonaws.com\nx-amz-date:20180907T112658Z\n\nhost;x-amz-date\ne3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
    aws-data-fiphdr: (2 letter FipHdr field)    FipHdr field which will hold the
payload sha256 hash
        if POST, the data part which needs to be hashed; for GET, it is left blank
(obviously as there cannot be a payload)
            the hash is added to the last line of the 'aws-request' either as a fixed
string (for GET) or as a FipHdr field (aws-data-fiphdr)
            if GET, this can be ignored and (for SHA256), this is
'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
            eg sffhmac -I "" -Z sha256 -S -D -H
    aws-data-md5-fiphdr: (2 letter FipHdr field)    FipHdr field which will hold the
payload md5 hash
        This is used for the S3 Content-MD5: mime header. as an alternative to the
SHA256 hash in x-amz-content-sha256: mime header which should be set to
UNSIGNED-PAYLOAD if you use md5

EG :
; if split into FipHdr fields for :
;   JA-accessKey, JB-secretKey, JH-sha of payload,JI-aws-id, JP-Url Params,
JQ-sqsque, JS-service, JR-region, JZ-utc datetime
combie:JA   AA,-noAccessKey
combie:JB   AB,-noSecretKey

combie:JI   AI,349445556777
combie:JM   AM,GET
combie:JP
AP,Action=ReceiveMessage&AttributeName=All&MaxNumberOfMessages=10&MessageAttributeName=All&Version=2012-11-05&VisibilityTimeout=1&WaitTimeSeconds=20
combie:JQ   AQ,fiptest
combie:JS   AS,sqs
combie:JR   AR,us-east-99
newdate:JZ  gmt unixdate=\$p "\ZZ\ZM\ZGT\ZH\ZF\ZEZ"

; For GET - choose either
fixed:JH    e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
; or
; aws-data:
; aws-data-fiphdr:JH

aws-signature-key:AWS4\JB|\JD|\JR|\JS|aws4_request
aws-signature-type:sha256
aws-signature-fiphdr:RX
aws-request:\JM\n/\JI/\JQ\n\JP\nhost:\JS.\JR.amazonaws.com\nx-amz-date:\JZ\n\nhost;x-amz-date\n\JH

- In this case, the default request boils down to :
GET
/349714556777/fiptest
Action=ReceiveMessage&AttributeName=All&MaxNumberOfMessages=10&MessageAttributeName=All&Version=2012-11-05&VisibilityTimeout=1&WaitTimeSeconds=20
host:sqs.us-east-99.amazonaws.com
x-amz-date:20180907T112658Z

host;x-amz-date
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

-- The links are not in the normal Anchor or Frame tags.
If the Site returns an XML feed rather than HTML, you can specify what the
contents of which tags you want to play with. There can be up to 20 tags
specified.
    linktag:(tagname)
or  linktag:(tagname)@(attribute)   (for version 05u onwards)
    linktag:TEXT
    linktag-2:Slavver
    linktag-3:Bone

or to imitate the defaults :
    linktag-1:a@href
    linktag-2:frames@src

- sites which return other data which is not xml - such as CSVs
    data-type:CSV   (can be CSV for comma sep format, JSON, PSV for Pipe sep, TXT)
    data-type-sep:|
    data-type-eoln:
    data-link-idx:2
        define the column containing the link to the data
    headline-link-idx:3
        define the column containing the headline
    skipdetails-link-idx:1
        define the column containing the skipdetails

-- RSS feeds
Sometimes a link can point to data which gets updated and there is a second tag
which gives either a unique-id or a date/time which you need to track for any
changes. Use the 'skip-details-tag' to specify the second tag - it is the
combination of the 'linktag' and 'skip-details-tag' which should be unique.
For general RSS 2.0 feeds, this can either be 'pubDate' or 'guid' :
    linktag:link
    skip-details-tag:pubDate
In RSS feeds there is often a fake 'link' at the top which is the channel.
Usually you do not want this one - often it is a URN not a real URL, so use
'matchlinks' or 'ignorelinks' to bypass it.
if more than one skip details are needed, up to 9 skip-details-tag-X can be
specified.

-- If the data in the link is not complete..
Use templates to slot data from a link into another call. This is again used
extensively for XML work - like soap.

It uses either just a template (in FipSeq so you can add Header Fields etc) or
a template AND a template file if there is a lot of data.
    level2template:/query.dll?src=\QD
    level3template:/getFile.dll?file=\W$
    level3template-file:soap-getfile.xml

There are 8 templates for levels 2 to 9. 'maxlevel:' and 'ignorelevel:' must
always be used with these to specify which one you need the data from.

A levelXtemplate on its own will generate a GET.
To POST something, you will also have to specify a 'levelXdata: (FipSeq)' eg
    ; level 3 - To get THAT file is always a POST of
FileManager1%24gvwFiles%24ctl03%24gvlnkNam
    ; .. using the different EVTVAL and VWSTAT
    level3template:/proximity/Admin/FileManager.aspx

level3data:__EVENTTARGET=\A2&__EVENTARGUMENT=&__VIEWSTATE=\G8&__EVENTVALIDATION=\G9
will force as
    POST /proximity/Admin/FileManager.aspx
will data filled in for fiphdrs A2, G8 and G9 eg

_EVENTTARGET=FileManager1%24gvwFiles%24ctl03%24gvlnkName&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUJMjQyMzY1MzEX%3D&__EVENTVALIDATION=%2FwEWAwKVxPyBCgLc

Note there is no level1template as that is the same as the URL:.. BUT there is
a 'level1template-file' version. In this case the URL: should be just that.

There is a little used parameter levelXmime which can be used to change the
Content-type / Mime type just for that level.

The template-files are normally in /fip/tables/webwire. They are NOT force
uppercase.

The default Content-Type for POSTing data or forms,
'application/x-www-form-urlencoded', sometimes needs to be changed for
templates.
It can be changed with the 'levelXmime' parameter. For example, soap normally
likes a content-type of 'application/soap+xml':
    level1mime: application/soap+xml
unless you are Microsoft of course who usually/sometimes want
    level1mime: text/soap

The 'W$' in the example is because each link is put into a temporary FipHdr
field called W$ as it is being used. If the link data is too much or too
little, use FipSeq to chop/add/replace.
Eg  if the data in the link is "nine:/rt/newsart/Id="z1jit4":text"
    And you want a link like
        /searchDB?database=nine&link=/rt/newsart/Id="z1jit4"&format=text
    use repeat:R1   W$,:,1
        repeat:R2   W$,:,2
        repeat:R3   W$,:,3
        ; if there is no 3rd field, use 'xml' instead
        combie:W4   R3,xml
        level2template:/searchDB?database=\R1&link=\R3&format=\W4

-- Values -------
Values can be -
    - EITHER a file containing lines of values to be used to repeatedly grab data
for a single file.
        using values-file:(filename in tables/webwire)
    - OR a sequential number
        using values-seqno:(min value):(max value):(incremental value)
        plus    values-seqno-fiphdr-from: (FipHdr field containing the From seqno - ie
start grabbing from the NEXT id after this)
            values-seqno-fiphdr-to: (FipHdr field containing the To seqno - ie each
seqno until and INCLUDING this one)
    values-get-url:
    values-post-url:
    values-post-data:
        Fipseq to POST a form or GET a link from a line in the
        values file. See below for a description.
    values-sep: Separator chr for splitting fields in the values file.
        default is a pipe - '|'
    values-leave-spaces: Normally leading spaces are trimmed from each
        field in the values file. Use this to preserve them.
    values-parallel: (Number of Simultaneous Hits)
        For 'values' the default is to run the hits serially, one after the other has
finished. Use this to send out a number of hits at the same time which should
reduce the total time by a large factor. However, you should check with the
remote and test what the number should be. For Apache sites for example, 8 is a
common default setting.
        eg  values-parallel: 10
    values-fiphdr: Normally fipHdr W1 will contain the first field of the values
file, W2 the second etc.
        So data can be specified by \W1
        Use this parameter to specify another field - ie if W1 is being used
elsewhere.
        ** Note that if you are using iptimer to start webwire running a values file,
the Wx fields will be zapped in the output file.
        So in this case, always use 'value-fiphdr:' with a different FipHdr if you
want to use the Values in iproute or another downstream program.
        eg  values-fiphdr:R1
    values-pause: (secs)
        Gap/wait/pause between Grabs using the next value   default is 0 for none
    values-comment: (single FipSeq chr)
        comment - ignore any value line which has this chr as the first non-blank chr
default: ';' - semicolon
        values-comment:;
    values-allow: (single FipSeq chr)
        allow - only process values lines which have this (case-insensitive) chr as
the first non-blank chr default: all
        values-allow:E
    zap-values-file: (yes/no)
        Delete the values file after it has been used.  default no
        Only files in /fip/x/ starting TMP.. can be deleted.

----

Note that in the FipHdr - unless the 'nofiphdr' keyword has been requested, the
following fields will be filled in :
    Day and time in the normal HH,HD,HY etc fields
    ST  host
    SP  port
    SF  url - path/filename being grabbed
    SG  url - path/filename with is the link
Where webwire is sitting on a scrolled queue (using -i), the folder name is in
EQ and the filename EN (with all '#' replaced by the chr chged by
'fiphdr-hash')

Extra FipHdr values are
    \@v is no of items in this file
    \@i is ths childId if spinning off children

Input Parameters (all optional) :
either  -i : scrolled queue             default: no default
        This checks the folder and for each file, checks the FipHdr for 'DF' which is
used for the name of the parameter file to run against
        This allow a variety of parameter files to be run
or  -1 : Run a single time and exit         default: spool
        The parameter is the name of the individual parameter file in tables/webwire
(ie NOT The top or main parameter file)
or  -T : Tuning mode                default: spool
        Display links and data for the page requested. Runs only that page and then
exits.
        The parameter is the name of the individual parameter file in tables/webwire
(ie NOT The top or main parameter file)
    -A : In Tuning mode, do NOT prompt before searching a link  default: prompt
    -a : log the actual link of each accesses in the FipLog     default: no
        This can be quite a lot of logging if you are grabbing lots of files !
        But is quite useful when starting/adding a new feed.
    -B : default balance group for skip files           default: none
        (see skip-balance-group parameter)
    -C : warm restart for cookies/api-keys      default: always ask for new
cookies/api-keys for logon
        ie do NOT re-logon if the previous session logged on and saved the cookie or
api-key
        if any apikey is missing or has timed out, all cookies and api-keys are wiped
and webwire needs to be re-run to logon and download.
        see note below
    -d : done folder for -i scrolled queue      default: none
        This can be overwritten by the 'doneque:' parameter
    -D : display the Request and Response       default: do not
    -e : exit with the Result Code of the last grab.    default: normal program exit
        The Normal exit is 0 if ok, negative number if not
        With -e this will be 0 for ok, and -1 (timeout) but 4XX or 5XX for page
errors.
    -E : maximum number of threads up to a max of 100 (not Win2k).  default: 1
        Note this is also a hardware limit in that small systems may not be able to
run as many.
    -f : path and filename of the output file if a non-200 HTTP code is returned;
default: fip standard
        use this to leave the file(s) in a non-std folder.
        ++ NOTE this was -E before version 6a47
    -F : do NOT add a FipHdr to the output file default: do
        this can be overridden by the 'nofiphdr:no' parameter
    -h : extra FipHdr information           default: none
        This is in FipSeq and should normally be quoted
        Note this is the means that 'iptimer' sends variable information to webwire
        eg : -h"SN:hello#TC:200401031"
    -H : display the Request and Response in fancy HTML default: do not
    -I : wire id                        default: 0
        used to track which instance of a multi-webwire system a file arrived/logged
    -k : ignore the Skip list (used mainly in tuning)   default: use skip-links:
    -K : Do NOT save or process any data, just build up a skip file.
        This can be used before putting sites into production so that all old links
are ignored and only new links will be tracked.
        ie run 'webwire -1 (name) -K' once beforehand.
    -l : no logging to the FipLog except for errors default: log all
    -L : log new files and errors to the FipLog default: log all
    -m : (FipSeq) no of items               default: grab ALL items
        eg -m 3 or -m \A1
        Generally used in testing to reduce the number of files grabbed
        This is overridden by 'max-items:...' parameter
    -N : path and filename of the output file   default: fip standard
        use this to leave the file(s) in a non-std folder.
    -o : output queue in 'spool'            default: spool/2go
        This can be overwritten by the 'outque' parameter
        This is ignoring in Tuning mode.
    -O : force ALL output to this queue in 'spool'  default: spool/2go
        This overwrites the 'outque' parameter
        This is ignoring in Tuning mode.
    -s : generate statistics for bandwidth usage    default: no
        using Hour_group files
    -S : generate statistics for bandwidth usage    default: no
        using name of group_client files
    -t : track status               default: no
        this can be overwridden by the parameter
            track-status:no
    -V : if using spool-a-folder (-i) then stop when it is empty    default: keep
spooling
    -w : Wait in seconds between accessing links.   default: 5
    -x : Proxy server host or IP address        default: none
    -X : Proxy server port              default: 80
    -y : Proxy logon                default: none
    -Y : Proxy server is Squid          default: no
    -z : parameter file in 'tables/webwire'.    default: XWEB
    -v : Print the version number and exit

---- Other Notes ----

-- Netiquette --

Pls note if you are grabbing data off another site, then you should contact the
webmaster of the remote and let them know. Certainly if you are accessing every
few seconds, then there is a good chance they will put you on some refuse list.
So it pays to be nice !

-- How to find out the actual url....

Sometimes it is quite difficult to find out the real path to use for the url.

Especially so for script-driven gets and puts.

NetScape or Iexploiter is invaluable in this case..
 - use either 'View Source' or 'History' normally gives the game away!

Snooping using tcpdump or windump
    0. Open a Terminal/Cmd window and start you browser - without hitting the site
yet
    1. Find out which interface
        tcpdump -D
    2. Leave tcpdump running in background
        On Mac OSX you will need to be sudo
        tcpdump -i1 -w remo.tdmp -X host www.remote.host

    3. On the browser, do the absolute minimum ..
        .. do a simple logon and grab ne file using Firefox, Mozilla, IExp, Safari
etc
    4. CntrlC to stop tcpdump
    5. run tcpdump to show data
        rcpdump -r remp.tdmp > remo.fip
    6. call up remo.fip in an editor.

-- Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookie

Cookies are neat but nasty.
If you already know the cookie you need, just make a file in /fip/fix/webwire
with the name of the cookie (case is important on Unix boxes) and slap in the
whole of that cookie which has the syntax
    (key)=(data)
ie  zumzum=hungryTummy

Before grabbing data pages we can attempt to logon to a box and get its cookies
!!
This uses from 1 to 9 GETs or POSTs

    add-cookie:\C1; \C2 Add the Cookie on to the end of the HTTP headers in this
form
    get-cookie-1:   Command to send to get a cookie or to logon.
    get-cookie-data-1: Optional data usually required for a POST
    get-cookie-http-1: more HTTP headers used ONLY for this GET/POST
    cookie-fiphdr-1: name of the cookie to use as a FipHdr field C1 to C9
            ie if there are several cookies returned but only one
            is needed, put the key as the cookie-fiphdr
            ie Set-Cookie: ABC=12345
                add-cookie:\C1; perm=yes
                cookie-fiphdr-1:ABC
            will result in a Cookie: ABC=12345; perm=yes
            If you  want all the cookies to be saved, use '*'
                cookie-fiphdr-1:*
    follow-cookie-redirect: (yes/no)
        ie if you get a 302 Moved Temporarily status Plus a Location from a cookies
-request,
        use that rather than the 'url:..' specified.
            HTTP/1.1 302 Moved Temporarily$
            Date: Fri, 29 Oct 2010 00:17:19 GMT$
            Cache-Control: max-age=3$
            Location:
http://fippo.fip.fip/palio/html.run?_Instance=cms_csi&_PageID=1&_SessionID=1068051&_SessionKey=922432532&_CheckSum=328747502$
    cookie-form-1: find and save an input tag in a form and put the data in a
FipHdr field starting F*
        use this to add hidden form zones into a reply for a logon for example.
        ; csrf_token will go into F1
        cookie-form-1:csrf_token
        ; then send it back
        get-cookie-2:POST /login/

get-cookie-data-2:csrf_token=\F1&email=dot%40sniggerfrost.com&pwhash=somut&caform=1&submit=Login
    keep-cookie-fiphdrs:yes/no
        Normally the access to the cookies do NOT give any data you need to save in
the FipHdr for use later on
        But there are times - eg when the cookie (or api-key) is a logon code - when
you DO want to save
        However if you do not want this (maybe there is some data which clashes) turn
this OFF by specifying NO

There can be up to 9 of these.
    eg  add-cookie:\C1
        get-cookie-1:GET /
        get-cookie-2:POST /logon.pl
        get-cookie-data-2:logon=helpme&password=iamswimming

Rarely are any  get-cookie-http-1 fields needed as
    Host, Content-type, and Content-length are added automatically
    Referer is added if you have specified a 'referer'
        which you should if running 'http-version:1.1'
    Keep-alive is added if you secify 'keep-alive:yes'
    Others 'httphdr' fields should be specified as normal..

As a general rule, some Microsoft IIS sites (who else!) have problems if you
HTTP headers are in the wrong order. Basically, make sure your CONTENT* lines
are last.

Example 1
; ------------------------------------------------------
; we need to go and get a cookie for this service
; we will call it C1 - so the httphdr will be 'Cookie: (contents of C1)'
add-cookie:\C1
; C1 will hold the contents of an incoming 'WebLogicSession=.....'
cookie-fiphdr-1:WebLogicSession
; this is the URL to hit (with parameters) to trigger the Cookie
get-cookie-1:GET /servlet/com.login.DispatchServlet?Login=&User=guest&Pwd=guest

Example 2
; ----------------------------------------------
; in this case we have 3 cookies C1, C2 and a fixed one 'b'
; C1 is SID=..
; C2 is ASP...=...
; add the fixed 'b=b' on the end
add-cookie:\C1 ;\C2 ;b=b
; just one grab at a cookie - and Logon and the same time
get-cookie-1:POST /login/Login.asp
; one logon string
get-cookie-data-1:u=%2Findex.asp%3F&l=letmein&p=ohpleaseplease&x=0&y=0
; ignore the 302 return - it is only trying to send us to index.asp
cookie-noredirect-1:
; Save the two cookies as C1 and C2
cookie-fiphdr-1:SID
cookie-fiphdr-2:ASPSESSIONIDASDQCAAD

This will POST - ie pretend to be a filled out html FORM - the logon back.

Note that the cookie-data is 'URI escaped' ie if it is a special chr - like
/?&+ - and is in the data bit, you must use the '%xx' notation (where xx is
the HEX value). But hopefully you would have seen that in your tcpdump/snoop
anyway.

-- Proxies Proxies Proxies Proxies Proxies Proxies Proxies Proxies Proxies
Proxies

When running through a proxy server, you will need :
    1. hostname of the proxy server
    2. port number on the proxy server if it is NOT port 80
    3. (optionally) a logon and password
    4. Is the proxy SQUID ?
        If so headers are slightly different.

If this information is NOT available, normally you can find it easily from any
PC or Mac on the internal network using a browser like Netscape or IExplorer.

Start a NEW copy of either of these.  - It must be a new copy to check on
logons etc.

Under 'Preferences' or 'Internet Options' there should be a 'Connections'
section and under that, the host name or ip address plus host name of any proxy
used.

Note that often the main Fip server is NOT running DNS and will not be able to
resolve external hostnames, so the IP address must be used in this case.

Enter these values in the Fip parameter file as :
    proxy-server:195.13.83.99   (no default)
    proxy-port:412          (this defaults to port 80)

Use the Browser to attempt to access a web site outside the firewall - like
'www.fingerpost.co.uk'.

If you are asked for a password to get through, you will probably need to add a
'proxy-logon' parameter too unless the keeper of the Firewall has made a hole
through just for you.

The data for 'proxy-logon' is in base64 in the format (logon) (colon)
(password).

Use 'sffb64' to generate this string :
    On a Sparc  echo -n "chris:magicman" | sffb64 -i
    On Linux    echo "chris:magicman" | sffb64 -i
    On Winnt    type "chris:magicman" | sffb64 -i

    proxy-logon:Y2hyaXM6bWFnaWNtYW4===

The actual 'You need to Logon, Pal' message is a '407 Authentication Required'
message.

-- Repeat Offenders -----------------

Some sites add a session-id into each and every link. And this Id changes on
each access.

To 'webwire' this appears to be a new file and so it is grabbed every time -
falsely.

There is an 'ignore-key' command to isolate and ignore the relavany parameter.
eg Take a site like :
    url:http://www.fingerdong.com/
    matchlinks:*&news=yes&newsid=*
    ignorelevel:1

which returns links like
    /en/pressrelease.php?date=20080910&news=yes&PHPSESSID=11bf21&newsid=7866

If value of PHPSESSID changes each access, they you will get a copy of newsid
7866 every time.

Use :
    ignore-key:PHPSESSID

Do NOT specify the '=' or '?' etc.

-- Others Others Others Others Others Others Others Others Others Others Others

--Where 'webwire' is used to drill down links, there is a wait of about 5
seconds between accesses which, hopefully, is enough time for other people to
use that server.

--Where a logon and password is requested as part of the Browser - ie a pop-up
from Netscape or IExplorer, NOT an HTML form - you will need to add a
'Authorization' line. This will be true if you get a message like :
    HTTP/1.0 999 Authorization failure
        ... etc etc etc ...
    Assuming you know your logon and password :
    1. Use uuencode or sffb64 to generate a Base64 string
        echo -n "logon:passwd" | sffb64 -i
    2. Add an extra line to the parameter file with the result of the sffb64 line
using 'httphdr'.
        Syntax: Authorization (colon) (spc) Basic (spc) (Base64 of logon:password)
(\n FipSeq for NL)
        Eg  httphdr:Authorization: Basic AbGtGgbhpdOkOTE=\n

-- Valid links are :
    - The HREF tag atttibute in A for Anchor    <a href="www.fingerpost.co.uk>
    - The SRC  tag attribute in FRAME       <frame src="ax1000.html">
    - The URL in a META/Refresh         <META HTTP-EQUIV="Refresh" CONTENT="0;
url=go4thAndMulitply.com">

-- For 'matchlinks', the term LINK is the contents of the <a href="THISONE">,
NOT the associated text
    ie matchlinks:*boonies*
        will find   <a href="/rubbo/boonies/tunies.html">This is a Wonderful Page</a>
        BUT not     <a href="/tunies.html">This is the boonies Wonderful Page</a>

-- Note that 'ignorelinks' refers to both Links and Forms.

-- If you want to ignore all links and only get forms, use a weirdo name in
mathclinks
        matchlinks:gobbLedeGook9981

-- What are reasonable HTTP headers ?
1. If you are using HTTP Version 1.1, you MUST add a line in the headers which
specifies the actual host you are trying to access (ie the REMOTE hostname or
IP address):
    httphdr:Host: www.theirsite.com\n
or if DNS is a problem
    httphdr:Host: 123.456.789.012\n

2. Most servers would like to know what you are and what you can do - so lie !
    Try this for starters :
    httphdr:Accept: \052/\052\n
    httphdr:Accept-Language: en\n
    httphdr:User-Agent: Mozilla/4.0 (compatible; MSIE 4.01)\n
Note the syntax is httphdr:(Keyword) (colon) (space) (Parameter) (NL)
    Keyword is case-INsensitive
    There MUST a Colon-Space beteween the Keyword and Parameter.
    The line MUST finish with a single NL (which webwire will handle correctly)
    as Double NLs mean end of header.

3. If the data on a lower level is being served from a different host, if you
need authentication or some other httphdr, use the 'httphdr-on-all-grabs:yes'
parameter to add them for that server too.

-- ValuesFile ValuesFile ValuesFile ValuesFile ValuesFile ValuesFile --

Take the case where you need to get the 10 foreign exchange rates every 20
minutes from a site like Yahoo.

The normal way would be to test using one forex rate and, when ready, just
duplicate that parameter file another 9 times, just changing the forex
name/search string in the 'url' or 'post'.

The classy way is to pput all the search values (ie the bits that change) into
a single 'values-file' and reference them using FipHdr fields W1 to W9.

To Do this :
If the original url is :
  http://finance.yahoo.com/m5?a=1&s=USD&t=LAK

1. Create a values-file in /fip/tables/webwire - lets call ir VALUES_4_FOREX
    This can have the normal Fip-style comments of ';' at the start of line
    ;
    ; Values file for Forex
    ;
    USD|LAK
    USD|YEN
    USD|MYR
    ; end of values file

2. In the WebWire parameter file - lets call it FOREX.
    ;
    ;   FoREX
    ;
    port:8080
    url:http://finance.yahoo.com

    values-file:VALUES_4_FOREX

    values-get-url:/m5?a=1&s=\W1&t=\W2

... and let rip.....

Note that W1 is the first field, W2 the second etc. If you are already using W1
for something else, specify another FipHdr field to start on with the
'values-fiphdr' parameter.

Note that the FipHdr fields are useable for filename and other Fippy things.
  filename:Forex-\W1-\W2.fip

will give filenames (and/or FipHdr SN) for our example of
    Forex-USD-LAK.fip
    Forex-USD-YEN.fip
    Forex-USD-MYR.fip

-- Standard-FingerPost-Rant on bad HTML ----------------------
-- Using Webwire to pull off other file formats

Sometimes, 'webwire' seems to only grab part of a page and never returns
errors. Well, if you use a browser to look at the page and then 'View Source'
or 'View Frame Source', lo and behold there is probably a random </HTML> at
that point.

</HTML> is of course the End Tag of an HTML document. So we SHOULD stop there
really.

But a lot of web sites do not care how awful their stuff is - or maybe a
conversion program has been set up wrongly (a well-known news agency in New
York uses </html> in place of </image> to end pictures for example)

So use the keyword 'end-of-document' to track either nothing - just timeout -
or the REAL end of document.

If the data is NOT html - some XML variant for example - use 'end-of-document'
to track that.

By the way, did you know you can immunise yourself from fingerpost-rants; pls
contact the sales dept.

-- Wrinkles with Ports and RSS

Some RSS servers like to service the initial list from one port - but you have
to grab the data from another
    port:8080
    url:http://finance.yahoo.com

-- using warmrestarts and cookies to keep a logon current

1. Add '-C'

if using iptimer, add as ' switch:-C '
client:abc type:w template:abc.fip days:X  every:1s fiphdr:'#' switch:-C

2. run manually to see what the http response code is for a BAD ie logon again
pls
eg good is normally a 200 code :    HTTP/1.1 200 OK
    bad is something like
HTTP/1.1 303 See Other$
Date: Fri, 24 Jul 2015 15:53:18 GMT$
Server: Apache$
Expires: Thu, 19 Nov 1981 08:52:00 GMT$
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0$
Pragma: no-cache$
Location: /login/$

2. to the parameter file
; YES or FIPHDR= logon returns data; COOKIE=logn returns a cookie in the
mimeheader
need-logon-token:yes
; look for this in the incoming data
matchlogon:/login/
add-cookie:\C1
cookie-fiphdr-1:*
get-cookie-1:POST /login/
get-cookie-data-1:csrf_token=\F1&email=shouta%40bingbong.com&pwhash=N1&caform=1&submit=Login

when it goes bad, you will get this message in the item log
Fri Jul 24 16:58:22 webwiressl !z : Zap cookies for Logon NZXIS.APAC

-- API keys in the data

basic-authentication:JA whomee:never
need-apikey:yes

apikey-host-1:https://auth.weather.gods
apikey-fiphdr-1:*
apikey-url-1:POST /oauth/token
apikey-httphdr-1:Authorization: Basic \JA\n
apikey-postdata-1:grant_type=client_credentials&scope=ukmo-warning-read
apikey-data-type-1:json
apikey-save-1:\VZBZ:\JZ\nXX:as at \$h:\$n:\$b\nXX:expires \J1\nXX:scope
\J2\nXX:domain \J3\n\$o

; READ last token
fiphdr-file:\R2.\R3.\R4.BZ
combie:RZ   JZ|BZ

matchlogon:/login/
; ; ..
{"timestamp":1482758750452,"status":401,"error":"Unauthorized","message":"Not
authorized","path":"/active"}
; if status code =401-Unauthorized .. redo the logon ..
fiphdr:J0   tag:status
matchlogon-fiphdr:J0=401

; .. and if the test is invalid, zap the FipHdr file - use this very carefully
if you are using multiple FipHdr files !
matchlogon-invalid:zap
; this will zap any cookies (default) matchlogon-invalid:yes

-------------------------------------------------------
Version Control
;6a54   26oct18 chgs for multiple aws grabs
    ;2-3 retry-404/500 parsed
    ;4-5 internal - trimming the FipHdr
    ;6-7 location can be same url but diff host/port and log-errors
    ;8-11 bugette with if-not-modified and method parsed, maxTree 100->300
    ;12-14 9jul19 added socks proxy too
    ;15 1nov19 cater for when there are > 9 values fields plus added values-allow
and allow-comment
    ;16 11nov19 added S2 S7 etc for system
    ;17 25nov19 redid lower levels and added fiphdr-ll ;18 12feb20 bug in Values
buffers
    ;19 4mar20 minor dontuse ;20-24 minor ; 25 sizeof url and relocate tuning
    ;26 linktags 10->20
    ;27-28 20jul20 redid apikey so params are parsed at runtime NOT on param file
read
    ;29-30 minor json [] issuette
    ;31-32 7feb21 redid AWS STS token and retry-404
    ;33-34 25feb21 woops - KAlive but diff host issue / FHlevel on last-TOPlevel
issue
    ;35 17mar21 better zap of expired ApiKey
    ;36-37  5may21 preserve FH in copy_tmp as the new bits are needed for
outque/script etc
    ;38 6oct21 always pull_apart_json/xml for cookies to get FipHdrs
    ;39 29oct21 added -m maxItems
    ;40 12apr22 better handling of JSON Arrays
    ;41  8sep22 added RAW type for EFE api
    ;42-44 23sep22 better VIEWSTATE ;45 tuned VALUES
    ;46abc bugette in waitTimeout and (slightly) better messages
    ;47abcd  1aug23 FORK and children and fiphdr S5: lnko idx ;b
convert-CDATA-sections fipseq ;d BIOnotSSL tuned
    ;48a 7sep23 added oauth for gcp
    ;49 added poll-or-select and shorten-urn
    ;50a-d tuning FORK;  localSeqno better
    ;51-53 19mar24 BUG in openBio (RC needs this mod - bad from 6a47d) ;a minor ;
53 no RESET on end of JSON (only on TAG end)
    ;54 29may24 timeouts map to SSL too
;5z99   10may05 hourly bandwidth stats files rather than per client
    ;a 13may05 balance skiplists if changed
    ;b-c 25jun05 added -M and -K
    ;d-g 05aug05 added fiphdr:XX data:abc\A3 and wait-end-timeout
    ;h-k 04sep05 changed -x-X to force not default
    ;l-m 07nov05 added 24hour+ skip files
    ;n-p 25sep06 added ssl at last
    ;q-t 17oct06 added skip-details-tag
    ;u 29apr07 major change to linktag, added matchkeys and match-case-sensitive
    ;v-w14 21may07 add rest of path if 3rd+ level and no starting '/' (w14 -
tweaks to stuff_cookie)
    ;x1-6  8may08 added save-fiphdrs ;3 added -N newname ;6 bugette with VALUES
file and port != 80
        ;7 added -e and -E errname ;8 balance fiphdr fields ;9 meta-files ;10-12
minor
        ;13-14 note_balance_action ;15-16 spc in url ;17 added pretend-301:200 ;19
allow feed:
        ;20-23 finally added basic-authentication: and redid ssl
        ;24 bugette/modette - allow multiple spaces in mime headers
        ;25 allow intergap of zero
        ;26 bugette - save_metadata missing if one and only one found
        ;27-29 25jun10 bugette when proxy is a Squid and host changes
    ;y1-9 26jul10 added grab-on-tag/endtag (major release) ;10-11 6sep10 bugette
with 302-move and http://...
        ;12-14 added matchlogon, bug (bg) with data-type:CSV, plus tom bug :
retry-404-max:3 retry-404-gap:1
        ;15-17 14oct10 added skip-save-data and days:Z for weekdays
        ;18 15nov10 added follow-cookie-redirect: ; 19 able to parse VALUES-FILE: ;20
added nofiphdr
        ;21-25 mess if too many 404 plus added -D and fiphdr-hash
        ;26-27 16mar11 added repxml for fiphdr: / include fiphdr-file in start of
hdr..
        ;28-29 31mar11 added zap-values-file:yes
        ;30-32 poll.every secs bugette ;32 added need-proxy-cookie
        ;33 6jul11 better skips handling now allow 15000 skips and zap olds with
different skipdetails
        ;34 29jul11 added need-logon-token and cookie-host-X for rconnect
        ;35-36 added dbl-dblqtes in links plus Bugette in Chunks and redid outque for
speedy
        ;37-41 added CONNECT for proxy https plus started minitracking and sleep
between polls for XWEB
        ;42 allow multiple spaces in custom tag link and added filter ;43
null_next_link added
        ;43-45 added retry-404-error
    ;z1-8 15mar12 added eventvalaidation and viewstate and level5* and json
        ;9-10 allow multiple grabs, added level to grab-on-tag and matchlinks etc
        ;11-12 redid 302 moved to handle full paths better ;13 ;14 bugettes -
proxy/do NOT output file for cookies
        ;15 28feb13 tuning for level1template-file:
        ;16  4apr13 bug in skips if no headline
        ;17-23 11apr13 added trees, levels and keys to fiphdr:,  grab-on*tab: and
linktag:
        ;24-28 17may13 added retry-500 kwds and better proxy handling ;27 added
level1mime and -I wireId
        ;29-31 17mar14 added 404/500action=move, que and FipHdr ;31 modette-repxml
for all tags
        ;32 14apr14 for custom logging ;33 4aug14 added -Z force DF ;34 bugette with
fiphdr.. key:
        ;35-36 12nov14 added httphdr-on-all-grabs
        ;37 17dec14 bugette WINNT Only, cookie=* ;38 ;39 fiphdr-hash for W$ too ;40
CDATA
        ;41 28dec15 new apache does not like 443 on the end of Host:..
        ;42 13jan16 bug with https and proxy
        ;43-45 22mar16 allow 302/301 with Values and Bug with skipDetailsTag ;46
httphdrs on proxy
        ;47 10jun16 pullapartJson better
        ;48-52 14jun16 proxy and TLS1_2 and httphdr on proxy
        ;53  7sep16 allow same tag or tag@att to be in multiple fiphdrs
        ;54-55 19sep16 added SX/save-data-pathname
        ;56-57 cleanups bugettes - SU/DU if in extra, proxy and 302 handling
        ;58-59 16oct16 for ANP_FOTO fiphdrs and keydepth
        ;60-64 30dec16 added apikey stuff ; 65-68 maxLowerLevels 5->10
        ;69-72 bugette to newname/forcenewname and cookieForm
        ;73-76 28jul17 redid hmac and added recode ;77 JSON grab on endtag if in an
array
        ;78 3nov17 do not attempt to drill down during cookies
        ;79 15jan18 better JSON handling
        ;80-81 18feb18 updated ssl and added mime-type-fiphdr and level-fiphdr
        ;82 redid convertCDATA slightly ;83-4 cookie-host in FipSeq so we can vary it
;85-86 issuette with data-type and cookies
        ;87 matchlogon-invalid:zap added to zap the FIPHDR file as well as any
cookies
        ;88-90 added if-mod-suffix for filename if-modified
        ;91-92 better json support
        ;93 amz hmac support
        ;94-95 12sep18 added connection-retries and \@V
        ;96-98 6oct18 better handling of truncated data (aws-data added)
;004z   07jul04 tweaks...
    ;b 01aug04 added fiphdr:....
    ;c 10aug04 added levelXtemplate: where X is 2->4
    ;d-k 01sep04 -9 speedy and timing stats (f-maxlevel and values bugette)
    ;l-n 07oct04 added skps2, fixed one-file,
        fixed HTTP results with no messages
    ;o 28oct04 redid skps2
    ;p 01dec04 buglette with spaces in URLS - need to be stripped.
        plus lvl1file-lvl5file added
    ;s 31dec04 added -x proxy-host, -X proxy-port plus -y/-Y
    ;t-u 01feb05 added bandwidth-stats
    ;v-w 19feb05 added -u testPid and -U singlelevel only and split into files
        plus bugette with Chunking
    ;x-z 18apr05 added -O for rpt-offenders/small-diffs flag
;003z   15dec00 added one output file, tracking sents, only-get-if-modified
    ;a 20dec00 added watch on XWEB
    ;b/c 22jan01 allow hrefs to be NOT in dbl quotes plus added end-of-document
    ;d/e 19mar01 started proxies
    ;f 17sep01 proxies again
    ;g 29oct01 proxies again
    ;h 13dec01 minor mods - allow http:name:port in url and proxy
    ;i 08jan02 values-fiphdr and bugette with values
    ;j 08apr02 bug with one output file - core dump
    ;k 01jan03 MACOSX
    ;l-p 21jan04 added -h and allows secs for 'every'
        and 'no-data:'
    ;q-u 08jun04 added matchlinks/ignorelinks/url and now FipSeq
    ;u 27jun04 added -H html, -k ignore skipfile
    ;w-z 30jun04 proxy-is-squid added
;002b   24oct00 added values-file
    ; 06nov00 added 'every' and Chunks

(copyright) 2024 and previous years FingerPost Ltd.