NOTE - These FipHdrs are used internally by webwire C0-9 for cookies F0-9 for forms So pls do NOT used for 'fiphdr:XX ....' etc webwire FOR HTTPS/port 443, please use the 'webwiressl' version of this program. Webwire goes and gets pages of data from Other people's web sites automatically and then sends those pages to your destination - usually the editorial system - in the normal Fip fashion. These can be updates of weather, financial data, sports results, backup for Wire services if the satellite is down (those were the days !), graphics, software. In fact most things. It can be used either : - on a timed basis to get regular known pages. - on demand by sending a file into spool/webpoll with the FipHdr field DF set to the parameter file required. What it can do - - drill down links to several layers deep, optionally ignoring the data on the top levels. - select only certain links - either in XML, HTML, JSON or CSV - you set masks to filter which to get and which to ignore. - logon automatically to protected sites and save Cookie information for use in later accesses. - fill in standard form data to get make on-demand searches. - strip or rework HTML tags to make the data more presentable. This is meant for reasonably simple pages while more complicated ones will be routed through 'ipsgml' and/or 'ipxchg'. - Use an external list of values to make several grabs to the same site/page/script but varying the search data for each hit. eg to pull all the values of a financial index. (This we call a 'values-file') - Grab a 'id' from a List-of-items from a REST web service and then sequentially call all items What it cannot do - - play tunes. - run javascripts or any other applet type affairs. (yet..) - run FTP, GOPHER or whatever (for these and especially FTP, see program 'ipftp' and 'iptimer'). The current version is primarily for getting text data but can be used for images etc if required. There is a TUNING mode to be used for setting up a new link and trying to clean up the relevant parameter file WITHOUT sending (possibly) live data to the required destination. - This shows the data with escaped unprintables and '$' at the end of a line. - All links and forms are also displayed. - Any pages saved in Tuning mode are NOT sent to the normal output queue (spool/2go) but are left in spool/webtest for future perusal and/or deletion. - To run, choose your parameter file in tables/webwire and run 'webwire' manually in a window: webwire -T AUS.STOX | more for prompt before calls or webwire -A -T AUS.STOX | tee aussies for no prompting There are Two (sometimes three) types of parameter file : 1. Main Parameter file which sets up the polling of certain pages at set times (if any). 2. A Page Description file for each site/page accessed. 3. Optional lookup file of values where you want to repetitively hit a site changing certain values each time. (eg a sport site for several divisions or a list of stox to get) ----- Main Parameter file ----- The syntax of the Main Parameter File - by default tables/webwire/XWEB : ; comment line poll:(pram file) day:(MTuWThFSaSu) time:20:30 mustget: In detail, the 'poll' keyword : Pram file is the name of the Page Description file - see below for its syntax day: Day of week to run the job : M Monday Tu Tuesday W Wednesday Th Thursday F Friday Sa Saturday Su Sunday X Every day. Z Weekdays M-F. Case is NOT important. Commas (but NOT spaces) may be used to separate. Default is every day. either time: Time of the day on 24 hour clock. Default is 18:00. or every: interval between grabs Default: none every: (mins) [(optional) start:(starttime) end:(endtime) every:30 start:07:30 end:19:00 The minimum interval is 1 min and maximum is 3 hours (ie every:180 mins) You may also specify in seconds using 'secs' or 'seconds' immediately after the number (with no spaces) every:10secs start:09:30 end:09:50 eg: poll:AP day:ALL time:20:10 Get the Page file tables/webwire/AP every day at 20:10 poll:Forex day:MTuWThF time:16:30 poll:Forex day:MTuWThF time:16:40 Get the Page file tables/webwire/FOREX every week day at 16:30 and 16:40 There can be none or up to 200 polls in the main parameter file. Note that the page is grabbed ONLY if the program is running. ----- Page Description Parameter files ----- The individual Page description parameter files are also in tables/webwire. The syntax of these are : ; comment start with a semi colon like this MANDATORY url: Full url of the page. default: none There MUST be one and only one 'url:' specified. You can also specify the page, cgi and any subparameters. eg url:www.fingerpost.co.uk url:www.big-press-org/sports/baseball/index.htm url:www.marketlook.co.uk/scripts/Summary.dll?HandleSummary dest: Fip Destination for the files default: WEBDATA This is the 'DU' FipHdr field as per the USERS file. eg dest:w3saves OPTIONAL: use-tls: no/yes use-ssl: no/yes use-https: no/yes Use Secure Sockets Layer (TLS/SSL) - also called HTTPS default: no If the url starts 'https://....' then this command is NOT needed. (There is also a setup option for openssl s/r to use either Bio or SSL functions for the secure connection use-ssl:BIO use-ssl:SSL ssl-method: tls1.3 tls tls1 tls1.1 tls1.2 sslv2 sslv3 sslv2and3 Version number to use for TLS/SSL default: 999 for current default (2 or 3) (only the digits are significant, so add other text to make it readable) For 'modern' connection, pls do NOT use sslv2 ! as it is deemed insecure If default it will check the available list and pick the highest. The default is currently 23 which on a modern server is sslv3 and tls1_2 !) ssl-password: (password) ssl-passwd: (password) default: none Optional password if the handshake requires a shared secret ssl-key: (name of a certiticate key file) default: none ssl-cert: (name of a certificate file) default: none ssl-root-cert: (name of a root PEM certificate file) defaunt: none Optional certificates are in tables/ssl unless name starts with '/' ssl-verify: yes/no verify server certificates default: yes ssl-ciphers: (list) acceptable ciphers (use 'openssl ciphers' to list) default: "ECDH+AESGCM:ECDH+CHACHA20:ECDH+AES256:ECDH+AES128:!aNULL:!SHA1:!AESCCM" pre 2021oct default: "ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS" pre 2017 default: "HIGH:!aNULL:!kRSA:!SRP:!PSK:!CAMELLIA:!RC4:!MD5:!DSS" ssl-display: yes/no display SSL connection details default: no port: Port number of the Remore Server. default: 80 This forces the port to be this if none is specified. nofiphdr: Do NOT add a Fip Hdr to the file. default: yes pls source: Fip Source of the files. (FipHdr 'SU'). default: XWEB Unless 'noarchive' is specified, all data files will be archived under this name in log/data. This can be in FipSeq so that 'combie' can be used to set a default.. noarchive: Do NOT archive these files in log/data. default: archive maxlevel:3 Maximum no of levels to drill down. default: 1 Normally the URL you have requested is the data you want. However if that is an index page with links that may change, it may be these lower-level pages that are needed. 'maxlevel' states how many levels of link the actual data pages are. Default is 1 = do NOT drill down any of the links. Note that level 1 is the first page. ignorelevel: Used with 'maxlevel' where the information def: no required is on a linked page and NOT on the first page, use 'ignorelevel' to ignore all those pages on intermediate levels. Note that level 1 is the first page. eg ; ignore levels 1, 2, 4 and 6 ignorelevel:1,2,4,6 matchlinks: Only follow links which match this mask. def: all links Used only if 'maxlevel' is greater than 1. There can be many 'matchlinks'. Use the '*' as a wild card string and '?' as a wild chr. eg ; get all links ENDING 'html' matchlinks:*html matchforms: Only process forms which match this mask. default:no forms Used only if 'maxlevel' is greater than 1. There can be many 'matchforms'. Use the '*' as a wild card string and '?' as a wild chr. eg ; get all forms ENDING 'asp' matchforms:getfile.asp matchframes: Only follow frames which match this mask. def: all frames Used only if 'maxlevel' is greater than 1. There can be many 'matchframes'. Use the '*' as a wild card string and '?' as a wild chr. eg ; get all links ENDING 'html' matchframes:*.top matchkeys: Only follow links which match this test. def: all links Used only if 'maxlevel' is greater than 1. Used only for 'linktag' where an attribute MUST be set for the link to be valid There can be many 'matchkeys. Use the '*' as a wild card string and '?' as a wild chr. eg ; <hotel id=33 name="Fawlty Towers" url="http://www.ohnonotagain.com" status="current" /> linktag:hotel@url matchkeys:hotel@status=current matchkeys:hotel@status=ready match-case-sensitive: yes/no all matches and ignores can be case sensiive or in-sensitive DEFAULT changed 05u to INsensitive - previously sensitive. match-dedup: (FipSeq) Check and ignore Sequencial duplicate items with (possibly) diff urls - FipSeq It can be used the same as skip-save-data if that else use \W$ for the normal grab url match-dedup:\VX\Q6-\Q8\$o force-lower-levels: (levelNumber) When data is on more than one level - maybe a text page has a link to a PDF and you need both bits, use this to get all bits of this element before continuing with the next element. The default (without this parameter) is to get all this level and then all the next lowest level etc. ; force the lower levels below level 2 force-lower-levels:2 mime-type-fiphdr:(2 letter FipHdr field) if the MimeType is present, add the mime-type to the fiphdr as this fiphdr field level-fiphdr:(2 letter FipHdr field) add the Level of this file to the fiphdr as this fiphdr field This can be used for option inside the parameter file: level-fiphdr:AL option:V1 AL,,,,1, option:V2 AL,,,,2, filename:level\V1ONE\$o\V2TWO\$o_file level-link-fiphdr: (FipSeq - 2 letter FipHdr) This gives access to the top level link in force-lower-levels eg force-lower-levels:2 level-link-fiphdr:C1 so C1 is for level 1 link, C2 for level 2 etc If you want only a part of the link use FipSeq to pull apart there is no default top-level-nextpage:(FipSeq) Some BIG feeds will only return the first 'n'100 items of a list of items - eg S3 is up to 1000 Use this to FipSeq that an input tag is saved into a FipHdr field and the webwire loops for more eg ; FH for next page top-level-nextpage:\JY ; save the contents of NextContinuationToken tag at the top level 1 fiphdr:JY level:1 tag:NextContinuationToken ; only add 'continuation-token=' and the Token if the Token HAS data option:VY JY fixed:P1 \VYcontinuation-token=\JY&\$odelimiter=/&encoding-type=url&list-type=2&max-keys=\JG&prefix=\JE skip-links: Name of a file in /fip/fix/webwire holding names of links and forms already accessed; so that only new ones are tried. eg skip-links:webwirelinks.\$d default: none skip-details-tag: (tagname) extra details (such as a publishdate) for check if existing links have been updated see below on the section for RSS feeds default: none skip-purge-after: (hours) Number of hours to keep the skip entry default is 24. You might want to tune this : - make bigger if sites add/take off old material - reduce the time if the same link is used for different data skip-save-data: (FipSeq field) Sometimes there is some data in the link which changes for every access - such as a Cookie or SessionId eg the first access might get search.do;jsessionid=A9823A4622A23C10C4EC7F1825BF9E26.node1?messageId=268482 and the second search.do;jsessionid=FCC18E9582E77C2AD9EFE6C68CA0F0A2.node1?messageId=268482 But they both happen to be the same file - messageId=268482 Use FipSeq to just get the data that contains ONLY the information you want to save. Certain FipHdr fields hold relevant info: WX is the field marker '^' WS is the skip details tag (optional - see above) WT is the type - 'a'-anchor WL is the level no W$ is the actual link - anchor, form etc S$ is the actual hostname or IP address WH is the associated display text from an anchor tag In the above example : ; split on the '?' - get the second field repeat:Q1 W$,?.2 ; skip string is now 'messageId=268482' - note the FipSeq needs a backslash skip-save-data:\Q1 skip-balance-group: name of a balance group (in tables/sys/BALANCE) to distribute the skip file when changed (see doc on 'ipbalan') This is often used where a second system could be used as a redundant server if the main system fails. (see also -B input switch) ignorelinks: Of the Links found, skip any matching this mask. default: all links Used only if 'maxlevel' is greater than 1. There can be many 'ignorelinks'. Use the '*' as a wild card string and '?' as a wild chr. eg ; ignore any links pointing at any 'netscape' or 'microsoft' site ignorelinks:*microsoft* ignorelinks:*netscape* ; ignore any links requiring 'ftp:' ignorelinks:ftp:// * ; ignore any links to other sections ignorelinks:../ * ; ignore any links to any index ignorelinks:*index* httphdr: Extra lines of HTTP header you may need. default: none Remember to add a NL at the end of each line. There can be multiple httphdr lines but pls remember to add a '\n' at the end of each one. (or you can try to force all on one httphdr line!) eg httphdr:Authorization: Basic AbGtGgbhpdOkOTE=\n httphdr:User-Agent: Mozilla/4.0\n httphdr:Host: wibble.wobble.com\n see below for 'useful, common header lines' ** ALL basic-authentication MUST BE HIGHER IN THE PARAMETER FILE THAN httphdr OR proxy-logon httphdr-on-all-grabs:yes/no Normally the httphdr is only for a single host. So if the 2nd or subsequent level is to a different host, by default, nothing defined as 'httphdr' will be added. if 'yes', the option adds the httphdrs to all grabs httphdr-on-proxy:yes/no Normally the httphdr is only for data grabs NOT for getting thru the Proxy. if 'yes', the option adds the httphdrs to the proxy call basic-authentication: (fiphdr field) (logon:password) Build a FipHdr field with the BasicAuthentication formatted logon:password Pls remember to escape any funny chrs - like backslashes ** ALL basic-authentication MUST BE HIGHER IN THE PARAMETER FILE THAN httphdr OR proxy-logon eg basic-authentication:BA DOMMY\\zipple:Ardvark99 httphdr:Authorization: Basic \BA\n method: POST/GET/DELETE/PUT etc default: GET unless 'post:' is specified normally this is a single UPPERCASE action - with NO spaces. post: Post a Form default: get url see below for processing a form using method=POST. filename: Filename for the output file in FipSeq. default: WEB\$Z newname: ditto If this does NOT start with a '/' it is left under the Output Queue as specified on startup (default spool/2go) eg filename:AFP\$d.\$z eg newname:#SN:\JC.\JK.\JZ#XX:\$u.z\$z.v\@v Note \@v is no of items in this file It is ignored if a -N (forcename) is specified as an input parameter supercede:(FipSeq which should resolve to yes|no) default: no if supercede:yes, the contents of any existing file is overwritten striptags:(yes|no) Strip tags and attributes default: no wild: (FipSeq) Character used as a Wild String for default: '*' 'matchlinks/ignorelinks'. eg wild:\377 singlewild: (FipSeq) Character used as a single default: '?' Wild chr for 'matchlinks/ignorelinks'. eg singlewild:! number: (o|d|h) Number system for FipSeq default: octal octal, decimal or hexidecimal The following are all equivalent : number:octal before:\040 number:decimal before:\032 number:hex before:\020 before: FipSeq String to add before any data. default: none after: FipSeq String to add after any data. default: none script: Script to run on ths data of the incoming file. default: none outque: Output folder (in FipSeq) default: spool/2go This overrides both the default and the '-o' output switch except for Testing/Tuning mode where the file is forced to spool/webtest. log: FipSeq custom logging for the item log. default:\SN \SU \EF : \EH,\EP This logs each Page grabbed Note that EH or ST remote site host EP or SP remote site port EN or SF or SG remote site url SG is the actual link, the others are the link used to grab EF parameter file used The default is that no incoming files are logged by webwire custom-log: FipSeq custom logging for the item log. default: none This can be used to log link details in a custom log /fip/log/webwire/(date)_(paramfile).fip custom-log:pnac.\YN|date.\YT|procdate.\T7|taketime.\T9|source.\TU|take.\TZ|head.\TH log-errors:w (warn) - for all communication errors log-https-errors: w (for warning) - for all HTTPS comms errors Any failure to go secure in https connections are flagged as warnings The transmission is always aborted. This parameter affects only the logging. default: !x for failures extra: extra-grab: extra-pre: Extra FipHdr fields (in FipSeq) to be added to the output. default: none To separate FipHdr fields, pls use a '#' or a newline. extra-pre is added as soon as the file is read - so may be used for information in the URL extra is only used for any output file and is not used at all for any other purpose. extra-grab is added before each grab eg extra:ZH:NYNZ#DI:Headline News#QZ:333 extra-grab:\nD1:\nD2:\nD3:\nD4:\$p\n tag: FipSeq String to replace the start tag default: none such as <H1>. There can be many 'tag's. eg tag:P {Para}\n endtag: FipSeq String to replace the End tag default: none such as </P>, </TITLE>. There can be many 'endtag's. eg endtag:TITLE \n getimages: Also get all the images By default all images - *.gif or *.jpeg are ignored. keep-alive: yes/no default: no Just that ! default:no http-version: 1.0 or 1.1 default:1.0 only-get-if-modified: (FipSeq for yes or no or etag) default: NO for get data each time This will check the remote server for the time the page was last modified. This does not work with old servers and some set to HTTP/1.0. If remote data has been modified since, data is grabbed and processed normally If not - it is ignored (unless logging is lon) If the parameter is 'etag' then any incoming ETag tag is saved and subsequent request use 'If-None-match: (Etag)' if-modified-suffix: (FipSeq) 'only-get-if-modified' uses a save file named by the parameter file (or the poll name) If there are several grabs using the same parameter file but need their own separate times. (Otherwise they would all use the one, latest time for all grabs ! - not good !) This adds a suffix to the saved time file combie:QJ AJ|BJ,json if-modified-suffix:\QJ ignore-key:PHPSESSID When matching for skip files, ignore this key-value pair. see the section below on Repeat Offenders max-items: (number) default: 0 for all Max number of items to grab per session Some sites only allow you to read 5 or 10 items before blocking you. Use this to creep under that total. (from 6a20) The number can be in FipSeq. Note this is the number of files produced - ignoring Skipped files So it the number of linked grabs is 2 * a FipHdr field, use FipSeq 'sum' to adjust. eg is AL:7 is a FipHdr field and 2 files per link are generated - THUMBNAIL and HIRES sum:Q7 (\AL * 2) max-items:\Q7 There can be a subparameter - level:(number) - where there are multiple levels and you want to grab the all the items on the lowest level BUT need to track the previous level pause-between-files: (secs) Gap/wait/pause between grabs default is 5 for standalone, 1 for iptimer This is overridden by the -w input switch one-output-file: Put ALL data in a single output file. The default is one file per page/access Use this with 'values' to create a single output file. This ONLY uses the FipHdr of the first file if 'values' have been specified. end-of-document: Where a site is sending really really crap HTML - or XML use this to state what the last tag. For no checking at all : end-of-document: Default: end-of-document:</HTML> See below for a standard-fingerpost-rant on crap HTML..... end-of-cookie-page: end text which signifies the end of a logon or cookie page This is rarely changed. default is </HTML> connection-retries: (number) No of retries that a connection or a broken connection (ie before a response is received) Some slow sites are throttled and will kick the n+1 th connection off before servicing it. Use this to retry. Default is 1 connection - ie NO retries. connection-timeout: (secs) Slow, busy sites, may take a lot longer than normal to connect to. Use this to adjust the time to connect. Default is 90 wait-end-timeout: (secs) For slow, busy sites, data - especially big files - may take a lot longer than normal to be retreived. Use this to expand that time. Default is 120 (it should be divisible by 5 for some arcane reason) pretend-301: (3 digit number) pretend-302: (3 digit number) Ignore redirects (HTTP return code 301/307 or 302/308) and assume they are this return code pretend-301:200 this will take a 301 and save the data as through it was an incoming file. dump-data: Save /Dump a copy of the each request and response and data in a dump file in /fip/dump default:no dump-filename: (FipSeq) Name to be appended to the dump filename in /fip/dump default: none no-data: (FipSeq string in place of data) Do not get/send the data - just this string data-is-binary:(yes/no/maybe - can be FipSeq) Data files at the lowest level are binary or not default is check for <?xml, Tiff, Jpeg, MsWord/Office, EPS and PDF automatically; otherwise it is treated as text ignore-mime-if-binary: (yes/no - can be FipSeq) if yes = Strip the MimeHeader off binary files default is no to leave it on - so you know what the file really is ! For Socks 4/5 - use these parameters to control use-socks:4/5 yes/no (yes is same as 5) socks-host: (hostname of the socks proxy) no default socks-port: (port number of the socks proxy) default: 1080 socks-user: (user name for the socks proxy) no default if nothing specified, assumed that there is none socks-pwd: (password for the socks proxy) no default For old-style HTTP Proxies : proxy-server: If using a proxy, these are the name and port to aim at. proxy-port: proxy-logon: This is the logon and password to get thru the firewall if required. The format is (logon) (colon) (password) and is converted to base 64. proxy-logon:Y2hyaXMuaHVnaGpvbmVzOnBhbnRoZXIK= ** ALL basic-authentication MUST BE HIGHER IN THE PARAMETER FILE THAN httphdr OR proxy-logon To generate use basic-authentication or: echo -n "logon:password" | sffb64 -i eg echo -n "chris:sleekpanther" | sffb64 -i gives Y2hyaXM6c2xlZWtwYW50aGVy proxy-logon:Y2hyaXM6c2xlZWtwYW50aGVy= proxy-is-squid:yes/no Is the proxy a Squid ? default: no proxy-handshake:yes/no Does the proxy need to CONNECT first ? default: no If the proxy is a Squid, this MUST be NO logeachfile:(dest) Send a Success/failed msg to this destination for each file. There is no default. This log file is just a FipHdr with the following extra fields : DR-File Sent OK DR:ok or DR:error DG-Will Retry later DG:retrying, DG:stopped DT-Some message text DT:No connection default: no log created. The text for the DR and DG can be in FipSeq and so can contain FipHdr and other variables. As they are FipHdr fields, please do NOT put NL, CR etc in the fields. Note that System Variable \$q holds the time taken for transmission. DRgood:(text) Message for the FipHdr field DR on a successful tx default: ok DRbad: (text) Message for the FipHdr field DR on a unsuccessful tx default: error DGcont:(text) Message for the FipHdr field DG if, after an unsuccessful tz, another attempt will be made. default: retrying DGstop:(text) Message for the FipHdr field DG if no further attempts will be made as the file was sent successfully or the maximum no of attempts has been tried. default: stopped fiphdr-for-logeachfile: (FipSeq) or msgeachfile:(FipSeq) Additional information to add to the FipHdr of the 'logeachfile' or 'loglasterrfile' msg. This should be in FipHdr format and be in FipSeq. It can be used to pass FipHdr fields in the outgoing file into the log file. eg msgeachfile: DF:logdial\nSS:\SS\n default: nothing added convert-CDATA-sections: convert-CDATA-sections:no - no dont ! (default) convert-CDATA-sections:zap - no but zap the '<!CDATA[' and ']]>' convert-CDATA-sections:yes - yes pls and zap the '<!CDATA[' and ']]>' convert-CDATA-sections:preserve - yes pls and leave the '<!CDATA[' and ']]>' Normally a CDATA section like : <![CDATA[ Vongerful Vondafool C&oe;penh&areing;gen <99thisIsAnon-compliant XMLtag> ]]> is considered a single, raw string of XML/SGML data. And all the tags and entities (like <) are not changed either. Use this parameter to convert them. Note that you should use this option CAREFULLY if any tag in the CDATA is the same as a tag in the main envelope. See below for more comments. To save the contents of a particular Tag or TagAttribute, use the 'fiphdr' keyword : fiphdr:(FipHdr field) (optional subkeywords) Either tag:(name of tag) specify the tag name which contains the data required. Or data:(FipSeq) for adding FipHdrs with standing data. fiphdr:TT data:\$e\$y\$i\$d will create a FipHdr field DT with the current date in it Or tag:(name of tag)@(name of attribute) specify the tag name and the attribute name which contains the data required. Or there can also be a 'key' parameter for selecting the data ONLY if there is Key attribute with its data equal to a certain string: eg: if the tag is <meta name="category" content="f"/> fiphdr:NC tag:meta@content key:meta@name=category Get the contents of the content attribute of 'meta' where another attribute called 'name' has the value 'category' or fiphdr:NC tag:meta key:meta@name=category or fiphdr:NC tag:meta@name=category Get the data for the 'meta' tag that has an att 'name' = 'category' Double quotes around the Key Data are optional unless there are embedded spaces. The Key Data can be in FipSeq. For any of the tag options, use 'dup' to flag duplicated fields. dup:(optional separator) This field may be duplicated. Duplicate fields are separated with a space unless a separator chr is also specified. Where there might be embedded tags inside the main tag, use 'repxml' to specify a replace string repxml:(FipSeq) eg fiphdr:AL tag:TD repxml:+\s+ and the data is <td>abc<br>efg<br>line3</td> will give AL:abc+ +efg+ +line3 As some FipHdr fields have distinct meanings - SN, DU, DP etc - please use other 2 letter codes starting N or Q. In the current version of webwire, you CANNOT specify trees of tags ie fiphdr:AA tag:entry/id. eg fiphdr:NA tag:itemid dup:+ get the data from each <ITEMID> field. If there is more than one, they are separated by a '+'. fiphdr-save:(FipSeq) fiphdr-file:(Filename in /fip/fix/webwire/fiphdr) This allows data to be stored as FipHdrs at the end of the session - and read at the begining of the next So items like Sequence numbers and time-of-access can be passed between attempts. ; default name combie:QA WA,default ; save and possibly reuse the FipHdrs .... repeat:JQ J1,+,1 repeat:JD J2,+,1 fiphdr-save:BQ:\JQ\nBD:\JD\nXX:some comment\n fiphdr-file:websave_\QA ** This must be lower down the parameter file than any FipSeq if you are using FipHdr fields as the example above ! There can be multiple 'fiphdr-file' - all of which are read as the parameter file is read. But if there is a fiphdr-save, ONLY the last 'fiphdr-file' is stored to. fiphdr-on-all-levels: Add the FipHdr to each file on every level - default: no fiphdr-hash: (single chr in FipSeq) This will replace a Hash '#' in a FipHdr field (as Hashes are normally end-of-fiphdr field) meta-to-save:(FipSeq) meta-save-file: (Filename) meta-save-on-tag: (tag name) This meta file is appended to on the End-of-tag specified (or end-of-file if no tag specified) ; save these fields to the lookup file meta-to-save:\J3|\J5|\J6|\J1|\J4|\$h:\$n:\$b\n meta-save-file:/fip/data/blob/\$e\$y\$i\$d/\WA meta-save-on-tag:LINK reset-fiphdr-on-tag: (tagName) Trim the FipHdr - and extra, added fields - on the end of this tag to the same position when the tag started This can be used in meta-save to make sure that FipHdr fields from one grab do NOT exist for the second or subsequent grabs default: not used. grab-on-tag: (tagName) grab-on-endtag: (tagName) Any links should be grabbed at the start or end of this Tag default: all links are grabbed at the end of the page An extra parameter may be specified on the same line for level eg grab-on-endtag:VALUE level:3 grab-on-endtag:params/param/value/struct/member NOTE that grab-on-endtag does not trim the FipHdr (as we might need the extra meta for a fiphdr-save). So use reset-fiphdr-on-tag with the same tag to trim (if there is NO fiphdr-save) retry-404-max:3 retry-404-gap:1 retry-404-error:abort/ignore/move retry-404-queue:2go retry-404-fiphdr:#CE:300#DU:nextstage Retry links which return a 404 Not Found error. Max is the number of retries and Gap is the pause in seconds between the retries Use this for those sites which are a bit slow to add the data files the links point to. If the files really are not there - and you do NOT want to abort the transmission - use 'retry-404-error:ignore' to continue with the next grab OR you can use retry-404-error:move and retry-404-queue:(queue in spool) and retry-404-fiphdr:(FipSeq) to send a item retry-500-code:505 retry-500-max:5 retry-500-gap:1 retry-500-error:abort/ignore/move retry-500-fiphdr-file:delete/ignore retry-500-queue:2go retry-500-fiphdr:#CE:300#DU:nextstage Retry links which return this system error - code can be any 3 digit number above 400. Max is the number of retries and Gap is the pause in seconds between the retries Use this for those sites which are a bit slow to add the data files the links point to. If the errors continue - and you do NOT want to abort the transmission - use 'retry-500-error:ignore' to continue with the next grab OR you can use retry-500-error:move and retry-500-queue:(queue in spool) and retry-500-fiphdr:(FipSeq) to send a item save-data-path: (Fipseq pathname for data) This puts the data of the incoming file into this folder and creates a FipHdr file that contains 2 FipHdrs containing the full path/filename SX: and FTP_EXTERNAL_FILE: (ipbalan uses SX and ipftp uses FTP_EXTERNAL_FILE) eq save-data-path:/fip/data/jpegs/\$e\$y\$i\$d/ Use this for big files that you do not want to copy around the Fip Spool area. save-data-filename: (FipSeq name) Use this to specify exactly what the the 'save-data-path' name should be default is (incoming filename).(time).(seqno) eg save-data-filename:\HR-\SU.raw save-data-balance-group: (Balance group) Balance all save-data files to the following group. default: do not balance save-data-balance-folder: (Balance folder) If balancing, put the token in this folder under spool. default: 2balance max-children:(number in Fipseq) - same as the -E switch. default: none forks-per-sec: (number in FipSeq) Throttle forks to this number per second default: no throttle -- More Complex sites ------ -------- Oauth2, Oauth digest and AWS notes -------- -- For accessing Oauth2 protected assets - eg GCP Cloud Storage or G-Drive or Microsoft Azure features ; OAUTH2 authentication as per Google GCP or Microsoft Azure use-oauth2:yes/no Use OAUTH2 to grab/use an access-token or Bearer token eg for Gmail access default is NO ; We need an access token use-oauth2:yes ; which flavour of Oauth2 ? - only the first letter is meaningful ; oauth-flavour: Google (Gmail) or Microsoft (Office365) oauth-flavour:microsoft for office 365 ; Current token file will be saved in /fip/fix/goauth2 oauth-token-file:\OT ; Credentials file in /fip/tables/cert oauth-credentials-file:\OC ; sffoauth and imapwire oauth-scope:https://outlook.office365.com/.default ; Script to run when token expires - approximately every 12 hours oauth-refresh-script: (Script in FipSeq) script to generate the access_token using a refresh_token oauth-refresh-script:/fip/bin/sffoauth -z wire/IMAP.O365.OAUTH.SEA -c \OC -t \OT -H '#WN:\WN' -a These 5 FipHdrs are use to generate, check, add/renew permissions to access the remote data - normally Gmail or Office365 oauth-client-fiphdr: (FipHdr) default: IC oauth-secret-fiphdr: (FipHdr) default: IS oauth-access-fiphdr: (FipHdr) default: IA oauth-refresh-fiphdr: (FipHdr) default: IR oauth-expiry-fiphdr: (FipHdr) default: IX -- For accessing other Oauth protected assets - like twitter OAUTH digest - such as twitter and dropbox use There are three parameters to define to ; this is the salt key oauth-signature-key:\JC&\JA ; string to use oauth-signature-data:\R5 ; fiphdr to add to oauth-signature-fiphdr:RS ; type sha1 oauth-signature-type:sha1 oauth-signature-key: (FipSeq) Key for Oauth; normally the ConsumerKey and the AccessToken oauth-signature-data: (FipSeq) Data string to encode - see your remote api doc for what needs to be included and how it should be formatted oauth-signature-fiphdr: (2 letter FipHdr field) FipHdr which will hold the signature oauth-signature-type: (type) Signature type valid types are md5, sha1, sha224 sha256 sha384 and sha512 -- For AWS grabs, use the same pararmeters ars oauth (almost!) aws-signature-key: (FipSeq) Key for AWS; use | as a sep : secretkey|dateStamp|regionName|serviceName aws-signature-fiphdr: (2 letter FipHdr field) FipHdr which will hold the signature aws-signature-type: (type) Signature type valid types are md5, sha1, sha224 sha256 sha384 and sha512 aws-request: (FipSeq) FipSeq string to hash using the signature key eg aws-request:GET\n/349445556777/fiptest1\nAction=ReceiveMessage&AttributeName=All&MaxNumberOfMessages=10&MessageAttributeName=All&Version=2012-11-05&VisibilityTimeout=1&WaitTimeSeconds=20\nhost:sqs.us-east-99.amazonaws.com\nx-amz-date:20180907T112658Z\n\nhost;x-amz-date\ne3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 aws-data-fiphdr: (2 letter FipHdr field) FipHdr field which will hold the payload sha256 hash if POST, the data part which needs to be hashed; for GET, it is left blank (obviously as there cannot be a payload) the hash is added to the last line of the 'aws-request' either as a fixed string (for GET) or as a FipHdr field (aws-data-fiphdr) if GET, this can be ignored and (for SHA256), this is 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855' eg sffhmac -I "" -Z sha256 -S -D -H aws-data-md5-fiphdr: (2 letter FipHdr field) FipHdr field which will hold the payload md5 hash This is used for the S3 Content-MD5: mime header. as an alternative to the SHA256 hash in x-amz-content-sha256: mime header which should be set to UNSIGNED-PAYLOAD if you use md5 EG : ; if split into FipHdr fields for : ; JA-accessKey, JB-secretKey, JH-sha of payload,JI-aws-id, JP-Url Params, JQ-sqsque, JS-service, JR-region, JZ-utc datetime combie:JA AA,-noAccessKey combie:JB AB,-noSecretKey combie:JI AI,349445556777 combie:JM AM,GET combie:JP AP,Action=ReceiveMessage&AttributeName=All&MaxNumberOfMessages=10&MessageAttributeName=All&Version=2012-11-05&VisibilityTimeout=1&WaitTimeSeconds=20 combie:JQ AQ,fiptest combie:JS AS,sqs combie:JR AR,us-east-99 newdate:JZ gmt unixdate=\$p "\ZZ\ZM\ZGT\ZH\ZF\ZEZ" ; For GET - choose either fixed:JH e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 ; or ; aws-data: ; aws-data-fiphdr:JH aws-signature-key:AWS4\JB|\JD|\JR|\JS|aws4_request aws-signature-type:sha256 aws-signature-fiphdr:RX aws-request:\JM\n/\JI/\JQ\n\JP\nhost:\JS.\JR.amazonaws.com\nx-amz-date:\JZ\n\nhost;x-amz-date\n\JH - In this case, the default request boils down to : GET /349714556777/fiptest Action=ReceiveMessage&AttributeName=All&MaxNumberOfMessages=10&MessageAttributeName=All&Version=2012-11-05&VisibilityTimeout=1&WaitTimeSeconds=20 host:sqs.us-east-99.amazonaws.com x-amz-date:20180907T112658Z host;x-amz-date e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 -- The links are not in the normal Anchor or Frame tags. If the Site returns an XML feed rather than HTML, you can specify what the contents of which tags you want to play with. There can be up to 20 tags specified. linktag:(tagname) or linktag:(tagname)@(attribute) (for version 05u onwards) linktag:TEXT linktag-2:Slavver linktag-3:Bone or to imitate the defaults : linktag-1:a@href linktag-2:frames@src - sites which return other data which is not xml - such as CSVs data-type:CSV (can be CSV for comma sep format, JSON, PSV for Pipe sep, TXT) data-type-sep:| data-type-eoln: data-link-idx:2 define the column containing the link to the data headline-link-idx:3 define the column containing the headline skipdetails-link-idx:1 define the column containing the skipdetails -- RSS feeds Sometimes a link can point to data which gets updated and there is a second tag which gives either a unique-id or a date/time which you need to track for any changes. Use the 'skip-details-tag' to specify the second tag - it is the combination of the 'linktag' and 'skip-details-tag' which should be unique. For general RSS 2.0 feeds, this can either be 'pubDate' or 'guid' : linktag:link skip-details-tag:pubDate In RSS feeds there is often a fake 'link' at the top which is the channel. Usually you do not want this one - often it is a URN not a real URL, so use 'matchlinks' or 'ignorelinks' to bypass it. if more than one skip details are needed, up to 9 skip-details-tag-X can be specified. -- If the data in the link is not complete.. Use templates to slot data from a link into another call. This is again used extensively for XML work - like soap. It uses either just a template (in FipSeq so you can add Header Fields etc) or a template AND a template file if there is a lot of data. level2template:/query.dll?src=\QD level3template:/getFile.dll?file=\W$ level3template-file:soap-getfile.xml There are 8 templates for levels 2 to 9. 'maxlevel:' and 'ignorelevel:' must always be used with these to specify which one you need the data from. A levelXtemplate on its own will generate a GET. To POST something, you will also have to specify a 'levelXdata: (FipSeq)' eg ; level 3 - To get THAT file is always a POST of FileManager1%24gvwFiles%24ctl03%24gvlnkNam ; .. using the different EVTVAL and VWSTAT level3template:/proximity/Admin/FileManager.aspx level3data:__EVENTTARGET=\A2&__EVENTARGUMENT=&__VIEWSTATE=\G8&__EVENTVALIDATION=\G9 will force as POST /proximity/Admin/FileManager.aspx will data filled in for fiphdrs A2, G8 and G9 eg _EVENTTARGET=FileManager1%24gvwFiles%24ctl03%24gvlnkName&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUJMjQyMzY1MzEX%3D&__EVENTVALIDATION=%2FwEWAwKVxPyBCgLc Note there is no level1template as that is the same as the URL:.. BUT there is a 'level1template-file' version. In this case the URL: should be just that. There is a little used parameter levelXmime which can be used to change the Content-type / Mime type just for that level. The template-files are normally in /fip/tables/webwire. They are NOT force uppercase. The default Content-Type for POSTing data or forms, 'application/x-www-form-urlencoded', sometimes needs to be changed for templates. It can be changed with the 'levelXmime' parameter. For example, soap normally likes a content-type of 'application/soap+xml': level1mime: application/soap+xml unless you are Microsoft of course who usually/sometimes want level1mime: text/soap The 'W$' in the example is because each link is put into a temporary FipHdr field called W$ as it is being used. If the link data is too much or too little, use FipSeq to chop/add/replace. Eg if the data in the link is "nine:/rt/newsart/Id="z1jit4":text" And you want a link like /searchDB?database=nine&link=/rt/newsart/Id="z1jit4"&format=text use repeat:R1 W$,:,1 repeat:R2 W$,:,2 repeat:R3 W$,:,3 ; if there is no 3rd field, use 'xml' instead combie:W4 R3,xml level2template:/searchDB?database=\R1&link=\R3&format=\W4 -- Values ------- Values can be - - EITHER a file containing lines of values to be used to repeatedly grab data for a single file. using values-file:(filename in tables/webwire) - OR a sequential number using values-seqno:(min value):(max value):(incremental value) plus values-seqno-fiphdr-from: (FipHdr field containing the From seqno - ie start grabbing from the NEXT id after this) values-seqno-fiphdr-to: (FipHdr field containing the To seqno - ie each seqno until and INCLUDING this one) values-get-url: values-post-url: values-post-data: Fipseq to POST a form or GET a link from a line in the values file. See below for a description. values-sep: Separator chr for splitting fields in the values file. default is a pipe - '|' values-leave-spaces: Normally leading spaces are trimmed from each field in the values file. Use this to preserve them. values-parallel: (Number of Simultaneous Hits) For 'values' the default is to run the hits serially, one after the other has finished. Use this to send out a number of hits at the same time which should reduce the total time by a large factor. However, you should check with the remote and test what the number should be. For Apache sites for example, 8 is a common default setting. eg values-parallel: 10 values-fiphdr: Normally fipHdr W1 will contain the first field of the values file, W2 the second etc. So data can be specified by \W1 Use this parameter to specify another field - ie if W1 is being used elsewhere. ** Note that if you are using iptimer to start webwire running a values file, the Wx fields will be zapped in the output file. So in this case, always use 'value-fiphdr:' with a different FipHdr if you want to use the Values in iproute or another downstream program. eg values-fiphdr:R1 values-pause: (secs) Gap/wait/pause between Grabs using the next value default is 0 for none values-comment: (single FipSeq chr) comment - ignore any value line which has this chr as the first non-blank chr default: ';' - semicolon values-comment:; values-allow: (single FipSeq chr) allow - only process values lines which have this (case-insensitive) chr as the first non-blank chr default: all values-allow:E zap-values-file: (yes/no) Delete the values file after it has been used. default no Only files in /fip/x/ starting TMP.. can be deleted. ---- Note that in the FipHdr - unless the 'nofiphdr' keyword has been requested, the following fields will be filled in : Day and time in the normal HH,HD,HY etc fields ST host SP port SF url - path/filename being grabbed SG url - path/filename with is the link Where webwire is sitting on a scrolled queue (using -i), the folder name is in EQ and the filename EN (with all '#' replaced by the chr chged by 'fiphdr-hash') Extra FipHdr values are \@v is no of items in this file \@i is ths childId if spinning off children Input Parameters (all optional) : either -i : scrolled queue default: no default This checks the folder and for each file, checks the FipHdr for 'DF' which is used for the name of the parameter file to run against This allow a variety of parameter files to be run or -1 : Run a single time and exit default: spool The parameter is the name of the individual parameter file in tables/webwire (ie NOT The top or main parameter file) or -T : Tuning mode default: spool Display links and data for the page requested. Runs only that page and then exits. The parameter is the name of the individual parameter file in tables/webwire (ie NOT The top or main parameter file) -A : In Tuning mode, do NOT prompt before searching a link default: prompt -a : log the actual link of each accesses in the FipLog default: no This can be quite a lot of logging if you are grabbing lots of files ! But is quite useful when starting/adding a new feed. -B : default balance group for skip files default: none (see skip-balance-group parameter) -C : warm restart for cookies/api-keys default: always ask for new cookies/api-keys for logon ie do NOT re-logon if the previous session logged on and saved the cookie or api-key if any apikey is missing or has timed out, all cookies and api-keys are wiped and webwire needs to be re-run to logon and download. see note below -d : done folder for -i scrolled queue default: none This can be overwritten by the 'doneque:' parameter -D : display the Request and Response default: do not -e : exit with the Result Code of the last grab. default: normal program exit The Normal exit is 0 if ok, negative number if not With -e this will be 0 for ok, and -1 (timeout) but 4XX or 5XX for page errors. -E : maximum number of threads up to a max of 100 (not Win2k). default: 1 Note this is also a hardware limit in that small systems may not be able to run as many. -f : path and filename of the output file if a non-200 HTTP code is returned; default: fip standard use this to leave the file(s) in a non-std folder. ++ NOTE this was -E before version 6a47 -F : do NOT add a FipHdr to the output file default: do this can be overridden by the 'nofiphdr:no' parameter -h : extra FipHdr information default: none This is in FipSeq and should normally be quoted Note this is the means that 'iptimer' sends variable information to webwire eg : -h"SN:hello#TC:200401031" -H : display the Request and Response in fancy HTML default: do not -I : wire id default: 0 used to track which instance of a multi-webwire system a file arrived/logged -k : ignore the Skip list (used mainly in tuning) default: use skip-links: -K : Do NOT save or process any data, just build up a skip file. This can be used before putting sites into production so that all old links are ignored and only new links will be tracked. ie run 'webwire -1 (name) -K' once beforehand. -l : no logging to the FipLog except for errors default: log all -L : log new files and errors to the FipLog default: log all -m : (FipSeq) no of items default: grab ALL items eg -m 3 or -m \A1 Generally used in testing to reduce the number of files grabbed This is overridden by 'max-items:...' parameter -N : path and filename of the output file default: fip standard use this to leave the file(s) in a non-std folder. -o : output queue in 'spool' default: spool/2go This can be overwritten by the 'outque' parameter This is ignoring in Tuning mode. -O : force ALL output to this queue in 'spool' default: spool/2go This overwrites the 'outque' parameter This is ignoring in Tuning mode. -s : generate statistics for bandwidth usage default: no using Hour_group files -S : generate statistics for bandwidth usage default: no using name of group_client files -t : track status default: no this can be overwridden by the parameter track-status:no -V : if using spool-a-folder (-i) then stop when it is empty default: keep spooling -w : Wait in seconds between accessing links. default: 5 -x : Proxy server host or IP address default: none -X : Proxy server port default: 80 -y : Proxy logon default: none -Y : Proxy server is Squid default: no -z : parameter file in 'tables/webwire'. default: XWEB -v : Print the version number and exit ---- Other Notes ---- -- Netiquette -- Pls note if you are grabbing data off another site, then you should contact the webmaster of the remote and let them know. Certainly if you are accessing every few seconds, then there is a good chance they will put you on some refuse list. So it pays to be nice ! -- How to find out the actual url.... Sometimes it is quite difficult to find out the real path to use for the url. Especially so for script-driven gets and puts. NetScape or Iexploiter is invaluable in this case.. - use either 'View Source' or 'History' normally gives the game away! Snooping using tcpdump or windump 0. Open a Terminal/Cmd window and start you browser - without hitting the site yet 1. Find out which interface tcpdump -D 2. Leave tcpdump running in background On Mac OSX you will need to be sudo tcpdump -i1 -w remo.tdmp -X host www.remote.host 3. On the browser, do the absolute minimum .. .. do a simple logon and grab ne file using Firefox, Mozilla, IExp, Safari etc 4. CntrlC to stop tcpdump 5. run tcpdump to show data rcpdump -r remp.tdmp > remo.fip 6. call up remo.fip in an editor. -- Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookies are neat but nasty. If you already know the cookie you need, just make a file in /fip/fix/webwire with the name of the cookie (case is important on Unix boxes) and slap in the whole of that cookie which has the syntax (key)=(data) ie zumzum=hungryTummy Before grabbing data pages we can attempt to logon to a box and get its cookies !! This uses from 1 to 9 GETs or POSTs add-cookie:\C1; \C2 Add the Cookie on to the end of the HTTP headers in this form get-cookie-1: Command to send to get a cookie or to logon. get-cookie-data-1: Optional data usually required for a POST get-cookie-http-1: more HTTP headers used ONLY for this GET/POST cookie-fiphdr-1: name of the cookie to use as a FipHdr field C1 to C9 ie if there are several cookies returned but only one is needed, put the key as the cookie-fiphdr ie Set-Cookie: ABC=12345 add-cookie:\C1; perm=yes cookie-fiphdr-1:ABC will result in a Cookie: ABC=12345; perm=yes If you want all the cookies to be saved, use '*' cookie-fiphdr-1:* follow-cookie-redirect: (yes/no) ie if you get a 302 Moved Temporarily status Plus a Location from a cookies -request, use that rather than the 'url:..' specified. HTTP/1.1 302 Moved Temporarily$ Date: Fri, 29 Oct 2010 00:17:19 GMT$ Cache-Control: max-age=3$ Location: http://fippo.fip.fip/palio/html.run?_Instance=cms_csi&_PageID=1&_SessionID=1068051&_SessionKey=922432532&_CheckSum=328747502$ cookie-form-1: find and save an input tag in a form and put the data in a FipHdr field starting F* use this to add hidden form zones into a reply for a logon for example. ; csrf_token will go into F1 cookie-form-1:csrf_token ; then send it back get-cookie-2:POST /login/ get-cookie-data-2:csrf_token=\F1&email=dot%40sniggerfrost.com&pwhash=somut&caform=1&submit=Login keep-cookie-fiphdrs:yes/no Normally the access to the cookies do NOT give any data you need to save in the FipHdr for use later on But there are times - eg when the cookie (or api-key) is a logon code - when you DO want to save However if you do not want this (maybe there is some data which clashes) turn this OFF by specifying NO There can be up to 9 of these. eg add-cookie:\C1 get-cookie-1:GET / get-cookie-2:POST /logon.pl get-cookie-data-2:logon=helpme&password=iamswimming Rarely are any get-cookie-http-1 fields needed as Host, Content-type, and Content-length are added automatically Referer is added if you have specified a 'referer' which you should if running 'http-version:1.1' Keep-alive is added if you secify 'keep-alive:yes' Others 'httphdr' fields should be specified as normal.. As a general rule, some Microsoft IIS sites (who else!) have problems if you HTTP headers are in the wrong order. Basically, make sure your CONTENT* lines are last. Example 1 ; ------------------------------------------------------ ; we need to go and get a cookie for this service ; we will call it C1 - so the httphdr will be 'Cookie: (contents of C1)' add-cookie:\C1 ; C1 will hold the contents of an incoming 'WebLogicSession=.....' cookie-fiphdr-1:WebLogicSession ; this is the URL to hit (with parameters) to trigger the Cookie get-cookie-1:GET /servlet/com.login.DispatchServlet?Login=&User=guest&Pwd=guest Example 2 ; ---------------------------------------------- ; in this case we have 3 cookies C1, C2 and a fixed one 'b' ; C1 is SID=.. ; C2 is ASP...=... ; add the fixed 'b=b' on the end add-cookie:\C1 ;\C2 ;b=b ; just one grab at a cookie - and Logon and the same time get-cookie-1:POST /login/Login.asp ; one logon string get-cookie-data-1:u=%2Findex.asp%3F&l=letmein&p=ohpleaseplease&x=0&y=0 ; ignore the 302 return - it is only trying to send us to index.asp cookie-noredirect-1: ; Save the two cookies as C1 and C2 cookie-fiphdr-1:SID cookie-fiphdr-2:ASPSESSIONIDASDQCAAD This will POST - ie pretend to be a filled out html FORM - the logon back. Note that the cookie-data is 'URI escaped' ie if it is a special chr - like /?&+ - and is in the data bit, you must use the '%xx' notation (where xx is the HEX value). But hopefully you would have seen that in your tcpdump/snoop anyway. -- Proxies Proxies Proxies Proxies Proxies Proxies Proxies Proxies Proxies Proxies When running through a proxy server, you will need : 1. hostname of the proxy server 2. port number on the proxy server if it is NOT port 80 3. (optionally) a logon and password 4. Is the proxy SQUID ? If so headers are slightly different. If this information is NOT available, normally you can find it easily from any PC or Mac on the internal network using a browser like Netscape or IExplorer. Start a NEW copy of either of these. - It must be a new copy to check on logons etc. Under 'Preferences' or 'Internet Options' there should be a 'Connections' section and under that, the host name or ip address plus host name of any proxy used. Note that often the main Fip server is NOT running DNS and will not be able to resolve external hostnames, so the IP address must be used in this case. Enter these values in the Fip parameter file as : proxy-server:195.13.83.99 (no default) proxy-port:412 (this defaults to port 80) Use the Browser to attempt to access a web site outside the firewall - like 'www.fingerpost.co.uk'. If you are asked for a password to get through, you will probably need to add a 'proxy-logon' parameter too unless the keeper of the Firewall has made a hole through just for you. The data for 'proxy-logon' is in base64 in the format (logon) (colon) (password). Use 'sffb64' to generate this string : On a Sparc echo -n "chris:magicman" | sffb64 -i On Linux echo "chris:magicman" | sffb64 -i On Winnt type "chris:magicman" | sffb64 -i proxy-logon:Y2hyaXM6bWFnaWNtYW4=== The actual 'You need to Logon, Pal' message is a '407 Authentication Required' message. -- Repeat Offenders ----------------- Some sites add a session-id into each and every link. And this Id changes on each access. To 'webwire' this appears to be a new file and so it is grabbed every time - falsely. There is an 'ignore-key' command to isolate and ignore the relavany parameter. eg Take a site like : url:http://www.fingerdong.com/ matchlinks:*&news=yes&newsid=* ignorelevel:1 which returns links like /en/pressrelease.php?date=20080910&news=yes&PHPSESSID=11bf21&newsid=7866 If value of PHPSESSID changes each access, they you will get a copy of newsid 7866 every time. Use : ignore-key:PHPSESSID Do NOT specify the '=' or '?' etc. -- Others Others Others Others Others Others Others Others Others Others Others --Where 'webwire' is used to drill down links, there is a wait of about 5 seconds between accesses which, hopefully, is enough time for other people to use that server. --Where a logon and password is requested as part of the Browser - ie a pop-up from Netscape or IExplorer, NOT an HTML form - you will need to add a 'Authorization' line. This will be true if you get a message like : HTTP/1.0 999 Authorization failure ... etc etc etc ... Assuming you know your logon and password : 1. Use uuencode or sffb64 to generate a Base64 string echo -n "logon:passwd" | sffb64 -i 2. Add an extra line to the parameter file with the result of the sffb64 line using 'httphdr'. Syntax: Authorization (colon) (spc) Basic (spc) (Base64 of logon:password) (\n FipSeq for NL) Eg httphdr:Authorization: Basic AbGtGgbhpdOkOTE=\n -- Valid links are : - The HREF tag atttibute in A for Anchor <a href="www.fingerpost.co.uk> - The SRC tag attribute in FRAME <frame src="ax1000.html"> - The URL in a META/Refresh <META HTTP-EQUIV="Refresh" CONTENT="0; url=go4thAndMulitply.com"> -- For 'matchlinks', the term LINK is the contents of the <a href="THISONE">, NOT the associated text ie matchlinks:*boonies* will find <a href="/rubbo/boonies/tunies.html">This is a Wonderful Page</a> BUT not <a href="/tunies.html">This is the boonies Wonderful Page</a> -- Note that 'ignorelinks' refers to both Links and Forms. -- If you want to ignore all links and only get forms, use a weirdo name in mathclinks matchlinks:gobbLedeGook9981 -- What are reasonable HTTP headers ? 1. If you are using HTTP Version 1.1, you MUST add a line in the headers which specifies the actual host you are trying to access (ie the REMOTE hostname or IP address): httphdr:Host: www.theirsite.com\n or if DNS is a problem httphdr:Host: 123.456.789.012\n 2. Most servers would like to know what you are and what you can do - so lie ! Try this for starters : httphdr:Accept: \052/\052\n httphdr:Accept-Language: en\n httphdr:User-Agent: Mozilla/4.0 (compatible; MSIE 4.01)\n Note the syntax is httphdr:(Keyword) (colon) (space) (Parameter) (NL) Keyword is case-INsensitive There MUST a Colon-Space beteween the Keyword and Parameter. The line MUST finish with a single NL (which webwire will handle correctly) as Double NLs mean end of header. 3. If the data on a lower level is being served from a different host, if you need authentication or some other httphdr, use the 'httphdr-on-all-grabs:yes' parameter to add them for that server too. -- ValuesFile ValuesFile ValuesFile ValuesFile ValuesFile ValuesFile -- Take the case where you need to get the 10 foreign exchange rates every 20 minutes from a site like Yahoo. The normal way would be to test using one forex rate and, when ready, just duplicate that parameter file another 9 times, just changing the forex name/search string in the 'url' or 'post'. The classy way is to pput all the search values (ie the bits that change) into a single 'values-file' and reference them using FipHdr fields W1 to W9. To Do this : If the original url is : http://finance.yahoo.com/m5?a=1&s=USD&t=LAK 1. Create a values-file in /fip/tables/webwire - lets call ir VALUES_4_FOREX This can have the normal Fip-style comments of ';' at the start of line ; ; Values file for Forex ; USD|LAK USD|YEN USD|MYR ; end of values file 2. In the WebWire parameter file - lets call it FOREX. ; ; FoREX ; port:8080 url:http://finance.yahoo.com values-file:VALUES_4_FOREX values-get-url:/m5?a=1&s=\W1&t=\W2 ... and let rip..... Note that W1 is the first field, W2 the second etc. If you are already using W1 for something else, specify another FipHdr field to start on with the 'values-fiphdr' parameter. Note that the FipHdr fields are useable for filename and other Fippy things. filename:Forex-\W1-\W2.fip will give filenames (and/or FipHdr SN) for our example of Forex-USD-LAK.fip Forex-USD-YEN.fip Forex-USD-MYR.fip -- Standard-FingerPost-Rant on bad HTML ---------------------- -- Using Webwire to pull off other file formats Sometimes, 'webwire' seems to only grab part of a page and never returns errors. Well, if you use a browser to look at the page and then 'View Source' or 'View Frame Source', lo and behold there is probably a random </HTML> at that point. </HTML> is of course the End Tag of an HTML document. So we SHOULD stop there really. But a lot of web sites do not care how awful their stuff is - or maybe a conversion program has been set up wrongly (a well-known news agency in New York uses </html> in place of </image> to end pictures for example) So use the keyword 'end-of-document' to track either nothing - just timeout - or the REAL end of document. If the data is NOT html - some XML variant for example - use 'end-of-document' to track that. By the way, did you know you can immunise yourself from fingerpost-rants; pls contact the sales dept. -- Wrinkles with Ports and RSS Some RSS servers like to service the initial list from one port - but you have to grab the data from another port:8080 url:http://finance.yahoo.com -- using warmrestarts and cookies to keep a logon current 1. Add '-C' if using iptimer, add as ' switch:-C ' client:abc type:w template:abc.fip days:X every:1s fiphdr:'#' switch:-C 2. run manually to see what the http response code is for a BAD ie logon again pls eg good is normally a 200 code : HTTP/1.1 200 OK bad is something like HTTP/1.1 303 See Other$ Date: Fri, 24 Jul 2015 15:53:18 GMT$ Server: Apache$ Expires: Thu, 19 Nov 1981 08:52:00 GMT$ Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0$ Pragma: no-cache$ Location: /login/$ 2. to the parameter file ; YES or FIPHDR= logon returns data; COOKIE=logn returns a cookie in the mimeheader need-logon-token:yes ; look for this in the incoming data matchlogon:/login/ add-cookie:\C1 cookie-fiphdr-1:* get-cookie-1:POST /login/ get-cookie-data-1:csrf_token=\F1&email=shouta%40bingbong.com&pwhash=N1&caform=1&submit=Login when it goes bad, you will get this message in the item log Fri Jul 24 16:58:22 webwiressl !z : Zap cookies for Logon NZXIS.APAC -- API keys in the data basic-authentication:JA whomee:never need-apikey:yes apikey-host-1:https://auth.weather.gods apikey-fiphdr-1:* apikey-url-1:POST /oauth/token apikey-httphdr-1:Authorization: Basic \JA\n apikey-postdata-1:grant_type=client_credentials&scope=ukmo-warning-read apikey-data-type-1:json apikey-save-1:\VZBZ:\JZ\nXX:as at \$h:\$n:\$b\nXX:expires \J1\nXX:scope \J2\nXX:domain \J3\n\$o ; READ last token fiphdr-file:\R2.\R3.\R4.BZ combie:RZ JZ|BZ matchlogon:/login/ ; ; .. {"timestamp":1482758750452,"status":401,"error":"Unauthorized","message":"Not authorized","path":"/active"} ; if status code =401-Unauthorized .. redo the logon .. fiphdr:J0 tag:status matchlogon-fiphdr:J0=401 ; .. and if the test is invalid, zap the FipHdr file - use this very carefully if you are using multiple FipHdr files ! matchlogon-invalid:zap ; this will zap any cookies (default) matchlogon-invalid:yes ------------------------------------------------------- Version Control ;6a54 26oct18 chgs for multiple aws grabs ;2-3 retry-404/500 parsed ;4-5 internal - trimming the FipHdr ;6-7 location can be same url but diff host/port and log-errors ;8-11 bugette with if-not-modified and method parsed, maxTree 100->300 ;12-14 9jul19 added socks proxy too ;15 1nov19 cater for when there are > 9 values fields plus added values-allow and allow-comment ;16 11nov19 added S2 S7 etc for system ;17 25nov19 redid lower levels and added fiphdr-ll ;18 12feb20 bug in Values buffers ;19 4mar20 minor dontuse ;20-24 minor ; 25 sizeof url and relocate tuning ;26 linktags 10->20 ;27-28 20jul20 redid apikey so params are parsed at runtime NOT on param file read ;29-30 minor json [] issuette ;31-32 7feb21 redid AWS STS token and retry-404 ;33-34 25feb21 woops - KAlive but diff host issue / FHlevel on last-TOPlevel issue ;35 17mar21 better zap of expired ApiKey ;36-37 5may21 preserve FH in copy_tmp as the new bits are needed for outque/script etc ;38 6oct21 always pull_apart_json/xml for cookies to get FipHdrs ;39 29oct21 added -m maxItems ;40 12apr22 better handling of JSON Arrays ;41 8sep22 added RAW type for EFE api ;42-44 23sep22 better VIEWSTATE ;45 tuned VALUES ;46abc bugette in waitTimeout and (slightly) better messages ;47abcd 1aug23 FORK and children and fiphdr S5: lnko idx ;b convert-CDATA-sections fipseq ;d BIOnotSSL tuned ;48a 7sep23 added oauth for gcp ;49 added poll-or-select and shorten-urn ;50a-d tuning FORK; localSeqno better ;51-53 19mar24 BUG in openBio (RC needs this mod - bad from 6a47d) ;a minor ; 53 no RESET on end of JSON (only on TAG end) ;54 29may24 timeouts map to SSL too ;5z99 10may05 hourly bandwidth stats files rather than per client ;a 13may05 balance skiplists if changed ;b-c 25jun05 added -M and -K ;d-g 05aug05 added fiphdr:XX data:abc\A3 and wait-end-timeout ;h-k 04sep05 changed -x-X to force not default ;l-m 07nov05 added 24hour+ skip files ;n-p 25sep06 added ssl at last ;q-t 17oct06 added skip-details-tag ;u 29apr07 major change to linktag, added matchkeys and match-case-sensitive ;v-w14 21may07 add rest of path if 3rd+ level and no starting '/' (w14 - tweaks to stuff_cookie) ;x1-6 8may08 added save-fiphdrs ;3 added -N newname ;6 bugette with VALUES file and port != 80 ;7 added -e and -E errname ;8 balance fiphdr fields ;9 meta-files ;10-12 minor ;13-14 note_balance_action ;15-16 spc in url ;17 added pretend-301:200 ;19 allow feed: ;20-23 finally added basic-authentication: and redid ssl ;24 bugette/modette - allow multiple spaces in mime headers ;25 allow intergap of zero ;26 bugette - save_metadata missing if one and only one found ;27-29 25jun10 bugette when proxy is a Squid and host changes ;y1-9 26jul10 added grab-on-tag/endtag (major release) ;10-11 6sep10 bugette with 302-move and http://... ;12-14 added matchlogon, bug (bg) with data-type:CSV, plus tom bug : retry-404-max:3 retry-404-gap:1 ;15-17 14oct10 added skip-save-data and days:Z for weekdays ;18 15nov10 added follow-cookie-redirect: ; 19 able to parse VALUES-FILE: ;20 added nofiphdr ;21-25 mess if too many 404 plus added -D and fiphdr-hash ;26-27 16mar11 added repxml for fiphdr: / include fiphdr-file in start of hdr.. ;28-29 31mar11 added zap-values-file:yes ;30-32 poll.every secs bugette ;32 added need-proxy-cookie ;33 6jul11 better skips handling now allow 15000 skips and zap olds with different skipdetails ;34 29jul11 added need-logon-token and cookie-host-X for rconnect ;35-36 added dbl-dblqtes in links plus Bugette in Chunks and redid outque for speedy ;37-41 added CONNECT for proxy https plus started minitracking and sleep between polls for XWEB ;42 allow multiple spaces in custom tag link and added filter ;43 null_next_link added ;43-45 added retry-404-error ;z1-8 15mar12 added eventvalaidation and viewstate and level5* and json ;9-10 allow multiple grabs, added level to grab-on-tag and matchlinks etc ;11-12 redid 302 moved to handle full paths better ;13 ;14 bugettes - proxy/do NOT output file for cookies ;15 28feb13 tuning for level1template-file: ;16 4apr13 bug in skips if no headline ;17-23 11apr13 added trees, levels and keys to fiphdr:, grab-on*tab: and linktag: ;24-28 17may13 added retry-500 kwds and better proxy handling ;27 added level1mime and -I wireId ;29-31 17mar14 added 404/500action=move, que and FipHdr ;31 modette-repxml for all tags ;32 14apr14 for custom logging ;33 4aug14 added -Z force DF ;34 bugette with fiphdr.. key: ;35-36 12nov14 added httphdr-on-all-grabs ;37 17dec14 bugette WINNT Only, cookie=* ;38 ;39 fiphdr-hash for W$ too ;40 CDATA ;41 28dec15 new apache does not like 443 on the end of Host:.. ;42 13jan16 bug with https and proxy ;43-45 22mar16 allow 302/301 with Values and Bug with skipDetailsTag ;46 httphdrs on proxy ;47 10jun16 pullapartJson better ;48-52 14jun16 proxy and TLS1_2 and httphdr on proxy ;53 7sep16 allow same tag or tag@att to be in multiple fiphdrs ;54-55 19sep16 added SX/save-data-pathname ;56-57 cleanups bugettes - SU/DU if in extra, proxy and 302 handling ;58-59 16oct16 for ANP_FOTO fiphdrs and keydepth ;60-64 30dec16 added apikey stuff ; 65-68 maxLowerLevels 5->10 ;69-72 bugette to newname/forcenewname and cookieForm ;73-76 28jul17 redid hmac and added recode ;77 JSON grab on endtag if in an array ;78 3nov17 do not attempt to drill down during cookies ;79 15jan18 better JSON handling ;80-81 18feb18 updated ssl and added mime-type-fiphdr and level-fiphdr ;82 redid convertCDATA slightly ;83-4 cookie-host in FipSeq so we can vary it ;85-86 issuette with data-type and cookies ;87 matchlogon-invalid:zap added to zap the FIPHDR file as well as any cookies ;88-90 added if-mod-suffix for filename if-modified ;91-92 better json support ;93 amz hmac support ;94-95 12sep18 added connection-retries and \@V ;96-98 6oct18 better handling of truncated data (aws-data added) ;004z 07jul04 tweaks... ;b 01aug04 added fiphdr:.... ;c 10aug04 added levelXtemplate: where X is 2->4 ;d-k 01sep04 -9 speedy and timing stats (f-maxlevel and values bugette) ;l-n 07oct04 added skps2, fixed one-file, fixed HTTP results with no messages ;o 28oct04 redid skps2 ;p 01dec04 buglette with spaces in URLS - need to be stripped. plus lvl1file-lvl5file added ;s 31dec04 added -x proxy-host, -X proxy-port plus -y/-Y ;t-u 01feb05 added bandwidth-stats ;v-w 19feb05 added -u testPid and -U singlelevel only and split into files plus bugette with Chunking ;x-z 18apr05 added -O for rpt-offenders/small-diffs flag ;003z 15dec00 added one output file, tracking sents, only-get-if-modified ;a 20dec00 added watch on XWEB ;b/c 22jan01 allow hrefs to be NOT in dbl quotes plus added end-of-document ;d/e 19mar01 started proxies ;f 17sep01 proxies again ;g 29oct01 proxies again ;h 13dec01 minor mods - allow http:name:port in url and proxy ;i 08jan02 values-fiphdr and bugette with values ;j 08apr02 bug with one output file - core dump ;k 01jan03 MACOSX ;l-p 21jan04 added -h and allows secs for 'every' and 'no-data:' ;q-u 08jun04 added matchlinks/ignorelinks/url and now FipSeq ;u 27jun04 added -H html, -k ignore skipfile ;w-z 30jun04 proxy-is-squid added ;002b 24oct00 added values-file ; 06nov00 added 'every' and Chunks (copyright) 2024 and previous years FingerPost Ltd.