webwire
FOR HTTPS/port 443, please use the 'webwiressl' version of this program.
Webwire goes and gets pages of data from Other people's web sites automatically
and then sends those pages to your destination - usually the editorial system -
in the normal Fip fashion.
These can be updates of weather, financial data, sports results, backup for
Wire services if the satellite is down, graphics, software. In fact most
things.
It can be used either :
- on a timed basis to get regular known pages.
- on demand by sending a file into spool/webpoll with the FipHdr field DF set
to the parameter file required.
What it can do -
- drill down links to several layers deep,
optionally ignoring the data on the top levels.
- select only certain links - either in XML, HTML, JSON or CSV
- you set masks to filter which to get and which to ignore.
- logon automatically to protected sites
and save Cookie information for use in later accesses.
- fill in standard form data to get make on-demand searches.
- strip or rework HTML tags to make the data more presentable.
This is meant for reasonably simple pages while more complicated ones
will be routed through 'ipsgml' and/or 'ipxchg'.
- Use an external list of values to make several grabs to the same
site/page/script
but varying the search data for each hit. eg to pull all the values of a
financial index. (This we call a 'values-file')
- Grab a 'id' from a web service and then sequentially call all pages using
intermediate ids from the last to the new one.
What it cannot do -
- play tunes.
- run javascripts or any other applet type affairs. (yet..)
- run FTP, GOPHER or whatever (for these and especially FTP, see program
'ipftp' and 'iptimer').
The current version is primarily for getting text data but can be used for
images etc if required.
There is a TUNING mode to be used for setting up a new link and trying to clean
up the relevant parameter file WITHOUT sending (possibly) live data to the
required destination.
- This shows the data with escaped unprintables and '$' at the end of a line.
- All links and forms are also displayed.
- Any pages saved in Tuning mode are NOT sent to the normal output queue
(spool/2go) but are left in spool/webtest for future perusal and/or deletion.
- To run, choose your parameter file in tables/webwire and run 'webwire'
manually in a window:
webwire -T AUS.STOX | more for prompt before calls
or webwire -A -T AUS.STOX | tee aussies for no prompting
There are Two (sometimes three) types of parameter file :
1. Main Parameter file which sets up the polling of certain pages at set times
(if any).
2. A Page Description file for each site/page accessed.
3. Optional lookup file of values where you want to repetitively hit a site
changing certain values each time. (eg a sport site for several divisions or a
list of stox to get)
----- Main Parameter file -----
The syntax of the Main Parameter File - by default tables/webwire/XWEB :
; comment line
poll:(pram file) day:(MTuWThFSaSu) time:20:30 mustget:
In detail, the 'poll' keyword :
Pram file is the name of the Page Description file - see below for its syntax
day: Day of week to run the job :
M Monday
Tu Tuesday
W Wednesday
Th Thursday
F Friday
Sa Saturday
Su Sunday
X Every day.
Z Weekdays M-F.
Case is NOT important.
Commas (but NOT spaces) may be used to separate.
Default is every day.
either
time: Time of the day on 24 hour clock. Default is 18:00.
or
every: interval between grabs Default: none
every: (mins) [(optional) start:(starttime) end:(endtime)
every:30 start:07:30 end:19:00
The minimum interval is 1 min and maximum is 3 hours (ie every:180 mins)
You may also specify in seconds using 'secs' or 'seconds'
immediately after the number (with no spaces)
every:10secs start:09:30 end 09:50
eg:
poll:AP day:ALL time:20:10
Get the Page file tables/webwire/AP every day at 20:10
poll:Forex day:MTuWThF time:16:30
poll:Forex day:MTuWThF time:16:40
Get the Page file tables/webwire/FOREX every week day at 16:30 and 16:40
There can be none or up to 200 polls in the main parameter file.
Note that the page is grabbed ONLY if the program is running.
----- Page Description Parameter files -----
The individual Page description parameter files are also in tables/webwire. The
syntax of these are :
; comment start with a semi colon like this
MANDATORY
url: Full url of the page. default: none
There MUST be one and only one 'url:' specified.
You can also specify the page, cgi and any subparameters.
eg url:www.fingerpost.co.uk
url:www.big-press-org/sports/baseball/index.htm
url:www.marketlook.co.uk/scripts/Summary.dll?HandleSummary
dest: Fip Destination for the files default: WEBDATA
This is the 'DU' FipHdr field as per the USERS file.
eg dest:w3saves
OPTIONAL:
use-tls: no/yes
use-ssl: no/yes
use-https: no/yes
Use Secure Sockets Layer (TLS/SSL) - also called HTTPS default: no
If the url starts 'https://....' then this command is NOT needed.
ssl-method: (1,2,3,23,999)
Version number to use for TLS/SSL default: 999 for current default (2 or 3)
ssl-password: (password)
ssl-passwd: (password) default: none
Optional password if the handshake requires a shared secret
ssl-cert: (name of a PEM certificate file) default: none
ssl-root-cert: (name of a root PEM certificate file) defaunt: none
Optional certificates - held in tables/ssl
port: Port number of the Remore Server. default: 80
This forces the port to be this if none is specified.
nofiphdr: Do NOT add a Fip Hdr to the file. default: yes pls
source: Fip Source of the files. (FipHdr 'SU'). default: XWEB
Unless 'noarchive' is specified, all data files will be archived under this
name in log/data.
This can be in FipSeq so that 'combie' can be used to set a default..
noarchive: Do NOT archive these files in log/data. default: archive
maxlevel:3 Maximum no of levels to drill down. default: 1
Normally the URL you have requested is the data you want.
However if that is an index page with links that may change,
it may be these lower-level pages that are needed. 'maxlevel'
states how many levels of link the actual data pages are.
Default is 1 = do NOT drill down any of the links.
Note that level 1 is the first page.
ignorelevel: Used with 'maxlevel' where the information def: no
required is on a linked page and NOT on the first page,
use 'ignorelevel' to ignore all those pages on intermediate
levels. Note that level 1 is the first page.
eg ; ignore levels 1, 2, 4 and 6
ignorelevel:1,2,4,6
matchlinks: Only follow links which match this mask. def: all links
Used only if 'maxlevel' is greater than 1.
There can be many 'matchlinks'.
Use the '*' as a wild card string and '?' as a wild chr.
eg ; get all links ENDING 'html'
matchlinks:*html
matchforms: Only process forms which match this mask. default:no forms
Used only if 'maxlevel' is greater than 1.
There can be many 'matchforms'.
Use the '*' as a wild card string and '?' as a wild chr.
eg ; get all forms ENDING 'asp'
matchforms:getfile.asp
matchframes: Only follow frames which match this mask. def: all frames
Used only if 'maxlevel' is greater than 1.
There can be many 'matchframes'.
Use the '*' as a wild card string and '?' as a wild chr.
eg ; get all links ENDING 'html'
matchframes:*.top
matchkeys: Only follow links which match this test. def: all links
Used only if 'maxlevel' is greater than 1.
Used only or 'linktag' where an attribute MUST be set for the link to be
valid
There can be many 'matchkeys.
Use the '*' as a wild card string and '?' as a wild chr.
eg ; <hotel id=33 name="Fawlty Towers" url="http://www.ohnonotagain.com"
status="current" />
linktag:hotel@url
matchkeys:hotel@status=current
matchkeys:hotel@status=ready
match-case-sensitive: yes/no
all matches and ignores can be case sensiive or in-sensitive
DEFAULT changed 05u to INsensitive - previously sensitive.
skip-links: Name of a file in /fip/fix/webwire holding names of links
and forms already accessed; so that only new ones are tried.
eg skip-links:webwirelinks.$d
default: none
skip-details-tag: (tagname) extra details (such as a publishdate) for check if
existing links have been updated
see below on the section for RSS feeds
default: none
skip-purge-after: (hours) Number of hours to keep the skip entry
default is 24. You might want to tune this :
make bigger if sites add/take off old material
reduce the time if the same link is used for different data
skip-save-data: (FipSeq field)
Sometimes there is some data in the link which changes for every access -
such as a Cookie or SessionId
eg the first access might get
search.do;jsessionid=A9823A4622A23C10C4EC7F1825BF9E26.node1?messageId=268482
and the second
search.do;jsessionid=FCC18E9582E77C2AD9EFE6C68CA0F0A2.node1?messageId=268482
But they both happen to be the same file - messageId=268482
Use FipSeq to just get the data that contains ONLY the information you want
to save.
Certain FipHdr fields hold relevant info:
WX is the field marker '^'
WS is the skip details tag (optional - see above)
WT is the type - 'a'-anchor
WL is the level no
W$ is the actual link - anchor, form etc
WH is the associated display text from an anchor tag
In the above example :
; split on the '?' - get the second field
repeat:Q1 W$,?.2
; skip string is now 'messageId=268482' - note the FipSeq needs a backslash
skip-save-data:Q1
skip-balance-group: name of a balance group (in tables/sys/BALANCE) to
distribute
the skip file when changed (see doc on 'ipbalan')
This is often used where a second system could be used as a redundant server
if the main system fails. (see also -B input switch)
ignorelinks: Of the Links found, skip any matching this mask. default: all
links
Used only if 'maxlevel' is greater than 1.
There can be many 'ignorelinks'.
Use the '*' as a wild card string and '?' as a wild chr.
eg ; ignore any links pointing at any 'netscape' or 'microsoft' site
ignorelinks:*microsoft*
ignorelinks:*netscape*
; ignore any links requiring 'ftp:'
ignorelinks:ftp:// *
; ignore any links to other sections
ignorelinks:../ *
; ignore any links to any index
ignorelinks:*index*
httphdr: Extra lines of HTTP header you may need. default: none
Remember to add a NL at the end of each line.
There can be multiple httphdr lines but pls remember to add a 'n' at the
end of each one. (or you can try to force all on one httphdr line!)
eg httphdr:Authorization: Basic AbGtGgbhpdOkOTE=n
httphdr:User-Agent: Mozilla/4.0n
httphdr:Host: wibble.wobble.comn
see below for 'useful, common header lines'
** ALL basic-authentication MUST BE HIGHER IN THE PARAMETER FILE THAN httphdr
OR proxy-logon
basic-authentication: (fiphdr field) (logon:password)
Build a FipHdr field with the BasicAuthentication formatted logon:password
Pls remember to escape any funny chrs - like backslashes
** ALL basic-authentication MUST BE HIGHER IN THE PARAMETER FILE THAN httphdr
OR proxy-logon
eg basic-authentication:BA DOMMY\zipple:Ardvark99
httphdr:Authorization: Basic BAn
method: POST/GET/DELETE/PUT etc default: GET unless 'post:' is specified
normally this is a single UPPERCASE action - with NO spaces.
post: Post a Form default: get url
see below for processing a form using method=POST.
filename: Filename for the output file in FipSeq. default: WEB$Z
If this does NOT start with a '/' it is left under the
Output Queue as specified on startup (default spool/2go)
eg filename:AFP$d.$z
striptags:(yes|no) Strip tags and attributes default: no
wild: (FipSeq) Character used as a Wild String for default: '*'
'matchlinks/ignorelinks'.
eg wild:377
singlewild: (FipSeq) Character used as a single default: '?'
Wild chr for 'matchlinks/ignorelinks'.
eg singlewild:!
number: (o|d|h) Number system for FipSeq default: octal
octal, decimal or hexidecimal
The following are all equivalent :
number:octal
before:40
number:decimal
before:32
number:hex
before:20
before: FipSeq String to add before any data. default: none
after: FipSeq String to add after any data. default: none
script: Script to run on ths data of the incoming file. default: none
outque: Output folder (in FipSeq) default: spool/2go
This overrides both the default and the '-o' output switch
except for Testing/Tuning mode where the file is forced
to spool/webtest.
log: FipSeq custom logging for the item log. default:SN SU EF : EH,EP
This logs each Page grabbed
Note that
EH or ST remote site host
EP or SP remote site port
EN or SF or SG remote site url SG is the actual link, the others are the
link used to grab
EF parameter file used
The default is that no incoming files are logged by webwire
custom-log: FipSeq custom logging for the item log. default: none
This can be used to log link details in a custom log
/fip/log/webwire/(date)_(paramfile).fip
custom-log:pnac.YN|date.YT|procdate.T7|taketime.T9|source.TU|take.TZ|head.TH
log-https-errors:warn
Any failure to go secure in https connections are flagged as warnings
The transmission is always aborted. This parameter affects only the logging.
default: !x for failures
extra:
extra-pre: Extra FipHdr fields to be added to the output file. default: none
To separate FipHdr fields, pls use a '#'.
extra-pre is added as soon as the file is read - so may be used for
information in the URL
extra is only used for any output file and is not used at all for any other
purpose.
eg extra:ZH:NYNZ#DI:Headline News#QZ:333
tag: FipSeq String to replace the start tag default: none
such as <H1>. There can be many 'tag's.
eg tag:P {Para}n
endtag: FipSeq String to replace the End tag default: none
such as </P>, </TITLE>. There can be many 'endtag's.
eg endtag:TITLE n
getimages: Also get all the images
By default all images - *.gif or *.jpeg are ignored.
keep-alive: yes/no default: no
Just that ! default:no
http-version: 1.0 or 1.1 default:1.0
only-get-if-modified: (FipSeq message if not found) default: get
This will check the remote server for the time the page was
last modified. This does not work with old servers and some
set to HTTP/1.0.
If modified since, the page is read
If not, the optional message is sent
If there is no message, no data is sent - just a note in the item log
ignore-key:PHPSESSID
When matching for skip files, ignore this key-value pair.
see the section below on Repeat Offenders
max-items: (number) default: 0 for all
Max number of items to grab per session
Some sites only allow you to read 5 or 10 items before blocking you.
Use this to creep under that total.
pause-between-files: (secs)
Gap/wait/pause between grabs default is 5 for standalone, 1 for iptimer
This is overridden by the -w input switch
one-output-file: Put ALL data in a single output file.
The default is one file per page/access
Use this with 'values' to create a single output file.
This ONLY uses the FipHdr of the first file if 'values' have been specified.
end-of-document: Where a site is sending really really crap HTML - or XML
use this to state what the last tag.
For no checking at all : end-of-document:
Default: end-of-document:</HTML>
See below for a standard-fingerpost-rant on crap HTML.....
end-of-cookie-page: end text which signifies the end of a logon or cookie page
This is rarely changed.
default is </HTML>
connection-timeout: (secs)
wait-end-timeout: (secs)
For slow, busy sites, data - especially big files - may take a lot longer
than normal to be
retreived. Use this to expand that time. Default is 120 (it should be
divisible by 5 for some arcane reason)
pretend-301: (3 digit number)
pretend-302: (3 digit number)
Ignore redirects (HTTP return code 301 or 302) and assume they are this
return code
pretend-301:200
this will take a 301 and save the data as through it was an incoming file.
no-data: (FipSeq string in place of data)
Do not get/send the data - just this string
data-is-binary:(yes/no/maybe)
Data files at the lowest level are binary or not
default is check for <?xml, Tiff, Jpeg, MsWord/Office, EPS and PDF
automaticatlly
otherwise it is treated as text
ignore-mime-if-binary: (yes/no)
if yes = Strip the MimeHeader off binary files
default is no to leave it on - so you know what the file really is !
proxy-server: If using a proxy, these are the name and port to aim at.
proxy-port:
proxy-logon: This is the logon and password to get thru the firewall
if required. The format is (logon) (colon) (password) and is
converted to base 64.
proxy-logon:Y2hyaXMuaHVnaGpvbmVzOnBhbnRoZXIK=
** ALL basic-authentication MUST BE HIGHER IN THE PARAMETER FILE THAN httphdr
OR proxy-logon
To generate use basic-authentication or:
echo -n "logon:password" | sffb64 -i
eg echo -n "chris:sleekpanther" | sffb64 -i
gives Y2hyaXM6c2xlZWtwYW50aGVy
proxy-logon:Y2hyaXM6c2xlZWtwYW50aGVy=
proxy-is-squid:yes/no Is the proxy a Squid ? default: no
proxy-handshake:yes/no Does the proxy need to say hello first ? default: no
If the proxy is a Squid, this MUST be NO
logeachfile:(dest) Send a Success/failed msg to this destination
for each file. There is no default. This log file is
just a FipHdr with the following extra fields :
DR-File Sent OK DR:ok or DR:error
DG-Will Retry later DG:retrying, DG:stopped
DT-Some message text DT:No connection
default: no log created.
The text for the DR and DG can be in FipSeq and so can contain
FipHdr and other variables. As they are FipHdr fields, please
do NOT put NL, CR etc in the fields.
Note that System Variable $q holds the time taken for transmission.
DRgood:(text) Message for the FipHdr field DR on a successful tx
default: ok
DRbad: (text) Message for the FipHdr field DR on a unsuccessful tx
default: error
DGcont:(text) Message for the FipHdr field DG if, after an
unsuccessful tz, another attempt will be made.
default: retrying
DGstop:(text) Message for the FipHdr field DG if no further
attempts will be made as the file was sent successfully
or the maximum no of attempts has been tried.
default: stopped
fiphdr-for-logeachfile: (FipSeq) or
msgeachfile:(FipSeq) Additional information to add to the FipHdr of the
'logeachfile' or 'loglasterrfile' msg. This should be in FipHdr
format and be in FipSeq. It can be used to pass FipHdr fields
in the outgoing file into the log file.
eg msgeachfile: DF:logdialnSS:SSn
default: nothing added
To save the contents of a particular Tag or TagAttribute, use the 'fiphdr'
keyword :
fiphdr:(FipHdr field) (optional subkeywords)
Either tag:(name of tag)
specify the tag name which contains the data required.
Or data:(FipSeq)
for adding FipHdrs with standing data.
fiphdr:TT data:$e$y$i$d
will create a FipHdr field DT with the current date in it
Or tag:(name of tag)@(name of attribute)
specify the tag name and the attribute name which contains the data
required.
Or there can also be a 'key' parameter for selecting the data ONLY if there
is Key attribute with its data equal to a certain string:
eg: if the tag is <meta name="category" content="f"/>
fiphdr:NC tag:meta@content key:meta@name=category
Get the contents of the content attribute of 'meta' where another attribute
called 'name' has the value 'category'
or fiphdr:NC tag:meta key:meta@name=category
or fiphdr:NC tag:meta@name=category
Get the data for the 'meta' tag that has an att 'name' = 'category'
Double quotes around the Key Data are optional unless there are embedded
spaces. The Key Data can be in FipSeq.
For any of the tag options, use 'dup' to flag duplicated fields.
dup:(optional separator)
This field may be duplicated. Duplicate fields are separated
with a space unless a separator chr is also specified.
Where there might be embedded tags inside the main tag, use 'repxml' to
specify a replace string
repxml:(FipSeq)
eg fiphdr:AL tag:TD repxml:+s+
and the data is <td>abc<br>efg<br>line3</td>
will give AL:abc+ +efg+ +line3
As some FipHdr fields have distinct meanings - SN, DU, DP etc - please use
other 2 letter codes starting N or Q.
In the current version of webwire, you CANNOT specify trees of tags ie
fiphdr:AA tag:entry/id.
eg fiphdr:NA tag:itemid dup:+
get the data from each <ITEMID> field. If there is more than one,
they are separated by a '+'.
fiphdr-save:(FipSeq)
fiphdr-file:(Filename in /fip/fix/webwire/fiphdr)
This allows data to be stored as FipHdrs at the end of the session - and read
at the begining of the next
So items like Sequence numbers and time-of-access can be passed between
attempts.
; default name
combie:QA WA,default
; save and possibly reuse the FipHdrs ....
repeat:JQ J1,+,1
repeat:JD J2,+,1
fiphdr-save:BQ:JQnBD:JDnXX:some commentn
fiphdr-file:websave_QA
** This must be lower down the parameter file than any FipSeq if you are
using FipHdr fields as the example above !
There can be multiple 'fiphdr-file' - all of which are read as the parameter
file is read.
But if there is a fiphdr-save, ONLY the last 'fiphdr-file' is stored to.
fiphdr-on-all-levels:
Add the FipHdr to each file on every level - default: no
fiphdr-hash: (single chr in FipSeq)
This will replace a Hash '#' in a FipHdr field (as Hashes are normally
end-of-fiphdr field)
meta-to-save:(FipSeq)
meta-save-file: (Filename)
meta-save-on-tag: (tag name)
This meta file is appended to on the End-of-tag specified (or end-of-file if
no tag specified)
; save these fields to the lookup file
meta-to-save:J3|J5|J6|J1|J4|$h:$n:$bn
meta-save-file:/fip/data/blob/$e$y$i$d/WA
meta-save-on-tag:LINK
reset-fiphdr-on-tag: (tagName)
Trim the FipHdr - and extra, added fields - on the end of this tag to the
same position when the tag started
This can be used in meta-save to make sure that FipHdr fields from one group
of tags to not linger and are not used for the second or subsequent
default: not used.
grab-on-tag: (tagName)
grab-on-endtag: (tagName)
Any links should be grabbed at the start or end of this Tag
default: all links are grabbed at the end of the page
An extra parameter may be specified on the same line for level eg
grab-on-endtag:VALUE level:3
grab-on-endtag:params/param/value/struct/member
retry-404-max:3
retry-404-gap:1
retry-404-error:abort/ignore/move
retry-404-queue:2go
retry-404-fiphdr:#CE:300#DU:nextstage
Retry links which return a 404 Not Found error. Max is the number of retries
and Gap is the pause in seconds between the retries
Use this for those sites which are a bit slow to add the data files the links
point to.
If the files really are not there - and you do NOT want to abort the
transmission - use 'retry-404-error:ignore' to continue with the next grab
OR you can use retry-404-error:move and retry-404-queue:(queue in spool) and
retry-404-fiphdr:(FipSeq) to send a item
retry-500-code:505
retry-500-max:5
retry-500-gap:1
retry-500-error:abort/ignore/move
retry-500-queue:2go
retry-500-fiphdr:#CE:300#DU:nextstage
Retry links which return this system error - code can be any 3 digit number
above 400.
Max is the number of retries and Gap is the pause in seconds between the
retries
Use this for those sites which are a bit slow to add the data files the links
point to.
If the errors continue - and you do NOT want to abort the transmission - use
'retry-500-error:ignore' to continue with the next grab
OR you can use retry-500-error:move and retry-500-queue:(queue in spool) and
retry-500-fiphdr:(FipSeq) to send a item
More Complex sites ------
-- The links are not in the normal Anchor or Frame tags.
If the Site returns an XML feed rather than HTML, you can specify what the
contents of which tags you want to play with. There can be up to 10 tags
specified.
linktag:(tagname)
or linktag:(tagname)@(attribute) (for version 05u onwards)
linktag:TEXT
linktag-2:Slavver
linktag-3:Bone
or to imitate the defaults :
linktag-1:a@href
linktag-2:frames@src
- sites which return other data which is not xml - such as CSVs
data-type:CSV (can be CSV for comma sep format, JSON, PSV for Pipe sep, TXT)
data-type-sep:|
data-type-eoln:
data-link-idx:2
define the column containing the link to the data
headline-link-idx:3
define the column containing the headline
skipdetails-link-idx:1
define the column containing the skipdetails
- RSS feeds
Sometimes a link can point to data which gets updated and there is a second tag
which gives either a unique-id or a date/time which you need to track for any
changes. Use the 'skip-details-tag' to specify the second tag - it is the
combination of the 'linktag' and 'skip-details-tag' which should be unique.
For general RSS 2.0 feeds, this can either be 'pubDate' or 'guid' :
linktag:link
skip-details-tag:pubDate
In RSS feeds there is often a fake 'link' at the top which is the channel.
Usually you do not want this one - often it is a URN not a real URL, so use
'matchlinks' or 'ignorelinks' to bypass it.
if more than one skip details are needed, up to 9 skip-details-tag-X can be
specified.
-- If the data in the link is not complete..
Use templates to slot data from a link into another call. This is again used
extensively for XML work - like soap.
It uses either just a template (in FipSeq so you can add Header Fields etc) or
a template AND a template file if there is a lot of data.
level2template:/query.dll?src=QD
level3template:/getFile.dll?file=W$
level3template-file:soap-getfile.xml
There are 4 templates for level2, 3, 4 and 5. 'maxlevels:' and 'ignorelevels'
must always be used with these to specify which one you need the data from.
A levelXtemplate on its own will generate a GET.
To POST something, you will also have to specify a 'levelXdata: (FipSeq)' eg
; level 3 - To get THAT file is always a POST of
FileManager1%24gvwFiles%24ctl03%24gvlnkNam
; .. using the different EVTVAL and VWSTAT
level3template:/proximity/Admin/FileManager.aspx
level3data:__EVENTTARGET=A2&__EVENTARGUMENT=&__VIEWSTATE=G8&__EVENTVALIDATION=G9
will force as
POST /proximity/Admin/FileManager.aspx
will data filled in for fiphdrs A2, G8 and G9 eg
_EVENTTARGET=FileManager1%24gvwFiles%24ctl03%24gvlnkName&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUJMjQyMzY1MzEX%3D&__EVENTVALIDATION=%2FwEWAwKVxPyBCgLc
Note there is no level1template as that is the same as the URL:.. BUT there is
a 'level1template-file' version. In this case the URL: should be just that.
The template-files are normally in /fip/tables/webwire. They are NOT force
uppercase.
The default Content-Type for POSTing data or forms,
'application/x-www-form-urlencoded', sometimes needs to be changed for
templates.
It can be changed with the 'levelXmime' parameter. For example, soap normally
likes a content-type of 'application/soap+xml':
level1mime: application/soap+xml
unless you are Microsoft of course who usually/sometimes want
level1mime: text/soap
The 'W$' in the example is because each link is put into a temporary FipHdr
field called W$ as it is being used. If the link data is too much or too
little, use FipSeq to chop/add/replace.
Eg if the data in the link is "nine:/rt/newsart/Id="z1jit4":text"
And you want a link like
/searchDB?database=nine&link=/rt/newsart/Id="z1jit4"&format=text
use repeat:R1 W$,:,1
repeat:R2 W$,:,2
repeat:R3 W$,:,3
; if there is no 3rd field, use 'xml' instead
combie:W4 R3,xml
level2template:/searchDB?database=R1&link=R3&format=W4
Values -------
Values can be -
- EITHER a file containing lines of values to be used to repeatedly grab data
for a single file.
using values-file:(filename in tables/webwire)
- OR a sequential number
using values-seqno:(min value):(max value):(incremental value)
plus values-seqno-fiphdr-from: (FipHdr field containing the From seqno - ie
start grabbing from the NEXT id after this)
values-seqno-fiphdr-to: (FipHdr field containing the To seqno - ie each
seqno until and INCLUDING this one)
values-get-url:
values-post-url:
values-post-data:
Fipseq to POST a form or GET a link from a line in the
values file. See below for a description.
values-sep: Separator chr for splitting fields in the values file.
default is a pipe - '|'
values-leave-spaces: Normally leading spaces are trimmed from each
field in the values file. Use this to preserve them.
values-parallel: (Number of Simultaneous Hits)
For 'values' the default is to run the hits serially, one after the other has
finished. Use this to send out a number of hits
at the same time which should reduce the total time by a large factor.
However, you should check with the remote and test
what the number should be. For Apache sites for example, 8 is a common
default setting.
eg values-parallel: 10
values-fiphdr: Normally fipHdr W1 will contain the first field of the values
file, W2 the second etc.
So data can be specified by W1
Use this parameter to specify another field - ie if W1 is being used
elsewhere.
** Note that if you are using iptimer to start webwire running a values file,
the Wx fields will be zapped in the output file.
So in this case, always use 'value-fiphdr:' with a different FipHdr if you
want to use the Values in iproute or another downstream program.
eg values-fiphdr:R1
values-pause: (secs)
Gap/wait/pause between Grabs using the next value default is 0 for none
zap-values-file: (yes/no)
Delete the values file after it has been used. default no
Only files in /fip/x/ starting TMP.. can be deleted.
Note that in the FipHdr - unless the 'nofiphdr' keyword has been requested, the
following fields will be filled in :
Day and time in the normal HH,HD,HY etc fields
ST host
SP port
SF url - path/filename being grabbed
SG url - path/filename with is the link
Where webwire is sitting on a scrolled queue (using -i), the folder name is in
EQ and the filename EN (with all '#' replaced by the chr chged by
'fiphdr-hash')
Input Parameters (all optional) :
either -i : scrolled queue default: no default
This checks the folder and for each file, checks the FipHdr for 'DF' which is
used for the name of the parameter file to run against
This allow a variety of parameter files to be run
or -1 : Run a single time and exit default: spool
The parameter is the name of the individual parameter file in tables/webwire
(ie NOT The top or main parameter file)
or -T : Tuning mode default: spool
Display links and data for the page requested. Runs only that page and then
exits.
The parameter is the name of the individual parameter file in tables/webwire
(ie NOT The top or main parameter file)
-A : In Tuning mode, do NOT prompt before searching a link default: prompt
-a : log the actual link of each accesses in the FipLog default: no
This can be quite a lot of logging if you are grabbing lots of files !
But is quite useful when starting/adding a new feed.
-B : default balance group for skip files default: none
(see skip-balance-group parameter)
-C : warm restart for cookies default: always ask for new cookies/logon
ie do NOT re-logon if the previous session logged on and saved the cookie
if any cookie is missing or has timed out, all cookies are wiped and webwire
needs to be re-run to logon and download.
-d : done folder for -i scrolled queue default: none
This can be overwritten by the 'doneque:' parameter
-D : display the Request and Response default: do not
-e : exit with the Result Code of the last grab. default: normal program exit
The Normal exit is 0 if ok, negative number if not
With -e this will be 0 for ok, and -1 (timeout) but 4XX or 5XX for page
errors.
-F : do NOT add a FipHdr to the output file default: do
this can be overridden by the 'nofiphdr:no' parameter
-h : extra FipHdr information default: none
This is in FipSeq and should normally be quoted
Note this is the means that 'iptimer' sends variable information to webwire
eg : -h"SN:hello#TC:200401031"
-H : display the Request and Response in fancy HTML default: do not
-I : wire id default: 0
used to track which instance of a multi-webwire system a file arrived/logged
-k : ignore the Skip list (used mainly in tuning) default: use skip-links:
-K : Do NOT save or process any data, just build up a skip file.
This can be used before putting sites into production so that
all old links are ignored and only new links will be tracked.
ie run 'webwire -1 (name) -K' once beforehand.
-l : no logging to the FipLog except for errors default: log all
-L : log new files and errors to the FipLog default: log all
-N : path and filename of the output file default: fip standard
use this to leave the file(s) in a non-std folder.
-o : output queue in 'spool' default: spool/2go
This can be overwritten by the 'outque' parameter
This is ignoring in Tuning mode.
-O : force ALL output to this queue in 'spool' default: spool/2go
This overwrites the 'outque' parameter
This is ignoring in Tuning mode.
-s : generate statistics for bandwidth usage default: no
using Hour_group files
-S : generate statistics for bandwidth usage default: no
using name of group_client files
-t : track status default: no
this can be overwridden by the parameter
track-status:no
-V : if using spool-a-folder (-i) then stop when it is empty default: keep
spooling
-w : Wait in seconds between accessing links. default: 5
-x : Proxy server host or IP address default: none
-X : Proxy server port default: 80
-y : Proxy logon default: none
-Y : Proxy server is Squid default: no
-z : parameter file in 'tables/webwire'. default: XWEB
-v : Print the version number and exit
---- Other Notes ----
-- Netiquette --
Pls note if you are grabbing data off another site, then you should contact the
webmaster of the remote and let them know. Certainly if you are accessing every
few seconds, then there is a good chance they will put you on some refuse list.
So it pays to be nice !
-- How to find out the actual url....
Sometimes it is quite difficult to find out the real path to use for the url.
Especially so for script-driven gets and puts.
NetScape or Iexploiter is invaluable in this case..
- use either 'View Source' or 'History' normally gives the game away!
Snooping using tcpdump or windump
0. Open a Terminal/Cmd window and start you browser - without hitting the site
yet
1. Find out which interface
tcpdump -D
2. Leave tcpdump running in background
On Mac OSX you will need to be sudo
tcpdump -i1 -w remo.tdmp -X host www.remote.host
3. On the browser, do the absolute minimum ..
.. do a simple logon and grab ne file using Firefox, Mozilla, IExp, Safari
etc
4. CntrlC to stop tcpdump
5. run tcpdump to show data
rcpdump -r remp.tdmp > remo.fip
6. call up remo.fip in an editor.
-- Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookie Cookie
Cookies are neat but nasty.
If you already know the cookie you need, just make a file in /fip/fix/webwire
with the name of the cookie (case is important on Unix boxes) and slap in the
whole of that cookie which has the syntax
(key)=(data)
ie zumzum=hungryTummy
Before grabbing data pages we can attempt to logon to a box and get its cookies
!!
This uses from 1 to 9 GETs or POSTs
add-cookie:C1; C2 Add the Cookie on to the end of the HTTP headers in this
form
get-cookie-1: Command to send to get a cookie or to logon.
get-cookie-data-1: Optional data usually required for a POST
get-cookie-http-1: more HTTP headers used ONLY for this GET/POST
cookie-fiphdr-1: name of the cookie to use as a FipHdr field C1 to C9
ie if there are several cookies returned but only one
is needed, put the key as the cookie-fiphdr
ie Set-Cookie: ABC=12345
add-cookie:C1; perm=yes
cookie-fiphdr-1:ABC
will result in a Cookie: ABC=12345; perm=yes
If you want all the cookies to be saved, use '*'
cookie-fiphdr-1:*
cookie-ignore-redirect-1: just that
Ignore any redirects (like 302 Moved).
follow-cookie-redirect: just that !
ie if you get a 302 Moved Temporarily status Plus a Location from a cookies
-request,
use that rather than the 'url:..' specified.
HTTP/1.1 302 Moved Temporarily$
Date: Fri, 29 Oct 2010 00:17:19 GMT$
Cache-Control: max-age=3$
Location:
http://fippo.fip.fip/palio/html.run?_Instance=cms_csi&_PageID=1&_SessionID=1068051&_SessionKey=922432532&_CheckSum=328747502$
There can be up to 9 of these.
eg add-cookie:C1
get-cookie-1:GET /
get-cookie-2:POST /logon.pl
get-cookie-data-2:logon=helpme&password=iamswimming
Rarely are any get-cookie-http-1 fields needed as
Host, Content-type, and Content-length are added automatically
Referer is added if you have specified a 'referer'
which you should if running 'http-version:1.1'
Keep-alive is added if you secify 'keep-alive:yes'
Others 'httphdr' fields should be specified as normal..
As a general rule, some Microsoft IIS sites (who else!) have problems if you
HTTP headers are in the wrong order. Basically, make sure your CONTENT* lines
are last.
Example 1
; ------------------------------------------------------
; we need to go and get a cookie for this service
; we will call it C1 - so the httphdr will be 'Cookie: (contents of C1)'
add-cookie:C1
; C1 will hold the contents of an incoming 'WebLogicSession=.....'
cookie-fiphdr-1:WebLogicSession
; this is the URL to hit (with parameters) to trigger the Cookie
get-cookie-1:GET /servlet/com.login.DispatchServlet?Login=&User=guest&Pwd=guest
Example 2
; ----------------------------------------------
; in this case we have 3 cookies C1, C2 and a fixed one 'b'
; C1 is SID=..
; C2 is ASP...=...
; add the fixed 'b=b' on the end
add-cookie:C1 ;C2 ;b=b
; just one grab at a cookie - and Logon and the same time
get-cookie-1:POST /login/Login.asp
; one logon string
get-cookie-data-1:u=%2Findex.asp%3F&l=letmein&p=ohpleaseplease&x=0&y=0
; ignore the 302 return - it is only trying to send us to index.asp
cookie-noredirect-1:
; Save the two cookies as C1 and C2
cookie-fiphdr-1:SID
cookie-fiphdr-2:ASPSESSIONIDASDQCAAD
This will POST - ie pretend to be a filled out html FORM - the logon back.
Note that the cookie-data is 'URI escaped' ie if it is a special chr - like
/?&+ - and is in the data bit, you must use the '%xx' notation (where xx is
the HEX value). But hopefully you would have seen that in your tcpdump/snoop
anyway.
-- Proxies Proxies Proxies Proxies Proxies Proxies Proxies Proxies Proxies
Proxies
When running through a proxy server, you will need :
1. hostname of the proxy server
2. port number on the proxy server if it is NOT port 80
3. (optionally) a logon and password
4. Is the proxy SQUID ?
If so headers are slightly different.
If this information is NOT available, normally you can find it easily from any
PC or Mac on the internal network using a browser like Netscape or IExplorer.
Start a NEW copy of either of these. - It must be a new copy to check on
logons etc.
Under 'Preferences' or 'Internet Options' there should be a 'Connections'
section and under that, the host name or ip address plus host name of any proxy
used.
Note that often the main Fip server is NOT running DNS and will not be able to
resolve external hostnames, so the IP address must be used in this case.
Enter these values in the Fip parameter file as :
proxy-server:195.13.83.99 (no default)
proxy-port:412 (this defaults to port 80)
Use the Browser to attempt to access a web site outside the firewall - like
'www.fingerpost.co.uk'.
If you are asked for a password to get through, you will probably need to add a
'proxy-logon' parameter too unless the keeper of the Firewall has made a hole
through just for you.
The data for 'proxy-logon' is in base64 in the format (logon) (colon)
(password).
Use 'sffb64' to generate this string :
On a Sparc echo -n "chris:magicman" | sffb64 -i
On Linux echo "chris:magicman" | sffb64 -i
On Winnt type "chris:magicman" | sffb64 -i
proxy-logon:Y2hyaXM6bWFnaWNtYW4===
The actual 'You need to Logon, Pal' message is a '407 Authentication Required'
message.
-- Repeat Offenders -----------------
Some sites add a session-id into each and every link. And this Id changes on
each access.
To 'webwire' this appears to be a new file and so it is grabbed every time -
falsely.
There is an 'ignore-key' command to isolate and ignore the relavany parameter.
eg Take a site like :
url:http://www.fingerdong.com/
matchlinks:*&news=yes&newsid=*
ignorelevel:1
which returns links like
/en/pressrelease.php?date=20080910&news=yes&PHPSESSID=11bf21&newsid=7866
If value of PHPSESSID changes each access, they you will get a copy of newsid
7866 every time.
Use :
ignore-key:PHPSESSID
Do NOT specify the '=' or '?' etc.
-- Others Others Others Others Others Others Others Others Others Others Others
--Where 'webwire' is used to drill down links, there is a wait of about 5
seconds between accesses which, hopefully, is enough time for other people to
use that server.
--Where a logon and password is requested as part of the Browser - ie a pop-up
from Netscape or IExplorer, NOT an HTML form - you will need to add a
'Authorization' line. This will be true if you get a message like :
HTTP/1.0 999 Authorization failure
... etc etc etc ...
Assuming you know your logon and password :
1. Use uuencode or sffb64 to generate a Base64 string
echo -n "logon:passwd" | sffb64 -i
2. Add an extra line to the parameter file with the result of the sffb64 line
using 'httphdr'.
Syntax: Authorization (colon) (spc) Basic (spc) (Base64 of logon:password)
(n FipSeq for NL)
Eg httphdr:Authorization: Basic AbGtGgbhpdOkOTE=n
-- Valid links are :
- The HREF tag atttibute in A for Anchor <a href="www.fingerpost.co.uk>
- The SRC tag attribute in FRAME <frame src="ax1000.html">
- The URL in a META/Refresh <META HTTP-EQUIV="Refresh" CONTENT="0;
url=go4thAndMulitply.com">
-- For 'matchlinks', the term LINK is the contents of the <a href="THISONE">,
NOT the associated text
ie matchlinks:*boonies*
will find <a href="/rubbo/boonies/tunies.html">This is a Wonderful Page</a>
BUT not <a href="/tunies.html">This is the boonies Wonderful Page</a>
-- Note that 'ignorelinks' refers to both Links and Forms.
-- If you want to ignore all links and only get forms, use a weirdo name in
mathclinks
matchlinks:gobbLedeGook9981
-- What are reasonable HTTP headers ?
1. If you are using HTTP Version 1.1, you MUST add a line in the headers which
specifies the actual host you are trying to access (ie the REMOTE hostname or
IP address):
httphdr:Host: www.theirsite.comn
or if DNS is a problem
httphdr:Host: 123.456.789.012n
2. Most servers would like to know what you are and what you can do - so lie !
Try this for starters :
httphdr:Accept: 52/52n
httphdr:Accept-Language: enn
httphdr:User-Agent: Mozilla/4.0 (compatible; MSIE 4.01)n
Note the syntax is httphdr:(Keyword) (colon) (space) (Parameter) (NL)
Keyword is case-INsensitive
There MUST a Colon-Space beteween the Keyword and Parameter.
The line MUST finish with a single NL (which webwire will handle correctly)
as Double NLs mean end of header.
-- ValuesFile ValuesFile ValuesFile ValuesFile ValuesFile ValuesFile --
Take the case where you need to get the 10 foreign exchange rates every 20
minutes from a site like Yahoo.
The normal way would be to test using one forex rate and, when ready, just
duplicate that parameter file another 9 times, just changing the forex
name/search string in the 'url' or 'post'.
The classy way is to pput all the search values (ie the bits that change) into
a single 'values-file' and reference them using FipHdr fields W1 to W9.
To Do this :
If the original url is :
http://finance.yahoo.com/m5?a=1&s=USD&t=LAK
1. Create a values-file in /fip/tables/webwire - lets call ir VALUES_4_FOREX
This can have the normal Fip-style comments of ';' at the start of line
;
; Values file for Forex
;
USD|LAK
USD|YEN
USD|MYR
; end of values file
2. In the WebWire parameter file - lets call it FOREX.
;
; FoREX
;
port:8080
url:http://finance.yahoo.com
values-file:VALUES_4_FOREX
values-get-url:/m5?a=1&s=W1&t=W2
... and let rip.....
Note that W1 is the first field, W2 the second etc. If you are already using W1
for something else, specify another FipHdr field to start on with the
'values-fiphdr' parameter.
Note that the FipHdr fields are useable for filename and other Fippy things.
filename:Forex-W1-W2.fip
will give filenames (and/or FipHdr SN) for our example of
Forex-USD-LAK.fip
Forex-USD-YEN.fip
Forex-USD-MYR.fip
-- Standard-FingerPost-Rant on bad HTML ----------------------
-- Using Webwire to pull off other file formats
Sometimes, 'webwire' seems to only grab part of a page and never returns
errors. Well, if you use a browser to look at the page and then 'View Source'
or 'View Frame Source', lo and behold there is probably a random </HTML> at
that point.
</HTML> is of course the End Tag of an HTML document. So we SHOULD stop there
really.
But a lot of web sites do not care how awful their stuff is - or maybe a
conversion program has been set up wrongly (a well-known news agency in New
York uses </html> in place of </image> to end pictures for example)
So use the keyword 'end-of-document' to track either nothing - just timeout -
or the REAL end of document.
If the data is NOT html - some XML variant for example - use 'end-of-document'
to track that.
By the way, did you know you can immunise yourself from fingerpost-rants; pls
contact the sales dept.
-- Wrinkles with Ports and RSS
Some RSS servers like to service the initial list from one port - but you have
to grab the data from another
port:8080
url:http://finance.yahoo.com
-------------------------------------------------------
Version Control
;005z34 10may05 hourly bandwidth stats files rather than per client
;a 13may05 balance skiplists if changed
;b-c 25jun05 added -M and -K
;d-g 05aug05 added fiphdr:XX data:abcA3 and wait-end-timeout
;h-k 04sep05 changed -x-X to force not default
;l-m 07nov05 added 24hour+ skip files
;n-p 25sep06 added ssl at last
;q-t 17oct06 added skip-details-tag
;u 29apr07 major change to linktag, added matchkeys and match-case-sensitive
;v-w14 21may07 add rest of path if 3rd+ level and no starting '/' (w14 -
tweaks to stuff_cookie)
;x1-6 8may08 added save-fiphdrs ;3 added -N newname ;6 bugette with VALUES
file and port != 80
;7 added -e and -E errname ;8 balance fiphdr fields ;9 meta-files ;10-12
minor
;13-14 note_balance_action ;15-16 spc in url ;17 added pretend-301:200 ;19
allow feed:
;20-23 finally added basic-authentication: and redid ssl
;24 bugette/modette - allow multiple spaces in mime headers
;25 allow intergap of zero
;26 bugette - save_metadata missing if one and only one found
;27-29 25jun10 bugette when proxy is a Squid and host changes
;y1-9 26jul10 added grab-on-tag/endtag (major release) ;10-11 6sep10 bugette
with 302-move and http://...
;12-14 added matchlogon, bug (bg) with data-type:CSV, plus tom bug :
retry-404-max:3 retry-404-gap:1
;15-17 14oct10 added skip-save-data and days:Z for weekdays
;18 15nov10 added use-cookie-redirect: ; 19 able to parse VALUES-FILE: ;20
added nofiphdr
;21-25 mess if too many 404 plus added -D and fiphdr-hash
;26-27 16mar11 added repxml for fiphdr: / include fiphdr-file in start of
hdr..
;28-29 31mar11 added zap-values-file:yes
;30-32 poll.every secs bugette ;32 added need-proxy-cookie
;33 6jul11 better skips handling now allow 15000 skips and zap olds with
different skipdetails
;34 29jul11 added need-logon-token and cookie-host-X for rconnect
;35-36 added dbl-dblqtes in links plus Bugette in Chunks and redid outque for
speedy
;37-41 added CONNECT for proxy https plus started minitracking and sleep
between polls for XWEB
;42 allow multiple spaces in custom tag link and added filter ;43
null_next_link added
;43-45 added retry-404-error
;z1-8 15mar12 added eventvalaidation and viewstate and level5* and json
;9-10 allow multiple grabs, added level to grab-on-tag and matchlinks etc
;11-12 redid 302 moved to handle full paths better ;13 ;14 bugettes -
proxy/do NOT output file for cookies
;15 28feb13 tuning for level1template-file:
;16 4apr13 bug in skips if no headline
;17-23 11apr13 added trees, levels and keys to fiphdr:, grab-on*tab: and
linktag:
;24-28 17may13 added retry-500 kwds and better proxy handling ;27 added
level1mime and -I wireId
;29-31 17mar14 added 404/500action=move, que and FipHdr ;31 modette-repxml
for all tags
;32 14apr14 for custom logging ;33 4aug14 added -Z force DF ;34 bugette with
fiphdr.. key:
;004z 07jul04 tweaks...
;b 01aug04 added fiphdr:....
;c 10aug04 added levelXtemplate: where X is 2->4
;d-k 01sep04 -9 speedy and timing stats (f-maxlevel and values bugette)
;l-n 07oct04 added skps2, fixed one-file,
fixed HTTP results with no messages
;o 28oct04 redid skps2
;p 01dec04 buglette with spaces in URLS - need to be stripped.
plus lvl1file-lvl5file added
;s 31dec04 added -x proxy-host, -X proxy-port plus -y/-Y
;t-u 01feb05 added bandwidth-stats
;v-w 19feb05 added -u testPid and -U singlelevel only and split into files
plus bugette with Chunking
;x-z 18apr05 added -O for rpt-offenders/small-diffs flag
;003z 15dec00 added one output file, tracking sents, only-get-if-modified
;a 20dec00 added watch on XWEB
;b/c 22jan01 allow hrefs to be NOT in dbl quotes
plus added end-of-document
;d/e 19mar01 started proxies
;f 17sep01 proxies again
;g 29oct01 proxies again
;h 13dec01 minor mods - allow http:name:port in url and proxy
;i 08jan02 values-fiphdr and bugette with values
;j 08apr02 bug with one output file - core dump
;k 01jan03 MACOSX
;l-p 21jan04 added -h and allows secs for 'every'
and 'no-data:'
;q-u 08jun04 added matchlinks/ignorelinks/url and now FipSeq
;u 27jun04 added -H html, -k ignore skipfile
;w-z 30jun04 proxy-is-squid added
;002b 24oct00 added values-file
; 06nov00 added 'every' and Chunks
(copyright) 2014 and previous years FingerPost Ltd.