ipsgml (Sat Oct 25 2014 01:31:01)

ipxml/sgml/wml/html/newsml

This program is used to convert into, out of and between different tagged
format files such as XML or SGML or variants like NITF, NewsML, XHTML, WML or
HTML.

Generally it can be used to convert things like :
- NewsML <-> IPTC 7901 or ANPA 1312
- NITF <-> plain ascii
- XML <-> HTML
- SGML <-> NITF
- SGML <-> plain ascii
- WML <-> ascii
- HTML/NITF tables -> inline markup for Quark, InDesign or other Editorial
systems

Data can be extracted from the SGML tags or attributes and formatted into text
eg.
- convert and/or replace the data within a tag
- plain ascii files -> XML possibly using FipHdr fields to create tagged data

Definitions and Glossary
tag something between '<' and '>' eg. <BODY>
usually ending with tagend. eg. <LOCATION>Hollywood</LOCATION>
data non-tag information eg. "Hollywood" in the above example
attribute - sub field/data within a tag eg. <LOCATION ID="996"
PLACE="Hollywood">
NITF News Industry Text Format as put together by IPTC and NAA.
XML,HTML Much simplified sub-set of SGML for WWW. - see www.w3.org

It scans its input directory and each file is processed according to a
parameter file specified either as the default or as the DY: FipHdr field.

Two types of processing are possible
- strip or modify tag, attributes and/or data
- extract data or attribute-data and stuff in a FipHdr field which can then be
used to replace the top of the file or used by a subsequent program.

There is also a question of where to send the output file as this, by default,
is put in spool/2go for IPWHEEL to distribute. So it needs a Destination(s) or
DU FipHdr field. This is added by either :
- It there is a DX FipHdr field in the input file, that is used.
- If not, the keyword 'dest' is used in the parameter file.
- If that is not specified either, it is sent to 'woops' the Intercept queue.
- You may also specify it from the incoming data or attribute-data using the
'fiphdr' keyword.
In this case the contents of DX, 'dest' or 'woops' will be the default if there
is no data.

IPXML may be used to convert XML tables to plain formatted text or in-line
markup such as Quark.

The parameter file in tables/sgml defaults to SGML and has the keywords:
tag:(sgml tag name) (optional subkeywords)
Process a Start or End tag as follows :
start:(FipSeq)
optional string to replace the tag
end:(FipSeq)
optional string to replace the end tag ie. </location>
strip:(tag|attribute|zap|everything|data|end|none)
optional strip all or part of the tag and its associated data
tag All information between '<' and '>' is ignored.
This will also zap the end tag if there is one.
attribute all attributes are ignored; tag and data preserved.
zap All information - tag, attrib and data is zapped to
the next tag.
everything Same as 'zap' but lower tags are always zapped too.
data All data for this tag is ignored; tag and attrib preserved
end Zap everthing, including all other tags until and including
the end tag : </NAME> unless any other tags are specified as
NOT being stripped.
none Preserve everything (default)
keepattribute: (optional FipSeq)
Used during strip to keep all the attribute data. Any
data after the keyword is added before and after the attribute :
tag:ds start:** end:-- strip:tag keepattribute:=
<ds num="1.5" ver="orig">oinky</ds>
gives **=1.5==orig=oinky--
As the optional data is will checked against the mapping tables
please make sure they are what you want them to be.
endkeepattribute: (optional FipSeq)
Same as KeepAttribute: (above) except the data is ONLY added after the
attribute
and before data.
att: (attribute name)
used with keepattribute: - use when only one attribute is required
tag:content strip:tag att:content-role start:[fip- keepattribute:|
endkeepattribute:-fip]
<content content-ref="c00000002"
content-role="urn:x-hoho:content-role:INTRO" auto-generated="false">
generates : [fip-urn:x-hoho:content-role:INTRO-fip]
which can then be mangled by ipxchg or other at a later stage

upper: force the field uppercase
lower: force the field lowercase
Note that these two conversions only change data up to the next
tag or end tag (ignoring <P>) which may not be the end of this tag.
list-fiphdr:P3 If converting OrderedLists <ol> or unordereds <ul>, this is
the FipHdr field containing the item number.
tag:ul strip:tag start:<FipUL> list-fiphdr:P6
tag:ol strip:tag start:<FipOL> list-fiphdr:P6
tag:li strip:tag start:"n P6"
The actual string used in the Unordered list can be changed from a '*' using
the parameter 'unordered-list-chr:+'

fiphdridx: use a link-Fiphdr (see below) to extract some FipHdr data
referenced by
tag:A strip:tag end:(R7) fiphdridx:a@href=R7

Note when specifying the tag, do NOT specify either the presy/endy ie the '<'
or '>'.
eg tag:location start:[ModeBold] end:[ql]n strip:tag
There is a special case for a comment <!-- This is a comment -->, where
the 'end' subkeyword specifies the end of the comment.

fiphdr:(2-letter code) (optional subkeywords)
Either tagdata:(name of tag)
specify the tag name which contains the data required.
Or tagattrib:(name of tag),(name of attribute)
Or tagattribute:(name of tag),(name of attribute)
specify the tag name and the attribute name which
contains the data required.
Or data: (FipSeq)
general data to add to a FipHdr field.
Or text:
Stuff the first part of text into this hdr field
This searches for the <TEXT> tag. If not found, the top of data
is used.
default length is 100 chrs unless you change
with a 'max:1024' (see below)

For any of the fiphdr-tag* options, subkeywords are 'dup', 'max',
'upper', 'lower'

continue: allow this fiphdr to continue and include lower level tags

dup:(optional separator)
Flag that this field may be duplicated. Duplicate fields are separated
with a space unless a separator chr is also specified.
For 'dup' to work correctly, each tag or attribute to be accessed is
stuffed into one fiphdr line only.
Each occurance of the duplicated tag MUST follow sequentially with
no other tags interceeding
incdup:
A second method of handling duplicate tags or tag/attributes is to
create a new FipHdr field by incrementing the second letter of the FipHdr
name
eg fiphdr:J6 tag:DEST incdup:
the first FipHdr will be 'J6'
the second 'J7'
the third 'J8'
etc
So the idea is to start with 'J0' (zero) if under 10
duplicates are possible or 'JA' if 26.
maxdup: (max number of duplicates allowed for this field)
default: no limit for 'dup', 26 for 'incdup'
Use this to limit the number of entries in a duplicated field.

max: (max number of chrs in this FipHdr field)
limit the size of the data to a fixed amount
max:25
Note there is no default except the absolute maximum is 1023

upper: force the field uppercase
lower: force the field lowercase
Normally these take the concept of lower and uppercase chrs
from the LOCALE of the system you are running on. These can
be supplemented by the 'locale' and 'extralocale:'
keywords below.
key: and key2: Some XML variants reuse structures and it is the contents of
an
attribute which describes what the data really is.
In NewsML for example there can be multiple TopicSets with the attribute
'Scheme'
on the 'FormalName' tag which varies. Use 'key' to define which one.
eg
fiphdr:PP tag:FormalName dup: key:TopicSet/Topic/FormalName/Scheme="Internal
MetaCodes"

See below for more comments for use with multiple structures
you MUST specify at least the tag and attribute in the key.

There can be up to two 'key's for each 'fiphdr' - see below
for an example using 2 keys are necessary for NewsML Topics.

index: (Tag@attribute)
Create an internal FipHdr for use with this index for outputting with
tag/fiphdridx above
fiphdr:R7 tag:FormalName dup: key:FormalName@Scheme="Ticker"
index:Topic@Duid

For fiphdr/tagdata there is an additional keyword of 'attribute-is-data:'.
This forces any information in attributes in any lower tags to be treated
as data.

As some FipHdr fields have distinct meanings - SN, DU, DP etc - please use
2 letter codes starting N or Q.
eg fiphdr:NA tagdata:itemid dup:+
get the data from each <ITEMID> field. If there is more than one,
they are separated by a '+'.

general examples
fiphdr:PN data:SN max:6
fiphdr:HT data:"This is the old HS =HS="
fiphdr:DI tagdata:brodtext max:200
Other keywords :
start-text-tag: (tag)
Tag signifying the begining of text data for 1st line (etc) of text ($1, $t
etc)
The default is 'TEXT' but is often defined as 'BODY' :
start-text-tag:BODY
or for NITF, the body.content tag
start-text-tag:body.content

pinhdr:
pindata:The <P> Paragraph tag is handled separately from other tags as it
often
'neutral' and should not alter the current processing.
Use these two keywords to define what to do with the start and end 'P' in
either a FipHdr field or in the data part:
pinhdr: start:~ end:s
pindata: start:n end:n
'start:' being the string output in place of a <P>
'end:' being the string output in place of </P>
Note that CR NL etc are not valid characters in the FIpHdr - if you do need
them use another unique chr and use 'ipxchg' to convert at a later stage.
Defaults for pinhdr: start:s end:s
Defaults for pindata: start:n end:n

dest: (one or more Fip Destinations separated by space or '+')
This can be overridden by the DX: FipHdr field. Note that all
destinations MUST be in the tables/sys/USERS file. As per normal
case is important, so ZAPME and zapme are 2 different destinations.
eg. dest:logcopy+outsgml.
stripfiphdr: do NOT copy the existing FipHdr of the input file onto the
output.
Normally the FipHdr is stuck on top.
nofiphdr: do NOT add a FipHdr to the output file. Any new FipHdr keywords are
added without the tilde NL top and bottom.
zapfiphdrfields: (List of FipHdr fields to zap)
Delete all occurances of the FipHdr fields specified. This is ONLY valid
where the FipHdr from the input file is retained for the output.
In this case it is normal to zap :
zapfiphdrfields:XZ,XS,CX,DC,SZ,CQ,CP,XP
addhdr-file: (fullpath/filename in FipSeq) default: none
Extra, optional FipHdr information held in an external file
addhdr-script: (script in FipSeq) default: none
Extra, optional FipHdr information generated by an external program or script
addhdr-script:/fip/local/find_iim.pl EP/EN > E3
Temporarily, 3 FipHdr fields are available for the script :
EP holds the input folder
EN holds the input filename
E3 hold the name of a TMP file to create that will be read for the list.
extra-fiphdr: (FipSeq) default: none
Extra, optional FipHdr information - note this overrides the -h switch

use-sx:
or use-external-file:
if there is an SX FipHdr field with a path to the data file, use that rather
than the data in the input file.

filename: (FipSeq) New filename for the output file name.
supercede:
or overwrite: Where 'filename' has been specified, if there is already a file
with that name in the output queue, it is deleted first.
script: (path and name) Script to run AFTER processing.
The output filename and path is added to the script before running.
Care must be taken NOT to run a script on a file that
normally is written to a spooled queue.
For example, the default output queue is 'spool/2go' where
program 'ipwheel' may have already processed the file (and
possibly deleted it) before the script has had time to
function. So it is normal to specify a holding queue, not
used by any other program as 'outque:'
The script must therefore delete the file after use OR
delete them all in the nightly maintenance - 'zapfiplog'
Note also that script called only once at the end of
the file. Use split-script: to run on each split (if using splits).
outque: Output Queue for the output file.
This default to the '-o' input switch which defaults to spool/2go.
If the first chr is NOT a '/', it is assumed under spool.
The default is outque is used in preference to -o,
UNLESS the -V switch is on were -o is used over outque.
doneque: Done Queue for the raw input file.
This default to the '-d' input switch which has no default.
If the first chr is NOT a '/', it is assumed under spool.

before: (FipSeq) String to parse and add at the top of the file.
after: (FipSeq) String to parse and add at the end of the file.
beffile: (Path/filename) Contents of a file in FipSeq to parse and add at the
top of the file (after 'before')
aftfile: (Path/filename) Contents of a file in FipSeq to parse and add at the
bottom of the file (before 'after')
number:octal|dec|hex In FipSeq, make all escaped numbers Octal, Dec or Hex.
default is octal
log: Custom log line for the Fip Item log in FipSeq
default is name of the parameter file (DF) and filename (SN)
archive: (Archive Name) Archive all incoming raw data using this
parameter file. The 'archive Name' can be FipSeq.
This adds the file to the normal Fip archives in /fip/log/data
It should be purged using 'ipmaint'.
eg archive: SU
or combie:QS SU|NS,rawdata
archive:QS
ie Use the contents of FipHdr SU, if not there, NS, if not there
just use the word 'rawdata'.

striptags: Strip all tags EXCEPT those specifically stated using the 'tag'
keyword.

default-strip: (tag|attribute|zap|everything|data|end|none)
default strip all or part of the tag and its associated data
(see strip: above for descriptions)

ignore-non-xml-data: If there is any text or data BEFORE the start of the XML
document or any after the end of the last End Tag, it is stripped.
Normally it is preserved and output.

locale:(valid locale)
Change the locale from the System Locale to this
The locale MUST be valid !
locale:dk
extralocale: (2chr combinations)
For changing uppercase to lower and vice versa, we can add to the
normal locale by specifying a series of 2 letters which the lower
then the upper.
The lowercase chr is 1st then the upper, then a separator or space.
eg extralocale:aA,bB,cC,dD,212232,213237
Normal a-z/A-Z are by default : in the example above they are included
to give an idea of syntax

chr:(octal/dec/hex number):(FipSeq string)
hdrchr:(octal/dec/hex number):(FipSeq string)
txtchr:(octal/dec/hex number):(FipSeq string)
Replace this character with the string - usually an Sgml escaped chr.
USE THIS TO REPLACE SINGLE CHRS WITH SGML CHRS (ie opposite of 'sgmlchr:'
below).
This can be a printable chr or an escaped number. The number is
octal/dec/hex depending on the preceding 'number' keyword (if any).
eg chr:313:&pound; chr:<:&lt;
Note that the ';' is part of the string and NOT a comment as it does NOT
start the line.
hdrchr works on new FipHdr fields only.
txtchr works on data and when data is taken from a FipHdr field and
added to the data part of a tag.
chr works on both data and new FipHdr fields.

eoln: Convert Line Ends (ie CR and/or NLs) from the outbound feed.
SGML should be terminated CR NL : eoln:rn
for HTML (default) the EndOfLine is NL only : eoln:n
for NO eoln, specify NO subparameter : eoln:
The subparameter can be any valid FipSeq.
(SGML uses the term 'RE' (record end) for Carriage Return CR and
'RB' for LineFeed NL meaning record begin.)
Note that, unless using the 'preserve-multiple-eolns', you should map
eoln to something unique like eoln:<mypara> as normally CR NLs are reduced to
a single End Of Line.
preserve-multiple-eolns:
Normally multiple end-of-lines are stripped as they
are meaningless in the XML world. Use this to preserve them!
preserve-top-spaces:
Do NOT strip all spaces and blank lines at the top of the output file.
preserve-padding-spaces:
Do NOT strip all spaces and blank lines at the beginning of each tag.
strip-multiple-spaces:
Strip all multiple spaces and blank lines inside each tag.
allow-presy-in-tag: In XML/HTML etc, reserved chrs like '<' or '>' cannot
appear inside
the attribute data of a tag - they must be encoded like &lt; etc.
Use this where there might be some non-comforming stuff. However the
drawback here is that they MUST be inside dbl qtes ie <meta ds="helle<p>ooo"
convert-CDATA-sections:
convert-CDATA-sections:no - no dont ! (default)
convert-CDATA-sections:yes - yes pls and zap the '<!CDATA[' and ']]>'
convert-CDATA-sections:preserve - yes pls and leave the '<!CDATA[' and ']]>'
Normally a CDATA section like :
<![CDATA[ Vongerful Vondafool C&oe;penh&areing;gen <99thisIsAnon-compliant
XMLtag> ]]>
is considered a single, raw string of XML/SGML data. And all the tags and
entities (like &lt;) are not changed either. Use this parameter
to convert them.
Note that you should use this option CAREFULLY if any tag in the CDATA
is the same as a tag in the main envelope. See below for more comments.

sgmlhdrchr: (FipSeq string) : (FipSeq Chr or String)
sgmltxtchr: (FipSeq string) : (FipSeq Chr or String)
sgmlchr: (FipSeq string) : (FipSeq Chr or String)
Translate Sgml escaped chr back into a single chr or a string.
USE THIS TO REPLACE SGML CHRS WITH A CHR OR A STRING (ie opposite of 'chr:'
above)
Sgml escaped chrs always start with a '&' and end with a ';' : "&gt;",
"&copyright;"
Note that case of both parameters IS important - These two are different :
sgmlchr:Oring:<CapOring>
sgmlchr:oring:<smallOring>
This will take &XXXX; and translate it.
eg. sgmlchr:lt:<
sgmlchr:oumlaut:202
sgmlchr:Utilde:{tildeU}
sgmlhdrchr works on new FipHdr fields only.
sgmltxtchr works on data and when data is taken from a FipHdr field and
added to the data part of a tag.
sgmlchr works on both data and new FipHdr fields.
NOTE that if the input is any NITF, XML or HTML feed and the output
is just plain text, then you almost always need :
sgmlchr:lt:<
sgmlchr:gt:>
sgmlchr:amp:&
sgmlchr:apos:"
BUT you will want to preserve them /leave them alone if the output is
the same or another NITF, XML or HTML flavour.

unicodelist: (dec or hex number) : (list of single FipSeq Chrs)
Starting at the number, fill in the map of SINGLE character replacements in
sequential order
For any map which is MORE than a single chr, use a '*' (or the value of the
convert-unmatched-unicodes: parameter)
and then use unicodechr: further down the parameter file.
eg :
; Map Unicode Latin2s chrs to plain Ascii ... use a star for unmatched (or
will match later)
unicodelist:x100:AaAaAaCcCcCcCcDd
unicodelist:x110:DdEeEeEeEeEeGgGg
unicodelist:x120:GgGgHhHhIiIiIiIi
unicodelist:x130:Ii**JjKkkLlLlLlL
unicodelist:x140:lLlNnNnNnnNnOoOo
unicodelist:x150:Oo**RrRrRrSsSsSs
unicodelist:x160:SsTtTtTtUuUuUuUu
unicodelist:x170:UuUuWwYyYZzZzZzf
; NOTE hex 132, 133, 152 and 153 are mapped to '*' as they need more than a
single chr
;; .. so then we replace them properly
unicodechr:x132:IJ
unicodechr:x133:ij
unicodechr:x152:OE
unicodechr:x153:oe
unicodechr: (dec or hex number ) : (FipSeq Chr or String)
For all unicode chrs which are >= 256 (xA0), you can specify a map to a
single chr or a string.
The chr can also be specified as hex with a preceeding 'x'
Commonly used ones are :
; trademark
unicodechr:x2122:(tm)
unicodechr:8194:s
unicodechr:8195:s
unicodechr:8201:s
unicodechr:8211:-
unicodechr:8212:_
unicodechr:8216:'
unicodechr:8217:'
unicodechr:8220:"
unicodechr:8221:"
unicodechr:8249:<<
unicodechr:8250:>>
; euro in a table
unicodechr:8364:EUR
; fractions 1/3 .. 1/5 .. 1/6 .. 1/8 ... 7/8
unicodechr:x2153:s1/3s
unicodechr:x2154:s2/3s
unicodechr:x2155:s1/5s
unicodechr:x2156:s2/5s
unicodechr:x2157:s3/5s
unicodechr:x2158:s4/5s
unicodechr:x2159:s1/6s
unicodechr:x215A:s5/6s
unicodechr:x215B:s1/8s
unicodechr:x215C:s3/8s
unicodechr:x215D:s5/8s
unicodechr:x215E:s7/8s
; ByteOrder ?? x.feff d.65279 o.177377
unicodechr:65279:s
convert-unmatched-unicodes: (FipSeq Chr)
Single chr to represent a unicode chr which is NOT latin1 and NOT matched in
'unicodechr'
default: '*'
Normally these will be mapped to '*'.
To pass-thru all unmatcheds, use : convert-unmatched-unicodes:passthru

hdr-strip-between: start:(FipSeq Chr) end: (FipSeq Chr)
Where the 1st 9 lines of text are used in FipSeq using $1 etc,
use this to replace any tags with a space.
Normally the following would be used :
hdr-strip-between: start:< end:>
But if you have mapped the start/end tags to other chrs in