XML documents consist primarily of elements arranged in nested fashion. Elements may also contain text. xmlformat acts to rearrange elements by removing or adding line breaks and indentation, and to reformat text.
XML elements within input documents may be of three types:
block elements
This is the default element type. The DocBook
<chapter>
, <sect1>
,
and <para>
elements are examples of block
elements.
Typically a block element will begin a new line. (That is the default formatting behavior, although xmlformat allows you to override it.)
Spacing between sub-elements can be controlled, and sub-elements can be
indented. Whitespace in block element text may be normalized. If
normalization is in effect, line-wrapping may be applied as well.
Normalization and line-wrapping may be appropriate for a block element
with mixed content (such as <para>
).
inline elements
These are elements that are contained within a block or within other
inlines. The DocBook <emphasis>
and
<literal>
elements are examples of inline
elements.
Normalization and line-wrapping of inline element tags and content is handled the same way as for the enclosing block element. In essence, an inline element is treated as part of parent's "text" content.
verbatim elements
No formatting is done for verbatim elements. The DocBook
<programlisting>
and
<screen>
elements are examples of verbatim
elements.
Verbatim element content is written out exactly as it appears in the input document. This also applies to child elements. Any formatting that would otherwise be performed on them is suppressed when they occur within a verbatim element.
xmlformat never reformats element tags. In particular, it does not change whitespace betweeen attributes or which attribute values. This is true even for inline tags within line-wrapped block elements.
xmlformat handles empty elements as follows:
If an element appears as <abc/>
in the input
document, it is written as <abc/>
.
If an element appears as <abc></abc>
, it
is written as <abc></abc>
. No line break
is placed between the two tags.
XML documents may contain other constructs besides elements and text:
Processing instructions
Comments
DOCTYPE
declaration
CDATA
sections
xmlformat handles these constructs much the same way as verbatim elements. It does not reformat them.
Line breaks within block elements are controlled by the
entry-break
, element-break
, and
exit-break
formatting options. A break value of
n
means n
newlines. (This produces n
-1 blank lines.)
Example. Suppose input text looks like this:
<elt> <subelt/> <subelt/> <subelt/> </elt>
Here, an <elt>
element contains three nested
<subelt>
elements, which for simplicity are
empty.
This input can be formatted several ways, depending on the configuration options. The following examples show how to do this.
To produce output with all sub-elements are on the same line as the
<elt>
element, add a section to the
configuration file that defines <elt>
as a
block element and sets all its break values to 0:
elt format block entry-break 0 exit-break 0 element-break 0
Result:
<elt><subelt/><subelt/><subelt/></elt>
To leave the sub-elements together on the same line, but on a separate
line between the <elt>
tags, leave the
element-break
value set to 0, but set the
entry-break
and exit-break
values
to 1. To suppress sub-element indentation, set
subindent
to 0.
elt format block entry-break 1 exit-break 1 element-break 0 subindent 0
Result:
<elt> <subelt/><subelt/><subelt/> </elt>
To indent the sub-elements, make the subindent
value
greater than zero.
elt format block entry-break 1 exit-break 1 element-break 0 subindent 2
Result:
<elt> <subelt/><subelt/><subelt/> </elt>
To cause the each sub-element begin a new line, change the
element-break
to 1.
elt format block entry-break 1 exit-break 1 element-break 1 subindent 2
Result:
<elt> <subelt/> <subelt/> <subelt/> </elt>
To add a blank line between sub-elements, increase the
element-break
from 1 to 2.
elt format block entry-break 1 exit-break 1 element-break 2 subindent 2
Result:
<elt> <subelt/> <subelt/> <subelt/> </elt>
To also produce a blank line after the <elt>
opening tag and before the closing tag, increase the
entry-break
and exit-break
values
from 1 to 2.
elt format block entry-break 2 exit-break 2 element-break 2 subindent 2
Result:
<elt> <subelt/> <subelt/> <subelt/> </elt>
To have blank lines only after the opening tag and before the closing
tag, but not have blank lines between the sub-elements, decrease the
element-break
from 2 to 1.
elt format block entry-break 2 exit-break 2 element-break 1 subindent 2
Result:
<elt> <subelt/> <subelt/> <subelt/> </elt>
Breaks within block elements are suppressed in certain cases:
Breaks apply to nested block or verbatim elements, but not to inline elements, which are, after all, inline. (If you really want an inline to begin a new line, define it as a block element.)
Breaks are not applied to text within non-normalized blocks. Non-normalized text should not be changed, and adding line breaks changes the text.
For example if <x>
elements are normalized, you
might elect to format this:
<x>This is a sentence.</x>
Like this:
<x> This is a sentence. </x>
Here, breaks are added before and after the text to place it on a
separate line. But if <x>
is not normalized,
the text content will be written as it appears in the input, to avoid
changing it.
The XML standard considers whitespace nodes insignificant in elements that contain only other elements. In other words, for elements that have element content, sub-elements may optionally be separated by whitespace, but that whitespace is insignificant and may be ignored.
An element that has mixed content may have text
(#PCDATA
) content, optionally interspersed with
sub-elements. In this case, whitespace-only nodes may be significant.
xmlformat treats only literal whitespace as
whitespace. This includes the space, tab, newline (linefeed), and
carriage return characters. xmlformat does not
resolve entity references, so entities such as
 
or  
that
represent whitespace characters are seen as non-whitespace text, not as
whitespace.
xmlformat doesn't know whether a block element has element content or mixed content. It handles text content as follows:
If an element has element content, it will have only sub-elements and possibly all-whitespace text nodes. In this case, it is assumed that you'll want to control line-break behavior between sub-elements, so that the (all-whitespace) text nodes can be discarded and replaced with the proper number of newlines, and possibly indentation.
If an element has mixed content, you may want to leave text nodes alone, or you may want to normalize (and possibly line-wrap) them. In xmlformat, normalization converts runs of whitespace characters to single spaces, and discards leading and trailing whitespace.
To achieve this kind of formatting, xmlformat
recognizes normalize
and
wrap-length
configuration options for block elements.
They affect text formatting as follows:
You can enable or disable text normalization by setting the
normalize
option to yes
or
no
.
Within a normalized block, runs of whitespace are converted to single spaces. Leading and trailing whitespace is discarded. Line-wrapping and indenting may be applied.
In a non-normalized block, text nodes are not changed as long as they contain any non-whitespace characters. No line-wrapping or indenting is applied. However, if a text node contains only whitespace (for example, a space or newline between sub-elements), it is assumed to be insignficant and is discarded. It may be replaced by line breaks and indentation when output formatting occurs.
Consider the following input:
<row> <cell> A </cell> <cell> B </cell> </row>
Suppose that the <row>
and
<cell>
elements both are to be treated as
non-normalized. The contents of the <cell>
elements are text nodes that contain non-whitespace characters, so they
would not be reformatted. On the other hand, the spaces between tags are
all-whitespace text nodes and are not significant. This means that you
could reformat the input like this:
<row><cell> A </cell><cell> B </cell></row>
Or like this:
<row> <cell> A </cell><cell> B </cell> </row>
Or like this:
<row> <cell> A </cell> <cell> B </cell> </row>
In each of those cases, the whitespace between tags was subject to
reformatting, but the text content of the
<cell>
elements was not.
The input would not be formatted like this:
<row><cell>A</cell><cell>B</cell></row>
Or like this:
<row> <cell> A </cell> <cell> B </cell> </row>
In both of those cases, the text content of the
<cell>
elements has been modified, which is not
allowed within non-normalized blocks. You would have to declare
<cell>
to have a normalize
value of yes
to achieve either of those output
styles.
Now consider the following input:
<para> This is a sentence. </para>
Suppose that <para>
is to be treated as a
normalized element. It could be reformatted like this:
<para>This is a sentence.</para>
Or like this:
<para> This is a sentence. </para>
Or like this:
<para> This is a sentence. </para>
Or even (with line-wrapping) like this:
<para> This is a sentence. </para>
The preceding description of normalization is a bit oversimplified. Normalization is complicated by the possibility that non-normalized elements may occur as sub-elements of a normalized block. In the following example, a verbatim block occurs in the middle of a normalized block:
<para>This is a paragraph that contains <programlisting> a code listing </programlisting> in the middle. </para>
In general, when this occurs, any whitespace in text nodes adjacent to non-reformatted nodes is discarded.
There is no "preserve all whitespace as is" mode for block elements. Even if normalization is disabled for a block, any all-whitespace text nodes are considered dispensible. If you really want all text within an element to be preserved intact, you should declare it as a verbatim element. (Within verbatim elements, nothing is ever reformatted, so whitespace is significant as a result.)
If you want to see how xmlformat handles whitespace
nodes and text normalization, invoke it with the
--canonized-output
option. This option causes
xmlformat to display the document after it has been
canonized by removing whitespace nodes and performing text
normalization, but before it has been reformatted in final form. By
examining the canonized document, you can see what effect your
configuration options have on treatment of the document before
line-wrapping and indentation is performed and line breaks are added.