Paul DuBois
dubois@primate.wisc.edu
Wisconsin Regional Primate Research Center
Revision date: 9 March 1997
tc2html is a postprocessor for converting troffcvt
output to HTML. It's used by the troff2html front end.
This document describes how tc2html works and some of the
design issues involved in writing it.
In general, the goal of tc2html is that you should get
reasonable HTML output with no need for special treatment of the
troff input file. The most important thing is that you
use a standard macro package. However, there are some additional
principles you can follow that will improve the quality of the
HTML that tc2html generates. For example, it's possible
to embed hypertext links in your troff source with a little
prior planning. Techniques for such things are discussed in the
section "Generating Better HTML."
If you're not interested in implementation details, you can skip
directly to that section.
tc2html reads output from troffcvt and produces
an HTML document that has the following general form:
<HTML> <HEAD> <TITLE>title text</TITLE> </HEAD> <BODY> <H1>title text</H1> body text </BODY> </HTML>The document HEAD part may be missing if tc2html detects no title in the input. In this case the initial heading at the beginning of the document BODY part also will be missing. The entire document BODY may be missing or empty if the input document is empty.
HTML documents typically are highly structured, being written
in terms of elements such as headers, paragraphs, lists, and displays
(preformatted text). But troffcvt output normally contains
very little structural information beyond markers like those for
inter-paragraph spacing and line breaks (in the form of \space
and \break control lnes). The result when tc2html
reads such troffcvt output is that it produces HTML that
is relatively unstructured -- just a lot of text broken by occasional
<P> or <BR> markers.
However, if your document is marked up using macros from a macro
package such as -ms or -man, it's possible to get
output from troffcvt that's much more suitable for tc2html.
The trick is to map troff requests to HTML structure markers,
rather than trying to guess the structure from the low-level troffcvt
output that normally results from those requests. This is accomplished
using the following strategy:
The effect of the strategy outlined above is to remap the macros
in your macro package from their usual actions onto actions that
produce document structure information that tc2html can
recognize. For this to work well, all the important structure-related
macros in a macro package must be redefined, so the redefinition
files used for tc2html tend to be more extensive than those
used for other postprocessors. This is really the source of most
of the work involved in getting tc2html to function well.
Once a set of redefinitions is written for a given macro package,
translation from troff to HTML is a straighforward process
that usually generates fairly reasonable HTML.
Here's an example of how the strategy described above works in
practice. The .LP macro in the -ms macro package
means "begin paragraph." But .LP typically is
implemented by executing several other requests (restore font,
margins, adjustment, spacing, point size, etc.), and the troffcvt
output you'd get by processing those requests really contains
nothing that specifically indicates a paragraph. To work around
this, we use the fact that tc2html interprets \html
para as indicating a paragraph beginning, and define a
macro to generate that control:
req H*para eol output-control "html para"Then we can redefine the .LP macro in terms of the .H*para macro:
req LP eol \ break center 0 fill adjust b font R \ push-string ".H*para\n"The break, fill, adjust, and font actions cause troffcvt to adjust its internal state to match the effect that the .LP macro normally has. The call to .H*para results in \html para in the output, so that tc2html can recognize the paragraph beginning.
The \html markers that tc2html recognizes are shown
below:
\html title Begin document title \html header N Begin level N header \html header-end End header (any level) \html para Begin paragraph \html blockquote Begin block quote \html blockquote-end End block quote \html list Begin list \html list-end End list \html list-item Begin list item \html display Begin display (preformatted text) \html display-end End display \html display-indent N Set display indent to N spaces \html definition-term Begin definition list term \html definition-desc Begin definition list description \html shift-right Shift left margin right \html shift-left Shift left margin left \html anchor-href URL Begin HREF anchor for link to URL \html anchor-name LABEL Begin NAME anchor with label LABEL \html anchor-toc N Begin NAME anchor for level N TOC entry \html anchor-end End anchor (any kind)The troff-level macros used to generate the \html controls are shown below. These macros are defined in the action file actions-html:
.H*title Begin document title .H*header N Begin level N header .H*header-end End header (any level) .H*para Begin paragraph .H*bq Begin block quote .H*bq-end End block quote .H*list Begin list .H*list-end End list .H*list-item Begin list item .H*disp Begin display (preformatted text) .H*disp-end End display .H*disp-indent N Set display indent to N spaces .H*dterm Begin definition list term .H*ddesc Begin definition list description .H*shift-right Shift left margin right .H*shift-left Shift left margin left .H*ahref URL Begin HREF anchor for link to URL .H*aname LABEL Begin NAME anchor with label LABEL .H*atoc N Begin NAME anchor for level N TOC entry .H*aend End anchor (any kind)Note that since these names are longer than two characters, they cannot be used in compatibility mode.
The \html controls are defined in a file actions-html
that you can access on the troffcvt command line using
-a actions-html. If you use a macro package -mxx,
you specify it on the command line, along with the general and
HTML-specific troffcvt redefinitions for that macro package;
these are in the action files tc.mxx and tc.mxx-html.
Thus, to translate a file that you'd normally process using -ms,
the command would look like this:
% troffcvt -a actions.html -ms -a tc.ms -a tc.ms-html myfile.ms \ | tc2html > myfile.htmlThat's pretty ugly, of course; it's better to use a wrapper script like troff2html that supplies the necessary options for you:
% troff2httml -ms myfile.ms > myfile.html
This section provides some specifics on how several troff
concepts are turned into HTML elements. It should be considered
illustrative rather than exhaustive.
Title macros are implemented in terms of .H*title, which
generates an \html title control. When tc2html
sees this control, it goes into document HEAD collection mode.
If the document contains a title, the \html title
line must be the first \html control that tc2html
sees. Should any other \html control or document text occur
first, tc2html assumes no title is present. Any leading
document whitespace (\space or \break lines) occurring
prior to the title is skipped.
The title is terminated by the next \html line with a structural
marker, such as \html para. The title text is used
to produce the TITLE in the document HEAD part and the initial
header in the document BODY part. \space and \break
lines within the title do not terminate title text collection;
instead, they are turned into spaces in the title and into <P>
and <BR> in the initial header. Consider the following troff
input (using -ms macros):
.TL My .sp Title .LP This is a lineThis is converted by troffcvt into the following:
\html title My \space Title \break \html para This is a line.The output from troffcvt is converted in turn by tc2html into this HTML:
<HEAD> <TITLE> My Title </TITLE> </HEAD> <BODY> <H2> My <P> Title </H2> <P> This is a line.-T title may be specified on the tc2html or troff2html command line to specify a title explicitly. It overrides the title in the document if there is one.
The "standard" paragraph is a paragraph with the first
line flush left. There is no mechanism for writing paragraphs
with an indented first line; they're treated simply as standard
paragraphs.
The standard paragraph is implemented in terms of .H*para,
which generates an \html para control. This is turned
by tc2html into <P>.
In the document BODY part, \space is also interpreted as
a paragraph marker, but during document title collection, \space
is treated as described above under "Document
Titles ."
Indented paragraphs (with or without a hanging tag) are implemented
using definition lists (<DL>...</DL>). The tag is
written as a definition term (<DT>...</DT>) and the
paragraph body is written as a definition description (<DD>...</DD>).
If there is no tag, the term part is empty.
Indented paragraph macros are implemented in terms of .H*dterm
and .H*ddesc, which generate \html definition-term
and \html definition-desc controls.
One problem with mapping indented paragraphs onto definition lists
is that it's not always clear from the troff input where
the list ends. In HTML, the definition list is a container for
which you must write both a beginning and ending tag, but in troff
only the beginnings of paragraphs are specified. This problem
is handled (perhaps poorly) by closing the list when other HTML
structural elements like a standard paragraph or a header are
seen. Suppose you write something like this:
.IP (i) Para 1 .IP (ii) Para 2 .LP Para 3This is converted by troffcvt into the following:
\html definition-term (i) \html definition-desc Para 1 \break \html definition-term (ii) \html definition-desc Para 2 \break \html para Para 3 \breakWhen tc2html sees the first \definition-term, it begins a definition list. The second \definition-term continues the same list. The \html para (resulting from the .LP) is part of a different structural element, so tc2html closes the list and begins a standard paragraph. The resulting HTML looks like this:
<DL> <DT> (i) </DT> <DD> Para 1<BR> </DD> <DT> (ii) </DT> <DD> Para 2<BR> </DD> </DL> <P> Para 3<BR>
In troff, the left margin can be shifted right and left,
e.g., as is done with the -ms and -man packages
using .RS and .RE. HTML has no good way of shifting
the margin, so shifts are performed using <UL> and </UL>.
This is admittedly a hack, but it works reasonably well. Shift
macros are redefined to be implemented in terms of .H*shift*right
and .H*shift*left, which generate \html shift-right
and \html shift-left controls. These in turn are
converted by tc2html to <UL> and </UL>.
Displays are implemented as preformatted text (<PRE>...</PRE>).
Tabstops are respected within displays, although they must be
approximated since characters widths are unknown. tc2html
assumes 10 characters/inch for determining the width of tabstops.
Display macros are implemented in terms of .H*disp and
.H*disp*end. Preformatted text in HTML has no additional
indent relative to the left margin, but troff displays
often are indented a bit. To handle this, .H*disp*indent
N can be used to set the display indent to N spaces.
.H*disp, .H*disp*end, and .H*disp*indent
generate \html display, \html display-end,
and \html display-indent controls. The first two
of these are converted by tc2html into <PRE> and
</PRE>. \html display-indent generates no
output itself, but causes tc2html to add spaces to the
beginning of each line of a display.
Centered and right-justified displays are not implemented. They're
treated as regular displays.
If your input document has tables written in the tbl language,
preprocess the document with tblcvt rather than with tbl.
Your output will look better that way.
Table cell borders are hard to do well. In tbl you can
put a border on any cell boundary, but in HTML a table has either
no borders or borders around every cell. Currently, tc2html
puts borders around every cell.
Fonts are handled in tc2html by means of a table that associates
four tags with each font name. The first two tags are used to
turn the font on and off in normal text. The second two tags are
used to turn the font on and off in displays. This table is read
at runtime from the html-fonts file. Here's an example
of what the file might look like:
R "" "" "" "" I <I> </I> <I> </I> B <B> </B> <B> </B> BI <B><I> </I></B> <B><I> </I></B> C <TT> </TT> "" "" CW <TT> </TT> "" "" CI <TT><I> </I></TT> <I> </I> CB <TT><B> </B></TT> <B> </B> CBI <TT><B><I> </I></B></TT> <B><I> </I></B>The difference between the tags for regular text and display text is that, since browsers implicitly switch the font to monospaced font in displays, the only thing that can be done for font changes there is to change the style attributes.
The initial font when tc2html begins is R (roman).
When a font change occurs, the new font's begin tag is written
out after terminating the previous font by writing its end tag.
Using the font table just shown, this input:
\font R abc \font I def \font CW ghi \font R jklbecomes this output:
abc<I>def</I><TT>ghi</TT>jkl
Tabs are ignored except in displays. Adding extra space to tab
over has no effect in regular paragraphs anyway, because browsers
typically collapse runs of spaces.
Right-justified and centered tabs are treated as left-justified
tabs. That is, they're completely botched.
This section describes how you can embed hypertext links in your
troff source and how to produce a table of contents containing
clickable links to the main sections of your document.
The \html controls used to generate hypertext links are:
\html anchor-href URL \html anchor-name LABEL \html anchor-endThe first two controls generate opening <A HREF=URL> and <A NAME=LABEL> tags; the third generates a closing </A> tag.
To embed hypertext links in your troff source, you can
use the macros .H*ahref and .H*aend, or .H*aname
and .H*aend. To write an HREF link, the troff source
looks like this:
.H*ahref http://www.some.host/some/path hypertext link .H*aendThe resulting HTML looks like this:
<A HREF="http://www.some.host/some/path"> hypertext link</A>To write a NAME link, the troff source looks like this:
.H*aname my-name name link .H*aendThe resulting HTML looks like this:
<A NAME="my-name"> name link</A>Section-header macros are usually redefined to generate a NAME anchor for the table of contents, so don't surround a section header with anchor-generating macros. You'll end up with nested anchors, which tc2html disallows. You can generate a NAME link for a section (e.g., so that you refer to it using a specific name) as long as you don't write the link like this:
.H*aname better-html .SH "Generating Better HTML" .H*aendInstead, write it like this:
.H*aname better-html .H*aend .SH "Generating Better HTML"Unfortunately, some browsers don't seem able to jump to NAME anchors unless there is some text between the <A NAME> and </A> tags.
You can't make a section header a hypertext link. You'd have to
put the header (which generates a NAME link for the TOC) between
the .H*ahref and .H*aend macros, which would result
in nested anchors.
Putting a table of contents (TOC) into an HTML document requires
some postprocessing of the tc2html output. The TOC entries
can't be written to the beginning of the document because they're
not all known until the input has been read entirely. The approach
adopted with tc2html is as follows:
The \html controls used to generate TOC entries are:
\html anchor-toc N \html anchor-endText occurring between \html anchor-toc and \html anchor-end pairs is written to the output, but it's also collected and remembered. When tc2html encounters end of file on its input, it writes the TOC entries to the output between two other HTML comments:
<!-- TOC BEGIN --> TOC entries <!-- TOC END -->If you want to generate a TOC entry explicitly in your troff source, use .H*atoc and .H*aend. For example:
.H*atoc 1 My TOC Entry .H*aendThe argument to .H*atoc is the TOC entry level (1, 2, 3, ...).
It's unnecessary to invoke TOC macros directly if the section-header
macros in your macro package are redefined to invoke the TOC macros
for you. For example, the .SH for the -ms package
is redefined like this in the tc.ms-html action file:
req SH parse-macro-args eol \ break fill adjust b \ push-string ".H*atoc 1\n" \ push-string ".H*header 2\n" \ push-string "$1\n" \ push-string ".H*header*end\n" \ push-string ".H*aend\n"To specify the TOC title and generate the TOC position marker, use the .H*toc*title macro. Invoke it as shown below, passing the title of your TOC as the first argument:
.H*toc*title "Table of Contents".H*toc*title writes the TOC title to the output followed by a special HTML comment:
Table of Contents <!-- INSERT TOC HERE -->The INSERT TOC HERE comment is used by tc2html-toc, along with the TOC BEGIN and TOC END comments, to find the TOC entries and move them to the desired location.
Action files that provide macro package redefinitions for tc2html
can try to place an advisory TOC location marker in the document.
This is used if you don't specify a location marker explicitly
with .H*toc*title:
<!-- INSERT TOC HERE, MAYBE -->For instance, the -man redefinitions put out this marker when the .TH macro has been seen. The marker causes a TOC to be placed after the title line and the first man page section, unless one is specified explicitly. No TOC title is written with the advisory marker however, so the TOC will be "title-less."