htmltotex tutorial

htmltotex is a utility for converting html pages to TeX documents. This is a program written in Perl5. htmltotex accepts only local files written according to html2 specifications and outputs a TeX file which you should compile using your TeX compiler. You need the following programs/packages to run it. htmltotex accepts all html 2 tags. In addition it is html 3.2 compatible. That is, it recognizes all html3.2 tags and when it encounters one such tag it either does as expected or copes up with it in a clever and suitable manner.

Command-line options

Usage: htmltotex [Options] input_file
Table: Command-line options for htmltops
optionmeaningdefault
-a Show details (code, parameter names and values) of APPLET tags. Ignore APPLET tags.
-b BASEURL Set base URL to BASEURL. If not set, overwritten by the BASE tag in the header of the input file. file:<input file>
-c Retain html comments in input file as comments in output TeX file. Remove all html comments.
-d Convert images in black-and-white dithered mode. Retain original colors.
-f Show details (buttons, checkboxes, textareas, textfields etc.) of the form components. Don't show the details.
-F Tell TeX not to load computer modern fonts at smaller sizes. Use scaled versions of the 10pt fonts instead. For more details look at the troubleshooting file. If the size of normal text is n points, tell TeX to load computer modern fonts at point sizes n, n-3 and n-5.
-g Convert images in gray scale. Retain original colors.
-h Print help message
-i Do not put in-line images, show alternate text (ALT) only. Put images.
-I Do not put in-line images, do not show alternate text (ALT). Put images.
-ip FILE Input user-defined TeX macro/preamble file FILE. Only one file can be included. You may, for example, change text dimensions, paragraph skips etc. in the preamble file. You are, however, requested not to change \parindent to a non-zero value. No user-defined files are included. Plain TeX macros and some other pre-defined macros are used only. These macros are read and processed from the library files fonts.mac and misc.mac. You may not directly use these macro files in your TeX documents.
-L LIBDIR Append LIBDIR to the default list of directories for searching the library files. LIBDIR should be a colon-separated list of directories. The default search path is /usr/local/lib/htmltotex: /usr/local/htmltotex: ~/.htmltotex: .
-lf LFTR Set left-footer to LFTR. Null
-lh LHDR Set left-header to LHDR. Null
-m IMGDIR Put postscript files for in-line images in IMGDIR. Directory of the input file.
-o OUTFILE Output TeX file to OUTFILE stdout
-p NUM Set starting page number to NUM. 1
-P Suppress page numbering. Show page numbers.
-pt NUM Set size of default font to NUM points. 10
-r NUM Set TeX running mode to NUM for the output file. Admissible values for NUM are:
0 - stop-for-errors mode
1 - batch mode
2 - scroll mode
3 - nonstop mode
0
-rf RFTR Set right footer to RFTR Null
-rh RHDR Set right header to RHDR TEXT if input contains <TITLE>TEXT</TITLE> in the header; null otherwise. If you do not want a header even when the input contains the TITLE tag, give the option -rh "".
-t TMPDIR Set temp directory for intermediate files to TMPDIR. Intermediate files are used for image conversion. You must have permission to write in this directory. /tmp
-u Underline anchors. Don't underline.
-v Set up verbose mode. Non-verbose mode
-V Print version info and quit.
-w NUM Set cell width in tables to NUM centimeters. Each cell in the tables of the output file is set in a \vtop box of width NUM cms. 3

Examples:
htmltotex myfile.html # Use all defaults
htmltotex -o myfile.tex myfile.html # Output to myfile.tex
htmltotex myfile.html > myfile.tex # Same as the last example
htmltotex -t . myfile.html # Use . as temp directory
htmltotex -a -c -f myfile.html # Show applets and forms, retain comments
htmltotex -g myfile.html # Convert all in-line images to grey-scale
htmltotex -rh "My sweet home page" myfile.html # Set right header to "My sweet home page"
htmltotex -b "http://local.machine/~user/fun/index.html" myfile.html # Set base location
htmltotex -L /opt/htmltotex:~/lib myfile.html # Add directories /opt/htmltotex and ~/lib to the library search path
htmltotex -F -o myfile.tex myfile.html # Use this option if tex complains that it cannot find certain fonts when it tries to compile myfile.tex.

Tags supported

Anchors

<A> </A>

Blocks

<ADDRESS> </ADDRESS> <BLOCKQUOTE> </BLOCKQUOTE>
<BODY> </BODY> <HEAD> </HEAD> <HTML> </HTML> <TITLE> </TITLE>
<H1> ... <H6> </H1> ... </H6>
<BR> <HR> <P> </P> <CENTER> </CENTER> <DIV> </DIV>
<PRE> </PRE> <XMP> </XMP> <LISTING> </LISTING>
<SUB> </SUB> <SUP> </SUP>
<!-- ... >

Lists

<DL> </DL> <DT> <DD> <DIR> </DIR> <MENU> </MENU> <OL> </OL> <UL> </UL> <LI>

Tables

<TABLE> </TABLE> <TR> <TD> </TD> <TH> </TH>

Highlighting

<BLINK> </BLINK> <B> </B> <CITE> </CITE> <CODE> </CODE> <DFN> </DFN> <EM> </EM> <I> </I> <KBD> </KBD> <SAMP> </SAMP> <STRONG> </STRONG> <TT> </TT> <U> </U> <VAR> </VAR>

Font changes

<FONT SIZE=...> <FONT SIZE=+...> <FONT SIZE=-...> </FONT> <BIG> <SMALL>

Images

<IMG>

Forms

<FORM> </FORM> <INPUT> </INPUT> Input types: CHECKBOX, HIDDEN, IMAGE, RADIO, RESET, SUBMIT, TEXT. <OPTION> </OPTION> <SELECT> </SELECT> <TEXTAREA> </TEXTAREA>

Special characters

HTML control characters: &lt; &gt; &quot; &amp; ISO Latin characters: &copy; &Aacute; &Agrave; &Acirc; &Atilde; &Aring; &Auml; &AElig; &Ccedil; &Eacute; &Egrave; &Ecirc; &Euml; &Iacute; &Igrave; &Icirc; &Iuml; &Ntilde; &Oacute; &Ograve; &Ocirc; &Otilde; &Ouml; &Oslash; &Uacute; &Ugrave; &Ucirc; &Uuml; &Yacute; &szlig; &aacute; &agrave; &acirc; &atilde; &aring; &auml; &aelig; &ccedil; &eacute; &egrave; &ecirc; &euml; &iacute; &igrave; &icirc; &iuml; &ntilde; &oacute; &ograve; &ocirc; &otilde; &ouml; &oslash; &uacute; &ugrave; &ucirc; &uuml; &yacute; &yuml; Numerical Character Referencesirect inclusion of characters:
¡ ¦ § ¨ © ª « ¬ ­ ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼
½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý
ß à á â ã ä å æ ç è é ê ë ì í î ï ñ ò ó ô õ ö ÷ ø ù ú û ü ý ÿ

Applets

<APPLET> </APPLET> <PARAM>

HTML 3.2 tags

All HTML 3.2 tags are not supported. In fact, the last section describes which html3.2 tags are considered. For the sake of completeness, let us restate those HTML3.2 features that are taken care of by htmltotex. The closing tags are not listed. <H1 ALIGN=CENTER/LEFT/RIGHT> <OL START=number TYPE=1/a/A/i/I> <LI VALUE=number> <UL TYPE=DISC/SQUARE/CIRCLE> <LI TYPE=DISC/SQUARE/CIRCLE> <DIV ALIGN=CENTER/LEFT/RIGHT> <CENTER> <FONT SIZE=number> <FONT SIZE=+number> <FONT SIZE=-number> <BIG> <SMALL> <SUB> <SUP> <APPLET> <PARAM> &copy; On the other hand, the following HTML features are not incorporated. (All of these are not part of HTML3.2 specifications, but supported by different browsers.) Again the closing tags are not listed. <STRIKE> <BASEFONT> <EMBED> <TAB> <LH> <FRAMESET> <FRAME> <SCRIPT> <FIG> <CAPTION> <CREDIT> <OVERLAY> &reg; text between <APPLET> and </APPLET>

What are ignored

Assumptions

Troubleshooting

What should you do, if you have an error? Look at the file trouble.html that summarizes many possible errors and their remedies. If your error is not listed there, contact me.

Known bugs

Conclusion

Use htmltotex. With the exception of a few pathological cases, it should run okay for most of the files. For problems contact me at abhij@csa.iisc.ernet.in. Have fun !!


This page hosted by Get your own Free Home Page