2.0 : Getting the best results

A fledgling FAQ for JafSoft text conversion utilities

The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html

2.0 Getting the best results

2.1 General

2.1.1 Three words: consistency, consistency, consistency

The software works by analysing your document to determine what "rules" you've used for laying out your file. On the output pass these "rules" (also known as "policies") are used to determine how to categorize each line, and inconsistencies can lead to lines being wrongly treated because they "fail to obey policy".

You can greatly help this analysis by being consistent in your formatting. Many of the decisions the software makes can be overridden by changing the "analysis policies" (see "using policy files"), but if this becomes necessary it can quickly become hard work (if only because you need to familiarize yourself with these policies), so it's better to avoid this if possible.

If you're writing a document with text conversion in mind, bear in mind the following

use of white space (see "white space is your friend"). In general white space can be used to separate paragraphs, tables and diagrams from normal text and columns of data from each other inside tables.

The software likes white space :)

use of tabs. The software will convert all tabs to spaces on input, assuming that one tab = 8 spaces. This will work fine provided this tab size is correct, or your use of tabs and spaces is consistent. It may not work otherwise, in which case you'll need to tell the software what your tab size is via an analysis policy.
use of indentation. The software will calculate the pattern of indentation used in your file, and will output text accordingly. If your use of indentation is inconsistent, then paragraphs will be wrongly broken and headings may not be correctly recognized.
use of numbering. The software can spot numbered headings and numbered lists. To avoid confusing the two, the indentation of a given type of heading is tested (although you can disable this test), together with the numbering sequence. The software can tolerate small gaps in numbering, but large gaps will confuse it.
use of line lengths. The software will attempt to determine your "page width" and text justification. These are then used to spot short lines (which get a <BR> added) and centred text. The centred text algorithm has problems and so is disabled by default.

Try to avoid really long lines, or highly variable line lengths. If you don't, the software is liable to insert <BR> where you don't want them, unless you set the "page width" and "short line length" analysis policies to correct this behaviour.

avoid confusing the program. Numbered lists inside numbered sections all at the same level of indentation is a good example. The numbers become ambiguous and errors start to occur. If you must have this, try to set the numbered list at a small offset to the heading, so that the indentation position will distinguish the two.

2.1.2 Make sure your files are "line-orientated"

The software reads files line-by-line. On the first pass it will analyse the distribution of line lengths to determine the "page width" of your file. This in turn is used to detect certain features such as centred text and "short lines".

Some files, especially those created on PC, do not include line breaks, instead they only have a single break after each paragraph of text.

Whilst not a problem in itself, it does somewhat handicap the software's ability to analyse the file.

Where possible, you should attempt to save files "with line breaks" to give the software the best chance of understanding how your file is laid out.

2.1.3 Make sure your use of tabs is consistent

The software converts all tabs in your source document on the assumption that one tab equals 8 spaces. In fact, the actual tab size is irrelevant provided your use of tabs and spaces is consistent. If it isn't, you may find tables aren't being analysed correctly.

You can set the actual Tab Size used in your documents vie the policy line

Tab Size: n

where n is the number of spaces per tab.

2.1.4 White space is your friend

The software attempts to categorize each line into one of a number of types (e.g. heading, bullet point, part of a table etc).

Often this analysis is influenced by adjacent lines. For example a line of minus signs can be interpreted as "underlining" a heading, or perhaps as part of a table or diagram.

Confusion can occur where different features are close to each other (e.g. an underlined heading immediately followed by a table).

In most cases the ambiguity can be reduced or eliminated by adding 1 or 2 blank lines between the objects being confused.

The same argument applies to table columns. If two columns get merged together, try increasing the "white space" between by moving them apart.

In almost all situations, adding white space to your document will help reduce the likelyhood of analysis errors.

2.1.5 Use a simple numbering system

I've seen documents with section numbers like "Section II-3.b". I'm sorry, but at present the software can't recognise such an exotic numbering system. Equally it can't cope Appendices line A-1 etc. [*]

If possible, change your section numbers to numbers (like this document), or "underline" all your headings with a row of dashes or equal signs on the next line. The software will understand that much better.

[*]: From version 4 onwards, there is the ability to recognise headings that start with the same word or phrase (such as Chapter, Appendix, Section etc), so this may offer a solution to you.

2.1.6 Save policies into a policy file

The program offers a large number of "policies" to customize the conversion. These policies can be saved in a "policy file", which is simply an ordinary text file (which you may edit by hand if you like).

By saving policies into files, you can reload these files the next time you do a conversion, which means you won't need to adjust all the settings again. You can create multiple policy file for different conversions or conversion types.

Policy files are described at length in the "Policy Manual".

2.1.7 Add preprocessor commands to your source file

The program has it's own built-in preprocessor. This allows you to add special "directives" and "tags" into your source file which tell the program to perform special functions. Examples include the addition of include files into the source, the insertion of contents lists, adding hyperlinks to sections and much much more.

An example is the following hyperlink, whereby

[[GOTO Using preprocessor commands]]

is used to provide the link to the named section, such as the one that appears in the next sentence. For more details see "using preprocessor commands"

The preprocessor is described at length in the "Tag Manual".

2.2 Using policy files

2.2.1 Saving "incremental" policies

When you choose to save your policies to file you will be asked whether you want to save "incremental" policies, or "all" policies.

"Incremental" means only those policies loaded from file, or manually adjusted will be written to file. This is recommended as it leaves the program free to make all other adjustments itself.

"All" means that all policies will be written to file. This is useful if you want to document or review the policies used, but it is less useful if you want to reload this policy file, as it will fully constrain the program's behaviour. While this may not be a problem when reconverting the same file, it may well be unsuitable when converting new files.

2.2.2 Editing policy files by hand

Policy file are just text files with a ".pol" extension. If you think of them like the old Windows .ini files you'll get the idea. This has been done deliberately so that these files can be manually edited in a normal text editor.

OpenVMS users actually have no other way of creating policy file, but Windows users can change most (but not all) policies via the GUI. However I recommend that anyone who comes to regard themselves as a "power" user learns how to edit these files.

The policy file consists of one policy per line, usually in the form

<policy text> : <data value>

e.g.

Document title : Here's my favourite URLs

When entering policy lines you must use the exact <policy text> indicated in the documentation for the policy to be recognized. If I've misspelt anything then tough, you'll have to follow it (but tell me anyway). The one exception to this rule is I've allowed both British and American spelling of colour/color.

The allowed <data value> will vary from policy to policy. Most policy lines accept a value of "(none)" effectively negating that policy.

The order of lines in the file is largely unimportant. If you're editing a .pol file generated by the program (see "generate a .pol file") then you'll notice section headings of the form

[Hyperlinks]

These are purely decorative. That is, they have no significance, and you can ignore them and move the policy lines around, there's no concept of having to place policy lines in the "right" section.

As new versions of the software are released policies are moved from one section to another as different grouping expand and appear. As explained above, this usually has no effect on the validity of the .pol file.

2.2.3 Using include files in policy files

Policy files may include other policy file as follows

include file : ..\policies\Other_policy_file.pol

This can be useful if you have multiple policy files but want certain features to be the same. For example I use this to introduce the same link dictionary commands into all my policy file. You could equally put all your colour policies into one file.

The "include file" line will have to be manually edited into the .pol file using a text editor.... there is no support currently for setting this via the program itself.

NOTE:: If you "save" a policy file that has been loaded, then the include file structure will be lost, and all the policies will be output into a single file.

2.2.4 Using a default policy

You can make the program use the same policies by default each time it runs. To do this select the policies you want, and then save these to a policy file.

Next select the Settings->Use of Policy Files menu option. Check the "Use a default" flag, and select the file you just created.

Next time you run the program these policies will be loaded and used for your conversions. Note, you can still reset the policies or load a different file using the options on the Conversion options menu.

To stop using a default just clear the "Use a default" flag (you don't need to clear the policy file name).

2.3 Using preprocessor commands

2.3.1 What is the preprocessor?

The program has a built-in preprocessor. This will recognize special commands inserted into the source file. These commands can be used to correct analysis errors (e.g. to correctly delimit a table), or to add to the output. For example the TIMESTAMP tag can cause the text

"this document was converted on [[TIMESTAMP]]"

to be output as

"this document was converted on 22-Feb-2004".

preprocessor commands are of two types

Directives. These begin with "$_$_" and must be on a line by themselves with the "$_$_" being at the start of the line (i.e. there can be no leading spaces).

Tags. These take the form [[TAG <data...>]] and may occur anywhere within your text, but cannot be split over two lines.

Some commands may be expressed as either directives or tags. A "Tag Manual" is also available.

2.3.2 Delimiting tables, diagrams etc

The program will attempt to detect tables and diagrams, but sometimes it gets the wrong range for the table, and also diagrams may be interpreted as tables and vice versa.

To correct such mistakes, you can bracket the source lines as follows :-

        $_$_BEGIN_TABLE
        ...
        $_$_END_TABLE

        $_$_BEGIN_DIAGRAM
        ...
        $_$_END_DIAGRAM

2.3.3 How do I add my own HTML to the file?

You can embed raw HTML in your text file in one of three ways using the preprocessor

Insert a one-line HTML as follows

$_$_HTML_LINE <whatever you want inserted>

The HTML_LINE and it's arguments must all be on one line.

Insert a HTML tag as follows

[[HTML <whatever you want>]]

The HTML tag must all be on one line.

insert a section of HTML between two directive lines

        $_$_BEGIN_HTML
        ...
        lines of HTML, e.g. custom artwork or tables
        ...
        $_$_END_HTML

For example to enter a anchor point in your text so that you can link to it try

      $_$_HTML_LINE <A NAME="whatever"> </A>

To embed an image with a hyperlink you might try

      $_$_BEGIN_HTML
      <A HREF="URL"><IMG SRC="../pics/a2hdoco2.jpg" WIDTH=160
      HEIGHT=160 BORDER=0 ALT="AscToHTM home page" ALIGN=RIGHT
      VALIGN=TOP></A>
      $_$_END_HTML

The "$_$_" has to be at the beginning of the line, i.e. not indented as I've shown above. If you look at the program's HTML documentation, and the text used to create it you'll see examples of this and other preprocessors. Indeed if you look at the source file for this document you'll see that's exactly how the image on the right was added to this document.

Future versions of the software will introduce in-line tagging so you can do place LINKPOINTs anywhere in your text. Check your program's documentation for details.

2.3.4 Using standard include files

The preprocessor command INCLUDE can be used to include standard pieces of text into your source files. For example

$_$_INCLUDE ..\data\footer.inc

will include the file "footer.inc" into your source file at this location. Note that the path given must be correct relative to the source file being converted.

The contents of the include file simply get "read into" the source. As such they get included in the analysis of the whole document.

Include files can be useful to include standard disclaimers or navigation bars to all your pages. For example you could embed HTML to link back to your home page (see "how do I add my own HTML to the file?")

Of course the same effect could be achieved by using a HTML footer file (see "adding headers and footers") or by defining a "HTML fragment" called HTML_FOOTER (see "customizing the HTML created by the software").

2.3.5 Adding Title, keywords etc

If you want to add title, keywords and descriptions to your HTML you can do this by embedding special commands in the source file as follows

        $_$_TITLE This is the title of my HTML page
        $_$_DESCRIPTION This page is a wonderful page that everyone should visit
        $_$_KEYWORDS wonderful, web, page, full, of keywords, that
        $_$_KEYWORDS everyone, will, want, to search, for

The "$_$_" must be the first characters on the line. You can spread the keywords and description over several lines by adding extra $_$_KEYWORD and $_$_DESCRIPTION lines.

Note:: Most of these commands have equivalent policies, allowing you to set title etc through an external policy file should you prefer.

2.3.6 Adjusting policies for individual files or parts of files

You can, if you wish, create one policy file for each file being converted, however this is liable to become a maintenance nightmare.

If you don't want to maintain multiple policy files, or if you simply want to adjust a few policies for a given source file, you can use the $_$_CHANGE_POLICY command.

The effect will vary according to the type and position of the command. Some policies will affect the whole document, others will only affect the document from that point onwards... it depends on the nature of the particular policy. See the "Policy Manual" for details.

For example placing

$_$_CHANGE_POLICY background colour #FF0000

$_$_CHANGE_POLICY text colour White

will change the document background colour to be red, and the text to be white throughout the whole document.

2.4 Making the program run faster

You can make the program run faster in a number of ways by disabling features that you know you don't want.

2.4.1 Review the "look for" options

As of V3.1, AscToHTM has a number of "look for" options, stating what the program is looking for. Disable the ones you don't want, although most of them will not make a major difference to the program speed.

2.4.2 Don't convert URLs

Probably the single most expensive function is the search for URLs to convert into hyperlinks. Every word (and every word fragment) has to be checked individually. The problem isn't helped by having to distinguish URLs with commas in them from comma separated lists of URLs.

If you know your document has no URLs to be converted, disable this feature and watch the software run 10-20% faster. However this is one feature of the software that people like most.

2.4.3 Don't generate tables

The software will attempt to convert regions of pre-formatted text into tables. This can take a lot of analysis even if eventually it decides "it's not a table after all!".

This only comes into effect if the program detects preformatted text, so you should only disable this feature if your pre-formatted text is largely non-tabular. If that's the case you probably want to disable this anyway as the tables created may be inappropriate.

Back to Contents List

$_$_CHANGE_POLICY background colour	#FF0000
$_$_CHANGE_POLICY text colour	White