Documentation for the AscToPDF conversion utility : Using policy files

Documentation for the AscToPDF conversion utility

The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html

Using policy files

Document policies have two main uses; to correct any failure of analysis that AscToPDF makes, and to tell the program how to produce better PDF in ways that couldn't possibly be inferred from the original text.

Examples of the former may include specifying a nominal page width, and stating whether or not underlined section headings are expected etc.

Examples of the latter include adding colour and titles to the page, as well as requesting that a large document is split into several pages.

Contents of this section

What are Policy files?
Analysis policies

'What to look for' policies
General analysis policies
Bullet policies
Contents policies
File Structure policies
Headings policies
Pre-formatted text policies

What are Policy files?

AscToPDF has a large number of options available to influence the analysis of your text files, and the output to PDF. These options are called "policies" as they govern how the source file should be interpreted and converted.

Policies may be saved in text files, known as policy files. These files have a ".pol" extension by default. The policy files are usually updated by changing the policies and saving the changes in a new file. Because they are text files you can also edit them directly, in a text editor. The files have the format of one policy per line of

Text in the form

PolicyText : <policy value>

The use of policy files allow a given set of options to be saved and reused for other conversions, or later conversions of the same file. See Using policy files for more information.

Analysis policies

Analysis policies are usually calculated by AscToPDF by making a first pass through your document. The resulting policies are then used during the second, conversion pass to categorise all input lines so that they may be correctly converted to HTML.

You should only need to change these policies should the analysis fail.

'What to look for' policies

General Analysis

Bullets

File generation

Headings Policies

Pre-formatted text

'What to look for' policies

These policies act as "broad stroke" policies enabling or disabling areas of functionality within the software by telling it what to look for and to try to detect.

For example you can tell the program whether or not to bother looking for patterns of indentation, bullets, or numbered lists. In many cases if you enable a policy you can further fine tune the conversion details on other policy sheets.

Look for indentation

Look for paragraphs

Look for short lines

Look for bullets and numbered lists

Look for mail and USENET headers

Look for regions of preformatted text

NOTE:: Some options on this screen are grayed out. These options are supported by other JafSoft conversion utilities and it is hoped to extend this support to AscToPDF as development allows.

Look for indentation

AscToPDF can attempt to detect the indentation pattern of your document and replicate it in the output file. If you chose to disable this policy, all your text will be output with no indentations at all.

If the program is wrongly indenting your files, you can try adjusting the pattern of indentation on the General Analysis tabbed policy sheet.

Look for white space

By default AscToPDF will attempt to look for paragraphs in your source. Usually this is signaled by a blank line between paragraphs, a leading indent on the first line of each paragraph, or (in extreme cases) a short line at the end of a paragraph.

If you don't want AscToPDF to detect paragraphs, disable this policy.

If AscToPDF is wrongly detecting paragraphs, try adjusting the paragraph analysis policies on the General Analysis tabbed policy sheet.

Look for short lines

By default AscToPDF will attempt to detect short lines and preserve their structure by adding a line break. Disabling this will cause short lines to be merged into the surrounding paragraph's text.

If AscToPDF is wrongly handling your short lines, you can adjust the short line cutoff point or the page width (which is used in short line detection) in the Sizes section of the General Analysis tabbed policy sheet.

Look for bullets and numbered lists

By default AscToPDF will try to detect bullet points and numbered lists. This can sometimes go wrong if you have lines that look to the program like bullet points.

You can disable this behaviour should you wish. Alternatively you can fine tune the detection of bullets on the bullet analysis tabbed policy sheet.

Look for mail and USENET headers

AscToPDF will try to look for email and USENET headers. Where these are recognised they can be simplified so that only the To, Form and Subject lines are shown in the output.

You can disable this behaviour should you wish.

Look for preformatted text

By default AscToPDF will try to identify regions of preformatted text. Once identified AscToPDF will try to decide if it's a diagram, table or some other form of preformatted text. If it thinks it's a table it will attempt to place the text in an appropriate table structure.

You can disable the search for preformatted text, or if you allow preformatted text, disable table generation. (This may be appropriate if you have a large number of ASCII diagrams in your text).

The search for preformatted text can be refined via the Pre-formatted text
tabbed policy sheets.

General analysis policies

These policies aid AscToPDF's analysis by describing in detail what the contents of the document being converted are

Sizes

Page Width

TAB Size

Short line length

Min Chapter Size

Paragraphs

Blank lines between paragraphs

New paragraph offset

Layout

Indentation levels

Page Width

This indicates the width (in characters) of your nominal output page. This width is calculated from the observed line lengths in the original document.

This width is used in short line calculation, and determining whether a given line contains a definition term or not (definition character near the start of the line).

In documents that contain line feeds this should be automatically detected.

In other documents you may need to set this manually.

TAB size

This indicates the size (in characters) of your tabs. AscToPDF converts all tabs to spaces on conversion before analysis. By default a tab size of 8 characters is assumed.

The tab size can influence the analysis of paragraph indentations and other layout. Provided they are used consistently there shouldn't be a problem. However where tabs and spaces are used in combination, mistakes can arise.

This is particularly true in tables of data. AscToPDF does not expect tab-separated table cells, instead converting the tabs to spaces and analysing the results.

If your source document has been created with an editor with a different tab size, you should change this value should you start to experience strange layout conversion problems.

Short Line Length

This policy is used to determine what is a "short line". Short lines are treated specially by AscToPDF by adding a paragraph marker on the end. They can also be used to detect ends of paragraphs in those documents that don't have blank lines between paragraphs.

Normally AscToPDF will determine whether or not a line is short by comparing it to the page width, given the current context.

The default value is 0 characters (indicating a comparison to Page Width should be used). Set this to any value you like. A value of 80 is likely to make every line in your original document have a paragraph marker on the end.

Min Chapter Size

This policy tells AscToPDF what the smallest chapter size may be. This is used when trying to determine if a numbered line is a chapter heading. AscToPDF tries to avoid treating numbered lists as a series of small chapters using this policy.

The default value is 8 lines. Change this only if you suspect small chapters are being ignored, or large list items are being treated as chapter headings.

Blank Lines between paragraphs

AscToPDF can detect whether or not it should expect blank lines between paragraphs. Documents without blank lines between paragraphs will be harder to convert, and errors are more likely. Unfortunately text documents exported from Word for Windows often have this property.

Where there are no blank lines, AscToPDF relies of spotting the last line of a paragraph (usually shorter), and (in some documents) the presence of a hanging indent at the start of each new paragraph.

This should be automatically detected.

New Paragraph Offset

Some documents start the first line of a new paragraph with an offset of a number of characters. This is especially true in text files saved from Word for Windows documents.

AscToPDF can sometimes confuse such paragraphs as being two different levels of indentation. Use this policy to eliminate such confusion.

This should be automatically detected

Indent position(s)

AscToPDF recognises multiple levels of indentation. This policy shows the character levels at which indentation has been detected.

AscToPDF converts all tab characters into multiple spaces in input. These indentation positions are the positions that result after that conversion. Depending on your tab settings these might not be exactly the positions you would expect.

Normally these levels are correctly detected automatically, but should you wish to set them manually you may need to experiment slightly to see how AscToPDF has handled your tabs.

Bullet policies

AscToPDF should be able to detect the use of bullets on a reasonably sized document. These policies describe the type of bullets expected.

Automatically detect bullets and numbered lists

Expected Bullet types

numbered bullets

alphabetic bullets

roman numeral bullets

Bullet characters

recognize hyphen character as a bullet point

'recognize an "o" character as a bullet point'

Other bullet point characters

Look for bullets

This policy states whether or not the program should attempt to automatically detect bullets and numbered lists. This should normally be left on unless your document has no such features, but the program (wrongly) thinks it has.

This policy appears on the Bullets dialog as "Automatically detect bullets and numbered lists", but is identical to the "Look for bullets" policy on the 'What to look for' policies tabbed property sheet.

Expect Numbered bullets

This policy states whether or not numbered bullet points are expected. The numbered bullets can be followed by any punctuation, thus 1., 2) and (3) will all be recognised, but PDF will not necessarily support this in the markup produced.

This should be automatically detected.

Expect alphabetic bullets

This policy states whether or not alphabetic bullet points are expected. The numbered bullets can be followed by any punctuation, thus a., b) and (c) will all be recognised, but PDF will not necessarily support this in the markup produced.

Both upper and lower case bullets are recognised (and supported in the markup).

This should be automatically detected

Expect roman numeral bullets

This policy states whether or not roman numeral bullet points are expected. The numbered bullets can be followed by any punctuation, thus i., ii) and (iii) will all be recognised, but PDF will not necessarily support this in the markup produced.

Both upper and lower case bullets are recognised (and supported in the markup), although the range of roman numeral values supported is limited.

This should be automatically detected.

Recognize '-' as a bullet

This policy states whether or not bullet points starting with the hyphen character '-' are expected.

This policy appear on-screen as "Recognize hyphen character as a bullet point"

This should be automatically detected.

Recognize 'o' as a bullet

This policy states whether or not bullet points starting with the lower case 'o' are expected.

This policy appear on-screen as "Recognize 'o' character as a bullet point"

This should be automatically detected.

Other bullet point characters

This policy lists any other characters that are to be recognised as bullet characters.

Each bullet character entered will appear in the policy file as it's own "Bullet Char" line.

This should be automatically detected, but may sometimes need to be manually entered.

Contents policies

This dialog shows both analysis and output policies connected with contents list detection and generation.

Analysis

Expect contents list

Expect contents list

This policy specifies whether or not the document already contains a contents list. If it does, AscToPDF will attempt to convert the existing list into a series of hyperlinks.

This should be detected automatically, but occasionally you will need to set this policy manually.

See the discussion on contents list generation in the Documentation available

File Structure policies

These policies aid AscToPDF's analysis by describing some of the file structure that would affect the analysis.

Expect only a simple layout

Expected File contents

'Expect "C"-code samples'

Contains DOS characters

Contains PCL printer codes

Contains non-European (e.g. Japanese) characters

Contains mime-encoded quotable characters

File has change bars

File has Page markers

Page marker size (in lines)

Text Attributes

Text justification

File is double spaced

Text to ignore

Number of lines to ignore at start of document

Number of lines to ignore at end of document

Keep it simple

AscToPDF puts a lot of effort into detecting overall structure such as headings etc.

In documents that don't have any such structure, AscToPDF is liable to convert any line with a number at the start into a heading.

To prevent this, you can mark the document as simple, that is with no global structure. In a simple document AscToPDF will attempt far less analysis.

This policy appears on-screen as "Expect only a simple layout".

AscToPDF attempts to automatically identify simple documents, but you may still need to set this policy manually.

Expect Code samples

AscToPDF can markup C-like code fragments in <PRE>...</PRE> tags to preserve the layout and readability of the quoted code.

This may be automatically detected, but occasionally needs to be manually corrected.

Input file contains DOS characters

AscToPDF can convert files that use the DOS (OEM) character set. By default the file is assumed to be in the ANSI character set, but some files may have originated under DOS.

This may be automatically detected, but usually needs to be manually set.

Input file contains PCL codes

Indicates that the input file contains PCL printer codes. When set, the program will make whatever sensible use it can of these codes, otherwise they will be removed.

Please note that the PCL printer codes offer a rich command language that may be used to drive graphical printers. As such the emulation possibilities in a text converter are limited, and it is quite likely that files that make heavy use of such codes will fail dramatically to convert.

That said, those codes that are not recognised will be eliminated from the output.

Input file contains Japanese characters

*** not implemented yet ***

Files using non-ASCII character sets (Japanese, Korean etc) will be incorrectly converted. This may be fixed (as far as possible) in later versions.

Appears on-screen as "Contains non-European (e.g. Japanese) characters"

Input file contains MIME encoding

AscToPDF can convert mime-encoded quotable characters. These will usually appear in files that were originally part of an email message. Such files use the "=" character to escape special characters. So for example "=20" should be interpreted as a space.

This appears on-screen as "Contains mime-encoded quotable characters"

This may be automatically detected in files where the "=" is used to break up long lines, but more usually you will need to manually set this.

Input file has change bars

AscToPDF can strip out change bars in documents that contain them. Change bars are usually a vertical bar '|' placed in the leftmost or rightmost column.

Currently this is not automatically detected, and so will need to be manually switched on.

Input file has page markers

AscToPDF has a limited ability to remove page markers. These are normally a few lines following a form feed (FF) character, containing page numbers etc. This will commonly occur with files generated from older software packages.

Page marker size (in lines)

The number of lines after each form feed (FF) that should be ignored. These lines will not be copied to the output.

Text Justification

AscToPDF recognises documents that are left justified (default), right justified, centred or both left and right justified (confusingly known as "justified").

The program cannot currently mark up the text in a matching style, but this policy
is important in the analysis. For example "justified" documents are padded with extra white space which could be interpreted as pre-formatted text where the document not recognised as being justified.

Normally this policy is correctly detected automatically.

Input file is double spaced

AscToPDF will normally treat a blank line as a break between paragraphs. Some files have extra CR/LF characters (usually if they've come from a different computer, or from a printer package). In such cases AscToPDF will see every second line as blank, and this will affect the analysis, usually by turning each line of data into a separate paragraph.

If you have such a file, use this policy to mark the file as double spaced to get better results.

Lines to ignore at start of file

This specifies how many lines from the input files should be ignored at the start of the file. These lines will be discarded from the output.

This can be useful when converting file copied from a news feed or whatever that adds a small data header to the file.

Lines to ignore at end of file

This specifies how many lines from the input files should be ignored at the end of the file. Up to 40 lines may be ignored in this way. These lines will be discarded from the output.

This can be useful when converting file copied from a news feed or whatever that adds a small data footer to the file.

Headings policies

These policies determine the headings structure that the document is expected to have. Normally these are calculated correctly by AscToPDF, but due to the complexity of heading detection, you may sometimes need to correct the analysis.

At the top of the dialog you can specify what type of headings you expect to see. Any combination is allowed, although usually documents use just one type of heading.

Expect Numbered headings

Expect Underlined headings

Expect Capitalised headings

Expect Embedded headings

Heading Key phrases

Use first line as heading

Center first heading

Check indentations of headings are consistent

If numbered headings are expected, it may be possible to expect headings at multiple levels, and to also expect a contents list. Each level of heading will have it's own set of policies which are shown on this dialog. The policies are shown in text form, but are edited via the heading details dialog

Note: This area of functionality is continually under review.

See also the discussion in detecting headings and section titles.

Expect numbered headings

This policy specifies whether or not numbered headings are expected in the document.

Numbered headings may be found at multiple levels, and their details may be edited via The heading details dialog

This should be calculated correctly by AscToPDF. But is prone to error, getting confused by numbered bullets and the like. In such cases you may need to set this policy manually.

Expect underlined headings

This policy specifies whether or not underlined headings are expected. Note, where the headings themselves are numbered, the underlining will be taken into account, and you should set the expect numbered headings policy instead.

AscToPDF uses the character in the underlining to determine the heading level, thus text underlined with equals signs is given prominence over text with single underline characters such as minus signs, tildes or underscores.

Expect capitalised headings

This policy specifies whether or not CAPITALISED headings are expected. Note, where the headings themselves are numbered, this policy need not be set, and instead you should set the expect numbered headings policy instead.

Expect Embedded headings

This policy specifies whether or not "embedded" headings are expected, i.e.. the heading is "embedded" in the first paragraph. Such headings are expected to be a complete sentence or phrase in UPPER CASE at the start of a paragraph.

At present such headings are not auto-detected... you need to switch this policy.

Heading Key phrases

If specified, then any line that begins with one of the key phrases will be regarded as a heading. The syntax is

<details>, <details>...

where each set of details is

<details> = <phrases>, [<heading_level>]

and

<phrases> = <phrase_1> [|<phase_2>]

That is, each set of <details> can optionally specify a <heading_level>. If omitted this will default to 1,2,3 for the first, second, third set of details etc. Note, this is a logical heading level, and will be apparent in the contents list.

Each set of <details> must supply a set of <phrases>, and each set of phrases would must have at least one phrase with extra phrases added if wanted, separated by vertical bars.

So for example

Part, Chapter, Section

would treat lines beginning with the words "Part", "Chapter" and "Section" as level 1,2, and 3 headings.

The key phrases are case-sensitive in order to reduce the likelihood of false matches with lines that just happen to have these phrases at the start of the line. So

PART|Part, Chapter, Section

Would allow either "PART" or "Part" to be matched.

"PART|Part,1" , "Chapter,2" , "Section,2"

Would make lines beginning with "Part" level-1 headings, while both "Chapter" and "Section" would become level 2. This would be the same as

"PART|Part,1" , "Chapter|Section,2"

Note, spaces may form part of a match phrase, but because of their use in the tag syntax commands and vertical bars may not.

If false matches occur, (e.g. the word "Part" appears in the body of the text) edit the source text so that the offending word is no longer at the start of the line.

Use first line as heading

When this option is selected, the first line in the document will be treated as a heading. This can be a useful option to select when the first line of your document is a document title line, but doesn't conform to the headings style used in the rest of the document.

Center first heading

When this option is selected, the first heading in the document is centred. This may be an appropriate choice when the first heading is in fact to be treated as a document title.

Check indentation for consistency

The program performs a number of consistency checks when detecting headings. Amongst these is a check that all headings of the same type occur at the same indentation. This check can help distinguish between numbered headings and numbered lists.

However, if you have numbered headings that are different indentations - e.g. because they are centred on the page - then this check will cause them to be rejected as headings. In such cases you can manually disable this check.

This policy appears on-screen as "Check indentations of headings are consistent"

The heading details dialog

This dialog is reached through one of the edit buttons on the main Headings Policies dialog. This allows you to edit details of a particular type or level of heading.

Position of section number on the line

Indentation of heading lines

Heading prefix words

Section number formatting

Heading numbering scheme

Heading separator characters

Heading trailing letters

Bracketing

Heading bracket characters

Indentation of heading lines

AscToPDF uses checks on indentation levels to reject lines with numbers on that could be confused with headers.

This is the indentation level (in characters) that heading of this types are expected to be found at.

Heading prefix words

Some documents put words like "chapter", "subject" and "section" in front of the section number. These are known as prefix words.

Heading numbering scheme

This is the numbering scheme expected for headings at this level. At present AscToPDF can't cope with mixed types like "II-2.b".

This may be addressed in later versions.

Heading separator characters

This shows the separator expected between parts of the heading number.

*** Not currently supported ***

Heading trailing letters

This shows whether we expect trailing letters after the section number, as in "1.1b".

*** Not currently supported ***

Heading bracket characters

This shows what bracket characters (if any) we expect before and after the section number as in "[2.2]" or "3.2.1)".

*** Not currently supported ***

Pre-formatted text policies

These policies specify how AscToPDF detects pre-formatted text.

Detecting pre-formatted regions

Minimum size of automatic <PRE> section

See the section on pre-formatted text for more details.

Minimum size of automatic <PRE> section

This policy specifies the minimum number of consecutive pre-formatted lines that must be detected before the text is placed in fixed width font.

AscToPDF detects heavily formatted lines, and then looks at their neighbours to see if they too could be part of a pre-formatted text.

Once a group of lines is identifies, it will only be marked up as pre-formatted if the minimum is exceeded.

The default value is 0. Set this value larger if AscToPDF is marking text as pre-formatted when it shouldn't do.

Note: The <PRE> is a reference to the shared ancestry of this software with the text to HTML converter from which it evolved.

Back to Contents List