Documentation for the Detagger html to text converter and markup removal utility

The latest version of these files is available online at

Previous page Back to Contents List Next page

Running the software

The software is available as both a Windows program and a console program. The console version can be run from the command line, and is better suited for use in command files and batch conversions.

Contents of this section

Running as a Windows application
Main Dialog
Menu Bar
File menu
Conversion options menu
Selecting the Text Commands File
Selecting the Text fragments File
Settings menu
Language menu
View menu
Help menu
Update menu
Status window
Console version
The /CONCAT command line qualifier
The /CONSOLE command line qualifier
The /DETAG command line qualifier
The /HELP command line qualifier
The /LOG command line qualifier
The /OUTPUT command line qualifier
The /OVERWRITE command line qualifier
The /POLICY command line qualifier
The /SILENT command line qualifier
The /SUBFOLDERS command line qualifier
The /TREE command line qualifier
Running from the 'SendTo' menu
Working with Unicode
What is Unicode?
Unicode Byte Order Marks (BOMs)
Auto-detecting Unicode input
Creating Unicode output
Controlling Unicode handling through use of policies

Running as a Windows application

Detagger can be invoked as a normal Windows application. On start-up you will be presented with the main window. This consists of a menu bar across the top of the window, and some data entry fields in the main body of the window.

Main Dialog

To convert your files, take the following steps :-

  1. Select the files to be converted using the input selection options. You can type in a wildcard and - if you want - ask the software to search in sub-folders.
  2. Select the conversion type you want. You can convert to text, or selectively remove markup.
  3. Choose the type of output you want. You can output the results to files, or to the clipboard. If you've selected multiple files (e.g. by supplying a wildcard) you can choose to have all the results concatenated into one big results file.
  4. Choose the Output Directory you want. You can output to the same directory as the input files, or to a directory of your choosing. If you are converting files in sub-folders, then you can choose to replicate the folder structure under the output directory.
  5. Options allowing you to fine tune the conversion can be found on the conversion options menu. This has separate sub-menus for conversion to text and markup manipulation
  6. Alternatively, if you've previously saved some conversion settings to a "Policy File", then you can re-load it. If you've loaded the policy file before, it will be accessible via the drop-down list. (For a fuller description see policy files)
  7. Once you've made all your selections, press the Convert File(s) button at the bottom. The status window will briefly appear whilst the conversion proceeds, displaying messages, and a results viewer may be launched to display the results. _(You can control this behaviour using the settings menu)_.
  8. If you've fine tuned your conversion and want to save the options for use next time, save them to a policy file by using the Save Policy File option on the Conversion Options menu.

Input selection

Normally you need simply select the input file(s) using the Browse button, and the rest of the fields will be set to default values. If you want to use wildcards, type the file specification in the file(s) box directly

If you check the Search Sub-folders option, the program will look for matching files in sub-directories.

Conversion type selection

The program supports a number of conversion types. You should select the one that you want to perform.

Convert to text

The output will be a plain text version of the
input file. You can fine tune the conversion to
text by using the text conversion policies
Selectively remove markup

The output will be HTML, with some of the tags
selectively removed. You can control which by
using the Markup removal policies.

If you're removing markup (as opposed to creating text files), then you will typically be creating HTML files from HTML files. This means you run the risk of asking the program to overwrite the input file with the results. If this is what you want, you will need to check the May overwrite existing files option.

Output selection

When you select your input, the output will default to being a file in the same directory. However, there are a number of options available to you.

You can select the output type. The default is to file(s), but you can also output to the clipboard. When converting wildcards the clipboard option is only useful if you have a clipboard manager in place, otherwise only the last file will be held there.

When converting to file, you can select the output filename. This is not a sensible choice when using wildcards to select the input file(s). If you don't change this field the output file will match the input filename, but my have a different extension.

When converting to file, you can select the output directory. If you are converting wildcard files and including sub-directories, this option will put all the files in the one directory. There is no option at present to output to a parallel directory structure.

Output types

The program supports a number of output types that determine where the output should go. You should select the one that you want to perform.

Output text to file
The output will be a file. Depending on the type of conversion performed, the file will either be a HTML file or a plain text file

Output to the clipboard
You can use the Conversion Type to select the option of placing the generated output onto the Windows clipboard, ready for use in other Windows applications.

Using Detagger in this way can be a very powerful technique which allows you to merge converted text with more traditionally authored content.

This approach becomes even more powerful if you use a Clipboard extender like ClipMate (see to remember and organise everything to the clipboard. You could convert a few files, and then use ClipMate to recall the pasted text at your leisure for insertion into your other files.

Concatenate results into one file
When you select a conversion type of Concatenate results into one file, all your results will be added together in one big results file.

Converting to text
When you're converting to text, the output will be one big text file, with the results from converting each input file added after each other.

Removing markup
When selectively removing markup the output will be one big HTML file. This file will inherit the <HEAD> and <BODY> tags (which will include any <TITLE> from the first input file). All other <HEAD> tags will be discarded. Because of this many of the results properties (e.g. style sheets and <META> tags) will be whatever was in the first file.

The validity of the resulting HTML file will depend entirely on how well the markup of the multiple files goes together. It's a classic case of garbage in, garbage out

File separators
In the output file a separator can be added between the results of one input file and the next. These are defined using the text fragment feature.

When creating a text file, the separator fragment is called TEXT_SEPARATOR, and when creating a HTML file it's HTML_SEPARATOR.

In the registered version both separators are absent by default unless you choose to define these fragments. In the 30-day trial, both separators contain short messages.

Output Directory

When outputting to file, there are a number of options for choosing where the output files will go.

The default is to output the files to the same directory as the input files. When the conversion type is set to selectively remove markup this has the potential to overwrite the original files. For this reason you have to select the 'May overwrite the input files' option.

Alternatively you can choose to output files to a different directory. In this case the program will overwrite any files already there because these are not the input files.

Finally, if you have selected the 'Search sub-folders' option, you can elect to replicate the input directory structure under the output directory, rather than have all the files found placed in just the one directory.

Menu Bar

The main menu bar appears at the top of the main screen. It has the following options:-

File File options
Conversion Options Options that affect the conversion
Settings Edit the program's settings
Select the language you'd like the program's
user interface to be in
View the created HTML files or the messages
for the last conversion
Help Various help files and on-line resources
Update menu Check for more recent updates

File menu

The file menu offers the following options:-

This will prompt you for a file to convert and will then
convert the selected file(s).
Load policy file
Loads settings previously saved to a policy file.
See policy files
Save policies to file
Save program setting to a policy file.
See policy files
Exit Exit the program.

Conversion options menu

The Conversion Options menu allows you to alter the parameters of the conversions that you do. These options can be saved in policy files for later use. The options available include :-

Markup manipulation

This option allows access to dialogs that allow the options ("policies") that control markup removal to be adjusted

Detag policies
Detag Tables policies
Tag manipulation policies

These options only apply when the conversion type selection has been set to "selectively remove markup"

Conversion to text

This option allows access to dialogs that allow the options ("policies") that control the text conversion process to be adjusted

These options only apply when the conversion type selection has been set to "convert to text"

Configuration Files menu

The Config File Location menu allows you to specify the location of additional configuration files. The locations you select will be stored in your policy file, so in a sense these files act as extensions of the policy file, but by being stored in separate files the same configuration files can be shared by multiple policy files.

The options on this menu allow you to select do locate following :-

Selecting the Text Commands File

This option allows you to select the Text Commands File you wish to use.

Selecting the Text fragments File

This option allows you to select the Text Fragments File you wish to use.

Load policy file

Detagger has many program options known as "policies" to help you tailor the conversion process. These policies can be saved in a policy file for later re-use in future conversions. This dialog screen is primarily intended to allow you to load a previously saved policy file

For a fuller description see the section on policy files.

Options on this screen include: -

Load policies from an existing policy file
Save policies to a policy file for later re-use
Reset all policies to their default values

Save Policy File

This window is displayed whenever you wish to save your policies to a file, usually for use in later conversions.

To save the file, simply select the policy file name, usually with a .pol extension.

This window contains a radio button with two options:

Save only those policies that have changed

If this option is selected, then only those policies that have been loaded from an existing file and/or been edited during the current session will be saved.

This is the recommended option, as it will exclude all policies that have been set up correctly automatically.

Save all policies

If this option is selected, that all policies are written to file. This is a good way of documenting the policies used, but is usually too restrictive to be loaded as input into conversions of other files.

The saved file is a text file designed so that it may be manually edited and reloaded. If you do so, take care not to change the key phrases at the start of each line.

Note: If you find that conversions that used to work "stop working" it's possibly because you're using a complete policy file. If you find this happens, try creating a new policy file from scratch, or manually removing options from your existing policy file.

Resetting policies to default values

This option will reset all conversion options ("policies") to their default values. If a policy file has been loaded, it will be unloaded.

Settings menu

The program settings menu allows you to customise the way Detagger executes each time it is invoked.

This menu has the following options: -

Diagnostic Settings
Set message filters and alter the error reporting level to
control the number and type of messages generated during conversions
Drag and drop settings
Set the program's properties when invoked by
dragging files into the icon on the desktop
Results viewers settings
Specify the viewers to be used for viewing
results files, and their method of invocation
Use of policy file settings Specify any default policy file to be used.

In addition to the above sun-menus, this menu allows you to toggle the following options, indicated by tick marks.

Show Tool Tips
If checked tool tips will be available to offer
help on the controls on each dialog screen
Show Status Dialog

If checked the Status Window will show
during the conversion, showing messages describing
how the conversion is going.
Automatically view results

If checked a file viewer will be launched
after the conversion to view the results. This
will either be a HTML browser of a text editor
depending on the type of conversion being done.
See results viewers settings
Remember settings on exit

If checked the program will remember
the selected files and conversions details
for next time
Tip of the Day

If selected the 'Tip of the Day' screen is shown
and you can choose whether or not this should
also be displayed on startup.

Diagnostic settings

These options allow you to set the level of error reporting, or to suppress messages of various types from being displayed during conversion.

The types of messages include :-

INFO messages
Informational messages. These convey information
telling you what was been done and why.

WARNING messages

Warning messages. These tell you that something
you have requested has not been done, or something
has been done which may not be correct. It's possible
you may be able to take corrective action.

TAG ERROR messages
Tagging errors. Only occur when you use the
preprocessor in-line tags and directives.


Program errors. The program has detected it
has done something wrong. The conversion may still
be successful, but there is nothing you can do about
such messages except report them to the program's
author at info<at>

Drag and drop settings

These options specify the behaviour of Detagger when invoked via drag and drop (i.e. by dropping a file icon on the program's icon).

Show the status screen

The status dialog, showing messages
reporting how the conversion is going
should be shown.
View results in browser
once complete

The selected viewer (browser) for the
results files should be invoked on the
last file converted once conversion is
Start program after
The program should be launched in Windows
mode once the conversion is completed.

Results viewers settings

This identifies the viewers to be used whenever Detagger launches an application to view a results or documentation file. Viewers may be required for both HTML (when detagging) and TEXT (when converting to text) files.

Automatically view results files

You can elect to have results viewed automatically after each conversion. This will normally result in the named application being launched to view the last file converted.

Command used to view HTML files

For HTML, you can elect to use Dynamic Data Exchange (DDE) to have the results displayed in a currently active browser. This can be quicker and more efficient that launching a new instance of the browser each time. You should ensure your DDE browser matches the program named as the default browser so that if not already active, the program can start a fresh instance.

When DDE is used the results will vary from browser to browser. IE for example will come to the front, whereas Netscape will not, and if it is minimised you won't see the results until you maximise the browser again.

NOTE: On some systems problems can occur with DDE that will cause the program to hang whenever it attempts to display a HTML file. When this happens the program will need to be stopped via the task manager. The next time the program runs it will detect that this problem has occurred and disable the use of DDE.

Add "file://localhost/" prefix

For HTML files viewed from your local hard drive the prefix "file://localhost/" should be used in place of the "http:/" used for Internet access.

Unfortunately some browsers (take a bow IE 3.0) do not support this, so the addition of this prefix may be disabled if you're using such a browser.

Command used to view TEXT files

For TEXT files, DDE is not currently available, so you simply provide the command to view TEXT files (usually just a text editor or NotePad).

Use of policy file settings

Using a default policy file

This determines which policy file, if any, is to be used by default when the program is first invoked. The actual policy file used can, of course, be changed via the policy dialogue.

The default policy file will also be used if the program is invoked via drag'n'drop. This avoids the need for creating batch files with the policy file name on the command line.

Always reload policy file during conversion

This specifies that the current policy file should be reloaded every time the conversion is done. If the file is large, and you are repeatedly converting using the same policy file, then this can slow you down. On the other hand if you are editing the policy file by hand outside the program between conversions then you will want this option enabled.

Tip of the day

The "Tip of the Day" screen is shown by default each time you start up the program. This behaviour can be disabled by clearing the checkbox on the screen.

The tip shown will change each time the screen is displayed, and in addition you can review all the tips available by using the buttons marked "<<" and ">>" to go to the previous and next tips. The number of each tip is shown in case you should want to revisit it at a later date.

The Tip of the Day screen can be shown at any time by selecting the option on the Settings menu.

At present all tips are only available in English.

Language menu

It is possible to change the user interface to the language of your choice. Translations are provided by a number of volunteers who help converting the menu, dialog, and ToolTips text. The message and documentation text remains in English for the time being. As such these don't offer a full translation, but will hopefully be of some use to those whose first language isn't English.

At any given time you may still find English translations, especially in the messages displayed, and in the help and documentation files, but it is hoped that the efforts of these volunteers will make the program easier to use for non-English speakers.

Supported languages

At present work is under way on


Gonzalo San Martin is undertaking the Spanish translation.
Gonzalo operates a highly popular Real Madrid fan page (in
Spanish and English) which you can visit at
Gonzalo can be contacted at G.SanMartin<at>

The Italian translation is being undertaken by
Gianluigi Pizzuto who can be contacted at gibly<at> and
has a web page at
The Swedish translation is being undertaken by Dan Svarreby
who can be contacted at dan.svarreby<at>
French The French translation is being undertaken by Andre Martinez.
The Russian translation is being undertaken by
Alexander (aka J-34) at j34<at>

The Dutch translation is being undertaken by Jurrien Dokter,
who can be contacted at info<at> and runs
the web site at

If you would like to volunteer to help with this effort, please email info<at> (replace "<at>" by "@") or visit the web page at

View menu

This menu contains the following options

Messages from last conversion

View the messages window with messages
generated in the last conversion by bringing
back the Status window
Results of last conversion
View the last file converted in your
preferred browser

Results of last conversion

Once you've converted a file, you can view the results in the browser of your choice. Detagger will detect the default browser used on your system. If you wish you can change this through the settings menu

You can view results in the selected browser by selecting the option on the view menu or by pressing the View results button on the main screen.

Detagger can also be configured to automatically review results when run from the command line or in drag'n'drop operation.

Help menu

The help menu has the following options:-


Brings up the contents page of this help file. Help
can be brought up anywhere in the program by
pressing F1
Register (online)

This options will take you to the registration page,
or - if you have already registered - to the updates
HTML doco (offline)
Brings up the local copy of the HTML
documentation in your preferred browser
HTML doco (online)
Brings up the Internet copy of the HTML
documentation in your preferred browser.
Other products
Links to web pages for JafSoft and their various
software products.

Shows the program version and other details.
Includes buttons to take you to the home page etc
on the web.

Update menu

The update menu has the following option

Check for newer versions

This option will take you to the web site,
where a check will be made to tell you if this
is still the latest version of the software.

Status window

The status window is displayed whenever a conversion is in progress. It displays messages showing how the conversion is progressing. You can also bring up this window by selecting the "messages from last conversion" option on the View menu. You can prevent this behaviour by selecting the option from the Settings menu

The messages displayed are usually just informational messages telling you what Detagger is doing. You should review these messages and check they don't indicate an error in conversion.

Once conversion is complete you can dismiss the window. You can automate this by ticking the "dismiss on completion" box.

Should you wish to you can use the save to file button to save the messages displayed to file. This can be useful for reviewing messages, extracting URLs reported by the software (if showing URLs is enabled), or for sending details when requesting support.

Console version

In addition to the Windows version of Detagger, there is a console version. This can be invoked from the command line, and is thus well suited to use in batch and automated conversions.

The console version is free to users who register the Windows version. A trial copy of the console version can be obtained by visiting

The console version is used from the command line. Most of these command options are also supported by the Windows version, but the console version is better suited to batch operation.

The console version is called h2acons. You can see a list of the commands by using the command

c:> h2acons /help

This gives

Usage : h2acons filespec1 filespec2 [policy_file.pol] [/qualifiers]

Recognised qualifiers include

Concatenate the results into one file
Write output direct to console
Selectively remove HTML markup
Display this useful list of commands ("/?" also works)
Generate a .log file
Filespec for output file(s)
May overwrite input files with the output
Document policies used in a .pol file
Suppress all output messages (except these :-)
Process files that match the filespec in sub-folders as well
Place output files in parallel folder structure to input files

Qualifiers are case insensitive and may be reduced to shortest unique name (e.g. "/lo" for "/log")

Most of the configuration options are passed using a "policy file". This is most easily created by running the Windows version, selecting the options you want and then saving those to a policy file.

The policy file itself is just a text file, with one policy per line (hard break). If you look at the list of policies in the documentation you can edit this by hand, but usually it's just simpler to use the Windows version.

The /CONCAT command line qualifier

When present this qualifier states that all the results should be output to a single file. This only makes a difference if you've supplied multiple filespec's on the command line, or used a wildcard.

The /CONSOLE command line qualifier

When present this specifies that the output should be written to the console window. This might be useful in piping operations.

If you use this, you will usually want to also use the /SILENT qualifier.

The /DETAG command line qualifier

When present this specifies that Detagger should selectively remove HTML markup and create a HTML output file. The default behaviour otherwise is to convert the file to text.

If you want to specify which removal options should apply you'll need to create a policy file and add that to the command line.

The /HELP command line qualifier

Displays the list of supported qualifiers

The /LOG command line qualifier

When present this specifies that Detagger will create a .log file listing all the actions it takes and any messages created

The /OUTPUT command line qualifier

When present this will tell Detagger where the output should be placed. If omitted the default is to output the results in the same folder as the source file, with an extension (.txt or .html) appropriate to the type of conversion being attempted

Examples :-

c:> h2acons input.html /out="c:\my files\output.txt"

File is output to "c:\my files\output.txt". Because there is a space in the directory path the filename needs to be in quotes

c:> h2acons in*.html /out=c:\output\

All the files in*.html will be converted and placed in the directory "c:\output\"

c:> h2acons in*.html /concat/detag/out=c:\output\bigfile.html

In this case the /concat/detag means that Detagger will selectively remove markup and concatenate the results in the single file "c:\output\bigfile.html"

The /OVERWRITE command line qualifier

When the /DETAG qualifier is specified then by default the output file will be a HTML file in the same directory as the source file. In this case Detagger could end up replacing the original file by the output file. That is only allowed if the /OVERWRITE qualifier is present. If it isn't, an error message is generated.

An alternative to using the /OVERWRITE qualifier is to use the /OUTPUT qualifier to direct the output to a different folder, or to a different name in the same folder.

The /POLICY command line qualifier

When present Detagger will create a .pol policy file listing all the policies used in the conversion and their values. You should not normally want to do this unless you want to create a policy file to edit. or want to check that your policies are being used.

To pass in a set of policies, just list the policy file on the command line. It must have a .pol extension. For example the command

c:> h2acons in*.html input.pol /policy=output.pol

will read the policies in "input.pol", use those in the conversion, and then create a file "output.pol" listing the policies used, which will be a mixture of default values and those loaded from "input.pol".

The /SILENT command line qualifier

When present all the messages usually displayed to the console window are suppressed.

You'd want to use this if you were using the /CONSOLE qualifier.

The /SUBFOLDERS command line qualifier

When present the software will search the sub-folders of the input directory looking for other files that match the input filespec.

See also The /TREE command line qualifier

The /TREE command line qualifier

When present the software will place output files in a directory structure that matches the input structure. This will only apply when using the /SUBFOLDERS and /OUTPUT options as well. So for example the command

    c:> h2acons c:\input\a*.html /output=d:\new\ /subfolders/tree

Would look for all files a*.html in the folder c\input\ and its sub-folders. The output files will be placed in d:\new\ and sub-folders of that, so for example converting c:\input\sub\answer.html would be converted to d:\new\sub\answer.txt. If it didn't already exist, the sub-folder d:\new\sub\ will be created.

See also The /SUBFOLDERS command line qualifier

Running from the 'SendTo' menu

Detagger can make a useful addition to your "Send to" menu (available when you right-click on a file in explorer).

To add Detagger to this menu, simply add a shortcut to your Send To shortcuts directory. Under Windows 9x this is


under Windows XP this is

/Documents and Settings/<Your_User_Name>/SendTo

If you want to use a standard policy file (e.g. with a particular colour scheme), then change the properties of the shortcut so that the command is

Detagger %1 standard.pol

Working with Unicode

Detagger was not originally designed with Unicode in mind, and as a result support for Unicode text has been gradually added over time, with the result that earlier versions of Detagger may not support all the features described in this manual. If in doubt, please contact JafSoft for details.

What is Unicode?

Traditional single-byte character sets interpret the 8-bit character values (128-255) as special characters. So on a Russian machine this would be interpreted as Cyrillic, but on a different machine this could be read (wrongly) as Arabic (and vice versa). On most English-based PCs, the 8-bit characters are used for accented character used in certain European languages, so a Russian text would appear to have lots accented 'i's, 'e's and 'a's.

Unicode is a way of implementing text that supports multiple types of character sets at the same time so that - for example - it is possible to display Chinese and Cyrillic on the same page unambiguously. It does this by allocating each character in each language a unique code value, so that codes used for Cyrillic characters no longer overlap and conflict with those assigned to Arabic.

However, these code values are in most cases larger than can be represented in a single byte. As a result a way has to be chosen to represent each character by one or more bytes.

The following Unicode representations are commonly used

Each character is represented by 1, 2 or 3 bytes, depending on the which range the Unicode code value falls into. This has the advantage that all ASCII characters are a single byte, so for example all the HTML tags in a document are represented by a single byte each. This also means there are no null bytes contained in the text, which can make programming software to work with this text easier.

Each character is represented by a 2-byte pair (future characters may require 2 such pairs). The 2-byte pair is just the numerical representation of the Unicode value of each character. This makes the files easier to interpret, but also means that the byte order depends on how the machine stores its bytes - i.e. is the machine big-endian or little-endian. Because ASCII characters have a Unicode value less than 255 the ASCII characters map onto a byte pairs in which one of the bytes is null. Because each character requires two bytes, a single byte wrongly inserted into a UTF-16 stream will render all text that follows is as gibberish.

Unicode Byte Order Marks (BOMs)

Files that contain Unicode identify themselves by inserting a "Byte Order Mark" (BOM) at the top of the file. This is a two-byte marker for UTF-16 files and a three-byte marker for UTF-8 files. Modern applications will test for this byte marker and if present will then know how to interpret the contents of the file. For example Notepad as supplied with Windows XP can do this, whereas Notepad as supplied with Windows 98 could not.

In UTF-16 each character is represented by two bytes, and computers can store a two-byte value in different ways (known as "big-endian" and "little-endian"). Each operating system uses one method or another and it isn't usually an issue, but when Unicode files get passed from one machine to another, this becomes important. The BOM allows the two forms of UTF-16 (known as "UTF-16BE" and "UTF-16LE") to be distinguished.

Auto-detecting Unicode input

The software has some ability to auto-detect Unicode text, and will generally do so under the following circumstances

Creating Unicode output

The software will create Unicode output whenever it detects that the input files were Unicode, or wherever Unicode characters have been detected in the HTML entities of the original.

At present all Unicode output files will be UTF-8.

Controlling Unicode handling through use of policies

The following policies can be used to control the handling of Unicode during the conversion :-

input text encoding

By default the software will attempt to auto-detect whether or not the input is Unicode, but if this fails you can explicitly tell the software the encoding using this policy.

May add Unicode marker to output file

When Unicode is detected in the source the software will output the text as UTF8 and optionally add a file marker that will label the file as "Unicode" in a way that most applications that can cope with Unicode will recognize.

Allow ANSI alternatives (e.g. space for &nbsp;)

Certain common HTML entities don't have a single ANSI character but have common ASCII representations. If you enable this policy you tell the software to use ASCII/ANSI alternatives where possible, thereby reducing the chance of Unicode being necessary for the output file.

Previous page Back to Contents List Next page

Valid HTML 4.0! Converted from a single text file by AscToHTM
© 1997-2005 John A Fotheringham
Converted by AscToHTM