$_$_TITLE Documentation for the Detagger markup removal utility $_$_DESCRIPTION Detagger is a utility for converting html to text or selectively removing HTML markup from your HTML web pages. $_$_CHANGE_POLICY Indent Position(s) : 0 4 8 12 16 $_$_CHANGE_POLICY Create mailto links : no $_$_CHANGE_POLICY Default font : Arial, regular, 10 $_$_CHANGE_POLICY Could be blank line separated : yes $_$_TABLE_HEADER_COLS 1 $_$_TABLE_BORDER 0 $_$_SECTION MAKINGRTF Detagger Help Index ******************* $_$_SECTION MAKINGHTML [[HTML

Documentation for the Detagger markup remover and HTML-to-text conversion utility

]] $_$_SECTION ALL $_$_HELP_TOPIC_ID HID_ANAL_BULLETS $_$_HELP_TOPIC_ID HID_ANAL_CONTENTS $_$_HELP_TOPIC_ID HID_ANAL_FILESTRUCT $_$_HELP_TOPIC_ID HID_ANAL_HEADINGS $_$_HELP_TOPIC_ID HID_ANAL_LAYOUT $_$_HELP_TOPIC_ID HID_ANAL_LOOKFOR $_$_HELP_TOPIC_ID HID_ANAL_POLICIES $_$_HELP_TOPIC_ID HID_ANAL_PREFORM $_$_HELP_TOPIC_ID HID_ANAL_TABLE $_$_HELP_TOPIC_ID HID_DUMMY_MENU $_$_HELP_TOPIC_ID HID_FILE_CONVERT $_$_HELP_TOPIC_ID HID_FILE_EXIT $_$_HELP_TOPIC_ID HID_HELP_ABOUTXXXTOXXX $_$_HELP_TOPIC_ID HID_HELP_CONTENTS $_$_HELP_TOPIC_ID HID_HELP_HTML $_$_HELP_TOPIC_ID HID_HELP_HTML_NET $_$_HELP_TOPIC_ID HID_HELP_REGISTER $_$_HELP_TOPIC_ID HID_HELP_WEB_ASCTOHTM $_$_HELP_TOPIC_ID HID_HELP_WEB_ASCTORTF $_$_HELP_TOPIC_ID HID_MERRILL $_$_HELP_TOPIC_ID HID_OPTIONS_LOAD $_$_HELP_TOPIC_ID HID_OPTIONS_RESET $_$_HELP_TOPIC_ID HID_OPTIONS_SAVE $_$_HELP_TOPIC_ID HID_OUT_DETAG $_$_HELP_TOPIC_ID HID_OUT_DETAG_TABLES $_$_HELP_TOPIC_ID HID_OUT_FILE_CONTENTS $_$_HELP_TOPIC_ID HID_OUT_FILE_DIRECTORY $_$_HELP_TOPIC_ID HID_OUT_FILE_FILE $_$_HELP_TOPIC_ID HID_OUT_FILE_FRAMES $_$_HELP_TOPIC_ID HID_OUT_FILE_GENERATION $_$_HELP_TOPIC_ID HID_OUT_FILE_NAMES $_$_HELP_TOPIC_ID HID_OUT_FONTS $_$_HELP_TOPIC_ID HID_OUT_HTML_ADDED $_$_HELP_TOPIC_ID HID_OUT_HTML_HYPERLINKS $_$_HELP_TOPIC_ID HID_OUT_HTML_LINKDICT $_$_HELP_TOPIC_ID HID_OUT_HTML_STYLE $_$_HELP_TOPIC_ID HID_OUT_MISC_PREPRO $_$_HELP_TOPIC_ID HID_OUT_RTF_SETTINGS $_$_HELP_TOPIC_ID HID_OUT_STYLE_COLOURS $_$_HELP_TOPIC_ID HID_OUT_STYLE_CSS $_$_HELP_TOPIC_ID HID_OUT_STYLE_FONT $_$_HELP_TOPIC_ID HID_OUT_STYLE_HTML $_$_HELP_TOPIC_ID HID_OUT_TABLE_TABLE $_$_HELP_TOPIC_ID HID_OUT_TAG_MANIPULATION $_$_HELP_TOPIC_ID HID_OUT_TEXT_FORMAT $_$_HELP_TOPIC_ID HID_OUT_TEXT_HEADERS $_$_HELP_TOPIC_ID HID_OUT_TEXT_MARKERS $_$_HELP_TOPIC_ID HID_OUT_TEXT_PARAGRAPHS $_$_HELP_TOPIC_ID HID_OUT_TEXT_HYPERLINKS $_$_HELP_TOPIC_ID HID_OUT_TEXT_TABLES $_$_HELP_TOPIC_ID HID_POLICIES_ANALYSIS $_$_HELP_TOPIC_ID HID_POLICIES_CHANGE $_$_HELP_TOPIC_ID HID_POLICIES_LOAD $_$_HELP_TOPIC_ID HID_POLICIES_OUTPUTTOHTML $_$_HELP_TOPIC_ID HID_POLICIES_RELOAD $_$_HELP_TOPIC_ID HID_POLICIES_RESETTODEFAULTS $_$_HELP_TOPIC_ID HID_POLICIES_SAVE $_$_HELP_TOPIC_ID HID_PROD_ASCTOHTM $_$_HELP_TOPIC_ID HID_PROD_ASCTORTF $_$_HELP_TOPIC_ID HID_PROD_ASCTOTAB $_$_HELP_TOPIC_ID HID_PROD_JAFSOFT $_$_HELP_TOPIC_ID HID_REANALYSE $_$_HELP_TOPIC_ID HID_REMEMBER_SETTINGS $_$_HELP_TOPIC_ID HID_RTF_CONTENTS $_$_HELP_TOPIC_ID HID_RTF_DOCUMENT $_$_HELP_TOPIC_ID HID_RTF_HELP $_$_HELP_TOPIC_ID HID_RTF_LINKDICT $_$_HELP_TOPIC_ID HID_SETTINGS_DIAGNOSTICS $_$_HELP_TOPIC_ID HID_SETTINGS_DOCO $_$_HELP_TOPIC_ID HID_SETTINGS_DRAG $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_ENGLISH $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_FRENCH $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_GERMAN $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_ITALIAN $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_OTHER $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_PORTUGUESE $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_RUSSIAN $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_SPANISH $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_SWEDISH $_$_HELP_TOPIC_ID HID_SETTINGS_LANGUAGE_DUTCH $_$_HELP_TOPIC_ID HID_SETTINGS_LOADLANG $_$_HELP_TOPIC_ID HID_SETTINGS_POLICY $_$_HELP_TOPIC_ID HID_SETTINGS_VIEWER $_$_HELP_TOPIC_ID HID_SHOW_RESULTS $_$_HELP_TOPIC_ID HID_SHOW_STATUS $_$_HELP_TOPIC_ID HID_SHOW_TOOLTIPS $_$_HELP_TOPIC_ID HID_UPGRADE $_$_HELP_TOPIC_ID HID_VIEW_LASTCONVERSION $_$_HELP_TOPIC_ID HID_VIEW_MESSAGES $_$_HELP_TOPIC_ID HIDD_A2HDLG_DIALOG $_$_HELP_TOPIC_ID HIDD_ABOUTBOX $_$_HELP_TOPIC_ID HIDD_ADDEDHTML $_$_HELP_TOPIC_ID HIDD_ADVANCED_HTML $_$_HELP_TOPIC_ID HIDD_ANALYSIS $_$_HELP_TOPIC_ID HIDD_ASCTOTAB_DIALOG $_$_HELP_TOPIC_ID HIDD_BULLETS $_$_HELP_TOPIC_ID HIDD_CHOOSE_LANGUAGE $_$_HELP_TOPIC_ID HIDD_COLOURS $_$_HELP_TOPIC_ID HIDD_CONTENTS $_$_HELP_TOPIC_ID HIDD_CSS $_$_HELP_TOPIC_ID HIDD_DIRECTORY $_$_HELP_TOPIC_ID HIDD_EXPIRED $_$_HELP_TOPIC_ID HIDD_FILESPLIT $_$_HELP_TOPIC_ID HIDD_FILESPLIT_RTF $_$_HELP_TOPIC_ID HIDD_FILESTRUCT $_$_HELP_TOPIC_ID HIDD_FONTS $_$_HELP_TOPIC_ID HIDD_FRAME_COLOUR $_$_HELP_TOPIC_ID HIDD_FRAMES $_$_HELP_TOPIC_ID HIDD_HEADDTLS $_$_HELP_TOPIC_ID HIDD_HEADING_LEVELS $_$_HELP_TOPIC_ID HIDD_HEADINGS $_$_HELP_TOPIC_ID HIDD_HYPERLINKS $_$_HELP_TOPIC_ID HIDD_LANGUAGES $_$_HELP_TOPIC_ID HIDD_LINK_DICTIONARY $_$_HELP_TOPIC_ID HIDD_LOOKFOR $_$_HELP_TOPIC_ID HIDD_MAIL_TIDY $_$_HELP_TOPIC_ID HIDD_MAILTOTEXT_DIALOG $_$_HELP_TOPIC_ID HIDD_MAKEHELP $_$_HELP_TOPIC_ID HIDD_MERRILL $_$_HELP_TOPIC_ID HIDD_OPTIONS $_$_HELP_TOPIC_ID HIDD_PREFORMAT $_$_HELP_TOPIC_ID HIDD_PREPROCESSOR $_$_HELP_TOPIC_ID HIDD_RTF_DOCUMENT $_$_HELP_TOPIC_ID HIDD_RTF_SETTINGS $_$_HELP_TOPIC_ID HIDD_SAVE_POLICY $_$_HELP_TOPIC_ID HIDD_SETTINGS_DOCO $_$_HELP_TOPIC_ID HIDD_SETTINGS_SINGLEVIEWER $_$_HELP_TOPIC_ID HIDD_SETTINGS_VIEWER_RTF $_$_HELP_TOPIC_ID HIDD_SPLASH $_$_HELP_TOPIC_ID HIDD_STATIC $_$_HELP_TOPIC_ID HIDD_STYLE $_$_HELP_TOPIC_ID HIDD_TABANAL $_$_HELP_TOPIC_ID HIDD_TABLE $_$_HELP_TOPIC_ID HIDD_Template $_$_HELP_TOPIC_ID HIDD_TRANSLATIONS $_$_HELP_TOPIC_ID HIDR_ACCELERATOR1 $_$_HELP_TOPIC_ID HIDR_COMMON $_$_HELP_TOPIC_ID HIDR_ICON_ASCTOTAB $_$_HELP_TOPIC_ID HIDR_ICON_DETAGGER $_$_HELP_TOPIC_ID HIDR_ICON_MAILTOTEXT $_$_HELP_TOPIC_ID HIDR_MENU_A2T $_$_HELP_TOPIC_ID HIDR_MENU_H2A $_$_HELP_TOPIC_ID HIDR_MENU_M2T $_$_HELP_TOPIC_ID HIDD_DETAGGER_DIALOG $_$_HELP_TOPIC_ID HIDD_RTF_FORMAT $_$_HELP_TOPIC_ID HID_OUT_TEXT_MISCELLANEOUS $_$_HELP_TOPIC_ID HIDD_HTML_FRAGMENTS $_$_HELP_TOPIC_ID HIDD_SDF_FILE $_$_HELP_TOPIC_ID HIDD_LINKDICT_EDIT $_$_HELP_TOPIC_ID HIDD_TDF_FILE *Detagger* is a utility designed to process files containing markup tags. The program has various options to selectively remove some or all of the tags in the file. The document describes Detagger version 2.4, which was released in June 2005. $_$_SECTION MAKINGHTML $_$_CONTENTS_LIST 2 $_$_SECTION MAKINGRTF $_$_CONTENTS_LIST ,,1 $_$_SECTION MAKINGRTFHELP [[goto "Running the software","Running Detagger"]][[BR]] [[goto "Using Policy files","Using policy files to save program options"]][[BR]] [[goto "Using a Text Fragments File","Using a Text Fragments File to add custom headers and footers"]][[BR]] [[goto "Using a Text Commands File","Using a Text Commands File to prepare the text for conversion"]][[BR]] [[goto "Ordering your copy", Purchasing Detagger]][[BR]] [[goto Detagger on the web, More information on the web]][[BR]] [[popup Upgrades]][[BR]] [[popup Change History]][[BR]] [[popup Acknowledgements]] There's a [[goto complete contents list for this document]] Complete contents list for this document ======================================== $_$_HELP_TOPIC_ID ID_CONTENTS_LIST This is a complete contents list for the .rtf file used to generate this help file. Sadly it isn't hyperlinked, but it may help you find what you're looking for. Ignore any page numbers. $_$_CONTENTS_LIST $_$_SECTION ALL $_$_HELP_CHAPTER 1,"Installation" Installation ************ The shareware version of *Detagger* is made available over the web from [det Download location]. Once you register you can download the full version (no nags, no limits), and are entitled to free upgrades for an arbitrary (equals "my decision is final") period of time. Installation will vary according to the type of install kit you've downloaded, but in each case you first download the .zip file appropriate to your system and unzip. $_$_SECTION MAKINGHTML *Contents of this section* $_$_CONTENTS_LIST 2,,2 $_$_SECTION ALL $_$_HELP_CHAPTER 2,"Windows Installation" Windows installation ==================== The current version of the software makes updates to your Registry. See the Install notes that come with the software for a description of the registry settings used. Installing the Windows GUI version ---------------------------------- The standard installations use InnoSetup to offer install and uninstall options. To use this version, unzip the file and then run the Setup program. This will move the files to a directory, and create all icons etc. Once installed, InnoSetup will also offer an uninstall option. You can access this via Control Panel | Add/remove software. Installing the console version ------------------------------ The [[goto console version]] simply comes in a .zip file. The documentation is not included as this is the same as the Windows version. Simply unzip the console version to the folder of your choice. $_$_HELP_CHAPTER 2,"OpenVMS version" $_$_HELP_SUBJECT "Installation" OpenVMS version of Detagger =========================== Sorry. No VMS distribution is planned at this time. That said, the software is largely developed under VMS, so if there is enough interest this may change. Email *infojafsoft.com* if you're interested (replace the "" by "@"). $_$_HELP_CHAPTER 1,"Running the software" Running the software ******************** The software is available as both a Windows program and a console program. The console version can be run from the command line, and is better suited for use in command files and batch conversions. $_$_SECTION MAKINGHTML *Contents of this section* $_$_CONTENTS_LIST 2,,2 $_$_SECTION MAKINGRTFHELP - [[goto Running as a Windows application]] - [[goto Console Version, Running as a command line program]] - [[goto "Running from the 'SendTo' menu"]] $_$_SECTION ALL $_$_HELP_CHAPTER 2,"Running as a Windows application" $_$_HELP_SUBJECT "Overview" Running as a Windows application ================================ $_$_HELP_TOPIC_ID ID_RUN_WINDOWS *Detagger* can be invoked as a normal Windows application. On start-up you will be presented with the main window. This consists of a menu bar across the top of the window, and some data entry fields in the main body of the window. $_$_SECTION MAKINGRTFHELP _The Main screen_ - [[goto "main dialog","Main dialog screen"]] _Menu options_ - [[goto File menu]] - [[goto Conversion options menu]] - [[goto Settings menu]] - [[goto Language menu]] - [[goto View menu]] - [[goto Help menu]] - [[goto Update menu]] _The status window_ - [[goto "Status Window","The Status window"]] $_$_SECTION ALL $_$_HELP_CHAPTER 3,"The Main screen" Main Dialog ----------- To convert your files, take the following steps :- 1) Select the files to be converted using the [[goto input selection]] options. You can type in a wildcard and - if you want - ask the software to search in sub-folders. 2) Select the [[goto Conversion type selection, conversion type]] you want. You can convert to text, or selectively remove markup. 3) Choose the [[goto Output types, type of output]] you want. You can output the results to files, or to the clipboard. If you've selected multiple files (e.g. by supplying a wildcard) you can choose to have all the results concatenated into one big results file. 4) Choose the [[goto Output Directory]] you want. You can output to the same directory as the input files, or to a directory of your choosing. If you are converting files in sub-folders, then you can choose to replicate the folder structure under the output directory. 5) Options allowing you to fine tune the conversion can be found on the [[goto conversion options menu]]. This has separate sub-menus for [[goto conversion to text]] and [[goto markup manipulation]] 6) Alternatively, if you've previously saved some conversion settings to a "Policy File", then you can re-load it. If you've loaded the policy file before, it will be accessible via the drop-down list. _(For a fuller description see [[goto using policy files, policy files]])_ 7) Once you've made all your selections, press the *Convert File(s)* button at the bottom. The [[goto status window]] will briefly appear whilst the conversion proceeds, displaying messages, and a results viewer may be launched to display the results. _(You can control this behaviour using the [[goto settings menu]])_. 8) If you've fine tuned your conversion and want to save the options for use next time, save them to a policy file by using the _Save Policy File_ option on the _Conversion Options_ menu. $_$_HELP_CHAPTER 4,"Selecting your options" Input selection ............... Normally you need simply select the input file(s) using the *Browse* button, and the rest of the fields will be set to default values. If you want to use wildcards, type the file specification in the file(s) box directly If you check the *Search Sub-folders* option, the program will look for matching files in sub-directories. Conversion type selection ......................... The program supports a number of conversion types. You should select the one that you want to perform. $_$_BEGIN_DELIMITED_TABLE *Convert to text* The output will be a plain text version of the input file. You can fine tune the conversion to text by using the [[goto text conversion policies]] *Selectively remove markup* The output will be HTML, with some of the tags selectively removed. You can control which by using the [[goto Markup removal policies]]. $_$_END_DELIMITED_TABLE If you're removing markup (as opposed to creating text files), then you will typically be creating HTML files from HTML files. This means you run the risk of asking the program to overwrite the input file with the results. If this is what you want, you will need to check the _May overwrite existing files_ option. Output selection ................ When you select your input, the output will default to being a file in the same directory. However, there are a number of options available to you. You can select the [[goto "output types","output type"]]. The default is to file(s), but you can also output to the clipboard. When converting wildcards the clipboard option is only useful if you have a clipboard manager in place, otherwise only the last file will be held there. When converting to file, you can select the *output filename*. This is not a sensible choice when using wildcards to select the input file(s). If you don't change this field the output file will match the input filename, but my have a different extension. When converting to file, you can select the *output directory*. If you are converting wildcard files and including sub-directories, this option will put *all* the files in the one directory. There is no option at present to output to a parallel directory structure. Output types ............ The program supports a number of output types that determine where the output should go. You should select the one that you want to perform. *Output text to file* The output will be a file. Depending on the type of conversion performed, the file will either be a HTML file or a plain text file *Output to the clipboard* You can use the Conversion Type to select the option of placing the generated output onto the Windows clipboard, ready for use in other Windows applications. Using *Detagger* in this way can be a very powerful technique which allows you to merge converted text with more traditionally authored content. This approach becomes even more powerful if you use a Clipboard extender like *ClipMate* (see www.thornsoft.com) to remember and organise everything to the clipboard. You could convert a few files, and then use *ClipMate* to recall the pasted text at your leisure for insertion into your other files. *Concatenate results into one file* When you select a conversion type of _Concatenate results into one file_, all your results will be added together in one big results file. _Converting to text_ When you're converting to text, the output will be one big text file, with the results from converting each input file added after each other. _Removing markup_ When selectively removing markup the output will be one big HTML file. This file will inherit the and tags (which will include any from the first input file). All other <HEAD> tags will be discarded. Because of this many of the results properties (e.g. style sheets and <META> tags) will be whatever was in the first file. The validity of the resulting HTML file will depend entirely on how well the markup of the multiple files goes together. It's a classic case of _garbage in, garbage out_ _File separators_ In the output file a separator can be added between the results of one input file and the next. These are defined using the [[goto Using a Text Fragments File, text fragment]] feature. When creating a text file, the separator fragment is called TEXT_SEPARATOR, and when creating a HTML file it's HTML_SEPARATOR. In the registered version both separators are absent by default unless you choose to define these fragments. In the 30-day trial, both separators contain short messages. Output Directory ................ When outputting to file, there are a number of options for choosing where the output files will go. The default is to output the files to the same directory as the input files. When the [[goto conversion type selection, conversion type]] is set to selectively remove markup this has the potential to overwrite the original files. For this reason you have to select the 'May overwrite the input files' option. Alternatively you can choose to output files to a different directory. In this case the program will overwrite any files already there because these are not the input files. Finally, if you have selected the 'Search sub-folders' option, you can elect to replicate the input directory structure under the output directory, rather than have all the files found placed in just the one directory. $_$_HELP_CHAPTER 3,"Menu options" Menu Bar -------- The main menu bar appears at the top of the main screen. It has the following options:- $_$_BEGIN_DELIMITED_TABLE [[goto File menu, File]] File options [[goto Conversion Options menu, Conversion Options]] Options that affect the conversion [[goto Settings menu, Settings]] Edit the program's settings [[goto Language menu, Language]] Select the language you'd like the program's user interface to be in [[goto View menu,View]] View the created HTML files or the messages for the last conversion [[goto Help menu, Help]] Various help files and on-line resources [[goto Update menu]] Check for more recent updates $_$_END_DELIMITED_TABLE $_$_HELP_CHAPTER 4,"File menu" File menu --------- The file menu offers the following options:- $_$_BEGIN_DELIMITED_TABLE *Convert* This will prompt you for a file to convert and will then convert the selected file(s). *Load policy file* Loads settings previously saved to a policy file. See [[goto using policy files, policy files]] *Save policies to file* Save program setting to a policy file. See [[goto using policy files, policy files]] *Exit* Exit the program. $_$_END_DELIMITED_TABLE $_$_HELP_CHAPTER 4,"Conversion options menu" Conversion options menu ----------------------- The *Conversion Options* menu allows you to alter the parameters of the conversions that you do. These options can be saved in [[goto using policy files, policy files]] for later use. The options available include :- $_$_SECTION MAKINGRTFHELP [[goto Markup manipulation]][[br]] [[goto Conversion to text]][[br]] [[goto Configuration Files menu, Configuration Files]][[br]] [[goto Load policy file]][[br]] [[goto Save Policy File]][[br]] [[goto Resetting policies to default values]][[br]] $_$_SECTION ALL Markup manipulation ................... This option allows access to dialogs that allow the options ("policies") that control markup removal to be adjusted [[goto Detag policies]] [[goto Detag Tables policies]] [[goto Tag manipulation policies]] Note: These options only apply when the [[goto conversion type selection]] has been set to *"selectively remove markup"* Conversion to text .................. This option allows access to dialogs that allow the options ("policies") that control the text conversion process to be adjusted - [[goto data extraction policies, "Data Extraction"]] - [[goto text format policies, "Text formatting"]] - [[goto text hyperlink policies, "Hyperlink and Image handling"]] - [[goto text marker policies, "Added markers"]] *new in V2.3* - [[goto text paragraph policies, "Paragraphs and sentences"]] *new in V2.3* - [[goto text table policies, "Table handling"]] - [[goto Miscellaneous formatting policies, "Miscellaneous policies"]] Note: These options only apply when the [[goto conversion type selection]] has been set to *"convert to text"* Configuration Files menu ........................ The Config File Location menu allows you to specify the location of additional configuration files. The locations you select will be stored in your policy file, so in a sense these files act as extensions of the policy file, but by being stored in separate files the same configuration files can be shared by multiple policy files. The options on this menu allow you to select do locate following :- - [[goto Selecting the Text Commands File,Text Commands File]] - [[goto Selecting the Text Fragments File,Text Fragments File]] Selecting the Text Commands File ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This option allows you to select the [[goto Using a Text Commands File, Text Commands File]] you wish to use. Selecting the Text fragments File ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This option allows you to select the [[goto Using a Text Fragments File, Text Fragments File]] you wish to use. Load policy file ................ *Detagger* has many program options known as "policies" to help you tailor the conversion process. These policies can be saved in a policy file for later re-use in future conversions. This dialog screen is primarily intended to allow you to load a previously saved policy file For a fuller description see the section on [[goto using policy files, policy files]]. Options on this screen include: - *Load policies from an existing policy file* [[br]] *Save policies to a policy file for later re-use* [[br]] *Reset all policies to their default values* Save Policy File ................ This window is displayed whenever you wish to save your policies to a file, usually for use in later conversions. To save the file, simply select the policy file name, usually with a .pol extension. This window contains a radio button with two options: *Save only those policies that have changed* If this option is selected, then only those policies that have been loaded from an existing file and/or been edited during the current session will be saved. This is the recommended option, as it will exclude all policies that have been set up correctly automatically. *Save all policies* If this option is selected, that all policies are written to file. This is a good way of documenting the policies used, but is usually too restrictive to be loaded as input into conversions of other files. The saved file is a text file designed so that it may be manually edited and reloaded. If you do so, take care not to change the key phrases at the start of each line. _Note: If you find that conversions that used to work "stop working" it’s possibly because you're using a complete policy file. If you find this happens, try creating a new policy file from scratch, or manually removing options from your existing policy file._ Resetting policies to default values .................................... This option will reset all conversion options ("policies") to their default values. If a policy file has been loaded, it will be unloaded. $_$_HELP_CHAPTER 4,"Settings menu" Settings menu ------------- $_$_HELP_TOPIC_ID HIDD_SETTINGS The program settings menu allows you to customise the way *Detagger* executes each time it is invoked. This menu has the following options: - $_$_BEGIN_DELIMITED_TABLE [[goto Diagnostic Settings]] Set message filters and alter the error reporting level to control the number and type of messages generated during conversions [[goto Drag and drop settings]] Set the program's properties when invoked by dragging files into the icon on the desktop [[goto Results viewers settings]] Specify the viewers to be used for viewing results files, and their method of invocation [[goto Use of policy file settings]] Specify any default policy file to be used. $_$_END_DELIMITED_TABLE In addition to the above sun-menus, this menu allows you to toggle the following options, indicated by tick marks. *Show Tool Tips* If checked tool tips will be available to offer help on the controls on each dialog screen *Show Status Dialog* If checked the [[goto Status Window]] will show during the conversion, showing messages describing how the conversion is going. *Automatically view results* If checked a file viewer will be launched after the conversion to view the results. This will either be a HTML browser of a text editor depending on the type of conversion being done. See [[goto results viewers settings]] *Remember settings on exit* If checked the program will remember the selected files and conversions details for next time *[[goto Tip of the Day]]* If selected the 'Tip of the Day' screen is shown and you can choose whether or not this should also be displayed on startup. Diagnostic settings ................... $_$_HELP_TOPIC_ID HIDD_SETTINGS_MESSAGES These options allow you to set the level of error reporting, or to suppress messages of various types from being displayed during conversion. The types of messages include :- $_$_BEGIN_DELIMITED_TABLE *INFO messages* Informational messages. These convey information telling you what was been done and why. *WARNING messages* Warning messages. These tell you that something you have requested has not been done, or something has been done which may not be correct. It's possible you may be able to take corrective action. *TAG ERROR messages* Tagging errors. Only occur when you use the preprocessor in-line tags and directives. *PROGRAM ERROR messages* Program errors. The program has detected it has done something wrong. The conversion may still be successful, but there is nothing you can do about such messages except report them to the program's author at *info<at>jafsoft.com* $_$_END_DELIMITED_TABLE Drag and drop settings ...................... $_$_HELP_TOPIC_ID HIDD_SETTINGS_DRAG These options specify the behaviour of *Detagger* when invoked via drag and drop (i.e. by dropping a file icon on the program's icon). $_$_BEGIN_TABLE *Show the status screen* The status dialog, showing messages reporting how the conversion is going should be shown. *View results in browser* The selected viewer (browser) for the *once complete* results files should be invoked on the last file converted once conversion is complete *Start program after* The program should be launched in Windows *conversion* mode once the conversion is completed. $_$_END_TABLE Results viewers settings ........................ $_$_HELP_TOPIC_ID HIDD_SETTINGS_VIEWER This identifies the viewers to be used whenever *Detagger* launches an application to view a results or documentation file. Viewers may be required for both HTML (when detagging) and TEXT (when converting to text) files. *Automatically view results files* You can elect to have results viewed automatically after each conversion. This will normally result in the named application being launched to view the last file converted. *Command used to view HTML files* For HTML, you can elect to use Dynamic Data Exchange (DDE) to have the results displayed in a currently active browser. This can be quicker and more efficient that launching a new instance of the browser each time. You should ensure your DDE browser matches the program named as the default browser so that if not already active, the program can start a fresh instance. When DDE is used the results will vary from browser to browser. IE for example will come to the front, whereas Netscape will not, and if it is minimised you won't see the results until you maximise the browser again. _NOTE: On some systems problems can occur with DDE that will cause the program to hang whenever it attempts to display a HTML file. When this happens the program will need to be stopped via the task manager. The next time the program runs it will detect that this problem has occurred and disable the use of DDE._ *Add "file://localhost/" prefix* For HTML files viewed from your local hard drive the prefix "file://localhost/" should be used in place of the "http:/" used for Internet access. Unfortunately some browsers (take a bow IE 3.0) do not support this, so the addition of this prefix may be disabled if you're using such a browser. *Command used to view TEXT files* For TEXT files, DDE is not currently available, so you simply provide the command to view TEXT files (usually just a text editor or NotePad). Use of policy file settings ........................... $_$_HELP_TOPIC_ID HIDD_SETTINGS_POLICY *Using a default policy file* This determines which [[goto what are policy files?,"policy file"]], if any, is to be used by default when the program is first invoked. The actual policy file used can, of course, be changed via the policy dialogue. The default policy file will also be used if the program is invoked via drag'n'drop. This avoids the need for creating batch files with the policy file name on the command line. *Always reload policy file during conversion* This specifies that the current policy file should be reloaded every time the conversion is done. If the file is large, and you are repeatedly converting using the same policy file, then this can slow you down. On the other hand if you are editing the policy file by hand outside the program between conversions then you will want this option enabled. Tip of the day .............. $_$_HELP_TOPIC_ID HIDD_TIPS The "Tip of the Day" screen is shown by default each time you start up the program. This behaviour can be disabled by clearing the checkbox on the screen. The tip shown will change each time the screen is displayed, and in addition you can review all the tips available by using the buttons marked "<<" and ">>" to go to the previous and next tips. The number of each tip is shown in case you should want to revisit it at a later date. The Tip of the Day screen can be shown at any time by selecting the option on the Settings menu. At present all tips are only available in English. $_$_HELP_CHAPTER 4,"Language menu" Language menu ------------- It is possible to change the user interface to the language of your choice. Translations are provided by a number of volunteers who help converting the menu, dialog, and ToolTips text. The message and documentation text remains in English for the time being. As such these don't offer a full translation, but will hopefully be of some use to those whose first language isn't English. At any given time you may still find English translations, especially in the messages displayed, and in the help and documentation files, but it is hoped that the efforts of these volunteers will make the program easier to use for non-English speakers. _Supported languages_ At present work is under way on $_$_BEGIN_TABLE Spanish Gonzalo San Martin is undertaking the Spanish translation. Gonzalo operates a highly popular Real Madrid fan page (in Spanish and English) which you can visit at http://members.bigfoot.com/~G.SanMartin/ Gonzalo can be contacted at *G.SanMartin<at>bigfoot.com* Italian The Italian translation is being undertaken by Gianluigi Pizzuto who can be contacted at *gibly<at>libero.it* and has a web page at http://web.tiscalinet.it/fotone Swedish The Swedish translation is being undertaken by Dan Svarreby who can be contacted at *dan.svarreby<at>home.se*. French The French translation is being undertaken by Andre Martinez. Russian The Russian translation is being undertaken by Alexander (aka J-34) at *j34<at>mail.ru* Dutch The Dutch translation is being undertaken by Jurrien Dokter, who can be contacted at *info<at>axswebsolutions.nl* and runs the web site at http://www.axswebsolutions.nl/ $_$_END_TABLE If you would like to volunteer to help with this effort, please email info<at>jafsoft.com (replace "<at>" by "@") or visit the web page at http://www.jafsoft.com/products/translations.html $_$_HELP_CHAPTER 4,"View menu" View menu --------- This menu contains the following options $_$_BEGIN_TABLE Messages from last conversion View the messages window with messages generated in the last conversion by bringing back the [[goto Status window]] Results of last conversion View the last file converted in your preferred browser $_$_END_TABLE Results of last conversion .......................... Once you've converted a file, you can view the results in the browser of your choice. *Detagger* will detect the default browser used on your system. If you wish you can change this through the [[goto settings menu]] You can view results in the selected browser by selecting the option on the view menu or by pressing the View results button on the main screen. *Detagger* can also be configured to automatically review results when run from the command line or in drag'n'drop operation. $_$_HELP_CHAPTER 4,"Help menu" Help menu --------- The help menu has the following options:- $_$_BEGIN_TABLE *Contents* Brings up the contents page of this help file. Help can be brought up anywhere in the program by pressing F1 *Register (online)* This options will take you to the registration page, or - if you have already registered - to the updates page *HTML doco (offline)* Brings up the local copy of the HTML documentation in your preferred browser *HTML doco (online)* Brings up the Internet copy of the HTML documentation in your preferred browser. *Other products* Links to web pages for JafSoft and their various software products. *About* Shows the program version and other details. Includes buttons to take you to the home page etc on the web. $_$_END_TABLE $_$_HELP_CHAPTER 4,"Update menu" Update menu ----------- The update menu has the following option $_$_BEGIN_DELIMITED_TABLE *Check for newer versions* This option will take you to the web site, where a check will be made to tell you if this is still the latest version of the software. $_$_END_DELIMITED_TABLE $_$_HELP_CHAPTER 3,"Status window" Status window ------------- $_$_HELP_TOPIC_ID HIDD_STATUS_DIALOG The status window is displayed whenever a conversion is in progress. It displays messages showing how the conversion is progressing. You can also bring up this window by selecting the "messages from last conversion" option on the [[goto View menu]]. You can prevent this behaviour by selecting the option from the [[goto Settings menu]] The messages displayed are usually just informational messages telling you what *Detagger* is doing. You should review these messages and check they don't indicate an error in conversion. Once conversion is complete you can dismiss the window. You can automate this by ticking the "dismiss on completion" box. Should you wish to you can use the save to file button to save the messages displayed to file. This can be useful for reviewing messages, extracting URLs reported by the software (if showing URLs is enabled), or for sending details when requesting support. $_$_HELP_CHAPTER 2,"Running from the command line" Console version =============== In addition to the Windows version of *Detagger*, there is a console version. This can be invoked from the command line, and is thus well suited to use in batch and automated conversions. The console version is free to users who register the Windows version. A trial copy of the console version can be obtained by visiting http://www.jafsoft.com/developers/console_demos.html The console version is used from the command line. Most of these command options are also supported by the Windows version, but the console version is better suited to batch operation. The console version is called *h2acons*. You can see a list of the commands by using the command _c:> h2acons /help_ This gives Usage : h2acons filespec1 filespec2 [policy_file.pol] [/qualifiers] Recognised qualifiers include $_$_BEGIN_DELIMITED_TABLE [[popup The /CONCAT command line qualifier, /CONCAT]] Concatenate the results into one file [[popup The /CONSOLE command line qualifier, /CONSOLE]] Write output direct to console [[popup The /DETAG command line qualifier, /DETAG]] Selectively remove HTML markup [[popup The /HELP command line qualifier, /HELP]] Display this useful list of commands ("/?" also works) [[popup The /LOG command line qualifier, "/LOG=filename"]] Generate a .log file [[popup The /OUTPUT command line qualifier, "/OUTPUT=filespec"]] Filespec for output file(s) [[popup The /OVERWRITE command line qualifier, /OVERWRITE]] May overwrite input files with the output [[popup The /POLICY command line qualifier, "/POLICY=filename"]] Document policies used in a .pol file [[popup The /SILENT command line qualifier, /SILENT]] Suppress all output messages (except these :-) [[popup The /SUBFOLDERS command line qualifier, /SUBFOLDERS]] Process files that match the filespec in sub-folders as well [[popup The /TREE command line qualifier, /TREE]] Place output files in parallel folder structure to input files $_$_END_TABLE Qualifiers are case insensitive and may be reduced to shortest unique name (e.g. "/lo" for "/log") Most of the configuration options are passed using a "policy file". This is most easily created by running the Windows version, selecting the options you want and then saving those to a policy file. The policy file itself is just a text file, with one policy per line (hard break). If you look at the list of policies in the documentation you can edit this by hand, but usually it's just simpler to use the Windows version. The /CONCAT command line qualifier ---------------------------------- When present this qualifier states that all the results should be output to a single file. This only makes a difference if you've supplied multiple _filespec_'s on the command line, or used a wildcard. The /CONSOLE command line qualifier ----------------------------------- When present this specifies that the output should be written to the console window. This might be useful in piping operations. If you use this, you will usually want to also use the [[popup the /silent command line qualifier,/SILENT qualifier]]. The /DETAG command line qualifier --------------------------------- When present this specifies that *Detagger* should selectively remove HTML markup and create a HTML output file. The default behaviour otherwise is to convert the file to text. If you want to specify which removal options should apply you'll need to create a policy file and add that to the command line. The /HELP command line qualifier -------------------------------- Displays the list of supported qualifiers The /LOG command line qualifier ------------------------------- When present this specifies that *Detagger* will create a .log file listing all the actions it takes and any messages created The /OUTPUT command line qualifier ---------------------------------- When present this will tell *Detagger* where the output should be placed. If omitted the default is to output the results in the same folder as the source file, with an extension (.txt or .html) appropriate to the type of conversion being attempted Examples :- _c:> h2acons input.html /out="c:\my files\output.txt"_ File is output to "c:\my files\output.txt". Because there is a space in the directory path the filename needs to be in quotes _c:> h2acons in*.html /out=c:\output\_ All the files in*.html will be converted and placed in the directory "c:\output\" _c:> h2acons in*.html /concat/detag/out=c:\output\bigfile.html_ In this case the /concat/detag means that *Detagger* will selectively remove markup and concatenate the results in the single file "c:\output\bigfile.html" The /OVERWRITE command line qualifier ------------------------------------- When the [[popup The /DETAG command line qualifier, /DETAG qualifier]] is specified then by default the output file will be a HTML file in the same directory as the source file. In this case *Detagger* could end up replacing the original file by the output file. That is only allowed if the /OVERWRITE qualifier is present. If it isn't, an error message is generated. An alternative to using the /OVERWRITE qualifier is to use the [[popup the /OUTPUT command line qualifier, /OUTPUT qualifier]] to direct the output to a different folder, or to a different name in the same folder. The /POLICY command line qualifier ---------------------------------- When present *Detagger* will create a .pol policy file listing all the policies used in the conversion and their values. You should not normally want to do this unless you want to create a policy file to edit. or want to check that your policies are being used. To pass *in* a set of policies, just list the policy file on the command line. It must have a .pol extension. For example the command _c:> h2acons in*.html input.pol /policy=output.pol_ will read the policies in "input.pol", use those in the conversion, and then create a file "output.pol" listing the policies used, which will be a mixture of default values and those loaded from "input.pol". The /SILENT command line qualifier ---------------------------------- When present all the messages usually displayed to the console window are suppressed. You'd want to use this if you were using the [[popup the /console command line qualifier,/CONSOLE qualifier]]. The /SUBFOLDERS command line qualifier -------------------------------------- When present the software will search the sub-folders of the input directory looking for other files that match the input filespec. See also [[popup The /TREE command line qualifier]] The /TREE command line qualifier -------------------------------- When present the software will place output files in a directory structure that matches the input structure. This will only apply when using the /SUBFOLDERS and /OUTPUT options as well. So for example the command $_$_BEGIN_PRE c:> h2acons c:\input\a*.html /output=d:\new\ /subfolders/tree $_$_END_PRE Would look for all files a*.html in the folder c\input\ and its sub-folders. The output files will be placed in d:\new\ and sub-folders of that, so for example converting c:\input\sub\answer.html would be converted to d:\new\sub\answer.txt. If it didn't already exist, the sub-folder d:\new\sub\ will be created. See also [[popup The /SUBFOLDERS command line qualifier]] $_$_HELP_CHAPTER 2,Running from the 'SendTo' menu $_$_HELP_SUBJECT "Invoking the program from your right-click 'Sent To' menu" Running from the 'SendTo' menu ============================== $_$_HELP_TOPIC_ID ID_RUN_SENDTO *Detagger* can make a useful addition to your "Send to" menu (available when you right-click on a file in explorer). To add *Detagger* to this menu, simply add a shortcut to your Send To shortcuts directory. Under Windows 9x this is /Windows/SendTo under Windows XP this is /Documents and Settings/<Your_User_Name>/SendTo If you want to use a standard policy file (e.g. with a particular colour scheme), then change the properties of the shortcut so that the command is *Detagger %1 standard.pol* $_$_HELP_CHAPTER 2,Working with Unicode $_$_HELP_SUBJECT "Working with Unicode" Working with Unicode ==================== $_$_HELP_TOPIC_ID ID_UNICODE Detagger was not originally designed with Unicode in mind, and as a result support for Unicode text has been gradually added over time, with the result that earlier versions of Detagger may not support all the features described in this manual. If in doubt, please contact JafSoft for details. $_$_SECTION MAKINGRTFHELP - [[goto What is Unicode?]] - [[goto Unicode Byte Order Marks (BOMs)]] - [[goto Auto-detecting Unicode input]] - [[goto Creating Unicode output]] - [[goto Controlling Unicode handling through use of policies]] $_$_SECTION_ALL What is Unicode? ---------------- Traditional single-byte character sets interpret the 8-bit character values (128-255) as special characters. So on a Russian machine this would be interpreted as Cyrillic, but on a different machine this could be read (wrongly) as Arabic (and vice versa). On most English-based PCs, the 8-bit characters are used for accented character used in certain European languages, so a Russian text would appear to have lots accented 'i's, 'e's and 'a's. Unicode is a way of implementing text that supports multiple types of character sets at the same time so that - for example - it is possible to display Chinese and Cyrillic on the same page unambiguously. It does this by allocating each character in each language a unique code value, so that codes used for Cyrillic characters no longer overlap and conflict with those assigned to Arabic. However, these code values are in most cases larger than can be represented in a single byte. As a result a way has to be chosen to represent each character by one or more bytes. The following Unicode representations are commonly used _UTF-8_ Each character is represented by 1, 2 or 3 bytes, depending on the which range the Unicode code value falls into. This has the advantage that all ASCII characters are a single byte, so for example all the HTML tags in a document are represented by a single byte each. This also means there are no null bytes contained in the text, which can make programming software to work with this text easier. _UTF-16_ Each character is represented by a 2-byte pair (future characters may require 2 such pairs). The 2-byte pair is just the numerical representation of the Unicode value of each character. This makes the files easier to interpret, but also means that the byte order depends on how the machine stores its bytes - i.e. is the machine big-endian or little-endian. Because ASCII characters have a Unicode value less than 255 the ASCII characters map onto a byte pairs in which one of the bytes is null. Because each character requires two bytes, a single byte wrongly inserted into a UTF-16 stream will render all text that follows is as gibberish. Unicode Byte Order Marks (BOMs) ------------------------------- Files that contain Unicode identify themselves by inserting a "Byte Order Mark" (BOM) at the top of the file. This is a two-byte marker for UTF-16 files and a three-byte marker for UTF-8 files. Modern applications will test for this byte marker and if present will then know how to interpret the contents of the file. For example Notepad as supplied with Windows XP can do this, whereas Notepad as supplied with Windows 98 could not. In UTF-16 each character is represented by two bytes, and computers can store a two-byte value in different ways (known as "big-endian" and "little-endian"). Each operating system uses one method or another and it isn't usually an issue, but when Unicode files get passed from one machine to another, this becomes important. The BOM allows the two forms of UTF-16 (known as "UTF-16BE" and "UTF-16LE") to be distinguished. Auto-detecting Unicode input ---------------------------- The software has some ability to auto-detect Unicode text, and will generally do so under the following circumstances - a 3-byte Byte Order Mark (BOM) is detected at the top of a UTF-8 input file - a 2-byte Byte Order Mark (BOM) is detected at the top of a UTF-16 input file - the input HTML contains an HTML entity that maps onto a Unicode code value which can't be converted into an ANSI or ASCII equivalent, In this case although the input HTML may not have been encoded as Unicode, the output will need to be in order to correctly display the Unicode character. Creating Unicode output ----------------------- The software will create Unicode output whenever it detects that the input files were Unicode, or wherever Unicode characters have been detected in the HTML entities of the original. At present all Unicode output files will be UTF-8. Controlling Unicode handling through use of policies ---------------------------------------------------- The following policies can be used to control the handling of Unicode during the conversion :- *[[popup input text encoding]]* By default the software will attempt to auto-detect whether or not the input is Unicode, but if this fails you can explicitly tell the software the encoding using this policy. *[[popup May add Unicode marker to output file]]* When Unicode is detected in the source the software will output the text as UTF8 and optionally add a file marker that will label the file as "Unicode" in a way that most applications that can cope with Unicode will recognize. *[[popup Allow ANSI alternatives (e.g. space for  )]]* Certain common HTML entities don't have a single ANSI character but have common ASCII representations. If you enable this policy you tell the software to use ASCII/ANSI alternatives where possible, thereby reducing the chance of Unicode being necessary for the output file. $_$_HELP_CHAPTER 1,"Using policy files" Using policy files ****************** $_$_HELP_TOPIC_ID ID_POLICY_FILES Options available from the [[goto conversion options menu]] are also know as "policies". All of [JafSoft] conversion tools use the same policy file mechanism. Options can be saved to these policy files so that different sets of options can be saved and then easily reloaded during later conversions Policy files are just plain text files, with one policy per line. Each policy line takes the form <policy name> : <policy value> You can edit these files in a text editor, but must be careful to use the correct policy name. In *Detagger* almost all policies can be set through the conversion options menu. $_$_SECTION MAKINGRTFHELP There is a complete [[goto alphabetical list of *Detagger* policies]] $_$_SECTION MAKINGHTML *Contents of this section* $_$_CONTENTS_LIST 2,,2 $_$_SECTION MAKINGRTF See also :- [[goto Markup removal policies]] - [[goto Detag policies]] - [[goto Detag Tables policies]] - [[goto Tag manipulation policies]] [[goto Text conversion policies]] - [[goto Data Extraction policies]] - [[goto Text Format policies]] - [[goto Page Layout options]] - [[goto Bullet options]] - [[goto Heading options]] - [[goto Text Hyperlink policies]] - [[goto Text Marker policies]] - [[goto Text Paragraph policies]] - [[goto Text Table policies]] - [[goto Miscellaneous formatting policies]] - [[goto Dialogue options]] - [[goto Other Options]] - [[goto Unicode Options]] [[goto Miscellaneous policies]] $_$_SECTION ALL What are Policy files? ====================== $_$_HELP_TOPIC_ID ID_DEFINEPOLICYFILES *Detagger* has a large number of options available to influence the processing of your text files. These options are called "policies" as they govern how the source file should be interpreted and converted. Policies may be saved in text files, known as policy files. These files have a ".pol" extension by default. The policy files are usually updated by changing the policies and saving the changes in a new file. Because they are text files you can also edit them directly, in a text editor. The files have the format of one policy per line of Text in the form PolicyText : <policy value> The use of policy files allow a given set of options to be saved and reused for other conversions, or later conversions of the same file. See [[goto Using policy files]] for more information. $_$_HELP_CHAPTER 2,"Complete alphabetical list of Detagger policies" Alphabetical list of Detagger policies ====================================== Here is an alphabetic list of policy names. Where possible a link is supplied to the equivalent option in the user interface. - [[popup Add border to all tables]] - [[popup Add delimited table markers]] - [[popup Add URL references at end of file]] - [[popup Adjust table to page width]] - [[popup Allow 8-bit ANSI values in output]] - [[popup Allow ANSI alternatives (e.g. space for  )]] - [[popup Allow blank row separator lines]] - [[popup Allow by-line to be used for Author field]] (not available in the GUI) - [[popup Allow headings inside tables]] - [[popup Apply extra dialogue checks]] - [[popup Attempt to parse tables]] - [[popup Break lines where dialogue starts in the middle]] - [[popup Bullet point characters]] - [[popup Concatenate results into one file]] - [[popup Convert only innermost tables]] - [[popup Convert tags to lower case]] - [[popup Convert tags to upper case]] - [[popup Default table indentation]] - [[popup Display link URLs]] - [[popup End list marker]] - [[popup End table marker]] - [[popup First line indent for paragraphs]] - [[popup Fragments file]] - [[popup Heading underline characters]] - [[popup Highlight bold and italic text]] - [[popup Ignore table WIDTH attributes]] *New in version 2.4* - [[popup Impose a page width on the output]] - [[popup Insert gap between sentences]] - [[popup Input text encoding]] - [[popup Keep deprecated tag attributes]] - [[popup Keep deprecated tags]] - [[popup Lines to ignore at end of file]] - [[popup Lines to ignore at start of file]] - [[popup List item templates]] *New in version 2.4* - [[popup Look for dialogue lines]] - [[popup Maximum table depth]] *New in version 2.4* - [[popup May add Unicode marker to output file]] - [[popup May break words to fit target width]] *New in version 2.4* - [[popup Minimum table depth]] *New in version 2.4* - [[popup Nested Table scaling factor (percent)]] *New in version 2.4* - [[popup Omit email hyperlinks from the output]] - [[popup Omit local hyperlinks from the output]] - [[popup Output each paragraph on a single line]] - [[popup Output indentation positions]] - [[popup Output table format]] - [[popup Preserve all white space from the original source]] - [[popup Preserve hyperlinks in text output]] - [[popup Preserve short lines]] - [[popup Remove <!DOCTYPE> tags]] - [[popup Remove all horizontal rules and lines]] - [[popup Remove all HTML tags]] - [[popup Remove all non-HTML tags]] - [[popup Remove all tags]] - [[popup Remove all x-tags used in mail messages]] - [[popup Remove emphasis tags]] - [[popup Remove HTML <DIV> tags]] *New in version 2.4* - [[popup Remove HTML <FONT> tags]] - [[popup Remove HTML <FORM>..</FORM> tags]] - [[popup Remove HTML <HEAD> section]] - [[popup Remove HTML <IMG> tags]] - [[popup Remove HTML <OBJECT> tags]] - [[popup Remove HTML <P> tags from tables]] - [[popup Remove HTML <SCRIPT> section]] - [[popup Remove HTML <SPAN> tags]] *New in version 2.4* - [[popup Remove HTML alignment attributes from tables]] - [[popup Remove HTML color attributes from tables]] - [[popup Remove HTML size attributes from tables]] - [[popup Remove HTML table tags]] - [[popup Remove HTML-style comments]] - [[popup Remove Microsoft Office tags]] - [[popup Remove non-standard tag attributes]] - [[popup Remove non-standard tags]] - [[popup Remove style sheet]] - [[popup Replace <IMG> tags by a text marker]] - [[popup Replace entities by text]] - [[popup Replace hyperlinks by the display value]] - [[popup Right justify the output text]] - [[popup Show only table data]] *New in version 2.4* - [[popup Start list marker]] - [[popup Start table marker]] - [[popup Suppress borders on nested tables]] - [[popup Target page width]] - [[popup Target table width]] - [[popup Text bullet characters]] - [[popup Text commands file]] - [[popup Text to replace omitted links by]] - [[popup Treat short lines as paragraph endings]] - [[popup Use the ALT attribute to replace <IMG> tags]] $_$_HELP_CHAPTER 2,"Markup removal policies" Markup removal policies ======================= These policies allow you to control the tag removal process. You can choose what tags are to be removed, or what manipulations you'd like performed on each tag. [[goto Detag policies]] [[goto Detag tables policies]] [[goto tag manipulation policies]] $_$_HELP_CHAPTER 3,"Detag policies" Detag policies -------------- $_$_HELP_TOPIC_ID HIDD_DETAG These policies allow you to choose which tags are to be removed from the markup. [[popup Remove all tags]] - [[popup Remove all HTML tags]] - [[popup Remove <!DOCTYPE> tags]] - [[popup Remove HTML <HEAD> section, remove <HEAD>...</HEAD> section]] - [[popup Remove HTML-style comments, remove all comments]] - [[popup Remove HTML <SCRIPT> section, remove all <SCRIPT> sections]] - [[popup Remove HTML <OBJECT> tags, remove all <OBJECT>..</OBJECT> sections]] - [[popup Remove HTML <DIV> tags, remove all <DIV> tags]] - [[popup Remove HTML <FORM>..</FORM> tags, remove all <FORM>..</FORM> tags]] - [[popup Remove HTML <FONT> tags, remove all <FONT> tags]] - [[popup Remove HTML <SPAN> tags, remove all <SPAN> tags]] - [[popup Remove emphasis tags]] - [[popup Replace hyperlinks by the display value, remove all hyperlinks]] - [[popup Remove style sheet]] - [[popup Remove HTML <IMG> tags, remove all <IMG> tags]] - [[popup Remove non-standard tags]] - [[popup Keep deprecated tags, leave in "deprecated" tags]] - [[popup Remove non-standard tag attributes]] - [[popup Keep deprecated tag attributes, leave in "deprecated" attributes]] - [[popup Remove all non-HTML tags]] - [[popup Remove all x-tags used in mail messages, remove tags and attributes added by email systems]] - [[popup Remove Microsoft Office tags, remove tags and attributes added by MS Office]] Remove All Tags ............... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected, all tags will be removed from the markup. This effectively turns the conversion into a html-to-text conversions Remove All HTML Tags .................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected, all HTML tags will be removed from the markup. This will leave only the non-HTML tags. Remove <!DOCTYPE> tags ...................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected any <!DOCTYPE> tag in the document will be removed. This might be a useful precursor to remove a no longer valid DOCTYPE e.g. if you were concatenating the results or migrating the files to XML of some description. Remove HTML <HEAD> section .......................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected any <HEAD>...</HEAD> section in the document is removed. This might be a useful precursor to merging two HTML files together, although the <HTML> and <BODY> tags will be left in the output. Remove HTML-style comments .......................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all <!-- ... --> style comments are removed. Remove HTML <SCRIPT> section ............................ _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all <SCRIPT>..</SCRIPT> sections are removed, effectively removing all scripted content from the file. Remove HTML <OBJECT> tags ......................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all <OBJECT>..</OBJECT> sections are removed, effectively removing embedded active content from the file. Remove HTML <DIV> tags ....................... *New in version 2.4* _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all <DIV..> and </DIV> tags are removed. Remove HTML <SPAN> tags ....................... *New in version 2.4* _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all <SPAN..> and </SPAN> tags are removed. Remove HTML <FORM>..</FORM> tags ................................ _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all tags associated with forms will be removed. This includes the <FORM>, <INPUT>, <SELECT>, <OPTION>, <TEXTAREA>, <FIELDSET>, <LEGEND> and <LABEL> tags. Any visible markup (such as tables) integrated with the <FORM> is left intact. This may not always be desirable, and this issue may be addressed in later releases. Remove HTML <FONT> tags ....................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all <FONT..> and </FONT> tags are removed. <FONT> tags are deprecated in later versions of HTML in favour of CSS style sheets. <FONT> tags can also make a file much larger than it need be, so removing them can be desirable for a number of reasons. Remove emphasis tags .................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected, all emphasis markup such as <B>, <I>, <STRONG> and <EM> tags are removed. Replace hyperlinks by the display value ....................................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all hyperlinks are removed. The hyperlink is replaced by it's visible content, only the "link" part is removed. Remove style sheet .................. _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all <STYLE>..</STYLE> sections are removed and all references to external style sheets are removed Remove HTML <IMG> tags ...................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all Images defined as <IMG> tags are removed. Remove non-standard tags ......................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all tags not recognised as being part of HTML are removed. The standard used is currently HTML 4.0 Transitional. See also [[popup Keep deprecated tags]] Keep deprecated tags .................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When the option [[popup Remove non-standard tags]] is selected, this option determines that any "deprecated" tags may be left in. These are tags recognised in earlier versions of HTML, but which are no longer strictly supported. Usually these tags are still supported in browsers, and sometimes removing these tags will adversely affect your page's appearance, as newer forms of tagging are often required to achieve the same effect. Remove non-standard tag attributes .................................. _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When this option is selected all tag attributes not recognised as being part of HTML are removed. The standard used is currently HTML 4.0 Transitional. _Note, in HTML 4.0 many tag attributes were deprecated, so that HTML code that relied heavily on tag attributes (e.g.. to set paragraph alignment) may need to be heavily re-written to achieve the same effect under the later standard._ See also _[[popup Keep deprecated tag attributes]]_ Keep deprecated tag attributes .............................. _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When the option _[[popup Remove non-standard tag attributes]]_ is selected, this option determines that any "deprecated" tag attributes may be left in. These are attributes that were recognised in earlier versions of HTML, but which are no longer strictly supported. Usually these attributes are still supported in browsers, and sometimes removing these tags will adversely affect your page's appearance, as newer forms of tagging are often required to achieve the same effect. Remove all non-HTML tags ........................ _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When selected all tags not recognised as HTML will be removed from the file. Depending on the source of the HTML these could be XML tags, or proprietary tags added by the software used to create the HTML. Remove all x-tags used in mail messages ....................................... _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When selected tags believed to be added by an email package will be removed. For example Eudora adds tags in the form <x-whatever> to markup key aspects of a HTML message. this option should remove such tags. Remove Microsoft Office tags ............................ _Menu location: Configuration Options -> Markup Manipulation ->Detagger options_ When selected tags believed to have been added by MS Office applications are removed. Some versions of MS Office add a lot of tagging - particularly to HTML exported from Excel - to describe a document's ownership and structure. This option will attempt to remove (and to a limited extent tidy up) such markup. $_$_HELP_CHAPTER 3,"Detag tables policies" Detag tables policies ------------------------- $_$_HELP_TOPIC_ID HIDD_DETAG_TABLES These options allow a few limited tag manipulations to be performed on any <TABLE>s in the HTML during the markup removal. - [[popup Remove HTML table tags, Remove all table tags]] _Paragraphs_ - [[popup Remove HTML <P> tags from tables, Replace <p>..</p> tags by <br> tags]] _Attributes_ - [[popup Remove HTML alignment attributes from tables, Remove all alignment attributes]] - [[popup Remove HTML color attributes from tables, Remove all colour attributes]] - [[popup Remove HTML size attributes from tables, Remove all size attributes]] Remove HTML table tags ...................... _Menu location: Configuration Options -> Markup Manipulation ->Tables_ When this option is selected all "table" tags are removed (i.e. <table>, <tr>,<th>,<td>,<thead>,<tbody> and <tfoot>). This effectively removes all the table structure in a document, which can be useful if you want to view the HTML on a device with a small display less suited to tables. Remove HTML <P> tags from tables ................................ _Menu location: Configuration Options -> Markup Manipulation ->Tables_ When selected this option will replace <p>..</p> tags by a suitable pattern of <br> tags. <br> tags will be inserted between each paragraph in each cell. If a cell has only one paragraph, nothing is inserted. This option can be useful when tidying up HTML created by certain Word processing packages which needlessly insert <p>..</p> markup. Remove HTML alignment attributes from tables ............................................ _Menu location: Configuration Options -> Markup Manipulation ->Tables_ When selected all alignment attributes ("align" and "valign") are discarded from the various TABLE, TBODY, THEAD, TFOOT, TR, TH and TD tags Remove HTML color attributes from tables ........................................ _Menu location: Configuration Options -> Markup Manipulation ->Tables_ When selected all colour attributes ("bgcolor") are discarded from the various TABLE, TBODY, THEAD, TFOOT, TR, TH and TD tags Remove HTML size attributes from tables ....................................... _Menu location: Configuration Options -> Markup Manipulation ->Tables_ When selected all sizing attributes ("cellpadding", "cellspacing", "height" and "width") are discarded from the various TABLE, TBODY, THEAD, TFOOT, TR, TH and TD tags. Note logical size attributes such as "colspan" and "rowspan" are left intact, as is "border". $_$_HELP_CHAPTER 3,"Tag manipulation policies" Tag manipulation policies ------------------------- $_$_HELP_TOPIC_ID HIDD_TAG_MANIPULATION These options allow a few limited tag manipulations to be performed during the markup removal. *tag conversions* - [[popup Convert tags to lower case]] - [[popup Convert tags to UPPER case]] *character replacements* - [[goto Replace entities by text,Replace character entities by ANSI text]] - [[popup Allow ANSI alternatives (e.g. space for  )]] - [[popup Allow 8-bit ANSI values in output, Use 7-bit ASCII alternatives where possible]] Convert tags to lower case .......................... _Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation_ When selected all tags and attributes will be converted to lower case. Any attribute values will be left unchanged. Convert tags to UPPER case .......................... _Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation_ When selected all tags and attributes will be converted to UPPER CASE. Any attribute values will be left unchanged. Replace entities by text ........................ _Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation_ When selected any attribute values such as _©_ or _£_ will be converted to their ANSI equivalents. Note, not all such entities have an ANSI equivalent, and those that don't will be left unchanged. The ANSI character set is an "8-bit" character sets, that is each character is represented by a value in the range 0-255. Of these the first 128 characters are known as the "7-bit" characters. Whilst almost all character sets support 7-bit characters, not all support the "upper" 8-bit characters, so you may not want to allow 8-bit characters. For that reason there are two more options See also [[popup Allow ANSI alternatives (e.g. space for  )]] and [[popup Allow 8-bit ANSI values in output]] Allow ANSI alternatives (e.g. space for  ) ............................................... _Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation_ When selected, any entity that maps onto an upper 8-bit character will be allowed (e.g. © will be replaced by the copyright symbol) Allow 8-bit ANSI values in output ................................. _Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation_ The GUI shows this option as "Use 7-bit ASCII alternatives where possible" (which, in fact, has the exact opposite meaning). When that is set, this policy is disabled (and vice versa). When this policy is enabled (i.e. 'Use 7-bit...' is unchecked), then any 8-bit characters are allowed to pass unchanged. The display of such characters will depend on the language and Operating System of the computer used to view the results file. When this policy is disabled (i.e. 'Use 7-bit...' is checked), then any entity that can be approximated by using one or more 7-bit characters will be replaced by that approximation. For example _ _ will become a single space, while _—_ will become two hyphens _"--"_. $_$_HELP_CHAPTER 2,"Text conversion policies" Text conversion policies ======================== The following screens allow access to options to control various aspects of the text conversion process. - [[goto Data Extraction policies]] - [[goto Text Format policies]] - [[goto Text Hyperlink policies]] - [[goto Text Marker policies]] - [[goto Text Paragraph policies]] - [[goto Text Table policies]] - [[goto Miscellaneous formatting policies]] $_$_HELP_CHAPTER 3,"Data Extraction policies" Data Extraction policies ------------------------ $_$_HELP_TOPIC_ID HIDD_TEXT_EXTRACT *New in version 2.4* These policies help you refine the use of Detagger as a data extraction tool. Note, Detagger wasn't designed as a data mining tool, but with these options you can choose to focus only on data in tables at a certain level, and to then turn the selected data into a delimited format that will make it easier post-process the results file (e.g. to import it into a spreadsheet or database). *Table extraction policies* These options can be used to tell the program to only extract data from tables, or that only certain level(s) of nested table should be extracted from the soured document. - [[goto Show only table data]] - [[goto Maximum table depth, Maximum table depth to extract data from in nested tables]] - [[goto Minimum table depth, Minimum table depth to extract data from in nested tables]] *Table data handling* These options specify how the data inside tables should be handled. Only one option may be selected. If comma- or tab-delimited data is selected, then the text will be output between table delimiters. In the Policy File the policy [[popup Output table format]] takes the value 1,2 or 3 according to which of the following is selected. In the user interface you need only select the desired option. - [[popup Convert table to plain text]] - [[popup Convert table to comma-delimited data]] - [[popup Convert table to tab-delimited data]] If you choose to convert the table into a delimited data format, the following two options become available :- - [[popup Convert only innermost tables,Only convert the innermost TABLE tag]] - [[popup Add delimited table markers, Add markers around the delimited text]] Show only table data .................... *New in version 2.4* _Menu location: Configuration Options -> Conversion to text -> Data extraction_ When enabled, only that part of the input file contained in HTML <TABLE> markup will be included in the output. How that data is formatted will depend on the other [[goto text table policies]], and the remaining [[goto data extraction policies]] that have been chosen. To further refine which table data is extracted, use the [[goto Minimum table depth]] and [[popup Maximum table depth]] policies. Minimum table depth ................... *New in version 2.4* _Menu location: Configuration Options -> Conversion to text -> Data extraction_ This policy, together with [[popup Maximum table depth]] specifies a range of table depths from which data should be extracted. Many HTML use tables to lay out a page as well as to mark up tabular data. These nested table structures can be quite complex and can get in the way of accessing the inner data. Similarly Menus and the like are often implemented by "mini" tables inside tables, inside tables etc. With these policies you can elect to ignore all data placed in tables at a level higher or lower than that of interest Consider the HTML code $_$_BEGIN_PRE <TABLE WIDTH="100%"> <TR> <TD WIDTH="30%"> <!--- left hand menu --> <a href="http://www.jafsoft.com/products.html">This is menu item 1</a><br> <a href="http://www.jafsoft.com/detagger/">This is menu item 2</a><br> </TD> <TD WIDTH="70%"> <!-- main part of the page --> <TABLE> <TR> <TH>Date</TH> <TH>Total</TH> </TR> <TR> <TD>April 13</TD> <TD>1,001</TD> </TR> <TR> <TD>May 21</TD> <TD>908</TD> </TR> </TABLE> </TD> </TR> </TABLE> <TABLE WIDTH="100%"> <TR> <TD COLSPAN="2"> <TABLE WIDTH="100%"> <TR> <TD> <TABLE WIDTH="100%"> <TR> <TD><a href="http://www.jafsoft.com/asctohtm/">Text to HTML</a></TD> <TD><a href="http://www.jafsoft.com/asctotab/">Text to table</a></TD> <TD><a href="http://www.jafsoft.com/asctortf/">Text to RTF</a></TD> <TD><a href="http://www.jafsoft.com/asctopdf/">Text to PDF</a></TD> <TD><a href="http://www.jafsoft.com/detagger/">HTML to text</a></TD> </TR> </TABLE> </TD> </TR> </TABLE> </TD> </TR> </TABLE> $_$_END_PRE When converted to text this becomes $_$_BEGIN_PRE +-------------------+----------------+ |This is menu item 1|+--------+-----+| |This is menu item 2||Date |Total|| | |+--------+-----+| | ||April 13|1,001|| | |+--------+-----+| | ||May 21 |908 || | |+--------+-----+| +-------------------+----------------+ +---------------------------------------------------------------------+ |+-----------------------------------------------------------------+ | ||+------------+-------------+-----------+-----------+------------+| | |||Text to HTML|Text to table|Text to RTF|Text to PDF|HTML to text|| | ||+------------+-------------+-----------+-----------+------------+| | |+-----------------------------------------------------------------+ | +---------------------------------------------------------------------+ $_$_END_PRE That is, the page consists of a table of data containing dates and totals, but in addition this is placed in an outer table with a menu on the left. At the foot of the page is a navigation menu, implemented as a heavily nested table. NOTE: The hyperlinks have been removed using the [[goto text hyperlink policies]]. To get rid of the outer table set the *Minimum table depth* to 2. Doing this gives the following $_$_BEGIN_PRE +--------+-----+ |Date |Total| +--------+-----+ |April 13|1,001| +--------+-----+ |May 21 |908 | +--------+-----+ +-----------------------------------------------------------------+ |+------------+-------------+-----------+-----------+------------+| ||Text to HTML|Text to table|Text to RTF|Text to PDF|HTML to text|| |+------------+-------------+-----------+-----------+------------+| +-----------------------------------------------------------------+ $_$_END_PRE Note that the text in the outer table (in this case the menu text) has been discarded, but we still have a doubly nested footer table. If we also set the *Maximum table depth* to 2, we get $_$_BEGIN_PRE +--------+-----+ |Date |Total| +--------+-----+ |April 13|1,001| +--------+-----+ |May 21 |908 | +--------+-----+ ++ || ++ $_$_END_PRE That is, the footer becomes an empty table (when borders are displayed). Using the [[goto Data Extraction policies, table data handling]] options you can now convert this to, for example, CSV format. $_$_BEGIN_PRE "Date","Total" "April 13","1,001" "May 21","908" $_$_END_PRE Note, we've switched off borders and removed delimited table markers in this output. Note, the *Minimum table depth* and *maximum table depth* are fairly broad brush policies - if there had been multiple tables with nested content at level 2, then it would all be included in the output. That said, these policies do offer some prospect of focusing the output on the data you want. Maximum table depth ................... *New in version 2.4* _Menu location: Configuration Options -> Conversion to text -> Data extraction_ See the discussion in [[goto Minimum table depth]] Output table format ................... _Menu location: Configuration Options -> Conversion to text -> Data extraction_ In the Policy File this policy takes the value 1,2 or 3 according to which of the following is selected. - [[popup Convert table to plain text]] - [[popup Convert table to comma-delimited data]] - [[popup Convert table to tab-delimited data]] Any tables processed during the conversion will then be formatted according to the selection made. Convert table to plain text ........................... _Menu location: Configuration Options -> Conversion to text -> Data extraction_ When selected any tables will be converted into plain text. The software will look at the row and column structure of the HTML original, and attempt to lay this out in the current page width, although this may not always be possible. See also [[popup Output table format]] Convert table to comma-delimited data ..................................... _Menu location: Configuration Options -> Conversion to text -> Data extraction_ When selected any tables will be converted into comma-delimited data. Each row is put out as a row of comma-delimited data, with the data values themselves in quotes. The resulting data is in a format suitable for importing into spreadsheets. Where tables are nested, by default only the innermost table will be converted in this way. Because these could be multiple tables in a file, each table is delimited as follows $_$_BEGIN_PRE $_$_BEGIN_COMMA_DELIMITED_TABLE ... comma-delimited data rows ... $_$_END_TABLE $_$_END_PRE You can change this behaviour through the options:- - [[popup Convert only innermost tables]] - [[popup Add delimited table markers]] See also [[popup Output table format]] Convert table to tab-delimited data ................................... _Menu location: Configuration Options -> Conversion to text -> Data extraction_ When selected any tables will be converted into tab-delimited data. Each row is put out as a row of tab-delimited data, with the data values themselves in quotes. The resulting data is in a format suitable for importing into spreadsheets. Where tables are nested, by default only the innermost table will be converted in this way. Because these could be multiple tables in a file, each table is delimited as follows $_$_BEGIN_PRE $_$_BEGIN_DELIMITED_TABLE ... tab-delimited data rows ... $_$_END_TABLE $_$_END_PRE You can change this behaviour through the options:- - [[popup Convert only innermost tables]] - [[popup Add delimited table markers]] See also [[popup Output table format]] Convert only innermost tables ............................. _Menu location: Configuration Options -> Conversion to text -> Data extraction_ For nested tables conversion to a delimited format can become a bit of a nightmare. Usually it will be the innermost table which is data, with outer tables being used for page layout, and so by default the software will only convert the innermost table of a nested set into delimited format. However this may not always be the case, and so this option can be switched on to convert all levels of table into delimited format. In such cases you may have to tidy up the text (to delete and unwanted portions) before importing it into a spreadsheet. Note, from version 2.4 onwards, the policies [[popup maximum table depth]] and [[popup minimum table depth]] offer a bit more control than this policy. Add delimited table markers ........................... _Menu location: Configuration Options -> Conversion to text -> Data extraction_ By default markers are put round the delimited data to separate it from normal text, and from other sections of delimited data. This makes sense for a file which contains non-table elements, or multiple tables, but probably isn't for files that contain just a single table. When enabled each table (or sub-table) will appear like this $_$_BEGIN_PRE $_$_BEGIN_DELIMITED_TABLE ... delimited data rows ... $_$_END_TABLE $_$_END_PRE You can use this option to control this behaviour. $_$_HELP_CHAPTER 3,"Formatting policies" Text Format policies -------------------- $_$_HELP_TOPIC_ID HIDD_TEXTFORMAT *Detagger* has a number of options to allow you to tailor the conversion to text. Some control what is copied across to the output, but most offer options to format the output above and beyond the formatting implicit in the original HTML. - [[popup Preserve all white space from the original source, Preserve layout as it is in the source]] - [[goto Page Layout options]] - [[goto Bullet options]] - [[goto Heading options]] Page Layout options ------------------- These options control how the text is laid out on the page. - [[popup Impose a page width on the output, Apply line formatting to output text]] - [[popup Target page width]] - [[popup Right justify the output text]] - [[popup Output indentation positions, Comma-separated list of output indentation positions]] Preserve all white space from the original source ................................................. _Menu location: Configuration Options -> Conversion to text -> Formatting_ When this option is selected the text is laid out as it was in the original HTML file, minus the actual tags. If this option is selected then all other text formatting options are ignored. This option is suitable when the input file is only lightly tagged (e.g. a document with large <PRE> sections). Impose a page width on the output ................................. _Menu location: Configuration Options -> Conversion to text -> Formatting_ When selected this means that the lines of the text file should be formatted to match a target page width. This will involve moving text around within a paragraph if necessary Target page width .................. _Menu location: Configuration Options -> Conversion to text -> Formatting_ When line formatting is switched on, this is the target page width. If omitted there will in any case be a default page width applied (set to 76 characters). See also [[popup target table width]] Right justify the output text ............................. _Menu location: Configuration Options -> Conversion to text -> Formatting_ When selected white space will be added to each line inside a paragraph so that the right margins are aligned as well as the left. Output indentation positions ............................ _Menu location: Configuration Options -> Conversion to text -> Formatting_ The is a comma-separated list of up to 8 levels of indentation, specifying how the output text should be indented when laying out indented text such as nested lists. By conversion the first value should be a zero to set the left hand margin to start at the beginning of the line. Bullet options -------------- These options control the presentation of bullets and list in the output text document - [[popup Bullet point characters]] - [[popup Text bullet characters]] - [[popup List item templates, Comma-separated list of list item templates]] Bullet point characters ....................... _Menu location: Configuration Options -> Conversion to text -> Formatting_ When converting to text, this identifies characters which - if they end up at the start of a line by themselves - can be taken to be bullet points. The hyphen character '-' is implicitly regarded as a bullet point, but characters such as 'o', 'q' and '§' can sometimes appear in the output text as bullet points depending on how the original HTML was generated. When a bullet point is found to match one of these characters, the first of the [[popup text bullet characters]] is used as a replacement. Text bullet characters ...................... _Menu location: Configuration Options -> Conversion to text -> Formatting_ When converting to text, this is a comma-separated list that specifies which characters are to be used as the bullet at each level of list. The special value "middot" will be taken to mean the "middot" character. At the same time, any "middot" characters occurring in the text will be replaced by the first "bullet" on this list. For example the value (+),+,- Would convert any level 1 list bullets to "(+)", level 2 to "+" and level 3 to "-". At the same time any middot characters '·' will be converted into "(+)" (the first bullet on the list). The default value is middot,middot,middot,middot,middot,middot List item templates ................... *New in version 2.4* _Menu location: Configuration Options -> Conversion to text -> Formatting_ This policy is a comma-separated list of "templates" to be used when outputting ordered lists. The policy allows you to specify the format of each list item for up to 6 levels of list. The format should include an "x", which will be replaced by the item number. For example a value of (x), x), x: would lead to a list structure as follows $_$_BEGIN_PRE (1) item at list level 1 a) item at list level 2 i: item at list level 3 $_$_END_PRE The default value is x),x),x),x),x),x) Heading options --------------- These options control how headings in the original document are presented - [[popup Heading underline characters]] Heading underline characters ............................ _Menu location: Configuration Options -> Conversion to text -> Formatting_ This is a comma-separated list controlling the underlining character (or character pattern) at each heading level. A value of ",,,,," (or indeed blank) would suppress all heading underlining. As an example the value =+ , - would cause <H1> headings to be underlined with the pattern "=+=+=+=", <H2> to be underlined with "-------" and all other heading levels no to be underlined at all. $_$_HELP_CHAPTER 3,"Miscellaneous Formatting policies" Miscellaneous Formatting policies --------------------------------- $_$_HELP_TOPIC_ID HIDD_TEXT_MISC *Detagger* has a number of options to allow you to tailor the conversion to text. Some control what is copied across to the output, but most offer options to format the output above and beyond the formatting implicit in the original HTML. - [[popup Preserve all white space from the original source]] - [[goto Dialogue options]] - [[goto Other options]] - [[goto Unicode options]] Dialogue options ---------------- For those HTML files that represent works of fiction, *Detagger* has some sophisticated "dialogue" detection and formatting options. Dialogue is deemed to be any words or phrases in double quotation marks. Where they occur at the start of a line this can often (but not always) signify the next line of dialogue (i.e. a new "character" speaking) Care is taken - as far is possible - to distinguish text that happens to be in "quotes" and true dialogue. However this will never be a 100% accurate process. - [[popup Look for dialogue lines, Apply special formatting to dialogue lines]] - [[popup Apply extra dialogue checks, Apply extra tests for improved accuracy]] - [[popup Break lines where dialogue starts in the middle, Attempt to start all dialogue on a new line]] Look for dialogue lines ....................... _Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting_ When this option is switched on, the program will attempt to spot dialogue at the start of a line and format it accordingly, with each new speaker starting a new paragraph in the output. Apply extra dialogue checks ........................... _Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting_ When this option is enabled, extra tests are made to check the validity of lines believed to be the start or end of a dialogue phrase. Tests include looking for suitable use of capitalisation and punctuation inside and outside the quoted text. Some of these tests are biased towards the text being in English. Break lines where dialogue starts in the middle ............................................... _Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting_ When this option is enabled, the software will attempt to spot new dialogue from a different character that appears deep inside a paragraph. When this is detected, the larger paragraph will be broken so that the dialogue of the new character starts in a new paragraph Other Options ------------- - [[popup Remove all horizontal rules and lines]] - [[popup Highlight *bold* and _italic_ text, Add bold (*) and italics (_) emphasis characters]] - [[popup May add Unicode marker to output file]] Remove all horizontal rules and lines ..................................... _Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting_ When selected all horizontal lines and rules in the input will be omitted in the output. Highlight *bold* and _italic_ text .................................. _Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting_ When selected, HTML marked as <b>bold</b> or <i>italic</i> can be emphasised in the output text as *bold* and _italic_. This can work well when the occasional word is emphasised, but in some HTML pages entire menus are placed in bold, and in such files this options is probably best switched off. Unicode Options --------------- *Detagger* has a limited ability to deal with Unicode in the HTML files. At present the following options are available - [[popup May add Unicode marker to output file]] - [[popup Input text encoding]] May add Unicode marker to output file ..................................... _Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting_ When selected files which are detected to contain Unicode characters will get a "Unicode file marker" output at the start of the file. A Unicode file marker at the top of a file is recognised by some software applications (e.g. text editors) as marking the file as containing Unicode. The marker characters themselves don't get displayed. Where possible, the software will create a UTF8 file. Full Unicode support has not been tested, and it is not expected that *Detagger* will support all types of Unicode Input text encoding ................... _Menu location: (none at present)_ The program has the ability to detect Unicode Files on input if Byte Order Mark (BOM) is present, or if - under some circumstances - Unicode HTML entities are present in the input text, but in files without the BOMs the software may fail to detect the input is Unicode. In such circumstances this policy allows you to tell the software that the input should be treated as Unicode. The possible values for this policy are auto automatic detection (the default) UTF8 UTF-8 UTF16-BE UTF-16 "Big Endian" UTF16-LE UTF-16 "Little Endian" For a fuller discussion see [[goto Working with Unicode]] $_$_HELP_CHAPTER 3,"Hyperlink policies" Text Hyperlink policies ----------------------- $_$_HELP_TOPIC_ID HIDD_TEXTLINK These options control what happens to any hyperlinks in the original document. Since text files don't support hyperlinks, the options are to ignore the link entirely, only use the display text, or to turn the link into a reference and add a reference table at the end, listing the URLs the links pointed to. *Hyperlink removal* - [[popup Preserve hyperlinks in text output]] - [[popup Omit email hyperlinks from the output, Remove email hyperlinks]] - [[popup Omit local hyperlinks from the output, Remove hyperlinks to local resources]] - [[popup Text to replace omitted links by, Text to replace removed links by]] *Display of URLs* - [[popup Add URL references at end of file, Add URL reference table]] - [[popup Display link URLs]] *Images* - [[popup Replace <IMG> tags by a text marker]] - [[popup Use the ALT attribute to replace <IMG> tags, Use the ALT attribute as the marker]] Preserve hyperlinks in text output .................................. _Menu location: Configuration Options -> Conversion to text -> Hyperlinks_ This is a peculiar one added in response to a customer request. When enabled during a conversion to text, all hyperlinks are left intact, so what you end up with is a text file with HTML hyperlinks in it. This may be of interest to those wishing to import text into a database for display on HTML pages. It is expected that in the conversion any HTML entities (specifically &) in the URL will get converted to their ASCII equivalent. This may cause usability problems with the link after conversion. Omit email hyperlinks from the output ..................................... _Menu location: Configuration Options -> Conversion to text -> Hyperlinks_ When this option is selected, all visible email addresses are omitted from the output. This can be a useful privacy option. Omit local hyperlinks from the output ..................................... _Menu location: Configuration Options -> Conversion to text -> Hyperlinks_ When this option is selected, any link to a local resource (a jump point, or a non http-qualified URL) is omitted from the output. This can be used to remove local navigation links from documents where "next", "previous", "top of page" links will mean nothing in the final text. If this option isn't selected, the display part of such links will be copied to the output text. Text to replace omitted links by ................................ _Menu location: Configuration Options -> Conversion to text -> Hyperlinks_ If either of the two previous options is selected, then this is the text that any deleted links will be replaced by. If set to blank the links are completely removed. Add URL references at end of file ................................. _Menu location: Configuration Options -> Conversion to text -> Hyperlinks_ If this option is selected then hyperlinks to resources are replaced by the display text, and a reference number [n] added after it. A full reference table, listing the original URL that matches the reference numbers is then added at the end of the file. When selected the option [[popup Display link URLs]] is disabled. Display link URLs ................. _Menu location: Configuration Options -> Conversion to text -> Hyperlinks_ When selected the URL for hyperlinks is displayed in brackets in the main text, after the display text. When selected the option [[popup Add URL references at end of file]] is disabled. Replace <IMG> tags by a text marker ................................... _Menu location: Configuration Options -> Conversion to text -> Hyperlinks_ When selected any <IMG> tags in the original are replaced by an "[Image]" marker. See also [[popup Use the ALT attribute to replace <IMG> tags]] Use the ALT attribute to replace <IMG> tags ........................................... _Menu location: Configuration Options -> Conversion to text -> Hyperlinks_ When selected any ALT attribute on a <IMG> tag will be used as the replacement text marker for the tag. See also [[popup Replace <IMG> tags by a text marker]] $_$_HELP_CHAPTER 3,"Text marker policies" Text Marker policies ------------------- $_$_HELP_TOPIC_ID HIDD_TEXT_MARKERS *New in V2.3* These policies allow you to specify special "markers" that should be added to the output to delimit tables and lists. This can be useful if you want to pass the output to some further software package for post-processing. _Tables_ - [[popup Start table marker]] - [[popup End table marker]] _Lists_ - [[popup Start list marker]] - [[popup End list marker]] End list marker ............... _Menu location: Configuration Options -> Conversion to text -> Markers_ See [[popup start list marker]] End table marker ................ _Menu location: Configuration Options -> Conversion to text -> Markers_ See [[popup start table marker]] Start list marker ................. _Menu location: Configuration Options -> Conversion to text -> Markers_ When converting to text, this option identifies a marker that will be output on the line before the start of any marked-up list that is detected. This can be useful if you want to subsequently identify lists in the text. See also [[popup End list marker]] Start table marker .................. _Menu location: Configuration Options -> Conversion to text -> Markers_ When converting to text, this option identifies a marker that will be output on the line before the start of any marked-up table that is detected. This can be useful if you want to subsequently identify tables in the text. See also [[popup End table marker]] $_$_HELP_CHAPTER 3,"Text paragraph policies" Text paragraph policies ----------------------- $_$_HELP_TOPIC_ID HIDD_TEXT_PARAGRAPHS *New in V2.3* These options control the layout of text into sentences and paragraphs. <P> tags in the original text are preserved, but some HTML files use means other than that to layout text (e.g. multiple <BR> tags). In such cases *Detagger* applies extra intelligence to detect the paragraph structure. - [[popup Output each paragraph on a single line]] - [[popup Preserve short lines]] - [[popup Treat short lines as paragraph endings]] - [[popup Insert gap between sentences, Apply 2-character gap between sentences]] - [[popup First line indent for paragraphs, Paragraph indentation]] Output each paragraph on a single line ...................................... _Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences_ When selected this option specifies that each paragraph is put out as a single line (i.e. with a single hard break). This produces text that will display well in those environments that automatically wrap text. This options won't work on text inside a table unless you switch off [[popup Attempt to parse tables]] Preserve short lines .................... _Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences_ When selected any line deemed to be "short" will keep it's line break, even if text is rearranged to fit into a target page width. First line indent for paragraphs ................................ _Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences_ Specifies the number of characters by which the first line in a new paragraph should be indented relative to those that follow. Treat short lines as paragraph endings ...................................... _Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences_ Specifies that in files where there are no paragraph markers, a short line amongst longer lines should be taken as signalling a paragraph end. This can be useful when converting files that use multiple <BR> tags, but where you want a different page width in the output. Without this test there would be no paragraphs detected. Insert gap between sentences ............................ _Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences_ When selected the software will impose a 2-character space between sentences, regardless of what spacing was in the original. This is a common style in written text. $_$_HELP_CHAPTER 3,"Table policies" Text Table policies ------------------- $_$_HELP_TOPIC_ID HIDD_TEXTTABLE These policies control how (if at all) *Detagger* will attempt to faithfully represent tables in the output. The software does a reasonable job of representing simple tables as text, and can even cope with nested tables to a limited extent. - [[goto Attempt to parse tables]] - [[goto Table border options]] - [[goto Table width options]] - [[goto Miscellaneous Table options]] Attempt to parse tables ....................... _Menu location: Configuration Options -> Conversion to text -> Tables_ When selected *Detagger* will attempt to correctly format any HTML tables in the text. In doing so *Detagger* will attempt to preserve the width, alignment etc of the original, but this process can only ever be approximate due to the quite different formats of HTML and text. Bear in mind that if the software is adding emphasis characters (see [[popup Highlight *bold* and _italic_ text]]) or URL references (see [[goto text hyperlink policies]]), then this will end up with more output text to fit in the table. Although you can try to [[popup adjust table to page width]], if the table is too wide or the page too narrow, then this will often fail - particularly for heavily nested tables. Table border options -------------------- These options specify whether any tables should have borders added in the text file. This only applies when converting to plain text (see [[popup Output table format]]). By default the software will replicate the border status of the outermost tables, but suppress borders in any inner, nested tables. This is because the space taken by borders in the ASCII file limits the room for data. - [[popup Add border to all tables]] - [[popup Suppress borders on nested tables]] - [[popup Allow blank row separator lines,Allow blank lines between rows]] Add border to all tables ........................ _Menu location: Configuration Options -> Conversion to text -> Tables_ When selected a border will be added round each table created. This will be regardless of whether the original table set a border or not. If omitted, then the border attribute of the original HTML should be honoured. Suppress borders on nested tables ................................. _Menu location: Configuration Options -> Conversion to text -> Tables_ When table borders are present (or switched on), this option suppresses any borders on tables that appear inside other tables. This is done because nesting tables is usually done to achieve layout, and in the text file white space usually looks better (less distracting). This options will not prevent a border being added to the outermost table (if one is requested). Allow blank row separator lines ............................... _Menu location: Configuration Options -> Conversion to text -> Tables_ This specifies whether or not blank lines are allowed in the output between rows. In this case a blank line is output between each table row (when there is no border). This can space out the table making it easier to read, especially if some of the cell contents are split over several lines. This option only works when the table is being converted to plain text as opposed to a data delimited format. Table width options ------------------- Controls the width the table will take on the output "page". The default behaviour is to regard 800 pixels as 80 characters wide. - [[popup Adjust table to page width, Fit table to page size]] - [[popup Target table width]] - [[popup Ignore table WIDTH Attributes]] - [[popup May break words to fit target width]] - [[popup Nested Table scaling factor (percent)]] Adjust table to page width .......................... _Menu location: Configuration Options -> Conversion to text -> Tables_ When selected this specifies that any table should attempt to fit the current [[popup target table width]]. For a reasonable page size and simple table, *Detagger* can usually re-arrange the table's contents cell-by-cell to fit the table to the page. However for narrow pages and/or large tables or heavily nested tables it becomes virtually impossible to achieve this goal. Target table width .................. _Menu location: Configuration Options -> Conversion to text -> Tables_ When table formatting is being attempted, this is the target page width for tables. This is the maximum width that a table should be allowed to grow to. In some nested tables however this limit may on occasion be exceeded. If omitted the value will default to the [[popup target page width]] value, which in turn will default to 76 characters. Ignore table WIDTH attributes ............................. *New in version 2.4* _Menu location: Configuration Options -> Conversion to text -> Tables_ Whenever Detagger converts a tables with explicit WIDTH attributes, it tries to honour this layout. However, sometimes the widths set in an HTML table don't give enough room to lay out the text when converting to text. This can happen for a number of reasons - the widths have been wrongly chosen to be too small in the HTML, but the browser has widened the table (thereby effectively ignoring the set width) - text in the table has been set to a small font size that fits the HTML width. On conversion to text small sizes can't be honoured, and so the allocated width may be too narrow for the text to be placed in the cell - sometimes the "nowrap" attribute has been set on a cell to stop it breaking over several lines. Again most browsers will honour this request by widening the table if need be, but this option isn't always available to Detagger, especially in heavily nested tables. In such cases this policy allows all WIDTH attributes to be ignored by Detagger. When WIDTH attributes are ignored, Detagger is free to do the best it can to fit the data into the space available, subject to any limits suggested by the [[popup Target table width]] and [[popup Adjust table to page width]] policies. May break words to fit target width ................................... *New in version 2.4* _Menu location: Configuration Options -> Conversion to text -> Tables_ When Detagger converts tables, it will do a best-fit to the available page width, but there are occasions when it is difficult to fit a wide table into a narrow target page width. This is especially true of tables where small font sizes have been used in the HTML to achieve a fit, in plain ASCII text Detagger can't use an equivalent trick. When compressing a table to fit a small target width, Detagger will split text across multiple lines, moving words onto the next line to make the column narrower, but it won't break individual words in two in order to achieve a fit. This option tells Detagger that it can, if necessary, break up long words in order to narrow a column to fit. Detagger implements this in a fairly brutal manner - for example it doesn't hyphenate the broken text - and so this option should be used sparingly. Nested Table scaling factor (percent) ..................................... *New in version 2.4* _Menu location: Configuration Options -> Conversion to text -> Tables_ When table widths are not supplied (or the policy [[popup Ignore table width attributes]] is enabled), then Detagger sometimes struggles to fit heavily nested tables to a restrictive target page or table width. The reason for this is that Detagger first lays out the inner tables, and then embeds these in the outer tables when they are laid out. the problem is that the inner tables may expand to be too wide, making it impossible to get the outer tables to fit on the page. This policy scales back the amount of space a nested table can take. It's expressed as a percentage. 100(%) implies no limit on the inner table, giving the behaviour of earlier versions of Detagger. 0 implies the maximum restriction on inner tables (although this isn't, of course, to zero width). The default value of set to 75. If you find that heavily nested tables aren't fitting into the desired page width, try reducing this value. If you find that inner tables are getting broken over several lines, try setting this policy to 100. If that still fails, you may need to consider disabling the policy [[popup Adjust table to page width]] and increasing the value of [[popup Target table width]] NOTE: This policy is ignored is the [[popup Target table width]] is set larger than 100, since it is then assumed the user is allowing wide tables to preserve the layout. Miscellaneous Table options --------------------------- These are miscellaneous options related to how tables are converted and the markup found inside tables is treated. - [[popup Allow headings inside tables]] - [[popup Default table indentation]] Allow headings inside tables ............................ _Menu location: Configuration Options -> Conversion to text -> Tables_ When this option is selected it prevents any heading markup inside tables being interpreted as such. Not only will this suppress any underlining the heading would have had added, but it will also prevent the width of the whole heading being used in calculating the "minimum width" required for each column. As a result this will actually help make some tables narrower, as the "minimum" width required drops as the "heading" text is now allowed to be split over two or more lines. Default table indentation ......................... _Menu location: Configuration Options -> Conversion to text -> Tables_ This option specifies the value - in spaces - of an indentation to be applied to all tables in the output. This indentation will reduce the page width available to the table. $_$_HELP_CHAPTER 2,"Miscellaneous policies" Miscellaneous policies ====================== These policies are not set via the main options screens. - [[goto Configuration file policies]] - [[goto Other policies]] Configuration file policies --------------------------- These policies record the location of any additional configuration files that are to be used in the conversion. They are usually selected using the [[goto Configuration Files menu]] - [[popup Fragments File]] - [[popup Text Commands File]] Fragments File .............. $_$_HELP_TOPIC_ID HIDD_TEXT_FRAGMENTS _Menu location: Configuration Options -> Configuration Files -> Text fragments file_ In the full version you can add headers and footers to each text file created by using [[goto Using a Text Fragments File]]. This policy allows you to specify the file in which those fragments are defined. If you don't select a fragment file, then no headers or footers will be added. _Note: In the evaluation version, a standard header and footer are added, and this feature is not available. It *is* available in the registered version_ See also [[goto Using a Text Fragments File]] Text Commands File ................. _Menu location: Configuration Options -> Configuration Files -> Text commands file_ Specifies the location of any [[goto Using a Text Commands File, text commands]] file to be used to define text manipulations to be performed on the text as it is being read in, and prior to conversion. Other policies -------------- Some policies cannot be set through the normal options screens. These include - [[popup Allow by-line to be used for Author field]] - [[popup Concatenate results into one file]] - [[popup Lines to ignore at start of file]] - [[popup Lines to ignore at end of file]] Allow by-line to be used for Author field ......................................... _Menu location: (none at present. Edit the policy file)_ When selected, *Detagger* will search the first 40 lines of the document looking for a "By" line in the hope of identifying the author. Any value located this way, will then be available for use in a generated [[goto using a text fragments file, text fragment]] via the AUTHOR [[goto fragment tags, fragment tag]]. Note: this option can be edited into a policy file, but cannot currently be set via the GUI. Concatenate results into one file ................................. _Menu location: On main dialog, set 'Output Type' to concatenate files_ When selected this specifies that when converting multiple files at once, all the results should be concatenated into a single results file. This has the same effect as selecting _"Concatenate results into one file"_ as the [[goto Output types, output type]] Lines to ignore at start of file ................................ _Menu location: (none at present. Edit the policy file)_ This specifies how many lines from the input files should be ignored at the start of the file. These lines will be discarded from the output. This can be useful when converting file copied from a news feed or whatever that adds a small data header to the file. Lines to ignore at end of file .............................. _Menu location: (none at present. Edit the policy file)_ This specifies how many lines from the input files should be ignored at the end of the file. Up to 40 lines may be ignored in this way. These lines will be discarded from the output. This can be useful when converting file copied from a news feed or whatever that adds a small data footer to the file. $_$_HELP_CHAPTER 1,"Using a Text Fragments File to customize your output" Using a Text Fragments File *************************** *Detagger* allows you to define your own text headers and footers when converting the file to text. You do this by defining "text fragments" in an external "Text Fragments File" as follows $_$_BEGIN_PRE $_$_DEFINE_TEXT_FRAGMENT <fragment_name> .. ... fragment lines... ... $_$_END_BLOCK $_$_END_PRE Having placed your block definitions in an external text file, you should then use the Menu option _Conversion Options | Convert to text | Text headers_ to specify where this file can be located. This location will be saved in your Policy file, and may be lost if you load a new policy file. Using this approach you can define $_$_BEGIN_DELIMITED_TABLE [[goto header and footer fragments]] These will be placed at the top and bottom of each file when *Detagger* is converting files to text. This allows you to add standard copyright and contact information, and if you use the [[goto TEXT_HEADER tags]] you can create headers that are tailored to the contents of each file. [[goto separator fragments]] These will be places between results when you choose to convert multiple files and concatenate the results into a single file. $_$_END_DELIMITED_TABLE $_$_SECTION MAKINGHTML *Contents of this section* $_$_CONTENTS_LIST 2,,2 $_$_SECTION MAKINGRTF $_$_SECTION_ALL $_$_HELP_CHAPTER 2,"Adding custom headers and footers" Header and footer fragments =========================== *Detagger* recognises two fragment names TEXT_HEADER the text to be placed at the top of each output file TEXT_FOOTER the text to be placed at the end of each output file If either of these fragments is not defined in the [[goto Using a Text Fragments File, text fragments]] file, or if you don't supply a text fragment file, then the header and/or footer will be omitted. _Note: This feature is not available in the evaluation version of *Detagger*, instead in this version a [[goto default header and footer]] are used_ Default header and footer ------------------------- In the evaluation version of *Detagger* the header and footer are defined as follows:- $_$_BEGIN_PRE $_$_DEFINE_TEXT_FRAGMENT TEXT_HEADER [[OT]]TEXT_HEADER BOX_TOP[[CT]] [[OT]]TEXT_HEADER VERSION[[CT]] [[OT]]TEXT_HEADER TITLE[[CT]] [[OT]]TEXT_HEADER BOX_MIDDLE[[CT]] [[OT]]TEXT_HEADER OUT_FILENAME[[CT]] [[OT]]TEXT_HEADER OUT_FILESIZE[[CT]] [[OT]]TEXT_HEADER TIMESTAMP[[CT]] [[OT]]TEXT_HEADER BOX_BOTTOM[[CT]] $_$_END_BLOCK $_$_DEFINE_TEXT_FRAGMENT TEXT_FOOTER [[OT]]LINERULE[[CT]] Converted by an unregistered version of [[OT]]VERSION[[CT]] Visit http://www.jafsoft.com/detagger/ (this message is omitted in registered version [[OT]]LINERULE[[CT]] $_$_END_BLOCK $_$_END_PRE This gives example results as follows $_$_BEGIN_PRE /----------------------------------------------------------------------\ | < This header can be omitted in the registered version > | | Converted by : Detagger 2.0 (unregistered) | | : www.jafsoft.com/detagger/ | | Title : The JafSoft text conversion FAQ | | | | File name : a2hfaq.txt | | File size : 8,914 bytes (approx) | | Create date : 7-Aug-2002 | \----------------------------------------------------------------------/ <main file contents> ======================================================================== Converted by an unregistered version of Detagger 2.0 Visit http://www.jafsoft.com/detagger/ (this message is omitted in registered version) ======================================================================== $_$_END_PRE TEXT_HEADER Tags ---------------- TEXT_HEADER tags are [[goto Fragment tags]] that can be placed inside text fragments and be replaced by a suitable box line in the output. The box lines will adjust to the current page width. TEXT_HEADER tags have the form $_$_BEGIN_PRE [[OT]]TEXT_HEADER <type>[[CT]] $_$_END_PRE and should be placed on a line by itself inside the fragment. For example the fragment :- $_$_BEGIN_PRE $_$_DEFINE_TEXT_FRAGMENT TEXT_HEADER [[OT]]TEXT_HEADER BOX_TOP[[CT]] [[OT]]TEXT_HEADER OUT_FILENAME[[CT]] [[OT]]TEXT_HEADER OUT_FILESIZE[[CT]] [[OT]]TEXT_HEADER BOX_BOTTOM[[CT]] $_$_END_BLOCK $_$_END_PRE Gives the output $_$_BEGIN_PRE /----------------------------------------------------------------------\ | File name : a2hfaq.txt | | File size : 8,914 bytes (approx) | \----------------------------------------------------------------------/ $_$_END_PRE Possible TEXT_HEADER tag types include $_$_BEGIN_TABLE *AUTHOR* This tag will add a box line identifying the document author (taken from an author line, or from a META tag in the original) *BOX_BOTTOM* Adds a bottom line to the box *BOX_MIDDLE* Adds a middle (blank) line to the box *BOX_TOP* Adds a top line to the box *LAST_EMAIL* Adds a email line to the box for the last observed email hyperlink (e.g. taken from a signature) *LAST_URL* Adds a URL line to the box based on the last observed hyperlink *IN_FILENAME* Input filename *IN_FILESIZE* Input file size *IN_FILEDATE* Input file date *OUT_FILENAME* Output file name *OUT_FILESIZE* Output file size (in bytes). Only approximate, as it estimates the header size *TIMESTAMP* Adds a "date" line for the date of the conversion *TITLE* Adds a title line. Taken from the <TITLE> tag, or from the first heading *TOP_EMAIL* Adds a email line to the box based on the first email hyperlink in the source *TOP_URL* Adds a URL line to the box based on the first observed hyperlink *VERSION* Adds a line identifying that the file was converted by *Detagger* $_$_END_TABLE $_$_HELP_CHAPTER 2,"Adding custom results separators when concatenating files" Separator fragments =================== When converting multiples files at once and choosing to [[goto Output Types, concatenate results]], *Detagger* can be made to add a separator between the results for each file. The fragment names recognised are $_$_BEGIN_TABLE TEXT_SEPARATOR the text to be placed between each set of results in the output file when converting files to text HTML_SEPARATOR the HTML to be placed between each set of results in the output HTML file when selectively removing markup from the input files. Care should be taken to ensure the HTML in this fragment is compatible with that from the results files. $_$_END_TABLE If either of these fragments is not defined in the [[goto Using a Text Fragments File, text fragments]] file, or if you don't supply a text fragment file, then there will be no separators between results in the output file. _Note: This feature is not available in the evaluation version of *Detagger*, instead default results separators are used_ $_$_HELP_CHAPTER 2,"Fragment tags - customize your text fragments" Fragment tags ============= Within your fragment definitions you can supply any text you want, but this will be the same for each file converted. A number of _fragment tags_ are recognised in the form [[OT]]TAGNAME <details>[[CT]] Where tags of this form are recognised, *Detagger* will replace the tag by a suitable value. Of particular interest are the [[goto TEXT_HEADER tags]]. These tags produce a line of text suitable to be placed in a box at the top of the text file. The box width will be adjusted (where possible) to fit the chosen [[goto target page width]]. For other fragments tags supported by JafSoft converters, please read the section on fragment tags in the *Tag Manual* available online at http://www.jafsoft.com/doco/tag_manual_3.html#Section_3.3 Note: Not all the tags described in that document are suitable for use inside *Detagger* files The DATA fragment tag --------------------- The DATA fragment tag can be used to imbed information about the file being converted into the output. In Detagger the main use of the DATA fragment tag is in [[goto Using a Text Fragments File, text fragment]] or in replacement strings in the [[goto Text Command : replace_text, replace_text Text command]] (See [[goto An example use of a Text Commands File]]) $_$_BEGIN_PRE [[OT]]DATA <data_type>[[CT]] $_$_END_PRE where, $_$_BEGIN_TABLE <data_type> This is the type of data to be substituted in $_$_END_TABLE Supported data types include $_$_BEGIN_TABLE VERSION Indicates the program version of Detagger used in the conversion TITLE Document title (taken from the HTML header) IN_FILENAME Input filename OUT_FILENAME Output filename IN_FILESIZE Input file size (in bytes) OUT_FILESIZE Output file size IN_FILEDATE Timestamp of input file TIMESTAMP Timestamp of conversion COMMENT Free text comment $_$_END_TABLE Note, when used in a the [[goto Text Command : replace_text, replace_text Text command]] only those data types known when the input file is opened will work, so for example TITLE won't work in that context. $_$_HELP_CHAPTER 1,"Using Text commands to modify the source text" $_$_HELP_SUBJECT "Overview of Text Commands File" Using a Text Commands File ************************** $_$_HELP_TOPIC_ID HIDD_TEXTSUBS_FILE As of version 2.3, *Detagger* allows the use of "Text Commands". These are commands that allow you to modify the text before it is converted. The commands should be placed in an external "Text Commands File". This file can be chosen from _Conversion Options -> Config File Locations_ menu option. $_$_SECTION MAKINGHTML *Contents of this section* $_$_CONTENTS_LIST 2,,2 $_$_SECTION MAKINGRTF Various commands are available as follows $_$_BEGIN_DELIMITED_TABLE [[goto Text Command : ignore_line,ignore_line]] Identifies lines to be discarded from the input [[goto Text Command : remove_text,remove_text]] Identifies text to be removed from the input [[goto Text Command : replace_text,replace_text]] Identifies text to be replaced by other text $_$_END_DELIMITED_TABLE $_$_SECTION ALL $_$_HELP_CHAPTER 2,"Text Commands" Text Commands available ======================= $_$_SECTION MAKINGRTFHELP Various commands are available as follows $_$_BEGIN_DELIMITED_TABLE [[goto Text Command : ignore_line,ignore_line]] Identifies lines to be discarded from the input [[goto Text Command : remove_text,remove_text]] Identifies text to be removed from the input [[goto Text Command : replace_text,replace_text]] Identifies text to be replaced by other text $_$_END_DELIMITED_TABLE $_$_SECTION ALL Text Command : ignore_line -------------------------- The *ignore_line* command identifies lines that should be ignored in the input. Syntax: $_$_BEGIN_PRE ignore_line <line_selection> $_$_END_PRE Any line matching the specified [[goto line_selection]] criteria will be ignored in the output. This can be a useful way of ignoring page markers in an input file, as these don't always transfer well under the conversion. Text Command : remove_text -------------------------- The *remove_text* command identifies text that should be removed from the input. Syntax: $_$_BEGIN_PRE remove_text <match_type> "match string" $_$_END_PRE Any line containing text that matches the specified [[goto match_type]] for the supplied "match string" will have the matching text removed. Text Command : replace_text --------------------------- The *remove_text* command identifies text that should be removed from the input. Syntax: $_$_BEGIN_PRE replace_text <match_type> "match string" by_string "new string" or replace_text <match_type> "match string" by_character "<char>" $_$_END_PRE Any line containing text that matches the specified [[goto match_type]] for the supplied "match string" will have the matching text replaced. If the replacement is specified as by_string "new string" then the text is replaced by the new string. If the replacement is specified as by_character "<char>" then the string is replaced by a string of equal length consisting of this single character repeated. This can be useful for example to replace change bar characters by spaces in a document where the change bars have confused the program, or to replace other characters inside a table that are confusing the detection of the table's true layout. Note: The new string can contain [[goto The DATA fragment tag]] which allows details about the file (e.g. it's name) to be included in the substitution. See [[goto An example use of a Text Commands File]] $_$_HELP_CHAPTER 2,"Text Command Line syntax elements" Text Command line elements ========================== $_$_SECTION MAKINGRTFHELP The following Text command line elements are used by several commands $_$_BEGIN_DELIMITED_TABLE [[goto line_selection]] Complete specification of how to match a line [[goto line_match]] Location of text within line [[goto match_type]] Type of text matching required [[goto replace_type]] Type of text replacement wanted $_$_END_DELIMITED_TABLE $_$_SECTION ALL line_selection -------------- The *line_selection* element is actually a combination of a number of simpler elements as follows Syntax: $_$_BEGIN_PRE <line_match> <match_type> "match string" $_$_END_PRE That is the *line_selection* consists of a [[goto line_match]], a [[goto match_type]], and then the actual "match string" to be matched. All three elements must be present in order for the *line_selection* to be valid. The following are all valid examples $_$_BEGIN_PRE starting_with string "Chapter" starting_with exact_phrase "Author : " containing phrase "click here" containing string "http://" $_$_END_PRE line_match ---------- The *line_match* element specifies where on the input line the specified text should be located. The options are $_$_BEGIN_DELIMITED_TABLE starting_with Text should be at start of line (ignoring any white space) containing Text can be anywhere on the input line $_$_END_DELIMITED_TABLE Care should be used when using the *containing* option, as false matches are more likely to occur. match_type ---------- The *match_type* element specifies how any supplied match string should be matched. The options are $_$_BEGIN_TABLE string This specifies that a string should be matched. This is, in fact, the most general of match types and is the one that would normally be used. This match type is case-insensitive. exact_string Same as "string", but case-sensitive. phrase A "phrase" is a string that is surrounded by white space and/or punctuation on either side (see below). This match type is case-insensitive exact_phrase Same as "phrase", but case-sensitive. wildcard Not yet supported (*) $_$_END_TABLE The *match_type* _phrase_ is a special case. This is a _string_ that is surrounded by white space or punctuation on either side. So whereas the _string_ "the" would match "then", the _phrase_ "the" wouldn't because the "n" in "then" is not a white space character. The start and end of a line count as white space, and any leading or trailing punctuation is allowed. _Phase_ is therefore a more precise match - even for single words - than _string_. Consider the following example, concentrating on the letters "ten" in the word "tense" This is a tense situation.... The following would apply $_$_BEGIN_TABLE $_$_TABLE_LAYOUT 2,32,90 match_type Matches? ------------------------------------------------------- string "ten" Yes. The "ten" matches the first three characters in "tense" in the middle extact_string "Ten" No. The "t" in "tense" is lower case, so the match fails phrase "ten" No. "ten" is not surrounded by white space or punctuation because it is followed by "se" exact_phrase "tense situation" Yes. The case matches, and there is a space before and punctuation (the "...") afterwards. $_$_END_TABLE replace_type ------------ The *replace_type* element is used in the [[goto Text Command : replace_text,replace_text]] command to specify what type of text replacement should be executed. The element should be immediately followed by the replacement text in quotes. There are two options:- $_$_BEGIN_TABLE by_string The matched text should simply be replaced by the replacement text. by_character The matched text should be replaced by an equal length string composed solely of the single character in the replacement text. $_$_END_TABLE The *by_character* option allows a string to be "blanked out" by the character of your choice, but without altering the line length or spacing etc. This can be useful, for example to replace all DOS line drawing characters by blanks in table, so as to let the software make a better stab at detecting the table layout. $_$_HELP_CHAPTER 2,"An example Text Commands File" $_$_HELP_SUBJECT "Text Commands File An example use of a Text Commands File ====================================== The following is a real-life example, sent to me by one of my users. They had a files that consisted of a table of data that was to be imported into an Excel spreadsheet. By using the policies $_$_BEGIN_PRE [[popup output table format]] : 2 [[popup Add delimited table markers]] : yes $_$_END_PRE They were able to turn the table into delimited data which they could easily extract and then import into Excel. However the problem then was that they couldn't tell which data had come from which file. As it happened the filename matched an instrument name, and they wanted the imported data to include the instrument/filename. I actually modified Detagger to make this work for them. The solution was to use a text command file as follows $_$_BEGIN_PRE replace_text string "<TR>" by_string "<TR><TD>[[OT]]DATA IN_FILENAME[[CT]]</TD>" $_$_END_PRE In this case the opening <TR> of each table was replaced by a <TR>, a <TD> a fragment tag and then a </TD>. The fragment tag [[OT]]DATA IN_FILENAME[[CT]] gets substituted by the input filename. The net effect of this substitution is to create the appearance of an extra column in each table consisting of the filename in the first cell of each row. Once the modified table is converted to delimited data, the filename is effectively inserted into each row of the table, so that once imported into Excel each data row can identify which file (and hence which instrument) the data came from. See [[goto The DATA fragment tag]] $_$_HELP_CHAPTER 1,"Ordering" $_$_HELP_SUBJECT "How to order your copy" Ordering your copy ****************** You can buy the [[goto windows version]] on-line. Registered users will also get a copy of the console version, which is better suited for automated conversions. An [[goto API version]] is also available. You should contact us with details of your needs to get a quote for that. Purchasers of the API version will get a complimentary license for the Windows version. Windows Version =============== *Detagger* costs $25 per copy. Anyone who buys a copy of *Detagger* may freely take backup copies, and install it on all their home machines, and one desktop and one portable at work. Organisations wanting multiple copies will be offered discounts, contact *info<at>jafsoft.com* (replace the "<at>" by "@") for details. Registration details can be found on the web page http://www.jafsoft.com/detagger/register.html This URL is also shown on the Help -> About window. For more information, visit the web page, or email *info<at>jafsoft.com* (replace the "<at>" by "@"). As well as the Windows version *Detagger* is available as a [[goto console version]], and there is also a programmers [[goto API version]] available. API version =========== Software developers can purchase an API version of *Detagger* under separate license. Contact *info<at>jafsoft.com* (replace the "<at>" by "@") for details. The API offers access to the full functionality of *Detagger*. It is delivered as both a DLL or link library. A demonstration package is available on the web that shows the API being called from both C++ and Visual Basic sample programs. You can get more information by visiting http://www.jafsoft.com/developers/api_demos.html where you will also be able to download an evaluation copy. This evaluation will add headers to output and convert some text to UPPER CASE, but is otherwise fully functional. The API software itself is written in C++, but the API interface is declared as a set of "C" routines, and a C header file is supplied to define the API. The API is supplied as both a link library and a DLL. The code can be called from C/C++ software or (in DLL form only) Visual Basic. The above link contains evaluation downloads that demonstrate both in sample programs. Some customers have also successfully called the API from: - C/C++ - Java (using JNI) - Visual Basic - C# - Lotus Notes - Delphi The API is supplied in digital form under Windows, but with access to the source code (not normally supplied), the API has been successfully built on the following platforms - Windows 98 onwards - OpenVMS - Linux - Solaris - Mac OS Access to non-Windows system is currently only possible if you purchase a source code license. If you are interested in the API, send details of what your intended use is to *info<at>jafsoft.com*. You should specify - whether this is a commercial or non-commercial project - whether the API is to be used in a server, a web server or end user application - the expected number of users of the application *Support* Developers who purchase the API get a complimentary copy of the windows product, primarily to allow them to easily prototype any "Policy files" for the conversions they want to attempt. The functionality of the Windows version and the API may differ slightly. Developers will also be offered free of charge any updates that occur for up to one year from the date of purchase. During this time email support will be offered to a few named contacts. After one year developers may choose to continue to receive upgrades and support by paying an annual fee in advance, usually set at around 20% of the initial license fee. Unfortunately we do not have the resources to offer support to your end users. $_$_HELP_CHAPTER 1,"Contact details" Detagger on the Web ******************* The *Detagger* home page can be found at http://www.jafsoft.com/detagger/ It can be registered on-line by visiting http://www.jafsoft.com/detagger/register.html Updates will be announced at http://www.jafsoft.com/detagger/updates.html Users wishing to try out the [[goto console version]] should visit http://www.jafsoft.com/developers/console_demos.html Software developers who wish to integrate the functionality of *Detagger* into their own applications can get a trial copy of the [[goto API version]] by visiting http://www.jafsoft.com/developers/api_demos.html Other products offered by *JafSoft Limited* can be found at http://www.jafsoft.com/products/ Most of the URLs are available from the program's About screen. The author can be contacted at *info<at>jafsoft.com* (replace the "<at>" by "@"). $_$_HELP_CHAPTER 1,"Upgrades" Upgrades ******** It is our intention to continue development of this product, and for as long as possible make upgrades available via the Internet. Check the [[goto update menu]] for updates, or visit the web page http://www.jafsoft.com/detagger/updates.html $_$_HELP_CHAPTER 1,"Change History" Change History ************** Here is the change history of *Detagger* $_$_SECTION MAKINGHTML $_$_CONTENTS_LIST 2,,2 $_$_SECTION MAKINGRTF - [[goto Version 2.4 (June 2005)]] - [[goto Version 2.3.2 (September 2004)]] - [[goto Version 2.3 (April 2004)]] - [[goto Version 2.2 (May 2003)]] - [[goto Version 2.1 (March 2003)]] - [[goto Version 2.0 (December 2002)]] - [[goto Version 1.0 (August 2002)]] $_$_SECTION_ALL $_$_HELP_CHAPTER 2,"Version 2.4" $_$_HELP_SUBJECT "Changes in version 2.4" Version 2.4 (June 2005) ===================== Version 2.4 contains a small number of minor bug fixes and policy changes $_$_SECTION MAKINGRTFHELP - [[goto Changes made in 2.4]] - [[goto Policies added in 2.4]] $_$_SECTION_ALL Changes made in 2.4 ------------------- - a new menu option [[goto Data extraction policies,Data Extraction]] appears under the _Conversion Options -> Conversion to Text_ menu. These options allow you to select which data should be extracted from the soured document. Policies added in 2.4 --------------------- - [[popup Show only table data]]. The option is part of the new Data Extraction polices, and allows you to specify that only text contained in HTML <TABLE> markup should be included in the output document. - [[popup Maximum table depth]] and [[popup Minimum table depth]]. These options are also part of the new Data extraction policies, and allow you to specify which levels of a set of nested tables should be included in the output. - [[popup Ignore table WIDTH attributes]]. Sometimes the widths set in an HTML table don't give enough room to lay out the text when converting to text. In such cases this policy allows all WIDTH attributes to be ignored. - The policy [[popup preserve hyperlinks in text output]] added in version 2.3.2 is now accessible on the [[goto Text Hyperlink policies]] option panel. - The policies [[popup remove HTML <DIV> tags]] and [[popup remove HTML <SPAN> tags]] is added. - The policy [[popup May break words to fit target width]] is added. This allows Detagger to break up long words in a table in order to allow the column to be made narrow enough to fit a small target width. - The policy [[popup List item templates]] is added. This allows you to specify the text "decoration" added to ordered list items. - The policy [[popup Nested Table scaling factor (percent)]] is added. This can be used to limit the space allowed to nested tables. This can assist when trying to squeeze heavily nested and wide tables into a limited fixed width. Bugs fixed :- - Text commands using match or replace expressions containing double quote characters weren't being accepted. - Tables nested inside other tables where the outer table contained no <tr>..</tr> tags were not being processed correctly. - Various policies were not being saved correctly after being changed, or when policies were saved to file. - When [[popup Preserve all white space from the original source]] is enabled tabs in the original document are preserved. Previously they were being converted to spaces. - A number of policies weren't being saved to policy file on exit, especially if users changes between tabs on the property sheet before pressing "Apply" - PRE sections inside tables were being handled poorly. Large PRE sections (several 1000 lines) caused a performance hit which is now fixed, and in heavily nested tables the PRE text was compressed and lost it's formatting. Now the presence of a PRE section is liable to cause the table to become wider, and if the PRE section won't fit each line will be broken over several lines, but the formatting is preserved. $_$_HELP_CHAPTER 2,"Version 2.3.2" Version 2.3.2 (September 2004) ============================== Version 2.3.2 contains a number of bug fixes and minor enhancements over version 2.3. It also contains enhanced support for handling Unicode, especially UTF-16. $_$_SECTION MAKINGRTFHELP - [[goto Changes in version 2.3.2]] - [[goto New policies in version 2.3.2]] - [[goto Bugs fixed in version 2.3.2]] $_$_SECTION_ALL Changes in version 2.3.2 ------------------------ - Improved Unicode support, and added support for UTF-16. See [[goto Working with Unicode]] - New German translation of the user interface supplied by Jürgen Krane. - Added support for DATA tag and IN_FILENAME and OUT_FILENAME attribute3 in text fragments. See [[goto The DATA fragment tag]] - Changed the default page width to 76 from 72. For pages that add a <br> to each line this reduces the chances on unwanted additional line feeds in the output when converting to text. You can always override this value via the policy [[popup target page width]] - Various formatting changes to improve table layout (especially nested tables). New policies in version 2.3.2 ----------------------------- Here are the policies added in version 2.3.2 :- - [[popup Input text encoding]] - [[popup Remove <!DOCTYPE> tags]] - [[popup Preserve hyperlinks in text output]] Bugs fixed in version 2.3.2 --------------------------- - Fixed a number of bugs when processing tables with missing <TR>, </TR> and <TD> tags. These could cause the table to not be processed or, in extreme cases, to be omitted from the output. - Fixed bug where lines starting with a quotation mark didn't get correctly set to the target page width. - Fixed bug whereby a nested table was indented when converted to delimited data. - Fixed a number of bugs in which the presence of tags in embedded JavaScript (such as <style>, <body>, <title> etc) was confusing the file parsing, and occasionally leaving unwanted Javascript in the output. - Files containing Unicode entities were not getting the Unicode file marker added when converted to text. This prevented some files (e.g. Arabic) from being displayed properly. - A bug meant that occasionally the last line of a file wouldn't be read properly. This might have caused some conversion issues but was actually spotted when the last line in a policy file was ignored. - Documents with a missing or misplaced </head> tag were converted to an empty file when converted to text. - When removing hyperlinks, the first link after a NAME anchor tag wasn't being removed. - When converting multiple files (as opposed to wildcards) the program would sometimes get confused calculation the output filename $_$_HELP_CHAPTER 2,"Version 2.3" Version 2.3 (April 2004) ======================== Here are the policies added in version 2.3 :- $_$_SECTION MAKINGRTFHELP - [[goto Changes in version 2.3]] - [[goto New policy options in Version 2.3]] $_$_SECTION_ALL Changes in Version 2.3 ---------------------- - When converting files in sub-folder of the main input directory new options on the main screen allow you to have the output files placed in a parallel folder structure under the output folder. See [[goto Output Directory]] - For the console version the [[popup The /SUBFOLDERS command line qualifier, /SUBFOLDERS]] and [[popup The /SUBFOLDERS command line qualifier, /TREE]] qualifiers are introduced to allow you to search sub-folders to locate input files, and to direct output files into a parallel folder structure to the input files. - A new feature known as "Text Commands" allows you perform certain edits and manipulations on the source text before it is passed to the conversion process. For files that are not straight HTML files, but have instead come from elsewhere, these manipulations can be used to remove certain features (e.g. a non-HTML header on the document), or to turn certain markers into something that makes the file more HTML-like. For example SEC filings use the HTML-like <TEXT> tag to markup plain text. By using a Text Command to change this into a <PRE> tag, the HTML converter is then tricked into leaving the format of the text in this section alone. See [[goto Using a Text Commands File]] - A new [[goto Tip of the Day]] feature has been added. The good news is that if you don't like it you can switch it off straight away and it will never bother you again :-) Alternatively, the tips can be read in sequence should you prefer by using the next/last buttons to go through them, and the screen can be brought up from an option on the settings menu. If anyone has suggestions as to topics they would like tips on, please feel free to send them to info<at>jafsoft.com. - A Dutch translation has been supplied courtesy of *Jurrien Dokter* New policy options in Version 2.3 --------------------------------- Several new policy options are added in version 2.3 _General_ - [[popup Lines to ignore at end of file]] - [[popup Lines to ignore at start of file]] _External configuration files_ - [[popup Fragments File]] - [[popup Text Commands File]] _Conversion to text_ - [[popup Allow headings inside tables]] - [[popup Default table indentation]] - [[popup Target table width]] - [[popup Heading underline characters]] - [[popup Bullet point characters]] - [[popup Text bullet characters]] - [[popup Start list marker]] and [[popup End list marker]] - [[popup Start table marker]] and [[popup End table marker]] - [[popup Output indentation positions]] - [[popup Allow by-line to be used for Author field]] _Markup removal_ - [[popup Remove HTML table tags]] - [[popup Remove HTML <OBJECT> tags]] $_$_HELP_CHAPTER 2,"Version 2.2" Version 2.2 (May 2003) ====================== Version 2.2 contains a small number of improvements and enhancements over version 2.1. $_$_SECTION MAKINGRTFHELP - [[goto Changes in Version 2.2]] - [[goto New features in Version 2.2]] $_$_SECTION_ALL Changes in Version 2.2 ---------------------- On the [[goto Settings menu]] a new option allows you to _Remember settings on exit_. If selected the current file, output directory, policy file and conversion options are remembered and used as the starting values next time you run the program. New features in Version 2.2 --------------------------- A number of new options have been added to allow you to remove certain types of tags and attributes from inside tables only. A new "Tables" option has been added under "markup manipulation" on the [[goto Conversion options menu]]. This takes you to the _Detag Tables options_ tab which has the following options - [[goto Remove HTML <P> tags from tables]][[br]] - [[goto Remove HTML alignment attributes from tables]] - [[goto Remove HTML color attributes from tables]] - [[goto Remove HTML size attributes from tables]] $_$_HELP_CHAPTER 2,"Version 2.1" Version 2.1 (March 2003) ======================== Version 2.1 contains a small number of improvements and enhancements over version 2.0. $_$_SECTION MAKINGRTFHELP - [[goto Changes in Version 2.1]] - [[goto New features in Version 2.1]] $_$_SECTION_ALL Changes in Version 2.1 ---------------------- - Performance improvements. A number of small bugs have been fixed. Also a number of performance enhancements have been made. These will only really be noticeable when converting large files. New features in Version 2.1 --------------------------- - [[goto Output Types, Concatenate results]]. It is now possible to convert several files at once and have the results output to a single file. To do this select _"Concatenate results into a single file"_ for the Conversion Type. - Searching sub-folders. It is now possible to get the program to look in sub-folders of the one selected for files to convert. To do this just tick the _"Search Sub-folders"_ checkbox. Note, at present all results will only be put in a single output folder, and not into a structure that matches the original. See [[goto Input selection]] - New Table border option. A new option allows you to output blank lines between rows when converting tables to text. See [[popup Allow blank row separator lines]] - New table data handling options. New options allow you to fine tune the conversion of tables to delimited data. See [[popup Convert only innermost tables]] and [[popup Add delimited table markers]] $_$_HELP_CHAPTER 2,"Version 2.0" Version 2.0 (December 2002) =========================== Version 2.0 contains a number of changes suggested by users of *Detagger*, as well as a number of bug fixes and code enhancements. $_$_SECTION MAKINGRTFHELP - [[goto Changes in Version 2.0]] - [[goto New features in Version 2.0]] $_$_SECTION_ALL Changes in Version 2.0 ---------------------- - Name change. I've decided to lose the "-" in the name, so "De-tagger" becomes "*Detagger*" as of version 2.0 :-) - The version 2.0 code has been improved when dealing with large files Previously the performance seriously degraded on files over 0.5 Mb in size. The new version is significantly faster when handling such files, and furthermore there is now no theoretical limit to file size, although if a whole file is placed in a table there may still be problems. - Several bugs with saving options ("policies") to file have been fixed. The main dialog now has a section for [[goto using policy files, policy files]] to be loaded to make their use much more convenient. Recently saved files can be quickly recalled from the drop down list. - A console version is now available to registered users. The console version can be run from the command line and is better suited to batch operation. A demonstration of the console version can be got from the website at http://www.jafsoft.com/developers/console_demos.html - For those wishing to integrate the functionality of *Detagger* into their own software, and API version is available. This is sold under a separate license, subject to negotiation based on the intended use of the API. You can find out more and download an evaluation version from http://www.jafsoft.com/developers/api_demos.html New features in Version 2.0 --------------------------- Several new features have been added to *Detagger* since version 1.0. *Markup removal* _Tag removal options_ [[popup Remove emphasis tags]][[br]] Removes all the *bold* and _italic_ markup from the HTML [[popup Remove style sheet]][[br]] Removes all the <STYLE> sections from the HTML, together with any reference to an external CSS style sheet. [[popup Remove HTML <IMG> tags]][[br]] Removes all the <IMG> tags from a HTML document. *HTML-to-Text conversion* _Paragraph formatting_ [[popup Output each paragraph on a single line]] [[br]] Each paragraph is output without hard line breaks (except at the end). This can be useful, depending on how and where the resulting text is to be used _Miscellaneous text formatting_ [[popup May add Unicode marker to output file]] [[br]] When Unicode is detected in the source the software will output the text as UTF8 and optionally add a file marker that will label the file as "Unicode" in a way that most applications that can cope with Unicode will recognize. _Hyperlinks handling_ [[popup Display link URLs]] [[br]] An option to display hyperlink URLs immediately after the display text in the output. [[popup Replace <IMG> tags by a text marker]] [[br]] Option to place a marker in the text to show where an image has been removed. [[popup Use the ALT attribute to replace <IMG> tags]] [[br]] Option to use the ALT attribute of an <ING> tag in it's text marker. This can help give some sense of what was being shown on the original page. _Tables conversion_ [[popup Convert table to plain text]] [[br]] [[popup Convert table to comma-delimited data]] [[br]] [[popup Convert table to tab-delimited data]] [[br]] These options (which are mutually exclusive), determine how any <TABLE>s in the HTML should be output in the text. The options allow the table to be converted to plain text, or to delimited text better suited for loading into a spreadsheet such as Excel. $_$_HELP_CHAPTER 2,"Version 1.0" Version 1.0 (August 2002) ========================= The initial release. $_$_HELP_CHAPTER 1,"Documentation available" $_$_HELP_SUBJECT "Documentation available" Documentation available *********************** $_$_HELP_TOPIC_ID ID_HTMLDOCO A set of HTML documentation is maintained for *Detagger*. This is identical in content to the Help file available. Both documents are generated by [AscToHTM] and [AscToRTF] from the same text file. You should have got a set of this documentation when you got your copy of *Detagger*, however you can get more recent versions by visiting the web site http://www.jafsoft.com/doco/docindex.html where you will also find instructions on how to get a .zip copy for your own use. $_$_HELP_CHAPTER 1,"Acknowledgements" Acknowledgements **************** I'd like to thank all the people who have helped me produce *Detagger* and its related products. Although *Detagger* is a one-man programming effort, I really wouldn't have come this far without the support and encouragement of friends and users. I'd like to thank my beta testers. For *Detagger* special thanks also go to *Greg Platt* for suggesting text conversions options that would never have occurred to me., A special thanks to the people who volunteered to translate aspects of JafSoft software into other languages for the benefit of users. They are *Andre Martinez* (French) *Gonzalo San Martin* (Spanish). *Gianluigi Pizzuto* (Italian) *Dan Svarreby* (Swedish) *Alexander (aka J-34)* (Russian) *Jurrien Dokter* (Dutch) *Jürgen Krane* (German) Finally, my biggest thanks are to my registered customers who keep asking me features that would never even have occurred to me.