Charlie's Tidy Add-Ons

By Charles Reitzel
Please send bug reports or comments to creitzel@rcn.com

This is a brief page showing a couple additions to Tidy I have written.

Latest: 20 February, 2003, Exposed TidyOutputBOM option to TidyATL (and thru to Tidy.NET) and speeded up syntax highlighting. Removed DLL build, which is now included in TidyLib proper.

3 February, 2003, Added required DLL to tidyui.zip

1 February, 2003, Removed stuff that you can now get from the Tidy Project Page. Added .NET wrapper and syntax highlighting and other goodies to Tidy UI.

Enjoy!

Table Of Contents

The following are current as of 20 February, 2003:

Tidy UI

This is a windows executable that puts up a GUI for the HTML Tidy library.

Download Tidy UI

Tidy UI Screen Shot
Screen Shot

Updates

20 February, 2003, Gave syntax highlighting a decent performance bump. Should only be noticeable on large files (>100KB). Thanks to Ken Wagnitz for nudging me on this one.

3 February, 2003, Added required DLL (RWUXThemeS.dll) to tidyui.zip. Just place it in the same directory as tidyui.exe.

1 February, 2003: Added syntax highlighting, Search/Replace and Copy/Cut/Paste hot keys (Ctrl-C, Ctrl-X and Ctrl-V, respectively). Also, misc bug fixes.

28 October, 2002: Fix File Save problem reported by Hakon Haugnes.

15 October, 2002: Mostly fixes to various library bugs that affect all bindings.

13 August, 2002: Fixes to problems with recent pretty print updates. Avoid mangling whitespace for normal nodes.

9 August, 2002: Mostly incorporating TidyLib fixes and improvements. Plus, fixing a few bits of UI cruft. I am hoping this is a stable version.

28 July, 2002: Added edit for DOCTYPE option and view for Tidy messages. Double-click on a message and the location in the Original source will be selected. As before, you can now create/open/save/save as both HTML and Tidy configuration files. Also, Tidy configuration editing is, hopefully, fairly intuitive.

One nice feature is if you hit the help button in the config pane, it brings up the Tidy Quick Reference on that item.

Also, you can preview your input or the tidied HTML in the browser. There is also a button to replace your input with Tidy's output.

To Do List

Fixed Bugs

Done List

Many thanks to Jelks Cabaniss for testing TidyUI

Language Bindings

I have been exploring various language bindings for the HTML Tidy library.

C++

The first thing I did was to make a C++ wrapper for Tidy. It's pure syntax sugar!tm

Have a look.

The C++ wrapper adds no data members and, for things like nodes and attributes, the objects never actually get created. Instead, the C opaque types simply get cast to objects of the appropriate class. Use of callbacks for I/O and error handling is simplified by transforming these calls to virtual methods. The upshot is you get the cleaner syntax and self-cleanup of C++ with no, or very little, added overhead.

SWIG

The next thing I did was run it through SWIG. Which is an amazing tool for generating language bindings to C libraries. The generated code does a great job of capturing the C++ object model in the respective languages.

Mostly, SWIG just reads the C++ header and generates the appropriate Perl/Java/Python classes and, importantly, all of the environment specific glue/wrapper code to call into Tidy from each scripting langage.

An update on SWIG is that implementing virtual overrides in the script language is, understandably, not straight-forward and script language dependent.

HTML::Tidy

Here is a working Perl Wrapper. It uses the normal Perl extension type of build/install arrangement. See the readme.txt file. I used a tar to distribute as this is readable on all platforms (winZip will handle it fine).

First have a look at the SWIG module definition file. The first thing you'll notice is how small and simple it is. The only complication is to make Tidy::Buffer objects appear as strings in Perl.

Note, the HTML::Tidy Perl extension itself is a combination of a shared library (.DLL) and thin Perl wrapper classes (Tidy.pm). TidyLib is linked statically, however. The upside is that DLL hell is avoided. The downside is that, when you update Tidy, you must also update your Perl wrapper separately (which you probably would have had to do anyway).

I am currently researching how to implement I/O virtual calls in Perl. Once the mechanism is in place, it should be possible to parse URLs via LWP, for example.

Current version adds support for Tidy option constants and SaveString(), which returns the document into a perl string variable.

20 February, 2003, New build to use latest TidyLib.

24 November, 2002, fixed line endings in Unix version of build.sh and Makefile.PL.unix. Now I have a Linux system set up to test these on.

Python?

Java?

COM/ATL

The latest is a simple COM/ATL wrapper for the library. The simple operations are supported: parse file, parse from memory, cleanup, diagnostics, save file and save to memory. You can also set options in the usual ways. I got just a bit fancy and supported the I/O and error handling callbacks. Also, TidyLib fixes for Unicode/UTF-16 are included.

20 February, 2003: Fixed TidyOptionId enum so that all the conditionally compiled options are included, especially TidyOutputBOM. Note, for XHTML/XML output, the BOM is required. However, if you set TidyOutputBOM false after the parse, then it will be respected.

1 February, 2003: Some IDL updates, but no interface/UUID changes. IDL changes are purely for the benefit of generating the .NET wrapper.

Previous fix for character conversion in ATL wrapper now works fine with UTF16 due to fixes in core library. Thanks to Moshe Plotkin for identifying problem and testing updates. Parse/Save String worked OK only if current code page and desired encoding match. Now, the "String" methods temporarily force the encoding to UTF16LE to work with COM/OLE Unicode strings. Didn't break the test on my Latin1 system. Feedback still appreciated on non-Western European systems/content.

There is an example of redirecting Tidy output to a static control in the VB test driver. Note, this is still a rough draft. If there is demand, it may flesh out a bit.

.NET

You can download a pre-generated .NET wrapper here.

20 February, 2003: Regenerated to use latest TidyATL.

I have also been examining how to call Tidy from .NET. So far, there are 3 different options. Which is best depends, as always, on your requirements.

Quick and Dirty

With a few simple declarations, you can call directly into a DLL build of TidyLib. See a simple example VB.NET program sent to me by Phil Weber.

Full Boat

The right way to do it is probably to write a C# class that wraps the DLL build of TidyLib. I tried valiantly to make my C++ wrapper do dual duty as a regular C++ class and, with a few judicious macro definitions, a Managed C++ class. No joy. Managed C++ is simply too different from real C++ to be useful, imo. Or, as I like to say, Managed C++ is neither. Better to write a new C# class from scratch - much like the TidyATL implementation.

Apparently, this is primarily a matter of mangling the declarations so that the .NET runtime will marshall the internal Unicode strings to "Ansi" (i.e. MBCS) used by the DLL. There are still interop issues to be addressed. Predictably, things get complicated when calling out to .NET I/O implementation or report filters. You can get there from here by using a helper function to marshall delegate objects to function pointers and assigning them within native C code. Arrangements must also be made to keep those delegate objects around for the lifetime of the encapsulating I/O object.

Richard Cook has reported that he has got a working version. I'll update this page with a link when I hear he has gone public, so to speak.

Goldilocks

As in, "just right", for my taste anyway. Basically, using the .NET/COM interop layer with TidyATL seems to work well and provide fairly complete functionality. TidyATL is itself fairly stable at this point and the COM interop seems to be a fairly predictable and stable bridge into the .NET CLR (Common Language Runtime).

To get up and running, follow these steps, courtesy of Matthew Stanfield. Matt is the one who first told me about this approach and how to go about it. Have a look, for comparison

Register Tidy ATL on your system.

regsvr /c TidyATL.dll

Generate the .NET wrapper

TlbImp [\path\to\]TidyATL.dll /out:Tidy.dll

The TidyATL inferface (IDL) file has been tweaked to place the resulting assembly in the "Tidy" namespace. The results, however, are uneven. This means that the basic document object is correct: Tidy.Document. Other types, notably the enumerations, still have their longer C names. E.g. TidyOptionId.TidyCharEncoding, etc. Note, it is necessary for the DLL to be named "Tidy.DLL" to keep everything in the right namespace.

You can view the resulting assembly

IlDasm Tidy.dll

Call Tidy

To use Tidy.NET, you must first add the line.

using Tidy;
  

See the test program for an example of creating a Tidy.Document object and calling its methods.

Compile Your Program

csc /o+ /out:TestTidy.exe /r:Tidy.dll TestTidy\TestIt.cs

Deployment

This is less clear to me. I believe you need to place Tidy.NET.dll in the .NET equivalent of the Java CLASSPATH, most likely in the appropriate subdirectory: $CLASSPATH/Tidy.

Valid XHTML 1.0!