CW3 (CleanWWWW) v2.51
Es gibt noch keine deutsche version verfügbar.
Contents of this document
1 - What is CW3
CW3 (the former WW2HTM program) is a converter. It convertes the
(so-called) HTML output of various Windows programs like Winword,
Frontpage and Office 97 into true, resolution, hard- & sofware
independed HTML. The output of CW3 is in most cases smaller then the
input - good for loading times and server performance.
This is an BETA version and not quite ready. It is not a port, but a
complete rewrite from scratch.
Please read the Licence before using this program.
This BETA is full functional, except that it doesnt do much cleaning/checking
yet. I concentrated on the core engine of processing a tree of files and to
split HTML files into TAG's and TEXT's. This is ready, and plugging in some
checking rules is now easy, but had to wait for the final version.
There is not much documentation yet. Here is a list of all
Warnings and Errors
encountered by CW3.
In the subdirectoy TESTCASE are some files that
try to stress CW3 and to show what it can and what not yet.
The files in the CFG directory are not read or used yet.
The latest version can be found on hobbes
or simtel-net.
2 - Input format
CW3 takes the HTML files generated by the various programs or existing
HTML code. It works great on entire sites or homepages.
Have a look at the testcase file and
see what CW3 outputs.
21 - Changes made by CW3 to the HTML source code
CW3 changes the HTML source code of the input file, but
tries to preserve the final look of the document in the browser.
You can refer to the TESTCASE documents to seewhat CW3 realy does.
Here is a general description of the working manner of CW3:
- CW3 reads the file in and splits it into TAGS and TEXTS.
- The TAGS are separeted into NAME and PARAMETERS
- CW3 replaces/removes than TAGS and TEXT as specified by a list of RULES.
- CW3 combines TAGS that are the same and have different PARAMETERS
if the arenot seperated by TEXT. (like <FONT COLOR=RED><FONT SIZE=1>
</FONT</FONT> will become <FONT COLOR=RED SIZE=1></FONT> )
- CW3 removes then senseless TAG statements (like <B></B>)
- CW3 checks for the proper structure of the document and the places of some
important TAGS (like HTML, HEAD, BODY etc)
- CW3 extracts all LINKS from the file.
- CW3 rebuilds the TAGS from their name and list of parameters, and writes
the TAGS and the TEXT back to the output file.
3 - Known limitations and bugs
- Text inside <PRE> and </PRE> is destroyed more or less since
CW3 doesnt know that it mustn't reformat it. I hope no-one still uses PRE
anyway...
- There is not much more done then re-formatting (HTML) or copying (non HTML)
the input file. However, if you encounter problems, drop me a note.
- Links that are generated by JavaScript are not resolved. I dont feel like
implementing an entire JavaScript engine, and I honestly don't know what benefits
one get's by creating a homepage with JavaScript links, except that it forces
the user to use a browser which understands javascript :-/
4 - ToDo list
- Implementing all the checks/things from WWW2HTM
- Showing an index of the HTML tree, letting the user changing files in it.
- Checking structure of HTML documents (HTML, HEAD, BODY etc)
- Checking length of title, size of images
- Current version is too slow due to unecc. updates of screen (I know, I know...;)
- Converting all comments to inline text so that you can see what is hidden
in your documents.
- Removing all comments for keeping your privacy on the net.
So much needs to be done :-)
5 - History Revision
View here the History Revision
6 - Contact the author
You can either visit my web page at www.pobox.com/~tels or
send me a (PGP 2.6.3 encrypted prefered) mail.
[06/21/97] Tels. Last update:
End Of Document.