org.sandev.basics.util
Class XMLTextProcessing

java.lang.Object
  extended by org.sandev.basics.util.XMLTextProcessing

public class XMLTextProcessing
extends java.lang.Object

Provides raw text translation services for XML.

This class leverages the StringCharacterIterator combined with Character.isWhite to do its work. It does not actually make use of StringTokenizer or StreamTokenizer (not that those share anything in common either).


Constructor Summary
XMLTextProcessing()
           
 
Method Summary
static java.lang.String convertFromXML(java.lang.String text)
          Performs the inverse of the convertToXML character escapes.
static java.lang.String convertToHTML(java.lang.String text, boolean linkHref, boolean linkEmail, boolean translateFormat)
          Like convertToXML, except less stringent about things like apostrophes, quotes and ampersands.
static java.lang.String convertToXML(java.lang.String text, boolean linkHref, boolean linkEmail, boolean translateFormat)
          Convert the given text to valid XML plaintext.
static void escapeCharacter(java.lang.StringBuffer buf, char currChar, boolean stringentEscape)
          Append the character or the equivalent XML escape string to the given buffer.
static java.lang.String getPrefix(java.lang.String token)
          Return the open parenthesis or other prefix this token starts with, or the empty string if it is unprefixed.
static java.lang.String getSuffix(java.lang.String token)
          Return the close parentheses or other suffix this token ends with, or the empty string if it is unsuffixed.
static java.lang.String getXMLTagValue(java.lang.String tagname, java.lang.String input)
          Given some XML input, retrieve the value of the given tag.
static java.lang.String processToXML(java.lang.String text, boolean linkHref, boolean linkEmail, boolean translateFormat, boolean stringentEscape)
          Workhorse for convertToXML, convertToHTML methods.
static java.lang.String translateToken(java.lang.String token, boolean linkHref, boolean linkEmail)
          If the given token looks like an email address or a hyperlink then make it into one.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

XMLTextProcessing

public XMLTextProcessing()
Method Detail

convertToXML

public static java.lang.String convertToXML(java.lang.String text,
                                            boolean linkHref,
                                            boolean linkEmail,
                                            boolean translateFormat)
Convert the given text to valid XML plaintext. This method has three main functions:
  1. Escape any problematic characters that XML would otherwise try to process during subsequent parsing.
  2. Translate newlines into html breaks so they don't get lost
  3. Trap things like email addresses or URLs and translate them into hyperlinks for display.


convertToHTML

public static java.lang.String convertToHTML(java.lang.String text,
                                             boolean linkHref,
                                             boolean linkEmail,
                                             boolean translateFormat)
Like convertToXML, except less stringent about things like apostrophes, quotes and ampersands.


processToXML

public static java.lang.String processToXML(java.lang.String text,
                                            boolean linkHref,
                                            boolean linkEmail,
                                            boolean translateFormat,
                                            boolean stringentEscape)
Workhorse for convertToXML, convertToHTML methods. If translateFormat is true, then newlines are converted into breaks. We also convert tab characters into four non-break spaces, but since those can't easily be entered into most interfaces (tabbing usually takes you to the next entry field) we also convert sequential hard spaces into "nbsp"s. The way this works is every second space is replaced with an nbsp and not echoed.

The tough part about this is that in an HTML display, a space between two characters gets displayed, while space at the beginning of a line is typically ignored. So "blah nbsp;blah" is two spaces whereas " nbsp;blah" at the beginning of a line is 1 space. So when creating an indented list in text, we lose the first space character, so cut-and-paste into an editor loses one level of indenting. To avoid this we would need to trap whether we were at the beginning of a new line or not, which doesn't seem worth it. The relative positions look ok.

This was also causing annoyances when a sentence is ended with two spaces, since the HTML will wrap the nbsp onto the next line causing it to indent which looks wierd. To avoid that we skip counting one hardspace directly after the end of a sentence.

matching on newlines

We have to match on backslash n explicitely when recognizing newlines, or text values that are created programmatically don't always get formatted. In other words if you explicitely set the value of a large text field to be a string with an embedded backslash n, then it won't be translated (at least on windoze). So the upshot is that either the unicode newline character or an explicit backslash n will be recognized as a newline. That said, a crlf needs to recognized as a single newline character or we end up double spaced.


translateToken

public static java.lang.String translateToken(java.lang.String token,
                                              boolean linkHref,
                                              boolean linkEmail)
If the given token looks like an email address or a hyperlink then make it into one. Translations are toggled based on the parameter flags. Basically if a token starts with http:// then we treat it as a hyperlink. Otherwise if it contains an "@" character and a "." treat it as an email address.


getPrefix

public static java.lang.String getPrefix(java.lang.String token)
Return the open parenthesis or other prefix this token starts with, or the empty string if it is unprefixed.


getSuffix

public static java.lang.String getSuffix(java.lang.String token)
Return the close parentheses or other suffix this token ends with, or the empty string if it is unsuffixed.


escapeCharacter

public static void escapeCharacter(java.lang.StringBuffer buf,
                                   char currChar,
                                   boolean stringentEscape)
Append the character or the equivalent XML escape string to the given buffer. This replaces things like apostrophes, ampersands and the like with their equivalent escape strings.


convertFromXML

public static java.lang.String convertFromXML(java.lang.String text)
Performs the inverse of the convertToXML character escapes.


getXMLTagValue

public static java.lang.String getXMLTagValue(java.lang.String tagname,
                                              java.lang.String input)
Given some XML input, retrieve the value of the given tag. Return null if for any reason the tag cannot be read. This is a simple string manipulation hack where we read forward to the first instance of a less-than sign followed by the tag name. Then we read forward to the first instance of a greater-than sign after that and return the contents. Useful when just pulling a value out and you don't want to load an entire parser.