NATURAL LANGUAGE CHARACTER USE IN E-MAIL
ARE THERE PROBLEMS?
Spoken natural languages are recorded using sequences of character appropriate to the language. Each character
has an associated graphic rendering (or 'glyph'). In many cases, the glyph used to display the character will
modify the meaning or sense of the underlying character - such as a basic lower-case character and its'
upper-case glyph (for example, 'a' and 'A').
ICANN's recent decisions have added 107,000+ characters to what you might find in an e-mail.
A 'codeset' is simply a named collection of all of the characters of a language, their glyphs, and, for each,
a unique numeric value which is assigned to no other character of any other language. It should be noted that
not all of the characters of a language's codeset represent verbal parts of the language - punctuation symbol,
for instance.
The ASCII codeset of 128 characters is the foundation upon which all codesets are built and is included in every
codeset. This codeset is used to establish the structure of and access the Internet and its' components - including e-mail.
For instance, an e-mail begins with a Header containing various data fields. One field - 'Content-Type' identifies
the other (non-ASCII) codeset appearing in the e-mail.
The Header functions as an envelope for the actual e-mail transmission body. It contains the fields:
___ 'TO:' - which identifies the mailbox to which the mail is to be delivered
___ 'FROM:' - which identifies the mailbox from which the mail was sent
___ 'SUBJECT:' - which briefly describes the e-mail content
There is no prohibition against intermingling ASCII characters with the other codeset characters in
any e-mail field or component of the e-mail.
There is no prohibition against more that two codesets appearing in an e-mail. It's possible to include quotes in their
original codesets ... encrypt messages ... confuse and defeat spam detection mechanisms
The use of natural language characters anywhere in an e-mail poses no barrier to delivery of the mail to the addressed mailbox.
However, just because an e-mail can be delivered doesn't mean the mailbox user wants to receive the e-mail. Such unwanted
e-mail is called "spam" and accounts for about 85% of the 243 BILLION e-mails currently sent DAILY.
To reduce the (human and computer) costs of processing that 85%, various mechanisms are used to filter and classify incoming e-mail as
'wanted' or 'unwanted'. The goal is, of course, to deliver only the 'wanted' mail and discard the 'unwanted' mail quickly.
The strategy of three of these mechanisms is to apply their criteria to identify 'unwanted'('spam') e-mail. Then, any mail not
identified as 'unwanted' is classified as 'wanted'.
___ 'black-list' mechanism criteria consist of URL's which might appear in the Header-'FROM:' field and words/phrases which might appear in
the HEADER-'SUBJECT'' field or other text fields of the e-mail.
___ A second mechansim relies on linguistic analysis of the content of the 'SUBJECT:' and message text fields of the e-mail
___ A third mechanism relies on structural anlysis of the e-mail and it's content.
These mechanisms are demonstrably ineffective.They will classify any e-mail using codesets and character sequences they do not expect as 'wanted.
They can't exlude the vast majority of input e-mail from delivery with their criteria. (A single black-list criteria entry consisting only of letters 'a' to 'z' that's 8
characters long has 208,827,064,575 other possible variations.) Intermingling of multiple codesets will decrease detected 'spam' percentages
dramatically.
The 'white-list' mechanism is a totally effective, 100%-accurate solution to spam detection. Its' strategy is to identify 'wanted' mail and classify any mail
not so identified as 'unwanted'/'spam'. Again, criteria consist of URLs and words/phrases. But in this methodology, the criteria specify what's 'wanted'. If an 8-character
criteria consisting of the letters 'a' to 'z' is matched, the one 'wanted' mail is identified and the 208,827,064,575 other possible variations will be classified 'unwanted'
with no further effort.
White-list classification is not affected by character intermingling or any other text distortion techniques.
White-listing is totally insensitive to codesets and the meaning of the criteria specification.
Only an exact match between the white-list criteria entry representation and e-mail content results in a 'wanted' classification ... Everything else is 'spamn'
EPA uses ONLY white-listing to filter and manage e-mail. It's ready for all of those natural language codesets. It's 100% effective and 100% accurate.
If your software isn't, contact me - Ralph W. Seifert - at 561-762-7685 (live during US EDT business hours).