|
<<
^
>>
Date: 1999-12-16
NSAs Semantic Forests: Schneier analysiert
-.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.-
Bruce Schneier über die technischen Möglichkeiten des
"Semantic Forests" Patents der NSA. Schlu?satz: "Ich bin
überrascht, dass die NSA dieses Dokument nicht unter
Verschluß gehalten hat.
-.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.-
The NSA has been patenting, and publishing, technology that
is relevant to ECHELON.
ECHELON is a code word for an automated global
interception system operated by the intelligence agencies of
the U.S., the UK, Canada, Australia and New Zealand. (The
NSA takes the lead.) According to reports, it is capable of
intercepting and processing many types of transmissions,
throughout the globe.
Over the past few months, the U.S. House of Representatives
has been investigating ECHELON. As part of these
investigations, the House Select Committee on Intelligence
requested documents from the NSA regarding its operating
standards for intelligence systems like ECHELON that may
intercept communications of Americans. To everyone's
surprise, NSA officials invoked attorney-client privilege and
refused to disclose the documents. EPIC has taken the
NSA to court.
I've seen estimates that ECHELON intercepts as many as 3
billion communications everyday, including phone calls, e-
mail messages, Internet downloads, satellite transmissions,
and so on. The system gathers all of these transmissions
indiscriminately, then sorts and distills the information
through artificial intelligence programs. Some sources have
claimed that ECHELON sifts through 90% of the Internet's
traffic.
How does it do it? Read U.S. Patent 5,937,422,
"Automatically generating a topic description for text and
searching and sorting text by topic using the same,"
assigned to the NSA. Read two papers titled "Text Retrieval
via Semantic Forests," written by NSA employees.
Semantic Forests, patented by the NSA (the patent does not
use the name), were developed to retrieve information "on the
output of automatic speech-to-text (speech recognition)
systems" and topic labeling. It is described as a functional
software program.
The researchers tested this program on numerous pools of
data, and improved the test results from one year to the next.
All this occurred in the window between when the NSA
applied for the patent, more than two years ago, and when
the patent was granted this year.
One of the major technological barriers to implementing
ECHELON is automatic searching tools for voice
communications. Computers need to "think" like humans
when analyzing the often imperfect computer transcriptions of
voice conversations.
The patent claims that the NSA has solved this problem.
First, a computer automatically assigns a label, or topic
description, to raw data. This system is far more
sophisticated than previous systems because it labels data
based on meaning not on keywords.
Second, the patent includes an optional pre-processing step
which cleans up text, much of which the agency appears to
expect will come from human conversations. This pre-
processing will remove what the patent calls "stutter
phrases." These phrases "frequently occurs [sic] in text
based on speech." The pre-processing step will also remove
"obvious stop words" such as the article "the."
The invention is designed to sift through foreign language
documents, either in text, or "where the text may be derived
from speech and where the text may be in any language," in
the words of the patent.
The papers go into more detail on the implementation of this
technology. The NSA team ran the software over several
pools of documents, some of which were text from spoken
words (called SDR), and some regular documents. They ran
the tests over each pool separately. Some of the text
documents analyzed appear to include data from "Internet
discussion groups," though I can't quite determine if these
were used to train the software program, or illustrate results.
The "30-document average precision" (whatever that is) on
one test pool rose significantly in one year, from 19% in 1997
to 27% in 1998. This shows that they're getting better.
It appears that the tests on the pool of speech- to text-based
documents came in at between 20% to 23% accuracy (see
Tables 5 and 6 of the "Semantic Forests TREC7" paper) at
the 30-document average. (A "document" in this definition
can mean a topic query. In other words, 30 documents can
actually mean 30 questions to the database).
It's pretty clear to me that this technology can be used to
support an ECHELON-like system. I'm surprised the NSA
hasn't classified this work.
The Semantic Forest papers:
http://trec.nist.gov/pubs/trec6/papers/nsa-rev.ps
http://trec.nist.gov/pubs/trec7/papers/nsa-rev.pdf
Source
http://www.counterpane.com
-.- -.-. --.-
-.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.-
- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.-
edited by Harkank
published on: 1999-12-16
comments to [email protected]
subscribe Newsletter
- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.-
<<
^
>>
|
|
|
|