 |
The Mole Document Classifier
The automated classification of documents into predefined categories is
witnessing a booming interest, due to the increased availability of documents
in electronic form and the ensuing need to organize them.
When do you need an Automated
Document Classifier?
- Implementing
a new Document Management System
- Business
Merger or Acquisition and the combining of information assets.
- Corporate
Governance – Understanding and managing the business Knowledge Assets
- Compliance
– Archiving and document retention policies
The current approach to this problem is based on machine learning techniques: a
general inductive process automatically builds a classifier by learning, from
a set of pre-classified documents, the characteristics of the categories. The
advantages of this approach over the knowledge engineering approach
(consisting of the manual definition of a classifier by domain experts) are a
high degree of effectiveness, considerable savings of expert resource costs,
and straightforward portability to different domains.
Our Approach
The Mole
Document Classifier offers an effective solution to the automatic recognition
and classification of unstructured information. It analyses collections of
documents and assigns the documents on the basis of their properties and
content to classification structures. The classification structures are
defined according to the specific requirements of the individual customer. The
training of The Mole Document Classifier is carried out using sample document
sets.
The Mole Document Classifier is an end-to-end
solution that provides the user with configuration and processing of source document
repositories, the extraction of document properties and content for analysis,
the calculation of classification value rankings and the movement of documents
into target document repositories.
The Mole Document Classifier is based on Naïve Bayesian probability theory
combined with Princeton’s WordNet® Thesaurus for deriving conceptual maps of documents.
The conceptual maps are transformed into a vector of key term frequencies and
then each document is compared against sample document sets to determine the
similarity score or value ranking as a determinant of document category.
Document properties
introduced as parameters to the processing can influence the classification
decisions. For example, documents that are supplier invoices and over 7 years
old are to be ‘trashed’ whereas between 3 and 7 years go to archive and
under 3 years old are transferred and classified into a document management
system.
Customer Requirements
- A Classification Schema
- Sample documents representative of the classification schema
- Classification Rules - e.g. Document Retention periods, Author blacklists, Language
expletive blacklists
- Organisation/Industry Taxonomies (optional)
The Mole Document Classifier
Components
- Classification Scheduler for configuring classification processing and rules.
- Source and Target Repository Connectors (e.g. file system, document management
systems).
- The Mole
Document Classifier.
- Princeton Wordnet® thesaurus.
Technology
- Windows 2000/2003 Server
- WebSphere Application Server 5.1.1
- WebSphere Information Integrator Content Edition 8.2
WebSphere Information Integrator Content Edition enables
applications to access and work with a broad range of unstructured information
sources as if they were stored and managed in one unified system. Rather than
accessing each repository individually or coding to multiple APIs from
different vendors or replacing existing systems, you can utilize WebSphere
Information Integrator Content Edition's uniform superset API and real-time
views of content and workflow. WebSphere Information Integrator Content
Edition exposes the full, bi-directional interface of the underlying content
management and workflow systems, adheres to the security of those systems, and
adds federation services such as metadata mapping, federated search, and
single sign-on.

Solution Architecture

WebSphere Information Integrator
Content Edition Connectors:
WebSphere Information Integrator
Content Edition delivers full bi-directional access to underlying content and
workflow systems, including the unique capabilities of the underlying
repository. Out-of-the-box connectors quickly unify a broad range of content
sources and workflow systems without the cost, complexity, and risk of custom
programming efforts. Connectors are available to the following systems:
-
DB2® Content Manager and DB2 Content Manager
OnDemand
-
WebSphere® MQ Workflow
-
Lotus® Notes®, Lotus Domino.Doc®
-
Documentum
-
FileNet Content Services, FileNet Image
Services, FileNet Image Services Resource Adapter, FileNet P8 Content Manager,
FileNet P8 Business Process Manager
-
Open Text Livelink
-
Microsoft® Index Server/NTFS
-
Stellent Content Server
-
Interwoven TeamSite
-
Hummingbird Enterprise DM
|
Connector for Documentum
Documentum 4i and workflow
Documentum 5 and workflow
|
Connector for
Microsoft Index Server
Microsoft
Index Server/NTFS
|
|
Connector for
FileNet
FileNet Image
Services and Workflow
FileNet
Content Services
FileNet Image
Services Resource Adaptor (IRSA)
FileNet P8
Content Manager
|
Connector for
IBM
IBM DB2
Content Manager 8
IBM DB2 CM
OnDemand
IBM Websphere
MQ Workflow
IBM Lotus
Notes R5 (Domino)
IBM Lotus
Notes R6 (Domino)
IBM Lotus
Domino.Doc (3.1)
|
|
Connector for
OpenText
OpenText
Livelink
|
Connector for
Stellent
Stellent
Content Server
|
|
Connector for
Interwoven
Interwoven
Teamsite
|
Connector for
Hummingbird
Hummingbird
Enterprise 2004 DM 5.1
|
|
 |
 |
|