Site Map Contact Us Home
The Mole


 










The Mole
Document Classifier

The automated classification of documents into predefined categories is witnessing a booming interest, due to the increased availability of documents in electronic form and the ensuing need to organize them.

When do you need an Automated Document Classifier?

  • Implementing a new Document Management System
  • Business Merger or Acquisition and the combining of information assets.
  • Corporate Governance – Understanding and managing the business Knowledge Assets
  • Compliance – Archiving and document retention policies

The current approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of pre-classified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting of the manual definition of a classifier by domain experts) are a high degree of effectiveness, considerable savings of expert resource costs, and straightforward portability to different domains.

Our Approach

The Mole Document Classifier offers an effective solution to the automatic recognition and classification of unstructured information. It analyses collections of documents and assigns the documents on the basis of their properties and content to classification structures. The classification structures are defined according to the specific requirements of the individual customer. The training of The Mole Document Classifier is carried out using sample document sets.

The Mole Document Classifier is an end-to-end solution that provides the user with configuration and processing of source document repositories, the extraction of document properties and content for analysis, the calculation of classification value rankings and the movement of documents into target document repositories.

The Mole Document Classifier is based on Naïve Bayesian probability theory combined with Princeton’s WordNet® Thesaurus for deriving conceptual maps of documents. The conceptual maps are transformed into a vector of key term frequencies and then each document is compared against sample document sets to determine the similarity score or value ranking as a determinant of document category.

Document properties introduced as parameters to the processing can influence the classification decisions. For example, documents that are supplier invoices and over 7 years old are to be ‘trashed’ whereas between 3 and 7 years go to archive and under 3 years old are transferred and classified into a document management system.



Customer Requirements
  • A Classification Schema 
  • Sample documents representative of the classification schema 
  • Classification Rules - e.g. Document Retention periods, Author blacklists, Language expletive blacklists 
  • Organisation/Industry Taxonomies (optional)
The Mole Document Classifier Components
  • Classification Scheduler for configuring classification processing and rules.
  • Source and Target Repository Connectors (e.g. file system, document management systems).
  • The Mole Document Classifier. 
  • Princeton Wordnet® thesaurus.
Technology
  • Windows 2000/2003 Server
  • WebSphere Application Server 5.1.1
  • WebSphere Information Integrator Content Edition 8.2

WebSphere Information Integrator Content Edition enables applications to access and work with a broad range of unstructured information sources as if they were stored and managed in one unified system. Rather than accessing each repository individually or coding to multiple APIs from different vendors or replacing existing systems, you can utilize WebSphere Information Integrator Content Edition's uniform superset API and real-time views of content and workflow. WebSphere Information Integrator Content Edition exposes the full, bi-directional interface of the underlying content management and workflow systems, adheres to the security of those systems, and adds federation services such as metadata mapping, federated search, and single sign-on.



Solution Architecture

 



WebSphere Information Integrator Content Edition Connectors:

WebSphere Information Integrator Content Edition delivers full bi-directional access to underlying content and workflow systems, including the unique capabilities of the underlying repository. Out-of-the-box connectors quickly unify a broad range of content sources and workflow systems without the cost, complexity, and risk of custom programming efforts. Connectors are available to the following systems:

  • DB2® Content Manager and DB2 Content Manager OnDemand

  • WebSphere® MQ Workflow

  •  Lotus® Notes®, Lotus Domino.Doc®

  • Documentum

  •  FileNet Content Services, FileNet Image Services, FileNet Image Services Resource Adapter, FileNet P8 Content Manager, FileNet P8 Business Process Manager

  • Open Text Livelink

  • Microsoft® Index Server/NTFS

  • Stellent Content Server

  • Interwoven TeamSite

  • Hummingbird Enterprise DM

 

Connector for Documentum

Documentum 4i and workflow 

Documentum 5 and workflow

Connector for Microsoft Index Server

Microsoft Index Server/NTFS

 

Connector for FileNet

FileNet Image Services and Workflow

FileNet Content Services

 

FileNet Image Services Resource Adaptor (IRSA)

 

 

FileNet P8 Content Manager

 

Connector for IBM

IBM DB2 Content Manager 8

IBM DB2 CM OnDemand

IBM Websphere MQ Workflow

IBM Lotus Notes R5 (Domino)

IBM Lotus Notes R6 (Domino)

IBM Lotus Domino.Doc (3.1)

Connector for OpenText

OpenText Livelink

Connector for Stellent

Stellent Content Server 

Connector for Interwoven

Interwoven Teamsite 

Connector for Hummingbird

Hummingbird Enterprise 2004 DM 5.1 

 

 




 

Downloads

Download  The Mole Document Classifier datasheet. (150KB)

Partners

IBM and Superstructure-Upshot has partnered to provide an end-to-end solution that significantly reduces the cost and time involved in classifying documents.

   Contact Us | Site Map | Privacy | Legal | Copyright © 2007  Superstructure Group. All Rights Reserved.