The Information Infrastructure Technology and Applications (IITA)
Working Group, the highest level of the country's National Information
Infrastructure (NII) technical committee, held an invited workshop in
May 1995 to define a research agenda for digital libraries. The shared
vision is an entire net of distributed repositories in which objects
of any type and any size can be organized and searched within and
across different indexed collections.
The ultimate goal, as described in the IITA report, is the Grand
Challenge of Digital Libraries:
"deep semantic interoperability -- the ability of a user to access, consistently and coherently, similar (though autonomously defined and managed) classes of digital objects and services, distributed across heterogeneous repositories, with federating or mediating software compensating for site-by-site variations...Achieving this will require breakthroughs in description as well as retrieval, object interchange and object retrieval protocols. Issues here include the definition and use of metadata and its capture or computation from objects (both textual and multimedia), the use of computed descriptions of objects, federation and integration of heterogeneous repositories with disparate semantics, clustering and automatic hierarchical organization of information, and algorithms for automatic rating, ranking, and evaluation of information quality, genre, and other properties.""The use of computed descriptions of (multimedia) objects" and "clustering and automatic hierarchical organization of information" present pressing scientific and engineering problems that have a significant potential impact on the US society in this era of the Internet and distributed, multimedia computing.
The traditional approach to creating classification systems and
knowledge sources in library science and classical AI is often
considered top-down since knowledge representations and formats are
pre-defined by human experts or trained librarians and the process of
generating knowledge is structured and well-defined. A complementary
bottom-up approach to knowledge creation has been suggested by
researchers in machine learning, statistical analysis, and neural
networks.
General-purpose clustering algorithms have been in existence since the
1960s. Hierarchical and non-hierarchical clustering algorithms have
been developed primarily for numeric analysis purposes. Some of these
algorithms have then been adopted in digital library
applications. Several factor analysis based techniques such as latent
semantic indexing (LSI) and multi-dimensional scaling (MDS) have also
been adopted in textual analysis. More recently, neural network based
self-organizing map (SOM) textual classification systems have been
reported by Chen et al. in several digital library applications.
Clustering and classification techniques for real-life, large-scale
collections are critically needed for developing knowledge structures
for next-generation digital libraries.
Most of the more robust clustering algorithms (e.g., Voorhees method, Ward's
method) are computationally intensive, and they often are O(N^2) or O(N^3)
in complexity, where N is the number of objects to be clustered. Digital
library science is data-intensive, one in which performance measure is often
based on the amount of data (number of objects or size of collections).
The computational complexity of such algorithms has caused severe engineering
problems, especially for mid-to-large-scale applications (from millions
to billions of digital records, TBs collection). The largest calculation
that we were able to perform on a 32-node Covex Exemplar and a 32-node
SGI Origin2000 (2GBs RAM and 200 GBs of disk), respectively, was based
on the neural network SOM algorithm for an Internet collection of about
100K web pages.
Many researchers have attempted to optimize clustering algorithms,
especially for sparse textual analysis applications. Dmitri and Chen
developed a scaleable SOM algorithm that took advantage of the
sparseness of coordinates in document input vectors and thereby
reduced the SOM computational complexity by several orders of
magnitude. The resulting complexity of the algorithm is proportional
to the average number of non-zero coordinates in an input vector,
instead of the total number of input vector coordinates. We believe
the same ``keyword sparseness'' optimization principle observed in
textual applications could be applied in the optimization of other
conventional clustering algorithms as well.
In addition, hardware advancement in recent years, especially the
development of shared memory multi-processors, has made large-scale,
data-intensive, parallel analysis a promising approach. We have
recently reported our experiment in using parallel supercomputers to
analyze millions of engineering abstracts and automatically generate
engineering concept spaces (thesauri). Implementation of algorithms
developed in the 1960s and 1970s may be feasible because of algorithm
optimization and hardware advances.
Conventional approaches to addressing information overload and
interoperability problems are manual in nature, requiring human
experts as information intermediaries to create knowledge structures
and/or classification systems (e.g., the National Library of
Medicine's Unified Medical Language System, UMLS) in order to bridge
vocabulary differences. As information content and collections become
even larger and more dynamic (thus rendering manual knowledge
structures more difficult to create), we believe a system-aided,
algorithmic, bottom-up approach to organizing digital content is
needed.
One of the critical scientific objectives of the digital library community
for the next several years is to:
"Develop scalable and efficient statistical and AI clustering algorithms to create classification systems based on large-scale (millions and billions) domain-specific digital library collections."To be able to automatically analyze digital objects from the order of 10^5 (100K) to the order of 10^9 (billion), a proportional 1,000 to 10,000 fold increase in computing resources is needed -- in cycles, RAM, and disk space. An intermediate phase of 10 to 100 fold increase in computing resources will enable computation for 10^6 (million) to 10^7 (10 millions) of digital objects.