High-Performance Computing Needs of 
Digital Library Community: 
A Knowledge Management Perspective

Hsinchun Chen, University of Arizona
Recommendations to the NSF High-Performance Computing Initiative

A short description of the scientific or engineering problem being addressed, as well as an indication of its importance scientifically, technologically, economically, socially, etc.

The Information Infrastructure Technology and Applications (IITA) Working Group, the highest level of the country's National Information Infrastructure (NII) technical committee, held an invited workshop in May 1995 to define a research agenda for digital libraries. The shared vision is an entire net of distributed repositories in which objects of any type and any size can be organized and searched within and across different indexed collections.
The ultimate goal, as described in the IITA report, is the Grand Challenge of Digital Libraries:

"deep semantic interoperability -- the ability of a user to access, consistently and coherently, similar (though autonomously defined and managed) classes of digital objects and services, distributed across heterogeneous repositories, with federating or mediating software compensating for site-by-site variations...Achieving this will require breakthroughs in description as well as retrieval, object interchange and object retrieval protocols. Issues here include the definition and use of metadata and its capture or computation from objects (both textual and multimedia), the use of computed descriptions of objects, federation and integration of heterogeneous repositories with disparate semantics, clustering and automatic hierarchical organization of information, and algorithms for automatic rating, ranking, and evaluation of information quality, genre, and other properties."
"The use of computed descriptions of (multimedia) objects" and "clustering and automatic hierarchical organization of information" present pressing scientific and engineering problems that have a significant potential impact on the US society in this era of the Internet and distributed, multimedia computing.


A baseline from which to measure progress. Where does the work stand today? What are the largest calculations/simulations you are able to perform in terms of appropriate complexity measures, such as grid points, time steps, independent variables, etc. What is your computing environment in terms of hardware including architecture (vector, parallel, etc.), number of processors, amount of memory, and storage? What software/algorithms are you using?

The traditional approach to creating classification systems and knowledge sources in library science and classical AI is often considered top-down since knowledge representations and formats are pre-defined by human experts or trained librarians and the process of generating knowledge is structured and well-defined. A complementary bottom-up approach to knowledge creation has been suggested by researchers in machine learning, statistical analysis, and neural networks.
General-purpose clustering algorithms have been in existence since the 1960s. Hierarchical and non-hierarchical clustering algorithms have been developed primarily for numeric analysis purposes. Some of these algorithms have then been adopted in digital library applications. Several factor analysis based techniques such as latent semantic indexing (LSI) and multi-dimensional scaling (MDS) have also been adopted in textual analysis. More recently, neural network based self-organizing map (SOM) textual classification systems have been reported by Chen et al. in several digital library applications. Clustering and classification techniques for real-life, large-scale collections are critically needed for developing knowledge structures for next-generation digital libraries.
Most of the more robust clustering algorithms (e.g., Voorhees method, Ward's method) are computationally intensive, and they often are O(N^2) or O(N^3) in complexity, where N is the number of objects to be clustered. Digital library science is data-intensive, one in which performance measure is often based on the amount of data (number of objects or size of collections). The computational complexity of such algorithms has caused severe engineering problems, especially for mid-to-large-scale applications (from millions to billions of digital records, TBs collection). The largest calculation that we were able to perform on a 32-node Covex Exemplar and a 32-node SGI Origin2000 (2GBs RAM and 200 GBs of disk), respectively, was based on the neural network SOM algorithm for an Internet collection of about 100K web pages.


Have there been significant changes in the level of science practiced over the last several years? If there have, what factors are responsible? (System capability? New algorithms? New scientific understanding?)

Many researchers have attempted to optimize clustering algorithms, especially for sparse textual analysis applications. Dmitri and Chen developed a scaleable SOM algorithm that took advantage of the sparseness of coordinates in document input vectors and thereby reduced the SOM computational complexity by several orders of magnitude. The resulting complexity of the algorithm is proportional to the average number of non-zero coordinates in an input vector, instead of the total number of input vector coordinates. We believe the same ``keyword sparseness'' optimization principle observed in textual applications could be applied in the optimization of other conventional clustering algorithms as well.
In addition, hardware advancement in recent years, especially the development of shared memory multi-processors, has made large-scale, data-intensive, parallel analysis a promising approach. We have recently reported our experiment in using parallel supercomputers to analyze millions of engineering abstracts and automatically generate engineering concept spaces (thesauri). Implementation of algorithms developed in the 1960s and 1970s may be feasible because of algorithm optimization and hardware advances.


What are your scientific objectives for the next several years? In terms of the complexity measures listed above, of what size are the problems would you need to address to achieve your objectives? What would you need in terms of hardware, software and algorithm development? Can you anticipate what new scientific progress might be accomplished with a 10 to 100 fold increase of computing resources over current resources?

Conventional approaches to addressing information overload and interoperability problems are manual in nature, requiring human experts as information intermediaries to create knowledge structures and/or classification systems (e.g., the National Library of Medicine's Unified Medical Language System, UMLS) in order to bridge vocabulary differences. As information content and collections become even larger and more dynamic (thus rendering manual knowledge structures more difficult to create), we believe a system-aided, algorithmic, bottom-up approach to organizing digital content is needed.
One of the critical scientific objectives of the digital library community for the next several years is to:

"Develop scalable and efficient statistical and AI clustering algorithms to create classification systems based on large-scale (millions and billions) domain-specific digital library collections."
To be able to automatically analyze digital objects from the order of 10^5 (100K) to the order of 10^9 (billion), a proportional 1,000 to 10,000 fold increase in computing resources is needed -- in cycles, RAM, and disk space. An intermediate phase of 10 to 100 fold increase in computing resources will enable computation for 10^6 (million) to 10^7 (10 millions) of digital objects.
The rapidly proliferating numbers of data sources and volume of content have made digital library research an urgent priority, e.g., web pages are reaching 500M and growing, NASA's EOS collects TBs of earth data every week. With improved computing support, we believe we will be able to significantly accelerate our pace of achieving the ultimate research goal of:

ORGANIZING ALL THE DIGITAL INFORMATION IN THE WORLD

Hsinchun Chen
MIS Department
University of Arizona

1998 July 17