Indexing and search of mathematical expressions on a large scale in massive corpus of printed documents

Nowadays there exist large databases of digitized printed scientific documents, and many of them include mathematical expressions. The searching of textual information in these documents is currently a possibility widely exploited by the search engines of the most used web browsers. However, the searching in massive collections of digitized printed scientific documents with queries that are mathematical expressions is a research area scarcely explored. The methods that currently have been researched for tackling this problem are based on comparing images that are not realistic for searching in massive collections given the high computational cost of the above-mentioned approaches.

In IBEM, we propose to research indexing and searching techniques of mathematical expressions in large collections of digitized images. The preparation of the search indexes will be performed off-line, while the search query will be carried out with a mathematical expression that will be acquired on-line. The models that will allow us to build up the indexes of the collection and the models that will allow us to represent the query will be based on stochastic structural models that will account for the ambiguity that can appear in the recognition process, due to segmentation problems and due to the ambiguity that the mathematical expression may have. IBEM poses new challenges that have not been studied in the past and consequently the feasibility is not fully guaranteed: index preparation that includes confidence measures, data structures based on syntactic parse trees for structural search, discriminative machine learning of structural models, and development of a search engine using mathematical expressions that will be scalable for massive data.