Tuesday, July 31, 2007

Related Works: Page Layout analysis for Bangla Document Image


A. Ray Chaudhuri, A. K. Mandal and B. B. Chaudhuri at [1] present a page layout analyzer for multilingual Indian documents more specifically for Brahmi script based documents. The headline (matra for Bangla, sirorekha for Hindi) is identified as the main feature of these documents. They applied is a bottom up approach for the page layout analysis. Here the text regions are extracted as rectangular blocks containing moderate size connected components satisfying homogeneity or compatibility in terms of headline and other (textual) features particularly available in the considered scripts. Degree of homogeneity is calculated from the compatibility criterion, in terms of selected features. Textual regions or blocks are generated from the optimal bounding boxes (BBs) of all connected components.

In their work first the likelihood of the headline in a word and vertical line in a word is calculated from the statistical analysis on a large scale of documents. The primary features are extracted from the BBs viz. corners, headline (position and its thickness), the vertical line in the middle zone of UMD, lengthwise distribution of UMD and average number of object pixels per BB, i.e. density of the components. From these primary features some secondary features are calculated as follows: average pixel density of each component, average vertical span of M within a block, average horizontal gap among closest pairs of BBs, inter-headline distances between adjacent vertical BBs, inter-block distance (both horizontal and vertical) and the ratio of number of components between adjacent blocks. The Compatibility criterion for any two scalar quantities L1 and L2 is defined as:

Comp (L1, L2) = ½(|1 – L1/L2| + |1 – L2/L1|)

The entire process is completed in three modules. The task in the first module is the block construction using the primary features. The constructed blocks and the remaining BBs are labeled into seven categories (C1 – C7). The second module is applicable for the blocks where one block (Fin) is completely or partially resides into the other (Fout). The third module is for block merging which is actually applicable only for those blocks that are not yet considered as text (but they are real text in the title).

S. Khedekar, V. Ramanaprasad, S. Setlur, V. Govindaraju at [2] present a top-down, projection-profile based algorithm to separate text blocks from image blocks in a Devanagari document. They used a distinctive feature called Shirorekha (Header Line) to analyze the pattern. The horizontal profile corresponding to a text block possesses certain regularity in frequency, orientation and shows spatial cohesion. Their algorithm uses these features to identify text blocks.

The algorithm first generates the horizontal histogram of the entire image. When the horizontal histogram is plotted of the document image, the rows where Shirorekha is present will have maximum number of black pixels. The patterns (profiles) formed by text blocks are characterized by ranges of thick black peaks, separated by white gaps. Any graphics or images by contrast have relatively uniform pattern, with no prominent peaks. Another main feature of a text block is that the histogram corresponding to that block possesses certain frequency and orientation and shows spatial cohesion i.e. adjacent lines are of similar height. Since text lines appear adjacent to each other, we look for adjacent patterns which have identical shape and size distribution. If this profile is found, then this portion of document must contain Devanagari text. Any variation from this characteristic must be an image with or without surrounding text.

Reference:

[1]. A. Ray Chaudhuri, A. K. Mandal, B. B. Chaudhuri, “Page Layout Analyzer for Multilingual Indian Documents”, Proc. of the Language Engineering Conference (LEC’02), 2002.

[2]. S. Khedekar, V. Ramanaprasad, S. Setlur, V. Govindaraju, “Text -image separation in Devanagari documents", Proc. of Seventh International Conference on Document Analysis and Recognition, 2003.

No comments: