More on Discovering Semantics from Parts of Web Pages

13. February 2007 11:36 by CarlosLoria in General  //  Tags:   //   Comments (0)

By taking at look at Web Pages, we may expect to discover that some patterns of semantics are encoded using very few HTML sets of, let us say, “combinators” (HTML parts); this may be due to the lack of abstraction capabilities which is inherent to the HTML alone. We have compared this situation to the Noisy-Channel model in a previous post where we presented some interesting figures and data illustrating the claim. Let us continue our journey showing further instances of this phenomenon whose formal analysis is crucial for intelligent refactoring tools as the kind we have been pursued to introduce by means of this sequel of posts. In other words, let us know other forms of “HTML noise”. As a word of warning, we recall that the data is the result of a particular muster of crawled pages by the way we explained before.

For this post, we are experimenting with tables that are potentially used as page layouts or page structure. For those kinds of tables, we want to study the table shape or page surface, no the specific content; we may think of that as a way to filter potential candidates for further deeper semantic analysis. (We briefly recall that our muster contains 819 pages and about 5000 table instances, roughly speaking.).

The exercise is simple: we postulate an intuitive definition for a table as surface and see how well it is supported by our data in muster.

Let us try our shallow analysis by classifying a table as a page layout candidate if its container is the page body tag, eventually followed by a chain of div tags (assuming such div tags are intended to be organizers or formatters of the table), it has at least two rows and at least 2 columns (two columns is the most interesting case, we consider it as a base).

Such a pattern definition sounds reasonable in appearance; however, we will see that its empirical support is not as high as one may expect, at least in our muster.

We find 261 of such candidates; they represent a 31% of all pages, which is a quite interesting amount; however it is unexpectedly small because one may guess there should be at least one per page. Among these 261, we have 83 where the table is hanging directly from the body tag (32% of the candidates; 10% of the whole muster). As a matter of fact, such 83 tables present irregular patterns, albeit often we find 2 columns (65%) with a high variance. For instance, we may find a pattern of the form 6.2.2.2.2.2.2.2, where we use our convention of showing a table of n rows as a sequence of n numbers, each of one being the number of cols (in example 8 rows, the first of them with 6 columns the rest having 2 columns). But even worst, we find the irregular pattern 2.2.7.2.7.7.6.5.5.4.4.5.2.3.2.7.2.7. And talking about irregularity, let us take a look at this interesting one: 19.2.7.4.6.2.2.2.2.2.2.2.2.2.2.5.7.2.2.2.2.4.4.2, whatever it means.

With this simple analysis, we may learn that, perhaps, some intuitive definitions occur not as frequent as we may expect in our muster. Actually, and after seeing in detail some of the irregular cases, a sound conclusion might be that we may need first to pre-classify some parts of the page before using general patterns like the one we directly tried. In other words, we see that some noise needs to be filtered out for such a kind of pattern.

In a forthcoming post, we will continue studying that kind of patterns and their support.