ArtinSoft's Blogs

Software Migration Experts
Welcome to ArtinSoft's Blogs Sign in | Join | Help
in Search

Carlos Loría-Sáenz Blog

February 2007 - Posts

  • PaXQual: a silly language for analyzing and rewriting Web Pages

    Let us gently start meeting PaXQuAL, a prototype language that we are shaping out and adjusting for the emergent purpose of symbolically expressing analysis and transformation simple tasks around Web Pages, all this circumscribed in the context of refactoring, as we have been done in previous posts.

    And as we also have already done days before, sometimes we just want to digress a little bit from practical and realistic issues, just to expose some theoretical ideas we find somehow interesting (and probably nobody else). I just can promise that no Greek letter will be used, at all (in part because that Font is not allowed by the publishing tool, I confess).

    Anybody (if any) still reading this post is allowed to right away express the nowadays classical: “Yet another language? How many do we count up by now?” Claim is justified because everybody knows that every problem in Computer Science is solved proposing a (the) new one. Now it is my turn, why not. It’s a free world. For the interested reader a technical paper will be hopefully available with further details at this site, soon.

    Actually PaXQuAL (Path based Transformation and Querying Attribute Language is his real name; is pronounced Pascual) is not that new and different from many other languages, developed for real researchers at the academia and industry. We wanted to imagine a language for querying and transforming structured data (eg. XML, HTML) and from that sort we have many available as we know. What new material can be proposed at this field for someone like us? Actually, what we really want is to operationally relate CSS with some special sort of theoretical weird artifact we had been exploring some years ago that we may dare to call Object-Oriented Rewrite Systems or Term-rewriting Systems (TRS) with extra variables and state (as a result of some work developed by and joint with actual researchers some years ago).  Considering TRS in this case natural because CSS is indeed a kind of them and that field has a rich offering of tools for useful automated reasoning. And we can find them useful here, we guess.

    The question that pushed us back to the old days is: given an interesting, so simple and practical language, like CSS is, what kind of object-oriented rewriting logic can be used to describe its operational semantics. You may not believe it but this is a very important issue if we are interested in reasoning about CSS and HTML for refactoring purposes among others. And we are, don’t we?

    CSS is rule-based, includes path-based pattern matching and is feature (semantically attributed) equipped, which all together yields a nice combination. CSS can be considered “destructive” because it allows adding or changing (styling) only attributes of tags where remaining “proper content” does not result destructively rewritten. It is not generative, by such a reason (in contrast to XSLT and XQUERY). And that leads to an interesting paradigm. For instance, following is a typical simple CSS rule for setting some properties of every tag of the kind body.

    body {

         font-family: Arial, Helvetica, sans-serif;

         background-color: #423f43;

         text-align: center;

    }

    Of course more explicit rules like this one can be declared but further, an inheritance (cascading) mechanism implicitly allows that attributes may be pushed down or synthesized as we know from attribute grammars.

    That all is nice but we feel we had to be original and want to propose the crazy idea of using something similar to CSS for purposes beyond setting style attributes, for instance for expressing classification rules allowing to recognize patterns like the ones we explained in previous posts. For instance, that a table is actually a sort of layout object, navigation bar or a menu, among others. Hence, we would have a human-readable, querying and transformation language for Web Pages, a sort of CSS superset (keeping CSS as a metaphor what we think might be a good idea):

    Let us by now just expose some examples (where we advert concrete syntax in PaXQuAaL is not yet definitive). For instance, we may want to eliminate the bgcolor attribute of any table having it because is considered deprecated in XHTML. We use symbol “-:“ for denoting execution of the query/transformation as in Prolog.

     :- table[bgcolor]{bgcolor:null;}

    We may want to add a special semantic attribute to every table directly hanging from a body, indicating it may be a surface object for some latter processing. We first must statically declarate a kind of table, “sTable”, accepting a surface attribute because we are attempting to use static typing as much as possible (“Yes I am still a typeoholic”)

    @:- sTable::table[surface:boolean;]{surface:false}

    Symbol “@:-” is like “:-” but operating at the terminological level. And then we have the rule for classifying any table instance hanging from the body tag, directly:

    :- body sTable{surface:true;}

    Many more "interesting" issues and features still need to be introduced; we will do that in forthcoming post. Hence, stay tuned.

  • More on Discovering Semantics from Parts of Web Pages

    By taking at look at Web Pages, we may expect to discover that some patterns of semantics are encoded using very few HTML sets of, let us say, “combinators” (HTML parts); this may be due to the lack of abstraction capabilities which is inherent to the HTML alone. We have compared this situation to the Noisy-Channel model in a previous post where we presented some interesting figures and data illustrating the claim. Let us continue our journey showing further instances of this phenomenon whose formal analysis is crucial for intelligent refactoring tools as the kind we have been pursued to introduce by means of this sequel of posts. In other words, let us know other forms of “HTML noise”. As a word of warning, we recall that the data is the result of a particular muster of crawled pages by the way we explained before.

    For this post, we are experimenting with tables that are potentially used as page layouts or page structure. For those kinds of tables, we want to study the table shape or page surface, no the specific content; we may think of that as a way to filter potential candidates for further deeper semantic analysis. (We briefly recall that our muster contains 819 pages and about 5000 table instances, roughly speaking.).

    The exercise is simple: we postulate an intuitive definition for a table as surface and see how well it is supported by our data in muster.

    Let us try our shallow analysis by classifying a table as a page layout candidate if its container is the page body tag, eventually followed by a chain of div tags (assuming such div tags are intended to be organizers or formatters of the table), it has at least two rows and at least 2 columns (two columns is the most interesting case, we consider it as a base).

    Such a pattern definition sounds reasonable in appearance; however, we will see that its empirical support is not as high as one may expect, at least in our muster.

    We find 261 of such candidates; they represent a 31% of all pages, which is a quite interesting amount; however it is unexpectedly small because one may guess there should be at least one per page. Among these 261, we have 83 where the table is hanging directly from the body tag (32% of the candidates; 10% of the whole muster). As a matter of fact, such 83 tables present irregular patterns, albeit often we find 2 columns (65%) with a high variance. For instance, we may find a pattern of the form 6.2.2.2.2.2.2.2, where we use our convention of showing a table of n rows as a sequence of n numbers, each of one being the number of cols (in example 8 rows, the first of them with 6 columns the rest having 2 columns). But even worst, we find the irregular pattern 2.2.7.2.7.7.6.5.5.4.4.5.2.3.2.7.2.7. And talking about irregularity, let us take a look at this interesting one: 19.2.7.4.6.2.2.2.2.2.2.2.2.2.2.5.7.2.2.2.2.4.4.2, whatever it means.

    With this simple analysis, we may learn that, perhaps, some intuitive definitions occur not as frequent as we may expect in our muster. Actually, and after seeing in detail some of the irregular cases, a sound conclusion might be that we may need first to pre-classify some parts of the page before using general patterns like the one we directly tried. In other words, we see that some noise needs to be filtered out for such a kind of pattern.

    In a forthcoming post, we will continue studying that kind of patterns and their support.


  • Semantics from Structural Parts of Web Pages: some figures and patterns

    We continue our regular series of posts talking about refactoring of Web Pages based on semantic approaches; we invite the interested new reader to take a look at the previous contributions to get a general picture of our intentions.

    In this particular and brief post, we just want to present and describe some simple but interesting empirical data which are related with the structural (syntactic) content of some given muster of pages we have been analyzing during the last days. The results are part of a white page we are preparing, currently; it will be available at this site in short time.

    We may remember from our first post that we may want to recover semantics from structure given particular clues and patterns we usually may come across when analyzing pages. The approach is simpler to describe than to put into practice: Once semantics could be somehow detected, refactoring steps can be applied on some places at the page and, by doing so, some expected benefits can be gained.

    However, syntactic structure is the result of encoding some specific semantics and intentions on a web page using HTML elements and functionality; the HTML language is (expressively speaking) rather limited (where too much emphasis on presentation issues is the case, for instance) and some common programming “bad practices” increase the complexity of recovering semantics mainly based on syntactic content as input. And being HTML quite declarative, such complexity can make the discovering problem quite challenging in a pragmatic context, indeed. That is our more general goal, however, we do not want to go that far in this post, we just want to keep this perspective in mind and give the reader some insight and data to think about it. We will be elaborating more on recovering in forthcoming posts.

    As usual in NLP field, it is interesting to use the so-called Noisy-Channel model as point of reference and analogy. We may think of the initial semantics as the input message to the channel (the programmer); the web page is the output message. The programmer uses syntactic rules to encode semantics during coding adding more or less noisy elements. Different encodings forms do normally exist, noisy can be greater when too much structure is engaged for expressing some piece of the message.

    A typical example of noisy encoding is the use of tables for handling style, presentation or layout purposes beyond the hypothetically primary intention of such kind of table element: just to be an arrange of data. Complex software maintenance and sometimes lower performance may be a consequence of too much noise, among others matters.

    Let us take a look at some data concerning questions like: how much noise in page? What kind of noise? What kind of regular encodings could be found?

    As a warning, we do not claim anything on statistical significance because our muster is clearly too small and was based on biased selection criteria. Our results are very preliminary, in general. However, we feel they may be sound and believable, in some way consistent with the noisy model.

    Our “corpus” comes from of 834 pages which were crawled starting for convenience at a given root page in Costa Rica, namely: http://www.casapres.go.cr/. The size depended of a predetermined maximal quantity of 1000 nodes to visit; we never took more than 50 paths of those pointed in a page and we rather preferred visiting homepages to avoid traps.

    Let us see some descriptive profile of the data. For current limitations of the publishing tool, we are not presenting some charts complementing the raw numbers.

    Just 108 kinds of tags were detected and we have 523.016 instances of them in corpus. That means, very roughly, 6 kinds of tags per page, 627 instances per page. We feel that suggests the use of the same tags for saying probably different things (we remark that many pages are homepages for choice).

    The top 10 of tags are: pure text, a, td, tr, br, div, li, img, p and font (according to absolute frequency). Together text, a (anchor) and img correspond to more than 60% all instances. Hence 60% of pages are some form of data.

    We notice that ‘table’ is 1% and td 8.5% of all instances, against 42% from text, 15% from anchors. In average, we have 7 tables per page and 54 tds per page, 6 td per table, roughly speaking.

    Likewise we just saw 198 attributes and 545.585 instances of attributes. The 10 most popular are: href, shape, colspan, rowspan, class, width, clear and height, which is relatively consistent with the observed tag frequency (egg. href for anchor, colspan and rowspan for td).

    We pay some special attention to tables in the following lines. Our corpus has 5501 tables. It is worth to mention that 65% of them are children of td; in other words nested into another table. Hence a high proportion of nesting which suggests complexity in table design. We see that 77% of data (text, a, img) in muster are dominated by tds (most of the data is table dominated). In the case of anchors, 33% of them are td-dominated, what may suggest tables being used as navigational bars or similar semantic devices in an apparently very interesting proportion.

    We decided to explore semantic pattern on tables a little bit more exactly. For instance, we choose tables of nx1 dimension (n rows, 1 column) which are good candidates for navigational bars. A simple analysis shows that 618 tables (11%) have such a shape. The shape may be different which is quite interesting. For instance, we see a 5x1 table where all td are anchors. We denote that but a sequence of 1 and 0, where 1 means the corresponding td contains an anchor (a link to some url): in this case ‘1.1.1.1.1’ is the sequence. But another table of the same 5x1 size presents the pattern ‘1.0.1.0.1’. This same pattern occurs several times for instance in 50x1 table. Another case is this:0.0.0.0.1.1.1.1.1.0’ maybe suggesting that some links are not available. We mention that 212 patterns are 1x1, which would be a kind of navigation button. We will present more elaborated analysis of this table patterns in the following post.

    To finish, we notice that 875 tables (16%) are not regular: some rows have different size. Some of them are very unusual like in this 28x8 table, where each number in following sequence denotes the size id tds of the row: 4.4.4.6.8.8.7.2.8.4.4.6.6.6.6.5.4.5.5.5.5.5.5.5.5.5.5.1.

    Noisy, isn’t it?

Powered by Community Server (Non-Commercial Edition), by Telligent Systems