ADTs, OOP, rules and Pattern-matching (1)

6. June 2008 12:57 by CarlosLoria in General  //  Tags: , , ,   //   Comments (0)

 Abstract (or algebraic) Data Types (ADTs) are well-known modeling techniques for specification of data structures in an algebraic style.
Such specifications result useful because they allow formal reasoning on Software at the specification level. That opens the possibility of automated verification and code generation for instance as we may find within the field of program transformation (refactoring, modernization, etc.).  However, ADTs are more naturally associated with a declarative programming languages (functional (FP) and logic programming (LP)) style than an object-oriented programming one (OOP).  A typical ADT for a tree data structure could be as follows using some hypothetical formalism (we simplify the model to keep things simple):

    adt TreeADT<Data>{

    import Integer, Boolean, String;
    signature:
        sort Tree;
        oper Empty : -> Tree;
        oper Node : Data, Tree, Tree|-> Tree;
        oper isEmpty, isNode : Tree-> Boolean;
        oper mirror : Tree -> Tree
        oper toString : -> String
     semantics:
        isEmpty(Empty()) = true;
        isEmpty(Node(_,_,_)) = false;

        mirror(Empty()) = Empty()
        mirror(Node(i, l, r)) = Node(i, mirror(r), mirror(l))

    }
As we see in this example, ADTs strongly exploit the notion of rules and pattern-matching based programming as a concise way to specify requirements for functionality.  The signature part is the interface of the type (also called sort in this jargon) by means of operators; the semantics part indicates which equalities are invariant over operations for every implementation of this specification. Semantics allows recognizing some special kinds of operations. For instance, Empty and Node are free, in the sense that no equation reduces them. They are called constructors.
If we read equations from left to right, we can consider them declarations of functions where the formals are allowed to be complex structures. We also use wildcards (‘_’) to discard those parts of the parameters which are irrelevant in context. It is evident that a normal declaration without using patterns would probably require more code and consequently obscure the original intention expressed by the ADT. In general, we would need destructors (getters), that is, functions to access ADT parts. Hence, operation mirror might look like as follows:
    Tree mirror(Tree t){
        if(t.isEmpty()) return t;
        return new Node(  t.getInfo(),
                                    t.getRight().mirror(),
                                    t.getLeft().mirror() );
    }
Thus, some clarity usually could get lost when generating code OOP from ADTs: among other things, because navigation (term destruction, decomposition) needs to be explicit in the implementation.

Rules and pattern-matching has not yet been standard in main stream OOP languages with respect to processing general data objects.  It has been quite common the use of regular expressions over strings as well as pattern-matching over semi structured data especially querying and transforming XML data. Such processing tends to occur via APIs and external tools outside of the proper language. Hence, the need for rules does exist in very practical applications, thus, during the last years several proposals have been developed to extend OOP languages with rules and pattern-matching. Some interesting attempts to integrate ADTs with OOP which are worthy to mention are Tom (http://tom.loria.fr/ ) and Scala (http://www.scala-lang.org/ ). An interesting place to look at with quite illustrating comparative examples is http://langexplr.blogspot.com/.

We will start in this blog a sequence of posts studying these two models of integration of rules, pattern-matching and OOP in some detail. We first start with a possible version of the ADT above using Scala. In this case the code generation has to be done manually by the implementer. But the language for modeling and implementing is the same which can be advantageous. Scala explicit support (generalized) ADTs, so we practically obtain a direct translation.
    object TreeADT{
        trait Tree{
           def isEmpty:boolean = false
           override def toString() = ""
           def mirror : Tree = this
           def height : int = 0
        }
        case class Node(info:Any, left:Tree, right:Tree)
               extends Tree{
           override def toString() =
              "("+info+left.toString() + right.toString()+")"

        }

        object Empty extends Tree{
           override def isEmpty = true
        }

        def isEmpty(t : Tree) : boolean =
              t match {
            case Empty => true
            case _        => false
        }

        def mirror(t : Tree):Tree =
               t match{
               case Empty       => t
               case Node(i, l, r) => Node(i,mirror(r), mirror(l))
        }
      }

Aiming at Migration 2.0 more in detail

11. June 2007 04:41 by CarlosLoria in General  //  Tags:   //   Comments (0)

By Migration 2.0 (coined in the context of MacAfee’s Enterprise 2.0 vision) we want to consider additional ways of addressing Automated Software Modernization in a Web 2.0 context and why or how this should or not deserve some particular attention. As a complement of a previous post, we have developed a presentation (albeit in Spanish) which may help to clarify (or not) some of the ideas around the notion. Especially relevant are a user-centric notion of migration and modernization as a sort of “SLATES-ification” of legacy code where legacy need to be understood inside of the 2.0 context. We will revisit and refine this idea in a forthcoming post.
You will find the slides of the presentation here.

Migration 2.0 and Enterprise 2.0

24. May 2007 11:03 by CarlosLoria in General  //  Tags:   //   Comments (0)

Let us elaborate some loose ideas on a fresher notion of “software modernization”. In the sequel, we are reviewing the notion of software migration as a modernization path and we want to look it through the glass of other contextual elements and events around the IT world which have evolved during the last 2-3 years. Such elements give us a reason to identify forces and opportunities on innovation and possible new developments. Let’s assume, we now have focused more on an implementation oriented modernization; namely one that is  mainly concerned with programming languages, frameworks, platforms, architectures and the like those have being prevailing during the last years.

From a technical point of view, we are aware this implementation based notion is completely valid and will exist for a while as such because still useful legacy systems are forced to evolve at the implementation level while retaining as much as possible its original functionality and corresponding business value, at a reasonable cost. However, we also are aware that the environment where ISs serve and survive is so strongly evolving in such a way that implementation details are probably remaining that important only at a traditional IT/IS level and vision. What kind of environment and forces are these making pressures on that vision? Is there an opportunity there for migration?

One important phenomena is definitively the Web 2.0 and, in analogy to how Internet forced Intranets, Andrew McAfee has recently coined the Enterprise 2.0 concept which embodies those well-known effects the Web 2.0 as an ubiquitous trend, as a social movement and how those might be pushing on at the inside of the IT enterprise nowadays; and as a direct consequence at the kind of tools employees might be willing and needing in their regular work environment; those where new information requirements born faster than they can be incorporated as new features at the traditional IS platform. Whether so-called Web 2.0 tools will be improving employee productivity is probably an open and debatable question, no doubt about it. We still remember not too long ago how e-mail and Internet at the work was considered as a disturbance source per se.

True is also, however, that being able to search, analyze, discover, tag, publish, share and trace knowledge at the rhythm of the business and own personal information needs has become now more important than ever. And traditionally designed ISs could be becoming a factor contributing to a sort of impedance mismatch between the huge flexibility and freedom that persons might currently encounter on the Web (even in private personal milieu) and a rigid traditional IS platform at the work-place. And, we emphasize, this concern occurs independently of whether or not such IS platform is “modern” at the implementation level which is another different dimension of the matter.

 As Dion Hinchcliffe entitles, Enterprise 2.0 is a cultural catalyst (as Web 2.0 is being); we believe and interpret it as a realistic vision where, more soon than later, ISs will be judged in terms of McAfee’s SLATES criteria and this will entail a rather stronger force leading to modernization of higher–level then than the technical one, because such criteria are more closely related social, common-sense, better understood forces not so directly related with technical issues.

If such a vision is accepted as a legitimate opportunity, we might then be looking for spaces for innovation now when we are considering moving to the next levels of automated supported migration. In such a case we have to consider migration as a path enabling ISs (not just legacy ones) to evolve into direction Enterprise 2.0-like modernization as a well-defined strategy.

 

 

Migration as a path leading to BI

22. May 2007 07:05 by CarlosLoria in General  //  Tags:   //   Comments (0)

We speculate software and data migration are frequently urged by IT infrastructure modernization needs (emergencies would it be more precise to say perhaps) at the enterprise level which often are nearly related with platform obsolescence and associated maintenance costs. A possibly even more appreciable benefit might be the need of integration between heterogeneous information systems and data bases, the need of being able to easily connect information sources, to create access ports to facilitate extraction and derivation of knowledge from such sources helping at the management level at the decision-support systems level in a more natural and easy way. In other words, the real value of modernization would be to support BI-like strata pointing -for sure- to a better CRM platform among others. We feel migration is probably not perceived as a transitory goal, more as a mean, so the added-value might lie outside possibly hidden with respect to the whole migration effort per se. It seems to me that the vertical road which leads from software migration and modernization until reaching BI-end-user is perceived as a too long one, one that is hard to appreciate at first sight at the management level. My question to myself would be whether that has to be so. Shouldn’t migration tools and message be pointing as layers enablers for the BI layer, in a more directly way. And if the answer is yes we should be asking ourselves how such migration tools can be adapted for closing the gap.

Is BI a major trend? (will BI be needing AI?)

18. May 2007 13:45 by CarlosLoria in General  //  Tags:   //   Comments (0)
I was taken a look at some sources showing trends on BI for 2007, especially related to legacy code. I have found interesting the following quotes from a report Knightsbridge Solutions LLC (Trends in Business Intelligence for 2007)
Trend #5
Service-Oriented Architecture: Information Management is Critical to Success
(of BI)
The buzz around service-oriented architecture (SOA) continues, with some organizations viewing SOA
as the solution to a wide range of business and technology problems, from improving enterprise agility
to deriving more value from legacy systems
(I think we are wintness of that). In the BI arena, SOA has great potential for delivering
enhanced capabilities to users. An SOA-enabled BI infrastructure could provide seamless access to
both batch and real-time data integrated across operational and analytical sources. SOA also presents
opportunities for innovation in areas such as real-time data collection and real-time analytic services.
However, companies that approach SOA without a strong information management methodology will
have difficulty achieving the benefits they seek. When implementing SOA on a large scale, companies will
face the same barriers they do in large BI integration projects.
For example, some early adopters of SOA
found that semantic incompatibilities in the data between disparate systems are a significant impediment
to SOA adoption.
These organizations are discovering that master data management and data governance
efforts must precede SOA adoption to provide a “common language” that supports integration. SOA has
the potential to deliver benefits in BI and many other areas, but not without a solid information management
foundation.

And also
Trend #8
Influence of Large Vendors: Market Consolidation Expected in 2007
Speculation abounded in 2006 regarding potential acquisitions of pure-play vendors in the BI space.
Business Objects, Cognos, Hyperion, and Informatica were seen as potential acquisition targets with
likely acquirers being enterprise software and infrastructure vendors like IBM, Microsoft, Oracle,
and SAP. Large vendors have moved aggressively into the BI space, building their capabilities through
both acquisitions and internal development
(as evidenced by Hewlett-Packard’s recent acquisition
of Knightsbridge).
Strongly consistent with this
http://www.intelligententerprise.com/channels/bi/showArticle.jhtml;jsessionid=ADNU0C0TP01BYQSNDLRSKH0CJUNN2JVN?articleID=199501674
This is also interesting to see in more detail
http://www.spagoworld.org/ecm/faces/public/guest/home/solutions/spago
"Naja" let's see

A digression on simple life matters

15. May 2007 12:29 by CarlosLoria in General  //  Tags:   //   Comments (0)

Sometimes negative experiences may turn out positive in the long-run, everybody should hope to believe in whenever things do not happen the way we expected. Though already old, maybe questionable and controversial, I must realize this speech made me reflect a lot about it, I found it very rich in philosophical material; it changed me somehow. If you haven’t yet, you could find interesting to watch it

Video:  http://www.youtube.com/watch?v=D1R-jKKp3NA

Text: http://news-service.stanford.edu/news/2005/june15/jobs-061505.html

PaXQual: a silly language for analyzing and rewriting Web Pages

23. February 2007 12:42 by CarlosLoria in General  //  Tags:   //   Comments (0)

Let us gently start meeting PaXQuAL, a prototype language that we are shaping out and adjusting for the emergent purpose of symbolically expressing analysis and transformation simple tasks around Web Pages, all this circumscribed in the context of refactoring, as we have been done in previous posts.

And as we also have already done days before, sometimes we just want to digress a little bit from practical and realistic issues, just to expose some theoretical ideas we find somehow interesting (and probably nobody else). I just can promise that no Greek letter will be used, at all (in part because that Font is not allowed by the publishing tool, I confess).

Anybody (if any) still reading this post is allowed to right away express the nowadays classical: “Yet another language? How many do we count up by now?” Claim is justified because everybody knows that every problem in Computer Science is solved proposing a (the) new one. Now it is my turn, why not. It’s a free world. For the interested reader a technical paper will be hopefully available with further details at this site, soon.

Actually PaXQuAL (Path based Transformation and Querying Attribute Language is his real name; is pronounced Pascual) is not that new and different from many other languages, developed for real researchers at the academia and industry. We wanted to imagine a language for querying and transforming structured data (eg. XML, HTML) and from that sort we have many available as we know. What new material can be proposed at this field for someone like us? Actually, what we really want is to operationally relate CSS with some special sort of theoretical weird artifact we had been exploring some years ago that we may dare to call Object-Oriented Rewrite Systems or Term-rewriting Systems (TRS) with extra variables and state (as a result of some work developed by and joint with actual researchers some years ago).  Considering TRS in this case natural because CSS is indeed a kind of them and that field has a rich offering of tools for useful automated reasoning. And we can find them useful here, we guess.

The question that pushed us back to the old days is: given an interesting, so simple and practical language, like CSS is, what kind of object-oriented rewriting logic can be used to describe its operational semantics. You may not believe it but this is a very important issue if we are interested in reasoning about CSS and HTML for refactoring purposes among others. And we are, don’t we?

CSS is rule-based, includes path-based pattern matching and is feature (semantically attributed) equipped, which all together yields a nice combination. CSS can be considered “destructive” because it allows adding or changing (styling) only attributes of tags where remaining “proper content” does not result destructively rewritten. It is not generative, by such a reason (in contrast to XSLT and XQUERY). And that leads to an interesting paradigm. For instance, following is a typical simple CSS rule for setting some properties of every tag of the kind body.

body {

     font-family: Arial, Helvetica, sans-serif;

     background-color: #423f43;

     text-align: center;

}

Of course more explicit rules like this one can be declared but further, an inheritance (cascading) mechanism implicitly allows that attributes may be pushed down or synthesized as we know from attribute grammars.

That all is nice but we feel we had to be original and want to propose the crazy idea of using something similar to CSS for purposes beyond setting style attributes, for instance for expressing classification rules allowing to recognize patterns like the ones we explained in previous posts. For instance, that a table is actually a sort of layout object, navigation bar or a menu, among others. Hence, we would have a human-readable, querying and transformation language for Web Pages, a sort of CSS superset (keeping CSS as a metaphor what we think might be a good idea):

Let us by now just expose some examples (where we advert concrete syntax in PaXQuAaL is not yet definitive). For instance, we may want to eliminate the bgcolor attribute of any table having it because is considered deprecated in XHTML. We use symbol “-:“ for denoting execution of the query/transformation as in Prolog.

 :- table[bgcolor]{bgcolor:null;}

We may want to add a special semantic attribute to every table directly hanging from a body, indicating it may be a surface object for some latter processing. We first must statically declarate a kind of table, “sTable”, accepting a surface attribute because we are attempting to use static typing as much as possible (“Yes I am still a typeoholic”)

@:- sTable::table[surface:boolean;]{surface:false}

Symbol “@:-” is like “:-” but operating at the terminological level. And then we have the rule for classifying any table instance hanging from the body tag, directly:

:- body sTable{surface:true;}

Many more "interesting" issues and features still need to be introduced; we will do that in forthcoming post. Hence, stay tuned.

More on Discovering Semantics from Parts of Web Pages

13. February 2007 11:36 by CarlosLoria in General  //  Tags:   //   Comments (0)

By taking at look at Web Pages, we may expect to discover that some patterns of semantics are encoded using very few HTML sets of, let us say, “combinators” (HTML parts); this may be due to the lack of abstraction capabilities which is inherent to the HTML alone. We have compared this situation to the Noisy-Channel model in a previous post where we presented some interesting figures and data illustrating the claim. Let us continue our journey showing further instances of this phenomenon whose formal analysis is crucial for intelligent refactoring tools as the kind we have been pursued to introduce by means of this sequel of posts. In other words, let us know other forms of “HTML noise”. As a word of warning, we recall that the data is the result of a particular muster of crawled pages by the way we explained before.

For this post, we are experimenting with tables that are potentially used as page layouts or page structure. For those kinds of tables, we want to study the table shape or page surface, no the specific content; we may think of that as a way to filter potential candidates for further deeper semantic analysis. (We briefly recall that our muster contains 819 pages and about 5000 table instances, roughly speaking.).

The exercise is simple: we postulate an intuitive definition for a table as surface and see how well it is supported by our data in muster.

Let us try our shallow analysis by classifying a table as a page layout candidate if its container is the page body tag, eventually followed by a chain of div tags (assuming such div tags are intended to be organizers or formatters of the table), it has at least two rows and at least 2 columns (two columns is the most interesting case, we consider it as a base).

Such a pattern definition sounds reasonable in appearance; however, we will see that its empirical support is not as high as one may expect, at least in our muster.

We find 261 of such candidates; they represent a 31% of all pages, which is a quite interesting amount; however it is unexpectedly small because one may guess there should be at least one per page. Among these 261, we have 83 where the table is hanging directly from the body tag (32% of the candidates; 10% of the whole muster). As a matter of fact, such 83 tables present irregular patterns, albeit often we find 2 columns (65%) with a high variance. For instance, we may find a pattern of the form 6.2.2.2.2.2.2.2, where we use our convention of showing a table of n rows as a sequence of n numbers, each of one being the number of cols (in example 8 rows, the first of them with 6 columns the rest having 2 columns). But even worst, we find the irregular pattern 2.2.7.2.7.7.6.5.5.4.4.5.2.3.2.7.2.7. And talking about irregularity, let us take a look at this interesting one: 19.2.7.4.6.2.2.2.2.2.2.2.2.2.2.5.7.2.2.2.2.4.4.2, whatever it means.

With this simple analysis, we may learn that, perhaps, some intuitive definitions occur not as frequent as we may expect in our muster. Actually, and after seeing in detail some of the irregular cases, a sound conclusion might be that we may need first to pre-classify some parts of the page before using general patterns like the one we directly tried. In other words, we see that some noise needs to be filtered out for such a kind of pattern.

In a forthcoming post, we will continue studying that kind of patterns and their support.


Semantics from Structural Parts of Web Pages: some figures and patterns

1. February 2007 09:55 by CarlosLoria in General  //  Tags:   //   Comments (0)

We continue our regular series of posts talking about refactoring of Web Pages based on semantic approaches; we invite the interested new reader to take a look at the previous contributions to get a general picture of our intentions.

In this particular and brief post, we just want to present and describe some simple but interesting empirical data which are related with the structural (syntactic) content of some given muster of pages we have been analyzing during the last days. The results are part of a white page we are preparing, currently; it will be available at this site in short time.

We may remember from our first post that we may want to recover semantics from structure given particular clues and patterns we usually may come across when analyzing pages. The approach is simpler to describe than to put into practice: Once semantics could be somehow detected, refactoring steps can be applied on some places at the page and, by doing so, some expected benefits can be gained.

However, syntactic structure is the result of encoding some specific semantics and intentions on a web page using HTML elements and functionality; the HTML language is (expressively speaking) rather limited (where too much emphasis on presentation issues is the case, for instance) and some common programming “bad practices” increase the complexity of recovering semantics mainly based on syntactic content as input. And being HTML quite declarative, such complexity can make the discovering problem quite challenging in a pragmatic context, indeed. That is our more general goal, however, we do not want to go that far in this post, we just want to keep this perspective in mind and give the reader some insight and data to think about it. We will be elaborating more on recovering in forthcoming posts.

As usual in NLP field, it is interesting to use the so-called Noisy-Channel model as point of reference and analogy. We may think of the initial semantics as the input message to the channel (the programmer); the web page is the output message. The programmer uses syntactic rules to encode semantics during coding adding more or less noisy elements. Different encodings forms do normally exist, noisy can be greater when too much structure is engaged for expressing some piece of the message.

A typical example of noisy encoding is the use of tables for handling style, presentation or layout purposes beyond the hypothetically primary intention of such kind of table element: just to be an arrange of data. Complex software maintenance and sometimes lower performance may be a consequence of too much noise, among others matters.

Let us take a look at some data concerning questions like: how much noise in page? What kind of noise? What kind of regular encodings could be found?

As a warning, we do not claim anything on statistical significance because our muster is clearly too small and was based on biased selection criteria. Our results are very preliminary, in general. However, we feel they may be sound and believable, in some way consistent with the noisy model.

Our “corpus” comes from of 834 pages which were crawled starting for convenience at a given root page in Costa Rica, namely: http://www.casapres.go.cr/. The size depended of a predetermined maximal quantity of 1000 nodes to visit; we never took more than 50 paths of those pointed in a page and we rather preferred visiting homepages to avoid traps.

Let us see some descriptive profile of the data. For current limitations of the publishing tool, we are not presenting some charts complementing the raw numbers.

Just 108 kinds of tags were detected and we have 523.016 instances of them in corpus. That means, very roughly, 6 kinds of tags per page, 627 instances per page. We feel that suggests the use of the same tags for saying probably different things (we remark that many pages are homepages for choice).

The top 10 of tags are: pure text, a, td, tr, br, div, li, img, p and font (according to absolute frequency). Together text, a (anchor) and img correspond to more than 60% all instances. Hence 60% of pages are some form of data.

We notice that ‘table’ is 1% and td 8.5% of all instances, against 42% from text, 15% from anchors. In average, we have 7 tables per page and 54 tds per page, 6 td per table, roughly speaking.

Likewise we just saw 198 attributes and 545.585 instances of attributes. The 10 most popular are: href, shape, colspan, rowspan, class, width, clear and height, which is relatively consistent with the observed tag frequency (egg. href for anchor, colspan and rowspan for td).

We pay some special attention to tables in the following lines. Our corpus has 5501 tables. It is worth to mention that 65% of them are children of td; in other words nested into another table. Hence a high proportion of nesting which suggests complexity in table design. We see that 77% of data (text, a, img) in muster are dominated by tds (most of the data is table dominated). In the case of anchors, 33% of them are td-dominated, what may suggest tables being used as navigational bars or similar semantic devices in an apparently very interesting proportion.

We decided to explore semantic pattern on tables a little bit more exactly. For instance, we choose tables of nx1 dimension (n rows, 1 column) which are good candidates for navigational bars. A simple analysis shows that 618 tables (11%) have such a shape. The shape may be different which is quite interesting. For instance, we see a 5x1 table where all td are anchors. We denote that but a sequence of 1 and 0, where 1 means the corresponding td contains an anchor (a link to some url): in this case ‘1.1.1.1.1’ is the sequence. But another table of the same 5x1 size presents the pattern ‘1.0.1.0.1’. This same pattern occurs several times for instance in 50x1 table. Another case is this:0.0.0.0.1.1.1.1.1.0’ maybe suggesting that some links are not available. We mention that 212 patterns are 1x1, which would be a kind of navigation button. We will present more elaborated analysis of this table patterns in the following post.

To finish, we notice that 875 tables (16%) are not regular: some rows have different size. Some of them are very unusual like in this 28x8 table, where each number in following sequence denotes the size id tds of the row: 4.4.4.6.8.8.7.2.8.4.4.6.6.6.6.5.4.5.5.5.5.5.5.5.5.5.5.1.

Noisy, isn’t it?

Symbolically Reasoning on Web Pages and Refactoring

9. January 2007 09:50 by CarlosLoria in General  //  Tags:   //   Comments (0)

How do we appropriately represent Web pages for semantic analysis and particularly for refactoring purposes? In a previous post at this same place, we introduced some of the main principles of what we have called semantically driven rational refactoring of Web pages as a general framework of symbolic reasoning. At this post, we want to develop some preliminary but formally appealing refinement concerning page representation; we assume our previous post has been read and we put ourselves in its context: development of automated tools pointing to the task of refactoring as a goal.

We claim that many valuable derivations can be obtained by following a semantic approach to the handling of web pages those of which are both of conceptual and implementation nature and interest; we believe looking with more detail at such derivations may provide us with more evidence for sustaining our claim on semantics in the context of refactoring. In this post, we want to consider one very specific case of conceptual derivation with respect to the representation model. Specifically, we want to present some intuitive criteria for analyzing theoretically a symbolic interpretation of the popular k-means clustering scheme, which is considered useful for the domain of web pages in more general contexts.

Let us first and briefly remain why that is relevant to refactoring, to set up a general context. We recall that refactoring is often related to code understanding; hence building a useful semantic classification of data, in general, is a first step in the right direction. In our case web pages are our data sources, we want to be able to induce semantic patterns inside this kind of software elements, in a similar way as search engines do to classify pages (however, we had to expect that specific semantic criteria are probably being more involving in our refactoring case where, for instance, procedural and implementation features are especially relevant; we will expose such criteria in future posts).

In any case, it is natural to consider clustering of pages according to some given semantic criteria as an important piece in our refactoring context: Clustering is essence about (dis)similarity between data, is consequently very important in relation to common refactoring operations, especially those ones aiming at code redundancy. For instance, two similar code snippets in some set of web pages may become clustered together, allowing by this way an abstraction and refactoring of a "web control" or similar reusable element.

As several authors have done in other contexts, we may use a symbolic representation model of a page; in a general fashion pages can naturally be represented by means of so-called symbolic objects. For instance, an initial approximation of a page (in an XML notion) can be something of the form given below (where relevant page attributes (or features) like "title" and "content" can be directly extracted from the HTML):

<rawPage

   title="Artinsoft home page"

   content="ArtinSoft, the industry leader in legacy application migration and software upgrade technologies, provides the simple path to .NET and Java from a broad range of languages and platforms, such as Informix 4GL and Visual Basic."

.../>

Because this representation is not as such effortless for calculation purposes, attribute values need to be scaled from plain text into a more structured basis, for clustering and other similar purposes. One way is to translate each attribute value in numerical frequencies (e.g. histogram) according to the relative uses of normalized keywords taken from a given vocabulary (like "software", "legacy", "java", "software", etc) by using standard NLP techniques. After scaling our page could be probably something similar to the following:

<scaledPage

      title  ="artinsoft:0.95; software:0.00; java:0.00; "

      content="artinsoft:0.93; software:0.90; java:0.75; " .../>

The attribute groups values are denoting keywords and its relative frequency in page (data is hypothetic in this simple example), as usual. Notice the attribute group is another form of symbolic object whose values are already numbers. In fact, we may build another completely flat symbolic object where attribute values are all scalar, in an obvious way:
 
   <flatPage

        title.artinsoft="0.95" title.software="0.00" title.java="0.00"

        content.artinsoft="0.93" content.software="0.90" content.java="0.75" .../>


Now pages that are by this way related to some common keywords set can be easily compared using some appropriate numerical distance. Of course, more complex and sophisticated representations are feasible, but numerically scaling the symbolic objects remains as an important need at some final stage of the calculation. However, and this is a key point, many important aspects of the computational process can be modeled independently of such final numerical representation (at the side of the eventual explosion of low-level attributes or poorly dense long vectors of flat representations, which are not the point here).

Let us develop more on that considering the k-means clustering scheme but without committing it to any specific numerically valued scale, at least not too early. That is, we want to keep the essence of the algorithm so that it works for "pure" symbolic objects. In a simplified way, a (hard) k-means procedure works as follows:

   1.Given are a finite set of objects D and an integer parameter k. 
   2.Choose k objects from D representing the initial group of cluster leaders or centroids
   3.Assign each object in D to the group that has the closest centroid. 
   4.When all objects in D have been assigned, recalculate the k centroids. 
   5.Repeat Steps 2 and 3 until the centroids no longer change (move).

Essential parameters to the algorithm are 1) the distance function (of step 3) to compute the closest object and some form of operation allowing "object combination" in order to produce the centroids (step 4) (in the numerical case "combination" calculates the mean vector from vectors in cluster).

Such a combination needs to be "consistent" with the distance to assure the eventual termination (not necessarily global optimality). We pursue to state a formal criterion of consistency in an abstract way. Fundamental for assuring termination is that every assignment of one object X to a new cluster occurs with lower distance with respect to the new cluster leader than the distance to the previous clusters leaders of X. In other words the assignment of X must be a really new one. Combine operation after assignment (step 4) must guarantee that in every old cluster where X was, their leaders still remain more distant after future recalculations. So we need abstract definitions of distance and object combination would satisfy these invariant, that we call consistency.

To propose a distance is usually easier (in analogy to the vector model), objects are supposed to belong to some class and the feature values belong to some space or domain provided with a metric. For instance, an attribute like "title" takes values over a "String" domain where several numerical more o less useful metrics can possibly be defined. For more structured domains, built up from metric spaces, we may homomorphically compose individual metrics of parts for building the metric of the composed domain. For a very simplified example, if we have two "page" objects X and Y of the following form:

 X= <page>

 <title>DXT</title>

 <content>DXC</content>

 </page>  

 Y= <page>

 <title>DYT</title>

 <content>DYC</content>

 </page>

Where we have that DXT and DYT are eventually complex elements representing symbolic objects over a metric domain DT provided with distance dist_DT and similarly DXC, DYC comes from a metric space DC with metric dist_DC. We may induce a metric dist_page for the domain “page”, as linear combination (where weights for dimensions could also be considered):

    dist_page(X, Y) = dist_DT(DXT, DYT) + dist_DC(DXC, DYC)

Using the simple illustration, we see that we may proceed in a similar fashion for the combine_page operation, assuming operations combine_DT and combine_DC. In such a case, we have:

combine_page(X,Y)= <page>

                      <title>combine_DT(DXT,DYT)</title>

                      <content>combine_DC(DXC,DYC)</content>

                   </page>

It is straightforward to formulate the general case, the principle is simple enough. It is worth to mention that "combine" is clearly a kind of abstract unification operation; in general, we claim has to be consistent with the "distance" function (at every domain); further requirements rules this operation, in particular, that the domain needs to be combination-complete or combination-dense.

Proving consistency strongly relies on abstract properties like the following:

     dist_D(X, Y) >= dist_D(combine_D(X,Y), Y)

In other words, that "combine" actually comprises according to distance. Fortunately, by using homomorphic compositions of consistent domains, we recover consistency at the composed domain. At the scalar level, for instance with numbers, using average as "combination" operation easily assures consistency.