C# 4.0: Thinking about Software Evolution based on Migration as a Service

15. January 2009 10:57 by CarlosLoria in General  //  Tags:   //   Comments (0)

 We were watching with great interest and seriously thinking about the nice videos from Anders Hejlsberg  recently presented at PDC2008, regarding programming languages and particularly covering the future of C# 4.0; (JAOO).  Both talks offer really interesting perspectives that can be useful when thinking ahead about new challenges and opportunities on Software maintenance and particularly automated tools for supporting Software evolution, as it is the case of Software migration, usually.


We realize migration as a fundamental piece in fulfilling maintenance requirements derived from platform/framework evolution (as it will be the case of C# 4.0) beyond legacy migration in a traditional sense (I mean very old languages). That is if we consider shorter cycles of evolution not necessarily critical cases (i.e. those ones pushing to the edge of obsolescence a user application). We mean those new platform features offering better performance, scalability and maintainability chances, for instance. Thus, moving VB6 to VB.Net is a first stage. Moving to "a better" VB.Net a second one. We will return to this statement later on.

We first notice that in his presentation Hejlsberg talked about C# but it is assumable VB.Net will be providing in other flavor similar features and orientation as has been for these languages so far. Another very interesting player on this game is F# which deserves by itself another post; so we leave it aside just for the moment.

The language orientation Hejlsberg describes is one where C# becomes more multi-paradigmatic which essentially means for him covering three general aspects: more dynamic, concurrency and declarative expression power. These three aspects are clearly interrelated to each other in his vision.

The dynamic (DLR) facet should provide alternatives to compete with the proliferation of scripting languages a wave where strongly static typing is avoided (considered harmful for productivity by someone) even tough such kind of (rebel) languages will be sooner or later turn reasonable and will be offering capability for static types. Otherwise testing and maintenance will be more problematic as they are per se. In addition, the "dynamic" static type (no word playing here) will allow a cleaner interoperability with other .Net languages (scripting specially) and with COM itself, as he explains.

Turning something that looks like a compiler directive into a type might be disputable, we personally think; but independently of the name used the approach sounds very flexible because it is something open and controlled via an interface the programmer can use for his/her own needs. This also complements very well with the "var" declaration giving the impression of being programming in a dynamic language. However, we can imagine potential misuse and abuse of such a feature; hence the VS might play an important role during development.

Avoiding Javascript, using C# instead, is a subtle suggestion we might (over)infer from the message and if accepted we see an interesting space in the Web domain.

The declarative aspect, which takes the form of adopting functional programming (FP) idioms, look for avoiding, explicitly, side effects and mutability providing the compiler with better chances of optimization for parallelism as well as more readability and understanding. In other cases, more intuitive forms of static typing are now possible concerning co- and contra variance, by the way which is an added-value.

Concurrency should help to take advantage of multi-core computing without affecting program intentionality where FP is naturally an option for achieving that (see post1 and post2 )

An exciting challenge/opportunity derived from this vision (which is also valid for C# or VB.Net not necessarily product of a migration) is the ability to offer tools for platform related evolution without changing the language: refactoring for platform adoption, in other words (RPA).

A quite interesting and rich part of the C# 4.0 package is likewise "meta-programming" and the "compiler as a service" orientation. In particular, there will be a much more complete code model for C# available with an API for compiling in C# itself (presumably in F#, too!), as we see from the presentation. In such a case the space for developing tools for RPA should be easier which is an interesting point to analyze, we believe. In terms of how exploit the compiler as a service to produce a "migration as a service" in analogy.

A Humble Post on Software

20. November 2008 13:53 by CarlosLoria in General  //  Tags: ,   //   Comments (0)

I really never expected to be witness of the kind of events like those ones we are seeing nowadays. Certainly, we are experiencing a huge impact on our conception of society mainly driven by the adjustment of the economic system; probably an era is ending, a new one is about to born I cannot possibly know that for sure. I just hope for a better one.
 
Normally, I am a pessimistic character; and more than ever we have reasons for being so. However in this post, I want to be a little less than usually, but with prudence. I just needed to reorder my thoughts for mental health: I have been seeing so much C++ and Javascript and the like code during the last time.

I am neither economist nor sociologist, nothing even closer, just an IT old regular guy, needless to say. I just want to reflect here, quite informally, about the IT model and more exclusively the software model and its role in our society, under a context as the one we are currently experimenting.  What is a "software model" by the way? I mean just here (by overloading): in a society we have state, political, economical and legal models, among others. Good or bad, less recognized as such or not, I think we also have invented a software model which at some important extent orchestrates the other historically better known models, let us call them (more) natural models.
   
I am remembering the 90s as the new millennium approached and many conjectures were made about the Y2K problem, especially concerning legacy software systems, and its potentially devastating costs and negative effects. For an instance, from here, let us take some quotes (it is worthy following the related articles pointed to):

"People have been sounding the alarm about the costs of the millennium bug--the software glitch that could paralyze computers come Jan. 1, 2000--for a couple of years. Now, the hard numbers are coming in and, if the pattern holds, they point to an even larger bill than many feared just a few months ago.
[...]
Outspoken Y2K-watcher Edward E. Yardeni, chief economist at Deutsche Bank Securities Inc., says the numbers show that some organizations are ''just starting'' to wake up to Y2K's potential for damage--but he believes the possible impact is enormous. In fact, Yardeni puts the chances of a recession in 2000 or 2001 at 70% because of ''a glitch in the flow of information"

Suddenly after reading that again, the old fashioned term "software crisis" (attributed to Bauer in 1968, popularized by Dijkstra seminal work in the seventies) we taught at college rooms seemed to make more sense than ever. But romanticism aside, we in IT know, software is still quite problematic for the same reasons since then. It is now a matter of size. In Dijkstra own words almost forty years ago (Humble Programmer ):

"To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem."

But those were surely different times; weren’t they? Exaggerated (as a source of businesses or even religious opportunities) or not, however, Y2K also gave us a serious (and global) warning about to what extent software had grown in many parts of our normal society model even in a time when the Web was not wired into the global business model as we have lived in the last decades, apparently.  

I do not know whether final Y2K costs were as big as or even bigger than predicted but certainly the problem gave a huge impulse to investment in the software branch, I would guess. I am afraid, not always, leading to an improvement of the software model and practices. Those were the golden times that are probably ending now when appearance was frequently more valued than content and Artificial Intelligence became a movie.

What will be going on with the software model if the underlying economic model is now adjusting so dramatically and together with them the other more natural models, too? Will be short of capital for investment and consume stop or change (again) for worst software model evolution and development?

I do believe our software model remains essentially as bad as it was in the Y2K epoch. It is a consequence of its own abstract character living in a more and more "concrete" business world. I would further guess, it is even worst now for it was highly proliferated, it got more complicated structurally. Maintenance and formally understanding are now harder, among others, because dynamism, lack of standardization and because external functionality has tended to be more valued than maintainability and soundness. And exactly for that reason, I also believe (no matter how exactly the new economic model is going to look like) software will have to be stronger structurally and more reliable because after economic stabilization and restart, whenever, businesses will turn more strict to avoid just appearance once again be able to generate wealth.

I would expect (wish) software for effectively manipulating software (legacy and fresher, dynamic spaghetti included) should be more than ever demanded as a consequence. More effective testing and dynamic maintenance will be required during any adjustment phases of the new economical model. I also see standardization as a natural requirement and driver of a higher level of software quality. Platform independence will be more important than ever, I guess. No matter what agents become winners during adjustment and after recovering of the economic system, a better software model will be critical for each one of them, globally. Software as an expression of substantial knowledge will be in any case considered an asset under any circumstances. I think, we should be seeing a less speculative and consistent economical model where precisely a better software model really will make a discriminator for competing with real substance not with just fancy emptiness. The statement must be demonstrable.

Sometimes things do not result as bad as they appeared. Sometimes they result being even worst and, now, we do have to be prepared for such a scenario, absolutely. But, I also want to believe, it can also be an opportunity out there. I would like to think a software model improvement will be an essential piece of the economic transition and a transition to something better in software will be taking place, at the end. I wish it, at least.

I recommend reading Dijkstra again especially on these days just as an interesting historical comparison; and trying honestly not fooling myself, I quote him referring to his vision of a better software model, as we called here:

"There seem to be three major conditions that must be fulfilled. The world at large must recognize the need for the change; secondly the economic need for it must be sufficiently strong; and, thirdly, the change must be technically feasible."

I think, we might be having the first two of them. Concerning the third one, and slowly returning to the C++ code I am seeing, I just reshape his words: I absolutely fail to see how I could keep programs firmly within my intellectual grip, when this programming language escapes my intellectual control. But we have to, exactly.


Parsing with Oslo’s MGrammar (Mg)

7. November 2008 09:49 by CarlosLoria in General  //  Tags:   //   Comments (0)


As profiled in a previous post we are building a tool for reasoning about CSS and related HTML styling tools using some logic framework (logic grammars in human readable style are usually highly ambiguous, consequently an interesting case study). So, we need a textual language, a DSL, for the tool, I was asked to try MGrammar (simply Mg), which a DSL generator which is part of the Microsoft Oslo SDK, recently presented in MS PDP.

Actually I am unaware of the Oslo project details , so I was a little bit skeptical about the thing and because it would entail learning another new language (M). But I fight against any personal preconception and decided just to take a look and try to have fun with it. I do not pretend here to blog about the whole Oslo project, of course. Just to tell a simple story about using Mg in its current state, no more.

In general, I like very much the idea that data specification (models) and data storing, and querying and the like can be separated from procedural languages in a declarative and yet more expressive way; hopefully, my own way to express my model (beyond standard diagrams and graphic perspectives). Oslo and M approach such the general vision on model oriented software development, Mg the latter on expressivity, as I understand the source material which is not much yet.

What I have seen so far of M is indeed very interesting, I really liked it; it extends LINQ which I consider a very interesting proposal MS’s; it also reminded me OCL in some ways. In fact, there exists an open source OCL library called Oslo, which has nothing to do with MS; it is just funny to mention.

Mg
I focused on tasting Mg; we do not consider M at all.  We do not present a tutorial here, there is one nice for instance here. If you are interested in my example and sources please mail me and I will send you it back.

The SDK also contains a nice demo for a musical language (Song grammar) and its player in C#; and moreover M and Mg themselves are specified in Mg, the corresponding sources are available in the SDK. Hence, you learn by reading grammars, because the documentation is not yet as rich as I would like.

Mg source files resemble, at least in principle, other parser generators (like ANTLR, Coco, JavaCC, YACC/Lex and relatives). I do not know details about the parsing technique behind Mg. However, according to the mg.exe reporting it would be LR-parsing.

Mg offers really powerful features. Just as an instance, following are tokenizing rules:

token Digit = "0".."9";
token Digits = Digit+;
token Sign = ("+"|"-");
token Number = Sign?Digits ("." Digits)?;

Or constraint the repetition factor (from at least 1 maximum 10) as in:

token ANotTooLongNumber = Digit#1..10;

But you may have parameters in scanner rules, thus you may write (taken from nice tutorial here):

token QuotedText(Q) = Q (^('\n' | '\r') - Q)* Q;
token SingleQuotedText = QuotedText('"');
token DoubleQuotedText = QuotedText("'");

For handling different types of quotations depending on parameter Q! Notice the subpattern ^('\n' | '\r') – Q. it means any but newline or return excluding character Q.

Oslo SDK provides an editor called Intellipad, which can be used for both M and Mg development. You need to experiment a little bit before getting used with it, but, after short time, it appears to you as a very nice tool. Actually, a command-line compiler for Mg is included in the SDK, so you may edit your grammar using any text editor and compile using a shell. But, the Intellipad is quite powerful, it allows editing and debugging your grammar simultaneously, quickly; Intellipad shows you several panes for working with, in one you write your grammar, in other you may enter input to your grammar which is immediately checked against your rules. A third (tree view) pane shows you the AST being projected by your grammar on the input you are providing.

Finally, a forth pane shows you the error messages that includes those eventual parsing ambiguities according to the given input and grammar. Interesting, only then, you notice any potential conflict. I did not see any form of static analysis for warning about such cases provided in the editor. Is there any such?

Besides that, actually I enjoyed using it. I a not sure to appreciate the whole functionality but it was quite easy to create my grammar. Mg is modular, as M is. So you can organize and reuse your grammar parts. That is very nice.
To start defining a language, you write something like:

language LogicFormulae{
   //rules go here inside
}

For defining the language you have "token", "syntax" mainly and other statements. In this case, the name "LogicFormulae" will be used later on when we access the corresponding parser programmatically.

Case Study
Back to our case study, our simple language contains logic formulae like "p&(p->q) <-> p&q" (called well-formed-formulae, or wff) where as usual a lot of ambiguity (conflicts) might occur. The result is frequently that you have to rewrite your grammar into a usually uglier one in order to recover determinism and so fun is out. What I liked of Mg is a very readable style for handling such cases. For instance, I have a rule like

syntax ComposedWff =  
                                 precedence 1: ParenthesisWff
                               | precedence 1: PossibilityWff  
                               | precedence 1: NecessityWff   
                               | precedence 1: QuantifiedWff
                               | precedence 2: NotWff
                               | precedence 3: AndWff
                               | precedence 4: ImpliesWff
                               | precedence 4: EquivWff
                               | precedence 5: OrWff
                               ;
Using the keyword precedence I can "reorder" the rule to avoid ambiguity. Thus, "all x.p & q" parses as "(all x.p) & q" under this rule because of the precedence I chose.

Another example shows you how to deal with associativity and operator precedence

syntax AndWff         =  Wff  left(4)  "&" Wff;
syntax ImpliesWff    =  Wff right(3) "->"  Wff;

Mg projects syntax in its D-Graphs using the name of the rule and token images automatically, thus it always generate an AST without specifying it which is very useful. Thus, the "AndWff" rule will produce a tree with three children (including the token "&"), labeled with "AndWff". You can use your own constructors to be produced, instead:

syntax AndWff         =  x:Wff left(4)  "&"  y:Wff  => And[x, y];

Such constructors like "And" need not to be classes they will be labels in nodes.

Conflicts

Conflicts were in appearance solved by this way because the engine behind Intellipad did not complain anymore about my test cases. I do not know whether there is a way to verify the grammar using Intellipad, so I just assumed it was conflicts free. However, when I compiled it using the mg.exe (using the option -reportConflicts) I  got a list of warnings as by any LR parser generator. That was disappointing. Are we seeing different parsing engines?

Using C#
The generated parser and the graphs it produces can be accessed programmatically, something that was of my special interest in my case beyond working with Intellipad. Because I was not able to create a parser image directly from Intellipad, I did the following: I compiled my grammar using the mg.exe command producing a so-called image, a binary file (other targets are possible, I think) called "mgx". This mgx-image can be loaded in C# project using some libraries of the SDK. (It can also be executed using mgx.exe).  You would load the image like this:

parser  = MGrammarCompiler.LoadParserFromMgx(stream, languageName);

where "stream" would be a stream referring to the mgx file I produced with the mg compiler. And "languageName" is a string indicating the language I want to use for parsing ("LogicFormulae" in my case). And I parse any file as indicated by string variable "input" as follows

rootNode = parser.ParseObject(input, ErrorReporter.Standard);

The variable rootNode (of type object, by the way!) refers to root of the parsing graph. A special kind of object of class GraphBuilder gives you access to the nodes, node labels and node children if any. Hence, using such an object you visit the graph as usual. For instance using a pattern like:

void VisitAst(GraphBuilder builder, object node){

foreach (object childNode in builder.GetSequenceElements(node))
       if (childNode is string)
                VisitValue(childNode);
      else VisitAst(builder, childNode);
         
}
For some reason, the library uses object as node type, as you are noticing, probably.

Conclusions
I found Mg very powerful and working with Intellipad for prototyping was actually fun. Well it was almost always. I only have some doubts concerning use and performance. For instance, using Intellipad you are able to develop and test very fast. But it seems that there is no static analysis tool inside Intellipad for warning about eventual conflicts in the grammar.
The other unclear thing I saw is the loading time of the engine in C#. It takes really too long before you get the object parser given the mgx file. In fact, the mg compiler seems to work too slowly, in particular with respect to Intellipad. It seems as though the mg.exe and Intellipad were not connected as tool chain or something like that.  But surely is just me, because it is just my first contact with this SDK.

Aggiorno for CSS reasoning

8. October 2008 05:35 by CarlosLoria in General  //  Tags:   //   Comments (0)

 As you might know, recently Artinsoft released Aggiorno a smart tool for fixing pages for improving page structure and enforcing compliance with web standards in a friendly and unobtrusive way. Some remarks on the tool capabilities can be seen here. Some days ago I had the opportunity to use part of the framework on which the tool is built-in. Such a framework will be available in a forthcoming version to allow users developing extensions of the tool for user specific needs. Such needs may be variety of course; in our case we are interested in analyzing style sheets (CSS) and style affecting directives in order to help to predict and explain potential differences among browsers. The idea is to include a feature of such a nature in a future version of the tool. We notice that this intended tool though similar should be quite smarter than existing and quite nice tools like Firebug with respect to CSS understanding. Our idea was to implement it using the Aggiorno components; we are still working on it but it has been so far very easy to get a rich model of a page and its styles for our purposes. Our intention is to use a special form of temporal reasoning for explaining potentially conflictive behaviors at the CSS level. Using the framework we are able to build an automaton (actually a transition system) modeling the CSS behavior (intended and actual one). We incorporate dynamic (rendered, browser dependant) information in addition to the static one provided by the styles and the page under analysis. Dynamic information is available trough Aggiorno´s framework which is naturally quite nice to have it. Just to give you a first impression of the potential we show you a simple artificial example on how we model the styling information under our logic approach here. Our initial feelings on the potential of such a framework are quite encouraging, so we hope very soon this framework become public for similar or better developments according to own needs.

Parallel programming in PLINQ

1. September 2008 10:13 by CarlosLoria in General  //  Tags:   //   Comments (0)

In our previous post on parallel programming we mentioned some pointers to models on nested data parallelism (NDP) as offered in Intel’s Ct and Haskell derivatives in the form of a parallel list or array (Post).

We want to complement our set of references by mentioning just another quite interesting project by Microsoft, namely, the Parallel Language Integrated Query, PLINQ (J. Duffy) developed as an extension of LINQ. PLINQ is a component of FX. We base this post on the indicated reference.

We find this development particularly interesting given the potentially thin relation with pattern-matching (PM) that we might be exploiting in some particular schemas, as we suggested in our post; hence, it will be interesting to take that possibility into consideration for such a purpose. That would be the case if we eventually aim at PM based code on (massive) data sources (i.e. by means of rule-based programs).

As you might know LINQ offers uniformed and natural access to (CLR supported) heterogeneous data sources by means of an IEnumerable<T> interface that abstracts the result of the query. Query results are (lazily) consumable by means of iteration as in a foreach(T t in q) statement or directly by forcing it into other data forms. A set of query operators -main of them with a nice syntax support via extensions methods- are offered (from, where, join, etc) that can be considered phases on gathering and further filtering, joining, etc, input data sources.

For PLINQ, as we can notice, the path followed is essentially via a special data structure represented by the type IParallelEnumerable<T> (that extends IEnumerable<T>). For instance, in the following snippet model given by the author:

IEnumerable<T> data = ...;
var q = data.AsParallel().Where(x => p(x)).Orderby(x => k(x)).Select(x => f(x));
foreach (var e in q) a(e);

A call to extension method AsParallel is apparently the main requirement from the programmer point of view. But, in addition, three models of parallelism are available to in order to process query results: Pipelined, stop-and-go and inverted enumeration. Pipelined synchronizes query dedicated threads to (incrementally) feed the separated enumeration thread. This is normally the default behavior, for instance under the processing schema:

foreach (var e in q) {
    a(e);
}

Stop-and-go joins enumeration so that it waits until query threads are done. In some cases this is chosen, when the query is completely forced.

Inverted enumeration uses a lambda expression provided by the user which to be applied to each element of the query. That avoids any synchronization issue but requires using the special PLINQ ForAll extension method not (yet?) covered by the sugared syntax. As indicated by the author, in the following form:

var q = ... some query ...;
q.ForAll(e => a(e));

As in other cases, however, side-effects must be avoided under this model, but this is just programmer responsibility.
 

Parallel Programming as in Ct, BSGP and FP

26. August 2008 06:36 by CarlosLoria in General  //  Tags: ,   //   Comments (0)



Parallel programming has always been an interesting and recognized as a challenging area; it is again quite full of life due to the proliferation nowadays of parallel architectures available at the "regular" PC level such as multi-core processors, SIMD architectures as well as co-processors like stream processors as we have in (Graphical Processors Units (GPUs).


A key question derived from such a trend is naturally how parallel processors development would be influencing programming paradigms, languages and patterns in the near future. How programming languages should allow regular programmers to take real advantage of parallelism without adding another huge dimension of complexity to the already intricate software development process, without penalizing performance of software engineering tasks or application performance, portability and scalability. 

Highly motivated by these questions, we want to present a modest overview in this post using our understanding of some interesting references and personal points of view.

General Notions
Some months ago someone working in Intel Costa Rica asked me about multi-core and programming languages during an informal interview, actually as a kind of test, which I probably did not pass, I am afraid to say. However, the question remains actually very interesting (and is surprisingly related to some of our previous posts on ADTs and program understanding via functional programming, FP). So, I want to try to give myself another chance and try to cover some part of the question in a humble fashion. More exactly, I just want to remind some nice well-known programming notions that are behind the way the matter has been attacked by industry initiatives.

For such a purpose, I have found a couple of interesting sources to build upon as well as recent events that exactly address the mentioned issue and serve us as a motivation. First, precisely Intel had announced last year the Ct programming model (Ghuloum et al) and we also have a report on the BSGP programming language developed by people of Microsoft Research (Hou et al)  We refer to the sources for gathering more specific details about both developments. We do not claim these are the only ones, of course; we just use them as interesting examples.

Ct is actually an API over C/C++ (its name stands for C/C++ throughput) for portability and backward compatibility; it is intended for being used in multi-core and Tera-scale Intel architectures but practicable on more specific ones like GPUs, at least as a principle. On the other side BSGP (based on the BSP model of Valiant) addresses GPU programming mainly. The language has its own primitives (spawn, barrier, among the most important) but again looks like C at the usual language level. Thus, the surface does not reveal any pointer to FP, in appearance.

In the general case, the main concern with respect to parallel programming is clear: how can we program using a parallel language without compromising/constraining/deforming the algorithmic expression due to a particular architecture, at least not too much. What kind of uniform programming high-level concepts can be used that behave scalable, portable and even predictable in different parallel architectures? In a way that program structure can be derived, understood and maintained without enormous efforts and costs. In a way the compiler is still able to efficiently map them to the specific target, as much as possible.

Naturally, we already have general programming notions like threads, so parallel/concurrent programming usually get expressed using such concepts strongly supported by the multi-core architectures. Likewise, SIMD offers data-parallelism without demanding particular mind from the programmer. But data "locality" is in this case necessary in order to become effective. Such a requirement turns out not realistic in (so-called irregular) algorithms using dynamic data structures where data references (aliases and indirections) are quite normal (sparse matrices and trees for instance). Thus memory low-latency might result computationally useless in such circumstances, as well-known.

On the other hand, multithreading requires efficient (intra/inter-core) synchronization and a coherent inter-core memory communication. Process/task decomposition should minimize latency due to synchronization. Beside that data-driven decomposition is in general easier to grasp than task-driven. Thus, such notions of parallelism are still involved with specific low-level programming considerations to get a good sense of balance between multi-core and SPMD models and even MPMD.

Plenty of literature back to the eighties and the nineties already shows several interesting approaches in achieving data parallelism (DP) that take these mentioned issues into consideration, uniformly; thus, it is not surprising that Ct is driven by the notion of nested data parallelism (NDP).  We notice that BSPG addresses GPU architectures (stream based) in order to build simpler programming model than, for instance, in CUDA. NDP is not explicitly supported but similar principles can be recognized, as we explain later on.

Interestingly, that NDP seminal works on these subject directly points to the paradigm of functional-programming (FP), specifically prototyped in the Nesl language of Blelloch and corresponding derivations based on Haskell (Nepal, by Chakravarty, Keller et al).

FP brings us, as you might realize, back to our ADTs, so my detour is not too far away from my usual biased stuff as I promised above. As Peyton Jones had predicted, FP will be more popular due to its intrinsic parallel nature. And even though developments like Ct and BSGP are not explicitly expressed as FP models (I am sorry to say), my basic understanding of the corresponding primitives and semantics was actually easier only when I was able to relocate it in its FP-ADT origin. Thus, to get an idea of what an addReduce or a pack of Ct mean, or similarly a reduce(op, x) of BSGP, was simpler by thinking in terms of FP. But this is naturally just me.

Parallelism Requirements and FP combinators
Parallel programs can directly indicate, using primitives or the like, where parallel task take place, where task synchronize. But it would be better that the intentional program structure gets not hidden by low-level parallel constraints. In fact, we have to assure that program control structure remains intentionally sequential (deterministic) because otherwise reasoning and predicting behavior can become hard. For instance, to reason about whether we add more processors (cores) we can guarantee that programs (essentially) performance scale. In addition, debugging is usually easier if programmer thinks sequentially during program development and testing.

Hence, if we want to preserve natural programming structures during coding and not be using special (essentially compiler targeting) statements or similar to specify parallel control, we have to write programs in a way that the compiler can derive parallelism opportunities from program structure. For such a purpose we would need to make use of some sort of regular programming structures, in other words, kind of patterns to formulate algorithms. That is exactly the goal the Ct development pursues.

As FP theory decade ago has shown that a family of so-called combinators does exist for expressing regular and frequent algorithmic solution patterns. Combinators are a kind of building-blocks for more general algorithms; they base on homomorphisms on (abstract) data types. Main members of such family are the map and reduce (aka. fold) combinators. They are part of a theory of lists or the famous Bird-Meertens formalism (BMF), as you might know. Symbolically, map can be defined as follows:

map(f, [x1,...,xn]) = [f(x1),..., f(xn)]  (for n>= 0)

Where as usual brackets denote lists and f is an operation being applied to each element in a list collecting the results in a new list preserving the original order (map of an empty list is the empty list). As easy to see, many algorithms iterating over collections by this way can be an instance of map. My favorite example for my former FP students was: "to raise the final notes of every student in class in a 5%". But more serious algorithms of linear algebra are also quite related to the map-pattern, so they are ubiquitous in graphic or simulation applications (pointing to GPUs).

Map is naturally parallel; the n-applications of operation "f" could take place simultaneously in pure FP (no side-effects). Hence, expressing tasks using map allows parallelism. Moreover, map can be assimilated as a for-each iteration block able to run in parallel. List comprehensions in FP are another very natural form to realize map.

Nesl original approach which is followed by Nepal (a Haskell version for supporting NDP) and Ct is to provide a special data type (parallel list or array) that implicitly implements a map. In Nepal such lists are denoted by bracket-colon notation, [: :] and called parallel-arrays. In Ct such a type is provided by the API and is called TVEC. Simply using a data type for communicating parallelism to the compiler is quite simple to follow and does not obscure the proper algorithm structure. In Nepal notation, just as simple example, we would write using parallel array comprehensions something like this:

[: x*x | x <- [: 1, 2, 3 :] :]

This denotes the map of the square function over the parallel list [:1, 2, 3:]. Notice that this is very similar to the corresponding expression using regular lists. However, it is interesting to notice that in contrast to normal lists parallel-arrays are not inductively definable, that would suggest trying to process them sequentially, and that would make no sense. Using flat parallel arrays allows DP. However, and that is the key issue, parallel-arrays (TVECs also) can be nested, in such  a case a flattening process takes places internally in order to apply in parallel a parallel function over a nested structure (a parallel-array of parallel-arrays) which is not possible in DP. Standard and illustrating examples of NDP are divide-and-conquer algorithms, like the quicksort (qsort). Using NDP lets the flattening handle recursive calls as parallel calls, too. For instance, as key part of a qsort algorithm we calculate:

[: qsort(x) | x <-[: lessThan, MoreThan :]:]

Assume the lessThan and moreThan are parallel arrays result of the previous splitting phase. In this case the flattening stage will "distribute" (replicate) the recursive calls between data structure. The final result after flattening will be a DP requiring a number of threads that depends on the number of array partitions (O(log(n))) not on the number of qsort-recursive calls (O(n)); which can help to balance the number of required threads at the end independently of the order of the original array. This example shows that NDP can avoid potential degeneration of control (task) based parallelism.

Evidently, not every algorithm is a map-instance; many of them are sequential reductions from a data type into a "simpler" one. For instance, we have the length of a list which reduces (collapses) a list into a number or similarly the sum of all elements of a list. In such a case the threads computing independent results must be synchronized into a combined result. This corresponds to the reduce class of operations. Reduce (associative fold) is symbolically defined as follows:

reduce (e, @, [x1, ..., xn]) = e @x1@...@xn
where "@" denotes here an associative infix operator (function of two arguments).

Reduce can be harder to realize in parallel in comparison to map due its own nature, in general; scans/tree-contraction techniques offer one option providing reduce as a primitive.  Bottom-line is that they can be cleanly separated of big previous map-steps. In other words, many algorithms are naturally composed of "map"-phases followed by reduce-phases so particular parallelization and optimization techniques can be employed based on this separation at the compilation level.

We notice that map-steps can be matched with the superstep-notion which is very proper of the BSP model (which lies beneath the BSPG language). On the other hand reduce-phases are comparable with after-barrier code in this language or points where threads must combine their outputs.  However, BSGP does not have flattening, as we could observe in the report.

If the implemented algorithm naturally splits into well-defines map (superstep)-reduce (barrier) phases then NDP-thinking promotes better threading performance and more readability of the code intentionality. In addition, many reduce operations can be very efficiently implemented as operations on arrays according to previously calculated indexing parallel arrays (as masks) which are directly supported in Ct (for instance, partition, orReduce, etc.) according to the source.  In addition standard scan algorithms are supported in the library, thus scan-based combinations can be used.

In a forthcoming (in a hopefully shorter) second part of this post, we will be briefly discussing some applications of NDP to some pattern-matching schemas.
 

Understanding the complexity of programming exercises

23. July 2008 08:58 by CarlosLoria in General  //  Tags:   //   Comments (0)
As anyone attempts to support migration, refactoring or related automated software transformation tasks, some level of understanding of programs is required, as we might be expecting. The more complex the transformation is the deeper such an understanding could turn to be. Tools like standard compilers stay at the language level; for instance, to translate an “if” statement into alternate sequences of low-level code, independently of what functionality it represents at the application domain. However some migration related tasks usually demand a deeper understanding, for instance when one has to decide that some path of an “if” has to be abstracted as a method for some special domain specific related reason. Thus, a regular compiler translate from a higher into a lower level of semantics, migration tools might also require the opposite direction, in addition. Hence, a measure of understanding capability (complexity) is involved when dealing with software manipulation.

Naturally, the same is true for human beings, programmers, in usual maintenance labor or more specifically when beginners learn to program. In this last case, some kind of programming exercises help to determine how much understanding the student possesses about the language, an algorithm and its implementation. We might ask ourselves in such a scenario, how complex an exercise could be, how much effort, knowledge and tasks the exercise might involve: A not so easy question that we want to start to investigate, initially in terms of an e-learning situation for pragmatic reasons. For such a purpose, we have prepared a technical report explaining in more detail the context and some interesting angles of the problem. We invite the interested reader to take a look at the report.

On Extensible and Modular Pattern-Matching for ADTs

26. June 2008 07:31 by CarlosLoria in General  //  Tags:   //   Comments (0)

 In our previous post on pattern-matching and ADTs we used Scala for illustrating the idea of pattern extension as a complement to the notion of extractors as defined in Scala (See Post). We want to deepen and discuss some interesting points and derivations related to this requirement that we consider a key one.

By pattern extension we mean the possibility of adding or changing a pattern or combining it with other patterns in different new ways, in other words to have more modularity at this level of the programming or specification language. We know that patterns are available in several (no mainstream) languages however patterns are not usually treated as first-order citizens of such languages. A pattern is usually a sort of notation that the compiler traduces in normal code but not a value in the language. We consider that as an important limitation to the reuse and modularity of application in cases where patterns are numerous as in rule based systems with big amounts of rules. A pattern is actually that important as a query in an information system is, so it is related to a use case and might deserve being handled correspondingly.

Scala provides a nice way to connect patterns with classes as we know. However, Scala does not directly support extensibility at the pattern level as we are figuring out. In this post, we want to illustrate our ideas by developing a simple case using Scala. We want to reify patterns using objects to make the requirement more evident, we exploit many of the facilities that Scala provides but show a limitation we come across as well.

Suppose we want to define our pattern language as an idiom and to have more control over some operational facets of pattern-matching those we might be willing to change or customize. For instance, we would like to code a pattern that matches floating-point values but allows that the pattern matches approximately the muster (with a certain error margin). We want that such functionality would be cleanly a part of a "new" pattern. That is, although things like pattern guards could be used for such a purpose, we want to encapsulate this form of "fuzzy matching" and eventually parameterize it in a class of objects. Notice that for some class of classification problems fuzzy matching can be helpful, hence the example is not completely artificial. Other examples are evidently possible, too. Naturally, if we define our own atomic patterns, we should be able to construct our (boolean) combinations of them, too.

For our purposes, we will define a model of pattern by means of standard OO classes. We assume a very simple protocol (interface) to be satisfied by our family of patterns. We denote this protocol, by the following Scala trait:

trait IPattern[T]{
  val value:T
 def transduce(e:Any)=e         
 def matches(e:Any):Option[T] = None
 def handle(c:Option[T]) = c
 def `:>`(e:Any) = handle(matches(transduce(e)))
}
abstract class AnyPattern extends IPattern[Any]

The matching function is actually located in the method `:>`. The idea is that a pattern is allowed to first transduce the object to match before matching takes place; then the pattern matches the object and finally the "handle" functionality can be added to eventually cope with some particular condition after matching to transducer the result. As seen, we keep the design quite simple according the scope of this post. We assume the transduction does not change the type of neither the object nor the result. Results are given as instances of Scala class Option[T]. The instance field "value" is used by constant patterns those like numbers. We only handle patterns without binding just for simplicity; it can be easily extended for such a purpose.

Now, we can add a wildcard pattern (that matches every object, like _ in Scala)

object TruePattern extends AnyPattern{
  val value = null;
  override def
matches(e:Any):Option[Any] = Some(e)
}

Similarly a class for exact pattern matchin of integers can be defined:

case class IntPattern(val value:Int) extends AnyPattern{
  override def matches(e:Any):Option[Int] =
                if(value==e) Some(value)  else None
}

Let us present our fuzzy pattern for floating point:

case class FloatPattern(val value:Double) extends AnyPattern{
  var epsilon:Double=0.01;
  private def dmatches(e:Double):Option[Double]={
    val diff = Math.abs(value - e );
    if(diff<epsilon) return Some(diff)
    else return None
  }
  private  def imatches(e:Int):Option[Double]=matches(e.asInstanceOf[Double])
  override def matches(a:Any):Option[Double] = a match{
    case a:Int    => imatches(a)
    case a:Double => dmatches(a)
    case _        => None
  }
}

The pattern accepts integers and values that are at the 0.01 (a default for variable epsilon) distance of the pattern value, a parameter of the pattern. We return as value of the match this difference in absolute value.

As additional examples we define additional pattern combinators (in our algebra of own patterns) for "not", "and" and "or". We show the first two of them, nicely expressible in Scala. Notice that an expression (p `:>` e) produces the matching from pattern p with element e in our notation.

case class NotPattern(val value:AnyPattern) extends AnyPattern{
    override def matches(e:Any):Option[Any]= {
      val res:Option[Any] = (value `:>` e);
      res match {
          case Some(_) => return None
          case _       => return Some(e)
      }
    }
}
case class AndPattern(val value:AnyPattern*) extends AnyPattern{
  override def matches(e:Any):Option[Any]= {
    val res:Option[Any] = Some(e);
    for(v <- value){
       val res = (v `:>` e);
       res match {
         case Some(z) =>  ;
         case _           =>  return None
       }
    }
    return res;
  }
}
A guard pattern is also possible too (in this case the value is a Boolean lambda representing the guard):

case class GuardPattern(val value:(Any => Boolean)) extends AnyPattern{
   override def matches(e:Any):Option[Unit]={
    if(value(e)) return Some()
    else return None
   }
}

Now, let us add some sugar:

object _Patterns{
  def _t    = TruePattern;
  def _$(x:Int)     = IntPattern(x);
  def _$(x:Double)   = FloatPattern(x);
  def _not(x:AnyPattern)  = NotPattern(x);
  def _and(x:AnyPattern)  = AndPattern(x:_*);
  def _or(x:AnyPattern*)   = OrPattern(x:_*);
  def _when(x:Any=>Boolean)   = GuardPattern(x);
}
def even(x:Any) = x match{
     case x:Int => x%2==0
     case _     => false
  }
Using this, we might write:  _and(_not(_when(even)), _or(_$(1), _$(2)))  
to denote a silly pattern that checks for a not even number that is 1 or 2; in other words matches just a 1.

As a final remark, we can be tempted to model these pattern classes adding extractors for representing them as Scala patterns. However, Scala limitations on "objects" (they have to be parameter less) and the "unapply" functions (they are unary) avoid such a possibility. Once again we identify an interesting feature for the language around this limitation.

Automated generation of pattern destructors for ADTs

19. June 2008 11:36 by CarlosLoria in General  //  Tags:   //   Comments (0)

 As a continuation of our previous posts on ADTs, rules and pattern-matching, we go on with our work studying some alternatives to pattern extensibility and generation (see post 1 and post 2). For the extensibility case, we considered delegation by inheritance as an alternative, where we are currently using the programming language Scala for our illustration. In our previous post, we supported the claim the Scala compiler could be able to help in this aspect. In this post we consider pattern generation out from classes, once again, we identify a potential requirement for the programming languages under our study.

In general, we recall our goal is to discover features or their eventual absence in programming languages that give support to an ADT modeling style within a OOP environment. Our motivation and justification comes from the field of automated software transformation (as understood in www.artinsoft.com) and specifically exploring features in languages that properly help to specify transformation rules in a modular and reusable way. The ability to write rules is certainly a central and distinctive element of software transformation tools; especially in the case the feature is available to the final user in order to customize his/her own transformation tasks (as approached for instance in www.aggiorno.com ).

We have already seen how a pattern can be associated with a class in Scala which provides a very natural bridge between rules based programming and OOP. Patterns are views of objects that allow an easy access to its relevant subparts in a particular context, as we know. However, it is programmer’s responsibility to manually build the pattern, in the general case. And knowledge about the set of relevant destructors of the object is required, obviously. And this is usually a matter of domain specific semantics, thus we cope with a task that is difficult to automate.

The notion of destructor is central to ADTs and very particularly when compiling pattern matching. In OOP, destructors are usually implemented as so-called getters (or properties) but, interestingly, neither particular syntax nor semantics in mainstream OOP languages distinguishes them as such from other kind of methods, as we do have for the (class) constructors. In Scala, however, destructors can be automatically generated in some cases which is a fine feature. For instance, consider a simple model for symbolic expressions, defined as follows:

trait ISexp
class Sexp(val car:ISexp, val cdr:ISexp) extends ISexp{
}
class Atom(val value:String) extends ISexp

In this case Scala injects methods for accessing car and cdr fields, thus, in this case destructors are easily recognized. In the case the class would be a case-class the corresponding pattern is indeed generated as indicated by the constructor declaration. But what happens when the class specification is not Scala code or is not given by this explicit way, as in example.

Thus, an interesting question arises if we would want to support semi automated generation of patterns out of classes which are not available in a form that clearly exhibits its destructors.
In ADTs, a destructor is an operator (functor) f that gives access to one parameter of a constructor c. In other words, following invariant should hold in the ADT theory:

Forall x:T: f(c(…, x, ….)) = x

Or if we prefer using an OOP notation:

(new c(…,x,…)).f() = x

In terms of the type system we should have f: () => T where T is the declared type of formal parameter position where x occurs in c.
This provides us with a criterion for guessing candidates for destructors given a class. However, in order to generate code for a pattern, we would need to query the class implementing the ADT looking for such kind of candidates. However, in the case of Scala, we only have reflection as an alternative to perform the required query because there is no code model exposed as class model. Fortunately, the reflection facilities of Java fill the gap. The nice thing is the full interoperability of Scala with Java. The flaw is that we have to generate source code, only.

In order to evaluate the concept, we have developed a very simple Scala prototype that suggests destructors for Java and Scala classes and for generating patterns for such candidates.  
For instance, for the classes of the Sexp model given above we automatically get as expected
object Atom {
    def unapply(x:sexps.Atom)=Some(x.value)
}
object Sexp {
    def unapply(x:sexps.Sexp)=Some(x.car, x.cdr)
}
For a class like java.lang.Integer we get something unexpected:
object Integer {
    def unapply(x:java.lang.Integer)=Some(x.intValue, x.toString, x.hashCode)
}
Likewise for the class java.lang.String:
object String {
 def unapply(x:java.lang.String)=
   Some(x.toCharArray, x.length, x.getBytes, x.hashCode)
}
The overloading of constructors plays at this time an important role on the results. Unfortunately, Scala pattern abstraction does not correctly work if we use overloading for the unapply method. Thus, we have to pack all the possibilities in just one definition. Essential was actually, that we have identified additional requirements which are interesting to get attention as a potential language features.

ADTs, OOP, rules and Pattern-matching (2)

10. June 2008 09:44 by CarlosLoria in General  //  Tags:   //   Comments (0)

 We are reviewing some programming languages which provide explicit support for the integration between OO, ADTs (i.e. rules and pattern-matching). We refer to Part 1  for an introduction. We currently focus on Scala and TOM in our work. Scala provides the notion of extractor as a way to pattern-matching extensibility. Extractors allow defining views of classes such that we can see them in the form of patterns. Extractors behave like adapters that construct and destruct objects as required during pattern-matching according to a protocol assumed in Scala. By this way, it results possible to separate a pattern from the kind of objects the pattern denotes. We might consider pattern-matching as an aspect of an object (in the sense of Aspect Oriented Programming, AOP). The idea behind is really interesting, it is clarified in more detail in Matching Objects in Scala .  We want to perform some analysis in this post. Our main concern in this part, as expected, is pattern extensibility, we leave the ADT aside just for a while. We will consider a situation where more support for extensibility is required and propose using delegation to cope with it, without leaving the elegant idea and scope of extractors offered by Scala. Pattern-matching by delegation could be more directly supported by the language.

Let us now see the example. Because of the complete integration between Scala and Java, extractors in Scala nicely interact with proper Java classes, too. However, let us use just proper Scala class declarations in our case study, as an illustration. In this hypothetical case, we deal with a model of “beings” (“persons”, “robots”, societies” and the like. We want to write a very simple classifier of beings using pattern-matching. We assume that classes of model are sealed for the classifier (we are not allowed to modify them). Hence, patterns need to be attached externally. We start with some classes and assume more classes can be given, so we need to extend our initial classifier to cope with new kinds of beings.

Thus, initially the model is given by the following declarations (we only have “person” and “society” as kinds of beings):

trait Being

class Person extends Being {
  var name:String = _
  var age:Int = _
 
  def getName() = name
  def getAge()  = age
 
  def this(name:String, age:Int){this();this.name=name;this.age=age}
    
  override def toString() = "Person(" + name+ ", " +age+")"
}


class Society extends Being{
  var members:Array[Being] = _
  def getMembers = members
  def this(members:Array[Being]){this();this.members=members}
  override def toString() = "Society"
}

We have written classes in a Java-style, to remark that could be external to Scala. Now, let us introduce a first classifier. We encapsulate all parts in a class for our particular purposes, as we explain later on.

trait BeingClassifier{

    def classify(p:Being) =  println("Being   classified(-):" + p)
}

class BasicClassifier extends BeingClassifier{

   object Person {
     def apply(name:String, age:int)= new Person(name, age)
     def unapply(p:Person)            = Some(p.getName)
   }
   
   object Society{
     def apply(members:Array[Being]) = new Society(members)
     def unapply(s:Society) = Some(s.getMembers)
   }

   override def classify(b:Being) = b match {
      case Person(n)     => println("Person  classified(+):" + n)
      case Society(m)   => println("Society classified(+):" + b)
      case _                 => super.classify(b)
   }

}
object SomeBeingClassifier extends BasicClassifier{
}

Method “classify(b:Being)”, based on pattern-matching, tries to say whether “b” could or not be recognized and displays a message, correspondingly. Notice that objects “Person” and “Society” within class “BasicClassifier” are extractors. In each case, we have specified both “apply” and “unapply” methods of the protocol, however, just “unapply” is actually relevant here. Such methods uses destructors to extract those parts of interest for the pattern being defined. For instance, in this case, for a “Person” instance we just take its field “name” in the destruction (“age” is ignored). Thus, the pattern with shape “Person(n:String)” differs from any constructor declared in class Person. That is, pattern and constructor can be, in that sense, independent expressions in Scala, which is a nice feature.
Now, let us suppose we know about further classes “Fetus” and “Robot” deriving from “Being”, defined as follows:


class Fetus(name:String) extends Person(name,0)
{
  override def toString() = "Fetus"
}

class Robot extends Being{
  var model:String = _
  def getModel = model
  def this(model:String){this();this.model=model}
  override def toString() = "Robot("+model+")"
}

Suppose we use the following object for testing our model:

object testing extends Application {
  val CL  = SomeBeingClassifier  
  val p    = new Person("john",20)
  val r      = new Robot("R2D2")
  val f      = new Fetus("baby")
  val s    = new Society(Array(p, r, f))
 CL.classify(p)
 CL.classify(f)
 CL.classify(r)
 CL.classify(s)
}

We would get:

Person  classified(+):john
Person  classified(+):baby
Being   classified(-):Robot(R2D2)
Society classified(+):Society

In this case, we observe that “Robot” instances are negatively classified while “Fetus” instances are recognized as “Person” objects. Now, we want to extend our classification rule to handle these new classes. Notice, however, we do not have direct support in Scala for achieving that, unfortunately. A reason is that “object”-elements are final (instances) and consequently they can not be extended. In our implementation, we would need to create a new classifier extending the “BasicClassifier” above, straightforwardly. However, we notice that the procedure we follow is systematic and amenable to get automated.

class ExtendedClassifier extends BasicClassifier{
   object Fetus {
        def apply(name:String) =  new Fetus(name)
        def unapply(p:Fetus)   = Some(p.getName())
   }
   object Robot {
              def apply(model:String) = new Robot(model)
              def unapply(r:Robot)    = Some(r.getModel)
   }
   
   override def classify(p:Being) = p match {
         case Robot(n)     => println("Robot   classified(+):" + n)
         case Fetus(n)     => println("Fetus   classified(+):" + n)
         case _               => super.classify(p)
   }
   
}
object SomeExtendedClassifier extends ExtendedClassifier{
}

As expected, “ExtendedClassifier” delegates parent “BasicClassifier” performing further classifications that are not contemplated by its own classify method. If we now change our testing object:

object testing extends Application {
  val XCL  = SomeExtendedClassifier  
  val p = new Person("john",20)
  val r = new Robot("R2D2")
  val f = new Fetus("baby")
  val s = new Society(Array(p, r, f))
 XCL.classify(p)
 XCL.classify(f)
 XCL.classify(r)
 XCL.classify(s)
}

We would get the correct classification:

Person  classified(+):john
Fetus   classified(+):baby
Robot   classified(+):R2D2
Society classified(+):Society

As already mentioned, this procedure could be more directly supported by the language, such that a higher level of extensibility could be offered.

ADTs, OOP, rules and Pattern-matching (1)

6. June 2008 12:57 by CarlosLoria in General  //  Tags: , , ,   //   Comments (0)

 Abstract (or algebraic) Data Types (ADTs) are well-known modeling techniques for specification of data structures in an algebraic style.
Such specifications result useful because they allow formal reasoning on Software at the specification level. That opens the possibility of automated verification and code generation for instance as we may find within the field of program transformation (refactoring, modernization, etc.).  However, ADTs are more naturally associated with a declarative programming languages (functional (FP) and logic programming (LP)) style than an object-oriented programming one (OOP).  A typical ADT for a tree data structure could be as follows using some hypothetical formalism (we simplify the model to keep things simple):

    adt TreeADT<Data>{

    import Integer, Boolean, String;
    signature:
        sort Tree;
        oper Empty : -> Tree;
        oper Node : Data, Tree, Tree|-> Tree;
        oper isEmpty, isNode : Tree-> Boolean;
        oper mirror : Tree -> Tree
        oper toString : -> String
     semantics:
        isEmpty(Empty()) = true;
        isEmpty(Node(_,_,_)) = false;

        mirror(Empty()) = Empty()
        mirror(Node(i, l, r)) = Node(i, mirror(r), mirror(l))

    }
As we see in this example, ADTs strongly exploit the notion of rules and pattern-matching based programming as a concise way to specify requirements for functionality.  The signature part is the interface of the type (also called sort in this jargon) by means of operators; the semantics part indicates which equalities are invariant over operations for every implementation of this specification. Semantics allows recognizing some special kinds of operations. For instance, Empty and Node are free, in the sense that no equation reduces them. They are called constructors.
If we read equations from left to right, we can consider them declarations of functions where the formals are allowed to be complex structures. We also use wildcards (‘_’) to discard those parts of the parameters which are irrelevant in context. It is evident that a normal declaration without using patterns would probably require more code and consequently obscure the original intention expressed by the ADT. In general, we would need destructors (getters), that is, functions to access ADT parts. Hence, operation mirror might look like as follows:
    Tree mirror(Tree t){
        if(t.isEmpty()) return t;
        return new Node(  t.getInfo(),
                                    t.getRight().mirror(),
                                    t.getLeft().mirror() );
    }
Thus, some clarity usually could get lost when generating code OOP from ADTs: among other things, because navigation (term destruction, decomposition) needs to be explicit in the implementation.

Rules and pattern-matching has not yet been standard in main stream OOP languages with respect to processing general data objects.  It has been quite common the use of regular expressions over strings as well as pattern-matching over semi structured data especially querying and transforming XML data. Such processing tends to occur via APIs and external tools outside of the proper language. Hence, the need for rules does exist in very practical applications, thus, during the last years several proposals have been developed to extend OOP languages with rules and pattern-matching. Some interesting attempts to integrate ADTs with OOP which are worthy to mention are Tom (http://tom.loria.fr/ ) and Scala (http://www.scala-lang.org/ ). An interesting place to look at with quite illustrating comparative examples is http://langexplr.blogspot.com/.

We will start in this blog a sequence of posts studying these two models of integration of rules, pattern-matching and OOP in some detail. We first start with a possible version of the ADT above using Scala. In this case the code generation has to be done manually by the implementer. But the language for modeling and implementing is the same which can be advantageous. Scala explicit support (generalized) ADTs, so we practically obtain a direct translation.
    object TreeADT{
        trait Tree{
           def isEmpty:boolean = false
           override def toString() = ""
           def mirror : Tree = this
           def height : int = 0
        }
        case class Node(info:Any, left:Tree, right:Tree)
               extends Tree{
           override def toString() =
              "("+info+left.toString() + right.toString()+")"

        }

        object Empty extends Tree{
           override def isEmpty = true
        }

        def isEmpty(t : Tree) : boolean =
              t match {
            case Empty => true
            case _        => false
        }

        def mirror(t : Tree):Tree =
               t match{
               case Empty       => t
               case Node(i, l, r) => Node(i,mirror(r), mirror(l))
        }
      }

Aiming at Migration 2.0 more in detail

11. June 2007 04:41 by CarlosLoria in General  //  Tags:   //   Comments (0)

By Migration 2.0 (coined in the context of MacAfee’s Enterprise 2.0 vision) we want to consider additional ways of addressing Automated Software Modernization in a Web 2.0 context and why or how this should or not deserve some particular attention. As a complement of a previous post, we have developed a presentation (albeit in Spanish) which may help to clarify (or not) some of the ideas around the notion. Especially relevant are a user-centric notion of migration and modernization as a sort of “SLATES-ification” of legacy code where legacy need to be understood inside of the 2.0 context. We will revisit and refine this idea in a forthcoming post.
You will find the slides of the presentation here.

Migration 2.0 and Enterprise 2.0

24. May 2007 11:03 by CarlosLoria in General  //  Tags:   //   Comments (0)

Let us elaborate some loose ideas on a fresher notion of “software modernization”. In the sequel, we are reviewing the notion of software migration as a modernization path and we want to look it through the glass of other contextual elements and events around the IT world which have evolved during the last 2-3 years. Such elements give us a reason to identify forces and opportunities on innovation and possible new developments. Let’s assume, we now have focused more on an implementation oriented modernization; namely one that is  mainly concerned with programming languages, frameworks, platforms, architectures and the like those have being prevailing during the last years.

From a technical point of view, we are aware this implementation based notion is completely valid and will exist for a while as such because still useful legacy systems are forced to evolve at the implementation level while retaining as much as possible its original functionality and corresponding business value, at a reasonable cost. However, we also are aware that the environment where ISs serve and survive is so strongly evolving in such a way that implementation details are probably remaining that important only at a traditional IT/IS level and vision. What kind of environment and forces are these making pressures on that vision? Is there an opportunity there for migration?

One important phenomena is definitively the Web 2.0 and, in analogy to how Internet forced Intranets, Andrew McAfee has recently coined the Enterprise 2.0 concept which embodies those well-known effects the Web 2.0 as an ubiquitous trend, as a social movement and how those might be pushing on at the inside of the IT enterprise nowadays; and as a direct consequence at the kind of tools employees might be willing and needing in their regular work environment; those where new information requirements born faster than they can be incorporated as new features at the traditional IS platform. Whether so-called Web 2.0 tools will be improving employee productivity is probably an open and debatable question, no doubt about it. We still remember not too long ago how e-mail and Internet at the work was considered as a disturbance source per se.

True is also, however, that being able to search, analyze, discover, tag, publish, share and trace knowledge at the rhythm of the business and own personal information needs has become now more important than ever. And traditionally designed ISs could be becoming a factor contributing to a sort of impedance mismatch between the huge flexibility and freedom that persons might currently encounter on the Web (even in private personal milieu) and a rigid traditional IS platform at the work-place. And, we emphasize, this concern occurs independently of whether or not such IS platform is “modern” at the implementation level which is another different dimension of the matter.

 As Dion Hinchcliffe entitles, Enterprise 2.0 is a cultural catalyst (as Web 2.0 is being); we believe and interpret it as a realistic vision where, more soon than later, ISs will be judged in terms of McAfee’s SLATES criteria and this will entail a rather stronger force leading to modernization of higher–level then than the technical one, because such criteria are more closely related social, common-sense, better understood forces not so directly related with technical issues.

If such a vision is accepted as a legitimate opportunity, we might then be looking for spaces for innovation now when we are considering moving to the next levels of automated supported migration. In such a case we have to consider migration as a path enabling ISs (not just legacy ones) to evolve into direction Enterprise 2.0-like modernization as a well-defined strategy.

 

 

Migration as a path leading to BI

22. May 2007 07:05 by CarlosLoria in General  //  Tags:   //   Comments (0)

We speculate software and data migration are frequently urged by IT infrastructure modernization needs (emergencies would it be more precise to say perhaps) at the enterprise level which often are nearly related with platform obsolescence and associated maintenance costs. A possibly even more appreciable benefit might be the need of integration between heterogeneous information systems and data bases, the need of being able to easily connect information sources, to create access ports to facilitate extraction and derivation of knowledge from such sources helping at the management level at the decision-support systems level in a more natural and easy way. In other words, the real value of modernization would be to support BI-like strata pointing -for sure- to a better CRM platform among others. We feel migration is probably not perceived as a transitory goal, more as a mean, so the added-value might lie outside possibly hidden with respect to the whole migration effort per se. It seems to me that the vertical road which leads from software migration and modernization until reaching BI-end-user is perceived as a too long one, one that is hard to appreciate at first sight at the management level. My question to myself would be whether that has to be so. Shouldn’t migration tools and message be pointing as layers enablers for the BI layer, in a more directly way. And if the answer is yes we should be asking ourselves how such migration tools can be adapted for closing the gap.

Is BI a major trend? (will BI be needing AI?)

18. May 2007 13:45 by CarlosLoria in General  //  Tags:   //   Comments (0)
I was taken a look at some sources showing trends on BI for 2007, especially related to legacy code. I have found interesting the following quotes from a report Knightsbridge Solutions LLC (Trends in Business Intelligence for 2007)
Trend #5
Service-Oriented Architecture: Information Management is Critical to Success
(of BI)
The buzz around service-oriented architecture (SOA) continues, with some organizations viewing SOA
as the solution to a wide range of business and technology problems, from improving enterprise agility
to deriving more value from legacy systems
(I think we are wintness of that). In the BI arena, SOA has great potential for delivering
enhanced capabilities to users. An SOA-enabled BI infrastructure could provide seamless access to
both batch and real-time data integrated across operational and analytical sources. SOA also presents
opportunities for innovation in areas such as real-time data collection and real-time analytic services.
However, companies that approach SOA without a strong information management methodology will
have difficulty achieving the benefits they seek. When implementing SOA on a large scale, companies will
face the same barriers they do in large BI integration projects.
For example, some early adopters of SOA
found that semantic incompatibilities in the data between disparate systems are a significant impediment
to SOA adoption.
These organizations are discovering that master data management and data governance
efforts must precede SOA adoption to provide a “common language” that supports integration. SOA has
the potential to deliver benefits in BI and many other areas, but not without a solid information management
foundation.

And also
Trend #8
Influence of Large Vendors: Market Consolidation Expected in 2007
Speculation abounded in 2006 regarding potential acquisitions of pure-play vendors in the BI space.
Business Objects, Cognos, Hyperion, and Informatica were seen as potential acquisition targets with
likely acquirers being enterprise software and infrastructure vendors like IBM, Microsoft, Oracle,
and SAP. Large vendors have moved aggressively into the BI space, building their capabilities through
both acquisitions and internal development
(as evidenced by Hewlett-Packard’s recent acquisition
of Knightsbridge).
Strongly consistent with this
http://www.intelligententerprise.com/channels/bi/showArticle.jhtml;jsessionid=ADNU0C0TP01BYQSNDLRSKH0CJUNN2JVN?articleID=199501674
This is also interesting to see in more detail
http://www.spagoworld.org/ecm/faces/public/guest/home/solutions/spago
"Naja" let's see

A digression on simple life matters

15. May 2007 12:29 by CarlosLoria in General  //  Tags:   //   Comments (0)

Sometimes negative experiences may turn out positive in the long-run, everybody should hope to believe in whenever things do not happen the way we expected. Though already old, maybe questionable and controversial, I must realize this speech made me reflect a lot about it, I found it very rich in philosophical material; it changed me somehow. If you haven’t yet, you could find interesting to watch it

Video:  http://www.youtube.com/watch?v=D1R-jKKp3NA

Text: http://news-service.stanford.edu/news/2005/june15/jobs-061505.html

PaXQual: a silly language for analyzing and rewriting Web Pages

23. February 2007 12:42 by CarlosLoria in General  //  Tags:   //   Comments (0)

Let us gently start meeting PaXQuAL, a prototype language that we are shaping out and adjusting for the emergent purpose of symbolically expressing analysis and transformation simple tasks around Web Pages, all this circumscribed in the context of refactoring, as we have been done in previous posts.

And as we also have already done days before, sometimes we just want to digress a little bit from practical and realistic issues, just to expose some theoretical ideas we find somehow interesting (and probably nobody else). I just can promise that no Greek letter will be used, at all (in part because that Font is not allowed by the publishing tool, I confess).

Anybody (if any) still reading this post is allowed to right away express the nowadays classical: “Yet another language? How many do we count up by now?” Claim is justified because everybody knows that every problem in Computer Science is solved proposing a (the) new one. Now it is my turn, why not. It’s a free world. For the interested reader a technical paper will be hopefully available with further details at this site, soon.

Actually PaXQuAL (Path based Transformation and Querying Attribute Language is his real name; is pronounced Pascual) is not that new and different from many other languages, developed for real researchers at the academia and industry. We wanted to imagine a language for querying and transforming structured data (eg. XML, HTML) and from that sort we have many available as we know. What new material can be proposed at this field for someone like us? Actually, what we really want is to operationally relate CSS with some special sort of theoretical weird artifact we had been exploring some years ago that we may dare to call Object-Oriented Rewrite Systems or Term-rewriting Systems (TRS) with extra variables and state (as a result of some work developed by and joint with actual researchers some years ago).  Considering TRS in this case natural because CSS is indeed a kind of them and that field has a rich offering of tools for useful automated reasoning. And we can find them useful here, we guess.

The question that pushed us back to the old days is: given an interesting, so simple and practical language, like CSS is, what kind of object-oriented rewriting logic can be used to describe its operational semantics. You may not believe it but this is a very important issue if we are interested in reasoning about CSS and HTML for refactoring purposes among others. And we are, don’t we?

CSS is rule-based, includes path-based pattern matching and is feature (semantically attributed) equipped, which all together yields a nice combination. CSS can be considered “destructive” because it allows adding or changing (styling) only attributes of tags where remaining “proper content” does not result destructively rewritten. It is not generative, by such a reason (in contrast to XSLT and XQUERY). And that leads to an interesting paradigm. For instance, following is a typical simple CSS rule for setting some properties of every tag of the kind body.

body {

     font-family: Arial, Helvetica, sans-serif;

     background-color: #423f43;

     text-align: center;

}

Of course more explicit rules like this one can be declared but further, an inheritance (cascading) mechanism implicitly allows that attributes may be pushed down or synthesized as we know from attribute grammars.

That all is nice but we feel we had to be original and want to propose the crazy idea of using something similar to CSS for purposes beyond setting style attributes, for instance for expressing classification rules allowing to recognize patterns like the ones we explained in previous posts. For instance, that a table is actually a sort of layout object, navigation bar or a menu, among others. Hence, we would have a human-readable, querying and transformation language for Web Pages, a sort of CSS superset (keeping CSS as a metaphor what we think might be a good idea):

Let us by now just expose some examples (where we advert concrete syntax in PaXQuAaL is not yet definitive). For instance, we may want to eliminate the bgcolor attribute of any table having it because is considered deprecated in XHTML. We use symbol “-:“ for denoting execution of the query/transformation as in Prolog.

 :- table[bgcolor]{bgcolor:null;}

We may want to add a special semantic attribute to every table directly hanging from a body, indicating it may be a surface object for some latter processing. We first must statically declarate a kind of table, “sTable”, accepting a surface attribute because we are attempting to use static typing as much as possible (“Yes I am still a typeoholic”)

@:- sTable::table[surface:boolean;]{surface:false}

Symbol “@:-” is like “:-” but operating at the terminological level. And then we have the rule for classifying any table instance hanging from the body tag, directly:

:- body sTable{surface:true;}

Many more "interesting" issues and features still need to be introduced; we will do that in forthcoming post. Hence, stay tuned.

More on Discovering Semantics from Parts of Web Pages

13. February 2007 11:36 by CarlosLoria in General  //  Tags:   //   Comments (0)

By taking at look at Web Pages, we may expect to discover that some patterns of semantics are encoded using very few HTML sets of, let us say, “combinators” (HTML parts); this may be due to the lack of abstraction capabilities which is inherent to the HTML alone. We have compared this situation to the Noisy-Channel model in a previous post where we presented some interesting figures and data illustrating the claim. Let us continue our journey showing further instances of this phenomenon whose formal analysis is crucial for intelligent refactoring tools as the kind we have been pursued to introduce by means of this sequel of posts. In other words, let us know other forms of “HTML noise”. As a word of warning, we recall that the data is the result of a particular muster of crawled pages by the way we explained before.

For this post, we are experimenting with tables that are potentially used as page layouts or page structure. For those kinds of tables, we want to study the table shape or page surface, no the specific content; we may think of that as a way to filter potential candidates for further deeper semantic analysis. (We briefly recall that our muster contains 819 pages and about 5000 table instances, roughly speaking.).

The exercise is simple: we postulate an intuitive definition for a table as surface and see how well it is supported by our data in muster.

Let us try our shallow analysis by classifying a table as a page layout candidate if its container is the page body tag, eventually followed by a chain of div tags (assuming such div tags are intended to be organizers or formatters of the table), it has at least two rows and at least 2 columns (two columns is the most interesting case, we consider it as a base).

Such a pattern definition sounds reasonable in appearance; however, we will see that its empirical support is not as high as one may expect, at least in our muster.

We find 261 of such candidates; they represent a 31% of all pages, which is a quite interesting amount; however it is unexpectedly small because one may guess there should be at least one per page. Among these 261, we have 83 where the table is hanging directly from the body tag (32% of the candidates; 10% of the whole muster). As a matter of fact, such 83 tables present irregular patterns, albeit often we find 2 columns (65%) with a high variance. For instance, we may find a pattern of the form 6.2.2.2.2.2.2.2, where we use our convention of showing a table of n rows as a sequence of n numbers, each of one being the number of cols (in example 8 rows, the first of them with 6 columns the rest having 2 columns). But even worst, we find the irregular pattern 2.2.7.2.7.7.6.5.5.4.4.5.2.3.2.7.2.7. And talking about irregularity, let us take a look at this interesting one: 19.2.7.4.6.2.2.2.2.2.2.2.2.2.2.5.7.2.2.2.2.4.4.2, whatever it means.

With this simple analysis, we may learn that, perhaps, some intuitive definitions occur not as frequent as we may expect in our muster. Actually, and after seeing in detail some of the irregular cases, a sound conclusion might be that we may need first to pre-classify some parts of the page before using general patterns like the one we directly tried. In other words, we see that some noise needs to be filtered out for such a kind of pattern.

In a forthcoming post, we will continue studying that kind of patterns and their support.


Semantics from Structural Parts of Web Pages: some figures and patterns

1. February 2007 09:55 by CarlosLoria in General  //  Tags:   //   Comments (0)

We continue our regular series of posts talking about refactoring of Web Pages based on semantic approaches; we invite the interested new reader to take a look at the previous contributions to get a general picture of our intentions.

In this particular and brief post, we just want to present and describe some simple but interesting empirical data which are related with the structural (syntactic) content of some given muster of pages we have been analyzing during the last days. The results are part of a white page we are preparing, currently; it will be available at this site in short time.

We may remember from our first post that we may want to recover semantics from structure given particular clues and patterns we usually may come across when analyzing pages. The approach is simpler to describe than to put into practice: Once semantics could be somehow detected, refactoring steps can be applied on some places at the page and, by doing so, some expected benefits can be gained.

However, syntactic structure is the result of encoding some specific semantics and intentions on a web page using HTML elements and functionality; the HTML language is (expressively speaking) rather limited (where too much emphasis on presentation issues is the case, for instance) and some common programming “bad practices” increase the complexity of recovering semantics mainly based on syntactic content as input. And being HTML quite declarative, such complexity can make the discovering problem quite challenging in a pragmatic context, indeed. That is our more general goal, however, we do not want to go that far in this post, we just want to keep this perspective in mind and give the reader some insight and data to think about it. We will be elaborating more on recovering in forthcoming posts.

As usual in NLP field, it is interesting to use the so-called Noisy-Channel model as point of reference and analogy. We may think of the initial semantics as the input message to the channel (the programmer); the web page is the output message. The programmer uses syntactic rules to encode semantics during coding adding more or less noisy elements. Different encodings forms do normally exist, noisy can be greater when too much structure is engaged for expressing some piece of the message.

A typical example of noisy encoding is the use of tables for handling style, presentation or layout purposes beyond the hypothetically primary intention of such kind of table element: just to be an arrange of data. Complex software maintenance and sometimes lower performance may be a consequence of too much noise, among others matters.

Let us take a look at some data concerning questions like: how much noise in page? What kind of noise? What kind of regular encodings could be found?

As a warning, we do not claim anything on statistical significance because our muster is clearly too small and was based on biased selection criteria. Our results are very preliminary, in general. However, we feel they may be sound and believable, in some way consistent with the noisy model.

Our “corpus” comes from of 834 pages which were crawled starting for convenience at a given root page in Costa Rica, namely: http://www.casapres.go.cr/. The size depended of a predetermined maximal quantity of 1000 nodes to visit; we never took more than 50 paths of those pointed in a page and we rather preferred visiting homepages to avoid traps.

Let us see some descriptive profile of the data. For current limitations of the publishing tool, we are not presenting some charts complementing the raw numbers.

Just 108 kinds of tags were detected and we have 523.016 instances of them in corpus. That means, very roughly, 6 kinds of tags per page, 627 instances per page. We feel that suggests the use of the same tags for saying probably different things (we remark that many pages are homepages for choice).

The top 10 of tags are: pure text, a, td, tr, br, div, li, img, p and font (according to absolute frequency). Together text, a (anchor) and img correspond to more than 60% all instances. Hence 60% of pages are some form of data.

We notice that ‘table’ is 1% and td 8.5% of all instances, against 42% from text, 15% from anchors. In average, we have 7 tables per page and 54 tds per page, 6 td per table, roughly speaking.

Likewise we just saw 198 attributes and 545.585 instances of attributes. The 10 most popular are: href, shape, colspan, rowspan, class, width, clear and height, which is relatively consistent with the observed tag frequency (egg. href for anchor, colspan and rowspan for td).

We pay some special attention to tables in the following lines. Our corpus has 5501 tables. It is worth to mention that 65% of them are children of td; in other words nested into another table. Hence a high proportion of nesting which suggests complexity in table design. We see that 77% of data (text, a, img) in muster are dominated by tds (most of the data is table dominated). In the case of anchors, 33% of them are td-dominated, what may suggest tables being used as navigational bars or similar semantic devices in an apparently very interesting proportion.

We decided to explore semantic pattern on tables a little bit more exactly. For instance, we choose tables of nx1 dimension (n rows, 1 column) which are good candidates for navigational bars. A simple analysis shows that 618 tables (11%) have such a shape. The shape may be different which is quite interesting. For instance, we see a 5x1 table where all td are anchors. We denote that but a sequence of 1 and 0, where 1 means the corresponding td contains an anchor (a link to some url): in this case ‘1.1.1.1.1’ is the sequence. But another table of the same 5x1 size presents the pattern ‘1.0.1.0.1’. This same pattern occurs several times for instance in 50x1 table. Another case is this:0.0.0.0.1.1.1.1.1.0’ maybe suggesting that some links are not available. We mention that 212 patterns are 1x1, which would be a kind of navigation button. We will present more elaborated analysis of this table patterns in the following post.

To finish, we notice that 875 tables (16%) are not regular: some rows have different size. Some of them are very unusual like in this 28x8 table, where each number in following sequence denotes the size id tds of the row: 4.4.4.6.8.8.7.2.8.4.4.6.6.6.6.5.4.5.5.5.5.5.5.5.5.5.5.1.

Noisy, isn’t it?

Symbolically Reasoning on Web Pages and Refactoring

9. January 2007 09:50 by CarlosLoria in General  //  Tags:   //   Comments (0)

How do we appropriately represent Web pages for semantic analysis and particularly for refactoring purposes? In a previous post at this same place, we introduced some of the main principles of what we have called semantically driven rational refactoring of Web pages as a general framework of symbolic reasoning. At this post, we want to develop some preliminary but formally appealing refinement concerning page representation; we assume our previous post has been read and we put ourselves in its context: development of automated tools pointing to the task of refactoring as a goal.

We claim that many valuable derivations can be obtained by following a semantic approach to the handling of web pages those of which are both of conceptual and implementation nature and interest; we believe looking with more detail at such derivations may provide us with more evidence for sustaining our claim on semantics in the context of refactoring. In this post, we want to consider one very specific case of conceptual derivation with respect to the representation model. Specifically, we want to present some intuitive criteria for analyzing theoretically a symbolic interpretation of the popular k-means clustering scheme, which is considered useful for the domain of web pages in more general contexts.

Let us first and briefly remain why that is relevant to refactoring, to set up a general context. We recall that refactoring is often related to code understanding; hence building a useful semantic classification of data, in general, is a first step in the right direction. In our case web pages are our data sources, we want to be able to induce semantic patterns inside this kind of software elements, in a similar way as search engines do to classify pages (however, we had to expect that specific semantic criteria are probably being more involving in our refactoring case where, for instance, procedural and implementation features are especially relevant; we will expose such criteria in future posts).

In any case, it is natural to consider clustering of pages according to some given semantic criteria as an important piece in our refactoring context: Clustering is essence about (dis)similarity between data, is consequently very important in relation to common refactoring operations, especially those ones aiming at code redundancy. For instance, two similar code snippets in some set of web pages may become clustered together, allowing by this way an abstraction and refactoring of a "web control" or similar reusable element.

As several authors have done in other contexts, we may use a symbolic representation model of a page; in a general fashion pages can naturally be represented by means of so-called symbolic objects. For instance, an initial approximation of a page (in an XML notion) can be something of the form given below (where relevant page attributes (or features) like "title" and "content" can be directly extracted from the HTML):

<rawPage

   title="Artinsoft home page"

   content="ArtinSoft, the industry leader in legacy application migration and software upgrade technologies, provides the simple path to .NET and Java from a broad range of languages and platforms, such as Informix 4GL and Visual Basic."

.../>

Because this representation is not as such effortless for calculation purposes, attribute values need to be scaled from plain text into a more structured basis, for clustering and other similar purposes. One way is to translate each attribute value in numerical frequencies (e.g. histogram) according to the relative uses of normalized keywords taken from a given vocabulary (like "software", "legacy", "java", "software", etc) by using standard NLP techniques. After scaling our page could be probably something similar to the following:

<scaledPage

      title  ="artinsoft:0.95; software:0.00; java:0.00; "

      content="artinsoft:0.93; software:0.90; java:0.75; " .../>

The attribute groups values are denoting keywords and its relative frequency in page (data is hypothetic in this simple example), as usual. Notice the attribute group is another form of symbolic object whose values are already numbers. In fact, we may build another completely flat symbolic object where attribute values are all scalar, in an obvious way:
 
   <flatPage

        title.artinsoft="0.95" title.software="0.00" title.java="0.00"

        content.artinsoft="0.93" content.software="0.90" content.java="0.75" .../>


Now pages that are by this way related to some common keywords set can be easily compared using some appropriate numerical distance. Of course, more complex and sophisticated representations are feasible, but numerically scaling the symbolic objects remains as an important need at some final stage of the calculation. However, and this is a key point, many important aspects of the computational process can be modeled independently of such final numerical representation (at the side of the eventual explosion of low-level attributes or poorly dense long vectors of flat representations, which are not the point here).

Let us develop more on that considering the k-means clustering scheme but without committing it to any specific numerically valued scale, at least not too early. That is, we want to keep the essence of the algorithm so that it works for "pure" symbolic objects. In a simplified way, a (hard) k-means procedure works as follows:

   1.Given are a finite set of objects D and an integer parameter k. 
   2.Choose k objects from D representing the initial group of cluster leaders or centroids
   3.Assign each object in D to the group that has the closest centroid. 
   4.When all objects in D have been assigned, recalculate the k centroids. 
   5.Repeat Steps 2 and 3 until the centroids no longer change (move).

Essential parameters to the algorithm are 1) the distance function (of step 3) to compute the closest object and some form of operation allowing "object combination" in order to produce the centroids (step 4) (in the numerical case "combination" calculates the mean vector from vectors in cluster).

Such a combination needs to be "consistent" with the distance to assure the eventual termination (not necessarily global optimality). We pursue to state a formal criterion of consistency in an abstract way. Fundamental for assuring termination is that every assignment of one object X to a new cluster occurs with lower distance with respect to the new cluster leader than the distance to the previous clusters leaders of X. In other words the assignment of X must be a really new one. Combine operation after assignment (step 4) must guarantee that in every old cluster where X was, their leaders still remain more distant after future recalculations. So we need abstract definitions of distance and object combination would satisfy these invariant, that we call consistency.

To propose a distance is usually easier (in analogy to the vector model), objects are supposed to belong to some class and the feature values belong to some space or domain provided with a metric. For instance, an attribute like "title" takes values over a "String" domain where several numerical more o less useful metrics can possibly be defined. For more structured domains, built up from metric spaces, we may homomorphically compose individual metrics of parts for building the metric of the composed domain. For a very simplified example, if we have two "page" objects X and Y of the following form:

 X= <page>

 <title>DXT</title>

 <content>DXC</content>

 </page>  

 Y= <page>

 <title>DYT</title>

 <content>DYC</content>

 </page>

Where we have that DXT and DYT are eventually complex elements representing symbolic objects over a metric domain DT provided with distance dist_DT and similarly DXC, DYC comes from a metric space DC with metric dist_DC. We may induce a metric dist_page for the domain “page”, as linear combination (where weights for dimensions could also be considered):

    dist_page(X, Y) = dist_DT(DXT, DYT) + dist_DC(DXC, DYC)

Using the simple illustration, we see that we may proceed in a similar fashion for the combine_page operation, assuming operations combine_DT and combine_DC. In such a case, we have:

combine_page(X,Y)= <page>

                      <title>combine_DT(DXT,DYT)</title>

                      <content>combine_DC(DXC,DYC)</content>

                   </page>

It is straightforward to formulate the general case, the principle is simple enough. It is worth to mention that "combine" is clearly a kind of abstract unification operation; in general, we claim has to be consistent with the "distance" function (at every domain); further requirements rules this operation, in particular, that the domain needs to be combination-complete or combination-dense.

Proving consistency strongly relies on abstract properties like the following:

     dist_D(X, Y) >= dist_D(combine_D(X,Y), Y)

In other words, that "combine" actually comprises according to distance. Fortunately, by using homomorphic compositions of consistent domains, we recover consistency at the composed domain. At the scalar level, for instance with numbers, using average as "combination" operation easily assures consistency. 

Categories