More on Discovering Semantics from Parts of Web Pages

13. February 2007 11:36 by CarlosLoria in General // Tags: // Comments (0)

By taking at look at Web Pages, we may expect to discover that some patterns of semantics are encoded using very few HTML sets of, let us say, “combinators” (HTML parts); this may be due to the lack of abstraction capabilities which is inherent to the HTML alone. We have compared this situation to the Noisy-Channel model in a previous post where we presented some interesting figures and data illustrating the claim. Let us continue our journey showing further instances of this phenomenon whose formal analysis is crucial for intelligent refactoring tools as the kind we have been pursued to introduce by means of this sequel of posts. In other words, let us know other forms of “HTML noise”. As a word of warning, we recall that the data is the result of a particular muster of crawled pages by the way we explained before.

For this post, we are experimenting with tables that are potentially used as page layouts or page structure. For those kinds of tables, we want to study the table shape or page surface, no the specific content; we may think of that as a way to filter potential candidates for further deeper semantic analysis. (We briefly recall that our muster contains 819 pages and about 5000 table instances, roughly speaking.).

The exercise is simple: we postulate an intuitive definition for a table as surface and see how well it is supported by our data in muster.

Let us try our shallow analysis by classifying a table as a page layout candidate if its container is the page body tag, eventually followed by a chain of div tags (assuming such div tags are intended to be organizers or formatters of the table), it has at least two rows and at least 2 columns (two columns is the most interesting case, we consider it as a base).

Such a pattern definition sounds reasonable in appearance; however, we will see that its empirical support is not as high as one may expect, at least in our muster.

We find 261 of such candidates; they represent a 31% of all pages, which is a quite interesting amount; however it is unexpectedly small because one may guess there should be at least one per page. Among these 261, we have 83 where the table is hanging directly from the body tag (32% of the candidates; 10% of the whole muster). As a matter of fact, such 83 tables present irregular patterns, albeit often we find 2 columns (65%) with a high variance. For instance, we may find a pattern of the form 6.2.2.2.2.2.2.2, where we use our convention of showing a table of n rows as a sequence of n numbers, each of one being the number of cols (in example 8 rows, the first of them with 6 columns the rest having 2 columns). But even worst, we find the irregular pattern 2.2.7.2.7.7.6.5.5.4.4.5.2.3.2.7.2.7. And talking about irregularity, let us take a look at this interesting one: 19.2.7.4.6.2.2.2.2.2.2.2.2.2.2.5.7.2.2.2.2.4.4.2, whatever it means.

With this simple analysis, we may learn that, perhaps, some intuitive definitions occur not as frequent as we may expect in our muster. Actually, and after seeing in detail some of the irregular cases, a sound conclusion might be that we may need first to pre-classify some parts of the page before using general patterns like the one we directly tried. In other words, we see that some noise needs to be filtered out for such a kind of pattern.

In a forthcoming post, we will continue studying that kind of patterns and their support.

Visual Basic Upgrade Companion, Code Advisor and Visual Basic 6.0 Upgrade Assessment Tool

9. February 2007 18:46 by jpena in General // Tags: // Comments (0)

Last week, a developer from a company that is evaluating a trial version of the Visual Basic Upgrade Companion sent us an email, asking if they should use the Microsoft Visual Basic 6.0 Upgrade Assessment Tool and the Code Advisor. Perhaps someone else has a similar doubt, so I thought it may be a good idea to share our response here.

First of all, let's remember that we are talking about three separate --and different-- tools:

Visual Basic Upgrade Companion (VBUC): this is ArtinSoft’s Visual Basic 6.0 to VB.NET/C# migration tool. Basically, you use this tool to convert your VB6 code to .NET.
Microsoft Visual Basic 6.0 Upgrade Assessment Tool: this tool was written for Microsoft by ArtinSoft, and can be downloaded free of charge from http://www.microsoft.com/downloads/details.aspx?FamilyID=10c491a2-fc67-4509-bc10-60c5c039a272&DisplayLang=en. The purpose of this tool is to generate a detailed report of the characteristics of your VB6 code, giving you an idea of the size and complexity of the code from a migration standpoint. The tool itself does not make any modification of conversion of the source code.
Code Advisor: this tool is also provided by Microsoft, free of charge, and can be downloaded from http://www.microsoft.com/downloads/details.aspx?familyid=a656371a-b5c0-4d40-b015-0caa02634fae&displaylang=en. The Code Advisor analyzes your VB6 source code and looks for particular migration issues within the code. Each issue is marked with a code comment that suggests how to modify the VB6 code to avoid the problem.

The purposes of the Microsoft Visual Basic 6.0 Upgrade Assessment Tool and the Code Advisor are different, so it is recommended that you use both of them. However, it is important to note that the Code Advisor was designed for users that plan to migrate with the Visual Basic Upgrade Wizard (the conversion tool that comes with Visual Studio .NET), and since VBUC has a greater migration coverage, some of the issues that will be flagged by the Code Advisor will be fixed automatically by VBUC. For a detailed discussion on those issues, please refer to my article “Visual Basic Upgrade Companion vs. Code Advisor”: http://www.artinsoft.com/VB-Upgrade-Companion-vs-CodeAdvisor.aspx

Virtual Server being accessed by a 32-bit or 64-bit binary

8. February 2007 07:07 by Csaborio in General // Tags: 64-bit Related, Virtualization // Comments (0)

Yesterday, one of the attendees from the Virtualization events asked this question which I though would be worthwhile to share:

For a simple .NET application like this, would we need different applications when running on 64 vs. 32 bit hosts?

Before answering, please allow me to elaborate more on where the question is going. Virtual Server has a COM API that allows it to be managed by applications and scripts. Virtual Server R2 SP1 Beta 2 (phew) comes in two flavors: 32-bit and 64-bit. The owner of the question wondered if you could manipulate a 64-bit instance of Virtual Server using a 32-bit application (or vice-versa).

Ok, now that the question is (hopefully) a bit clearer, the answer to the question is no, you do not need to have a different version for accessing Virtual Server from an application regardless of its bit-architecture. Why? Virtual Server's COM API is accessed by an out-of-process COM library, which means that everything is done by means of RPC. When two applications are communicating with each other by means of RPC, the 1st commandment of 64-bit is not broken (thou shall not run 32-bit and 64-bit code within the same process space).

Windows Server Virtualization Calculator

8. February 2007 06:57 by Csaborio in General // Tags: Virtualization // Comments (0)

Riddle me this: How many licenses of Windows Server Enterprise Edition would you need if you are planning on running 20 Virtual machines inside a server that has 2 processors? Very, easy, you would need only 5 licenses. Too tough? How about this one...what would be the price difference if you were running 50 machines running Windows Server 2003 on a virtualization server with 2 processors if you chose to run the host machine with Windows Server Enterprise Edition vs. Windows Server Datacenter Edition? Very easy...running Datacenter edition would be $25,580 cheaper.

It definitely is tempting to say that I can pull this info right off the top of my head, but that would be a big big lie. The secret lies in the sweet web application Microsoft has published. It is called the Windows Server Virtualization Calculator, and without a doubt, it will clear a lot of doubts and will show you the best way to go (in terms of licensing) when consolidating your data center, enjoy!

Exit Row Cheat

8. February 2007 06:35 by Csaborio in General // Tags: Travel Advice // Comments (0)

Have you ever seen the Exit rows in an airplane? They longer leg space than coach, and after business or first, they are the best seats in the place. The bad news is that these seats are not reserved for anyone, or at least not in American Airlines. These seats are reserved for those travelers who have some kind of status such as Platinum or Gold. This means that if you do not have a status, you cannot choose them on-line (the seats will show up as unavailable), but fear not - I have found a workaround in some cases.

Say that you have no status at all in American Airlines, but you are traveling with a colleague or friend that does have this status. Before purchasing the ticket, you must tell your travel agent to place both tickets within the same record locator. The person with the high status will be able to select these exit rows for you and you will be able to fly a lot comfortable without having to have a high status.

Be warned that if 2 or more people are on the same American itinerary, and one of them selects an Upgrade to business, everyone in the itinerary will have a request to first. If they do not have enough upgrade stickers, the consequences can be quite bad - such as losing the exit row that was pre-selected and having to fly (if lucky) on the worst seat in the plane :S

Creating a Project Knowledge Base

6. February 2007 15:50 by jpena in General // Tags: // Comments (0)

During a migration project, the issues that your team will face may tend to become repetitive. Because of this, it is important that you have mechanisms that allow team members to share the knowledge that they have earned in the process of migrating the application. This way, you are not likely to have a team member struggling to fix a migration issue that someone else in the team already knows how to solve.

An ideal way of sharing team knowledge during the migration process is the creation of a Project Knowledge Base, where team members can post the solutions that they have applied to previous migration issues. With a good knowledge base, developers will be able to make searches and retrieve information that will help them fix the issues that they are facing, possibly increasing team productivity and reducing costs.

To be effective, your project knowledge base needs to have the following characteristics:

- Easy access: team members should easily retrieve information as well as add new items to the knowledge base.
- Search capability: of course, you don’t want team members navigating the knowledge base for hours to find a solution to a problem.
- Periodic backup: place the knowledge base on a server that is being backed up regularly. In a later project, the information stored may be useful again.

It is common to implement these knowledge bases using a Wiki engine. More information on Wiki can be obtained at http://www.wiki.org/wiki.cgi?WhatIsWiki.
Also, some examples of popular wiki sites are Wikipedia (http://www.wikipedia.org/) and Memory Alpha (http://memory-alpha.org/en/wiki/Portal:Main), this last one is a personal favorite :)

La gran mudanza

6. February 2007 13:41 by acyment in General // Tags: // Comments (0)

Ayer nos mudamos a las nuevas oficinas de Curridabat. A (casi) todo el mundo le queda más cerca de su casa y la verdad es que están muy lindas, por lo que reina la alegría. Poco importa que el edificio esté a medio construir. El nuevo lugar de almuerzo está genial y estamos por anotarnos en masa en un gimnasio con pileta y todo.El team room tiene muchas paredes y eso sacía mi apetito de post-its. Si me acuerdo, mañana traigo la cámara y empiezo a postear imágenes del lugar de trabajo.

Tiempo al tiempo

6. February 2007 13:15 by acyment in General // Tags: // Comments (0)

En la retrospectiva del sprint #1 hubo un comentario de parte de un miembro del equipo que me pareció interesante: a pesar de nunca antes habíamos hecho Scrum, su sensación es que las reuniones eran un poco caóticas y que el ScrumMaster (i.e. yo) tenía que poner orden. Mi intención original era comenzar el proyecto siendo permisivo (timeboxing estirable, opiniones de miembros externos no moderadas, roles un poco difusos), pero me di cuenta que el truco está justamente en empezar siendo ortodoxo. Uno de los primeros puntos en los que decidí ponerme inflexible desde el comienzo mismo del sprint #2 es el timeboxing: las reuniones estaban comenzando tarde y muchas veces se extendían demasiado. Siendo que soy un tipo más visual que otra cosa, decidí comunicar la idea del timeboxing de la forma más explícita posible. Lo principal, claro está, fue el cambio en mi actitud, pero estos dos bichos me ayudaron bastante:

El chanchito: una caja de cartón con un agujerito en su tapa. El que llega tarde a un daily meeting o a la retrospectiva paga según lo estipulado en una tabla que está pegada en la pared:

0'<t<5' : 200 colones (0,40$)
5'≤t<10': 500 colones (1$)
10'≤t": 1000 colones (2$)

El sapo: una cajita simpaticona que tiene 4 posibles cronómetros - básicamente los únicos que usamos son el de 15' y 60' - a todo el mundo le parece simpática, pero además viene siendo muy efectiva

Nota: hasta el momento llevamos recaudados unos 7000 colones (14$) - la idea es usarlo para comprar snacks para picar durante las reuniones

Sprint Review #3

6. February 2007 13:09 by acyment in General // Tags: // Comments (0)

Finalmente decidí darle un número al sprint abortado, por lo que el que acaba de terminar fue el #3 nomás. El review del viernes salió bastante bien...o por lo menos mucho mejor de lo que me lo esperaba. A diferencia del primer sprint review, esta vez hubo mucho demo y no tanta discusión filosófica. Creo que en ese sentido ayudó bastante el aclarar que íbamos a ser estrictos con el timeboxing y el simple de hecho de que ya habíamos tenido un review antes, que en mi opinión había salido bastante mal.
Unas horas antes del review MC y MR me dijeron que había que preparar una presentación (i.e. PPT, o al menos eso entendí yo) en las que se iba a introducir lo hecho en el sprint. Yo contesté que no era aconsejable invertir más de 1 hora en total en la preparación del review y que, además, el Product Owner era quien había elegido los user stories que iban a ser desarrollados, por lo que no valía la pena explicarle lo que él mismo ya conocía bien. La contra-respuesta fue que iban a asistir a la reunión
personas que poco sabían del proyecto (i.e. futuros miembros del equipo y un directivo de la empresa, así como LC). Mi contra-contra-respuesta, tal vez un poco dura, fue que no era responsabilidad del equipo subsanar el hecho de que no todos los stakeholders habían hecho sus deberes. La c-c-c-respuesta tuvo mucho sentido: "van a pensar que trabajamos mal". Y sigo en forma de diálogo:
- El review no es para quedar bien, sino para obtener feedback
- Pero de qué nos sirve el feedback de alguien que no entiende lo que ve
- Buen punto, pero no tapemos agujeros. Si hace falta que sepan y no saben, que se note
Sin embargo me quedó un sabor amargo después de esta charla...¿Quién tiene que poner al tanto a los stakeholders? ¿Y qué pasa si esos stakeholders están por ingresar al equipo?

The best asp.net blog. A personal perspective

6. February 2007 06:54 by Mquiros in General // Tags: ASP.NET // Comments (0)

Maybe I’m wrong but after 8 year in web development, 4 asp classic year, 2 year transition to .Net world and last year doing heavy development on asp.net 2.0 I think that I may have a good opinion on the best resources online for the asp.net development.

I was thinking sometime ago about give the credit to the great work of
Mads Kristensen and his .NET SLAVE blog, to me, the best blog around the blogsphere when talking about ASP.NET development. But I’ve been kinda lazy and never did so, today I read a blog post from HIM asking about some stuff , you should read here.

After playing around with all free resources online to see coding techniques and styles (forums, tutorial, blogs, Starter “piece of s***” kits, I easily can say that Mads’ blog is the best asp.net blog around, why? Because if you see around and read a lot of asp.net blogs an related technologies, forums, you can find good code but NEVER believe me NEVER the complete solution, or not a quality solution, and to make it even better Mads “KISS” approach just make his blog articles perfect. I understand that people shouldn’t give away everything they know, that everybody’s problem to decide to share or not.

Small, concise, ready for deployment in must cases, an the best of all, HE SHARES real solutions for real problems on real scenarios, his code snippets are piece of gold when you have the enough criteria to judge. I don’t want to sounds like a biased person, I don’t know Mads personally but I bet you he is a great person why? Because persons who SHARE KNOWLEDGE, - not just simple knowledge –I’m talking about real knowledge, is great people. I invite you to read his blog everyday and if you can donate when find something useful I encourage you to do so ( I should do that to J) Read all the post Mads wrote, I guarantee you that would be amazed to read all that valuable asp.net an C# stuff.

I will make a resource or blogs list that I read everyday that keeps me on track on latest news, trends etc, related to an asp.net developer but now I just feel necessary to give Mad something small back compare to his great knowledge.

As I said Mads code snippets and opinion rocks, and here is my favorite ones.

Latest: http://madskristensen.dk/blog/Search+Engine+Positioner.aspx Search Engine Positioner, I saw this yesterday and now is used in our marketing department, very valuable tool for SEO (search engine optimization) , Mads If you read this, this is my "wish a song", Proxy settings, to use 3^rd party proxies, this is very useful when doing SEO out of the US because Search engines give results depending on your IP country so if you do search engine marketing for another country rather that yours ( in my case Costa Rica) that would be very valuable).

Some other favorites:

And many more, if you put together all the code Mads provides you can build a great software library to a small general purpose web shop.
Thanks for all, Mads keep sharing, keep rocking!
Visit the .NET SLAVE BLOG now !

Upgrading VB6 to .NET: migration guide FAQ

2. February 2007 12:53 by Fzoufaly in General // Tags: // Comments (0)

Microsoft and ArtinSoft published the Upgrade guide for Visual Basic 6 to .NET. This guide is been re-purposed as a list of FAQs to easily search and allow programmers and managers to find out about the best practices when planning a migration project from VB to Visual Basic .NET 2005.

The first two chapters are out, more will come in the next weeks.

"The purpose of these pages is to provide a comprehensive FAQ for the Upgrading Visual Basic 6.0 to Visual Basic .NET and Visual Basic 2005 guide. This VB migration material was developed jointly by Microsoft and ArtinSoft, a company with vast experience in Visual Basic conversions and the developer of the Visual Basic Upgrade Wizard, the Visual Basic 6.0 Upgrade Assessment Tool, the Visual Basic Upgrade Companion and the ASP to ASP Migration Assistant, among other software migration products."

Link to Upgrading VB6 to .NET – migration guide FAQ

A Project Management joke

2. February 2007 12:16 by jpena in General // Tags: // Comments (0)

The other day I heard a joke about project managers told by John Valera, one of the Project Management professors at Costa Rica's Universidad Nacional, so I wanted to share it in this space. I’m not really good at telling jokes, but here it goes…

There was a big project which had three key team members: a software architect, a QA leader and a project manager. These three guys used to go together for a walk after lunch, to relax and talk about the project. One day, they came across an old lamp and when they picked it up, a genie appeared and said:

- “You have awaken me. I’m supposed to grant you three wishes, but since you are three, I will grant a wish to each one of you”.

First, the QA leader said:

- “I wish to flee to some place where I can have all the money that I want, and spend it in whatever I want!” Suddenly, he disappeared and became a rich man in Las Vegas.

Then came the software architect:

- “I wish to flee to some place where I don’t have to worry about anything, and I can have all the fun in the world!” Suddenly, he disappeared and found himself walking in the beautiful beaches of Rio de Janeiro.

At the end, it was the project manager’s turn. With no need for extra thinking, he just said:

- “I wish to have those two guys back at work by 2:00 PM!!!!” :)

Printing in Java

2. February 2007 11:20 by Mrojas in General // Tags: Java // Comments (0)

Sample Code to Print in Java

import java.io.ByteArrayInputStream;
import javax.print.Doc;
import javax.print.DocFlavor;
import javax.print.DocPrintJob;
import javax.print.PrintService;
import javax.print.PrintServiceLookup;
import javax.print.SimpleDoc;
import javax.print.attribute.HashPrintRequestAttributeSet;
import javax.print.attribute.PrintRequestAttributeSet;

public class Class3 {

static String textToPrint = "Richard North Patterson's masterful portrayals of law and politics at the apex of power have made him one of our most important\n" +
"writers of popular fiction. Combining a compelling narrative, exhaustive research, and a sophisticated grasp of contemporary\n" +
"society, his bestselling novels bring explosive social problems to vivid life through characters who are richly imagined and\n" +
"intensely real. Now in Balance of Power Patterson confronts one of America's most inflammatory issues-the terrible toll of gun\n" +
"violence.\n\n" +
"President Kerry Kilcannon and his fiancée, television journalist Lara Costello, have at last decided to marry. But their wedding\n" +
"is followed by a massacre of innocents in a lethal burst of gunfire, challenging their marriage and his presidency in ways so shattering\n" +
"and indelibly personal that Kilcannon vows to eradicate gun violence and crush the most powerful lobby in Washington-the Sons of\n" +
"the Second Amendment (SSA).\n\n" +
"Allied with the President's most determined rival, the resourceful and relentless Senate Majority Leader Frank Fasano, the SSA\n" +
"declares all-out war on Kerry Kilcannon, deploying its arsenal of money, intimidation, and secret dealings to eviscerate Kilcannon's\n" +

"crusade and, it hopes, destroy his presidency. This ignites a high-stakes game of politics and legal maneuvering in the Senate,\n" +

"the courtroom, and across the country, which the charismatic but untested young President is determined to win at any cost. But in\n" +

"the incendiary clash over gun violence and gun rights, the cost to both Kilcannons may be even higher than he imagined.\n\n" +

"And others in the crossfire may also pay the price: the idealistic lawyer who has taken on the gun industry; the embattled CEO\n" +

"of America's leading gun maker; the war-hero senator caught between conflicting ambitions; the female senator whose career is at\n" +

"risk; and the grief-stricken young woman fighting to emerge from the shadow of her sister, the First Lady.\n\n" +

"The insidious ways money corrodes democracy and corrupts elected officials . . . the visceral debate between gun-rights and\n" +

"gun-control advocates . . . the bitter legal conflict between gun companies and the victims of gun violence . . . a\n" +

"ratings-driven media that both manipulates and is manipulated - Richard North Patterson weaves these engrossing themes into an\n" +

"epic novel that moves us with its force, passion, and authority.";

public static void main(String[] args) {

DocFlavor flavor = DocFlavor.INPUT_STREAM.AUTOSENSE;

PrintRequestAttributeSet aset = new HashPrintRequestAttributeSet();

/* locate a print service that can handle it */

PrintService[] pservices = PrintServiceLookup.lookupPrintServices(flavor, aset);

/* create a print job for the chosen service */

int printnbr = 1;

DocPrintJob pj = pservices[printnbr].createPrintJob();

try {

/* * Create a Doc object to hold the print data.

* Since the data is postscript located in a disk file,

* an input stream needs to be obtained

* BasicDoc is a useful implementation that will if requested

* close the stream when printing is completed.

ByteArrayInputStream fis = new ByteArrayInputStream(textToPrint.getBytes());

Doc doc = new SimpleDoc(fis, flavor, null);

/* print the doc as specified */

pj.print(doc, aset);

}

catch (Exception ex){

ex.printStackTrace();

}

}}

TIP: If you are just testing create a Printer. Just go to Add Printers, select FILE for port and Manufacturer Generic and TEXT Only

NTBackup and Acronis TrueImage Server VSS troubles

1. February 2007 14:40 by Jaguilar in General // Tags: Tools // Comments (0)

Today I decided to test out the Volume Shadow Copy (VSS) support in Virtual Server 2005 R2. In theory, as I mentioned in an earlier post, with VSS, virtual server can create a consistent “snapshot” of a running virtual machine so other applications, such as backup clients, can use that snapshot without interrupting the virtual machine itself.

The only VSS-aware backup application I had installed was Windows’ very own NTBackup. So, I enabled VSS on the volumes, run NTBackup, and proceed to make a backup of my virtual machine. Everything started out OK, until NTBackup got stuck with the message “Waiting to retry shadow copy…”. Following my standard error-solving checklist, I checked the Event Viewer, and I found this message logged:

Volume Shadow Copy Service error: Error calling a routine on the Shadow Copy Provider {f5dbcc43-b847-494e-8083-f030501da611}. Routine details BeginPrepareSnapshot({f5dbcc43-b847-494e-8083-f030501da611},\\?\Volume{0cb1b616-8ea6-11db-88de-806e6f6e6963}\) [hr = 0x80070002].

We use Acronis imaging solution for deploying our server, and it turns out that Acronis’ VSS Provider has an issue with Microsoft’s VSS provider. Apparently the issue is well-known, and is documented in two forums posts. It is solved in the latest version of Acronis’ products, but I didn’t really had time to perform an upgrade (and Acronis’ products are notoriously stubborn when you try to uninstall them). So, I applied the solution suggested in one of the forum posts. I unregistered Acronis’ VSS provider using the command:

regsvr32 /u \windows\system32\snapapivss.dll

After that, the backup went without problems:

Opening up the log once the backup is complete shows you that all files from the virtual machine were backed up succesfully:

This was done without turning the virtual machine off, taking advantage of the VSS functionality in Virtual Server 2005 R2 SP1 Beta. I performed the same operation on a Windows XP box, disabling NTBackup’s VSS support, and the backup predictably failed.

Here’s some information on VSS: Volume Shadow Copy Service (VSS)

Semantics from Structural Parts of Web Pages: some figures and patterns

1. February 2007 09:55 by CarlosLoria in General // Tags: // Comments (0)

We continue our regular series of posts talking about refactoring of Web Pages based on semantic approaches; we invite the interested new reader to take a look at the previous contributions to get a general picture of our intentions.

In this particular and brief post, we just want to present and describe some simple but interesting empirical data which are related with the structural (syntactic) content of some given muster of pages we have been analyzing during the last days. The results are part of a white page we are preparing, currently; it will be available at this site in short time.

We may remember from our first post that we may want to recover semantics from structure given particular clues and patterns we usually may come across when analyzing pages. The approach is simpler to describe than to put into practice: Once semantics could be somehow detected, refactoring steps can be applied on some places at the page and, by doing so, some expected benefits can be gained.

However, syntactic structure is the result of encoding some specific semantics and intentions on a web page using HTML elements and functionality; the HTML language is (expressively speaking) rather limited (where too much emphasis on presentation issues is the case, for instance) and some common programming “bad practices” increase the complexity of recovering semantics mainly based on syntactic content as input. And being HTML quite declarative, such complexity can make the discovering problem quite challenging in a pragmatic context, indeed. That is our more general goal, however, we do not want to go that far in this post, we just want to keep this perspective in mind and give the reader some insight and data to think about it. We will be elaborating more on recovering in forthcoming posts.

As usual in NLP field, it is interesting to use the so-called Noisy-Channel model as point of reference and analogy. We may think of the initial semantics as the input message to the channel (the programmer); the web page is the output message. The programmer uses syntactic rules to encode semantics during coding adding more or less noisy elements. Different encodings forms do normally exist, noisy can be greater when too much structure is engaged for expressing some piece of the message.

A typical example of noisy encoding is the use of tables for handling style, presentation or layout purposes beyond the hypothetically primary intention of such kind of table element: just to be an arrange of data. Complex software maintenance and sometimes lower performance may be a consequence of too much noise, among others matters.

Let us take a look at some data concerning questions like: how much noise in page? What kind of noise? What kind of regular encodings could be found?

As a warning, we do not claim anything on statistical significance because our muster is clearly too small and was based on biased selection criteria. Our results are very preliminary, in general. However, we feel they may be sound and believable, in some way consistent with the noisy model.

Our “corpus” comes from of 834 pages which were crawled starting for convenience at a given root page in Costa Rica, namely: http://www.casapres.go.cr/. The size depended of a predetermined maximal quantity of 1000 nodes to visit; we never took more than 50 paths of those pointed in a page and we rather preferred visiting homepages to avoid traps.

Let us see some descriptive profile of the data. For current limitations of the publishing tool, we are not presenting some charts complementing the raw numbers.

Just 108 kinds of tags were detected and we have 523.016 instances of them in corpus. That means, very roughly, 6 kinds of tags per page, 627 instances per page. We feel that suggests the use of the same tags for saying probably different things (we remark that many pages are homepages for choice).

The top 10 of tags are: pure text, a, td, tr, br, div, li, img, p and font (according to absolute frequency). Together text, a (anchor) and img correspond to more than 60% all instances. Hence 60% of pages are some form of data.

We notice that ‘table’ is 1% and td 8.5% of all instances, against 42% from text, 15% from anchors. In average, we have 7 tables per page and 54 tds per page, 6 td per table, roughly speaking.

Likewise we just saw 198 attributes and 545.585 instances of attributes. The 10 most popular are: href, shape, colspan, rowspan, class, width, clear and height, which is relatively consistent with the observed tag frequency (egg. href for anchor, colspan and rowspan for td).

We pay some special attention to tables in the following lines. Our corpus has 5501 tables. It is worth to mention that 65% of them are children of td; in other words nested into another table. Hence a high proportion of nesting which suggests complexity in table design. We see that 77% of data (text, a, img) in muster are dominated by tds (most of the data is table dominated). In the case of anchors, 33% of them are td-dominated, what may suggest tables being used as navigational bars or similar semantic devices in an apparently very interesting proportion.

We decided to explore semantic pattern on tables a little bit more exactly. For instance, we choose tables of nx1 dimension (n rows, 1 column) which are good candidates for navigational bars. A simple analysis shows that 618 tables (11%) have such a shape. The shape may be different which is quite interesting. For instance, we see a 5x1 table where all td are anchors. We denote that but a sequence of 1 and 0, where 1 means the corresponding td contains an anchor (a link to some url): in this case ‘1.1.1.1.1’ is the sequence. But another table of the same 5x1 size presents the pattern ‘1.0.1.0.1’. This same pattern occurs several times for instance in 50x1 table. Another case is this: ‘0.0.0.0.1.1.1.1.1.0’ maybe suggesting that some links are not available. We mention that 212 patterns are 1x1, which would be a kind of navigation button. We will present more elaborated analysis of this table patterns in the following post.

To finish, we notice that 875 tables (16%) are not regular: some rows have different size. Some of them are very unusual like in this 28x8 table, where each number in following sequence denotes the size id tds of the row: 4.4.4.6.8.8.7.2.8.4.4.6.6.6.6.5.4.5.5.5.5.5.5.5.5.5.5.1.

Noisy, isn’t it?

Key Stakeholders: End Users

31. January 2007 15:31 by jpena in General // Tags: // Comments (0)

End users are sometimes ignored when planning a migration project. Traditional software development methodologies often lack an appropriate level of involvement from the end user, and this can limit end-user satisfaction with the final product. Before you begin a migration, it is important that you understand the needs of the users of the original application: after all, they are the ones who will use the migrated application in their everyday activities. Be sure to gather the following information on the perception that end users have on the original application:

Features that the users dislike: sometimes the users consider that certain features of the original application are not suited to their needs, or should be improved. If this is the case, you will be migrating something that the users don’t like, so you can expect the same disapproval when you finish the migration. Because of this, it’s a good idea to make the necessary improvements after you reach Functional Equivalence on the target platform. On certain cases, rewriting those particular features or modules can be a good option too.
Features that the users depend on: in several applications, you will find that there are features the users can’t live without, and even the slightest change of functionality could cause a problem. For example, in a data-entry form that is designed for fast-typing users entering lots of information, something as simple as changing the TabOrder of the form controls could be disastrous.

Of course this list is not exhaustive, so be sure to involve the end users form the beginning of the project and gather enough information from them. Whenever possible, make their needs part of the requirements for the migration or the post-migration phases.

Terminando el segundo sprint (que no vino después del primero)

31. January 2007 11:47 by acyment in General // Tags: // Comments (0)

Este viernes termina nuestro segundo sprint, que empezó a un tanto abruptamente hace dos lunes. ¿Qué pasó? Érase el martes de la primer semana del que originalmente era el segundo sprint, cuando apareció en nuestro Outlook un tímido mail de QM contando que acababa de salir un nuevo release de PR. ¿Qué es PR? Un producto que en un principio parecía ser un acérrimo competidor, después cambiamos el rumbo con un buen diferenciador...¡y de repente salen con una versión que parece una copia de nuestro Product Backlog! Emergencia, gritos, llantos y la crisis que es oportunidad. Miércoles reunión a las corridas con ZF y AC. Baraja que te baraja alternativas y decidimos volver a reunirnos el jueves, que también resultó a puro debate. Ya pasado el mediodía decidimos apretar el botón rojo: abnormal sprint termination. El viernes lo usamos para preparar entre todos algunas User Stories y el lunes siguiente, a correr. A ver qué nos depara el Review del 2/2...

Evaluation version of Windows 2003

31. January 2007 10:17 by Jaguilar in General // Tags: Virtualization, Tools // Comments (0)

All of you are probably aware that you can download MSDN Pre-Configured Virtual Machine Images and of configurations you can get with the VHD Test Drive. There is another option, though, if you want to evaluate a Windows 2003 R2 installation by itself on a virtual machine or as a host for Virtual Server 2005 R2. You can get a 180–day evaluation of Windows 2003 Server R2 at the trial software page over at Microsoft. This makes it easier to evaluate the performance of the server product, for virtualization, or for any other tasks that you may be considering it.

Link: Windows Server 2003 R2: How to Get Trial Software

VMWare Releases FREE P2V Utility

30. January 2007 09:51 by Csaborio in General // Tags: Virtualization // Comments (0)

There are many many alternatives out there that will assist you to migrate a Physical machine to a Virtual - heck, even NT Backup can be used to accomplish this. The supported procedure recommended to carry out this procedure is to use ADS to create an image of the source machine and then dump it to a Virtual Machine. I am currently testing this procedure and trust me, it is not a straightforward one.

Given the choice, I would recommend any other approach when carrying out a P2V migration. VMWare currently released they migration utility that allows to move physical machines to virtual ones. It even goes the extra mile and imports various virtual machines from other solutions such as Microsoft's Virtual Server.

This is perfect for users of VMWare, but what about if you want to carry a P2V migration to the Virtual Server format? Well, you can still carry this out by using VMWare's tool and then using this utility to convert from VMWare to Virtual Server format. Not the cleanest solution, but I guess this is a perfect example in which the ends justify the means ;)

El proyecto, el equipo y esos detalles logísticos

30. January 2007 05:47 by acyment in General // Tags: // Comments (0)

Como dije, mucho no voy a contar, aunque cuente mucho. Llamemos al proyecto en el que estamos embarcados CC. El nosotros ya es, como la realidad, complejo de describir: quien suscribe cumple el rol de ScrumMaster, ZF es Product Owner y el Equipo lo integran, por ahora y solamente por ahora, AD, MC y MR. Estamos trabajando en conseguir por lo menos 5 personas más. Además está AC, que participa del lado del Product Owner, aunque no puede dedicar mucho tiempo al proyecto. Y además LC, que va a integrar un equipo aparte, que va a tomar tareas de investigación. Y QM, que del dominio sabe un montón. Y SL, que está con temas de marketing. Pero en lo que a mí respecta todos estos últimos no son más que stakeholders. Importantes, cruciales, pero siguen siendo chickens.
El objetivo es poder tener un beta "lo antes posible". La idea es salir con la versión 1.0 a mediados de año. Como creo haber contado antes, lo que tenemos entre manos es un producto masivo. Lo que se dice software enlatado, aunque la imagen evoque conservas o duraznos en almíbar. El lugar de trabajo por ahora son las oficinas de Sabana Norte, aunque en breve nos vamos a estar mudando a la zona este, por Curridabat. Por ahora tenemos algo bastante parecido a un "team room" (belicosamente conocido también como "war room"), del que espero poder postear algunas fotos en breve. Para realizar el tracking de product backlog items y tareas estamos usando básicamente post-its y papelitos. Estamos probando hacer iteraciones de 2 semanas de duración. El Daily Meeting se hace todos los días a las 13:30hs y está durando unos 8 minutos más o menos. El Sprint Planning Meeting está timeboxeado en 2 horas (1 hora con el Product Owner y 1 hora solamente para el equipo), el Sprint Review Meeting en 1 hora y la Retrospectiva también en 1 hora. Para las llegadas tarde al Daily Meeting tenemos una cajita que usamos de alcancía y que bautizamos "el chanchito". Mandé a comprar un cronómetro para las reuniones de lo más chulo.
(suspiro)...creo que con esos datos ya se pueden ir ubicando...