Contents

Abstract

TreeMatcher is a subproject of ETE3. Target of TreeMatcher is to perform search on phylogenetic trees using tree patterns that allows relax connection between tree nodes in newick format mimicking regular expressions' functionality.

Targets:

Background

TreeMatcher is not a new project. The base code was already written when GSoC started and that code was used to extend it's functionality. The TreeMatcher was able to match simple patterns, without metacharacters, and could skip nodes using an "one or more" definition (as a “+” symbol) between two intermediate nodes .

Coding periods

First Coding Period

The first coding period was about understanding the logic and the approach of the problem and defining how TreeMatcher will adopt regular expressions' functionality.
That time was about implementing functionality to allow relax connection between nodes. The we implemented the basic symbols' functionality using RegEx patterns such as "*", "+", "^", "$").
At late first period the we approached differently the implementation and instead of treating each metacharacter independently, they were categorized in two cases. Those which were skipping nodes and those which were adding an attribute. The skipping metacharacters were defined according to the number of nodes to skip, either direct connection first (zero intermediate nodes conditions), or indirect connections, skipping any intermediate number of nodes, or a given number of nodes. The other type of metacharacters were adding a constraint at the node. For more details see the description about the code before refactoring at the links section (link number 4).

Second Coding Period

In the second coding period we implemented the rest of the symbols and comparisons between some predefined sets of nodes (an example of sets of nodes can be: the children of a given node and the rest of the tree). That was done quickly thanks to the previous work done on categorizing cases introduced.
The fact that TreeMatcher can use pure Python expressions to define nodes allowed the implementation of more complex patterns to be expressed, like numerical comparisons between sets or against the whole tree.
In this second coding period the ability to describe a pattern about a node's descendants using metacharacters was introduced. However the implementation of this new functionality revealed the limitations of the existing algorithm.
As describing node's descendants with metacharacters was a need for TreeMatcher, and after a long debugging and testing period, we made the decision to refactor the code. The refactoring of the core search algorithm started.

Third coding period

During the third coding period the core search algorithm was completely refactored thanks to the
effort of a mentor ( Dr. Jaime Huerta-Cepas ), to the experience gained and to the techniques learnt.
Before this refactoring the focus was on patterns that describe a tree's topology, and it was changed to focus on patterns of nodes' descendants. This change resulted in the loss of functionality. We lost the possibility to use more than one way to describe relaxed node connections and the possibility to compare a node's attribute with the rest of the tree or with the node’s children.
The new approach deeply affected both implementation and logic to describe patterns.
In order to access to all the new functionalities we updated and extended the command line toolkit .
Finally we prepared TreeMatcher for future merge with it's parent project (ETE3) and the documentation was updated. (link 3)

Interesting parts

Conclusion

What is done

What is left to do


TreeMatcher seems to satisfy most of the targets setted. Most of the student's code was not merged. This code was used to find the limitations of the approach or the techniques and was ultimately removed to meet the needs of the refactoring. Documentation on code before the refactoring can be found under the student's github repository (see Useful links number 4 and 5).

GSoC team:

Useful links