GSoC 2017 etetoolkit/treematcher community wiki
Treematcher is a side project of ete toolkit. It’s target is to allow ete toolkit use relax match search
using syntax similar to regular expressions.
Contents:
About the code before gsoc
Milestone 1: implement regular expression functionality
Milestone 1: Approach
Milestone 1: Features
Milestone 1: Examples of usage
Milestone 1: Set of all implemented features
About the code before gsoc
The treematcher base code was written before gsoc. It contained the main TreePattern class and helping
classes. Treematcher transforms a regular expression written in newick format to a Tree and runs a
search against a tree to find possible reuslts.
The main TreePattern class extends the etetoolkit/Tree class. It contains a recursive function used to
search the target tree (the “match” function), a function to compare individual nodes for “equality” (the
“is_local_match” function) and a function used as interface (the “find_match” function).
The “match” function is the main treematchers function. The “match” could use a completely
described tree (every TreePattern node has on corresponding node at target tree) and the “one or more”
intermediate nodes symbol’s functionality by considering as match every node until a match of next
level nodes is found. These features are considered as the basic functionality treematcher have (one to
one match and relax matches) and was used as helping guide to all other features implemented next.
Milestone 1
Most of the part of the first milestone was completed at the first coding period of gsoc. It’s target was
to allow treematcher describe more relaxed relationships between TreePattern tree and the target tree.
The features that was implemented are “one”, “zero or more”, “one or more”, “defined number”,
“defined numbers in a range” intermediate nodes between two nodes and “is leaf” and “is root”
shortcuts. Later the logical comparison between a node a set of nodes (of possible match) or the whole
tree ( used for maximum or minimum of a value) compared with an attribute was implemented.
Milestone 1: Approach
It was observed (not at the beginning, but soon enough) that many the most of the features, those which
are used to represent the existence of intermediate nodes, are very similar. Most of the features have a
corresponding metachearacter, other have periphrastic way be written.
To solve this problem a set of flags and values is being used to indicate what features are applied to that
(TreePattern) node. Every TreePattern node has it’s own set with default values. This set is named
controller and it has local (object) scope . Every feature turns on or changes some values to the
controller (local_match needed to test more than one constraint from now on). Some metacharacters
changes their TreePattern node name.
Milestone 1: Features
The features are grouped in three groups:
- Intermediate nodes
- Attribute shortcuts
- defined sets
Every intermediate node feateure can be described by:
- if it allows direct connection between two nodes ( zero intemediate)
- if it allows at least one intermediate node (that functionality was already implemented).
And:
- The the minimum and maximum number of intermediate nodes.
So whenever a feature allows the absence of intermediate nodes, that flag is turned on ( “zero or more”,
“ {0-4}” etc) and whenever one or more intermediate nodes can be skipped the other flag is turned on.
The minimum and the maximum number of intermediate nodes are used only to force false a match and
are defined by a high low and a high bound. These bounds are set to 0 and -1 respectively by default
(high = -1, means it’ s absence).
Attribute shortcuts are described by a flag.
The sets are divided too in groups. The sets that may have a more than one match and these that
requires a single match. In the first group belongs the “all children” and “any child” sets and at the
other the “all nodes” sets.
The first group is used to test a single node against it’s children so it is tested in the local match
comparison (node to node). The second to compare it to all the other nodes, so it uses it’s custom
parameterized find_match and one helping function.
Milestone 1: Examples of usage
Singleton orthologs To be completed…..
Paralogs (duplication) """(('@.species=="elf"')'*', ('@.species=="elf"')'*');""" (by Francua serra)
Co-orthologs To be completed…..
Intra-species paralogs To be completed…..
Inter-species paralogs To be completed…..
maximum of attribute “””@.dist > [:all_nodes:].dist’ ; “””
defined attribute relationship “”" ‘@.dist > [:any_child:].dist * 3’ ; “””
Milestone 1: Set of all implemented features
Intermediate node symbols.
Symbol Description Allow direct connection Allow intermediate presence Low high
@ Any node - - 0 -1
? Zero or one Yes Yes 0 1
{N} Defined number If N = 0 If N > 0 N N
{N-M} Defined range If N=0 If M > 0 N M
+ One or more - Yes 0 -1
* Zero or more Yes Yes 0 -1
Note that “ @ ” can be used as any node only if it’s written alone.
Attribute symbols
Symbol Description Self.is_root() Self.is_leaf()
^ Is root Yes -
$ If leaf - Yes
Defined sets
Symbol Description Single match
[:any_child:] At least one child meets requirements -
[:all_children:] All children meets requirements -
[:all_nodes:] The node that fits the best Yes
Not implemented features:
Symbol Description
! All but this node (logical not)
[node A, node B] / logical or List of possible matches