Thread

Posted on Fri Mar 23 16:40:10 2007 by xxqtony
find matching parenthesis
Hi, I have a dnd file in the format like ((A:0.33139,B:0.29208):0.04409,(C:0.28550,D:0.28440):0.03647,(E:0.35068,F:0.38974):0.03419); I want to find which left parentthesis matches which right parenthesis. Of course, in the example I listed, it is simple to figure out, but it would be very complex when I get thousands of nodes. Can Bio-Phylo module help on this? Or what Bioperl module I should use to get this problem solved. Thanks a lot! Tony
Direct Responses: 4679 | Write a response
Posted on Fri Mar 23 17:57:00 2007 by rvosa in response to 4676
Re: find matching parenthesis
Hi Tony,

I'm the author of Bio::Phylo. First of all: thanks for your interest. Just now I uploaded a new version release candidate, v.016RC1. There are still some goofs in the documentation, which means that the pod_coverage test won't pass, but in terms of the code it's all ready for release, so please try it out!

Now, on to your question. Parsing a well formed newick string is fairly hairy business. The one you provide is doable, but the official specification notes that a single-quoted taxon name like 'my species (and something tricky [with comments] between parentheses)' would be legal too, which can easily throw off naive implementations - as I've had to learn.

The newick parser (Bio::Phylo::Parsers::Newick) in Bio::Phylo handles these cases. However, what you get out of it is something more abstract than what it looks like you want - namely, it'll return a tree object, comprised of node objects which you can traverse to ask questions such as "what's the parent of node X?", "what's the balance of this tree?" etc. In other words, stuff to do with the abstract notion of a tree shape, not the specifics of the string you provided. Perhaps, though, that's what you really want? Are you actually trying to identify clades, for example? Below is an example to show what sort of things you'll be able to do using Bio::Phylo:
# import the parse function from Bio::Phylo::IO use Bio::Phylo::IO 'parse'; # always use strict use strict; # your string my $string = '((A:0.33139,B:0.29208):0.04409,(C:0.28550,D:0.28440):0.03647,(E:0.35068,F:0.38974):0. +03419);'; # the parse function returns a Bio::Phylo::Forest object (i.e. a set of trees) # we want the first (and only) object in that forest - i.e. your tree my $tree = parse( -format => 'newick', -string => $string )->first; # We can now query that tree object using the methods in Bio::Phylo::Forest::Tree, # and those defined in the superclasses, such as the example below from Bio::Phylo::Listable. # The get_by_regular_expression method iterates over all nodes in the tree, and in this case # calls the 'get_name' method on each node, then collects all nodes that match the regular # expression. $tips then becomes an array reference holding terminal node 'A' and terminal # node 'B' (i.e. the first two tips in your string). my $tips = $tree->get_by_regular_expression( -value => 'get_name', -match => qr/^(?:A|B)$/, ); # We now query the tree to get the most recent common ancestor for tip A and tip B my $mrca = $tree->get_mrca($tips); # $mrca now is the parent of those two tips, and just to show that it worked we'll print # out the branch length, which is 0.04409 print $mrca->get_branch_length; # prints 0.04409
I hope that helps. Alternatively, if you really want to go in and fiddle with the raw newick strings yourself you should have a look at the code in Bio::Phylo::Parsers::Newick - and steal whatever looks useful :)

Best wishes,

Rutger
Direct Responses: 4697 | Write a response
Posted on Wed Mar 28 18:47:25 2007 by xxqtony in response to 4679
Re: find matching parenthesis
Rutger, thanks for the detailed information!
Write a response