CompactTree  0.0.8
Cookbook

This page contains many examples of CompactTree usage. Rather than showing complete programs (which would require the boilerplate code of #include statements, function headers, and function return values), these examples will be short self-contained code snippets that can be incorporated into larger functions/programs (see the example directory of the CompactTree repository for complete example programs). In all of the examples below, tree is a compact_tree object.

Loading a Tree

A tree can be loaded from a Newick file as follows, where newick_filename is a C string (char*) or std::string variable containing the filename of the tree:

compact_tree tree(newick_filename);
Definition: compact_tree.h:61

The constructor has the following parameters, but only the first is required:

  1. input — The filename or C string of the tree to load
  2. is_fntrue if input is a filename (default), otherwise false if input is the Newick tree as a C string
  3. store_labelstrue to store node labels (default), otherwise false (saves memory)
  4. store_lengthstrue to store edge lengths (default), otherwise false (saves memory)
  5. reserve — How many nodes to reserve memory for up-front to avoid std::vector resizes (0 by default)

Thus, the above syntax for loading a tree from a Newick file is equivalent to the following:

compact_tree tree(newick_filename, true, true, true, 0);

A tree can also be loaded from a Newick string as follows, where newick_string is a variable containing the Newick string:

compact_tree tree(newick_string, false, true, true, 0);

Since the default value of is_fn is true, loading a Newick string requires specifying all constructor arguments.

Properties of a Tree

Number of Nodes

The number of nodes in a tree can be retrieved in constant time using the compact_tree::get_num_nodes function:

size_t num_nodes = tree.get_num_nodes();

Number of Leaves

The number of leaves in a tree can be calculated in linear time using the compact_tree::get_num_leaves function:

size_t num_leaves = tree.get_num_leaves();

After compact_tree::get_num_leaves is called the first time, the calculated number of leaves is saved internally, and future calls to compact_tree::get_num_leaves will be constant time.

Number of Internal Nodes

The number of internal nodes in a tree can be calculated in linear time using the compact_tree::get_num_internal function:

size_t num_internal = tree.get_num_internal();

The compact_tree::get_num_internal function calculates the number of internal nodes by first calling compact_tree::get_num_leaves to calculate the number of leaves, then calling compact_tree::get_num_nodes to get the total number of nodes, and finally subtracting the number of leaves from the number of nodes. Thus, the first call to compact_tree::get_num_internal or compact_tree::get_num_leaves will be linear time, and all subsequent calls to either function will be constant time.

Properties of a Node

Nodes of a tree are represented using type CT_NODE_T, which is either a 32-bit (default) or 64-bit unsigned integer (i.e., std::uint32_t or std::uint64_t). The value of a node is guaranteed to be smaller than the values of its children. By extension, ROOT_NODE (an alias for the root) is always 0. We also reserve a special alias, NULL_NODE, to represent a null node with value (CT_NODE_T)(-1).

Label

Node labels are represented as type std::string. The label of node can be retrieved using the compact_tree::get_label function:

std::string node_label = tree.get_label(node);

If the compact_tree object is not storing node labels (i.e., the constructor was called with store_labels=false), compact_tree::get_label will return the empty string ("").

The label of node can be changed to new_label using the compact_tree::set_label function:

tree.set_label(node, new_label);

If the compact_tree object was not storing node labels prior to calling compact_tree::set_label (i.e., the constructor was called with store_labels=false), node labels will be allocated upon the first call to compact_tree::set_label, and all nodes other than node will be assigned the empty string ("") as their label.

The labels of many nodes can be changed in bulk using the compact_tree::replace_labels function:

tree.replace_labels(label_map);

By default, compact_tree::replace_labels will replace all matching node labels (including internal nodes). However, in some contexts, you may not want to replace the labels of internal nodes (e.g. if they represent some type of special information). There is a second optional parameter, include_internal, that can be set to false in order to only replace leaf labels:

tree.replace_labels(label_map, false);

Length

Edge lengths are represented as type CT_LENGTH_T, which is either float (default) or double. The length of the incident edge of node (i.e., the edge between node and its parent) can be retrieved using the compact_tree::get_edge_length function:

CT_LENGTH_T node_edge_length = tree.get_edge_length(node);

If the compact_tree object is not storing edge lengths (i.e., the constructor was called with store_lengths=false), compact_tree::get_edge_length will return 0.

The length of the incident edge of node can be changed to new_length using the compact_tree::set_edge_length function:

tree.set_edge_length(node, new_length);

If the compact_tree object was not storing edge lengths prior to calling compact_tree::set_edge_length (i.e., the constructor was called with store_lengths=false), edge lengths will be allocated upon the first call to compact_tree::set_edge_length, and all edges other than the one incident to node will be assigned 0 as their length.

Parent

The parent of node can be retrieved using the compact_tree::get_parent function:

CT_NODE_T parent = tree.get_parent(node);

The root of a tree does not have a parent, so tree.get_parent(ROOT_NODE) will return NULL_NODE.

Children

The children of node can be retrieved using the compact_tree::get_children function:

const std::vector<CT_NODE_T> & node_children = tree.get_children(node);

Alternatively, the children of node can be iterated over using the compact_tree::children_iterator class via the compact_tree::children_begin and compact_tree::children_end functions:

for(auto it = tree.children_begin(node); it != tree.children_end(node); ++it) {
// visit `*it`, which is a `CT_NODE_T`
}

Currently, children will be visited in the order they appear in the original Newick string.

Adding a Child

While CompactTree does not support substantially restructuring the tree topology because of how it internally represents the tree structure, it does allow adding a child to node using the compact_tree::add_child function:

CT_NODE_T new_child_1 = tree.add_child(node);
CT_NODE_T new_child_2 = tree.add_child(node, "Niema", 123.456);

All Node Data

While you can retrieve the individual data members of a given node via the getter functions described above, you can also retrieve all of the data associated with node as an std::tuple using the square-brackets operator ([]):

std::tuple<const std::string*, CT_LENGTH_T, CT_NODE_T, const std::vector<CT_NODE_T>*> node_data = tree[node];
const std::string* node_label_ptr = std::get<0>(node_data);
const std::string node_label = *node_label_ptr;
CT_LENGTH_T node_length = std::get<1>(node_data);
CT_NODE_T node_parent = std::get<2>(node_data);
const std::vector<CT_NODE_T>* node_children_ptr = std::get<3>(node_data);
const std::vector<CT_NODE_T> node_children = *node_children_ptr;

However, we recommend only accessing the individual data members of a given node using the aforementioned getter functions, as the square-brackets operator creates an std::tuple object, which could add some time/memory overhead to your program.

Traversing a Tree

Because nodes are represented as unsigned integers, tree traversals can be performed extremely quickly.

Pre-Order

Because every node has a smaller value than its children, a preorder traversal of a tree with N nodes can be implemented trivially by simply iterating over the integers 0 through N–1:

for(CT_NODE_T node = 0; node < N; ++node) {
// visit `node`
}

A preorder traversal can also be performed using the compact_tree::preorder_iterator class via the compact_tree::preorder_begin and compact_tree::preorder_end functions:

for(auto it = tree.preorder_begin(); it != tree.preorder_end(); ++it) {
// visit `*it`, which is a `CT_NODE_T`
}

The only guarantee is that a node will be visited before its children. Currently, nodes will be visited in the order they appear in the original Newick string.

Post-Order

Because every node has a smaller value than its children, a postorder traversal of a tree with N nodes can be implemented trivially by simply iterating over the integers N–1 through 0. However, because CT_NODE_T is an unsigned type, you need to be careful with the loop bounds, as decrementing 0 will not yield a negative number (it will wrap around to NULL_NODE):

for(CT_NODE_T node = N-1; node != NULL_NODE; --node) {
// visit `node`
}

A postorder traversal can also be performed using the compact_tree::postorder_iterator class via the compact_tree::postorder_begin and compact_tree::postorder_end functions:

for(auto it = tree.postorder_begin(); it != tree.postorder_end(); ++it) {
// visit `*it`, which is a `CT_NODE_T`
}

The only guarantee is that a node will be visited before its parent. Currently, nodes will be visited in reverse of the order they appear in the original Newick string.

Level-Order

A level-order traversal can be performed using the compact_tree::levelorder_iterator class via the compact_tree::levelorder_begin and compact_tree::levelorder_end functions:

for(auto it = tree.levelorder_begin(); it != tree.levelorder_end(); ++it) {
// visit `*it`, which is a `CT_NODE_T`
}

The only guarantee is that nodes will be visited in increasing order of depth (i.e., number of edges from the root). Currently, nodes with the same depth will be visited in the order they appear in the original Newick string. Note that the copy constructor and the post-increment operator (it++) will copy the BFS queue, which is slow and uses extra memory, so avoid both when possible.

Leaves

A traversal over the leaves of a tree can be performed using the compact_tree::leaves_iterator class via the compact_tree::leaves_begin and compact_tree::leaves_end functions:

for(auto it = tree.leaves_begin(); it != tree.leaves_end(); ++it) {
// visit `*it`, which is a `CT_NODE_T`
}

Currently, leaves will be visited in the order they appear in the original Newick string.

Python Wrapper

While we strongly recommend using the native C++ compact_tree class for performance, we also provide a SWIG Python wrapper via the CompactTree package, which can be installed using pip:

pip install CompactTree

In this section of the cookbook, we will briefly summarize how to use the Python wrapper, and we will highlight key distinctions in usage with respect to the C++ compact_tree class. Please see the python_wrapper.py example script we provide as well.

Loading a Tree

A tree can be loaded from a Newick file or string as follows, which matches the 5-argument constructor of the compact_tree C++ class (see above):

from CompactTree import compact_tree
tree_from_file = compact_tree(tree_filename)
tree_from_string = compact_tree(tree_string, False) # specify `False` for the second argument (`is_fn`)

Traversing a Tree

Because SWIG does not support nested classes, we were not able to directly wrap around the iterators implemented in the compact_tree class. Instead, we have defined Python-specific generators for each type of tree traversal:

from CompactTree import compact_tree, traverse_leaves, traverse_levelorder, traverse_postorder, traverse_preorder
tree = compact_tree(tree_filename)
for node in traverse_preorder(tree):
pass # visit `node` (pre-order)
for node in traverse_postorder(tree):
pass # visit `node` (post-order)
for node in traverse_levelorder(tree):
pass # visit `node` (level-order)
for node in traverse_leaves(tree):
pass # visit `node` (which is a leaf)

Everything Else

All other functionality should perfectly match the member functions of the C++ compact_tree class. Variable types are automatically translated between Python and C++ by SWIG without any need for manual user intervention. Here are a few examples (not comprehensive; see compact_tree C++ member functions for complete list):

# load tree
from CompactTree import compact_tree
tree = compact_tree(tree_filename)
# number of nodes
num_nodes = tree.get_num_nodes()
num_leaves = tree.get_num_leaves()
num_internal = tree.get_num_internal()
# total branch length
total_bl = tree.calc_total_bl()
total_bl_leaves = tree.calc_total_bl(include_internal=False)
total_bl_internal = tree.calc_total_bl(include_leaves=False)
# average branch length
avg_bl = tree.calc_avg_bl()
avg_bl_leaves = tree.calc_avg_bl(include_internal=False)
avg_bl_internal = tree.calc_avg_bl(include_leaves=False)
# pairwise distance
distance_x_y = tree.calc_dist(x, y) # `x` and `y` are nodes
distance_matrix = tree.calc_distance_matrix()
# print Newick
print(tree.get_newick())