titus.producer.cart.TreeNode

class titus.producer.cart.TreeNode(dataset, predictand, maxSubsetSize=None)[source]

Bases: object

Represents a tree node and applies the CART algorithm to build decision and regression trees.

The constructors are __init__ and fromWholeDataset.

Tree-building is initiated by calling splitUntil(condition), where condition(node, depth) is a user-supplied function that takes a node (titus.producer.cart.TreeNode) and depth (integer) and returns bool (True: continue splitting; False: stop splitting).

__init__(dataset, predictand, maxSubsetSize=None)

Constructor for a tree from a dataset of regressors (that which we split) and a predictand (that which we try to purify in the leaves).

Parameters:
  • dataset (titus.producer.cart.Dataset) – dataset of regressors only
  • predictand (1-d Numpy array) – predictands in a separate array with the same number of rows as the dataset
  • maxSubsetSize (positive integer or None) – maximum size of subset splits of categorical regressors (approximation for optimization in categoricalEntropyGainTerm and categoricalNVarianceGainTerm)
canSplit()

Returns True if it is possible to split the predictand; False otherwise.

categoricalEntropyGainTerm(field, maxSubsetSize=None)

Split a categorical predictor in such a way that maximizes entropic gain inside and outside of a subset of predictor values.

Parameters:
  • field (titus.producer.cart.Dataset.Field) – the field to consider when calculating the entropy gain term
  • maxSubsetSize (positive integer or None) – maximum size of subset splits of categorical regressors (approximation for optimization in categoricalEntropyGainTerm and categoricalNVarianceGainTerm)
Return type:

(number, list of strings)

Returns:

(best gain term, best combination of regressor categories)

categoricalNVarianceGainTerm(field, maxSubsetSize=None)

Split a categorical predictor in such a way that maximizes n-times-variance gain inside and outside of a subset of predictor values.

Parameters:
  • field (titus.producer.cart.Dataset.Field) – the field to consider when calculating the n-times-variance gain term
  • maxSubsetSize (positive integer or None) – maximum size of subset splits of categorical regressors (approximation for optimization in categoricalEntropyGainTerm and categoricalNVarianceGainTerm)
Return type:

(number, list of strings)

Returns:

(best gain term, best combination of regressor categories)

classmethod fromWholeDataset(wholeDataset, predictandName, maxSubsetSize=None)

Constructor for a tree from a dataset that includes the predictand (that which we try to purify in the leaves) as one of its fields.

Parameters:
  • wholeDataset (titus.producer.cart.Dataset) – dataset including the predictand
  • predictandName (string) – name of the predictand, to be taken out of the dataset
  • maxSubsetSize (positive integer or None) – maximum size of subset splits of categorical regressors (approximation for optimization in categoricalEntropyGainTerm and categoricalNVarianceGainTerm)
Return type:

titus.producer.cart.TreeNode

Returns:

an unsplit tree

numericalEntropyGainTerm(field)

Split a numerical predictor in such a way that maximizes entropic gain above and below the threshold of the split.

numericalNVarianceGainTerm(field)

Split a numerical predictor in such a way that maximizes n-times-variance gain above and below the threshold of the split.

Parameters:field (titus.producer.cart.Dataset.Field) – the field to consider when calculating the n-times variance gain term
Return type:(number, number)
Returns:(best gain term, best cut value)
pfaDocument(inputType, treeTypeName, dataType=None, preprocess=None, nodeScores=False, datasetSize=False, predictandDistribution=False, predictandUnique=False, entropy=False, nTimesVariance=False, gain=False)

Create a PFA document to score with this tree.

Parameters:
  • inputType (Pythonized JSON) – Avro record schema of the input data
  • treeTypeName (string) – name of the tree node record (usually TreeNode)
  • dataType (Pythonized JSON) – Avro record schema of the data that goes to the tree, possibly preprocessed
  • preprocess (PrettyPFA substitution or None) – pre-processing expression
  • nodeScores (bool) – if True, include a field for intermediate node scores
  • datasetSize (bool) – if True, include a field for the size of the training dataset at each node
  • predictandDistribution (bool) – if True, include a field for the distribution of training predictand values (only for classification trees)
  • predictandUnique (bool) – if True, include a field for unique predictand values at each node
  • entropy (bool) – if True, include an entropy term at each node (only for classification trees)
  • nTimesVariance (bool) – if True, include an n-times-variance term at each node (only for regression trees)
  • gain (bool) – if True, include a gain term at each node
Return type:

Pythonized JSON

Returns:

complete PFA document for running tree classification or regression

pfaScoreType()

Create an Avro schema representing the score type.

Return type:Pythonized JSON
Returns:score type (part of the pass and fail unions of the PFA TreeNode)
pfaType(dataType, treeTypeName, nodeScores=False, datasetSize=False, predictandDistribution=False, predictandUnique=False, entropy=False, nTimesVariance=False, gain=False)

Create a PFA type schema representing this tree.

Parameters:
  • dataType (Pythonized JSON) – Avro record schema of the input data
  • treeTypeName (string) – name of the tree node record (usually TreeNode)
  • nodeScores (bool) – if True, include a field for intermediate node scores
  • datasetSize (bool) – if True, include a field for the size of the training dataset at each node
  • predictandDistribution (bool) – if True, include a field for the distribution of training predictand values (only for classification trees)
  • predictandUnique (bool) – if True, include a field for unique predictand values at each node
  • entropy (bool) – if True, include an entropy term at each node (only for classification trees)
  • nTimesVariance (bool) – if True, include an n-times-variance term at each node (only for regression trees)
  • gain (bool) – if True, include a gain term at each node
Return type:

Pythonized JSON

Returns:

Avro schema for the tree node type

pfaValue(dataType, treeTypeName, nodeScores=False, datasetSize=False, predictandDistribution=False, predictandUnique=False, entropy=False, nTimesVariance=False, gain=False, valueType=None)

Create a PFA data structure representing this tree.

Parameters:
  • dataType (Pythonized JSON) – Avro record schema of the input data
  • treeTypeName (string) – name of the tree node record (usually TreeNode)
  • nodeScores (bool) – if True, include a field for intermediate node scores
  • datasetSize (bool) – if True, include a field for the size of the training dataset at each node
  • predictandDistribution (bool) – if True, include a field for the distribution of training predictand values (only for classification trees)
  • predictandUnique (bool) – if True, include a field for unique predictand values at each node
  • entropy (bool) – if True, include an entropy term at each node (only for classification trees)
  • nTimesVariance (bool) – if True, include an n-times-variance term at each node (only for regression trees)
  • gain (bool) – if True, include a gain term at each node
  • valueType (Pythonized JSON or None) – if None, call self.pfaValueType(dataType) to generate a value type; otherwise, take the given value
Return type:

Pythonized JSON

Returns:

PFA data structure for the tree, to be inserted into the cell or pool’s init field

pfaValueType(dataType)

Create an Avro schema representing the comparison value type.

Parameters:dataType (Pythonized JSON) – Avro record schema of the input data
Return type:Pythonized JSON
Returns:value type (value field of the PFA TreeNode)
score()

Returns the best score at this TreeNode, which might or might not be a leaf.

splitComplete()

Convenience function for building up a tree until each leaf has only one unique value. Calls splitUntil.

splitField()

Return the name of the input field at this split or None if this is a leaf node.

splitMaxDepth(maxDepth)

Convenience function for building up trees until each leaf has only one unique value or the depth reaches maxDepth. Calls splitUntil.

Parameters:maxDepth (positive integer) – maximum allowed depth of the tree
splitOnce()

Compute an optimized split in one field, adding two new TreeNodes below this one.

If the predictand is numerical (numbers.Real), the split minimizes entropy; if categorical (basestring), it minimizes n-times-variance.

splitUntil(condition, depth=1)

Performs a recursive tree-split, calling the user-supplied condition(node, depth) at each new node.

If the predictand is numerical (numbers.Real), the node has attributes: datasetSize, predictandUnique, nTimesVariance, and gain.

If the predictand is categorical (basestring), the node has attributes: datasetSize, predictandDistribution, entropy, and gain.

Splits are performed in-place, changing this TreeNode.

Parameters:
  • condition (callable that takes node (titus.producer.cart.TreeNode) and depth (integer) and returns bool (True: continue splitting; False: stop splitting).) – splitting condition function
  • depth (positive integer) – current depth
walkLeaves()

Return a generator that walks over all leaves in the tree, yielding a 2-tuple of node and depth.

Return type:generator of (titus.producer.cart.TreeNode, int)
Returns:generator of (node, depth)
walkNodes(topDown=True, depth=0)

Return a generator that walks over all nodes in the tree, yielding a 2-tuple of node and depth.

Parameters:
  • topDown (bool) – if True, do a depth-first walk from root to leaf; if False, do a depth-first walk from leaf to root
  • depth (int) – starting value for reporting depth
Return type:

generator of (titus.producer.cart.TreeNode, int)

Returns:

generator of (node, depth)