titus.producer.cart.TreeNode¶

class titus.producer.cart.TreeNode(dataset, predictand, maxSubsetSize=None)[source]¶

Bases: object

Represents a tree node and applies the CART algorithm to build decision and regression trees.

The constructors are __init__ and fromWholeDataset.

Tree-building is initiated by calling splitUntil(condition), where condition(node, depth) is a user-supplied function that takes a node (titus.producer.cart.TreeNode) and depth (integer) and returns bool (True: continue splitting; False: stop splitting).

__init__(dataset, predictand, maxSubsetSize=None)¶

Constructor for a tree from a dataset of regressors (that which we split) and a predictand (that which we try to purify in the leaves).

Parameters:

Parameters:	dataset (titus.producer.cart.Dataset) – dataset of regressors only predictand (1-d Numpy array) – predictands in a separate array with the same number of rows as the `dataset` maxSubsetSize (positive integer or `None`) – maximum size of subset splits of categorical regressors (approximation for optimization in `categoricalEntropyGainTerm` and `categoricalNVarianceGainTerm`)

dataset (titus.producer.cart.Dataset) – dataset of regressors only
predictand (1-d Numpy array) – predictands in a separate array with the same number of rows as the dataset
maxSubsetSize (positive integer or None) – maximum size of subset splits of categorical regressors (approximation for optimization in categoricalEntropyGainTerm and categoricalNVarianceGainTerm)

canSplit()¶: Returns True if it is possible to split the predictand; False otherwise.

categoricalEntropyGainTerm(field, maxSubsetSize=None)¶

Split a categorical predictor in such a way that maximizes entropic gain inside and outside of a subset of predictor values.

Parameters:	field (titus.producer.cart.Dataset.Field) – the field to consider when calculating the entropy gain term maxSubsetSize (positive integer or `None`) – maximum size of subset splits of categorical regressors (approximation for optimization in `categoricalEntropyGainTerm` and `categoricalNVarianceGainTerm`)
Return type:	(number, list of strings)
Returns:	(best gain term, best combination of regressor categories)

categoricalNVarianceGainTerm(field, maxSubsetSize=None)¶

Split a categorical predictor in such a way that maximizes n-times-variance gain inside and outside of a subset of predictor values.

Parameters:	field (titus.producer.cart.Dataset.Field) – the field to consider when calculating the n-times-variance gain term maxSubsetSize (positive integer or `None`) – maximum size of subset splits of categorical regressors (approximation for optimization in `categoricalEntropyGainTerm` and `categoricalNVarianceGainTerm`)
Return type:	(number, list of strings)
Returns:	(best gain term, best combination of regressor categories)

classmethod fromWholeDataset(wholeDataset, predictandName, maxSubsetSize=None)¶

Constructor for a tree from a dataset that includes the predictand (that which we try to purify in the leaves) as one of its fields.

Parameters:	wholeDataset (titus.producer.cart.Dataset) – dataset including the predictand predictandName (string) – name of the predictand, to be taken out of the dataset maxSubsetSize (positive integer or `None`) – maximum size of subset splits of categorical regressors (approximation for optimization in `categoricalEntropyGainTerm` and `categoricalNVarianceGainTerm`)
Return type:	titus.producer.cart.TreeNode
Returns:	an unsplit tree

numericalEntropyGainTerm(field)¶: Split a numerical predictor in such a way that maximizes entropic gain above and below the threshold of the split.

numericalNVarianceGainTerm(field)¶

Split a numerical predictor in such a way that maximizes n-times-variance gain above and below the threshold of the split.

Parameters:	field (titus.producer.cart.Dataset.Field) – the field to consider when calculating the n-times variance gain term
Return type:	(number, number)
Returns:	(best gain term, best cut value)

pfaDocument(inputType, treeTypeName, dataType=None, preprocess=None, nodeScores=False, datasetSize=False, predictandDistribution=False, predictandUnique=False, entropy=False, nTimesVariance=False, gain=False)¶

Create a PFA document to score with this tree.

Parameters:	inputType (Pythonized JSON) – Avro record schema of the input data treeTypeName (string) – name of the tree node record (usually `TreeNode`) dataType (Pythonized JSON) – Avro record schema of the data that goes to the tree, possibly preprocessed preprocess (PrettyPFA substitution or `None`) – pre-processing expression nodeScores (bool) – if `True`, include a field for intermediate node scores datasetSize (bool) – if `True`, include a field for the size of the training dataset at each node predictandDistribution (bool) – if `True`, include a field for the distribution of training predictand values (only for classification trees) predictandUnique (bool) – if `True`, include a field for unique predictand values at each node entropy (bool) – if `True`, include an entropy term at each node (only for classification trees) nTimesVariance (bool) – if `True`, include an n-times-variance term at each node (only for regression trees) gain (bool) – if `True`, include a gain term at each node
Return type:	Pythonized JSON
Returns:	complete PFA document for running tree classification or regression

pfaScoreType()¶

Create an Avro schema representing the score type.

Return type:	Pythonized JSON
Returns:	score type (part of the `pass` and `fail` unions of the PFA `TreeNode`)

pfaType(dataType, treeTypeName, nodeScores=False, datasetSize=False, predictandDistribution=False, predictandUnique=False, entropy=False, nTimesVariance=False, gain=False)¶

Create a PFA type schema representing this tree.

Parameters:	dataType (Pythonized JSON) – Avro record schema of the input data treeTypeName (string) – name of the tree node record (usually `TreeNode`) nodeScores (bool) – if `True`, include a field for intermediate node scores datasetSize (bool) – if `True`, include a field for the size of the training dataset at each node predictandDistribution (bool) – if `True`, include a field for the distribution of training predictand values (only for classification trees) predictandUnique (bool) – if `True`, include a field for unique predictand values at each node entropy (bool) – if `True`, include an entropy term at each node (only for classification trees) nTimesVariance (bool) – if `True`, include an n-times-variance term at each node (only for regression trees) gain (bool) – if `True`, include a gain term at each node
Return type:	Pythonized JSON
Returns:	Avro schema for the tree node type

pfaValue(dataType, treeTypeName, nodeScores=False, datasetSize=False, predictandDistribution=False, predictandUnique=False, entropy=False, nTimesVariance=False, gain=False, valueType=None)¶

Create a PFA data structure representing this tree.

Parameters:	dataType (Pythonized JSON) – Avro record schema of the input data treeTypeName (string) – name of the tree node record (usually `TreeNode`) nodeScores (bool) – if `True`, include a field for intermediate node scores datasetSize (bool) – if `True`, include a field for the size of the training dataset at each node predictandDistribution (bool) – if `True`, include a field for the distribution of training predictand values (only for classification trees) predictandUnique (bool) – if `True`, include a field for unique predictand values at each node entropy (bool) – if `True`, include an entropy term at each node (only for classification trees) nTimesVariance (bool) – if `True`, include an n-times-variance term at each node (only for regression trees) gain (bool) – if `True`, include a gain term at each node valueType (Pythonized JSON or `None`) – if `None`, call `self.pfaValueType(dataType)` to generate a value type; otherwise, take the given value
Return type:	Pythonized JSON
Returns:	PFA data structure for the tree, to be inserted into the cell or pool’s `init` field

pfaValueType(dataType)¶

Create an Avro schema representing the comparison value type.

Parameters:	dataType (Pythonized JSON) – Avro record schema of the input data
Return type:	Pythonized JSON
Returns:	value type (`value` field of the PFA `TreeNode`)

score()¶: Returns the best score at this TreeNode, which might or might not be a leaf.

splitComplete()¶: Convenience function for building up a tree until each leaf has only one unique value. Calls splitUntil.

splitField()¶: Return the name of the input field at this split or None if this is a leaf node.

splitMaxDepth(maxDepth)¶

Convenience function for building up trees until each leaf has only one unique value or the depth reaches maxDepth. Calls splitUntil.

Parameters:	maxDepth (positive integer) – maximum allowed depth of the tree

splitOnce()¶

Compute an optimized split in one field, adding two new TreeNodes below this one.

If the predictand is numerical (numbers.Real), the split minimizes entropy; if categorical (basestring), it minimizes n-times-variance.

splitUntil(condition, depth=1)¶

Performs a recursive tree-split, calling the user-supplied condition(node, depth) at each new node.

If the predictand is numerical (numbers.Real), the node has attributes: datasetSize, predictandUnique, nTimesVariance, and gain.

If the predictand is categorical (basestring), the node has attributes: datasetSize, predictandDistribution, entropy, and gain.

Splits are performed in-place, changing this TreeNode.

Parameters:	condition (callable that takes node (titus.producer.cart.TreeNode) and depth (integer) and returns bool (`True`: continue splitting; `False`: stop splitting).) – splitting condition function depth (positive integer) – current depth

walkLeaves()¶

Return a generator that walks over all leaves in the tree, yielding a 2-tuple of node and depth.

Return type:	generator of (titus.producer.cart.TreeNode, int)
Returns:	generator of (node, depth)

walkNodes(topDown=True, depth=0)¶

Return a generator that walks over all nodes in the tree, yielding a 2-tuple of node and depth.

Parameters:	topDown (bool) – if `True`, do a depth-first walk from root to leaf; if `False`, do a depth-first walk from leaf to root depth (int) – starting value for reporting depth
Return type:	generator of (titus.producer.cart.TreeNode, int)
Returns:	generator of (node, depth)

Navigation

titus.producer.cart.TreeNode¶

Navigation