Classification problems is when our output Y is always in categories like positive vs negative in terms of sentiment analysis, dog vs cat in terms of image classification and disease vs no disease in terms of medical diagnosis.
Decision tree works by successively splitting the dataset into small segments until the target variable are the same or until the dataset can no longer be split. It’s a greedy algorithm which make the best decision at the given time without concern for the global optimality .
The concept behind decision tree is straightforward. The following flowchart show a simple email classification system based on decision tree. If the address is “myEmployer.com”, it will classify it to “Email to read when bored”. Then if the email contains the word “hockey”, this email will be classified as “Email from friends”. Otherwise, it will be identified as “Spam: don’t read”. Image source .
There are various kinds of decision tree algorithms such as ID3 (Iterative Dichotomiser 3), C4.5 and CART (Classification and Regression Trees). The constructions of decision tree are similar :
- Assign all training instances to the root of the tree. Set current node to root node.
- Find the split feature and split value based on the split criterion such as information gain, information gain ratio or gini coefficient.
- Partition all data instances at the node based on the split feature and threshold value.
- Denote each partition as a child node of the current node.
- For each child node:
- If the child node is “pure” (has instances from only one class), tag it as a leaf and return.
- Else, set the child node as the current node and recurse to step 2.
ID3 creates a multiway tree. For each node, it trys to find the categorical feature that will yield the largest information gain for the target variable.
C4.5 is the successor of ID3 and remove the restriction that the feature must be categorical by dynamically define a discrete attribute that partitions the continuous attribute in the discrete set of intervals.
CART is similar to C4.5. But it differs in that it constructs binary tree and support regression problem .
The main differences are shown in the following table:
|Split Criterion||Information gain||Information gain ratio (Normalized information gain)||Gini coefficient for classification problems|
|Types of Features||Categorical feature||Categorical & numerical features||Categorical & numerical features|
|Type of Problem||Classification||Classification||Classification & regression|
|Type of Tree||Mltiway tree||Mltiway tree||Binary tree|
class TreeNode: def __init__(self, data_idx, depth, child_lst=): self.data_idx = data_idx self.depth = depth self.child = child_lst self.label = None self.split_col = None self.child_cate_order = None def set_attribute(self, split_col, child_cate_order=None): self.split_col = split_col self.child_cate_order = child_cate_order def set_label(self, label): self.label = label
class DecisionTree() def fit(self, X, y): """ X: train data, dimensition [num_sample, num_feature] y: label, dimension [num_sample, ] """ self.data = X self.labels = y num_sample, num_feature = X.shape self.feature_num = num_feature data_idx = list(range(num_sample)) # Set the root of the tree self.root = TreeNode(data_idx=data_idx, depth=0, child_lst=) queue = [self.root] while queue: node = queue.pop(0) # Check if the terminate criterion has been met if node.depth>self.max_depth or len(node.data_idx)==1: # Set the label for the leaf node self.set_label(node) else: # Split the node child_nodes = self.split_node(node) if not child_nodes: self.set_label(node) else: queue.extend(child_nodes)
The CART algorithm, when constructing the binary tree, will try searching for the feature and threshold that will yield the largest gain or the least impurity. The split criterion is a combination of the child nodes’ impurity. For the child nodes’ impurity, gini coefficient or information gain are adopted in classification. For regression problem, mean-square-error or mean-absolute-error are used. Example codes are showed below. For more details about the formulas, please refer to Mathematical formulation for decision tree in scikit-learn documentation
class CART(DecisionTree): def get_split_criterion(self, node, child_node_lst): total = len(node.data_idx) split_criterion = 0 for child_node in child_node_lst: impurity = self.get_impurity(child_node.data_idx) split_criterion += len(child_node.data_idx) / float(total) * impurity return split_criterion def get_impurity(self, data_ids): target_y = self.labels[data_ids] total = len(target_y) if self.tree_type == "regression": res = 0 mean_y = np.mean(target_y) for y in target_y: res += (y - mean_y) ** 2 / total elif self.tree_type == "classification": if self.split_criterion == "gini": res = 1 unique_y = np.unique(target_y) for y in unique_y: num = len(np.where(target_y==y)) res -= (num/float(total))**2 elif self.split_criterion == "entropy": unique, count = np.unique(target_y, return_counts=True) res = 0 for c in count: p = float(c) / total res -= p * np.log(p) return res
K-Nearest Neighbor is a supervised learning algorithm both for classification and regression. The principle is to find the predefined number of training samples closest to the new point, and predict the label from these training samples .
For example, when a new point comes, the algorithm will follow these steps:
- Calculate the Euclidean distance between the new point and all training data
- Pick the top-K closest training data
- For regression problem, take the average of the labels as the result; for classification problem, take the most common label of these labels as the result.
Below is the Numpy implementation of K-Nearest Neighbor function. Refer to code example for details.
def KNN(training_data, target, k, func): """ training_data: all training data point target: new point k: user-defined constant, number of closest training data func: functions used to get the the target label """ # Step one: calculate the Euclidean distance between the new point and all training data neighbors=  for index, data in enumerate(training_data): # distance between the target data and the current example from the data. distance = euclidean_distance(data[:-1], target) neighbors.append((distance, index)) # Step two: pick the top-K closest training data sorted_neighbors = sorted(neighbors) k_nearest = sorted_neighbors[:k] k_nearest_labels = [training_data[i] for distance, i in k_nearest] # Step three: For regression problem, take the average of the labels as the result; # for classification problem, take the most common label of these labels as the result. return k_nearest, func(k_nearest_labels)
Support Vector Machines¶
Be the first to contribute!
|||(1, 2) Machine Learning in Action by Peter Harrington|
|||Scikit-learn Documentations: Tree algorithms: ID3, C4.5, C5.0 and CART|
|||Scikit-learn Documentations: Ensemble Method|