Decision Trees in R

Standard

Decision trees (DT) are predictive models aimed to linearly classify an item or a set of items among different classes. The distinctive feature of DT is that only parallel to x and y axis lines may be drawn.

In this post i’m showing you how to train, validate and use a DT to correctly classify objects in R, by using the “cars” set embedded in any R environment.

The cars set gives the speed of 50 cars and the distances taken to stop. Initially this set looks as follows.

head(cars)
  speed dist
1   4   2
2   4   10
3   7   4
4   7   22
5   8   16
6   9   10

In order to classify objects, we’ll assume that about the first half of dataset are “TRUE” objects, while the second half is labeled as “FALSE” objects.

To add the label we just need to add a new column, and give the value of either TRUE of FALSE.

cars$class = c(cars$speed < 15)
head(cars)
speed dist class
1 4 2 TRUE
2 4 10 TRUE
3 7 4 TRUE
4 7 22 TRUE
5 8 16 TRUE
6 9 10 TRUE
tail(cars)
45 23 54 FALSE
46 24 70 FALSE
47 24 92 FALSE
48 24 93 FALSE
49 24 120 FALSE
50 25 85 FALSE

Now that our data set is complete and labeled we can go on with tree’s training.

the first step is to import the rpart library, which allow us to use a tree structure.

install.packages('rpart')
library('rpart')

Then we need to select a subset to train our tree. The next instruction does select a subset of size 25 considering our entire set of 50 items.

train = cars[sample(1:50,25),]

Afterwards we use such subset to train our tree.

our_tree = rpart(class~speed+dist,method="class", data=train )
our_tree
n= 25
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 25 12 TRUE (0.4800000 0.5200000) 
2) speed>=14.5 12 0 FALSE (1.0000000 0.0000000) *
3) speed< 14.5 13 0 TRUE (0.0000000 1.0000000) *

As we can see, the tree sets only a threshold, which may accurately divide our data in TRUE and FALSE objects.

To validate our tree we may use the predict function by giving the two values of each object.

pred = predict(our_tree, cars[1:2])

now we can see the pred variable, which has correctly classified the 100% of our data with just one threshold on the speed value. We may visually validate such prediction by drawing an horizontal line on the value given by the tree, which is 14.5

plot(cars$x1,cars$x2)
lines(cars$x1[1:23],cars$x2[1:23],type="p",col="blue")
lines(cars$x1[24:50],cars$x2[24:50],type="p",col="red")
#draw the line
abline(v=14.5)

Screen Shot 2016-02-10 at 3.07.21 p.m.
And thats it, our DT is trained and ready to be used.

Leave a Reply

Your email address will not be published. Required fields are marked *