Associate Professor of Information and Computer Science
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
fruits = pd.read_table('./fruit_data_with_colors.txt')
fruits.head()
fruits.shape
lookup_fruit_name = dict(zip(fruits.fruit_label.unique(), fruits.fruit_name.unique()))
lookup_fruit_name
X = fruits[['mass', 'width', 'height','color_score']]
y = fruits['fruit_label']
#In this example. we're using a 75%-25% split of training versus test data
#It's a good rule of thumb to use in deciding what proportion of training
#versus test might be helpful
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
We'll denote the data that we have using different flavors of the variable X
, capital X
, which is typically a two dimensional array or data frame, and the notation we'll use for labels will be typically based on lowercaes y
, which is usually a one dimensional array, or a scalar.
Note the use of the random_state
parameter in the train_test_split()
function. This random_state
parameter provides a seed value to the function's internal random number generator. If we choose different values for that seed value, that will result in different randomized splits for training and test.
X_train.shape
X_test.shape
y_train.shape
y_test.shape
X_train
y_train
X_test
y_test
In general, if you're thinking about applying machine learning to a data set, it's a really good first step to actually look at the data set first, maybe using some simple visualization methods, or even simply scrolling through to look at the various values in the different rows before proceeding.
So now that we have a training set selected, let's create some simple visualizations to look at how the features in the objects in the training set, in our case different fruits, relate to each other and to the labals. With these visualizations, we get at least two major benefits:
First, we can get an idea of the range of values that each feature takes on, and we can immediately see any unusual outliers that are very different from other points, and that might indicate noise or a missing feature or other problem with the data set.
Second, we may be able to get a better idea how likely it is that a machine learning algorithm could do well at predicting the different classes. By seeing how well clustered and well separated the different types of objects are in feature space. So feature space refers to the representation of an object using specific features that are in certain columns of the data that we have.
from matplotlib import cm
cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(X_train, c= y_train, marker = 'o', s=40, hist_kwds={'bins':15}, figsize = (12,12), cmap = cmap)
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(X_train['width'], X_train['height'], X_train['color_score'], c = y_train, marker = 'o', s=100)
ax.set_xlabel('width')
ax.set_ylabel('height')
ax.set_zlabel('color_score')
plt.show()
The K-Nearest Neighbors algorithm can be used for classification and regression. k-NN classifiers are an example of what's called instance based or memory based supervised learnig. What this means is that instance based learning methods work by memorizing the labeled examples that they see in the training set, and then they use those memorized examples to classify new objects later
Given a training set X_train with labels y_train, and given a new instance x_test to be classified:
1. Find the most similar instances (let's call them X_NN
) to x_test
that are in X_train
.
2. Get the labels y_NN
for the instances in X_NN
.
3. Predict the label for x_test
by combining the labels y_NN
, e.g. simple majority vote.
1.A distance metric, e.g. a straight line or euclidean distance to measure the distance between points (Minkowski with p = 2)
2.How many 'nearest' neighbors to look at? e.g. five
3.Optional weighting function on the neighbor points, i.e. give some neighbors more influence on the outcome
4.Method for aggregating the classes of neighbor points to produce a final prediction (Class with the most representatives among nearest neighbors)
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
fruits = pd.read_table('./fruit_data_with_colors.txt')
fruits.head()
lookup_fruit_name = dict(zip(fruits.fruit_label.unique(), fruits.fruit_name.unique()))
lookup_fruit_name
X = fruits[['mass', 'width', 'height']]
y = fruits['fruit_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
the k-NN classifier that are using in this case is an example of a more general class called an estimator in scikit-learn
.
All estimators have a fit
method that takes the training data, and then changes the state of the classifier or estimator object to essentially enable prediction once the training is finished. In other words, it updates the state of the k
and n
variable here, which means that in the case of k-NN neighbors it will memorize the training set examples in some kind of internal storage fro future use.
knn.fit(X_train, y_train)
The first thing we can do with this newly trained classifier is to see how accurate it's likely to be on some new, previously unseen instances. To do this, we can apply the classifier to all the instances in the test set.
One simple way to assess if the classifier is likely to be good at predicting the label of future, previously unseen data instances is to compute the classifier's accuracy on the test set data items
Remember that the k-NN
classifier did not see any of the fruits in the test set during the training phase.
knn.score(X_test, y_test)
This will take the test set points as input and compute the accuracy. The accuracy is defined as the fraction of test set items, whose true label was correctly predicted by the classifier.
fruit_prediction = knn.predict([[20, 4.3, 5.5]])
lookup_fruit_name[fruit_prediction[0]]
fruit_prediction = knn.predict([[100, 6.3, 8.5]])
lookup_fruit_name[fruit_prediction[0]]
Let's use a utility function called plot_fruit_knn
that's included in adspy_shared_utilities
module
from adspy_shared_utilities import plot_fruit_knn
plot_fruit_knn(X_train, y_train, 5, 'uniform')
The uniform
parameter that pass at the last parameter is the waiting method to be used. In this case uniform
means to treat all neighbors equally when combining their labels. If you like, you can try changing this to the word distance
, you can also pass your own function.
We can see that when k
has a small value like 1, the classifier is good at learning the classes for individual points in the training set. But with a decision boundary that's fragmented with considerable variation. This is because when k = 1, the prediction is sensitive to noise, outliers, mislabeled data, and other sources of variation in individual data points.
For larger values of k
, the areas assigned to different classes are smoother and not as fragmented and more robust to noise in the individual points. But possibly with some mistakes, more mistakes in individual points. This is an example of what's known as the bias / variance tradeoff.
k_range = range(1,20)
scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(X_train, y_train)
scores.append(knn.score(X_test, y_test))
plt.figure()
plt.xlabel('k')
plt.ylabel('accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,5,10,15,20])
Keep in mind: these results are only for this particular training test split, to get a more reliable estimate of likely future accuracy for a particular value of k
, we would want to look at results over multiple possible train test splits
In general, the best choice of the value of k
, that is the one that leads to the highest accuracy, can vary greatly depending on the data set. In general with k-nearest neighbors using a larger k
suppresses the effects of noisy individual labels, but results in classification boundaries that are less detailed