Author: Zizhun Guo
# location:
import pandas as pd # data analysis and manipulation package
# location:
def write_file(best_attribute, is_positive_correlated):
""" writes trained program based on selected best attribute
@best_attribute: a string represents the name of best attribute
@is_positive_correlated: a boolean determines the classification rule
# initialize a string to contains codes for dumping in the trained program
lines = ""
.. implementations ..
# create a file with given name "filepath"
f = open('', "w")
# write string to the filepath
# location:
import pandas as pd
|Attribute|Correlation with Target variable| |—|—| |Bread|-0.012| |Vitamins|-0.448| |Vegetable|0.016| |Milk|-0.035| |Banana|-0.070| |PeanutButter|0.582| |Chocolate|-0.030| |Citrus|-0.057| |Cookie|0.050| |IceCream|0.013| |Soda|-0.061| |Fruit|-0.005|
PeanutButter has the highest cross-correlation absolute value, which means it is the most strongly cross-correlated attribute with the target variable.
The if-statement condition sets the structure based on whether the value of cross-correlation is positive or negative. Sets structure for One Rule classifier in the trained program:
# location:
if is_positive_correlated:
lines += f"\n\tfor val in data:"
lines += f"\n\t\tif val > 0:"
lines += f"\n\t\t\tprint('1')"
lines += f"\n\t\telse:"
lines += f"\n\t\t\tprint('0')"
lines += f"\n\tfor val in data:"
lines += f"\n\t\tif val == 0:"
lines += f"\n\t\t\tprint('1')"
lines += f"\n\t\telse:"
lines += f"\n\t\t\tprint('0')"
Struture below: (based on the given training dataset)
# location:
if attribute > 0:
The Trained Program would print the result as the homework asked, but the accuracy can be captured by frequency table created in Mentor Program:
# shell console
PeanutButter Sickness
0 0 399
1 108
1 0 101
1 392
dtype: int64
$numberOfCorrectness = numberOf(PeanutButter: 0, Sickness: 0)+numberOf(PeanutButter: 1, Sickness: 1) = 399+392 = 791$ $numberOfTotal = 1000$ $Accuracy = numberOfCorrectness / numberOfTotal = 791 / 1000 = 0.79$
The Accuracy is 0.79.
*Reference: Another way to define Accuracy: $ACC= (TP+TN)/(TP+TN+FP+FN)$
I also leanred using scatter plot to present the correlation situation. The default image would be looks like this:
Image 1: The 2D Sickness based on Peanutbutter
Each dot actually overlaps so many points. In order to see how condense each points are, we need to add jitter to scatter it up a little bit. So just choose a randomly scattered seed generated by Numpy package.
# location: HW_03_Guo_Zizhun_Mentor/
x = df_filtered[best_attribute] \
* (1 - scatter_fraction_rate) \
* scatter_scale \
+ np.random.ranf(size) \
* scatter_fraction_rate \
* scatter_scale
plt.scatter(x, y, alpha=0.5) # plot the image alpha = 0.5 indicates the level of the overlap
# Area with darker color have more points overlapped
After adding the Jitter:
Image 2: The 2D Sickness based on Peanutbutter with Jitter
Here Jitter as the Noise added enabling easy observation on how the points are settles. As can be seen, points from lower left and upper right are largely placed, which means the two x-variable and y-variable are strongly positive cross-correlated.
This tendency implies: The y-variable tends to have the same value of x_variable, since in our case, we only has two value options (0 and 1).
The tendency for attribute values to be changed as the target variable is defined by the sign of cross-correlation. This conclusion can be used to determine the One Rule of a trained program that if the cross-correlation(CC) is less than 0, the prediction should be opposite as the value of the attribute, whereas if CC is greater than 0, the prediction should be same result of the attribute value.
Yes. Vitamin can be this feature, since
Because it has the second-highest cross-correlation, which is -0.448.
Image 3: The 2D Sickness based on Vitamin with Jitter
I can print the frequency table to calculate the Accuracy.
Vitamins | Sickness | # |
0 | 0 | 305 |
^ | 1 | 487 |
1 | 0 | 195 |
^ | 1 | 13 |
total: 1000 |
Accuracy = (487+195)/1000 = 0.682 *Since the correlation is less than 0, it needs to sum up the # of {0, 1} and {1, 0} pairs.
All rights reserved by Zizhun Guo