Data Science For Cats : PART 3
Understanding The Relations
With the help of hooman, you’ve fixed your dataset and you both are planning to jump into some real action. You look at the data and find out there are lots of rows and columns. How are you going to find a meaning from these numbers? Hooman understands that you are confused and starts showing you what to do.
Hooman says he wants to find out if there is any relationship among different types of information. Relationship? Among information? How? Hooman gives you an example, when he tries to work on his laptop, you tend to sit on his keyboard. You do not do that in other times. Or, you meow when you are hungry. Here, hooman’s attempt to work on the laptop encourages you to sit on the keyboard. Your increased hunger makes you meow. Like this, any mutual connection between two events is significant and hooman calls it CORRELATION. These examples are called positive correlations, because in both cases, your attempt to sit on the laptop or your meowing increases with the increase of hooman’s attempt to work or your hunger. Hooman says that correlation can be negative too, like you play more when you are less hungry. In this case, one increases when the other decreases.
Now you understand that you need to find some clue about what makes people buy potato chips. In doing so, the hooman shows you an example. He randomly picks some attributes about different brands of chips and how much people liked them. These attributes are basically a few columns from a file consisting of features of different brands of chips. They look like this:
You would like to know why a specific brand of chips is loved by people. You can see, here the #0 brand was loved by 90% of people and the #4 was loved by 55% of people. There must be a reason behind it.
Hooman picks some values from the columns to show you what he means. He converts them to dataframe using pandas library of Python, and calls the build in dataframe.corr() function to find out the correlations:
import pandas as pd
data = {'potato content': [45,37,42,35,39],
'packaging quality': [38,31,26,28,33],
'owner can say potato in how many languages': [1,3,7,1,7],
'spiciness': [44,44,43,43,44],
'liked by %': [90,56,88,73,55], }
df = pd.DataFrame(data,columns=['potato content',
'packaging quality',
'owner can say potato in how many
languages',
'spiciness',
'liked by %']) pd.set_option("display.max_rows", None, "display.max_columns", None) pd.set_option('expand_frame_repr', False)
corrMatrix = df.corr()
print (corrMatrix)
Then he shows you the output:
Whoa, more numbers! What do they even mean? You meow at hooman and he starts explaining. He calls the output a CORRELATION MATRIX. So what is this correlation matrix? You can see that’s a table, with some numbers. Each of the numbers represents how strongly one column from your dataset is related with another column. These numbers are called CORRELATION COEFFICIENT s. This coefficient is within 0 and 1. Of course there are mathematical equations behind this calculation. You can search and have a look at them on the internet. How do they work? In the first row, the first number represents the relation between ‘potato content’ and ‘potato content’. Correlation of something with itself is always 1. As hooman emphasized on knowing the reason for people liking a brand, he now explains the last number of the first row to you. It represents the relation between potato contents of a brand of chips and people liking it. The higher the number, the stronger their relationship. Here, 0.685493 is pretty high. Similarly, the last number of the second row contains the relationship between packaging quality and people liking the chips. The last numbers of other rows represent similar relationships too. You can see, some of them are negative numbers. It represents that the relationship between those attributes and people liking a brand of potato chips are opposite, that means, decrease in those attributes causes increase in liking for that brand. Hooman says they are ‘negatively correlated’.
You now understand higher content of potato in a brand of chips makes people like the brand more, and the lower amount of spiciness makes people love the chips… but wait, ‘owner can say potato in how many languages’?? How on the earth can it make people loving or hating a brand of chips? You point your paw to that number.
Hooman knows that you have again become confused. He now asks you when you eat chips the most. You think and reply that you eat them most while watching football on television. What else do you do while watching the matches? You wear the jersey of your favourite team and meow a lot. You suddenly realize that, it kind of seems like you eat potato chips more when you wear a jersey, but in reality, is wearing a jersey a ‘cause’ of eating more chips? No, your chips intake doesn’t increase with wearing a jersey, the real reason for chips consumption is watching the game. Hooman calls this ‘real reason’ CAUSATION.
So, correlation doesn’t always imply causation.
Now that’s a problem. How can you determine which one is the real reason? Well, there is no straight forward way to find that, at least right now. You still are a young cat. You need to grow bigger to learn more complicated stuff. So what are you going to do? For now, you can safely assume that a relationship is more likely to be causal if the correlation coefficient is large. You can set a threshold value for correlation coefficient and ignore the smaller values for now. For example, 0.027518 and -0.214263 are small if you assume that you will take values higher than 0.4. Therefore, you can safely take the amount of ‘potato content’ and ‘spiciness’ in consideration while thinking about why someone liked or disliked a specific brand of potato chips. Here, our finding is, people like potato chips more if the potato content of the chips is higher, or we can say, if there is a positive correlation between them. If the spiciness is high, people tend to dislike that chips, in other words, they are negatively correlated. You will need these relationship assumptions for all types of problems, classification, regression or time series analysis, to find out and predict something about the data.
Previous part: https://orthymarjan.medium.com/data-science-for-cats-part-2-6867fc5d9768
Next part: https://orthymarjan.medium.com/data-science-for-cats-part-4-839183e24643
Originally published at https://dev.to on October 30, 2020.