In this video, we'll be making a couple adjustments to the regression model that we've been running this far this week. Specifically, we'll be replacing on-base percentage and slugging percentage with the key components that comprise the statistics. For a brief summary of why we're doing this, let's take a look at our markdown. You've probably noticed that on-base percentage and slugging percentage include many of the same events. For example, singles, doubles, triples, and home runs are all included in both on-base percentage and slugging percentage. Additionally, singles, doubles, triples, and home runs all positively affect both on-base percentage and slugging percentage. But remember the main reason that on-base percentage to gain popularity, was because it include walks in its calculation. That was really one of the key skills that the Oakland Athletics Bleed was being undervalued during the time of Moneyball. The question now is, why don't we just go ahead and take a look to see if the valuation for walks changed over our Moneyball period? We can do the same thing for singles, doubles, triples, and home runs. If you're familiar with regression you probably know that having variables that are correlated with each other can cause some issues at times. For example, if on-base percentage and slugging percentage are highly correlated with each other, considering that they have many of the same components included in your calculation, that might be a good possibility. This can cause issues in our regression, really determining the effect of each variable independently on our dependent variable, which is salary in this case. What we're going to be doing is we're going to breaking down on-base percentage and slugging percentage into four key metrics. The first metric is, walk percent or walk rate, which is simply defined as walks divided by plate appearances. Second variable is single percent or single rate, which is defined as singles divided by plate appearances. Remember singles is equal to hits minus doubles, minus triples, minus home runs. Then we have extra-base hit percent or extra-base hit rate, which is defined as doubles plus triples divided by plate appearances. The reason we're including these two together is because triples are pretty rare and pretty random and baseball, so we're just going to combine triples and doubles into one metric called extra-base hit percent. Then finally we have home run rate or home run percent, which is defined as home runs divided plate appearances. Let's go ahead and run that. We can see now in our Master data we have these four metrics, these four variables created on the right-hand side and for data. You can think about this intuitively now to each of these metrics. Each component is mutually exclusive with each other. For example, a walk can only possibly impact walk rate out of these four metrics. Similarly, at the same time, a home run can only possibly affect home run rate at these four metrics. We don't have that issue with overlapping components as we did with on-base percentage and slugging percentage. Intuitively it would seem that these four metrics are going to be less correlated with each other. Then on-base percentage and slugging percentage, which will allow us to really isolate the effects of each of these components in our regression. Actually go ahead and take a look at correlations between these metrics. If we scroll down, first we're going to take a look at the correlation between on-base percentage and slugging percentage and we're going to do that using this np.corrcoef function here. Then add the two variables that we're taking the correlation between, are on-base percentage and slugging percentage. Let's run that. Running that, we see the correlation between these two statistics is 0.67. Recall correlations range from negative 1 to positive 1, with negative 1 being a perfect negative correlation and positive 1 being a perfect positive correlation. Our correlation between on-base percentage and slugging percentage is about 0.66, which is pretty high considering that the range goes from negative 1 to positive 1. Since our correlation is pretty high, this can potentially cause issues in our regression in really isolating the effects that each metric has separately on our dependent variable, which is salary. Now let's take a look and compare those correlations with correlations for the four new metrics that we created. What we're going to do is we're going to create a correlation matrix between walk rate, single rate, extra-base hit rate in home run rate. This is coming from the Master dataset. We have a double bracket, and then we have each of the four variables that we would like to build the correlation matrix out of, and finally we have this.corr after. That will tell our notebook to build a correlation matrix out of these four variables. Let's take a look at what that looks like. We run that. Now we see we have a correlation matrix for walk rate, single rate, extra-base hit rate, and home run rate. One way we can think about these new statistics that we've calculated is, we can think of them as somewhat analogous to some of the traditional metrics that we have already looked at. Again, on-base percentage was valued because it took walks into accountant's formula. In that regard, walk rate can be viewed as somewhat of a proxy for on-base percentage. Similarly, slugging percentage is generally thought to be a measure of a player's ability to hit for power. Home run rate can be viewed as a proxy for sudden percentage with our new calculations. If you recall, batting average was defined as hits divided by 'at bats'. Since singles are by far the most common type of hit, single rate can be viewed as somewhat analogous to batting average. Let's take a look at what these actual correlations are. If we look at the correlation between walk rate and home run rate, we see a correlation of 0.30, which is much less than our correlation between on-base percentage and slugging percentage. Additionally, the highest correlation just in terms of magnitude is between home run rate and single rate, which is negative 0.48. This intuitively makes sense as you think a player that hits a lot of home runs is probably less likely to hit less singles and a player that hits less singles is less likely to hit a lot of home runs. But again, this correlation is moderate at best and lower than the correlation between on-base percentage and slugging percentage. It appears that these four metrics do a better job of isolating really the key components of on-base percentage and slugging percentage. Using these four metrics we can get a better sense of exactly how these components changed over time during our Moneyball era. What we're going to do in the next video, is we're going to replace on-base percentage and slugging percentage with these four new rate statistics that we have created in this video and we'll see how our Moneyball story holds up with our new regressions.