Monday, February 20, 2017

Azure Machine Learning: Model Evaluation and Threshold Manipulation

Today, we're going to continue looking at Sample 3: Cross Validation for Binary Classification Adult Dataset in Azure Machine Learning.  In the previous posts, we looked at the ROC, Precision/Recall and Lift tabs of the "Evaluate Model" module.  In this post, we'll be finishing up Sample 3 by looking at the "Threshold" and "Evaluation" tables of the Evaluate Model visualization.  Let's start by refreshing our memory on the data set.
Adult Census Income Binary Classification Dataset (Visualize)

Adult Census Income Binary Classification Dataset (Visualize) (Income)
This dataset contains the demographic information about a group of individuals.  We see the standard information such as Race, Education, Martial Status, etc.  Also, we see an "Income" variable at the end.  This variable takes two values, "<=50k" and ">50k", with the majority of the observations falling into the smaller bucket.  The goal of this experiment is to predict "Income" by using the other variables.
Experiment
As a quick refresher, we are training and scoring four models using a 70/30 Training/Testing split, stratified on "Income".  Then, we are evaluating these models in pairs.  For a more comprehensive description, feel free to read the ROC post.  Let's move on to the "Evaluate Model" visualization.
Threshold and Evaluation Tables
At the bottom of all of the tabs, there are two tables.  We've taken to calling them the "Threshold" and "Evaluation" tables.  If these tables have legitimate names, please let us know in the comments.  Let's start by digging into the "Threshold" table.
Threshold Table
On this table, we can use the slider to change the "Threshold" value to see what effect it will have on our summary statistics.  Some of you might be asking "What is this threshold?"  To answer that, let's take a look at the output of the "Score Model" module that feeds into the "Evaluate Model" module.
Scores
At the end of every row in the output, we find two columns, "Scored Labels" and "Scored Probabilities".  The "Scored Probabilities" are what the algorithm has decided is the probability that this record belongs to the TRUE category.  In our case, this would be the probability that the person has an Income of greater than $50k.  By default, "Scored Labels" predicts TRUE if the probability is greater .5 and FALSE if it isn't.  This is where the "Threshold" comes into play.  We can use the slider from the Threshold table to see what would happen if we were to utilize a stricter threshold of .6 or a more lenient threshold of .4.  Let's look back at the Threshold table again.
Threshold Table
Looking at this table, we can see the total number of predictions in each category, as well as a list of Accuracy metrics (Accuracy, Precision, Recall and F1).  Threshold selection can be a very complicated process based on the situation.  Let's assume that we're retailers trying to target potential customers for an advertisement in the mail.  In this case, a False Positive would mean that we sent an advertisement to someone who will not buy from us.  This wastes a few pennies because mailing the advertisement is cheap.  On the other hand, a False Negative would mean that we didn't send an advertisement to someone who would have bought from us.  This misses the opportunity for a sale, which is worth much more than a few pennies.  Therefore, we are more interested in minimizing False Negatives than False Positives.  This causes situations where using the built-in metrics may not be the best decision.  In our case, let's just assume that overall "Accuracy" is what we're looking for.  Therefore, we want to maximize as many of these metrics as we can.

Also, there's another metric over to the side called "AUC".  This stands for "Area Under the Curve".  For those of you that read our ROC post, you may remember that our goal was to find the chart that was closer to the top left.  Mathematically, this is equivalent to finding the curve that represents the largest area.  Let's take a look at a couple of different thresholds.
Threshold Table (T = .41)

Threshold Table (T = .61)
The first thing we notice is that the "Positive Label", "Negative Label", and "AUC" do not change.  "Positive Label" and "Negative Label" are displayed for clarity purposes and "AUC" is used to compare different models to each other.  None of these are affected by the threshold.  Remember that earlier we decided that "Accuracy" was our metric of interest.  Using the default threshold of .5, as well as the altered thresholds of .41 and .61, we can see that there is less than 1% difference between these values.  Therefore, we can say that there is virtually no difference in "Accuracy" between these three thresholds.  If this was a real-world experiment, we could reexamine our problem to determine whether another type of metric, such as "Precision", "Recall", "F1 Score" or even a custom metric would be more meaningful.  Alas, that discussion is better suited for its own post.  Let's move on.

As you can imagine, this table is good for looking at individual thresholds, but it can get time-consuming quickly if you are trying to find the optimal threshold.  Fortunately, there's another way, let's take a look at the "Evaluation" Table.
Evaluation Table
This table is a bit awkard and took us quite some time to break down.  In reality, it's two different tables.  We've recreated them in Excel to showcase what's actually happening.
Discretized Results
The "Score Bin", "Positive Examples" and "Negative Examples" columns make up a table that we call the "Discretized Results".  This shows a breakdown of the Actual Income Values for each bucket of Scored Probabilities.  In other words, this chart says that for records with a Scored Probability between .900 (90%) and 1.000 (100%), 50 of them had an Income value of ">50k" and 4 had an Income value of "<=50k".
Discretized Results (Augmented)
Since the algorithm is designed to identify Incomes of ">50k" with high probabilities and Incomes of "<=50k" with low probabilities, we would expect to see most of the Positive Examples at the top of the table and most of the Negative Examples at the bottom of the table.  This turns out to be true, indicating that our algorithm is "good".  Now, let's move on to the other part of the Evaluation Table.
Evaluation Table
Threshold Results
We can use the remaining columns to create a table that we're calling the "Threshold Results".  This table is confusing for one major reason.  None of the values in this table are based on the ranges defined in the "Scored Bin" column.  In reality, these values are calculated by utilizing a threshold
equal to the value at the bottom of the threshold defined in the "Score Bin" column.  To clarify this, we've added a column called "Threshold" to showcase what's actually being calculated.  To verify, let's compare the Threshold table to the Threshold Results for a threshold of 0.
Threshold Table (T = 0)
We can see that the "Accuracy", "Precision", "Recall" and "F1 Score" match for a threshold of 0 using the last line of the Evaluation Table.  Initially, this was extremely confusing to us.  However, now that we know what's going on, it's not a big deal.  Perhaps they can alter this visualization in a future release to be more intuitive.  Let's move back to the Threshold Results.
Threshold Results
Let's talk about what these columns represent.  We've already talked about the "Accuracy", "Precision", "Recall" and "F1 Score" columns.  The "Negative Precision" and "Negative Recall" are similar to "Precision" and "Recall", except that they are looking for Negative values (Income = "<=50k") instead of Positive values (Income = ">50k").  Therefore, we're looking maximize all six of these values.

The "Fraction Above Threshold" column tells us what percentage of our records have Scored Probabilities greater than the Threshold value.  Obviously, all of the records have a Scored Probability greater than 0.  However, it is interesting to note that 50% of our values have Scored Probabilities between 0 and .1.  This isn't surprising because, as we mentioned in a previous post, the Income column is an "Imbalanced Class".  This means that the values within the column are not evenly distributed.

The "Cumulative AUC" column is a bit more complicated.  Earlier, we talked about "AUC" being the area under the ROC curve.  Let's take a look at how a ROC curve is actually calculated.
ROC Curve
The ROC Curve is calculated by selecting many different threshold values, calculating the True Positive Rate and False Positive Rate, then plotting all of the points as a curve.  In the above illustration, we show how you might use different threshold values to define points on the red curve.  It's important to note that our threshold lines above were not calculated, they were simply drawn to illustrate the methodology.  Looking at this, it's much easier to understand "Cumulative AUC".
Simply put, "Cumulative AUC" is the area under the curve up to a particular threshold value.
Cumulative AUC (T = .5)
This opens up another interesting option.  In the previous posts (ROC and Precision/Recall and Lift), we evaluated multiple models by comparing them in pairs and comparing the winners.  Using the Evaluation Tables, we could compare all of these values simultaneously in Excel.  For instance, we could use Cumulative AUC to compare all four models at one.
Cumulative AUC Comparison
Using this visualization, we can see that the Boosted Decision Tree is the best algorithm for anything thresholds greater than .3 (30%) or less than .1 (10%).  If we wanted to utilize a threshold within these values, we would be better off using the Averaged Perceptron or Logistic Regression algorithms.  Hopefully, we've sparked your imagination to explore all the capabilities that Azure ML has to offer.  It truly is a great tool that's opening the world of Data Science beyond just the hands of Data Scientists.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, February 6, 2017

Azure Machine Learning: Model Evaluation Using Precision, Recall and Lift

Today, we're going to continue looking at Sample 3: Cross Validation for Binary Classification Adult Dataset in Azure Machine Learning.  In the previous post, we looked at the ROC Tab of the "Evaluate Model" module.  In this post, we'll be looking at the Precision/Recall and Lift tabs of the "Evaluate Model" visualization.  Let's start by refreshing our memory on the data set.
Adult Census Income Binary Classification Dataset (Visualize)

Adult Census Income Binary Classification Dataset (Visualize) (Income)
This dataset contains the demographic information about a group of individuals.  We see the standard information such as Race, Education, Martial Status, etc.  Also, we see an "Income" variable at the end.  This variable takes two values, "<=50k" and ">50k", with the majority of the observations falling into the smaller bucket.  The goal of this experiment is to predict "Income" by using the other variables.
Experiment
As a quick refresher, we are training and scoring four models using a 70/30 Training/Testing split, stratified on "Income".  Then, we are evaluating these models in pairs.  For a more comprehensive description, feel free to read the previous post.  Let's move on to the Precision/Recall tab of the "Evaluate Model" visualization.
Precision/Recall
In the top left corner of the visualization, we see a list of tabs.  We want to navigate to the Precision/Recall tab.
Precision/Recall Experiment View
Just as with the ROC tab, the Precision/Recall tab has a view of the experiment on the right side.  This will allow us to distinguish between the two models in the other charts/graphs.
Precision/Recall Curve
On the left side of the visualization, we can see the Precision/Recall Curve.  Simply put, Precision is the proportion of "True" predictions that are correct.  In our case, this would be the number of correct ">50k" predictions divided by the total number of ">50k" predictions.  Conversely, Recall is the proportion of actual "True" values that are correctly predicted.  In our case, this would be the number of correct ">50k" predictions divided by the number of actual ">50k" values.  Obviously, we want to maximize both of these values.  Therefore, we want to see curves that reach as far to the top right as possible.  Using this logic, we can say that the Boosted Decision Tree is the most accurate model by this metric.  Let's switch over the Lift tab.
Lift Curve

On the left side of the visualization is the Lift Curve.  This curve is designed to what proportion of the sample you would have to run through in order to find a certain number of true positives.  Effectively, this tells you how "efficient" your model is.  A more "efficient" model can find the same number of true positives (aka successes) from a smaller sample.  In our case, this would mean that we would have to contact less people in order to find X number of people with "Income > 50k".  For analytic purposes, we are looking for curves that are closer to the top left corner.  We see that the more "efficient" model is the Boosted Decision Tree.  Let's take a look at the curves from the other "Evaluate Model" module.
Precision/Recall Curve 2

Lift Curve 2
Moving on to the next set of models, both of these charts show that the "better" model is the Logistic Regression model.  Finally, let's compare the Boosted Decision Tree model to the Logistic Regression model.
Precision/Recall Curve (Final)

Lift Curve (Final)
We can see on both of these charts that the Boosted Decision Tree is the "better" model.  Now that we've compared these models using four different techniques (Overall Accuracy, ROC, Precision/Recall, and Lift), we can definitively say that the Boosted Decision Tree is the best model out of the four.  Does that mean that we can't possibly do better?  Of course not!  There's always more we can do to tweak these models.  Some of the options for tweaking more power out of models are to take a larger sample, add new variables, try different sampling techniques, and so much more.  The world of data science is nearly endless and there's always more we can do.  Stay tuned for our next post where we'll dig deeper into the Threshold and Evaluation tables.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, January 23, 2017

Azure Machine Learning: Model Evaluation using ROC (Receiver Operating Characteristic)

Today, we're going to continue looking at Sample 3: Cross Validation for Binary Classification Adult Dataset in Azure Machine Learning.  In the four previous posts, we looked at the Two-Class Averaged PerceptronTwo-Class Boosted Decision TreeTwo-Class Logistic Regression and Two-Class Support Vector Machine algorithms.

In all of these posts, we used a simple contingency table to determine the accuracy of the model.  However, accuracy is only one of a number of different ways to determine the "goodness" of a model.  Now, we need to expand our evaluation to include the Evaluate Model module.  Specifically, we'll be looking at the ROC (Receiver Operating Characteristic) tab.  Let's start by refreshing our memory on the data set.
Adult Census Income Binary Classification Dataset (Visualize)

Adult Census Income Binary Classification Dataset (Visualize) (Income)
This dataset contains the demographic information about a group of individuals.  We see the standard information such as Race, Education, Martial Status, etc.  Also, we see an "Income" variable at the end.  This variable takes two values, "<=50k" and ">50k", with the majority of the observations falling into the smaller bucket.  The goal of this experiment is to predict "Income" by using the other variables.

Utilizing some of the techniques we learned in the previous posts, we'll start by using the "Tune Model Hyperparameters" module to select the best sets of parameters for each of the four models we're considering.
Experiment (Tune Model Hyperparameters)

Tune Model Hyperparameters
As you can see, we are doing a Random Sweep of 10 runs measuring F-scores.  One of the interesting things about the "Tune Model Hyperparameters" module is that it not only outputs the results from the Tuning, it also outputs the Trained Model, which we can feed directly into the "Score Model Module".

At this point, we have two options.  For simplicity's sake, we could simply train the models using the entire data set, then score those same records.  However, that's considered bad practice as it encourages Overfitting.  So, let's use the "Split Data" module to Train our models with 70% of the dataset and use the remaining 30% for our evaluation.
Experiment (Score Model)

Split Data
Looking back at the "Income" histogram at the beginning of the post, we can see that a large majority of the observations fall into the "<=50k" category.  This creates an issue known as an "imbalanced class".  This means that there is a possibility that one of our samples will contain a very small proportion of ">50k" while the other sample contains a very large proportion.  This could cause significant bias in our final model.  Therefore, it's safe to use a Stratified Sample.  In this case, the Stratification Key Column will be "Income".  Simply put, this will cause the algorithm to take a 70% sample from the "<=50k" category and a 70% sample from the ">50k" category.  Then, it will combine these together to make the complete 70% sample.  This guarantees that our samples have the same distribution as our complete dataset, but only as "Income" is concerned.  There may still be bias on other variables.  Alas, that's not the focus of this post.  Let's move on to the "Evaluate Model" module.
Experiment (Evaluate Model)
We chose not to show you the parameters for the "Score Model" and "Evaluate Model" modules because they are trivial for the former and non-existent for the latter.  What is important is recognizing what the inputs are for the "Evaluate Model" module.  The "Evaluate Model" module is designed to compare two sets of scored data.  This means that we need to consider how we're going to handle our four sets of scored data.  If we wanted to be extremely thorough, we could use six modules to connect every set of scored data to every other set of scored data.  This may be helpful in case there are any cases where one model is good is some areas and weak in others.  For our case, it's just as easy to compare them as pairs, then compare the winners from those pairs.  Let's take a look at the "ROC" tab of the "Evaluate Model" visualization.  Given the size of the results, we'll have to look at it piece-by-piece.
ROC Tab
In the top left corner of the visualization, you will see three labels for "ROC", "Precision/Recall", and "Lift".  For this post, we'll be covering the ROC tab, which you can find by clicking on the ROC button in the top left, although it should be highlighted by default.
ROC Experiment View
If you scroll down a little, you will see a view of the Experiment on the right side of the visualization.  This might not seem too handy at first.  However, you should take note of which dataset is coming in the left and right sides.  In our case, this would be Averaged Perceptron on the left and Boosted Decision Tree on the right.
ROC Chart
On the left side of the visualization, you will find the ROC Curve.  This chart will tell you how "accurate" your model is at predicting.  We like to think of the ROC Curve as follows: "If we want a True Positive Rate of Y, we must be willing to accept a False Positive Rate of X".  Therefore, a "better" model would have a higher True Positive Rate for the same False Positive Rate.  Conversely, we could say that a "better" model would have a lower False Positive Rate for the same True Positive Rate.  In the end, True Positive predictions are a good thing and should be maximized.  Moreover, False Positive predictions are a bad thing and should be minimized.  Therefore, we are looking for a curve that travels as close to the top-left as possible.  In this case, we can see the "Scored dataset to compare" is a "better" model.  In order to find out which model that is, we have to look over at the ROC Experiment View on the right side of the visualization.  We can see that the "Scored Dataset" (i.e. the left input) is the Averaged Perceptron, while the "Scored Dataset To Compare" (i.e. the right input) is the Boosted Decision Tree.  Therefore, the Boosted Decision Tree is the more accurate model according to the ROC Curve.

As an added note, there is a grey diagonal line that goes across this chart.  That's the "Random Guess" line.  It follows a line for 50% probability of guessing correctly, just like if we flipped a coin.  If we find that our model dips below that line, then that means our model is worse than random guessing.  In that case, we should seriously reconsider a different model.  If the model is always significantly below that line, then we can simply swap our predictions (True becomes False, False becomes True) to create a good model.
Threshold and Evaluation Tables
If we scroll down to the bottom of the visualization, we can see some tables.  We're not sure what these tables are called, so we've taken to calling them the Threshold table (top table with slider) and the Evaluation table (bottom table).  These tables are interesting in their own right and will be covered in a later post.
ROC Curve 2
Look at the ROC Curve for the other "Evaluate Model" visualization, we can see that the Logistic Regression model is slightly more accurate than the Support Vector Machine.  Now, let's create a final "Evaluate Model" module to compare the winner from the first ROC analysis (Boosted Decision Tree) to the winner from the second ROC analysis (Logistic Regression).
ROC Curve (Final)
We can see that the ROC Curve has determined that the Boosted Decision Tree is the most accurate model out of the four.  This wasn't a surprise to use because we did a very similar analysis using contingency tables in the previous four posts.  However, model evaluation is all about gathering an abundance of evidence in order to make the best decision possible.  Stay tuned for later posts where we'll go over more information in the "Evaluate Model" visualization.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com

Monday, January 2, 2017

Azure Machine Learning: Classification Using Two-Class Support Vector Machine

Today, we're going to continue looking at Sample 3: Cross Validation for Binary Classification Adult Dataset in Azure Machine Learning.  In the three previous posts, we looked at the Two-Class Averaged PerceptronTwo-Class Boosted Decision Tree and Two-Class Logistic Regression algorithms.  The final algorithm in the experiment is Two-Class Support Vector Machine.  Let's start by refreshing our memory on the data set.
Adult Census Income Binary Classification Dataset (Visualize)
Adult Census Income Binary Classification Dataset (Visualize) (Income)


This dataset contains the demographic information about a group of individuals.  We see the standard information such as Race, Education, Martial Status, etc.  Also, we see an "Income" variable at the end.  This variable takes two values, "<=50k" and ">50k", with the majority of the observations falling into the smaller bucket.  The goal of this experiment is to predict "Income" by using the other variables.  Let's take a look at the Two-Class Support Vector Machine algorithm.
Two-Class Support Vector Machine
The Two-Class Support Vector Machine algorithm attempts to define a boundary between the two sets of points such that all of the points of one type fall on one side and all of the points of the other type fall on the other side.  More specifically, it attempts to define the boundary where the distance between the two sets of points is at its largest.  This is a relatively simple concept to imagine in two dimensions, but gets complex as your number of factors increases and the relationship between the factors becomes more complex.  Here's a picture that tells the story pretty nicely.
Support Vector Machine
Let's take a look at the parameters involved in this algorithm.  First, we need to define the "Number of Iterations".  Simply put, more iterations means that the algorithm is less likely to get stuck in an awkward portion of data.  Therefore, it increases the accuracy of your predictions.  Unfortunately, this also means that the algorithm will take longer to train.

The "Lambda" parameter allows us to tell Azure ML how complex we want our model to be.  The larger we make our "Lambda", the less complex our model will end up being.

The "Normalize Features" parameter will replace all of our values with "Normalized" values.  This is accomplished by taking each value, subtracting the mean of all the values in the column, then dividing the result by the standard deviation of all the values in the column.  This has the effect of making every column have a mean of 0 and a standard deviation of 1.  Since the algorithm chooses a boundary based on distance between points, it is imperative that your values be normalized.  Otherwise, you may have a single (or small subset) of factors that dominate the selection process because they have very large values, and therefore very large distances.  If we wanted to have certain factors play a larger role in the selection process for some type of technical or business reason, then we could forego this option.  However, that situation would be better handled by multiplying the normalized factors by our own custom sets of "weights" using a separate module.

The "Project to Unit Sphere" parameter allows us to normalize our set of output "Coefficients" as well.  In our testing, this didn't seem to have any impact on the predictability of the model.  However, it may be useful if we need to use the coefficients as inputs to some other type of model which would require them to be normalized.  If anyone knows of any other uses, let us know in the comments.

The "Allow Unknown Categorical Levels" parameter allows us to set whether we want to allow NULLs to be used in our model.  If we try to pass in data that has NULLs, we may get some errors.  If our data has NULLs, we should check this box.

If you want to learn more about the Two-Class Support Vector Machine algorithm, read this and this.  Let's use Tune Model Hyperparameters to find the best set of parameters for our Two-Class Support Vector Machine algorithm.  If you want to learn more about Tune Model Hyperparameters, check out our previous post.
Tune Model Hyperparameters
Tune Model Hyperparameters (Visualize)
As you can see, the best model has 25 iterations with a Lambda of .001274.  Let's plug that into our Two-Class Support Vector Machine algorithm and move on to Cross-Validation.
Cross Validate Model
Contingency Table (Two-Class Averaged Perceptron)

Contingency Table (Two-Class Boosted Decision Tree)

Contingency Table (Two-Class Logistic Regression)

Contingency Table (Two-Class Support Vector Machine)
As you can see, the Two-Class Support Vector Machine approach has about the same amount of True positives for "income = '<=50k'" as the rest of the models.  However, the number of true positives for "income = '>50k'" is significantly less than that of the Two-Class Boosted Decision Tree.  Therefore, using accuracy alone, we can say that the Two-Class Boosted Decision Tree model is the best model for this data.

We've mentioned a couple of times that there are more ways to measure "goodness" of a model besides Accuracy.  In order to look at these, let's examine another module called "Evaluate Model".
Evaluate Model
There are no parameters to set for the "Evaluate Model" module.  All you do is provide it 1 or 2 scored datasets and it will provide a huge amount of information about the "goodness" of those models.  Here's a snippet of what you can find.
Roc Curve

Precision/Recall Curve

Lift Curve
The three charts shown above are the ROC Curve, Precision/Recall Curve, and Lift Curve.  We simply wanted to introduce these concepts to you in this post.  We'll spend a lot more time talking about these metrics in a later post.  Thanks for reading.  We hope you found this informative.

Brad Llewellyn
Data Scientist
Valorem Consulting
@BreakingBI
www.linkedin.com/in/bradllewellyn
llewellyn.wb@gmail.com