The art and science of: Giving computers the ability to learn to make decisions from data without being explicitly programmed!
The process is iterative throughout the sequence, and the effort required at different stages will vary according to the project, but this process should generally include the following steps:
labeled
target variablelinear transformation
that captures most of the variance
in the existing dataset. PCA Algorithms differ with respect to the nature of the new dataset they will produce. Gaussian
, that each variable is is shaped like a bell curve when plotted.similar variance
, that values of each variable vary around the mean by the same amount on average.Extensions to LDA:
Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are multiple input variables).
Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used such as splines.
Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA.
equal
not equal
α
or alpha (constant), so their values become fixed when you choose the test's α
.Critical values on the standard normal distribution for α = 0.05
one-tailed Z-test
are significant if the test statistic is equal to or greater than 1.64, the critical value in this case. The shaded area is 5%
(α)
of the area under the curve. two-tailed Z-test
are significant if the absolute value of the test statistic is equal to or greater than 1.96
, the critical value in this case. The two shaded areas sum to 5%
(α)
of the area under the curve.normally distributed
. A z-score is calculated with population parameters such as “population mean” and “population standard deviation” and is used to validate a hypothesis that the sample drawn belongs to the same population.assumes a normal distribution
of the sample. A t-test is used when the population parameters (mean and standard deviation) are not known
.examines each feature individually to determine the strength of the relationship of the feature with the response variable
. Scikit-learn
exposes feature selection routines likes SelectKBest
, SelectPercentile
or GenericUnivariateSelect
as objects that implement a transform method based on the score of anova
or chi2
, mutual information
or fpr/fdr,fwe
(false positive rate, false discovery rate, family-wise error rate).categorical variables
. The chi-square test of independence works by comparing the categorically coded data that you have collected (known as the observed frequencies) with the frequencies that you would expect to get in each cell of a contingency table by chance alone (known as the expected frequencies).Two types of chi-square test
A chi-square fit test for two independent variables is used to compare two variables in a contingency table to check if the data fits.
data fits
.doesn’t fit
.continuous
. Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate variability found inside a data set into two parts: systematic factors and random factors. The systematic factors have a statistical influence on the given data set, while the random factors do not. quantifies the information obtained about one random variable through the other random variable
. The concept of MI is closely related to the fundamental notion of entropy of a random variable. Entropy quantifies the amount of information contained in a random variable
.f_regression
(anova), mutual_info_regression
chi2
, f_classif
(anova), mutual_info_classif
predict a continuous variable
. The root-mean-square error (RMSE) is the most popular loss function and error metric. The loss is symmetric, but larger errors weigh more in the calculation. Using the square root has the advantage of measuring the error in the units of the target variable.0
and 1
. The R2 score
or coefficient of determination yields the same outcome the mean of the residuals is 0
, but can differ otherwise.0-1 predictions
, there can be four outcomes, because each of the two existing classes can be either correctly or incorrectly predicted. With more than two classes, there can be more cases if you differentiate between the several potential mistakes.Measuring model performance
representing the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance
, which is equivalent to the Wilcoxon ranking test. In addition, the AUC has the benefit of not being sensitive to class imbalances.
When predictions for one of the classes are of particular interest, precision and recall curves visualize the trade-off between these error metrics for different thresholds. Both measures evaluate the quality of predictions for a particular class. The following list shows how they are applied to the positive class:
harmonic mean of precision and recall
for a given threshold and can be used to numerically optimize the threshold while taking into account the relative weights that these two metrics should assume.Imbalanced Datasets:
99%
accurate! But horrible at actually classifying emails as spam. Therefore failing at its original purpose. Instead use f1-score
, precision/recall score
or confusion matrix
.Class imbalance example:
99%
of emails are real; 1%
of emails are spamoverfitting
. In under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information
.50/50
ratio of real and spam-emails. Meaning our sub-sample will have the same amount of real and spam-emails.df['Class'][df['Class'] == 1].sum()
spam_df = df.loc[df['Class'] == 1]
non_spam_df = df.loc[df['Class'] == 0][:492]
normal_distributed_df = pd.concat([spam_df, non_spam_df])
y
is a binary categorical variable with values 0
and 1
and there are 25%
of zeros and 75%
of ones, stratify=y
will make sure that your random split has 25%
of 0's
and 75%
of 1's
.train_test_split(X, y,test_size=0.3, random_state=0, stratify=y)
StratifiedKFold
and StratifiedShuffleSplit
to ensure that relative class frequencies is approximately preserved in each train and validation fold.Python imbalanced-learn module