The art and science of: Giving computers the ability to learn to make decisions from data without being explicitly programmed!
The process is iterative throughout the sequence, and the effort required at different stages will vary according to the project, but this process should generally include the following steps:
captures most of the variancein the existing dataset. PCA Algorithms differ with respect to the nature of the new dataset they will produce.
Gaussian, that each variable is is shaped like a bell curve when plotted.
similar variance, that values of each variable vary around the mean by the same amount on average.
Extensions to LDA:
Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are multiple input variables).
Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used such as splines.
Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA.
αor alpha (constant), so their values become fixed when you choose the test's
Critical values on the standard normal distribution for
α = 0.05
one-tailed Z-testare significant if the test statistic is equal to or greater than 1.64, the critical value in this case. The shaded area is
(α)of the area under the curve.
two-tailed Z-testare significant if the absolute value of the test statistic is equal to or greater than
1.96, the critical value in this case. The two shaded areas sum to
(α)of the area under the curve.
normally distributed. A z-score is calculated with population parameters such as “population mean” and “population standard deviation” and is used to validate a hypothesis that the sample drawn belongs to the same population.
assumes a normal distributionof the sample.
A t-test is used when the population parameters (mean and standard deviation) are not known.
examines each feature individually to determine the strength of the relationship of the feature with the response variable.
Scikit-learnexposes feature selection routines likes
GenericUnivariateSelectas objects that implement a transform method based on the score of
fpr/fdr,fwe(false positive rate, false discovery rate, family-wise error rate).
categorical variables. The chi-square test of independence works by comparing the categorically coded data that you have collected (known as the observed frequencies) with the frequencies that you would expect to get in each cell of a contingency table by chance alone (known as the expected frequencies).
Two types of chi-square test
A chi-square fit test for two independent variables is used to compare two variables in a contingency table to check if the data fits.
continuous. Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate variability found inside a data set into two parts: systematic factors and random factors. The systematic factors have a statistical influence on the given data set, while the random factors do not.
quantifies the information obtained about one random variable through the other random variable. The concept of MI is closely related to the fundamental notion of entropy of a random variable.
Entropy quantifies the amount of information contained in a random variable.
predict a continuous variable. The root-mean-square error (RMSE) is the most popular loss function and error metric. The loss is symmetric, but larger errors weigh more in the calculation. Using the square root has the advantage of measuring the error in the units of the target variable.
R2 scoreor coefficient of determination yields the same outcome the mean of the residuals is
0, but can differ otherwise.
0-1 predictions, there can be four outcomes, because each of the two existing classes can be either correctly or incorrectly predicted. With more than two classes, there can be more cases if you differentiate between the several potential mistakes.
Measuring model performance
representing the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance, which is equivalent to the Wilcoxon ranking test. In addition,
the AUC has the benefit of not being sensitive to class imbalances.
When predictions for one of the classes are of particular interest, precision and recall curves visualize the trade-off between these error metrics for different thresholds. Both measures evaluate the quality of predictions for a particular class. The following list shows how they are applied to the positive class:
harmonic mean of precision and recallfor a given threshold and can be used to numerically optimize the threshold while taking into account the relative weights that these two metrics should assume.
99%accurate! But horrible at actually classifying emails as spam. Therefore failing at its original purpose. Instead use
Class imbalance example:
99%of emails are real;
1%of emails are spam
overfitting. In under-sampling, the simplest technique involves removing random records from the majority class, which can cause
loss of information.
50/50ratio of real and spam-emails. Meaning our sub-sample will have the same amount of real and spam-emails.
df['Class'][df['Class'] == 1].sum()
spam_df = df.loc[df['Class'] == 1]
non_spam_df = df.loc[df['Class'] == 0][:492]
normal_distributed_df = pd.concat([spam_df, non_spam_df])
yis a binary categorical variable with values
1and there are
25%of zeros and
stratify=ywill make sure that your random split has
train_test_split(X, y,test_size=0.3, random_state=0, stratify=y)
StratifiedShuffleSplitto ensure that relative class frequencies is approximately preserved in each train and validation fold.