Tuesday, April 29, 2014

[keep updating] Deep Learning

This is the buzzword in machine learning community recently. Let's start from a 101 article. http://markus.com/deep-learning-101/

Begin to read the tutorial:
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
Chinese version:
http://deeplearning.stanford.edu/wiki/index.php/UFLDL%E6%95%99%E7%A8%8B
and another one:
http://deeplearning.net/tutorial/gettingstarted.html#


open source code set:
http://deeplearning.net/software/theano/
introduce in Chinese
http://www.52ml.net/6.html
http://blog.csdn.net/mysee1989/article/details/11992535


先说一点直观感受,deep learning用了一段时间。
Motivation 是人的感知系统是hierarchical的,以视觉系统为例,底层的neuron负责检测local low-level features,比如边界,角点,纹理。高层的neuron负责在这些low-level特征基础上提取高级特征,最后形成high-level concept。
Neural Network在七八十年代红极一时。但后来逐渐被模型更简单的SVM取代。原因有:a. NN的parameter太多优化起来很难。b. 层数多了之后容易overfitting.
计算能力的爆炸性增长解决掉第一个问题(along with better optimization algorithm)第二个问题是优化过程中加入一些regularization来克服(e.g. sparsity).

convolutional neural network (CNN) 考虑到了spatial 局部性,底层向上层 计算时只考虑一定领域内的值,training的结果也就变成多个convolutional filter.好处是计算效率提高而且training的modle复杂度降低了。伴随来的是另一个概念max-pooling.可以看作是一个非线性的down-sampling. 一定领域内取最大值传递到上一层,领域之间是不重叠的。
max-pooling的好处是使得算法更robust,代价是丢失信息。

Monday, April 28, 2014

Face recognition again

These days, some interesting news in face recognition field are re-posted widely on social network.

DeepFace: Closing the Gap to Human-Level Performance in Face Verification (Facebook AI lab)
It is said the performance is close to human being.

Then, more incredible, someone claimed their algorithm outperforms humankind.
http://www.zhizhihu.com/html/y2014/4520.html
https://medium.com/the-physics-arxiv-blog/2c567adbf7fc
http://www.52ml.net/14704.html

face++
http://www.faceplusplus.com/uc/app/home?app_id=14807

Most of the result is achieved on LFW (labeled face in the wild)
http://vis-www.cs.umass.edu/lfw/index.html

My adviser want me to do some face recognition stuff.



FRR, FAR, TPR, FPR, ROC curve, ACC, SPC, PPV, NPV, etc.

In a framework that an algorithm is supposed to predict "positive" or "negative". Some concepts are really confusing. So a summary here. All the concepts or metrics are widely used to measure the performance of the algorithm or machine learning model (which is essentially an computational intensive algorithm).


ground truth\prediction positive negative rate
positive A B TPR
negative C D TNR
rate PPV,Precision NPV

A: true positive (TP)
B: false negative (FN)
C: false positive (FP)
D: true negative (TN)
A+B: positive (P)
C+D: negative(N)

False reject rate (FRR) = B/(A+B) = FN/(TP+FN) = FN/P = 1-TPR
False accept rate (FAR) = C/(C+D) = FP/(FP+TN) = FPR

True positive rate (TPR) = A/(A+B) = TP/(TP+FN) = TP/P
False positive rate (FPR) = fall out = C/(C+D) = FP/(FP+TN) = FAR = 1-SPC

Accuracy (ACC) = (A+D)/(A+B+C+D)
Sensitivity = TPR = A/(A+B) = TP/(TP+FN)
Specificity (SPC) = TNR = D/(C+D) = TN/(FP+TN) = 1-FPR
Positive predictive value (PPV) = precision = A/(A+C) = TP/(TP+FP) = TP/P = TPR
Negative predictive value (NPV) = TN/(TN+FN)

False positive rate (FPR) = fall out = C/(C+D) = FP/(FP+TN) = FAR = 1-SPC
False discover rate (FDR) = C/(A+C) = FP/(TP+FP) = FP/P = 1-TPR = 1-PPV

*In bio-medical field, positive means disease, while negative indicates healthy.

Further more,
F1 Score, harmonic mean of precision and sensitivity
F1 = 2TP/(2TP+FP+FN)

Matthews correlation coefficient (MCC)
MCC = (TP*TN+FP*FN) / ((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))^0.5

When a model (classification model) is finalized, one may want to find an operating point, i.e., confidence threshold. This will be a trade-off between precision (higher with higher threshold) and recall (lower with higher threshold). Then you need a curve to show the performance at all possible threshold levels.

1. Receiver operating characteristic (ROC) curve is a plot of true positive ratio V.S. false positive ratio.
When compare  the performance of two models, it is hard to tell which one is better given two curves. So people use the area under curve (AUC) to measure and compare performance (the larger the better). But as you can see, two different curves can have the same AUC. Then to choose which one is depends on whether high precision or high recall is more desirable.

The AUC can be any value between 0 and 1. A random guess classifier will create a straight line segment from (0, 0) and (1, 1). While it can happen that auc<0.5, but then one can flip the output prediction, to make a new classifier with auc' = (1-auc). In this sense, AUC can be considered as something equal or above 0.5.

In statistic learning, the AUC represents the probability that a model outputs higher score for a randomly chosen positive class than a randomly chosen negative class. To prove this, please refer to a very nice blog: https://madrury.github.io/jekyll/update/statistics/2017/06/21/auc-proof.html.

2. Precision-Recall curve is, as the name says, a plot of precision ratio V.S. recall ratio.


references:
http://en.wikipedia.org/wiki/Sensitivity_and_specificity