Monday, June 30, 2014

How does Matlab calculate the eccentricity of a region

In matlab, there is a built-in function to calculate properties of a region.
http://www.mathworks.com/help/images/ref/regionprops.html#bqkf8jf

And as said in help message:
'Eccentricity' — Scalar that specifies the eccentricity of the ellipse that has the same second-moments as the region. The eccentricity is the ratio of the distance between the foci of the ellipse and its major axis length. The value is between 0 and 1. (0 and 1 are degenerate cases; an ellipse whose eccentricity is 0 is actually a circle, while an ellipse whose eccentricity is 1 is a line segment.) This property is supported only for 2-D input label matrices.

So, the idea is to fit using a ellipse with same second-moments as the region.
What does it mean?
The answer is in this thread:
http://stackoverflow.com/questions/1532168/what-are-the-second-moments-of-a-region

To simplify, the idea is to calculate the co-variance matrix, then do eign-value decomposition. Eigen-values are those axis length, minor and major one. While eigen-vectors are the directions of minor/major axis.

length of major axis = 2a, minor axis = 2b, then the foci = c, then:
eccentricity E = c/a = sqrt(1-(b/a)^2)
a^2-b^2 = c^2.


Wednesday, June 11, 2014

ML_general_talk.md

why this article

I am not a newbie for machine learning any more. But still sometimes I suspect what did I gain from learning “machine learning”. By applying some classical algorithms, I get some real feeling about this hot topic.

everything is about generalization

Generalization means the ability to have good prediction on novel data samples. In other words, when you make prediction on testing data given the model you trained on training data.
You can easily get a 100% accuracy on training data, except for some ambiguous data point (same data points, but different label). This is meaningless since your decision boundary is too complex. The terminology is over-training. Instead of record every training data sample by taking photo, you need to loss the decision boundary. Some technologies play this role essentially, like margin in SVM, regularization in general optimization.
Another this is features goes first. You cannot do magic on bad features. This means spend more time on feature extraction/selection/design is worthy. In some sense, deep learning or sparse coding/representation is to put efforts on the steps before classification.

Do Preprocessing

Some simple preprocessing, e.g. normalization, whitening, etc. can really benefit your classification.

select correct classification algorithm

Linear or not, svm or lda.
I always try four algorithm to have baseline:
1. Support Vector Machine (SVM)
2. Linear Discriminant Analysis (LDA)
3. Random Forest (RF)
4. K Nearest Neighbor (KNN)


Written with StackEdit.

Tuesday, June 10, 2014

git_notes

Git 笔记

Some basic concepts

repository 仓库,存储code的单元。
branch 分支,同一个代码库的不同版本控制路径,默认的主线是master,可以分叉出来开发测试新功能,完成后merge到master上。
commit 提交修改。
push推送到远程,默认推到original的repository。
pull 从远程repository取回代码。
一般远程的repository默认是origin, 另一个常用的名字是upstream,这个用于从一个已有的repository fork了一个副本后,自己的副本作为origin,原始的版本作为upstream。参加open source project常用到upstream。

文件状态

untracked 即没有加入到git的index
unmodified 没有个修改过的文件
modified 修改但尚未提交
staged 已经提交,上了舞台了。。

常用命令

git status检查所有文件的状态
git diff 检查文件修改前后差异
git branch 创建新分支
git checkout 切换分支
删除和ignore不同:
ignore只是忽略跟踪,但文件保留,删除是彻底删除,先删除文件再用git -rm xx.xx 实现彻底git上的删除。

深入

关于配置:
/etc/gitconfig 针对系统 git config –system
~/gitconfig 针对当前用户 git config –global
.git/config 针对当前 repository
git config –list列出当前所有配置。
提交代码最常用的配置是信息是用户名和邮箱。当你需要提交并push的时候会要求输user pwd验证身份。另一种选择是可以直接在本地产生一对公钥私钥 (ssh-keygen -t rsa -C “your_email_address”)。
所谓分布式版本控制。事实上是说没有一个主线处于支配地位。我的理解,在远程服务器比如github上也是一个类似本地的一个repository。所以你的机器如果在线,也能直接被clone。
.git文件下存储所有配置。(比如忽略某些文件,index等)

References:

GitHub详细教程
collaborate using git
git user guide
简单明了的一个教程:廖雪峰的git教程
Written with StackEdit.

Monday, June 9, 2014

java integer pool

Here in this post, I record a bug in my project(Java).
The problem is when compare two integers, it works well only when integer is of one-byte length.
This is caused by "Integer constant pool". To avoid extra memory cost, Java will return a already-created Integer object if it's between -128 and 127. This means:
Integer i = Integer(10); Integer j = Integer(10);
bool flag = (i==j); //return true; directly check the address
bool flagE = i.equals(j);//return true; directly check the values, what we expect here.
For the Object class, which is the root class of all classes in java, these two method (or operator) are the same. But for a certain derived class, it is not necessary depends on the logic.

So if the logic of your program is to check whether two integers are equally valued, then you need to use equals instead of == operator.

You may imply that java override the equals function for Integer class. this gives us another topic to discuss, that is, when you override equals function you need override hashCode(). One naturally raised question is "why always override hashcode() given equals() overrided". Here is a good article for this question:
http://www.xyzws.com/javafaq/why-always-override-hashcode-if-overriding-equals/20
To simplify it, you have to guarantee that:
if equals(Object o) gives true, hashCode() output same integer. This means you need to generate hash code from attributes which are used to decide whether two objects are equal (equals() function).

Thursday, May 29, 2014

[Python] fundamental of Python

introduction

Python is a interpreting script language, which is very popular these years. This post will summarize some fundamentals of this language.
I installed python(x,y) as my python environment. To run a python script, type in “python xx.py”, or run it in IDE, just same as running matlab script in matlab IDE.
fundamental syntax:
http://wiki.woodpecker.org.cn/moin/PyAbsolutelyZipManual
import (same as Java)
string <=> int
str(i)
string.atoi(s,[,base]) #convert to int, base is the base of the number, 10, 16, or 8
string.atof(s) #convert to float
change dir:
import os
chdir(strDir)
statement (loop/if/else)
if x<0:
x++ # note that all code-block structure is defined by indent
else:
x–

no end

for i in range(10,0,-3): # range(10,0,-3) means 10:-3:0, while 0 is excluded
print “the output num is %d”%(i)
read/write file:
http://www.cnblogs.com/allenblogs/archive/2010/09/13/1824842.html
fid = open(“xx.txt”,”r”)
line = fid.readline()
fid.close()
fid = open(“xx.txt”,”w+”)
fid.write(line) # automatically return to a new line
fid.close()

List if mutable, while tuple is immutable.


How to check the structure of a variable?
You can call type(var) to know the type of the variable. Then dir() and getattr() to check the structure.

List comprehensions
x = list([0 1 2 3 4 5])
y = [x_i for x_i in x]
y = [x_i+1 for x_i in x if x>1]
y = [x_i+1 if x_i%2==0 else x_i for x_i in x if x>1]

how to sum up a list?
sum(x)

pandas DataFrame



Written with StackEdit.

Wednesday, May 7, 2014

Multi-kernel learning, run "SMO-MKL" on my laptop (windows 8 64 bit)

I am compiling an open-source code set of multi-kernel learning:
http://research.microsoft.com/en-us/um/people/manik/code/smo-mkl/download.html
This is not the latest ML software, but could be a good start.

Now it is working on my windows 8 (64 bit)
1. Add $Programfiles$\Microsoft Visual Studio 11.0\VC\bin to $path$
  run "vcvars32.bat" to set up all path and environment variables for Visual C++.
http://msdn.microsoft.com/en-us/library/f2ccy3wt.aspx

2. type:
nmake -f Makefile.win clean all

Done~ the compiled exe files are located in Windows sub-directory.
Have a try:
svm-train -s 0 -h 0 -m 400 -o 2.0 -a 26 -c 10.0 -l 1.0 -f 0 -j 1 -g 3 -k Example/Classification/PrecomputedKernels/kernelfile Example/Classification/PrecomputedKernels/y_train Example/Classification/PrecomputedKernels/model_file
This example is to train a model from a kernel matrix. And you may test on testing data:
svm-predict Example/Classification/PrecomputedKernels/y_test Example/Classification/PrecomputedKernels/model_file Example/Classification/PrecomputedKernels/prediction

I eventually found that nmake -f Makefile.win clean all actually did not compile the svm.cpp again, but instead, it is based on pre-compiled svm.obj. And if you made a change on svm.cpp, then nmake will give you some error message. I did not have a solution for it so far.

3. error/warning message when run "svm-predict".
This is annoying, because I wanna do grid search to find the optimal values for parameters. And actually the result is correct, but only because of exception not been handled, the error message comes out. This makes the pipeline stopped and have to manually click a confirm button to continue.
I have to change the svm-predict.c, to comment off the code to destroy svm_model.



Tuesday, April 29, 2014

[keep updating] Deep Learning

This is the buzzword in machine learning community recently. Let's start from a 101 article. http://markus.com/deep-learning-101/

Begin to read the tutorial:
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
Chinese version:
http://deeplearning.stanford.edu/wiki/index.php/UFLDL%E6%95%99%E7%A8%8B
and another one:
http://deeplearning.net/tutorial/gettingstarted.html#


open source code set:
http://deeplearning.net/software/theano/
introduce in Chinese
http://www.52ml.net/6.html
http://blog.csdn.net/mysee1989/article/details/11992535


先说一点直观感受,deep learning用了一段时间。
Motivation 是人的感知系统是hierarchical的,以视觉系统为例,底层的neuron负责检测local low-level features,比如边界,角点,纹理。高层的neuron负责在这些low-level特征基础上提取高级特征,最后形成high-level concept。
Neural Network在七八十年代红极一时。但后来逐渐被模型更简单的SVM取代。原因有:a. NN的parameter太多优化起来很难。b. 层数多了之后容易overfitting.
计算能力的爆炸性增长解决掉第一个问题(along with better optimization algorithm)第二个问题是优化过程中加入一些regularization来克服(e.g. sparsity).

convolutional neural network (CNN) 考虑到了spatial 局部性,底层向上层 计算时只考虑一定领域内的值,training的结果也就变成多个convolutional filter.好处是计算效率提高而且training的modle复杂度降低了。伴随来的是另一个概念max-pooling.可以看作是一个非线性的down-sampling. 一定领域内取最大值传递到上一层,领域之间是不重叠的。
max-pooling的好处是使得算法更robust,代价是丢失信息。

Monday, April 28, 2014

Face recognition again

These days, some interesting news in face recognition field are re-posted widely on social network.

DeepFace: Closing the Gap to Human-Level Performance in Face Verification (Facebook AI lab)
It is said the performance is close to human being.

Then, more incredible, someone claimed their algorithm outperforms humankind.
http://www.zhizhihu.com/html/y2014/4520.html
https://medium.com/the-physics-arxiv-blog/2c567adbf7fc
http://www.52ml.net/14704.html

face++
http://www.faceplusplus.com/uc/app/home?app_id=14807

Most of the result is achieved on LFW (labeled face in the wild)
http://vis-www.cs.umass.edu/lfw/index.html

My adviser want me to do some face recognition stuff.



FRR, FAR, TPR, FPR, ROC curve, ACC, SPC, PPV, NPV, etc.

In a framework that an algorithm is supposed to predict "positive" or "negative". Some concepts are really confusing. So a summary here. All the concepts or metrics are widely used to measure the performance of the algorithm or machine learning model (which is essentially an computational intensive algorithm).


ground truth\prediction positive negative rate
positive A B TPR
negative C D TNR
rate PPV,Precision NPV

A: true positive (TP)
B: false negative (FN)
C: false positive (FP)
D: true negative (TN)
A+B: positive (P)
C+D: negative(N)

False reject rate (FRR) = B/(A+B) = FN/(TP+FN) = FN/P = 1-TPR
False accept rate (FAR) = C/(C+D) = FP/(FP+TN) = FPR

True positive rate (TPR) = A/(A+B) = TP/(TP+FN) = TP/P
False positive rate (FPR) = fall out = C/(C+D) = FP/(FP+TN) = FAR = 1-SPC

Accuracy (ACC) = (A+D)/(A+B+C+D)
Sensitivity = TPR = A/(A+B) = TP/(TP+FN)
Specificity (SPC) = TNR = D/(C+D) = TN/(FP+TN) = 1-FPR
Positive predictive value (PPV) = precision = A/(A+C) = TP/(TP+FP) = TP/P = TPR
Negative predictive value (NPV) = TN/(TN+FN)

False positive rate (FPR) = fall out = C/(C+D) = FP/(FP+TN) = FAR = 1-SPC
False discover rate (FDR) = C/(A+C) = FP/(TP+FP) = FP/P = 1-TPR = 1-PPV

*In bio-medical field, positive means disease, while negative indicates healthy.

Further more,
F1 Score, harmonic mean of precision and sensitivity
F1 = 2TP/(2TP+FP+FN)

Matthews correlation coefficient (MCC)
MCC = (TP*TN+FP*FN) / ((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))^0.5

When a model (classification model) is finalized, one may want to find an operating point, i.e., confidence threshold. This will be a trade-off between precision (higher with higher threshold) and recall (lower with higher threshold). Then you need a curve to show the performance at all possible threshold levels.

1. Receiver operating characteristic (ROC) curve is a plot of true positive ratio V.S. false positive ratio.
When compare  the performance of two models, it is hard to tell which one is better given two curves. So people use the area under curve (AUC) to measure and compare performance (the larger the better). But as you can see, two different curves can have the same AUC. Then to choose which one is depends on whether high precision or high recall is more desirable.

The AUC can be any value between 0 and 1. A random guess classifier will create a straight line segment from (0, 0) and (1, 1). While it can happen that auc<0.5, but then one can flip the output prediction, to make a new classifier with auc' = (1-auc). In this sense, AUC can be considered as something equal or above 0.5.

In statistic learning, the AUC represents the probability that a model outputs higher score for a randomly chosen positive class than a randomly chosen negative class. To prove this, please refer to a very nice blog: https://madrury.github.io/jekyll/update/statistics/2017/06/21/auc-proof.html.

2. Precision-Recall curve is, as the name says, a plot of precision ratio V.S. recall ratio.


references:
http://en.wikipedia.org/wiki/Sensitivity_and_specificity


Monday, February 24, 2014

C++ thread-safe queue, two ways to implement

I just explored a little on this topic, how to implement thread safe queue.
The most standard method to introduce mutex and condition variable.

Take the problem of "to implement a thread-safe queue" as an example.

The idea is to lock the critical resource before operation, and release lock after. There are some terminologies, like "own the lock" means lock(); and "release" means unlock();

So, for the thread-safe queue, two operations need to implement are:
pop()
{
mutex.lock();
while(queue.empty())
{
condition_v.wait(mutex);
}
queue.pop();
mutex.unlock();
}

push(T &item)
{
mutex.lock();
queue.push(item);
mutex.unlock();
}

This is just pseudo-code above, for implementation, there are two different ways.
1. using pthread stuff. That is "piosix" API, which is a set of API in C style. For more details, please refer to:

The code for safe queue can be found here:


One thing worth to care about is "pthread_cond_wait(cond_v&, mutex&)" function will pass under the condition of "cond_v.signal() AND mutex.lock()", which means you must have cond_v be satisfied and own the lock. And at the begining of wait(...), it release the mutex. This can make sure other thread can access the critical resource. 


2. another way to implement this idea is to use new feature of C++ 11. 
A very good example is:

Where mutex.lock() can block in scope of block. I like this style, but really don't like that only lock(), but no unlock() operation.
They should show up as a pair naturally.


Friday, January 17, 2014

Cannot open any webpage when update to Windows 8.1

After update to Win 8.1 from Win 8. None of browser works, while other app can access to public network. This is a bug of this update.
The solution is very easy:
1. Win key + "f" to search "everywhere", you will see "command prompt ". Right click "run as administrator". 2. In command line, type in "netsh winsock reset", return.
3. Actually you have done everything, although it is said "you need to restart you PC". But you really don't need, or only need to restart your browser.