Monday, June 30, 2014

How does Matlab calculate the eccentricity of a region

In matlab, there is a built-in function to calculate properties of a region.
http://www.mathworks.com/help/images/ref/regionprops.html#bqkf8jf

And as said in help message:
'Eccentricity' — Scalar that specifies the eccentricity of the ellipse that has the same second-moments as the region. The eccentricity is the ratio of the distance between the foci of the ellipse and its major axis length. The value is between 0 and 1. (0 and 1 are degenerate cases; an ellipse whose eccentricity is 0 is actually a circle, while an ellipse whose eccentricity is 1 is a line segment.) This property is supported only for 2-D input label matrices.

So, the idea is to fit using a ellipse with same second-moments as the region.
What does it mean?
The answer is in this thread:
http://stackoverflow.com/questions/1532168/what-are-the-second-moments-of-a-region

To simplify, the idea is to calculate the co-variance matrix, then do eign-value decomposition. Eigen-values are those axis length, minor and major one. While eigen-vectors are the directions of minor/major axis.

length of major axis = 2a, minor axis = 2b, then the foci = c, then:
eccentricity E = c/a = sqrt(1-(b/a)^2)
a^2-b^2 = c^2.


Wednesday, June 11, 2014

ML_general_talk.md

why this article

I am not a newbie for machine learning any more. But still sometimes I suspect what did I gain from learning “machine learning”. By applying some classical algorithms, I get some real feeling about this hot topic.

everything is about generalization

Generalization means the ability to have good prediction on novel data samples. In other words, when you make prediction on testing data given the model you trained on training data.
You can easily get a 100% accuracy on training data, except for some ambiguous data point (same data points, but different label). This is meaningless since your decision boundary is too complex. The terminology is over-training. Instead of record every training data sample by taking photo, you need to loss the decision boundary. Some technologies play this role essentially, like margin in SVM, regularization in general optimization.
Another this is features goes first. You cannot do magic on bad features. This means spend more time on feature extraction/selection/design is worthy. In some sense, deep learning or sparse coding/representation is to put efforts on the steps before classification.

Do Preprocessing

Some simple preprocessing, e.g. normalization, whitening, etc. can really benefit your classification.

select correct classification algorithm

Linear or not, svm or lda.
I always try four algorithm to have baseline:
1. Support Vector Machine (SVM)
2. Linear Discriminant Analysis (LDA)
3. Random Forest (RF)
4. K Nearest Neighbor (KNN)


Written with StackEdit.

Tuesday, June 10, 2014

git_notes

Git 笔记

Some basic concepts

repository 仓库,存储code的单元。
branch 分支,同一个代码库的不同版本控制路径,默认的主线是master,可以分叉出来开发测试新功能,完成后merge到master上。
commit 提交修改。
push推送到远程,默认推到original的repository。
pull 从远程repository取回代码。
一般远程的repository默认是origin, 另一个常用的名字是upstream,这个用于从一个已有的repository fork了一个副本后,自己的副本作为origin,原始的版本作为upstream。参加open source project常用到upstream。

文件状态

untracked 即没有加入到git的index
unmodified 没有个修改过的文件
modified 修改但尚未提交
staged 已经提交,上了舞台了。。

常用命令

git status检查所有文件的状态
git diff 检查文件修改前后差异
git branch 创建新分支
git checkout 切换分支
删除和ignore不同:
ignore只是忽略跟踪,但文件保留,删除是彻底删除,先删除文件再用git -rm xx.xx 实现彻底git上的删除。

深入

关于配置:
/etc/gitconfig 针对系统 git config –system
~/gitconfig 针对当前用户 git config –global
.git/config 针对当前 repository
git config –list列出当前所有配置。
提交代码最常用的配置是信息是用户名和邮箱。当你需要提交并push的时候会要求输user pwd验证身份。另一种选择是可以直接在本地产生一对公钥私钥 (ssh-keygen -t rsa -C “your_email_address”)。
所谓分布式版本控制。事实上是说没有一个主线处于支配地位。我的理解,在远程服务器比如github上也是一个类似本地的一个repository。所以你的机器如果在线,也能直接被clone。
.git文件下存储所有配置。(比如忽略某些文件,index等)

References:

GitHub详细教程
collaborate using git
git user guide
简单明了的一个教程:廖雪峰的git教程
Written with StackEdit.

Monday, June 9, 2014

java integer pool

Here in this post, I record a bug in my project(Java).
The problem is when compare two integers, it works well only when integer is of one-byte length.
This is caused by "Integer constant pool". To avoid extra memory cost, Java will return a already-created Integer object if it's between -128 and 127. This means:
Integer i = Integer(10); Integer j = Integer(10);
bool flag = (i==j); //return true; directly check the address
bool flagE = i.equals(j);//return true; directly check the values, what we expect here.
For the Object class, which is the root class of all classes in java, these two method (or operator) are the same. But for a certain derived class, it is not necessary depends on the logic.

So if the logic of your program is to check whether two integers are equally valued, then you need to use equals instead of == operator.

You may imply that java override the equals function for Integer class. this gives us another topic to discuss, that is, when you override equals function you need override hashCode(). One naturally raised question is "why always override hashcode() given equals() overrided". Here is a good article for this question:
http://www.xyzws.com/javafaq/why-always-override-hashcode-if-overriding-equals/20
To simplify it, you have to guarantee that:
if equals(Object o) gives true, hashCode() output same integer. This means you need to generate hash code from attributes which are used to decide whether two objects are equal (equals() function).