Sklearn Split Train and Test

Posted on 2019-04-30 | In Machine Learning

There are several ways to split the data set into training data set and test data set.

In this blog, I will talk about the difference between these approaches.

sklearn.model_selection.train_test_split

Doc: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

sklearn.model_selection.ShuffleSplit

Doc: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html

Better style for Python Programming

Posted on 2019-04-29 | Edited on 2019-07-30 | In Others

How to write better code with good style

Python Name Convention

python 命名规范

Type	Convention	Example
Packages & Modules	lower_with_under	from prefetch_generator import BackgroundGenerator
Classes	CapWords	class Dataloader
Constants	CAPS_WITH_UNDER	BATCH_SIZE=16
Instances	lower_with_under	dataset = Dataset
Methods & Functions	lower_with_under()	def visualize_tensor()
Variables	lower_with_under	background_colour = ‘Blue’

Main

tip:
即使是一个打算被用作脚本的文件, 也应该是可导入的. 并且简单的导入不应该导致这个脚本的主功能(main functionality)被执行, 这是一种副作用. 主功能应该放在一个main()函数中.

def main():
      ...

if __name__ == '__main__':
    main()

所有的顶级代码在模块导入时都会被执行. 要小心不要去调用函数, 创建对象, 或者执行那些不应该在使用pydoc时执行的操作.

Reference

Python风格规范 : https://zh-google-styleguide.readthedocs.io/en/latest/google-python-styleguide/python_style_rules/
机器之心：PyTorch最佳实践，怎样才能写出一手风格优美的代码

Pytorch CUDA experience

Posted on 2019-04-29 | In Deep Learning

In my experience of using the lab server to train my model, I met the problem of OOM(out of memory). Here I attach some solution and thinking in the following article.

Assume such scenario:

The default CUDA is full and even you want to do torch.tensor([1,2,3]).cuda() you will get OOM error.

You shoul trying to choose another GPU.

CUDA_VISIBLE_DEVICES

Code

1 2	import os os.environ['CUDA_VISIBLE_DEVICES'] = '2,3'

Add this piece of code into your script file, and when your execute your code, you will use the corrsponding GPU.
(Note: This will not be useful in Jupyter Notebook.)

1	CUDA_VISIBLE_DEVICES=2 python test.py

When you execute your script file, add the CUDA_VISIBLE_DEVICES=2 in the begining. Then the script will run on the certain GPU.

Note

Even you set your GPU of 2 or 3 using this way, in the output, the device will show tensor([1, 2, 3], device='cuda:0').

From pytorch forum of @pjavia ‘s answer:
@MrTuo This is how pytorch 0.4.1 convention works. If you say CUDA_VISIBLE_DEVICES=2, 3. Then for pytorch GPU - 2 is cuda:0 and GPU - 3 is cuda:1. Just check your code is consistent with this convention or not?

And I tested on the torch 1.0.1, it seems also consistent with this answer.

Torch.cuda

torch.cuda.set_device(1)

torch.tensor([1,2,3]).cuda()
# output: tensor([1, 2, 3], device='cuda:1')

This code of first line is useful on Jupyter Notebook. When you set certain GPU device, the following code will use this GPU.

It’s kind of set the GPU environment.

Torch.device

1 2	device = torch.device('cuda:3') # X = X.to(device)

Set a device of certain GPU, when you are executing the code, transfer the variable into the device(it can also be CPU).

Reference

Set Default GPU in PyTorch
Pytorch forum: CUDA_VISIBLE_DEVICE is of no use

Linux Commands when using Server

Posted on 2019-04-28 | Edited on 2019-05-03 | In Linux

When I am using the lab server, there are some commands that I need to use to see the situation of server.

top

作用等同于任务管理器

You can see the CPU, Memory situation by using this command.

nvidia-smi

See the GPU situation, the GPU memory and the some other things.

ps

1	ps -u [username]

To see the jobs of one user in this server.

echo

To see the current working dir path:

echo $PWD

Rerference

每天一个linux命令（44）：top命令
CUDA之nvidia-smi命令详解

Linux Server virtual env

Posted on 2019-04-28 | In Linux

How to create the virtual env in the server of lab.

In my group coursework for the advanced ML, I want to run the code of the first solution in this competition. The requirments wasn’t satisfied in the server, so I want to create virtual env to built such environment. This blog records the process of building envirment to run deep learning task.

virtualenv

If your linux server already has the virtualenv module, you can use virtual env to create virtual environment. You can check it using pip list.

In my try, I tried to install the virtualenv in the beginning. I found that the permmission is denied, since I don’t have the root access on this server supplied by teachers.

I found the reason and solution in this issue. Therefore, I did another try which is the following one.

Python -m venv

1 2	python3 -m venv env source ./env/bin/activate

This solution just needs you to have the Python in your system (any Linux has the Python within system). In this way, I can activate the virual environment and pip install the specific modules.

我的解决方案

目标：python3.6

过程：

python -m env
在python创建出来的虚拟环境中安装virtualenv
通过virtualenv创建对应python3.6版本的虚拟环境

Reference

an issue on Github: Permission denied

Kaggle Competition: Humpback Whale Identification

Posted on 2019-04-27 | Edited on 2019-04-28 | In Kaggle

Kaggle竞赛第一名方案解读

Description of Competition

目的：构建算法识别鲸鱼个体

难点：

训练样本的严重不均衡
存在接近三分之一的无标注（new whale）数据

Some new terminology:

Few-shot learning: what’s few shot learning
细粒度分类: that’s why we need mask. mask-CNN,什么是mask
triplet loss: ？？？
SE-resneXt154: 一个新的分类模型
伪标签：？？

Pipeline：

Input of the models

RGB+mask
Data Augmentation:
- 有人提出鲸鱼尾部不对称，翻转之后是新的类别

Reference

kaggle competition: Humpback Whale Identification
机器之心: Kaggle第一名竞赛方案解读

Interview-1 QuantumBlack Data Science Intern

Posted on 2019-04-15 | In Others

记第一次在英国公司面试

introduction

第一次面试，quantum black这个公司，面试官是两个小姐姐。公司整体人很好，刚进门的时候有小哥路过还打招呼问我，后来在餐厅等候的时候还有小哥问我吃不吃巧克力，公司整体氛围相当不错。面试官也特别友善。

面试开始先有一个简单的introduction，让面试官认识你。没有准备好这个brief introduction。

面试过程表现不太好。。。感觉一是因为英语不够熟练，刚开始不太能get到小姐姐的问题，表现不好。二是后面technical方面的问题的时候，我忘了一些模型的细节，后面详细写。

case part

第一部分是实际案例部分，案例问题是关于fraud detection。给定一组很大的银行的交易数据，such as 100million条，其中200条是诈骗交易，我们要进行诈骗交易的检测。

第一步让我构建feature，觉得能从之前的数据中构建出来什么feature。最开始没太理解，后来也没答好，确实不知道能构建什么feature

第二步让我建立模型来解决这个问题。我提出使用逻辑回归模型来进行预测。接着会来一连串的问题，为什么会选用逻辑回归来预测？我要如何训练和测试这个模型？

关于loss function这一块，问我如何构建。我说使用交叉熵，但是我忘了交叉熵的公式了。。。。
metric to evaluate 这个模型，我说可以使用confusion matrix。要求来画出混淆矩阵，紧张了一下没画出来，后来画出来了。问到了precision，recall和f值。结合案例又问了问题，问我应该重点关注哪个值，这里回答不好。。。
如何split training data set。这个数据严重不平衡，如何做。。。。我也不知道回答的好还是不好。。先说80 20 split，后来说可以使用cross validation来进行交叉检验。小姐姐针对这个问题提出疑问，可能有的fold没有fraud point。。。

这部分整体感觉，有点崩，也有些超时。感觉这里应该有自己的独立思考，根据相应的案例进行变通，应该是要跟面试官进行discuss的，我没做好心理准备，导致被面试官牵着走，效果也不好。模型不应该一成不变，应该根据相应的case有不同变通。

technical part

这一块不是case，是要问理论的部分。这一块刚开始其实还是比较自信的，因为自我感觉理论掌握的还不错。

ROC curve
之前看过ROC curve，但是这次死活也想不起来。。。难受，这个东西业界用的比较多，之前看过一次，但是这次之前忘了看，实属失误
逻辑回归
这里小姐姐结合线性回归和逻辑回归来问我问题。还好前几天看官网案例的时候看到逻辑回归用的比较多，提前准备了一下。这里主要看线性回归和逻辑回归的理解。分别问到了线性回归的方程表现形式，loss function是什么，梯度下降又是什么（这里画图来描述），如何使用梯度下降进行优化（不必要推导导数公式）。

接着问了逻辑回归和线性回归之间的联系。时间限制，我写了一个公式，小姐姐知道了我的意思就开始下一个问题了。

其他非线性分类的模型
我回答了 SVM，接着让我描述SVM和他的原理。我说svm基本状态是线性分类的，要做到非线性要用kernel。接着让我描述kernel，kernel是什么。这一块花的时间比较多，我有时候没有搞明白她的意思。其实kernel我也没有办法说的很清楚，这一块是个失误。
tree - ensemble
还好我提前也准备了这一块，集成学习这一块。可惜boosting部分忘记了细节，太紧张了没回答上来。

bagging，我描述了bagging的idea。小姐姐针对bagging问了我问题，这些models是一个model吗，是不是不同。

boosting，描述的没有很清楚。我没讲清楚如何训练互补的model。。。。。。给data set赋予权重，每个数据都有不同的权重（最开始没讲清楚），然后讲如何通过一个$\alpha$来变换之前之后的权重。（太紧张了，又没有提前准备，没回答好）

之后时间到了，结束。

经验教训

准备好开头的小介绍 *重要
练习好英语，case discuss部分要灵活变通，表现自己的思考 ****重要
ROC curve *重要，忘记准备了
SVM kernel *重要，学会讲这个东西
boosting *重要，忘记准备，本身会
技术的问题都问的很详细，不会问你深度学习相关的东西，就只是问你base model的问题。准备时候要有侧重点，还好我提前看过了官网上的往期project，对知道他应该更多的问传统机器学习模型部分。但是一些具体的细节需要更加深入的理解，达到能给别人讲的程度。 ****重要

Gaussian Mixed Model(GMM) and EM algorithm

Posted on 2019-04-14 | In Machine Learning

Introduction

Gaussian Distribution

Mixed Gaussian Distribution

Optimization Method

Reference：

知乎高斯混合模型(GMM)
知乎一文详解高斯混合模型原理
《统计学习方法》第九章 - EM算法及其推广——李航

Note-2 Feature Engineering

Posted on 2019-04-14 | In Machine Learning

What’s Feature Engineering

In the application of machine learning or the field of data science, to achieve better performance on prediction or classification, we should not only choose the most suitable algorithm/model, but also we should use the suitable features.

Definition in wiki:

1	Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

In a word, the feature engineering is to manually design what the input x should be and make our models work successfully.

Importance

The features choice are important for our task.

Better features make model have more flexibility.
Suitable features can use simple models
Achieve better performance

Sub-questions of Feature Engineering

There are main three kinds of tasks in the feature engineering:

Feature Construction
- Given a problem and raw data set, to construct the features using domain knowledge, is what I called feature construction. In this process, we should analyse our problem and convert it into mathematics problem, and come up with ideas what data we need and how to tackle this problem.
Feature Extraction
- Extract the features from data set. Such as, in the document filtering or clustering task, to constuct the document/word vector, we use TF-IDF method to extract the information behind the documents. Another example in the CNN application, the kernels/filters in convolution layers are used to extract the features of images.
Feature Selection
- Choose the most suitable features and feed them into our models. Ignore the non-relational features.

These three tasks sometimes will overlap and make people confused. They are basicall the good ways for me to understand, you can choose what your think to make yourself have a better understanding.

How to do?

A data science pipeline is basicall followed like this:

given task and understand it
choose data set
pre-process the data set
feature engineering(extract features)
model data
analyse and evaluate

Feature Engineering is a part of work in our data science project.

There are some ways to do features engineering:

Brain storm: To come up the ideas of features which maybe useful for our project
Design features
Choose features

(… TO BE CONTINUE)

Reference:

image and content of ideas from this blog

Note-1: Linear Regression and Logistic Regression

Posted on 2019-04-12 | In Machine Learning

Linear Regression

What’s Liear Regression

Linear Regression is a approach to modelling the relationship between a scalar response( or a dependent variable) and one or more explanatory variables.

It is written as the linear formula: $f(x)=w^T x+b$

Given the features values of $n$ data points, we can train to get a linear model which can fit the data set properly. When the new data point is fed into the model, we can predict the value.

We are going to find the optimal weights value:

$(w^*,b^*)= \underset{(w,b)}{\operatorname{argmix}}\sum^m_{i=1}(f(x_i)-y_i)^2$

The close form solution can be calculated through the derivative. Of course, you can also use Gradient descent to find the optimal parameters, but it’s not necessary.

Advantage

The advantages of linear regression are that it’s simple and easy to implement, and the time complexity is small.

Logistic Regression

Why Logistic Regression

Though the name of Logistic Regression includes regression, it’t not really a regression model. It’s for classification task. In this aspect, we can call Logistic Regression Analysis.

Since we have Linear Regression to do the regression task to predict the value for a new data set. Actually it can be used to predict the class for a given data. We can just set the threshold, if the predicted value is above the threshold, then it is classified into class 1, on the other hand, the data is classified into class 0.

However, there is a drawback when we use linear regression to do classification. We should set lots of thresholds according to different cases. And that’s why Logistic Regreesion came out.

What’s Logistic Regression

Some key words in Logistic Regression:

Hypothesis: Data points are Bernoulli distributed
Maximum likelihood to get the cost function
Gradient descent or Newton method to find the optimal solution

Given the generalized linear model: $y=g^{-1}(w^Tx+b)$, the $g(\cdot)$ is called link function.

The $g$ function, from unit-stop function to sigmoid function, can convert the predicted value into corresponding class.

$sign(x)=\begin{cases} 1,&x>0 \\ 0.5,&x=0 \cr 0,&x<0 \end{cases}$

Unit-step doesn’t have a very good property, it can easy to do derivativation. Then we use the sigmoid function. It has the format like this:

$y=\frac{1}{1+e^{-z}}$

The sigmoid function squash the predicted value into 0 and 1, and now we can just set one threshold and do the classification task.

Log Odds - another way to interprete LR

Log odds is another way to interprete the logistic regression. For more details, see the chapter 3 in 《机器学习周志华》.

Reference

《机器学习》周志华
Python数据科学机器学习笔记
最小二乘法的本质是什么