Linux Delete Files Bash Command

Posted on 2019-07-03 | In Linux

To delete files on Linux server, there are multi ways to achieve this.

rm bash command

1	rm [options] ...file...

I usually use rm -rf file folder_name to remove a non-empty filefolder.

Options:

-f, —force

Ignore nonexistent files, never prompt
-r, -R, —recursive

Remove directories and their contents recursively.

and some other options, see the reference 1 for the doc.

rsync

rsync is an alternative method which can be used in deleting files. It is suitable for the folder which contains a large amount of files, in which case rm -rf can’t behave well.

1 2	mkdir empty rsync -a --delete empty/ your_folder/

Reference

Bash man doc: rm
Bash man doc: rsync
StackExchange: Efficiently delete large directory containing thousands of files

Linux tar for unzipping

Posted on 2019-07-02 | Edited on 2019-07-03

1	tar xvzf file.tar.gz

x: This option tells tar to extract the files.

v: The “v” stands for “verbose.” This option will list all of the files one by one in the archive.

z: The z option is very important and tells the tar command to uncompress the file (gzip).

f: This options tells tar that you are going to give it a file name to work with.

Note:
For compressing files, you can use

1	tar cvfz file.tar.gz dir/

The z tell that the files are compressed using gzip, and we usually name the zipped file .tar.gz.

Reference

How To Extract .tar.gz Files using Linux Command Line
每天一个linux命令（61）：tar命令

Linux wget for downloading

Posted on 2019-07-02 | In Linux

wget for downloading files in Terminal.

When I was doing my summer project, I should download the datasets into the server. To achieve this, I used the wget and records the infor of this command by this chance.

Introduction of wget

GNU Wget is a free utility for non-interactive download of files from Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.

Non-interactive means that it can work in the background.

To invoke:

1	wget [option]… [URL]…

Some more about Using

Download file in the background:

1	wget -b [URL]

Check the information of the downloading:

1	tail -f wget-log

Download file and change the name

1	wget -O [name] [URL]

Download multiple files

1	wget -i filelist.txt

The filelist.txt should have the urls of the targets files.

To unzip the .tar.gz file, use the command tar. See the blog Linux tar for unzipping.

Reference:

GNU Wget 1.20 Manual
每天一个linux命令（61）：wget命令

Can Python Interprete itself?

Posted on 2019-07-02 | In Python

The interpreter of Python can be written with Python, such as the most famous one PyPy.

There are several advantages using PyPy, like the speed, memory usage, compatibility and stackless.

In the tutorial of Allison Kaptur, the Python interpreter is a stack machine: it manipulates several stacks to perform its operations( as contrasted with a register machine, which writes to and reads from particular memory locations).

Reference:

Allison Kaptur A Python Interpreter Written in Python)

Python Interpreter

Posted on 2019-07-02 | In Python

python解释器

编写python代码时，得到的是一个包含python代码的.py为拓展名的文本文件。要运行代码的时候，需要python解释器去执行.py文件。

事实上，存在有很多python解释器。

CPython

从python官网下载并安装好python3.x之后，我们会直接获得一个官方版本的解释器CPython. 这个解释器是用C语言开发的，所以叫CPython。

CPython是使用最广的Python解释器。当.py执行的时候，解释器会先解释成cpython文件，进而编译执行。

IPython

IPython是基于CPython之上的一个交互式解释器，即IPython只是在交互方式上有增强，除此之外，其执行Python代码的功能和CPython是一样的。

CPython用>>> 作为提示符，而IPython用In[no.]:作为提示符。

PyPy

PyPy是另一个解释器，目标是执行速度。PyPy采用了JIT技术，对Python代码进行动态编译（不是解释），所以可以显著提高Python代码的执行速度。

绝大部份Python代码都可以在PyPy下运行，但是PyPy和CPython有一些是不同的，导致相同的Python代码在两种解释器下执行可能有不同的结果。使用前要先了解其不同点。

Jython

Jython是运行在Java平台上的Python解释器，可以直接把Python代码编译成Java字节码运行。

IronPython

IronPython和Jython类似，只不过其是运行在微软.NET平台上的Python解释器，可以直接把Python代码编译成.NET的字节码。

Reference

廖雪峰 python解释器

Web Crawler Get the User-Agent

Posted on 2019-06-13 | Edited on 2019-06-14 | In Web Crawler

How to Get User-Agent for your Crawler

There are several common ways to get the user-agent and use it to scrape the website.

about:version in your browser, and check the User-Agent.
Use the inspect tool within your browser. Network -> refresh your page -> find the current page -> Headers -> check the User-Agent

install the fake_agent in your computer and import it for using in Python

1	pip install fake_useragent

When using:

from fake_useragent import UserAgent
# import random

fake_ua=UserAgent()
headers={'User-Agent':fake_ua.random}

Google to find some User-Agent, sush as User-Agent 汇总

Reference

CSDN blog: 爬虫之UserAgent的获得方法
fake_useragent doc

Error Met in First Crawler Demo

Posted on 2019-06-13 | Edited on 2019-06-14 | In Web Crawler

When writing the crawler program, I met across several problems. This blog is to records the solution which I used in my program.

Access denied when using url directly

Some websites add some features to protect them from being scraped. In this case, we should add the headers.

The fields in headers:

User-agent: like Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Referer: like https://www.google.com/

(See the introduction of crawler: Web Crawler Basic)

Max retries exceeded with url

requests.exceptions.SSLError: HTTPSConnectionPool(host='www.mzitu.com', port=443): Max retries exceeded with url: /184325/8 (Caused by SSLError(SSLError("bad handshake: SysCallError(60, 'ETIMEDOUT')")))

First time to solve this question:

add sleep(1) in every iteration of downloading image
- It improve the problem and make the program stick longer time to do crawler work, but not solve it completely.
- Be able to download around 300 images until failing.
To improve the ability of anti-anti-spider, in the second version, I added two other functions:
- Randomly sleep some time when accessing the images
- Randomly choose the User-Agent using fake_useragent Python module.
- The performance is not very good. I’m not sure whether it is due to the problem of my ip.
If want to have more ability of scraping, more IPs are needed in my program.(To be continued)

Connection reset by peer

I’m not sure what’s the reason of this problem.

I add the proxies to avoid this problem.

how to get the proxies.
Use it in the requests.get function.

Although the program become slow, it become much more robust during scraping.

Web Crawler Basic

Posted on 2019-06-13 | Edited on 2019-06-14 | In Web Crawler

There are several modules for accessing html through Python.

Urllib and requests are two different modules in Python which can be used for crawler.

For beginner, the requests is recommended.

Headers

HTTP request header is the information, in the form of a text record, that a user’s browser sends to a Web serveer containing the details of what the browser wants and will accept back from server.

User-Agent

The User-Agent appears in an HTTP Request Header, not an HTTP Response one. In general, the request is sent from browser to the web application. So the user-agent is filled by the browser. Different browsers will fill up this field with different values.

Blog: Web Crawler Get the User-Agent

Referer

Optional HTTP header field that identifies the address of the webpage that linked to the resource being requested. By checking the referee, the new webpage can see where the request originated.

Some websites use this to ban the crawler, and you may need to update your referer.

Reference

Association Rules Mining

Posted on 2019-05-29 | In Data Mining

It also kind of Market Basket Analysis.

Introduction

An Association Rule is where $X \Longrightarrow Y$ ($X$ implies $Y$)

An item set is a set of items. If it has $k$ items, it is a $k$-itemset.

Support $s$ of an itemset $X$ is the percentage of transactions in $D$ that contains $X$.

Support of association rule $X \Longrightarrow Y$ is the support of the itemset $\{X,Y\}$.

Confidence of the rule $X \Longrightarrow Y$ is the ratio between the transactions that contain both $X$ and $Y$ and the number of transactions that have $X$ in $D$.

The problem is: Find association rules.

Given:

A set $I$ of items
database $D$ of transactions
minimum support $s$
minimum confidence $c$

Find: Association rules $X \Longrightarrow Y$ with a minimum support $s$ and minimum confidence $c$

Apriori Algorithm

There are two pinciples of the apriori algorithm:

Any subset of a frequent itemset is also frequent
Any superset of an infrequent itemset is also infrequent

(example see the reference 2)

Improvements

The limitation of confidence:

$\operatorname{conf}(X \Longrightarrow Y)=\frac{\frac{n T \operatorname{rans}(X \cup Y)}{|D|}}{\frac{n \operatorname{Trans}(X)}{|D|}}=\frac{p(X \wedge Y)}{p(X)}=p(Y | X)$

If $Y$ is independent of $X$: $\mathrm{p}(Y)=\mathrm{p}(Y-X)$ .

This means if you have a high probability of $p(Y)$ we have a rule with high confidence that associate independent itemset.

Lift: measure indicates departure from independence of $X$ and $Y$.

The lift of $X \Longrightarrow Y$ is:

$\operatorname{lift}(X \Longrightarrow Y)=\frac{\operatorname{conf}(X \Longrightarrow Y)}{p(Y)}=\frac{\frac{p(X \wedge Y)}{p(X)}}{p(Y)}=\frac{p(X \wedge Y)}{p(X) p(Y)}$

But lift is symmetric, the same for $X \Longrightarrow Y$ as $Y \Longrightarrow X$

Conviction: indicates that $X$ and $Y$ are not independent, and takes into account the direction of implication.

Since the $p \rightarrow q \equiv \neg p \vee q$, and we can rewrite it as the $\neg( p \wedge \neg q)$. Therefore, the Conviction is based on this

$\operatorname{conv}(X \Longrightarrow Y)=\frac{p(X) p(\neg Y)}{p X \wedge \neg Y )}$

Conviction is a measure of the implication and has value 1 if items are unrelated.

Reference

Jo slide Market Basket
Mining Association Rules

Decision Tree

Posted on 2019-05-29 | In Data Mining

A decision tree is like a flow chart.

One merit of using decision tree is that they are interpretable. It is easy to see how they made a certain decision. Besides, the decision trees can be “hand crafted” by experts. They can also be built up using machine learning techniques.

Criteria for Choosing the Split

For classification:

ID3: maximise the information gain (based on the information entropy). The information gain is kind of mutual entropy. See the blog for more details.
CART: maximise the impurity decrease (based on the Gini impurity)
$\text {Gini}(r o o t)-\left(\operatorname{Gini}(L e f t) \frac{n_{L}}{n}+\operatorname{Gini}(\operatorname{Right}) \frac{n_{R}}{n}\right)$
root is the node to be split, and left and right is the impurity of the left and right branches.

For regression:

CART: use variance instead of gini or entropy. Choose the feature decrease the variance most.

Note:

When computing the information gain, impurity decrease or the variance gain, remember to multiply the weights for the left and right trees.

CART

CART, the classification and regression tree. In this part, regression using the variance is going to be introduced.

The procedure of building the CART:

Find the best split for each feature - minimises the impurity measure
Find feature that minimises impurity the most
Use the best split on that feature to split the node
Do the same for each of the leaf nodes

（先找每个feature最好的区分点，比较所有feature的最好效果找出决定feature，然后split。不断迭代）

Pruning

The decision tree is easy to be overfitting. One solution is pruning. This can be done in a variety of ways, including:

Reduced Error Pruning
Entropy Based Merging

Reduced Error Pruning

Procedure:

Start at leaf nodes
Look up branches at last decision split
Replace with a leaf node predicting the majority class
If validation set classification accuracy is not affected, the keep the change

This is a simple and fast algorithm that can simplify over complex decision trees.

Entropy Based Pruning

Procedure:

Chose a pair of leaf nodes with the same parent
What is the entropy gain from merging them
If lower than a threshold, merge nodes.

This doesn’t require additional data.

Reference

Jo Slide Decision Tree