<img src="https://juniorworld.github.io/python-workshop-2018/img/portfolio/week7.png" width="350px">

---

# Review of Data Collection

<img src="https://juniorworld.github.io/python-workshop-2018/img/data%20collection.png" width="400px" align='left'>

---

# Natural Language Processing

<img src="https://juniorworld.github.io/python-workshop-2018/img/NLP_.png" width="700px" height="400px" align='left'>

We will demonstrate how to go through these five steps for English and Chinese texts respectively.

## 1. Data Cleaning
- Main task: convert the case, remove punctuations and special characters like hashtags, hyperlinks
- Use Regular Expression for Pattern Matching
- Convert the case: `.lower()`

### Regular Expression Cheat Sheet
- `.` matches any single character
- `[...]` group matching, matches any one of the characters inside the square brackets
- `[^x]` matches one character that is not x
- `|` an “or” operator, matches patterns on either side of the |.
- `*` matches at least 0 times.
- `+` matches at least 1 times.
- `?` matches at most 1 times.
- `{n}` matches n times
- `(...)` grouping in regular expressions
- `\\N` – backreference to group N
- `^` matches the start of the string.
- `$` matches the end of the string.

<h3 style="color:red">1a. English</h3>

In [None]:
#install regular expression package for pattern matching
! pip3 install regex

In [None]:
import regex as re

In [None]:
#Use sub() function to match pattern and substitute the matched words with new pattern
a='I only have 100 dollars in my pocket. What I can buy?'
re.sub('I','You',a) #substitute a word with another word

In [None]:
#Substitute a word with nothing, meaning removing the word
re.sub('I','',a)

In [None]:
#Match a group of words by using []. "a-z" means all capital letters.
re.sub('[A-Z]','*',a)

In [None]:
#Remove all letters in lower case


In [None]:
#Hide the numbers
re.sub('[0-9]','*',a)

In [None]:
#Remove all alphanumeric characters


In [None]:
#Remove all charaters that are not alphanumeric


In [None]:
#Shortcut to remove all punctuation
re.sub('\p{P}+',' ',a) #\p stands for POSIX characters. {P} stands for punctuation.

In [None]:
#Remove hashtags
a='''@JerryNadler admits on #CNN they have no proof of Obstruction by @realDonaldTrump it's just his "personal opinion" Meet the new #WitchHunt Same as the old #WitchHunt cc @DonaldJTrumpJr'''
re.sub('#[^ ]+','',a)

In [None]:
#Extract hashtags by findall() function
re.findall('#[^ ]+',a)

In [None]:
#Extract all mentions


In [None]:
#Remove hyperlinks
a='Tesla’s abrupt shift to online-only car sales, after racing to open stores, battered its share price and raised questions about its future. https://goo.gl/rwGHTP'
re.sub('https://[^ ]+|http://[^ ]+','',a)

In [None]:
#tranform all letters to lower case
a.lower()

<h3 style='color:blue'>Practice</h3>

Create a data_cleaning() function to convert letter case, remove punctuations, numbers, mentions, hashtags and hyperlinks

In [None]:
def data_cleaning(text){
    
    #Write your code here
    
    return(text)
}

In [None]:
#test your function with a post from @realDonaldTrump
a='@seanhannity “We the people will now be subjected to the biggest display of modern day McCarthyism....which is the widest fishing net expedition....every aspect of the presidents life....all in order to get power back so they can institute Socialism.” https://t.co/izb2tTrINB'
data_cleaning(a)

---
## Break
---

## Tokenization
- Definition: tokenization is a process of splitting sentences/paragraphs/documents into a set of words.
- Differences in Languages:
    - English: **words** are naturally separated with spaces
    - Korean: **phrases** are naturally separated with spaces
        - konlpy (http://konlpy.org/)
    - Chinese/Japanese: **no spaces** in text
        - Chinese: jieba (https://github.com/fxsjy/jieba)
        - Japanese: jNlp (https://github.com/kevincobain2000/jProcessing)

## Tokenize English Text: Hunt for Spaces

In [None]:
#Split the following sentence into words
sentence='Mr. Zuckerberg, who runs Facebook, Instagram, WhatsApp and Messenger, on Wednesday expressed his intentions to change the essential nature of social media. Instead of encouraging public posts, he said he would focus on private and encrypted communications, in which users message mostly smaller groups of people they know. Unlike publicly shared posts that are kept as users’ permanent records, the communications could also be deleted after a certain period of time.'
sentence=data_cleaning(sentence)
words=

In [None]:
import pandas as pd

In [None]:
pd.Series(words).value_counts()

<div class="alert alert-block alert-success">
    **<b>Extra Knowledge</b>** We can use funtion <font style='color:red;font-weight:bold;'>gensim.parsing.preprocessing.stem_text(text)</font> to stem words in the sentence.</div> 

## Tokenize Chinese Text

We will use a package package "jieba" to tokenize Chinese text.<br>
<br>
**Why jieba?**
- It adopts a hybrid method combining both statistical/probabilistic inference and pattern matching based on dictionary. 
    - capable to recognize words existing in the pre-defined dictionary
    - capable to find new words.
- Two dictionaries:
    - System dictionary
        - Simplied Chinese
        - Simplied+Traditional Chinese
    - User dictionary

In [None]:
! pip3 install jieba

In [None]:
import jieba

In [None]:
list(jieba.cut('你好，这是一个简单的句子。'))

In [None]:
#it can segment tradional Chinese text by using statistical inference method.
list(jieba.cut('你好，這是一個簡單的句子。'))

In [None]:
#however, statistical inference is not perfect.
list(jieba.cut('談判擱置，工會號召靜坐。'))

In [None]:
list(jieba.cut('谈判搁置，工会号召静坐。'))

## Configurate Dictionaries

To better segment traditional Chinese text, we need to upgrade system dictionary to include traditional Chinese words.<br>
Download the system dictionary from this link:https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

In [None]:
#load another dictionary to support traditional Chinese
jieba.set_dictionary('PATH OF DICTIONARY')

In [None]:
#try tokenizing this sentence again
list(jieba.cut('談判擱置，工會號召靜坐。'))

In [None]:
#Some names and special terminologies cannot be properly identified.
print(list(jieba.cut('中央上周二向特首林鄭月娥發公函'))) #very long name
print(list(jieba.cut('台灣蔡英文總統日前表示希望與日本舉行安保對話'))) #names including frequently used words
print(list(jieba.cut('高雄市長韓國瑜本月稍後訪問港澳深圳廈門四市'))) #names including frequently used words
print(list(jieba.cut('汶萊的全稱為汶萊達魯薩蘭國。'))) #special terminologies

In [None]:
#Build your user dictionary (time-consuming)
file=open('user_dict.txt','w',encoding='utf-8')
file.write('林鄭月娥\n')
file.write('蔡英文\n')
file.write('韓國瑜\n')
file.write('汶萊達魯薩蘭國\n')
file.close()

In [None]:
#Use your user dictionary
jieba.load_userdict('user_dict.txt')

In [None]:
#After loading user dictionary:
print(list(jieba.cut('中央上周二向特首林鄭月娥發公函'))) #very long name
print(list(jieba.cut('台灣蔡英文總統日前表示希望與日本舉行安保對話'))) #names including frequently used words
print(list(jieba.cut('高雄市長韓國瑜本月稍後訪問港澳深圳廈門四市'))) #names including frequently used words
print(list(jieba.cut('请问：根據碳碳键键能能否否定定律一'))) #special terminologies
print(list(jieba.cut('汶萊的全稱為汶萊達魯薩蘭國。'))) #terminologies

## Remove stop words
Stop words are useless for understanding text.<br>
In English: at, in, on, for, of, a, an, the...<br>
In Chinese: 的，地，得，了.<br>

However, the combination of 不得了 (holy great) is not a stop word which is used to convey extreme compliment over something.<br>
√ Absolute Match. × Pattern Matching

In [None]:
'a' in ['a','b','c']

In [None]:
'a' in ['aa','b','c']

In [None]:
'a' not in ['aa','b','c']

Chinese stop words file: https://juniorworld.github.io/python-workshop-2018/doc/stop_words_chi.txt<br>
Chinese stop words file: https://juniorworld.github.io/python-workshop-2018/doc/stop_words_eng.txt

In [None]:
file_chi=open('FILE PATH','r',encoding='utf-8')

In [None]:
stop_words_chi=[]
for line in file_chi.readlines():
    line=line.strip() #remove line break
    stop_words_chi.append(line) #update the list of stop words line by line
file_chi.close()

In [None]:
len(stop_words_chi)

In [None]:
file_eng=open('FILE PATH','r')

In [None]:
stop_words_eng=[]
for line in file_eng.readlines():
    line=line.strip() #remove line break
    stop_words_eng.append(line) #update the list of stop words line by line
file_eng.close()

In [None]:
len(stop_words_eng)

### Absolute Match of Stop words

In [None]:
sentence='Facebook将向加密通信转型，打造以隐私为中心的平台。'
words=list(jieba.cut(sentence))
words_new=[]
for word in words:
    if word not in stop_words_chi:
        words_new.append(word)

In [None]:
words_new

In [None]:
#for loop in the list
a=[1,2,3,4,5]
b=             #increase element by one

In [None]:
#for loop and if statement in the list
a=[1,2,3,4,5]
b=[i for i in a if i<4]
b

In [None]:
words_new=[word for word in list(jieba.cut(sentence)) if word not in stop_words_chi]

In [None]:
words_new

In [None]:
#Clean and tokenize this sentence and remove the stop words
sentence='Mr. Zuckerberg, who runs Facebook, Instagram, WhatsApp and Messenger, on Wednesday expressed his intentions to change the essential nature of social media. Instead of encouraging public posts, he said he would focus on private and encrypted communications, in which users message mostly smaller groups of people they know. Unlike publicly shared posts that are kept as users’ permanent records, the communications could also be deleted after a certain period of time.'





<h3 style='color:blue'>Practice</h3>

Find the 10 fade-in and fade-out words in speeches.<br>
The magnitude of difference is measured by the change in their relative frequencies:<br>
<p style='text-align:center;font-size:15px;'>Relative Freq (RF) = word frequency / max word frequency</p>
<p style='text-align:center;font-size:15px;'>Difference = RF<font size='2px'>2019</font> - RF<font size='2px'>2009</font></p>

Options:<br>
- Chinese: Annual government work reports, <a href="https://juniorworld.github.io/python-workshop-2018/doc/2019_Government_Work_Report.txt">2019</a> vs <a href="https://juniorworld.github.io/python-workshop-2018/doc/2009_Government_Work_Report.txt">2009</a>
- English: State of the Union address, <a href="https://juniorworld.github.io/python-workshop-2018/doc/2019_SoU.txt">2019</a> vs <a href="https://juniorworld.github.io/python-workshop-2018/doc/2009_SoU.txt">2009</a><br>

*Hint:*<br>
*1. You can use `pd.concat([df1,df2],axis=1)` to combine two data frames by columns*<br>
*2. You can use `df.fillna(0)` to replace NAN value with 0.*<br>
*3. You can use `df.sort_values(column_name)` to sort a certain column.* 

In [None]:
file_2019=open('FILE PATH','r',encoding='utf-8')
file_2009=open('FILE PATH','r',encoding='utf-8')

In [None]:
#Write Your Code Here




