# WEEK 6: Web Crawling & Twitter API

---

## Web Crawling
We will introduce two methods to collect data: web crawling (this week) and calling API (next week).<br>
Web crawling is to design an automatic bot to imitate human browsing behavior.

### Understanding HTML
- HTML stands for **Hyper Text Markup Language**, which is used to define a website.
- All HTML contents are hierarchical and structured.
    - Basic Element: `Tag` and `Text`
    - Text is the content shown on the screen. **Tag is not displayed but is used to render the text.**
    - Text is wrapped by start and end tags.
    - Tag: denoted by a pair of angle bracket <>
        - Start Tag
            - Tag Name
            - Attributes (optional): attributes provide additional information about the element
                - Attribute Name
                - Attribute Value
            - format: <...>
        - End Tag
            - format: </...>
        - All tags are used in pairs, <font style="color:red">except line break tag <b>&lt;br&gt;</b> and input box tag <b>&lt;input&gt;</b></font>.

---

### Input Types

```html
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <a href="https://juniorworld.github.io/python-workshop-2018/">Go to our Home Page</a>
    <p>Please input your user name:</p>
    <input type="text">
    <p>Please input your password:</p>
    <input type="password">
    <br>
    <input type='radio'> Do you like Python?
    <br>
    <input type='radio'> Do you like HTML?
    <br>
    <input type="submit">
    <input type="reset">
  </body>
</html>```

To assign default value, you can use `value` attribute.

```html
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <a href="https://juniorworld.github.io/python-workshop-2018/">Go to our Home Page</a>
    <p>Please input your user name:</p>
    <input type="text" value="junior">
    <p>Please input your password:</p>
    <input type="password" value="123">
    <br>
    <input type="submit">
    <input type="reset">
  </body>
</html>```

### Publish HTML page
Please save your HTML code as a file and rename it as "week5.html"
Double click to render the page at your local end.
If you have a server, then you can send this file to your server and publish it as a online web page.

#### <font style="color: blue">Practice:</font>
<font style="color: blue">Please create a page as the screen, save it as "week5_practice.html" and render it in your computer.</font>

## Using Selenium

We will use `selenium` package to collect data, which is applicable to both static and dynamic websites.<br>
Please download Chrome driver from this link: https://chromedriver.storage.googleapis.com/index.html?path=73.0.3683.20/

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

In [None]:
driver=webdriver.Chrome(executable_path='C:\\Python27\\selenium\\webdriver\\chrome\\chromedriver.exe') #load the browser

In [None]:
driver.get('file:///C:/Users/yuner/Desktop/week5.html') #use absolute path to open local html file

In [None]:
driver.title #print the title

In [None]:
driver.current_url #get the url of the page

## Locate Element by Xpath

We can locate elements by their relative/absolute paths in the file with additional hints about their tag name, attribute name, and attribute value.<br>
- Xpath is an expression of HTML element path
    - `/` is the sign of **absolute path**:
        - if used at the begining: this is a xpath starting from the root node
        - if used in the middle: refer to the element **at the next level**
            - i.e. xpath of &lt;body&gt; can be written as "html/body" or "/html/body". 
            - If you write "/body", system will pop up error message.
    - `//` is the sign of **relative path**: refer to any element that matches to the pattern no matter where they are.
        - i.e. xpath of &lt;body&gt; can be written as "//body"
    - `[@attribute name=attribute value]` we can include attribute into the matching pattern
        - i.e. "//input[@type='reset']"
        - The most efficient attribute is `id`. `id` is the unique identification of element.

In [None]:
#you can use find_element_by_xpath function to find the element by relative xpath
body=driver.find_element_by_xpath('//body')

In [None]:
body.text #get the text of the matched element

In [None]:
#or by absolute xpath
body=driver.find_element_by_xpath('/html/body')
print(body.text)

In [None]:
#use find_elements_by_xpath function to find a list of elements with shared pattern
inputs=driver.find_elements_by_xpath('//input')

In [None]:
len(inputs)

In [None]:
first_input=inputs[0]
print(first_input.get_attribute('value'))

In [None]:
ps=driver.find_elements_by_xpath('//p')

In [None]:
print(len(ps)) #count how many <p> are in the html
print(ps[0].text) #first element's text
print(ps[1].text) #second element's text

## Imitate Browsing Behavior

Some frequently used behaviors:
1. Click: `element.click()`
2. Type: `element.send_keys('something')`
3. Clear existing content: `element.clear()`
4. Scroll: 
    - Scroll to bottom: `driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")`
    - Scroll to specific location: i.e. scroll down by 400px, `driver.execute_script("window.scrollTo(0, 400);")`

In [None]:
#clean default name and fill in your name
name_box=inputs[0]
name_box.clear()
name_box.send_keys('your name')

In [None]:
#clean default password and fill in any random keys





In [None]:
#click the link of "GO to our Home Page"
link=driver.find_element_by_xpath('//a')
link.click()

In [None]:
#navigate to another online page and inspect the page
driver.get('https://juniorworld.github.io/python-workshop-2018/week5/1.html')

In [None]:
#copy the xpath and fill it into the bracket
Q1=driver.find_element_by_xpath('')
print(Q1.text)
Q2=driver.find_element_by_xpath('')
print(Q2.text)

In [None]:
#click the submit button
submit=driver.find_element_by_xpath('') #copy the xpath from inspect window will not look into attributes other than id
submit=driver.find_element_by_xpath('//input[@type="submit"]') #or you can specify xpath by yourself
submit.click()

#### <font style="color: blue">Practice:</font>
<font style="color: blue">Open Google page (https://www.google.com/), search for "JMSC" and click the "Google Search" button.</font>

In [None]:
#write your code here






In [None]:
#collect all results on the first page
results=driver.find_elements_by_xpath('//div[@class="rc"]')

In [None]:
#how many results are listed on the first page
len(results)

In [None]:
#print every result
for result in results:
    result_link=result.find_element_by_xpath('div[@class="r"]/a') #we can also find element under current note
    result_link_text=result_link.find_element_by_xpath('h3').text
    result_link_href=result_link.get_attribute('href')
    result_description=result.find_element_by_xpath('div[@class="s"]').text
    print(result_link_text,result_link_href,result_description)

In [None]:
#save results
output_file=open('week5_google.txt','w',encoding='utf-8')
for result in results:
    result_link=result.find_element_by_xpath('div[@class="r"]/a') #we can also find element under current note
    result_link_text=result_link.find_element_by_xpath('h3').text
    result_link_href=result_link.get_attribute('href')
    result_description=result.find_element_by_xpath('div[@class="s"]').text
    output_file.write(result_link_text+'\t'+result_link_href+'\t'+result_description+'\n')
output_file.close()

---
# Break
---

## Twitter API
API stands for Application Interface, which is provided and maintained by IT company as an official approach to automatically fetch data from their servers. Almost all IT giants like Twitter, Facebook and Google have their APIs. Therefore, knowing how to API is a very critical capacity for anyone who aims to do social media analytics.
Please follow this instruction to apply for a Twitter API: https://juniorworld.github.io/python-workshop-2018/doc/Instructions_on_Twitter_API.pdf

In [None]:
import requests
import time
import base64
import pandas as pd

In [None]:
#Authorize your App

api_key = 'API KEY'
api_secret = 'API SECRET KEY'

key_secret = api_key+':'+api_secret
b64_encoded_key = base64.b64encode(key_secret.encode('ascii')).decode('ascii')

auth_url = 'https://api.twitter.com/oauth2/token'

auth_headers = {
    'Authorization': 'Basic '+b64_encoded_key,
    'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'
}

auth_data = {
    'grant_type': 'client_credentials'
}

auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)

In [None]:
auth_resp.status_code #status code "200" means authorization succeeds, "400" bad request, "401" unauthorized, "403" forbidden 

In [None]:
access_token=auth_resp.json()['access_token'] #get your bearer access token

In [None]:
headers = {'Authorization': 'Bearer '+access_token} #we will use this header throughout the course

## Search API
We can use Search API to search for posts or users in Twitter platform.

### 1. Search for Posts

Since we are using free version API, we are only allowed to collect post in the past 7 days. But this limitation can be transcended if you schedule a routine program to collect data every 7 days.<br>
The Search API functions in a way similar to Twitter advanced search: https://twitter.com/search-advanced<br>
The key to search is creating a query url containing search parameters.

In [None]:
search_url = 'https://api.twitter.com/1.1/search/tweets.json'

In [None]:
params = {
    'q': '"#hongkong"', #search string
    'result_type': 'recent', #mixed,recent,popular
    'count': 100 #up to 100
}

search_resp = requests.get(search_url, headers=headers, params=params)

In [None]:
type(search_resp.json())

In [None]:
search_resp.json().keys()

In [None]:
print(len(search_resp.json()['statuses']))  #a list of tweet objects

In [None]:
search_resp.json()['statuses'][0].keys()

In [None]:
results=search_resp.json()['statuses'] #save first 100 results

For more information about tweet object, please refer to: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json

### 2. Navigate to next page of results (step-by-step breakdown)

In [None]:
#have a look at the metadata
search_resp.json()['search_metadata'] #the link of next_results is the one we need

In [None]:
#let's do next run search
next_page=     #please extract the link from the dictionary and save it as "next_page" variable
search_resp=requests.get(search_url+next_page,headers=headers)

In [None]:
len(search_resp.json()['statuses']) #another 100 posts are in place

In [None]:
#update the results
results.extend(search_resp.json()['statuses'])

### 3. Navigate to next N page of results (integrated)

In [None]:
#you can use a for loop to collect specific pages of results
for page in range(5):
    next_page=search_resp.json()['search_metadata']['next_results']
    search_resp=requests.get(search_url+next_page,headers=headers)
    results.extend(search_resp.json()['statuses'])
    print(page+1,'pages have been collected')
    time.sleep(15)
print('DONE!')

In [None]:
#you can use a while loop to exhaust all posts
#Reminder: put some time delay so that you won't exceed the rate limit
page=0
while 'next_results' in search_resp.json()['search_metadata'].keys():
    page+=1
    next_page=search_resp.json()['search_metadata']['next_results']
    search_resp=requests.get(search_url+next_page,headers=headers)
    results.extend(search_resp.json()['statuses'])
    print(page,'pages have been collected')
    time.sleep(15)
print('DONE!')

### 4. Preliminary Analysis

In [None]:
#turn results into a dataframe
table=pd.DataFrame.from_records(results)

In [None]:
table.columns

### 4(a) Co-hashtag Analysis

In [None]:
table['entities'][0].keys() #entities is a dictionary about in-text connections

For more information about entities, please refer to official documentation: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object

In [None]:
hashtags=[]
for entity in table['entities']:
    for hashtag in entity['hashtags']:
        hashtags.append(hashtag['text'])

In [None]:
len(hashtags)

In [None]:
hashtag_freq=pd.value_counts(hashtags) #frequency distribution of hashtags

In [None]:
hashtag_freq.head() #first 5 rows

In [None]:
#convert uppercase to lowercase
hashtags=[i.lower() for i in hashtags] #Write your code here
hashtag_freq=pd.value_counts(hashtags) #data type: Series, index: hashtag
hashtag_freq.head()

In [None]:
pd.DataFrame(hashtags).to_csv('hashtags.txt')

You can create a word cloud of co-hashtags of #hongkong in https://wordcloud.timdream.org/

#### <font style="color: blue">Practice:</font>
---
<font style="color: blue">Please collect most recent 500 tweets using hashtag #FinishTheWall and visualize its co-hashtags with word cloud.<br>
   Please use a variable name other than "table" to store your results, because we will use table later. 
</font>

In [None]:
#Write your code here






---
## Break
---

### 4(b) User Analysis
REMINDER: More and more people are concerning privacy issues in social networking sites. To free yourself from such sticky debates, you need to mindfully remove identifiers like user ids, screen names and profile pictures, before displaying your findings to the public.

In [None]:
#the target column is 'user'
table['user'][0].keys()

In [None]:
#convert the list of user dictionaries into a data frame
users=pd.DataFrame.from_records(table['user'])

In [None]:
unique_users=users.drop_duplicates('id')
unique_users.sort_values('followers_count',ascending=False)['screen_name'].head() #display the five most influential users

In [None]:
#use plotly for visualization
import plotly.plotly as py
import plotly.graph_objs as go

py.sign_in('USER NAME', 'API TOKEN')

In [None]:
trace=go.Histogram(x=unique_users['followers_count'],xbins={'start':0,'end':10000,'size':100})
py.iplot([trace],filename='histogram')

#### <font style="color: blue">Practice:</font>
---
<font style="color: blue">
    1. Find out the screen names of 5 most active users who generate more posts than others.<br>
    2. Create a histogram to visualize the frequency distribution of user activity. User activity is the number of posts generated by each user. So, the first bar in the graph should represent the number of users with only 1 post. Second bar represents the number of users with 2 posts.
</font>

Hint: you can use `pd.value_counts()` function to get the frequencies of users.

In [None]:
#Write your code here




