image.png


Section 1: Knowing Python

Python is a Programming Language

  • Element: English words, numbers, and special characters.
  • Syntax: sensitive to case and indent

Why Python is so popular?

  • Open source: free and fast-developing
  • General-purpose: suitable for computer scientists, statistics, enterprises and generally everyone
  • Computational efficiency: low-cost, fast-speed and requiring fewer lines (clean and tidy)
  • Extensive libraries: statistical analytics, visualization, machine learning

If you only minimum time to learn only one new programming lanugage, Python would be your best choice.

Typical Workflow

  1. Problem Definition
    • Supervised Learning (SL): Answers are provided.
      • Regression
      • Classification
      • Classification is a special type of Regression whose outcome measure is categorical variable, like political affiliation (Democratic, Neutral and Republican)
    • Unsupervised Learning (UL): Answers are not provided.
      • Clustering
      • Dimension Reduction
    • SL is the present whereas UL is believed to be the future of data science, as in reality people are more often living without knowing rules and without the support of _god_ who are omniscient and will rectify you every now and then when you make some mistakes. Besides, in real world, we have far more unlabeled data than labeled one.
Major Category Minor Category RQ Method Application
Supervised Learning Regression What is the score of A? Pearson Regression
KNN, Neural Network
Stock price prediction
Machine translation
Supervised Learning Classification Which category does A belong to? Naive Bayesian, KNN
Logistic Regression
Image Recognition
Sentiment Analysis
Unsupervised Learning Clustering How many clusters are there? KMeans,Gaussian Mixtures
Agglomerative Clustering
Fraud Detection
Users Segementation
Unsupervised Learning Dimension Reduction How many dimensions are sufficient
to represent data variance?
PCA/SVD, LDA,word2vec Visualization
Vector Space
Image Recognition Clustering based on PCA

Identifying meaningful research questions is an important capacity, which requires comprehensive tacit knowledge on top of explicit knowledge. In this workshop series, we can only learn explicit knowledge like know-what and know-how. And to cultivate the capicity of identifying RQs, you need to practice more in your spare time to help internalize and transform explicit knowledge into tacit one.

  1. Association between Research Tasks and Python Libraries to learn
    • Data Collection
      • Selenium
    • Data Cleaning (Preprocessing)
      • Pandas
    • Data Exploration
      • Numpy
      • Pandas
    • Model Development
      • Sciki-learn
      • Tensorflow (optional)
    • Visualization
      • Plot.ly

Section 2: Installing Python

A. install python environment: Download Link

Python 3.X is recommended. The number following "Python", i.e. 3.X, is the 'version number', which is composed of generation number and minor version number. The greater number, the more updated.

After 11 years' development, Python 3 has become very mature and stable. Almost all libraries and packages are supporting Python 3.X. And some important libraries, like TensorFlow, are exclusive to Python 3.X users.

Tip:Mac Os system has installed Python 2.7 by default. Python 2.7 and 3.X can work parallely as they are compatible to each other. You can switch between the two python versions freely.

B. run python environment by CLIs:

  1. Open Command Line Interface (CLI)

    CLI is an interface allowing you to write textual instructions line by line to control computer do something at your command.

    _There are several CLIs available in Mac OS and Windows._

    Mac OS:

    • Terminal (built-in CLI): Spotlight Search/Launchpad -> Type "Terminal" -> Type "py3
    • IDLE (python-specific CLI): Spotlight Search/Launchpad -> Type "IDLE"

    Windows:

    • PowerShell(built-in CLI): Start Menu -> Type "PowerShell" -> Type "py"
    • Command Prompt(built-in CLI):Start Menu -> Type "Command Prompt" -> Type "py"
    • Python Command Line(python-specific CLI): Start Menu -> Type "Python" -> Choose "Python (Command Line)"
**Tip:** Windows users can use shortcuts to open PowerShell/Command Prompt.

PowerShell: Windows+R -> Type "powershell"

Command Prompt:Windows+R -> Type "cmd"

Mac OS IDLE:

_number inside red box is the **version number**_

Mac OS Terminal:

_string inside the blue box indicates **current directory**_

Windows Command Prompt:


  1. Run some basic commands
  • math
    • add: 1+2
    • substract: 1-2
    • multiply: 1*2
    • divide: 1/2
    • power: 2**2
  • string
    • echo: 'hi'
    • extend:'hi'+'!'
    • duplicate: 'a'*20
  • comparison: also called logical conditions, because its return value is logical boolean value, i.e. True or False.
    • equal: 1 is 0 or 1 == 0
    • not equal: 1 is not 0 or 1 != 0
    • less than: 1 < 0
    • less than or equal: 1 <= 0
    • greater than: 1 > 0
    • greater than or equal: 1 >= 0
  • boolean operation:
    • and: True and False, (1>0) and (1==0)
    • or: True or False, (1>0) or (1==0)
    • not:not True, not (1==0)
  • variable:
    • definition: Variables point to values. They are the nicknames of values, designated by user.
    • assign values to variable: a = 1, here "a" is the variable name and 1 is its value.
    • increase variables by 1 unit: a = a + 1 or a += 1
    • decrease variables by 1 unit: a = a - 1 or a -= 1
**Warning**: Python is a case-sensitive language. So, if you try "1 IS 0", it will return error message.
**Tip:** To quickly copy last line, you can press . To copy the second last lin, you can press twice. And so on.

  1. Exit CLI
    • Use Exit Function: Exit()
    • Shortcut: Ctrl+Z -> Press Enter or return

  1. Navigate CLI
    • Using command cd [folder name] to a child folder of current folder or any folder with absolute directory path
      • child folder: "cd desktop"
      • folder with absolute path: "cd C:\Users\yuner\Desktop"
    • Using command cd.. to navigate to the parent folder of current folder
**Tip**: To find absolute path in Mac OS, you need to right click target folder and 1) select Get Info -> Where -> + C

; or 2) OPTION -> Copy as Pathname.

  1. Summary

    Generally speaking, every operating system has equipped with Linux-like **built-in CLIs** by default. They can be instantly switched to Python environment by simply one command, i.e. "py" or "py3".

    Besides, a **Python-specific CLI** is provided after installation. You could find it by searching its name in Start Menu or Spotlight. Different from built-in CLIs, it is running under python environment by default so it doesn't require extra command to turn into this way.

    However, the biggest weakness of CLIs is that you can only write commands **line by line**, which is inefficient and even disruptive to integrative thinking. To overcome this, we will introduce two alternatives to CLIs, namely **Jupyter Notebook** and **Sublime Text (optional)**.


C. Alternatives to CLIs


Other than CLIs, we can choose to use external softwares to run Python environment.

Here we will first learn an interactive python editor called **Jupyter Notebook**, which has been widely adopted as a norm in Silicon Valley.

  1. install jupyter via pip
    • pip:
      • pip is an installation assistant library, which has been installed along with Python.
      • usage: you can use pip by typing pip3 install [library name] in system built-in CLIs (Terminal/PowerShell/Command Prompt).
      • example: to install jupyter notebook, you should type pip3 install jupyter
**Tip**: Here we need to use "pip3" instead of "pip", because by specifying "3" we can order pip to install a library that supports Python 3.X.
  1. run jupyter notebook
    • type "jupyter notebook" in system built-in CLIs
    • it will automatically open a new page in your default browser
    • the page provides a view of current folder

**Tip**: Even though it looks like a web page, jupyter notebook is running in your local computer. As you could tell from the link address "localhost:8888", "localhost" means the file is running in local end, and "8888" is the port number.
  1. create new notebook and run cells
    • Click New▾

    • Select "Notebook" -> "Python 3"
    • Unit of codes here is Cell, not Line. You can write multiple lines in one cell and run all of them in a batch.
    • How to run Cell:
      • Click ▶ Run
      • Use shortcut: Crtl+Enter or return
    • Repeat what we have learned for CLIs
      • maths
      • string
      • comparing values
      • boolean operation
      • assign values to varaibles
    • A new function called print, which is used to print out certain values
      • print(1==2)
  1. rename and save notebook
  1. relocate notebook and re-open it

D. install requisite libraries

  • Data Collection
    • Selenium: pip3 install selenium
  • Data Cleaning (Preprocessing)
    • Pandas: pip3 install pandas
  • Data Exploration
    • Numpy: pip3 install numpy
  • Model Development
    • Sciki-learn: pip3 install sklearn
  • Visualization
    • Plot.ly: pip3 install plotly

Assignment

To help you better grasp the knowledge, I will prepare assignment for you every week. Assignments will be mindfully designed to make sure average people can finish it within two hours. You are suggested to finish them before next class, but it's not compulsory. I will grade them, send you my feedback comments and organize some discussions at the beginning of every class. May you have any question, feel free to find me (yunerzhu@gmail.com).

**Submission Website:**

Google Classroom:https://classroom.google.com/c/MjcxNTczODExNDha

Course code: k8mi2l