Web Crawling COMM6320

Basically, there are two ways to collect data from web: via API and via Scraping. Before collecting web data, it is also helpful to understand the basic principles of HTTP.

1. Using API

Example 1.1: www.reddit.com

First, we need to find the API links (rules). For reddit.com, it is very simple - /.json. For example, https://www.reddit.com/ is the URL for the front page. The corresponding API link is https://www.reddit.com/.json. You can visit the link using any web browsers and you will see the data in JSON format.

Second, we can visit the API links and convert the JSON data to tables with R. There are several R packages to deal with JSON: A biased comparsion of JSON packages in R.

library(RCurl,quietly = T) # the package for http requests
library(rjson) # to prcoess json data
# page = getURL("https://www.reddit.com/.json")
# jsondata = fromJSON(page,unexpected.escape="skip")

Request the data:

# alternatively we can use
library(httr)
page = GET("https://www.reddit.com/r/nba/.json", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"))
page

## Response [https://www.reddit.com/r/nba/.json]
##   Date: 2022-01-25 15:10
##   Status: 200
##   Content-Type: application/json; charset=UTF-8
##   Size: 352 kB

Convert string to json. To check the data structure of json, you can use JSON Editor Online.

jsondata = content(page)
names(jsondata$data)

## [1] "after"      "dist"       "modhash"    "geo_filter" "children"  
## [6] "before"

Format to a table (author, domain, title, URL):

items = jsondata$data$children
authors = sapply(items,function(x) x$data$author)
authors

##  [1] "NBA_MOD"              "NBA_MOD"              "SQAD3"               
##  [4] "LauriFUCKINGLegend"   "efranklin13"          "urfaselol"           
##  [7] "GuyCarbonneauGOAT"    "sidighjd"             "leafieie"            
## [10] "Barea_Clamped_Lebron" "HayOjay77"            "gulfside13"          
## [13] "GuyCarbonneauGOAT"    "JackAttaq"            "leafieie"            
## [16] "GuyCarbonneauGOAT"    "xXTheRacerXx"         "JC_Frost"            
## [19] "mxnoob983"            "GuyCarbonneauGOAT"    "urfaselol"           
## [22] "sidighjd"             "urfaselol"            "PMMeUrBleedingPussy" 
## [25] "jacktradesall"        "curryybacon"          "urfaselol"

Another way is to define the function first.

extract_author = function(item){
  return(item$data$author)
}
extract_author(items[[1]])

## [1] "NBA_MOD"

Sapply the function to each element in items (instead of loop):

authors = c()
for (item in items){
  author = extract_author(item)
  authors = c(authors,author)
}
authors == sapply(items,extract_author)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

domains = sapply(items,function(x) x$data$domain)
titles = sapply(items,function(x) x$data$title)
urls = sapply(items,function(x) x$data$url)

Combined to a data frame:

dp = data.frame(authors,domains,titles,urls,stringsAsFactors = F)
head(dp)

##              authors        domains
## 1            NBA_MOD       self.nba
## 2            NBA_MOD       self.nba
## 3              SQAD3       self.nba
## 4 LauriFUCKINGLegend streamable.com
## 5        efranklin13       self.nba
## 6          urfaselol streamable.com
##                                                                                                                                                                                                                                                                                                        titles
## 1                                                                                                                                                                                                                                                                 Daily Discussion Thread + Game Thread Index
## 2                                                                                                                                                                                                                                           [SERIOUS NEXT DAY THREAD] Post-Game Discussion (January 24, 2022)
## 3                                                                                                                                                                                                           LeBron is only 137 points away from being the all-time leading scorer (regular season + playoffs)
## 4                                                                                                                                                                                                           [Highlight] Ayo Dosunmu rips it from half court to make it 10/11 from the field to close the 3rd!
## 5 Joel Embiid: “In the previous year, we had someone that was so good in transition that you had to get him the ball so he can attack or making plays and he was so good at it. His absence puts a hole in that category. That’s why I decided kinda to take my game to another level when it comes to that.”
## 6                                                                                                                                                                                                            [Highlight] Two minutes of Heat announcers expressing disbelief on how bad the Lakers defense is
##                                                                                            urls
## 1       https://www.reddit.com/r/nba/comments/scd64t/daily_discussion_thread_game_thread_index/
## 2     https://www.reddit.com/r/nba/comments/scc3o2/serious_next_day_thread_postgame_discussion/
## 3   https://www.reddit.com/r/nba/comments/scclxd/lebron_is_only_137_points_away_from_being_the/
## 4                                                                 https://streamable.com/8v7cmx
## 5 https://www.reddit.com/r/nba/comments/sc6znr/joel_embiid_in_the_previous_year_we_had_someone/
## 6                                                                 https://streamable.com/5dfbda

Find next page URL:

nextp = jsondata$data$after
nextp

## [1] "t3_sc78mi"

nextp = paste0("https://www.reddit.com/.json?count=25&after=",nextp)
nextp

## [1] "https://www.reddit.com/.json?count=25&after=t3_sc78mi"

Now, we write a functions to extract data and next page URL.

extract_page = function(api_link){
  page = GET(api_link,user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"))
  jsondata = content(page)
  items = jsondata$data$children
  authors = sapply(items,function(x) x$data$author)
  domains = sapply(items,function(x) x$data$domain)
  titles = sapply(items,function(x) x$data$title)
  urls = sapply(items,function(x) x$data$url)
  dp = data.frame(authors,domains,titles,urls,stringsAsFactors = F)
  nextp = paste0("https://www.reddit.com/.json?count=25&after=",jsondata$data$after)
  return(list(dp,nextp))
}

d = extract_page("https://www.reddit.com/.json")
head(d[[1]]) #data

##                authors          domains
## 1              kurtios rollingstone.com
## 2 SeaworthinessJumpy95        i.redd.it
## 3           HeraldWard        i.redd.it
## 4             StevenTM        i.redd.it
## 5          hsoj1006789           ihr.fm
## 6        KaamDeveloper        i.redd.it
##                                                                                titles
## 1 Neil Young Demands Spotify Remove His Music Over ‘False Information About Vaccines’
## 2                                                                            No shit?
## 3                                                                  this made me smile
## 4                                                                        Venn diagram
## 5            Biden Calls Fox Reporter 'Stupid Son of a Bitch' Over Inflation Question
## 6                                                Different gens, Amazing achievements
##                                                                                                                    urls
## 1 https://www.rollingstone.com/music/music-news/neil-young-demands-spotify-remove-music-vaccine-disinformation-1290020/
## 2                                                                                   https://i.redd.it/3xzy2l4rntd81.jpg
## 3                                                                                   https://i.redd.it/rpm4do87rtd81.jpg
## 4                                                                                   https://i.redd.it/gzk2sr2vetd81.jpg
## 5                                                                                                https://ihr.fm/3fU5A4A
## 6                                                                                   https://i.redd.it/gy69trwm1ud81.jpg

d[[2]] #next

## [1] "https://www.reddit.com/.json?count=25&after=t3_scaz39"

Put them together

library(dplyr,quietly = T)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

url = "https://www.reddit.com/.json"
data = data.frame()
n=0
while (!is.null(url)&n<=10){
  d = extract_page(url)
  data = bind_rows(data,d[[1]])
  url = d[[2]]
  n=n+1
  print(n)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11

dim(data)

## [1] 275   4

head(data)

##                authors          domains
## 1              kurtios rollingstone.com
## 2 SeaworthinessJumpy95        i.redd.it
## 3           HeraldWard        i.redd.it
## 4             StevenTM        i.redd.it
## 5          hsoj1006789           ihr.fm
## 6        KaamDeveloper        i.redd.it
##                                                                                titles
## 1 Neil Young Demands Spotify Remove His Music Over ‘False Information About Vaccines’
## 2                                                                            No shit?
## 3                                                                  this made me smile
## 4                                                                        Venn diagram
## 5            Biden Calls Fox Reporter 'Stupid Son of a Bitch' Over Inflation Question
## 6                                                Different gens, Amazing achievements
##                                                                                                                    urls
## 1 https://www.rollingstone.com/music/music-news/neil-young-demands-spotify-remove-music-vaccine-disinformation-1290020/
## 2                                                                                   https://i.redd.it/3xzy2l4rntd81.jpg
## 3                                                                                   https://i.redd.it/rpm4do87rtd81.jpg
## 4                                                                                   https://i.redd.it/gzk2sr2vetd81.jpg
## 5                                                                                                https://ihr.fm/3fU5A4A
## 6                                                                                   https://i.redd.it/gy69trwm1ud81.jpg

2. Using Scraping

Example 2.1: Diabetes Discussions.

#Sys.setlocale(locale="Chinese") # for Windows
#Sys.setlocale("LC_ALL", 'en_US.UTF-8') # for MAC
library(RCurl)
library(XML) # to parse html docs and use xpath

url="https://www.diabetes.co.uk/forum/category/diabetes-discussions.1/?order=reply_count"
source=getURL(url,.encoding='UTF-8')
#source=iconv(source, "big5", "UTF-8",sub = 'byte') 
page =htmlParse(source,encoding = 'UTF-8')
class(page)

## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"

Using xpath to extract users, titles, urls etc.

authors = xpathSApply(page,"//li[@class='discussionListItem visible  ']/@data-author")
authors

##     data-author     data-author     data-author     data-author     data-author 
##   "NewdestinyX"     "archersuz"        "Prem51"          "Lupf"    "kimyeomans" 
##     data-author     data-author     data-author     data-author     data-author 
## "Administrator"      "jayney27"       "Begonia"      "CherryAA"        "Spiker" 
##     data-author     data-author     data-author     data-author     data-author 
##         "SueMG"  "999sugarbabe"       "Giverny"         "izzzi"    "ladybird64" 
##     data-author 
## "Administrator"

We may remove names:

names(authors) = NULL
authors

##  [1] "NewdestinyX"   "archersuz"     "Prem51"        "Lupf"         
##  [5] "kimyeomans"    "Administrator" "jayney27"      "Begonia"      
##  [9] "CherryAA"      "Spiker"        "SueMG"         "999sugarbabe" 
## [13] "Giverny"       "izzzi"         "ladybird64"    "Administrator"

To extract titles and urls:

titles = xpathSApply(page,"//li[@class='discussionListItem visible  ']//h3[@class='title']/a",xmlValue)
titles

##  [1] "What was your fasting blood glucose? (with some chat)"     
##  [2] "Diabetics R Us"                                            
##  [3] "What was your bg reading last night and this morning?"     
##  [4] "Covid/Coronavirus and diabetes - the numbers"              
##  [5] "'Newcastle diet' advice"                                   
##  [6] "Poll: Which diabetes course(s) have you attended?"         
##  [7] "The one show discussion"                                   
##  [8] "Prof Roy Taylor's work on reversing type 2 diabetes"       
##  [9] "My personal hypothesis - T2 - Low insulin Diet"            
## [10] "Poll - side effects from statins?"                         
## [11] "Reversing Type 2 diabetes"                                 
## [12] "NHS Direct doctor says... NO testing when taking Metformin"
## [13] "How old were you when you were diagnosed?"                 
## [14] "LCHF diet to help you lose weight, not diabetes"           
## [15] "Sugar Tax"                                                 
## [16] "Have you found love on the forum?"

urls = xpathSApply(page,"//li[@class='discussionListItem visible  ']//h3[@class='title']/a/@href")
urls

##                                                                     href 
##      "threads/what-was-your-fasting-blood-glucose-with-some-chat.22272/" 
##                                                                     href 
##                                         "threads/diabetics-r-us.150682/" 
##                                                                     href 
##   "threads/what-was-your-bg-reading-last-night-and-this-morning.157291/" 
##                                                                     href 
##             "threads/covid-coronavirus-and-diabetes-the-numbers.174274/" 
##                                                                     href 
##                                   "threads/newcastle-diet-advice.55478/" 
##                                                                     href 
##          "threads/poll-which-diabetes-course-s-have-you-attended.70671/" 
##                                                                     href 
##                                "threads/the-one-show-discussion.148418/" 
##                                                                     href 
##     "threads/prof-roy-taylors-work-on-reversing-type-2-diabetes.124863/" 
##                                                                     href 
##             "threads/my-personal-hypothesis-t2-low-insulin-diet.127156/" 
##                                                                     href 
##                          "threads/poll-side-effects-from-statins.58409/" 
##                                                                     href 
##                              "threads/reversing-type-2-diabetes.102415/" 
##                                                                     href 
## "threads/nhs-direct-doctor-says-no-testing-when-taking-metformin.76791/" 
##                                                                     href 
##                "threads/how-old-were-you-when-you-were-diagnosed.39006/" 
##                                                                     href 
##          "threads/lchf-diet-to-help-you-lose-weight-not-diabetes.65746/" 
##                                                                     href 
##                                               "threads/sugar-tax.97464/" 
##                                                                     href 
##                        "threads/have-you-found-love-on-the-forum.70861/"

full_urls = paste0("https://www.diabetes.co.uk/forum/",urls)
head(full_urls)

## [1] "https://www.diabetes.co.uk/forum/threads/what-was-your-fasting-blood-glucose-with-some-chat.22272/"   
## [2] "https://www.diabetes.co.uk/forum/threads/diabetics-r-us.150682/"                                      
## [3] "https://www.diabetes.co.uk/forum/threads/what-was-your-bg-reading-last-night-and-this-morning.157291/"
## [4] "https://www.diabetes.co.uk/forum/threads/covid-coronavirus-and-diabetes-the-numbers.174274/"          
## [5] "https://www.diabetes.co.uk/forum/threads/newcastle-diet-advice.55478/"                                
## [6] "https://www.diabetes.co.uk/forum/threads/poll-which-diabetes-course-s-have-you-attended.70671/"

Put them together:

url="https://www.diabetes.co.uk/forum/category/diabetes-discussions.1/?order=reply_count"

scrape = function(url){
  source=getURL(url,.encoding='utf-8')
  page =htmlParse(source,encoding = 'UTF-8')
  authors = xpathSApply(page,"//li[@class='discussionListItem visible  ']/@data-author")
  titles = xpathSApply(page,"//li[@class='discussionListItem visible  ']//h3[@class='title']/a",xmlValue)
  urls = xpathSApply(page,"//li[@class='discussionListItem visible  ']//h3[@class='title']/a/@href")
  full_urls = paste0("https://www.diabetes.co.uk/forum/",urls)
  df = data.frame(authors,titles,full_urls,stringsAsFactors = F)
  
  nexp = xpathSApply(page,"//nav/a[contains(text(),'Next >')]/@href")[1]
  nexp = paste0("https://www.diabetes.co.uk/forum/",nexp)
  return(list(df,nexp))
}

k=scrape(url)
k[[1]]

##          authors                                                     titles
## 1    NewdestinyX      What was your fasting blood glucose? (with some chat)
## 2      archersuz                                             Diabetics R Us
## 3         Prem51      What was your bg reading last night and this morning?
## 4           Lupf               Covid/Coronavirus and diabetes - the numbers
## 5     kimyeomans                                    'Newcastle diet' advice
## 6  Administrator          Poll: Which diabetes course(s) have you attended?
## 7       jayney27                                    The one show discussion
## 8        Begonia        Prof Roy Taylor's work on reversing type 2 diabetes
## 9       CherryAA             My personal hypothesis - T2 - Low insulin Diet
## 10        Spiker                          Poll - side effects from statins?
## 11         SueMG                                  Reversing Type 2 diabetes
## 12  999sugarbabe NHS Direct doctor says... NO testing when taking Metformin
## 13       Giverny                  How old were you when you were diagnosed?
## 14         izzzi            LCHF diet to help you lose weight, not diabetes
## 15    ladybird64                                                  Sugar Tax
## 16 Administrator                          Have you found love on the forum?
##                                                                                                  full_urls
## 1       https://www.diabetes.co.uk/forum/threads/what-was-your-fasting-blood-glucose-with-some-chat.22272/
## 2                                          https://www.diabetes.co.uk/forum/threads/diabetics-r-us.150682/
## 3    https://www.diabetes.co.uk/forum/threads/what-was-your-bg-reading-last-night-and-this-morning.157291/
## 4              https://www.diabetes.co.uk/forum/threads/covid-coronavirus-and-diabetes-the-numbers.174274/
## 5                                    https://www.diabetes.co.uk/forum/threads/newcastle-diet-advice.55478/
## 6           https://www.diabetes.co.uk/forum/threads/poll-which-diabetes-course-s-have-you-attended.70671/
## 7                                 https://www.diabetes.co.uk/forum/threads/the-one-show-discussion.148418/
## 8      https://www.diabetes.co.uk/forum/threads/prof-roy-taylors-work-on-reversing-type-2-diabetes.124863/
## 9              https://www.diabetes.co.uk/forum/threads/my-personal-hypothesis-t2-low-insulin-diet.127156/
## 10                          https://www.diabetes.co.uk/forum/threads/poll-side-effects-from-statins.58409/
## 11                              https://www.diabetes.co.uk/forum/threads/reversing-type-2-diabetes.102415/
## 12 https://www.diabetes.co.uk/forum/threads/nhs-direct-doctor-says-no-testing-when-taking-metformin.76791/
## 13                https://www.diabetes.co.uk/forum/threads/how-old-were-you-when-you-were-diagnosed.39006/
## 14          https://www.diabetes.co.uk/forum/threads/lchf-diet-to-help-you-lose-weight-not-diabetes.65746/
## 15                                               https://www.diabetes.co.uk/forum/threads/sugar-tax.97464/
## 16                        https://www.diabetes.co.uk/forum/threads/have-you-found-love-on-the-forum.70861/

head(k[[2]])

## [1] "https://www.diabetes.co.uk/forum/category/diabetes-discussions.1/page-2?order=reply_count"

Loop to get more pages:

frdata=data.frame()
n=0
while (!is.null(url)&n<=10){
  k = scrape(url)
  frdata = bind_rows(frdata,k[[1]])
  url = k[[2]]
  n=n+1
  print(n)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11

dim(frdata)

## [1] 193   3

head(frdata)

##         authors                                                titles
## 1   NewdestinyX What was your fasting blood glucose? (with some chat)
## 2     archersuz                                        Diabetics R Us
## 3        Prem51 What was your bg reading last night and this morning?
## 4          Lupf          Covid/Coronavirus and diabetes - the numbers
## 5    kimyeomans                               'Newcastle diet' advice
## 6 Administrator     Poll: Which diabetes course(s) have you attended?
##                                                                                               full_urls
## 1    https://www.diabetes.co.uk/forum/threads/what-was-your-fasting-blood-glucose-with-some-chat.22272/
## 2                                       https://www.diabetes.co.uk/forum/threads/diabetics-r-us.150682/
## 3 https://www.diabetes.co.uk/forum/threads/what-was-your-bg-reading-last-night-and-this-morning.157291/
## 4           https://www.diabetes.co.uk/forum/threads/covid-coronavirus-and-diabetes-the-numbers.174274/
## 5                                 https://www.diabetes.co.uk/forum/threads/newcastle-diet-advice.55478/
## 6        https://www.diabetes.co.uk/forum/threads/poll-which-diabetes-course-s-have-you-attended.70671/

3. Using Automated Browser

Twitter Advanced Search

library(RSelenium,quietly = T)
# Sys.which("java")
rD <- rsDriver(verbose = FALSE,port=4444L,browser="firefox") #4445L
remDr <- rD$client
remDr$navigate("https://twitter.com/search-advanced?lang=en")

Now, input keywords

Sys.sleep(20)

# hashtag
webElem <- remDr$findElement(using = "xpath","//input[@name='allOfTheseWords']")
webElem$sendKeysToElement(list("#covid19",key ="enter"))

Sys.sleep(20)

# click the latest button
webElem <- remDr$findElement(using = "xpath","//span[text()='Latest']")
webElem$clickElement()

# scroll down the page
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight)", list(""))

## list()

And you could scroll to the top like this:

webElem <- remDr$findElement("css", "body") #very important
webElem$sendKeysToElement(list(key = "home"))

And in case you want to scroll down just a bit, use

webElem$sendKeysToElement(list(key = "down_arrow"))

You could scroll to the buttom (of the current page) like this:

webElem$sendKeysToElement(list(key = "end"))

If you want to loop many steps:

n=0
while(n<10){
  webElem$sendKeysToElement(list(key = "end"))
  n=n+1
}

Extract information

Sys.sleep(20)
source=remDr$getPageSource()[[1]]
page=htmlParse(source,encoding = 'UTF-8')

tweets = xpathSApply(page,"//div[@class='css-901oao r-18jsvk2 r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-bnwqim r-qvutc0']",xmlValue)
head(tweets)

## [1] "Fui tomar hj a terceira dose da vacina contra o C19. Caraca! Além de apenas determinados locais aplicarem a dita cuja, haja saco. Fila pra senha, fila pra cadastro, fila pra esperar, fila pra vacinar. Por isso o povo ñ tá indo. Quase desisti! #curitiba #COVID19"             
## [2] "China’s zero-Covid policy is a pandemic waiting to happen http://dlvr.it/SHmvGn #COVID19 #COVID19variants"                                                                                                                                                                         
## [3] "Urgent : la pharmacie Oberkampf dispose de plusieurs doses de vaccin Moderna valables aujourd’hui sans rdv pour les plus de 30 ans \nOuverte jusqu'à 19h30 , 4 rue Pasteur / place de la Marne #JouyenJosas #Vaccinezvous #COVID19 #Moderna"                                       
## [4] "Ayer #Querétaro reportó un pico histórico de #COVID19 con dos mil 759 contagios en un día. #NoBajesLaGuardia #UsaCubrebocas y por favor, #Vacúnate \n\nhttps://adninformativo.mx/queretaro-rompe-nuevamente-pico-historico-de-contagios-2-mil-759-casos-de-covid-19-este-lunes/…"  
## [5] "  La móvil de vacunación estará en #Caucasia desde este viernes 28 y hasta el domingo 30 de enero.\n\nSi estás en este municipio, aprovecha y vacúnate contra #COVID19: hay primeras, segundas y dosis de refuerzo Rostro boca arriba.\n\n Más información en la alcaldía local."  
## [6] "#AsíLoDijo || Vicepdta. @delcyrodriguezv: Este Consejo también está representado por el Gobierno del Reino Unido que ha adoptado la ilegítima decisión de apoderarse del oro del pueblo venezolano, negado incluso para atender la situación humanitaria generada por la #COVID19."

Close client/server

remDr$close()
rD$server$stop()

## [1] TRUE