Basically, there are two ways to collect data from web: via API and via Scraping. Before collecting web data, it is also helpful to understand the basic principles of HTTP.
First, we need to find the API links (rules). For reddit.com, it is very simple - /.json. For example, https://www.reddit.com/ is the URL for the front page. The corresponding API link is https://www.reddit.com/.json. You can visit the link using any web browsers and you will see the data in JSON format.
Second, we can visit the API links and convert the JSON data to tables with R. There are several R packages to deal with JSON: A biased comparsion of JSON packages in R.
library(RCurl,quietly = T) # the package for http requests
library(rjson) # to prcoess json data
# page = getURL("https://www.reddit.com/.json")
# jsondata = fromJSON(page,unexpected.escape="skip")
Request the data:
# alternatively we can use
library(httr)
page = GET("https://www.reddit.com/r/nba/.json", user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"))
page
## Response [https://www.reddit.com/r/nba/.json]
## Date: 2022-01-25 15:10
## Status: 200
## Content-Type: application/json; charset=UTF-8
## Size: 352 kB
Convert string to json. To check the data structure of json, you can use JSON Editor Online.
jsondata = content(page)
names(jsondata$data)
## [1] "after" "dist" "modhash" "geo_filter" "children"
## [6] "before"
Format to a table (author, domain, title, URL):
items = jsondata$data$children
authors = sapply(items,function(x) x$data$author)
authors
## [1] "NBA_MOD" "NBA_MOD" "SQAD3"
## [4] "LauriFUCKINGLegend" "efranklin13" "urfaselol"
## [7] "GuyCarbonneauGOAT" "sidighjd" "leafieie"
## [10] "Barea_Clamped_Lebron" "HayOjay77" "gulfside13"
## [13] "GuyCarbonneauGOAT" "JackAttaq" "leafieie"
## [16] "GuyCarbonneauGOAT" "xXTheRacerXx" "JC_Frost"
## [19] "mxnoob983" "GuyCarbonneauGOAT" "urfaselol"
## [22] "sidighjd" "urfaselol" "PMMeUrBleedingPussy"
## [25] "jacktradesall" "curryybacon" "urfaselol"
Another way is to define the function first.
extract_author = function(item){
return(item$data$author)
}
extract_author(items[[1]])
## [1] "NBA_MOD"
Sapply the function to each element in items (instead of loop):
authors = c()
for (item in items){
author = extract_author(item)
authors = c(authors,author)
}
authors == sapply(items,extract_author)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
domains = sapply(items,function(x) x$data$domain)
titles = sapply(items,function(x) x$data$title)
urls = sapply(items,function(x) x$data$url)
Combined to a data frame:
dp = data.frame(authors,domains,titles,urls,stringsAsFactors = F)
head(dp)
## authors domains
## 1 NBA_MOD self.nba
## 2 NBA_MOD self.nba
## 3 SQAD3 self.nba
## 4 LauriFUCKINGLegend streamable.com
## 5 efranklin13 self.nba
## 6 urfaselol streamable.com
## titles
## 1 Daily Discussion Thread + Game Thread Index
## 2 [SERIOUS NEXT DAY THREAD] Post-Game Discussion (January 24, 2022)
## 3 LeBron is only 137 points away from being the all-time leading scorer (regular season + playoffs)
## 4 [Highlight] Ayo Dosunmu rips it from half court to make it 10/11 from the field to close the 3rd!
## 5 Joel Embiid: “In the previous year, we had someone that was so good in transition that you had to get him the ball so he can attack or making plays and he was so good at it. His absence puts a hole in that category. That’s why I decided kinda to take my game to another level when it comes to that.”
## 6 [Highlight] Two minutes of Heat announcers expressing disbelief on how bad the Lakers defense is
## urls
## 1 https://www.reddit.com/r/nba/comments/scd64t/daily_discussion_thread_game_thread_index/
## 2 https://www.reddit.com/r/nba/comments/scc3o2/serious_next_day_thread_postgame_discussion/
## 3 https://www.reddit.com/r/nba/comments/scclxd/lebron_is_only_137_points_away_from_being_the/
## 4 https://streamable.com/8v7cmx
## 5 https://www.reddit.com/r/nba/comments/sc6znr/joel_embiid_in_the_previous_year_we_had_someone/
## 6 https://streamable.com/5dfbda
Find next page URL:
nextp = jsondata$data$after
nextp
## [1] "t3_sc78mi"
nextp = paste0("https://www.reddit.com/.json?count=25&after=",nextp)
nextp
## [1] "https://www.reddit.com/.json?count=25&after=t3_sc78mi"
Now, we write a functions to extract data and next page URL.
extract_page = function(api_link){
page = GET(api_link,user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"))
jsondata = content(page)
items = jsondata$data$children
authors = sapply(items,function(x) x$data$author)
domains = sapply(items,function(x) x$data$domain)
titles = sapply(items,function(x) x$data$title)
urls = sapply(items,function(x) x$data$url)
dp = data.frame(authors,domains,titles,urls,stringsAsFactors = F)
nextp = paste0("https://www.reddit.com/.json?count=25&after=",jsondata$data$after)
return(list(dp,nextp))
}
d = extract_page("https://www.reddit.com/.json")
head(d[[1]]) #data
## authors domains
## 1 kurtios rollingstone.com
## 2 SeaworthinessJumpy95 i.redd.it
## 3 HeraldWard i.redd.it
## 4 StevenTM i.redd.it
## 5 hsoj1006789 ihr.fm
## 6 KaamDeveloper i.redd.it
## titles
## 1 Neil Young Demands Spotify Remove His Music Over ‘False Information About Vaccines’
## 2 No shit?
## 3 this made me smile
## 4 Venn diagram
## 5 Biden Calls Fox Reporter 'Stupid Son of a Bitch' Over Inflation Question
## 6 Different gens, Amazing achievements
## urls
## 1 https://www.rollingstone.com/music/music-news/neil-young-demands-spotify-remove-music-vaccine-disinformation-1290020/
## 2 https://i.redd.it/3xzy2l4rntd81.jpg
## 3 https://i.redd.it/rpm4do87rtd81.jpg
## 4 https://i.redd.it/gzk2sr2vetd81.jpg
## 5 https://ihr.fm/3fU5A4A
## 6 https://i.redd.it/gy69trwm1ud81.jpg
d[[2]] #next
## [1] "https://www.reddit.com/.json?count=25&after=t3_scaz39"
Put them together
library(dplyr,quietly = T)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
url = "https://www.reddit.com/.json"
data = data.frame()
n=0
while (!is.null(url)&n<=10){
d = extract_page(url)
data = bind_rows(data,d[[1]])
url = d[[2]]
n=n+1
print(n)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
dim(data)
## [1] 275 4
head(data)
## authors domains
## 1 kurtios rollingstone.com
## 2 SeaworthinessJumpy95 i.redd.it
## 3 HeraldWard i.redd.it
## 4 StevenTM i.redd.it
## 5 hsoj1006789 ihr.fm
## 6 KaamDeveloper i.redd.it
## titles
## 1 Neil Young Demands Spotify Remove His Music Over ‘False Information About Vaccines’
## 2 No shit?
## 3 this made me smile
## 4 Venn diagram
## 5 Biden Calls Fox Reporter 'Stupid Son of a Bitch' Over Inflation Question
## 6 Different gens, Amazing achievements
## urls
## 1 https://www.rollingstone.com/music/music-news/neil-young-demands-spotify-remove-music-vaccine-disinformation-1290020/
## 2 https://i.redd.it/3xzy2l4rntd81.jpg
## 3 https://i.redd.it/rpm4do87rtd81.jpg
## 4 https://i.redd.it/gzk2sr2vetd81.jpg
## 5 https://ihr.fm/3fU5A4A
## 6 https://i.redd.it/gy69trwm1ud81.jpg
#Sys.setlocale(locale="Chinese") # for Windows
#Sys.setlocale("LC_ALL", 'en_US.UTF-8') # for MAC
library(RCurl)
library(XML) # to parse html docs and use xpath
url="https://www.diabetes.co.uk/forum/category/diabetes-discussions.1/?order=reply_count"
source=getURL(url,.encoding='UTF-8')
#source=iconv(source, "big5", "UTF-8",sub = 'byte')
page =htmlParse(source,encoding = 'UTF-8')
class(page)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"
## [4] "XMLAbstractDocument"
Using xpath to extract users, titles, urls etc.
authors = xpathSApply(page,"//li[@class='discussionListItem visible ']/@data-author")
authors
## data-author data-author data-author data-author data-author
## "NewdestinyX" "archersuz" "Prem51" "Lupf" "kimyeomans"
## data-author data-author data-author data-author data-author
## "Administrator" "jayney27" "Begonia" "CherryAA" "Spiker"
## data-author data-author data-author data-author data-author
## "SueMG" "999sugarbabe" "Giverny" "izzzi" "ladybird64"
## data-author
## "Administrator"
We may remove names:
names(authors) = NULL
authors
## [1] "NewdestinyX" "archersuz" "Prem51" "Lupf"
## [5] "kimyeomans" "Administrator" "jayney27" "Begonia"
## [9] "CherryAA" "Spiker" "SueMG" "999sugarbabe"
## [13] "Giverny" "izzzi" "ladybird64" "Administrator"
To extract titles and urls:
titles = xpathSApply(page,"//li[@class='discussionListItem visible ']//h3[@class='title']/a",xmlValue)
titles
## [1] "What was your fasting blood glucose? (with some chat)"
## [2] "Diabetics R Us"
## [3] "What was your bg reading last night and this morning?"
## [4] "Covid/Coronavirus and diabetes - the numbers"
## [5] "'Newcastle diet' advice"
## [6] "Poll: Which diabetes course(s) have you attended?"
## [7] "The one show discussion"
## [8] "Prof Roy Taylor's work on reversing type 2 diabetes"
## [9] "My personal hypothesis - T2 - Low insulin Diet"
## [10] "Poll - side effects from statins?"
## [11] "Reversing Type 2 diabetes"
## [12] "NHS Direct doctor says... NO testing when taking Metformin"
## [13] "How old were you when you were diagnosed?"
## [14] "LCHF diet to help you lose weight, not diabetes"
## [15] "Sugar Tax"
## [16] "Have you found love on the forum?"
urls = xpathSApply(page,"//li[@class='discussionListItem visible ']//h3[@class='title']/a/@href")
urls
## href
## "threads/what-was-your-fasting-blood-glucose-with-some-chat.22272/"
## href
## "threads/diabetics-r-us.150682/"
## href
## "threads/what-was-your-bg-reading-last-night-and-this-morning.157291/"
## href
## "threads/covid-coronavirus-and-diabetes-the-numbers.174274/"
## href
## "threads/newcastle-diet-advice.55478/"
## href
## "threads/poll-which-diabetes-course-s-have-you-attended.70671/"
## href
## "threads/the-one-show-discussion.148418/"
## href
## "threads/prof-roy-taylors-work-on-reversing-type-2-diabetes.124863/"
## href
## "threads/my-personal-hypothesis-t2-low-insulin-diet.127156/"
## href
## "threads/poll-side-effects-from-statins.58409/"
## href
## "threads/reversing-type-2-diabetes.102415/"
## href
## "threads/nhs-direct-doctor-says-no-testing-when-taking-metformin.76791/"
## href
## "threads/how-old-were-you-when-you-were-diagnosed.39006/"
## href
## "threads/lchf-diet-to-help-you-lose-weight-not-diabetes.65746/"
## href
## "threads/sugar-tax.97464/"
## href
## "threads/have-you-found-love-on-the-forum.70861/"
full_urls = paste0("https://www.diabetes.co.uk/forum/",urls)
head(full_urls)
## [1] "https://www.diabetes.co.uk/forum/threads/what-was-your-fasting-blood-glucose-with-some-chat.22272/"
## [2] "https://www.diabetes.co.uk/forum/threads/diabetics-r-us.150682/"
## [3] "https://www.diabetes.co.uk/forum/threads/what-was-your-bg-reading-last-night-and-this-morning.157291/"
## [4] "https://www.diabetes.co.uk/forum/threads/covid-coronavirus-and-diabetes-the-numbers.174274/"
## [5] "https://www.diabetes.co.uk/forum/threads/newcastle-diet-advice.55478/"
## [6] "https://www.diabetes.co.uk/forum/threads/poll-which-diabetes-course-s-have-you-attended.70671/"
Put them together:
url="https://www.diabetes.co.uk/forum/category/diabetes-discussions.1/?order=reply_count"
scrape = function(url){
source=getURL(url,.encoding='utf-8')
page =htmlParse(source,encoding = 'UTF-8')
authors = xpathSApply(page,"//li[@class='discussionListItem visible ']/@data-author")
titles = xpathSApply(page,"//li[@class='discussionListItem visible ']//h3[@class='title']/a",xmlValue)
urls = xpathSApply(page,"//li[@class='discussionListItem visible ']//h3[@class='title']/a/@href")
full_urls = paste0("https://www.diabetes.co.uk/forum/",urls)
df = data.frame(authors,titles,full_urls,stringsAsFactors = F)
nexp = xpathSApply(page,"//nav/a[contains(text(),'Next >')]/@href")[1]
nexp = paste0("https://www.diabetes.co.uk/forum/",nexp)
return(list(df,nexp))
}
k=scrape(url)
k[[1]]
## authors titles
## 1 NewdestinyX What was your fasting blood glucose? (with some chat)
## 2 archersuz Diabetics R Us
## 3 Prem51 What was your bg reading last night and this morning?
## 4 Lupf Covid/Coronavirus and diabetes - the numbers
## 5 kimyeomans 'Newcastle diet' advice
## 6 Administrator Poll: Which diabetes course(s) have you attended?
## 7 jayney27 The one show discussion
## 8 Begonia Prof Roy Taylor's work on reversing type 2 diabetes
## 9 CherryAA My personal hypothesis - T2 - Low insulin Diet
## 10 Spiker Poll - side effects from statins?
## 11 SueMG Reversing Type 2 diabetes
## 12 999sugarbabe NHS Direct doctor says... NO testing when taking Metformin
## 13 Giverny How old were you when you were diagnosed?
## 14 izzzi LCHF diet to help you lose weight, not diabetes
## 15 ladybird64 Sugar Tax
## 16 Administrator Have you found love on the forum?
## full_urls
## 1 https://www.diabetes.co.uk/forum/threads/what-was-your-fasting-blood-glucose-with-some-chat.22272/
## 2 https://www.diabetes.co.uk/forum/threads/diabetics-r-us.150682/
## 3 https://www.diabetes.co.uk/forum/threads/what-was-your-bg-reading-last-night-and-this-morning.157291/
## 4 https://www.diabetes.co.uk/forum/threads/covid-coronavirus-and-diabetes-the-numbers.174274/
## 5 https://www.diabetes.co.uk/forum/threads/newcastle-diet-advice.55478/
## 6 https://www.diabetes.co.uk/forum/threads/poll-which-diabetes-course-s-have-you-attended.70671/
## 7 https://www.diabetes.co.uk/forum/threads/the-one-show-discussion.148418/
## 8 https://www.diabetes.co.uk/forum/threads/prof-roy-taylors-work-on-reversing-type-2-diabetes.124863/
## 9 https://www.diabetes.co.uk/forum/threads/my-personal-hypothesis-t2-low-insulin-diet.127156/
## 10 https://www.diabetes.co.uk/forum/threads/poll-side-effects-from-statins.58409/
## 11 https://www.diabetes.co.uk/forum/threads/reversing-type-2-diabetes.102415/
## 12 https://www.diabetes.co.uk/forum/threads/nhs-direct-doctor-says-no-testing-when-taking-metformin.76791/
## 13 https://www.diabetes.co.uk/forum/threads/how-old-were-you-when-you-were-diagnosed.39006/
## 14 https://www.diabetes.co.uk/forum/threads/lchf-diet-to-help-you-lose-weight-not-diabetes.65746/
## 15 https://www.diabetes.co.uk/forum/threads/sugar-tax.97464/
## 16 https://www.diabetes.co.uk/forum/threads/have-you-found-love-on-the-forum.70861/
head(k[[2]])
## [1] "https://www.diabetes.co.uk/forum/category/diabetes-discussions.1/page-2?order=reply_count"
Loop to get more pages:
frdata=data.frame()
n=0
while (!is.null(url)&n<=10){
k = scrape(url)
frdata = bind_rows(frdata,k[[1]])
url = k[[2]]
n=n+1
print(n)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
dim(frdata)
## [1] 193 3
head(frdata)
## authors titles
## 1 NewdestinyX What was your fasting blood glucose? (with some chat)
## 2 archersuz Diabetics R Us
## 3 Prem51 What was your bg reading last night and this morning?
## 4 Lupf Covid/Coronavirus and diabetes - the numbers
## 5 kimyeomans 'Newcastle diet' advice
## 6 Administrator Poll: Which diabetes course(s) have you attended?
## full_urls
## 1 https://www.diabetes.co.uk/forum/threads/what-was-your-fasting-blood-glucose-with-some-chat.22272/
## 2 https://www.diabetes.co.uk/forum/threads/diabetics-r-us.150682/
## 3 https://www.diabetes.co.uk/forum/threads/what-was-your-bg-reading-last-night-and-this-morning.157291/
## 4 https://www.diabetes.co.uk/forum/threads/covid-coronavirus-and-diabetes-the-numbers.174274/
## 5 https://www.diabetes.co.uk/forum/threads/newcastle-diet-advice.55478/
## 6 https://www.diabetes.co.uk/forum/threads/poll-which-diabetes-course-s-have-you-attended.70671/
library(RSelenium,quietly = T)
# Sys.which("java")
rD <- rsDriver(verbose = FALSE,port=4444L,browser="firefox") #4445L
remDr <- rD$client
remDr$navigate("https://twitter.com/search-advanced?lang=en")
Now, input keywords
Sys.sleep(20)
# hashtag
webElem <- remDr$findElement(using = "xpath","//input[@name='allOfTheseWords']")
webElem$sendKeysToElement(list("#covid19",key ="enter"))
Sys.sleep(20)
# click the latest button
webElem <- remDr$findElement(using = "xpath","//span[text()='Latest']")
webElem$clickElement()
# scroll down the page
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight)", list(""))
## list()
And you could scroll to the top like this:
webElem <- remDr$findElement("css", "body") #very important
webElem$sendKeysToElement(list(key = "home"))
And in case you want to scroll down just a bit, use
webElem$sendKeysToElement(list(key = "down_arrow"))
You could scroll to the buttom (of the current page) like this:
webElem$sendKeysToElement(list(key = "end"))
If you want to loop many steps:
n=0
while(n<10){
webElem$sendKeysToElement(list(key = "end"))
n=n+1
}
Extract information
Sys.sleep(20)
source=remDr$getPageSource()[[1]]
page=htmlParse(source,encoding = 'UTF-8')
tweets = xpathSApply(page,"//div[@class='css-901oao r-18jsvk2 r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-bnwqim r-qvutc0']",xmlValue)
head(tweets)
## [1] "Fui tomar hj a terceira dose da vacina contra o C19. Caraca! Além de apenas determinados locais aplicarem a dita cuja, haja saco. Fila pra senha, fila pra cadastro, fila pra esperar, fila pra vacinar. Por isso o povo ñ tá indo. Quase desisti! #curitiba #COVID19"
## [2] "China’s zero-Covid policy is a pandemic waiting to happen http://dlvr.it/SHmvGn #COVID19 #COVID19variants"
## [3] "Urgent : la pharmacie Oberkampf dispose de plusieurs doses de vaccin Moderna valables aujourd’hui sans rdv pour les plus de 30 ans \nOuverte jusqu'à 19h30 , 4 rue Pasteur / place de la Marne #JouyenJosas #Vaccinezvous #COVID19 #Moderna"
## [4] "Ayer #Querétaro reportó un pico histórico de #COVID19 con dos mil 759 contagios en un día. #NoBajesLaGuardia #UsaCubrebocas y por favor, #Vacúnate \n\nhttps://adninformativo.mx/queretaro-rompe-nuevamente-pico-historico-de-contagios-2-mil-759-casos-de-covid-19-este-lunes/…"
## [5] " La móvil de vacunación estará en #Caucasia desde este viernes 28 y hasta el domingo 30 de enero.\n\nSi estás en este municipio, aprovecha y vacúnate contra #COVID19: hay primeras, segundas y dosis de refuerzo Rostro boca arriba.\n\n Más información en la alcaldía local."
## [6] "#AsíLoDijo || Vicepdta. @delcyrodriguezv: Este Consejo también está representado por el Gobierno del Reino Unido que ha adoptado la ilegítima decisión de apoderarse del oro del pueblo venezolano, negado incluso para atender la situación humanitaria generada por la #COVID19."
Close client/server
remDr$close()
rD$server$stop()
## [1] TRUE