Scrape Google Scholar in R
- What will be scraped
- Prerequisites
- Full Code
- Explanation
- Google Scholar Python scraper alternatives
Scraping Google Scholar profiles in R can be a powerful tool for academic researchers, librarians, and data analysts.
This blog post will show how to scrape profile data with pagination.
What will be scraped
Prerequisites
In your R CMD install all the needed packages:
install.packages("httr")
install.packages("rvest")
install.packages("jsonlite")
install.packages("purrr")
install.packages("stringr")
install.packages("glue")
install.packages("dplyr")
Full Code
Please keep in mind that I’m not an experienced R user and some of the techniques might be better implemented.
library(httr)
library(rvest)
library(jsonlite)
library(purrr)
library(stringr)
library(glue)
library(dplyr)
scrape_all_profiles_from_university <- function(label, university_name) {
headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36")
# remove trailing whitespaces/hidden characters
label <- trimws(label)
university_name <- trimws(university_name)
params <- list(
view_op = "search_authors",
mauthors = glue("label:{label} '{university_name}'"),
hl = "en",
astart = 0
)
# empty list to store future the profile results
all_profile_results <- list()
profiles_is_present <- TRUE
while (profiles_is_present) {
response <- GET("https://scholar.google.com/citations", query = params, add_headers(.headers = headers))
page <- read_html(content(response, "text"))
print(paste0("extracting authors at page #", params$astart))
profiles <- page %>% html_elements(".gs_ai_chpr")
profile_results <- map(profiles, function(profile) {
name <- profile %>% html_element(".gs_ai_name a") %>% html_text()
link <- paste0("https://scholar.google.com", profile %>% html_element(".gs_ai_name a") %>% html_attr("href"))
affiliations <- profile %>% html_element(".gs_ai_aff") %>% html_text(trim = TRUE)
email <- profile %>% html_element(".gs_ai_eml") %>% html_text()
cited_by <- profile %>% html_element(".gs_ai_cby") %>% html_text() %>% gsub(pattern = "[^0-9]", replacement = "") # Cited by 17143 -> 17143
interests <- profile %>% html_elements(".gs_ai_one_int") %>% html_text()
# scalar values instead of single-value vectors
# or data.frame() could be used instead here
list(
profile_name = name[[1]],
profile_link = link[[1]],
profile_affiliations = affiliations[[1]],
profile_email = email[[1]],
profile_city_by_count = cited_by[[1]],
profile_interests = interests
)
})
# append profile results to the list
all_profile_results <- c(all_profile_results, profile_results)
# pagination
next_page_button <- page %>% html_elements("button.gs_btnPR") %>% html_attr("onclick")
if (!is.na(next_page_button)) {
# extract the "after_author" parameter from the "onclick" attribute of the "Next" button using regex
# and assign it to the "after_author" URL parameter which is the next token pagination
# along with "astart" URL param
params$after_author <- str_match(next_page_button, "after_author\\\\x3d(.*)\\\\x26")[, 2]
params$astart <- params$astart + 10
} else {
profiles_is_present <- FALSE
}
}
# convert to data frame
all_profile_results <- data.frame(do.call(rbind, all_profile_results), stringsAsFactors = FALSE)
return(all_profile_results)
}
# Scrape the data
data <- scrape_all_profiles_from_university(label="physics", university_name="Harvard University")
# Select all columns of the data frame using dplyr
all_data <- select(data, everything())
# Extract the email addresses using dplyr
emals <- all_data %>% pull(profile_email)
for (email in emals) {
cat("- ", name, "\n")
}
Explanation
Import all the needed packages:
library(httr)
library(rvest)
library(jsonlite)
library(purrr)
library(stringr)
library(glue)
library(dplyr)
Next, we create a function with 2 arguments, label
and university_name
:
scrape_all_profiles_from_university <- function(label, university_name) {
# ... code
}
The following step is to pass a browser user-agent to act like we’re sending request as an actual user, not a bot that sends a request. Check what’s your user-agent.
You can read more about bypassing topic from my reducing the chance of being blocked while web scraping blog post.
params
list
is used to create URL parameters for the reqesut and dynamically pass label
and university_name
data to the request.
headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36")
# remove trailing whitespaces/hidden characters
label <- trimws(label)
university_name <- trimws(university_name)
params <- list(
view_op = "search_authors",
mauthors = glue("label:{label} '{university_name}'"),
hl = "en",
astart = 0
)
# empty list to store future the profile results
all_profile_results <- list()
After that, we need to create a while
loop that will be used to paginate through all pages dynamically no matter how many there're, it will go through all of them. We can use hardcoded approach (iterate from x to x page) but this is not reliable.
If you want to iterate over x amount of pages, you may create a variable before the while
loop that tells how many iterations should be done and a condition to check for the current iteration.
While making request, we’re passing earlier created params
and headers
and reading HTML content and assigning it to the page
variable:
profiles_is_present <- TRUE
while (profiles_is_present) {
# on each params update, 'response' will be different (new page) and 'page' accordingly
response <- GET("https://scholar.google.com/citations", query = params, add_headers(.headers = headers))
page <- read_html(content(response, "text"))
# ...
}
At this step, we’re iterating (map()
) over HTML containers with .gs_ai_chpr
CSS selector and extracting data.
profiles <- page %>% html_elements(".gs_ai_chpr")
profile_results <- map(profiles, function(profile) {
name <- profile %>% html_element(".gs_ai_name a") %>% html_text()
link <- paste0("https://scholar.google.com", profile %>% html_element(".gs_ai_name a") %>% html_attr("href"))
affiliations <- profile %>% html_element(".gs_ai_aff") %>% html_text(trim = TRUE)
email <- profile %>% html_element(".gs_ai_eml") %>% html_text()
cited_by <- profile %>% html_element(".gs_ai_cby") %>% html_text() %>% gsub(pattern = "[^0-9]", replacement = "") # Cited by 17143 -> 17143
interests <- profile %>% html_elements(".gs_ai_one_int") %>% html_text()
Right after that, we append extracted data from the current iteration and append it to the all_profile_results
by concatenating two lists profile_results
and profile_results
together:
# not really sure if [[1]] is needed
list(
profile_name = name[[1]],
profile_link = link[[1]],
profile_affiliations = affiliations[[1]],
profile_email = email[[1]],
profile_city_by_count = cited_by[[1]],
profile_interests = interests
)
})
# append profile results to the list
all_profile_results <- c(all_profile_results, profile_results)
At this point, we get to the actual pagination.
- Firstly, we extract
onclick
attribute from thebutton
HTML element and assign it tonext_page_button
. - Secondly, we check
if (!is.na(next_page_button))
(if button is present) otherwise, exit thewhile
loop of no button available. - Thirdly, we extract next page token from the
button
onclick
attribute and pass it toparams
as a new key. - Lastly, we increment a
10
to aastart
underparams
which is used in combination withafter_authour
parameter to drive pagination.
astart
10 = 2nd page, 20 = 3rd page and so on.
next_page_button <- page %>% html_elements("button.gs_btnPR") %>% html_attr("onclick")
if (!is.na(next_page_button)) {
params$after_author <- str_match(next_page_button, "after_author\\\\x3d(.*)\\\\x26")[, 2]
params$astart <- params$astart + 10
} else {
profiles_is_present <- FALSE
}
Finally, we convert all_profile_results
to a dataframe:
all_profile_results <- data.frame(do.call(rbind, all_profile_results), stringsAsFactors = FALSE)
return(all_profile_results)
do.call(rbind)
will stack all of the data to create a single matrix/data.stringsAsFactors = FALSE
will convert all column type to a string.
And as a final step, here’s how we can access the data:
data <- scrape_all_profiles_from_university(label="physics", university_name="Harvard University")
all_data <- select(data, everything())
emals <- all_data %>% pull(profile_email)
for (email in emals) {
cat("- ", name, "\n")
}
Outputs:
[1] "extracting authors at page #0"
[1] "extracting authors at page #10"
[1] "extracting authors at page #20"
[1] "extracting authors at page #30"
[1] "extracting authors at page #40"
[1] "extracting authors at page #50"
> ...
- Verified email at neu.edu
- Verified email at seas.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at cfa.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at cfa.harvard.edu
- Verified email at seas.harvard.edu
- Verified email at mcb.harvard.edu
-
- Verified email at mcgill.ca
- Verified email at cfa.harvard.edu
- Verified email at bidmc.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at hsph.harvard.edu
- Verified email at g.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at iisc.ac.in
- Verified email at fas.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at bwh.harvard.edu
- Verified email at g.harvard.edu
- Verified email at seas.harvard.edu
- Verified email at g.harvard.edu
- Verified email at hsph.harvard.edu
- Verified email at math.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at cfa.harvard.edu
- Verified email at harvard.edu
- Verified email at polytechnique.edu
- Verified email at seas.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at g.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at g.harvard.edu
- Verified email at seas.harvard.edu
- Verified email at cfa.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at college.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at go.cambridgecollege.edu
- Verified email at mgh.harvard.edu
- Verified email at hsph.harvard.edu
- Verified email at g.harvard.edu
-
- Verified email at g.harvard.edu
- Verified email at college.harvard.edu
- Verified email at college.harvard.edu
- Verified email at physics.uoc.gr
Google Scholar Python scraper alternatives
If you want to extract more data from Google Scholar in R but haven’t figured it out, you can use a few of the Python alternatives if you’re comfortable using it:
scrape-google-scholar-py
is an open-source project of mine that aims to extract all the possible data from Google Scholar. In the future, I'll port it to R.scholarly
is also an open-source project that extracts data from Google Scholar. The difference between this and mine package is that mine aim to extract all possible pages, whilescholarly
not.