[This article was first published on Mark H. White II, PhD, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Well, it looks like the time has finally come for me to join the cluband write a large language model (LLM) blog post. I hope to do twothings here:
Show how easy it is to seamlessly work with both R and Pythoncode simultaneously
Use the OpenAI API to see how well it does extracting informationfrom text
In my previous blogpost, I discussed scraping film awards data to build a modelpredicting the Best Picture winner at the Academy Awards. One issue Irun into, however, is that some HTML is understandably not written withscraping in mind. When I try to write a script that iterates through 601movies, for example, the structure and naming of the data areinconsistent. The lack of standardization means writing modularfunctions for scraping data programmatically is difficult.
A recentPew Research Center report showed how they used GPT-3.5 Turbo tocollect data about podcast guests. My approach here is similar: I scrapewhat I can, give it to the OpenAI API along with a prompt, and theninterpret the result.
I wanted to add two variables to my Oscar model:
Is the director of the film also a writer?
Is the director of the film also a producer?
The reasoning being that maybe directors who are famous for writingtheir own material (e.g., Paul Thomas Anderson, Sofia Coppola) are moreor less likely for their films to win Best Picture. Similarly, perhapsbeing a producer as well as director means that the director hasachieved some level of previous success that makes them more likely totake home Best Picture.
The difficulty of scraping this from Wikipedia is that the “infobox”(i.e., the light grey box at the top, right-hand side of the entry) doesnot follow the same structure, formatting, or naming conventions acrosspages.
Methodology
To get the data I want (a logical value for whether or not thedirector was also a writer and another logical value for if they were aproducer), I took the following steps:
Use the
rvest
package in R to pull down the“infobox” from the Wikipedia page and did my best to limit it to theinformation relevant to the director, writer, and producerUse the
openai
Python library to pass thisinformation to GPT-3.5 Turbo or GPT-4Parse this result in R using the
tidyverse
toarrange the data nicely and append to my existing dataset for the Oscarmodel
Now, you could be asking: Why not use Python’sbeautifulsoup4
in Step 1? Because I like rvest
more and have more experience using it. And why not use R to access theOpenAI API? Because the official way in theirdocumentation to access it is by using Python. Lastly, why not usepandas
in Python to tidy the data afterward? Because Ithink the tidyverse
in R is much easier of a way to cleandata.
The great news: Posit’s RStudio IDE can handle both R and Python(among many other languages). The use of the reticulate
Rpackage also means we can import Python functions directly into an Rsession (and vice versa with rpy2
). These are all justtools at the end of the day, so why not use the ones I’m comfortable,quickest, and most experienced with?
The Functions
I started with two files: funs.R
andfuns.py
, which stored the functions I used.
funs.R
is for pulling the data from the Wikipediainfobox, given the title and year of a film. I use this to searchWikipedia, get the URL of first result from the search results, and thenscrape the infobox from that page:
#' Get the information box of a Wikipedia page#'#' Takes the title and year of a film, searches for it, gets the top result,#' and pulls the information box at the top right of the page.#'#' @param title Title of the film#' @param year Year the film was releasedget_wikitext <- function(title, year) { tryCatch({ tmp_tbl <- paste0( "https://en.wikipedia.org/w/index.php?search=", str_replace_all(title, " ", "+"), "+", year, "+film" ) %>% rvest::read_html() %>% rvest::html_nodes(".mw-search-result-ns-0:nth-child(1) a") %>% rvest::html_attr("href") %>% paste0("https://en.wikipedia.org", .) %>% rvest::read_html() %>% rvest::html_node(".vevent") %>% rvest::html_table() %>% janitor::clean_names() # just relevant rows lgls <- grepl("Direct", tmp_tbl[[1]]) | grepl("Screen", tmp_tbl[[1]]) | grepl("Written", tmp_tbl[[1]]) | grepl("Produce", tmp_tbl[[1]]) tmp_tbl <- tmp_tbl[lgls, ] # clean up random css # I have no idea how this works # I just got it online tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^.*?\\") tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^\\..*?(?=\n)") tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^.*?\\") tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^\\..*?(?=\n)") # print text apply(tmp_tbl, 1, \(x) paste0(x[[1]], ": ", x[[2]])) %>% paste(collapse = ", ") %>% str_replace_all("\n", " ") }, error = \(x) NA )}
An example output:
> get_wikitext("all that jazz", 1979)[1] "Directed by: Bob Fosse, Written by: Robert Alan AurthurBob Fosse, Produced by: Robert Alan Aurthur"
Not perfect, but should be close enough. Sometimes it is closer, withdifferent formatting:
> get_wikitext("la la land", 2016)[1] "Directed by: Damien Chazelle, Written by: Damien Chazelle, Produced by: Fred Berger Jordan Horowitz Gary Gilbert Marc Platt"
The result is then passed to the function defined infuns.py
. That script is:
from openai import OpenAIimport astclient = OpenAI(api_key='API_KEY_GOES_HERE')def get_results(client, wikitext): chat_completion = client.chat.completions.create( messages=[ { 'role': 'user', 'content': ''' Below is a list that includes people involved with making a movie. Each part corresponds to a different role that one might have in making the movie (such as director, writer, or producer). Could you tell me two things about the director? First, did the director also write the script/screenplay/story for the movie? And second, did the director also serve as a producer for the movie? Note that, in this list, names may not be separated by spaces even when they should be. That is, names may run together at times. You do not need to provide any explanation. Please reply with a valid Python dictionary, where: 'writer' is followed by True if the director also wrote the film and False if they did not, and 'producer' is followed by True if they also produced the film and False if they did not. If you cannot determine, you can follow it with NA instead of True or False. The information is: ''' + wikitext } ], model='gpt-3.5-turbo' ) # tidy result to make readable dict out = chat_completion.choices[0].message.content out = out.replace('\n', '') out = out.replace(' ', '') out = out.replace('true', 'True') out = out.replace('false', 'False') return(ast.literal_eval(out))
(I don’t have as good of documentation here because I’m not asfamiliar writing Python functions.)
Bringing It Together
I used an R script to use these functions in the same session. Westart off by loading the R packages, sourcing the R script, activatingthe Python virtual environment (the path is relative to my filestructure in my drive), and sourcing the Python script. I read in thedata from a Google Sheet of mine and do one step of cleaning, as theread_sheet()
function was bringing the title variable in asa list of lists instead of a character vector.
library(tidyverse)library(reticulate)source("funs.R")use_virtualenv("../../")source_python("funs.py")dat <- googlesheets4::read_sheet("SHEET_ID_GOES_HERE") %>% mutate(film = as.character(film))
I then initialize two new variables in the data: writer
and producer
. These will get populated withTRUE
if the director also served as a writer or producer,respectively, and FALSE
otherwise.
res <- dat %>% select(year, film) %>% mutate(writer = NA, producer = NA)
I iterate through each row using a for
loop (I know thisisn’t a very tidyverse
way of doing things, asmap_*()
statements are preferred usually, but I felt it waseasiest for making sense of the code and catching errors).
for (r in 1:nrow(res)) { cat(r, "\n") tmp_wikitext <- get_wikitext(res$film[r], res$year[r]) # skip if get_wikitext fails if (is.na(tmp_wikitext)) next if (length(tmp_wikitext) == 0) next # give the text to openai tmp_chat <- tryCatch( get_results(client, tmp_wikitext), error = \(x) NA ) # if openai returned a dict of 2 if (length(tmp_chat) == 2) { res$writer[r] <- tmp_chat$writer res$producer[r] <- tmp_chat$producer }}
I use cat()
to track progress. I use the function fromfuns.R
to pull down the text I want GPT-3.5 to extractinformation from. You’ll note that that function had atryCatch()
in it, because I didn’t want everything to stopat an error. Upon an error, it’ll just return an NA
. I alsofound that sometimes it would read a different page successfully butthen just return a blank character string. So if either of those aretrue, I say next
to skip to the next row. This means I’mnot wasting OpenAI tokens feeding it blanks.
Then I use a Python function inside of an R session! I useget_results()
, which was defined in funs.py
,to take the text from Wikipedia and give it to OpenAI. If there was anerror, I again use tryCatch()
to give me an NA
instead of shutting the whole thing down. If there wasn’t an error, Iadd the values to the res
data that I initialized above.Notably, the package knows that a Python dictionary should be brought in as a named logical list.
What we can see from this script is you can seamlessly use R andPython in one session, depending on the tools you have and what you’recomfortable with. A clickbait topic in data science for the last tenyears or so has been “R or Python?” when really the answer is both: Theyplay quite nicely with one another, thanks to the hard work ofprogrammers who have developed packages like reticulate
andPosit’s focus on languages beyond R.
Performance
Now that we’ve seen how one can use R and Python in harmony to accessthe OpenAI API, how well did GPT do? I compare both 3.5 Turbo and 4. Theonly change I had to make to funs.py
to use GPT-4 wasreplacing 'gpt-3.5-turbo'
with 'gpt-4'
.
For each of the models, I did that for
loop above threetimes, as the GPT models aren’t reproducible: They can give differentanswers each time you give them the same prompt (one of my beefs withthis methodology). I only gave it rows that were still NA
after each iteration to save on tokens. Especially with GPT-3.5, thisgave me more data to work with.
NAs
Using GPT-3.5, I was able to get a valid result for 447 of the 601films. This was 444 for GPT-4. The three films that GPT-3.5 coded butGPT-4 did not were Pulp Fiction (1994), Chariots ofFire (1981), and Smilin’ Through (1933).
One note is that, before 1934, Academy Awards spanned multiple years.However, I code them all with the same year for ease of analysis. Butthat means I may be giving the Wikipedia search wrong information, so itisn’t a failure at the OpenAI stage but at the Wikipedia scrapingstage.
If we remove the films before 1934, GPT-3.5 coded 77.2% of the films,while GPT-4 coded 76.8%. This may not be GPT’s fault, however, asgetting the text from Wikipedia might still have been where the pipelineproduced an NA
.
Accuracy
But how often was each model giving us the correct answer? Ihand-coded a random sample of 100 movies and counted how often eachmodel was correct. The four rows in the table below represent differentcombinations of being correct/incorrect.
Writer Correct | Producer Correct | n.3-5 | n.4 |
---|---|---|---|
FALSE | FALSE | 4 | 0 |
FALSE | TRUE | 10 | 2 |
TRUE | FALSE | 38 | 1 |
TRUE | TRUE | 48 | 97 |
The last row shows complete accuracy, where both coding forwriter
and producer
were correct. The resultsare obvious in favor of GPT-4: It was fully correct 97% of the time,whereas GPT-3.5 Turbo was correct only 48% of the time. It was thecoding of producer
that sunk it: It was correct 86% of thetime with writer
, but only 58% of the time withproducer
. I feel confident using the GPT-4 data for myOscar model; I told myself a priori I’d be good with anything >90%accurate (an arbitrary threshold, admittedly).
So, not really a surprise that the newer model performed better. ButI am somewhat surprised that GPT-3.5 Turbo couldn’t extract informationeven when I was giving it very specific instructions and a mostly cleanpiece of text to examine. Maybe I just do not know how to talk to themodel correctly? I brought this up with a group of colleagues, to whichone said, “No idea but this is why I expect prompt engineering to be amajor like next year,” and they may very well be correct.
Conclusion
You can use R and Python together smoothly
You can use the OpenAI API to efficiently do content coding foryour research and models
ALWAYS KEEP A HUMAN IN THE LOOP to check foraccuracy and fairness
Related
To leave a comment for the author, please follow the link and comment on their blog: Mark H. White II, PhD.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.