Using R and Python Together, Seamlessly: A Case Study Using OpenAI's GPT Models (2024)

[This article was first published on Mark H. White II, PhD, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Well, it looks like the time has finally come for me to join the cluband write a large language model (LLM) blog post. I hope to do twothings here:

Show how easy it is to seamlessly work with both R and Pythoncode simultaneously
Use the OpenAI API to see how well it does extracting informationfrom text

In my previous blogpost, I discussed scraping film awards data to build a modelpredicting the Best Picture winner at the Academy Awards. One issue Irun into, however, is that some HTML is understandably not written withscraping in mind. When I try to write a script that iterates through 601movies, for example, the structure and naming of the data areinconsistent. The lack of standardization means writing modularfunctions for scraping data programmatically is difficult.

A recentPew Research Center report showed how they used GPT-3.5 Turbo tocollect data about podcast guests. My approach here is similar: I scrapewhat I can, give it to the OpenAI API along with a prompt, and theninterpret the result.

I wanted to add two variables to my Oscar model:

Is the director of the film also a writer?
Is the director of the film also a producer?

The reasoning being that maybe directors who are famous for writingtheir own material (e.g., Paul Thomas Anderson, Sofia Coppola) are moreor less likely for their films to win Best Picture. Similarly, perhapsbeing a producer as well as director means that the director hasachieved some level of previous success that makes them more likely totake home Best Picture.

The difficulty of scraping this from Wikipedia is that the “infobox”(i.e., the light grey box at the top, right-hand side of the entry) doesnot follow the same structure, formatting, or naming conventions acrosspages.

Methodology

To get the data I want (a logical value for whether or not thedirector was also a writer and another logical value for if they were aproducer), I took the following steps:

Use the rvest package in R to pull down the“infobox” from the Wikipedia page and did my best to limit it to theinformation relevant to the director, writer, and producer
Use the openai Python library to pass thisinformation to GPT-3.5 Turbo or GPT-4
Parse this result in R using the tidyverse toarrange the data nicely and append to my existing dataset for the Oscarmodel

Now, you could be asking: Why not use Python’sbeautifulsoup4 in Step 1? Because I like rvestmore and have more experience using it. And why not use R to access theOpenAI API? Because the official way in theirdocumentation to access it is by using Python. Lastly, why not usepandas in Python to tidy the data afterward? Because Ithink the tidyverse in R is much easier of a way to cleandata.

The great news: Posit’s RStudio IDE can handle both R and Python(among many other languages). The use of the reticulate Rpackage also means we can import Python functions directly into an Rsession (and vice versa with rpy2). These are all justtools at the end of the day, so why not use the ones I’m comfortable,quickest, and most experienced with?

The Functions

I started with two files: funs.R andfuns.py, which stored the functions I used.

funs.R is for pulling the data from the Wikipediainfobox, given the title and year of a film. I use this to searchWikipedia, get the URL of first result from the search results, and thenscrape the infobox from that page:

#' Get the information box of a Wikipedia page#'#' Takes the title and year of a film, searches for it, gets the top result,#' and pulls the information box at the top right of the page.#'#' @param title Title of the film#' @param year Year the film was releasedget_wikitext <- function(title, year) { tryCatch({ tmp_tbl <- paste0( "https://en.wikipedia.org/w/index.php?search=", str_replace_all(title, " ", "+"), "+", year, "+film" ) %>% rvest::read_html() %>% rvest::html_nodes(".mw-search-result-ns-0:nth-child(1) a") %>% rvest::html_attr("href") %>% paste0("https://en.wikipedia.org", .) %>% rvest::read_html() %>% rvest::html_node(".vevent") %>% rvest::html_table() %>% janitor::clean_names() # just relevant rows lgls <- grepl("Direct", tmp_tbl[[1]]) | grepl("Screen", tmp_tbl[[1]]) | grepl("Written", tmp_tbl[[1]]) | grepl("Produce", tmp_tbl[[1]]) tmp_tbl <- tmp_tbl[lgls, ] # clean up random css # I have no idea how this works # I just got it online tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^.*?\\") tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^\\..*?(?=\n)") tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^.*?\\") tmp_tbl[[2]] <- str_remove_all(tmp_tbl[[2]], "^\\..*?(?=\n)") # print text apply(tmp_tbl, 1, \(x) paste0(x[[1]], ": ", x[[2]])) %>% paste(collapse = ", ") %>% str_replace_all("\n", " ") }, error = \(x) NA )}

An example output:

> get_wikitext("all that jazz", 1979)[1] "Directed by: Bob Fosse, Written by: Robert Alan AurthurBob Fosse, Produced by: Robert Alan Aurthur"

Not perfect, but should be close enough. Sometimes it is closer, withdifferent formatting:

> get_wikitext("la la land", 2016)[1] "Directed by: Damien Chazelle, Written by: Damien Chazelle, Produced by: Fred Berger Jordan Horowitz Gary Gilbert Marc Platt"

The result is then passed to the function defined infuns.py. That script is:

from openai import OpenAIimport astclient = OpenAI(api_key='API_KEY_GOES_HERE')def get_results(client, wikitext): chat_completion = client.chat.completions.create( messages=[ { 'role': 'user', 'content': ''' Below is a list that includes people involved with making a movie. Each part corresponds to a different role that one might have in making the movie (such as director, writer, or producer). Could you tell me two things about the director? First, did the director also write the script/screenplay/story for the movie? And second, did the director also serve as a producer for the movie? Note that, in this list, names may not be separated by spaces even when they should be. That is, names may run together at times. You do not need to provide any explanation. Please reply with a valid Python dictionary, where: 'writer' is followed by True if the director also wrote the film and False if they did not, and 'producer' is followed by True if they also produced the film and False if they did not. If you cannot determine, you can follow it with NA instead of True or False. The information is: ''' + wikitext } ], model='gpt-3.5-turbo' ) # tidy result to make readable dict out = chat_completion.choices[0].message.content out = out.replace('\n', '') out = out.replace(' ', '') out = out.replace('true', 'True') out = out.replace('false', 'False') return(ast.literal_eval(out))

(I don’t have as good of documentation here because I’m not asfamiliar writing Python functions.)

Bringing It Together

I used an R script to use these functions in the same session. Westart off by loading the R packages, sourcing the R script, activatingthe Python virtual environment (the path is relative to my filestructure in my drive), and sourcing the Python script. I read in thedata from a Google Sheet of mine and do one step of cleaning, as theread_sheet() function was bringing the title variable in asa list of lists instead of a character vector.

library(tidyverse)library(reticulate)source("funs.R")use_virtualenv("../../")source_python("funs.py")dat <- googlesheets4::read_sheet("SHEET_ID_GOES_HERE") %>% mutate(film = as.character(film))

I then initialize two new variables in the data: writerand producer. These will get populated withTRUE if the director also served as a writer or producer,respectively, and FALSE otherwise.

res <- dat %>% select(year, film) %>% mutate(writer = NA, producer = NA)

I iterate through each row using a for loop (I know thisisn’t a very tidyverse way of doing things, asmap_*() statements are preferred usually, but I felt it waseasiest for making sense of the code and catching errors).

for (r in 1:nrow(res)) { cat(r, "\n") tmp_wikitext <- get_wikitext(res$film[r], res$year[r]) # skip if get_wikitext fails if (is.na(tmp_wikitext)) next if (length(tmp_wikitext) == 0) next # give the text to openai tmp_chat <- tryCatch( get_results(client, tmp_wikitext), error = \(x) NA ) # if openai returned a dict of 2 if (length(tmp_chat) == 2) { res$writer[r] <- tmp_chat$writer res$producer[r] <- tmp_chat$producer }}

I use cat() to track progress. I use the function fromfuns.R to pull down the text I want GPT-3.5 to extractinformation from. You’ll note that that function had atryCatch() in it, because I didn’t want everything to stopat an error. Upon an error, it’ll just return an NA. I alsofound that sometimes it would read a different page successfully butthen just return a blank character string. So if either of those aretrue, I say next to skip to the next row. This means I’mnot wasting OpenAI tokens feeding it blanks.

Then I use a Python function inside of an R session! I useget_results(), which was defined in funs.py,to take the text from Wikipedia and give it to OpenAI. If there was anerror, I again use tryCatch() to give me an NAinstead of shutting the whole thing down. If there wasn’t an error, Iadd the values to the res data that I initialized above.Notably, the package knows that a Python dictionary should be brought in as a named logical list.

What we can see from this script is you can seamlessly use R andPython in one session, depending on the tools you have and what you’recomfortable with. A clickbait topic in data science for the last tenyears or so has been “R or Python?” when really the answer is both: Theyplay quite nicely with one another, thanks to the hard work ofprogrammers who have developed packages like reticulate andPosit’s focus on languages beyond R.

Performance

Now that we’ve seen how one can use R and Python in harmony to accessthe OpenAI API, how well did GPT do? I compare both 3.5 Turbo and 4. Theonly change I had to make to funs.py to use GPT-4 wasreplacing 'gpt-3.5-turbo' with 'gpt-4'.

For each of the models, I did that for loop above threetimes, as the GPT models aren’t reproducible: They can give differentanswers each time you give them the same prompt (one of my beefs withthis methodology). I only gave it rows that were still NAafter each iteration to save on tokens. Especially with GPT-3.5, thisgave me more data to work with.

NAs

Using GPT-3.5, I was able to get a valid result for 447 of the 601films. This was 444 for GPT-4. The three films that GPT-3.5 coded butGPT-4 did not were Pulp Fiction (1994), Chariots ofFire (1981), and Smilin’ Through (1933).

One note is that, before 1934, Academy Awards spanned multiple years.However, I code them all with the same year for ease of analysis. Butthat means I may be giving the Wikipedia search wrong information, so itisn’t a failure at the OpenAI stage but at the Wikipedia scrapingstage.

If we remove the films before 1934, GPT-3.5 coded 77.2% of the films,while GPT-4 coded 76.8%. This may not be GPT’s fault, however, asgetting the text from Wikipedia might still have been where the pipelineproduced an NA.

Accuracy

But how often was each model giving us the correct answer? Ihand-coded a random sample of 100 movies and counted how often eachmodel was correct. The four rows in the table below represent differentcombinations of being correct/incorrect.

Writer Correct	Producer Correct	n.3-5	n.4
FALSE	FALSE	4	0
FALSE	TRUE	10	2
TRUE	FALSE	38	1
TRUE	TRUE	48	97

The last row shows complete accuracy, where both coding forwriter and producer were correct. The resultsare obvious in favor of GPT-4: It was fully correct 97% of the time,whereas GPT-3.5 Turbo was correct only 48% of the time. It was thecoding of producer that sunk it: It was correct 86% of thetime with writer, but only 58% of the time withproducer. I feel confident using the GPT-4 data for myOscar model; I told myself a priori I’d be good with anything >90%accurate (an arbitrary threshold, admittedly).

So, not really a surprise that the newer model performed better. ButI am somewhat surprised that GPT-3.5 Turbo couldn’t extract informationeven when I was giving it very specific instructions and a mostly cleanpiece of text to examine. Maybe I just do not know how to talk to themodel correctly? I brought this up with a group of colleagues, to whichone said, “No idea but this is why I expect prompt engineering to be amajor like next year,” and they may very well be correct.

Conclusion

You can use R and Python together smoothly
You can use the OpenAI API to efficiently do content coding foryour research and models
ALWAYS KEEP A HUMAN IN THE LOOP to check foraccuracy and fairness

To leave a comment for the author, please follow the link and comment on their blog: Mark H. White II, PhD.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Using R and Python Together, Seamlessly: A Case Study Using OpenAI's GPT Models (2024)

FAQs

How to pass OpenAI key in Python? ›

OpenAI provides a custom Python library which makes working with the OpenAI API in Python simple and efficient.

Step 1: Setting up Python. Install Python. ...
Step 2: Set up your API key. Set up your API key for all projects (recommended) ...
Step 3: Sending your first API request. Making an API request.

Learn More Now ›

How to use GPT 4 model in Python? ›

Installing the openai library and setting up an API token

To use this API, you must have the openai library in python installed. You can then set the token as an environment variable using the os library. import os # Set the OPENAI_API_KEY environment variable os. environ['OPENAI_API_KEY'] = '...'

Discover More ›

What is OpenAI API? ›

Put plainly, the API allows you to send requests to OpenAI's models and receive information in return. This is really the function of any API, but this article is specific to OpenAI's REST API. The models that you can currently access with OpenAI's API are GPT, DALL-E, and Whisper, a speech recognition model.

Explore More ›

What is OpenAI library in Python? ›

The OpenAI Python library provides convenient access to the OpenAI API from applications written in the Python language. It includes a pre-defined set of classes for API resources that initialize themselves dynamically from API responses which makes it compatible with a wide range of versions of the OpenAI API.

Explore More ›

Does OpenAI use PyTorch or Tensorflow? ›

OpenAI uses PyTorch, which was developed at FAIR. PyTorch 2.0 uses the Triton back-end compiler which was developed at OpenAI. OpenAI use transformers and RLHF which originated at Google & DeepMind.

Can GPT write Python code? ›

Speed up your daily workflows by getting AI to write Python code in seconds. On March 14, 2023, OpenAI launched GPT-4, the newest and most powerful version of their language model.

Learn More ›

Can I use ChatGPT to write Python code? ›

Chat GPT can generate code in various programming languages, including Python, Java, and JavaScript. You need to specify the programming language you want Chat GPT to generate code for. The ChatGPT prompt is a description of the code you want to generate.

Get More Info Here ›

Is GPT-4 good for coding? ›

On the plus side, GPT-4 can still write, convert or explain code more efficiently than its predecessors. Based on the chart below, GPT-4 has improved substantially compared to GPT-3.5 in coding exams.

Learn More Now ›

Is OpenAI API free or paid? ›

Is the OpenAI API free? You can create an OpenAI API key for free. OpenAI API and get a feel of the technology without incurring any cost. During the free tier, you can make unlimited API requests and access a smaller selection of OpenAI API models.

Show Me More ›

Can you use OpenAI API without paying? ›

There is no “free account” for API. The use of the service costs money by the amount of data used. There is only possibility of a free trial credit, which expires three months after you first created your OpenAI account. Now you'll need to purchase a credit balance in order to make calls.

Discover More Details ›

Can I get OpenAI API for free? ›

There is no free tier for the OpenAI API. All API requests are charged at a rate based on the amount of data you're using. However, OpenAI gives you $5 worth of API credits when you first create an OpenAI account. This free credit expires three months after you create your OpenAI account.

Explore More ›

Which Python libraries used for ChatGPT? ›

Using OpenAI Python Library to Interact with the ChatGPT API.

Know More ›

What language is OpenAI coded in? ›

Programming languages

Language	Source code	Package
Go	Source code	Package (Go)
Java	Source code	Artifact (Maven)
JavaScript	Source code	Package (npm)
Python	Source code	Package (PyPi)

1 more row

Dec 18, 2023

Is OpenAI JavaScript or Python? ›

js or TypeScript are more complex. Python is definitely the language of choice for most AI developers. However if you're doing a web app like me, the fact that it's a web app as 99% weighting on deciding which languages to use. Remember OpenAI has an HTTP API which can be called easily from ANY language.

Using R and Python Together, Seamlessly: A Case Study Using OpenAI's GPT Models (2024)

Methodology

The Functions

Bringing It Together

Performance

NAs

Accuracy

Conclusion

Related

FAQs

How to pass OpenAI key in Python? ›

Can I get OpenAI API for free? ›