After not using a reference manager at first (2014–2016) and later being very frustrated with Mendeley after a couple of years, I started using Zotero in 2018. I am extremely happy with the software and its features – it just works very well for everything I do. The browser plugin to import the full citation info and the PDF (if you do not have institutional access through your IP, it will often even automatically find an open access version) with one click is great. Word integration is also seamless and keyboard short cuts can be customized; switching styles is extremely easy. Libraries can also be hosted on their server and synced or shared with others, if you need a lot of storage for full texts, you’ll need premium. If you sync your library, you can access it from any browser also. You can import from and export to various formats. Zotero’s built-in PDF reader now also supports text highlighting in multiple colors and commenting and you can save notes along with each entry. And, it’s open source and free.
Currently, I have about 4000 entries in my main library and I was curious what kind of articles I’d been collecting, reading, and citing in the past couple of years.
You can export the entire library to CSV through the menu. So let’s jump into R and have look. I used this mainly as an exercise for myself to get used to the whole dplyr
and pipe (%>%
) logic.
Import CSV file and setup
zotero <- read.csv("C:/testing/zoteroLibrary/zotero.csv") # import
names(zotero)[1] <- "Key" # rename the first variable
library(tidyverse) # load tidyverse for dplyr and ggplot2
zotero <- as_tibble(zotero) # convert df to tibble
zotero$Publication.Title <- as.factor(zotero$Publication.Title) # title as factor, not character
First look at the journals
This counts the number of entries per publication title (i.e., journal, or for R, the factor level) after having excluded those rows that are empty on the publication title variable.
zotero %>%
filter(Publication.Title != "") %>%
count(Publication.Title) %>%
arrange(desc(n))
## # A tibble: 1,057 x 2
## Publication.Title n
## <fct> <int>
## 1 New Media & Society 142
## 2 Information, Communication & Society 111
## 3 Computers in Human Behavior 76
## 4 Current Opinion in Psychology 50
## 5 Social Science Computer Review 49
## 6 International Journal of Communication 47
## 7 Journal of Communication 44
## 8 Proceedings of the National Academy of Sciences 44
## 9 Big Data & Society 43
## 10 Social Media + Society 41
## # ... with 1,047 more rows
Constructing a new data frame with selected variables
I am only interested in the main publication types and have a bunch of untitled stuff or PDFs without meta data in my library, so I need to exlude those first. Journal articles, book sections, and other types have an entry in Publication.Title
but books don’t, so to not lose books, I used and OR condition for the filter. There is a also a field for when an entry was added to Zotero and R needs to know the date format to deal correctly with that.
bib <- zotero %>%
select(Publication.Year, Publication.Title, Item.Type, Author, Title, Date.Added, Url, DOI)
bib <- bib %>%
filter(Publication.Title != "" | Item.Type == "book")
bib <- bib %>%
filter(Publication.Year != "")
bib$Date.Added <- as.Date(bib$Date.Added, format = "%d/%m/%Y %H:%M")
bib$monthyear <- format(bib$Date.Added, "%Y %m") # this is no longer date format, but can be handy
bib$monthyear <- as.factor(bib$monthyear)
When was stuff added?
Now I can plot the number of new entries by month (the date variable has the information down to the second but monthly was a reasonable granuarity here).
added <- bib %>%
count(monthyear) # prepping a df with the frequencies, could also be done in the plot function only
ggplot(added, aes(x = monthyear, y = n, group = 1)) +
geom_point(stat = "identity") +
geom_line() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
labs(x="", y="")

mean(added$n) # monthly average
## [1] 85.12821
mean(added$n)/(365/12) # daily average
## [1] 2.798736
I checked to see what happened in March 2021 and it was when I merged in another library where we had collected about 250 papers for a literature review for a project proposal.
Using the whole data frame (bib
), this can also be split up e.g. by Item.Type
.
ggplot(bib, aes(x = monthyear, group = Item.Type, color = Item.Type)) +
geom_point(stat = "count") +
geom_line(stat = "count") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
labs(x="", y="")

Clearly, journal articles dominate the library with 84% of all entries:
type <- bib %>%
count(Item.Type) %>%
arrange(desc(n))
ggplot(type, aes(x = reorder(Item.Type, -n), y = n)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
labs(x="", y="") +
scale_y_continuous(breaks = seq(200, 2800, 200))

bib %>%
count(Item.Type) %>%
arrange(desc(n)) %>%
mutate(prop = prop.table(n))
## # A tibble: 9 x 3
## Item.Type n prop
## <chr> <int> <dbl>
## 1 journalArticle 2778 0.837
## 2 book 197 0.0593
## 3 bookSection 189 0.0569
## 4 conferencePaper 48 0.0145
## 5 webpage 37 0.0111
## 6 blogPost 31 0.00934
## 7 newspaperArticle 26 0.00783
## 8 magazineArticle 8 0.00241
## 9 encyclopediaArticle 6 0.00181
Top 40 journals
Next, I looked into the journals whose articles I save most and thus probably read and cite most.
topj <- bib %>%
filter(Item.Type == "journalArticle") %>%
count(Publication.Title) %>%
arrange(desc(n)) %>%
slice(1:40)
ggplot(topj, aes(x = n, y = reorder(Publication.Title, n))) +
geom_bar(stat="identity") +
labs(x="", y="") +
scale_x_continuous(breaks=seq(10, 150, 10))

New Media & Society comes out on top with quite a margin which didn’t surprise me. The high frequency of Current Opinion in Psychology papers is due to the fact that I am writing a piece for a special issue in that journal and downloaded a bunch of papers because I actually had never heard of it before that. Information, Communication & Society, Social Science Computer Review, Social Media + Society, and International Journal of Communication are all journals I have published in and regularly read articles from.
When were things published?
Finally, I looked into when the articles, books, chapters, etc. I have in my library were published. The earliest paper is from 1890, but the distribution is of course very heavily skewed towards the past couple of years (more than I expected actually). So for the plot, the range is cut at 1990.
range(bib$Publication.Year)
## [1] 1890 2021
bib90 <- bib %>%
filter(Publication.Year >= 1990)
ggplot(bib90, aes(x = as.factor(Publication.Year))) +
geom_histogram(stat="count") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
labs(x="", y="")

There is no real take-away here – I enjoyed learning some tidyverse
code and confirmed my intuitions about which journals feature frequently in my research.