2 Methods

2.1 Book corpus

This study used 6,462 of the 10,000 most popular books listed on goodreads.com (Zając 2018).

Author gender was estimated from their names cross-referenced with baby name popularity from 20-60 years prior to publication date (Social Security Administration 2019). When multiple authors were listed, the first author’s name was used. I manually annotated the most common authors whose first name was abbreviated, or not listed in the database. This method doesn’t account for ghost writers, pen names, multiple authors, writers born outside the United States, or name-gender discrepancies (for example, a person named “Russell” could be non-binary). To estimate this method’s accuracy, I manually reviewed 100 entries and found 2 incorrectly gendered authors.

Gender is not binary, and binary classification of author gender is only a computationally convenient approximation.

2.2 Natural language processing

Lemmatization and part-of-speech tagging was performed with Spacy (Honnibal and Montani 2017), and character coreferences were resolved with NeuralCoref (Chaumond 2019).

Individual characters were defined as entities with 25 or more coreferenced pronouns. Character gender was determined by gendered pronoun usage. Only characters whose gender was estimated with at least 90% accuracy were included.

For each character, a bags-of-word was extracted for each sentence where they were referenced.

2.3 Differential expression

The assumptions made when compare bags-of-words are remarkably similar to those in analyzing differential expression of next-gen RNA sequencing data. I took advantage of this by using DESeq2 (Love, Huber, and Anders 2014) to apply generalized linear models with a negative binomial distribution to estimate differences in word frequency between male and female characters and authors.