3 Features
Many of the raw data fields described in the previous section are also used as variables in the statistical models that follow in subsequent chapters. Additional variables (some predictors, some dependent, and some both) were derived from the raw data files. The complete set of variables (and their derivation) is described in this chapter.
The feature files are all in tidy CSV format [7], under the features/
sub-directory. The three primary files, confs.csv
, persons.csv
, and papers.csv
roughly correspond to (and aggregate) their counterparts described in the previous chapter. But they also contain blended features computed by combining data from multiple sources. These tables can be joined by their key field (typically the first column of each file).
Some of these variables are so-called dummy variables. They convert a variable type from a categorical enumeration to a set of Boolean values. For example, the data field organization
in the data/conf/
conference files can take on one or more of the values “IEEE”, “ACM”, or “USENIX”. In the data/features/all_confs.dat
file, this variable is split into three Boolean variables: is_org_IEEE
, is_org_ACM
, is_org_USENIX
.
3.4 Textual-related variables
The full-text papers in PDF were first converted to textual format using pdftotext (v. 0.41.0), a utility included in the Poppler package. In rare instances, the paper’s text was embedded as an image, which required text extraction using the Tesseract optical character recognition package. The wrapper for this conversion can be found in src/pdfocr.py
.
Each of these text files in turn was converted to “bag-of-words” format, which is simply a mapping from words to word counts. The output of this process is one CSV file per paper, each with two columns, one for normalized words, and one for the number of time each normalized word appeared in the paper. These data files are part of the accompanying data set, and can be found in the the features/bow/
sub-directory.
The normalization of words is a process (coded in src/normalize_text.py
) that includes the following stages:
- lower-case
- lemmatization
3.5 Country-related variables
Field description
code
: The two-letter international code of the country (also the top-level domain for the country).name
: String of country name.region
: String of geographical region or continent of the country.subregion
: String of geographical subregion.timezone
timezone as different in hours from GMT (at capital city, if more than one).speaks_english
: Boolean of whether English is one of the official languages in the country.
Bibliography
[7] Wickham, H. 2014. Tidy data. Journal of Statistical Software. 59, 10 (2014), 1–23.