Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
The data as used here is derived from the palmerpenguins Python package that is a port of the palmerpenguins R package. Use the load_penguins() function from either package to obtain the data as originally published via these packages, however, that is not the data you should use for lab in CS 307.
Horst, AM, Hill, AP, & Gorman, KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1. 0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
The Penguins
Meet the Palmer Penguins: Chinstrap, Gentoo1, and Adelie! Artwork by Allison Horst.
In lab, we will focus on the length and depth of the penguins’ bills. We’ll attempt to use these measurements to predict the species of each penguin.
Data Dictionary
The Palmer Penguins dataset includes size measurements for adult foraging penguins near Palmer Station, Antarctica. The data includes measurements for penguin species, island in Palmer Archipelago, size (flipper length, body mass, bill dimensions), and sex.
This dictionary is a modification of the original reference page from the R package with modifications appropriate for the data as presented in Python.
species
a string denoting penguin species (Adélie, Chinstrap and Gentoo)
island
a string denoting island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)
bill_length_mm
a number denoting bill length (millimeters)
bill_depth_mm
a number denoting bill depth (millimeters)
flipper_length_mm
a number denoting flipper length (millimeters)
body_mass_g
a number denoting body mass (grams)
sex
a string denoting penguin sex (female, male)
year
an integer denoting the study year (2007, 2008, or 2009)
Data in Python
The Palmer Penguins data can be access in Python using the palmerpenguins package.
from palmerpenguins import load_penguinsfrom palmerpenguins import load_penguins_raw
/home/runner/work/cs307-notes/cs307-notes/.venv/lib/python3.13/site-packages/palmerpenguins/penguins.py:2: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
The load_penguins function returns a Pandas DataFrame containing the data described in the above data dictionary.
The original data can be loaded via the load_penguins_raw, but this data is used infrequently.
History
As you move through your journey as a data scientist, you’ll like encounter the Palmer Penguins dataset with some frequency. An interesting question is, why?
In many ways, the Palmer Penguins data is meant to replace the even more ubiquitous Iris flower data. This again leads to the question, why? In many ways, it is a better dataset because it is useful for everything that the Iris data was, but also more.
The Palmer Penguins data…
uses a permissive CC0 license.
uses understandable variables.
has complete metadata and documentation.
is a manageable, but less trivial size than the Iris data.
is real world, not simulated data.
includes addition feature variables, including categorical variables, as well as missing data, providing more opportunity for usage and learning.
This reasoning is expanded upon in the R Journal article introducing the dataset.
All that is wonderful, but there was another compelling reason to move away from the Iris data, eugenics. How on earth are flowers related to eugenics?
If you study statistics for any length of time, you’ll start to notice that many methods and theories can be traced to one person, Ronald Aylmer (R.A.) Fisher.2
Like many of his contemporaries, Fisher’s interest in statistics and biology cannot be separated from his interest in and study of eugenics. While this is a very unfortunate part of the history of statistics, it cannot simply be ignored. Many of the most foundational statistical methods were developed by researchers and scientists at least interested in if not pursuing eugenics. If you were not already of the mind that statistical methodology should be applied with great care and ethical consideration, hopefully this history will change your perspective.
But how does this all relate to flowers? Well, the Iris data often goes by another name: Fisher’sIris data set.3 If we stopped using all of Fisher’s creations in statistics, we might not be left with much. But this particular artifact was also published in the Annals of Eugenics.