Data

Computational and Statistical Representations

Author

Modified

May 19, 2025

Setup and Objectives

import numpy as np
import pandas as pd

Computing Data Types

Thinking about data types in computing, there are three main categories:

Numeric
Text
Boolean (Logical)

Each of these types has (at least one) implementation in Python, numpy, and pandas.

Numeric

Within numeric data, there are two broad subcategories: floats and integers.

Floats

Floating point numbers (floats) are a method of representing real numbers.

In Python, some examples of floats include:

print(2.0)
print(42.0)
print(3.14)

2.0
42.0
3.14

type(1.1)

float

Real numbers, like \(\pi\), can be irrational, and require a decimal representing with an infinite number of digits. Computers, have finite memory, so of course they cannot store every digit of \(\pi\), much less perform computations with infinite digits.

Without getting into the details of floating point arithmetic, we will simply note that floats will sometimes behave, oddly.

0.2 + 0.1 == 0.3

False

0.2 + 0.1

0.30000000000000004

This is normal, and in general will not cause issues.

There are technically different variants of floats, which use different number of digits to represent a number, however, we will generally not be consider with this distinction.

Both numpy and pandas also utilize floats.

x = np.array([1.1, 2.2, 3.3])
print(x)

[1.1 2.2 3.3]

print(x.dtype)

float64

pd.Series([1.1, 2.2, 3.3])

0   1.10
1   2.20
2   3.30
dtype: float64

Here were see that both numpy and pandas default to using a 64 bit float.

Integers

print(2)
print(42)
print(-20)

2
42
-20

type(-20)

int

y = np.array([1, 0, 1, 1, 1, 0])
print(y)

[1 0 1 1 1 0]

print(y.dtype)
print(y.dtype.type)

int64
<class 'numpy.int64'>

pd.Series([1, 0, 1, 1, 1, 0])

0    1
1    0
2    1
3    1
4    1
5    0
dtype: int64

Text

Text data does what the name suggests, it stores text. Text data is usually called strings.

"Hello, World!"

'Hello, World!'

type("Hello, World!")

str

z = np.array(["a", "b", "c", "d"])
print(z)

['a' 'b' 'c' 'd']

print(z.dtype)
print(z.dtype.type)

<U1
<class 'numpy.str_'>

pd.Series(["a", "b", "c", "d"])

0    a
1    b
2    c
3    d
dtype: object

Note that in pandas, series that store strings have dtype object.

Boolean

Boolean data, otherwise know as logical data, stores value that represent true and false.¹

In Python, the two possible values are True and False.

print(True)
print(False)

True
False

type(True)

bool

Summary

The following code provides a few more examples to summarize computing data types.

# python types
python_float = 3.14
python_int = 42
python_str = "hello"
python_bool = True
print(f"python_float is of type {python_float.__class__}")
print(f"python_int   is of type {type(python_int)}")
print(f"python_str   is of type {type(python_str)}")
print(f"python_bool  is of type {type(python_bool)}")
print(f"")

python_float is of type <class 'float'>
python_int   is of type <class 'int'>
python_str   is of type <class 'str'>
python_bool  is of type <class 'bool'>

# numpy types
np_float = np.array([3.14, 42.0])
np_int = np.array([0, 1, 1, 0, 0, 0])
np_str = np.array(["hello", "world"])
np_bool = np.array([True, False, True])
print(f"np_float is is an array with type {np_float.dtype}")
print(f"np_int   is is an array with type {np_int.dtype}")
print(f"np_str   is is an array with type {np_str.dtype}")
print(f"np_bool  is is an array with type {np_bool.dtype}")
print(f"")

np_float is is an array with type float64
np_int   is is an array with type int64
np_str   is is an array with type <U5
np_bool  is is an array with type bool

# pandas types
pd_float = pd.Series([3.14, 42.0])
pd_int = pd.Series([42, 42])
pd_str = pd.Series(["hello", "world"])
pd_bool = pd.Series([True, False, True])
df = pd.DataFrame(
    {
        "pd_float_column": pd_float,
        "pd_int_column": pd_int,
        "pd_str_column": pd_str,
        "pd_bool_column": pd_bool,
    }
)
print(df.dtypes)

pd_float_column    float64
pd_int_column      float64
pd_str_column       object
pd_bool_column        bool
dtype: object

Statistical Data Types

In statistics, data types have a different categorization:

Numeric
- Continuous
- Discrete
Categorical
- Nominal
- Ordinal

For simplicity, we will focus on continuous numeric data, and nominal categorical data, essentially numbers and categories.²

Unfortunately, there is not a clean mapping from statistical data types to computing data types. Data will of course be stored using a computing data type. However, the statistical data types will guide modeling decisions.

Consider the following arrays:

a = np.array([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0])
b = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 25, 100])
c = np.array(["cat", "dog", "cat", "cat", "dog"])
d = np.array([True, True, True, False])
e = np.array([0.12, 0.35, 0.65, 0.89, 0.91])

These arrays have type:

print(a.dtype)
print(b.dtype)
print(c.dtype)
print(d.dtype)
print(e.dtype)

float64
int64
<U3
bool
float64

That is, we have one of each of the main computing types:

Float: a
Integer: b
String: c
Boolean: d
Float: e

What is the statistical type of each? Are they numbers or categories? The answer is a common one: it depends.

In this case, without additional context, we would label them:

a is categorical
b is numeric
c is categorical
d is categorical
e is numeric

How are we making this determination?

The general heuristics are:

Categorical data represents a finite number of categories.
Numeric data represents measurements or counts.

Why is a categorical? Well, technically, with the information we have here, we don’t know for sure. However, notice that there are only two values contained in the array, 0.0 and 1.0. While it is certainly reasonable to think that floats are (continuous) numeric data, that is not necessarily the case.

The a variable could represent a binary indicator, something like 1.0 for graduate students and 0.0 for undergraduate students.
The a variables could instead represent a measure, something like 1.0 millimeters of precipitation and 0.0 millimeters of precipitation.

Which situation are you dealing with? You’ll have to pay close attention to what data you’re actually dealing with!

Tabular Data

In CS 307, the vast majority of the data that we encounter will be tabular data. If and when we encounter non-tabular data, we will usually try out best to transform non-tabular data to become tabular.

Generally speaking, tabular data refers to data that can be arranged in tables with rows and columns. This loose definition is useful, but there are extensions, such tidy data, that further formalize this concept³.

Thinking statistically:

rows represent observations
columns represent attributes

In machine learning, we will usually refer to the rows as samples. Columns will also be referred to as variables.

In both statistics and machine learning, some analyses will give additional categorization to the columns, usually denoted one column as the target or response and the remaining columns as the features.

\(x_1\)	\(x_2\)	\(x_3\)	\(y\)
1.2	yes	2	1
2.1	yes	3	0
5.5	no	6	0
4.3	no	4	1
2.7	yes	3	1
5.6	no	5	0
5.7	yes	7	1
9.8	no	8	0
3.6	yes	9	1
3.7	no	10	0

Table 1: An example of tabular data.

This section collects notes to be added later.

pandas versus numpy
- frame versus 2d array
- series versus 1d array
TODO: within Python, we will encounter tabular day in both numpy, as two-dimensional arrays, and pandas, as data frames. but there are differences!

`numpy` Arrays

input = ? 1d array y
output = ? 2d array X

`pandas` Data Frames

data frames
- rows: samples (observations)
- columns: variables
  - output, y = target / label (response, dependent, etc)
  - input, x = features (predictors, independent, etc)

df = pd.DataFrame(
    {
        "integers": [1, 2, 3, 4, 5],
        "floats": [1.1, 2.2, 3.3, 4.4, 5.5],
        "strings": ["a", "b", "c", "d", "e"],
        "booleans": [True, False, True, False, True],
    }
)
df

	integers	floats	strings	booleans
0	1	1.10	a	True
1	2	2.20	b	False
2	3	3.30	c	True
3	4	4.40	d	False
4	5	5.50	e	True

https://en.wikipedia.org/wiki/Computer_number_format
The usual floating point sites.
https://en.wikipedia.org/wiki/Floating-point_arithmetic
python data structures?
Data Representation
- tabular
  - data frame (pandas)
  - 2d array (numpy)
- graphical (plot)
- statistical (DGP as random variables)
- computational (DGP as code to simulate)
tabular versus not:
- if not tabular, make tabular (see images and text)
- how does TIME factor?

Footnotes

Some languages, like R, provide for a third possibility that effectively represents an unknown status.↩︎
Dealing with discrete numeric and ordinal categories requires care that we may return to later with specific examples.↩︎
Database normalization is a process that is deeply invested in how data is stored as rows and columns within a relational database.↩︎