Data

Computational and Statistical Representations

Author
Modified

March 3, 2025

import numpy as np
import pandas as pd

Computing Data Types

Thinking about data types in computing, there are three main categories:

  • Numeric
  • Text
  • Boolean (Logical)

Each of these types has (at least one) implementation in Python, numpy, and pandas.

Numeric

Within numeric data, there are two broad subcategories: floats and integers.

Floats

Floating point numbers (floats) are a method of representing real numbers.

In Python, some examples of floats include:

print(2.0)
print(42.0)
print(3.14)
2.0
42.0
3.14
type(1.1)
float

Real numbers, like \(\pi\), can be irrational, and require a decimal representing with an infinite number of digits. Computers, have finite memory, so of course they cannot store every digit of \(\pi\), much less perform computations with infinite digits.

Without getting into the details of floating point arithmetic, we will simply note that floats will sometimes behave, oddly.

0.2 + 0.1 == 0.3
False
0.2 + 0.1
0.30000000000000004

This is normal, and in general will not cause issues.

There are technically different variants of floats, which use different number of digits to represent a number, however, we will generally not be consider with this distinction.

Both numpy and pandas also utilize floats.

x = np.array([1.1, 2.2, 3.3])
print(x)
[1.1 2.2 3.3]
print(x.dtype)
float64
pd.Series([1.1, 2.2, 3.3])
0    1.1
1    2.2
2    3.3
dtype: float64

Here were see that both numpy and pandas default to using a 64 bit float.

Integers

print(2)
print(42)
print(-20)
2
42
-20
type(-20)
int
y = np.array([1, 0, 1, 1, 1, 0])
print(y)
[1 0 1 1 1 0]
print(y.dtype)
print(y.dtype.type)
int64
<class 'numpy.int64'>
pd.Series([1, 0, 1, 1, 1, 0])
0    1
1    0
2    1
3    1
4    1
5    0
dtype: int64

Text

Text data does what the name suggests, it stores text. Text data is usually called strings.

"Hello, World!"
'Hello, World!'
type("Hello, World!")
str
z = np.array(["a", "b", "c", "d"])
print(z)
['a' 'b' 'c' 'd']
print(z.dtype)
print(z.dtype.type)
<U1
<class 'numpy.str_'>
pd.Series(["a", "b", "c", "d"])
0    a
1    b
2    c
3    d
dtype: object

Note that in pandas, series that store strings have dtype object.

Boolean

Boolean data, otherwise know as logical data, stores value that represent true and false.1

In Python, the two possible values are True and False.

print(True)
print(False)
True
False
type(True)
bool

Summary

The following code provides a few more examples to summarize computing data types.

# python types
python_float = 3.14
python_int = 42
python_str = "hello"
python_bool = True
print(f"python_float is of type {python_float.__class__}")
print(f"python_int   is of type {type(python_int)}")
print(f"python_str   is of type {type(python_str)}")
print(f"python_bool  is of type {type(python_bool)}")
print(f"")
python_float is of type <class 'float'>
python_int   is of type <class 'int'>
python_str   is of type <class 'str'>
python_bool  is of type <class 'bool'>
# numpy types
np_float = np.array([3.14, 42.0])
np_int = np.array([0, 1, 1, 0, 0, 0])
np_str = np.array(["hello", "world"])
np_bool = np.array([True, False, True])
print(f"np_float is is an array with type {np_float.dtype}")
print(f"np_int   is is an array with type {np_int.dtype}")
print(f"np_str   is is an array with type {np_str.dtype}")
print(f"np_bool  is is an array with type {np_bool.dtype}")
print(f"")
np_float is is an array with type float64
np_int   is is an array with type int64
np_str   is is an array with type <U5
np_bool  is is an array with type bool
# pandas types
pd_float = pd.Series([3.14, 42.0])
pd_int = pd.Series([42, 42])
pd_str = pd.Series(["hello", "world"])
pd_bool = pd.Series([True, False, True])
df = pd.DataFrame(
    {
        "pd_float_column": pd_float,
        "pd_int_column": pd_int,
        "pd_str_column": pd_str,
        "pd_bool_column": pd_bool,
    }
)
print(df.dtypes)
pd_float_column    float64
pd_int_column      float64
pd_str_column       object
pd_bool_column        bool
dtype: object

Statistical Data Types

In statistics, data types have a different categorization:

  • Numeric
    • Continuous
    • Discrete
  • Categorical
    • Nominal
    • Ordinal

For simplicity, we will focus on continuous numeric data, and nominal categorical data, essentially numbers and categories.2

Unfortunately, there is not a clean mapping from statistical data types to computing data types. Data will of course be stored using a computing data type. However, the statistical data types will guide modeling decisions.

Consider the following arrays:

a = np.array([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0])
b = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 25, 100])
c = np.array(["cat", "dog", "cat", "cat", "dog"])
d = np.array([True, True, True, False])
e = np.array([0.12, 0.35, 0.65, 0.89, 0.91])

These arrays have type:

print(a.dtype)
print(b.dtype)
print(c.dtype)
print(d.dtype)
print(e.dtype)
float64
int64
<U3
bool
float64

That is, we have one of each of the main computing types:

  • Float: a
  • Integer: b
  • String: c
  • Boolean: d
  • Float: e

What is the statistical type of each? Are they numbers or categories? The answer is a common one: it depends.

In this case, without additional context, we would label them:

  • a is categorical
  • b is numeric
  • c is categorical
  • d is categorical
  • e is numeric

How are we making this determination?

The general heuristics are:

  • Categorical data represents a finite number of categories.
  • Numeric data represents measurements or counts.

Why is a categorical? Well, technically, with the information we have here, we don’t know for sure. However, notice that there are only two values contained in the array, 0.0 and 1.0. While it is certainly reasonable to think that floats are (continuous) numeric data, that is not necessarily the case.

  • The a variable could represent a binary indicator, something like 1.0 for graduate students and 0.0 for undergraduate students.
  • The a variables could instead represent a measure, something like 1.0 millimeters of precipitation and 0.0 millimeters of precipitation.

Which situation are you dealing with? You’ll gave to pay close attention to what data you’re actually dealing with!

Tabular Data

In CS 307, the vast majority of the data that we encounter will be tabular data. If and when we encounter non-tabular data, we will usually try out best to transform non-tabular data to become tabular.

Generally speaking, tabular data refers to data that can be arranged in table with rows and columns. This loose definition is useful, but there are extension, such tidy data, that further formalize this concept.

Thinking statistically:

  • rows represent observations
  • columns represent attributes

In machine learning, we will usually refer to the rows as samples. Columns will also be referred to as variables.

In both statistics and machine learning, some analyses will give additional categorization to the columns, usually denoted one column as the target or response.

\(x_1\) \(x_2\) \(x_3\) \(y\)
1.2 yes 2 1
2.1 yes 3 0
5.5 no 6 0
4.3 no 4 1
2.7 yes 3 1
5.6 no 5 0
5.7 yes 7 1
9.8 no 8 0
3.6 yes 9 1
3.7 no 10 0
Table 1: An example of tabular data.

Footnotes

  1. Some languages, like R, provide for a third possibility that effectively represents an unknown status.↩︎

  2. Dealing with discrete numeric and ordinal categories requires care that we may return to later with specific examples.↩︎