import numpy as np
import pandas as pd
Computing Data Types
Thinking about data types in computing, there are three main categories:
- Numeric
- Text
- Boolean (Logical)
Each of these types has (at least one) implementation in Python, numpy
, and pandas
.
Numeric
Within numeric data, there are two broad subcategories: floats and integers.
Floats
Floating point numbers (floats) are a method of representing real numbers.
In Python, some examples of floats include:
print(2.0)
print(42.0)
print(3.14)
2.0
42.0
3.14
type(1.1)
float
Real numbers, like \(\pi\), can be irrational, and require a decimal representing with an infinite number of digits. Computers, have finite memory, so of course they cannot store every digit of \(\pi\), much less perform computations with infinite digits.
Without getting into the details of floating point arithmetic, we will simply note that floats will sometimes behave, oddly.
0.2 + 0.1 == 0.3
False
0.2 + 0.1
0.30000000000000004
This is normal, and in general will not cause issues.
There are technically different variants of floats, which use different number of digits to represent a number, however, we will generally not be consider with this distinction.
Both numpy
and pandas
also utilize floats.
= np.array([1.1, 2.2, 3.3])
x print(x)
[1.1 2.2 3.3]
print(x.dtype)
float64
1.1, 2.2, 3.3]) pd.Series([
0 1.1
1 2.2
2 3.3
dtype: float64
Here were see that both numpy
and pandas
default to using a 64 bit float.
Integers
print(2)
print(42)
print(-20)
2
42
-20
type(-20)
int
= np.array([1, 0, 1, 1, 1, 0])
y print(y)
[1 0 1 1 1 0]
print(y.dtype)
print(y.dtype.type)
int64
<class 'numpy.int64'>
1, 0, 1, 1, 1, 0]) pd.Series([
0 1
1 0
2 1
3 1
4 1
5 0
dtype: int64
Text
Text data does what the name suggests, it stores text. Text data is usually called strings.
"Hello, World!"
'Hello, World!'
type("Hello, World!")
str
= np.array(["a", "b", "c", "d"])
z print(z)
['a' 'b' 'c' 'd']
print(z.dtype)
print(z.dtype.type)
<U1
<class 'numpy.str_'>
"a", "b", "c", "d"]) pd.Series([
0 a
1 b
2 c
3 d
dtype: object
Note that in pandas
, series that store strings have dtype
object
.
Boolean
Boolean data, otherwise know as logical data, stores value that represent true and false.1
In Python, the two possible values are True
and False
.
print(True)
print(False)
True
False
type(True)
bool
Summary
The following code provides a few more examples to summarize computing data types.
# python types
= 3.14
python_float = 42
python_int = "hello"
python_str = True
python_bool print(f"python_float is of type {python_float.__class__}")
print(f"python_int is of type {type(python_int)}")
print(f"python_str is of type {type(python_str)}")
print(f"python_bool is of type {type(python_bool)}")
print(f"")
python_float is of type <class 'float'>
python_int is of type <class 'int'>
python_str is of type <class 'str'>
python_bool is of type <class 'bool'>
# numpy types
= np.array([3.14, 42.0])
np_float = np.array([0, 1, 1, 0, 0, 0])
np_int = np.array(["hello", "world"])
np_str = np.array([True, False, True])
np_bool print(f"np_float is is an array with type {np_float.dtype}")
print(f"np_int is is an array with type {np_int.dtype}")
print(f"np_str is is an array with type {np_str.dtype}")
print(f"np_bool is is an array with type {np_bool.dtype}")
print(f"")
np_float is is an array with type float64
np_int is is an array with type int64
np_str is is an array with type <U5
np_bool is is an array with type bool
# pandas types
= pd.Series([3.14, 42.0])
pd_float = pd.Series([42, 42])
pd_int = pd.Series(["hello", "world"])
pd_str = pd.Series([True, False, True])
pd_bool = pd.DataFrame(
df
{"pd_float_column": pd_float,
"pd_int_column": pd_int,
"pd_str_column": pd_str,
"pd_bool_column": pd_bool,
}
)print(df.dtypes)
pd_float_column float64
pd_int_column float64
pd_str_column object
pd_bool_column bool
dtype: object
Statistical Data Types
In statistics, data types have a different categorization:
- Numeric
- Continuous
- Discrete
- Categorical
- Nominal
- Ordinal
For simplicity, we will focus on continuous numeric data, and nominal categorical data, essentially numbers and categories.2
Unfortunately, there is not a clean mapping from statistical data types to computing data types. Data will of course be stored using a computing data type. However, the statistical data types will guide modeling decisions.
Consider the following arrays:
= np.array([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0])
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 25, 100])
b = np.array(["cat", "dog", "cat", "cat", "dog"])
c = np.array([True, True, True, False])
d = np.array([0.12, 0.35, 0.65, 0.89, 0.91]) e
These arrays have type:
print(a.dtype)
print(b.dtype)
print(c.dtype)
print(d.dtype)
print(e.dtype)
float64
int64
<U3
bool
float64
That is, we have one of each of the main computing types:
- Float:
a
- Integer:
b
- String:
c
- Boolean:
d
- Float:
e
What is the statistical type of each? Are they numbers or categories? The answer is a common one: it depends.
In this case, without additional context, we would label them:
a
is categoricalb
is numericc
is categoricald
is categoricale
is numeric
How are we making this determination?
The general heuristics are:
- Categorical data represents a finite number of categories.
- Numeric data represents measurements or counts.
Why is a
categorical? Well, technically, with the information we have here, we don’t know for sure. However, notice that there are only two values contained in the array, 0.0
and 1.0
. While it is certainly reasonable to think that floats are (continuous) numeric data, that is not necessarily the case.
- The
a
variable could represent a binary indicator, something like1.0
for graduate students and0.0
for undergraduate students. - The
a
variables could instead represent a measure, something like1.0
millimeters of precipitation and0.0
millimeters of precipitation.
Which situation are you dealing with? You’ll gave to pay close attention to what data you’re actually dealing with!
Tabular Data
In CS 307, the vast majority of the data that we encounter will be tabular data. If and when we encounter non-tabular data, we will usually try out best to transform non-tabular data to become tabular.
Generally speaking, tabular data refers to data that can be arranged in table with rows and columns. This loose definition is useful, but there are extension, such tidy data, that further formalize this concept.
Thinking statistically:
- rows represent observations
- columns represent attributes
In machine learning, we will usually refer to the rows as samples. Columns will also be referred to as variables.
In both statistics and machine learning, some analyses will give additional categorization to the columns, usually denoted one column as the target or response.
\(x_1\) | \(x_2\) | \(x_3\) | \(y\) |
---|---|---|---|
1.2 | yes | 2 | 1 |
2.1 | yes | 3 | 0 |
5.5 | no | 6 | 0 |
4.3 | no | 4 | 1 |
2.7 | yes | 3 | 1 |
5.6 | no | 5 | 0 |
5.7 | yes | 7 | 1 |
9.8 | no | 8 | 0 |
3.6 | yes | 9 | 1 |
3.7 | no | 10 | 0 |