ᚱᛗ
© 2022
Powered by Hugo

Constructing Data Frames

Table of Contents

0. Generating Synthetic Data

# libraries
import pandas as pd
import numpy as np

np.random.seed(0)
data = np.random.randint(low=0, high=100, size=(5, 4))
print(data)
# [[44 47 64 67]
#  [67  9 83 21]
#  [36 87 70 88]
#  [88 12 58 65]
#  [39 87 46 88]]

1. From a 2D np.array

This is probably the most straightforward method, assuming that each inner array in the 2D object represents a row/observation across features.

# generate col names
colnames = "kpi_1 kpi_2 kpi_3 kpi_n".split()
df1 = pd.DataFrame(data=data, columns=colnames)
print(df1)

# kpi_1 kpi_2 kpi_3 kpi_n
# 0 44 47 64 67
# 1 67 9 83 21
# 2 36 87 70 88
# 3 88 12 58 65
# 4 39 87 46 88

2. From a List of Lists

Same functionality as above, but with built-in lists instead of np arrays.

list_data = [list(row) for row in data]
df2 = pd.DataFrame(data=list_data, columns=colnames)

# assert all elements from df1 and df2 are identical
assert np.sum(np.array(df1) != np.array(df2)) == 0

Note: element-wise comparison between two dataframes—each converted to a 2D np.array—is a quick way to check whether they are identical. This comparison (using the != operator) yields a boolean mask with the same dimensions of the original dataframes; cells are False if the same elements in both arrays are identical. Therefore if we np.sum all elements of the mask of two identical arrays, we can expect to get a zero.

3. From a Dictionary of Feature’s Series

With this method, a dictionary is constructed using the column names as keys, and [each feature’s] data series as values. These values can be passed as a list, np.array or pd.Series.

dic_data = {
"kpi_1": np.array([44, 67, 36, 88, 39]),
"kpi_2": np.array([47, 9, 87, 12, 87]),
"kpi_3": np.array([64, 83, 70, 58, 46]),
"kpi_n": np.array([67, 21, 88, 65, 88]),
}

# automating the above process if this is the target format
dic_data = {colnames[i]: np.transpose(data)[i] for i in range(len(colnames))}
df3 = pd.DataFrame(dic_data)

# assert all elements from df1 and df3 are identical
assert np.sum(np.array(df1) != np.array(df3)) == 0

4. From Lists of Features Using zip

This method is best used when each feature’s data series is already found in its own container/list. Then zip creates the records zipping features together. For example, the first record using the data below would be (44, 47, 64, 67)

kpi1 = [44, 67, 36, 88, 39]
kpi2 = [47, 9, 87, 12, 87]
kpi3 = [64, 83, 70, 58, 46]
kpin = [67, 21, 88, 65, 88]
df4 = pd.DataFrame(data=zip(kpi1, kpi2, kpi3, kpin), columns=colnames)

# assert all elements from df1 and df4 are identical
assert np.sum(np.array(df1) != np.array(df4)) == 0

5. From a List of Dictionary Records

Note that one potential drawback from this method is the memory footprint created by the repetition of the column names in each record—depending on the situation this may or may not be an issue. On the other hand, with this method order wouldn’t matter since columns are clearly identified.

list_of_dicts = [
{"kpi_1": 44, "kpi_2": 47, "kpi_3": 64, "kpi_n": 67},
{"kpi_1": 67, "kpi_2": 9, "kpi_3": 83, "kpi_n": 21},
{"kpi_1": 36, "kpi_2": 87, "kpi_3": 70, "kpi_n": 88},
{"kpi_1": 88, "kpi_2": 12, "kpi_3": 58, "kpi_n": 65},
{"kpi_1": 39, "kpi_2": 87, "kpi_3": 46, "kpi_n": 88},
]
df5 = pd.DataFrame(list_of_dicts)

# assert all elements from df1 and df5 are identical
assert np.sum(np.array(df1) != np.array(df5)) == 0