1. Boring Indices & Where to Find Them: The Auto-Incremented Integer Index in StaticFrame¶
This article is part of a series exploring the features and design of StaticFrame, a Python package that offers data structures similar to the Pandas DataFrame and Series, but with an immutable data model.
This article demonstrates how StaticFrame exposes functionality for creating the most boring index object: the auto-incremented integer index (AIII). An AIII makes an axis selectable with integers, just as a NumPy array; it makes
loc selection equivalent to
iloc selection; and it is closely related to “auto increment” integer columns found in databases, such as in MySQL (the
AUTO_INCREMENT keyword), SQLite (the
AUTOINCREMENT keyword), or PostgreSQL (the
While index objects that provide scrutable labels into data are a key feature of libraries like Pandas and StaticFrame, there are many situations where the simple, inscrutable AIII is needed, such as when data does not have a meaningful index, or in concatenation of data with redundant indices. Offering convenient and consistent approaches to creating these indices supports creating more maintainable code.
All examples use StaticFrame 0.4.0 or later (https://pypi.org/project/static-frame) and import with the following convention:
>>> import static_frame as sf
1.1. Reindexing & Relabeling¶
We will take a brief detour to consider how reindexing and relabeling work in Pandas and StaticFrame.
Changing an index on a
Frame could be done in at least two ways: (1) create a new container with a new index of any size, supplying labels with values from the old container if those labels are in the old index (i.e., alignment based on index labels) or (2) create a new container with a new index of the same size, reusing the same values in the same position (alignment based on position).
Following the precedent of Pandas, StaticFrame implements
Frame.reindex() with the former interpretation: alignment based on index labels. As shown in the example below, the new index only matches and retains two of the four previous values:
>>> s1 = sf.Series((x * 100 for x in range(1, 5)), index=tuple('wxyz')) >>> s1 <Series> <Index> w 100 x 200 y 300 z 400 <<U1> <int64>
>>> s1.reindex(tuple('stwx'), fill_value=0) <Series> <Index> s 0 t 0 w 100 x 200 <<U1> <int64>
To handle the latter interpretation, alignment based on position, Pandas offers at least two approaches: the mutable
index attribute can be directly assigned, or the
set_axis() function can be used.
StaticFrame names all methods “relabel” that supply a new or transformed index of the same size, to be aligned by position. The
Series.relabel() method can be used to create a new index by transforming old index labels (via a function or mapping), or by supplying an appropriately sized index initializer. As NumPy arrays in StaticFrame are immutable, relabeling is efficient: underlying data is never copied.
>>> s1.relabel(tuple('abcd')) <Series> <Index> a 100 b 200 c 300 d 400 <<U1> <int64>
1.2. Setting an Auto-Incremented Integer Index¶
A common use of index assignment based on position is “resetting” the index: replacing an existing index with an auto-incremented integer index (AIII). AIIIs are given to
Frame created without explicit index arguments; they are also useful when combining data that does not have a “natural” index along an axis.
While Pandas offers a discrete method for this operation,
reset_index(), that function is made complex due to the
inplace parameters. For example,
reset_index() will produce, from a
pd.Series, a new
pd.Series or a
pd.Frame depending on if
False, and exposes a conflicting parameter configuration if
True, raising “TypeError: Cannot reset_index inplace on a Series to create a DataFrame.”
A goal in StaticFrame’s API design is to avoid, as much as possible, interfaces that permit conflicting, non-orthogonal arguments.
In addition to relabeling, another case where an AIII is frequently needed is in concatenating numerous
Frame. For example, when concatenating a
Frame, one axis might be aligned while the other, extended axis requires an AIII. Deviating in naming from of the
reset_index() method, Pandas supports this with a Boolean
ignore_index parameter provided to the
Another goal of StaticFrame’s API design is to support common interfaces wherever possible. Reusing, across diverse interfaces, the same mechanism for creating AIIIs is thus desirable.
Rather than specialized functions or arguments, AIIIs in StaticFrame can be created on
Frame by passing a special value, an
IndexAutoFactory object, to index initializer arguments. This is presently supported for
Frame initializers similarly can take an
By using a special type that can be supplied to existing
columns arguments, StaticFrame avoids non-orthogonal arguments and offers a consistent interface for producing AIIIs.
1.4. Resetting an Index when Relabeling¶
By accepting an
IndexAutoFactory argument, a
relabel() method can be used to cover the functionality of the Pandas
For example, the
IndexAutoFactory class can be given as the
index argument to
Series.relabel() to produce a new
Series with an AIII. As mentioned above, as underlying NumPy arrays are immutable in StaticFrame, this is a no-copy operation.
>>> s1.relabel(sf.IndexAutoFactory) <Series> <Index> 0 100 1 200 2 300 3 400 <int64> <int64>
The benefit of having a specific type, rather than using
None, to signify application of an AIII is made more clear in the context of
Frame.relabel(), where both a
index argument can be set independently. The example bellow demonstrates creating a
Frame, setting an AIII on both axis, and setting an AIII on
columns while doing relabeling on the
>>> f1 = sf.Frame.from_dict(dict(a=(1,2), b=(True, False)), index=tuple('xy')) >>> f1 <Frame> <Index> a b <<U1> <Index> x 1 True y 2 False <<U1> <int64> <bool>
>>> f1.relabel(index=sf.IndexAutoFactory, columns=sf.IndexAutoFactory) <Frame> <Index> 0 1 <int64> <Index> 0 1 True 1 2 False <int64> <int64> <bool>
>>> f1.relabel(index=tuple('ab'), columns=sf.IndexAutoFactory) <Frame> <Index> 0 1 <int64> <Index> a 1 True b 2 False <<U1> <int64> <bool>
1.5. Resetting an Index when Concatenating¶
Frame is a context where supplying a new index is often desirable along the extended axis. The
IndexAutoFactory type can be used here to supply that index.
For example, when concatenating (vertically stacking) with
Series.from_concat(), we must supply a new index if the resulting index is not unique. Unlike Pandas, StaticFrame requires all indices to have unique values.
>>> s1 <Series> <Index> w 100 x 200 y 300 z 400 <<U1> <int64>
>>> sf.Series.from_concat((s1, s1), index=tuple('abcdefgh')) <Series> <Index> a 100 b 200 c 300 d 400 e 100 f 200 g 300 h 400 <<U1> <int64>
However, if an AIII is needed, the
IndexAutoFactory type can be used with the same interface:
>>> sf.Series.from_concat((s1, s1), index=sf.IndexAutoFactory) <Series> <Index> 0 100 1 200 2 300 3 400 4 100 5 200 6 300 7 400 <int64> <int64>
The same approach is used with
Frame.from_concat(), where both
index arguments are exposed. For example, two
Series can be horizontally “stacked” along axis 1 to produce a new
Frame. If the
Series.name attributes are unique, they can be used to create the columns; otherwise, new columns can be supplied or an
IndexAutoFactory value can be provided.
>>> s2 = s1 * .5 >>> sf.Frame.from_concat((s1, s2), axis=1, columns=sf.IndexAutoFactory) <Frame> <Index> 0 1 <int64> <Index> w 100 50.0 x 200 100.0 y 300 150.0 z 400 200.0 <<U1> <int64> <float64>
Similarly, concatenating along axis 1 (horizontally stacking) the same
Frame multiple times results in non-unique columns, which raises an
Exception in StaticFrame. To avoid this, the
IndexAutoFactory can be supplied.
>>> sf.Frame.from_concat((f1, f1), axis=1, columns=sf.IndexAutoFactory) <Frame> <Index> 0 1 2 3 <int64> <Index> x 1 True 1 True y 2 False 2 False <<U1> <int64> <bool> <int64> <bool>
1.6. Consistent Interfaces for More Maintainable Code¶
Resetting an index is not a complex operation. However, how to provide the option to create an AIII within diverse interfaces is not obvious. The approach taken with StaticFrame offers a consistent interface, leading to more maintainable code.