Skip to article frontmatterSkip to article content

NumPy Logo

Intermediate NumPy


Overview

  1. Working with multiple dimensions
  2. Subsetting of irregular arrays with booleans
  3. Sorting, or indexing with indices

Prerequisites

ConceptsImportanceNotes
NumPy BasicsNecessary
  • Time to learn: 20 minutes

Imports

We will be including Matplotlib to illustrate some of our examples, but you don’t need knowledge of it to complete this notebook.

import matplotlib.pyplot as plt
import numpy as np

Using axes to slice arrays

Here we introduce an important concept when working with NumPy: the axis. This indicates the particular dimension along which a function should operate (provided the function does something taking multiple values and converts to a single value).

Let’s look at a concrete example with sum:

a = np.arange(12).reshape(3, 4)
a

This calculates the total of all values in the array.

np.sum(a)

Info

Some of NumPy's functions can be accessed as `ndarray` methods!
a.sum()

Now, with a reminder about how our array is shaped,

a.shape

we can specify axis to get just the sum across each of our rows.

np.sum(a, axis=0)

Or do the same and take the sum across columns:

np.sum(a, axis=1)

After putting together some data and introducing some more advanced calculations, let’s demonstrate a multi-layered example: calculating temperature advection. If you’re not familiar with this (don’t worry!), we’ll be looking to calculate

advection=vT\text{advection} = -\vec{v} \cdot \nabla T

and to do so we’ll start with some random TT and v\vec{v} values,

temp = np.random.randn(100, 50)
u = np.random.randn(100, 50)
v = np.random.randn(100, 50)

We can calculate the np.gradient of our new T(100x50)T(100x50) field as two separate component gradients,

gradient_x, gradient_y = np.gradient(temp)

In order to calculate vT-\vec{v} \cdot \nabla T, we will use np.dstack to turn our two separate component gradient fields into one multidimensional field containing xx and yy gradients at each of our 100x50100x50 points,

grad_vectors = np.dstack([gradient_x, gradient_y])
print(grad_vectors.shape)

and then do the same for our separate uu and vv wind components,

wind_vectors = np.dstack([u, v])
print(wind_vectors.shape)

Finally, we can calculate the dot product of these two multidimensional fields of wind and temperature gradient components by hand as an element-wise multiplication, *, and then a sum of our separate components at each point (i.e., along the last axis),

advection = (wind_vectors * -grad_vectors).sum(axis=-1)
print(advection.shape)

Indexing arrays with boolean values

Array comparisons

NumPy can easily create arrays of boolean values and use those to select certain values to extract from an array

# Create some synthetic data representing temperature and wind speed data
np.random.seed(19990503)  # Make sure we all have the same data
temp = 20 * np.cos(np.linspace(0, 2 * np.pi, 100)) + 50 + 2 * np.random.randn(100)
speed = np.abs(
    10 * np.sin(np.linspace(0, 2 * np.pi, 100)) + 10 + 5 * np.random.randn(100)
)
plt.plot(temp, 'tab:red')
plt.plot(speed, 'tab:blue');

By doing a comparison between a NumPy array and a value, we get an array of values representing the results of the comparison between each element and the value

temp > 45

This, which is its own NumPy array of boolean values, can be used as an index to another array of the same size. We can even use it as an index within the original temp array we used to compare,

temp[temp > 45]

Info

This only returns the values from our original array meeting the indexing conditions, nothing more! Note the size,
temp[temp > 45].shape

Warning

Indexing arrays with arrays requires them to be the same size!

If we store this array somewhere new,

temp_45 = temp[temp > 45]
temp_45[temp < 45]

We find that our original (100,) shape array is too large to subset our new (60,) array.

If their sizes do match, the boolean array can come from a totally different array!

speed > 10
temp[speed > 10]

Replacing values

To extend this, we can use this conditional indexing to assign new values to certain positions within our array, somewhat like a masking operation.

# Make a copy so we don't modify the original data
temp2 = temp.copy()
speed2 = speed.copy()

# Replace all places where speed is <10 with NaN (not a number)
temp2[speed < 10] = np.nan
speed2[speed < 10] = np.nan
plt.plot(temp2, 'tab:red');

and to put this in context,

plt.plot(temp, 'r:')
plt.plot(temp2, 'r')
plt.plot(speed, 'b:')
plt.plot(speed2, 'b');

If we use parentheses to preserve the order of operations, we can combine these conditions with other bitwise operators like the & for bitwise_and,

multi_mask = (temp < 45) & (speed > 10)
multi_mask
temp[multi_mask]

Heat index is only defined for temperatures >= 80F and relative humidity values >= 40%. Using the data generated below, we can use boolean indexing to extract the data where heat index has a valid value.

# Here's the "data"
np.random.seed(19990503)
temp = 20 * np.cos(np.linspace(0, 2 * np.pi, 100)) + 80 + 2 * np.random.randn(100)
relative_humidity = np.abs(
    20 * np.cos(np.linspace(0, 4 * np.pi, 100)) + 50 + 5 * np.random.randn(100)
)

# Create a mask for the two conditions described above
good_heat_index = (temp >= 80) & (relative_humidity >= 0.4)

# Use this mask to grab the temperature and relative humidity values that together
# will give good heat index values
print(temp[good_heat_index])

Another bitwise operator we can find helpful is Python’s ~ complement operator, which can give us the inverse of our specific mask to let us assign np.nan to every value not satisfied in good_heat_index.

plot_temp = temp.copy()
plot_temp[~good_heat_index] = np.nan
plt.plot(plot_temp, 'tab:red');

Indexing using arrays of indices

You can also use a list or array of indices to extract particular values--this is a natural extension of the regular indexing. For instance, just as we can select the first element:

temp[0]

We can also extract the first, fifth, and tenth elements as a list:

temp[[0, 4, 9]]

One of the ways this comes into play is trying to sort NumPy arrays using argsort. This function returns the indices of the array that give the items in sorted order. So for our temp,

inds = np.argsort(temp)
inds

i.e., our lowest value is at index 52, next 57, and so on. We can use this array of indices as an index for temp,

temp[inds]

to get a sorted array back!

With some clever slicing, we can pull out the last 10, or 10 highest, values of temp,

ten_highest = inds[-10:]
print(temp[ten_highest])

There are other NumPy arg functions that return indices for operating; check out the NumPy docs on sorting your arrays!


Summary

In this notebook we introduced the power of understanding the dimensions of our data by specifying math along axis, used True and False values to subset our data according to conditions, and used lists of positions within our array to sort our data.

What’s Next

Taking some time to practice this is valuable to be able to quickly manipulate arrays of information in useful or scientific ways.

Resources and references

The NumPy Users Guide expands further on some of these topics, as well as suggests various Tutorials, lectures, and more at this stage.