9.2. Defining the STFT#

The Short-Time Fourier Transform (STFT) does exactly what it says: it applies the Fourier transform to short fragments of time, that is, frames taken from a longer signal. At a conceptual level, there is not too much going on here: we just extract frames from the signal, and apply the DFT to each frame. However, there is much to discuss in the details.

9.2.1. A basic STFT algorithm#

A basic STFT algorithm requires three things:

  • the input signal \(\blue{x}\),

  • the frame length \(N_F\), and

  • the hop length \(N_H\).

Typical STFT implementations assume a real-valued input signal, and keep only the non-negative frequencies by using rfft instead of fft. The result is a two-dimensional array, where one dimension indexes the frames, and the other indexes frequencies. Note that the frame length dictates the number of samples going into the DFT, so the number of analysis frequencies will also be \(N_F\).

def basic_stft(x, n_frame, n_hop):
    '''Compute a basic Short-Time Fourier transform
    of a real-valued input signal.'''
    
    # Compute the number of frames
    frame_count = 1 + (len(x) - n_frame) // n_hop
    
    # Initialize the output array
    # We have frame_count frames 
    #     and (1 + n_frame//2) frequencies for each frame
    stft = np.zeros((frame_count, 1 + n_frame // 2), dtype=complex)
    
    # Populate each frame's DFT results
    for k in range(frame_count):
        # Slice the k'th frame
        x_frame = x[k * n_hop:k * n_hop + n_frame]
        
        # Take the DFT (non-negative frequencies only)
        stft[k, :] = np.fft.rfft(x_frame)
        
    return stft

Fig. 9.2 demonstrates the operation of this basic_stft method on a real audio recording.

Fig. 9.2 A signal \(\blue{x}\) is sampled at \(f_s=22050\) and frames are taken with \(N_F=1024\) and \(N_H=512\). Each frame of \(\blue{x[n]}\) is plotted (left) along with its DFT magnitudes \(|\darkblue{X[m]}|\) as produced by the STFT (right).#

The type of visualization used in Fig. 9.2 may look familiar to you, as it can be found on all kinds of commercially available devices (stereos, music software, etc.). Now you know how it works.

9.2.2. Spectrograms#

Another way of representing the output of a Short-Time Fourier transform is by using spectrograms. Spectrograms are essentially an image representation of the STFT, constructed by stacking the frames horizontally, so that time can be read left-to-right, and frequency can be read bottom-to-top. Typically, when we refer to spectrograms, what we actually mean are magnitude spectrograms, where the phase component has been discarded and only the DFT magnitudes are retained. In Python code, we would say:

# Compute the STFT with frame length = 1024, hop length = 512
stft = basic_stft(x, 1024, 512)

# Take the absolute value to discard phase information
S = np.abs(stft)

This allows us to interpret energy (\(\darkblue{S=|X|}\)) visually as brightness under a suitable color mapping.

Fig. 9.3 (top) illustrates an example of a spectrogram display. Each column (vertical slice) of the image corresponds to one frame of Fig. 9.2 (right).

A magnitude spectrogram in linear and logarithmic (decibel) scale

Fig. 9.3 The magnitude spectrogram representation of the slide whistle example of Fig. 9.2, using the same parameters \(N_F=1024\), \(N_H=512\). Top: visualization using linear magnitude scaling. Bottom: visualization using decibel scaling.#

While some spectral content is visually perceptible in Fig. 9.3 (top), most of the image is dark, and it’s generally difficult to read. This goes back our earlier discussion of decibels: human perception of amplitude is logarithmic, not linear, so we should account for this when visualizing spectral content.

The bottom plot of Fig. 9.3 shows the same data, but using a decibel scaling for amplitudes:

\[S_\text{dB} = 20\cdot \log_{10} S\]

The result of this mapping exposes far more structure in the input signal. The (frame-wise) fundamental frequency of the signal is visually salient as the bright contour at the bottom of the image, but harmonics are also visible, as is background noise.

9.2.3. Choosing parameters#

The basic_stft algorithm above has two parameters that we are free to set however we see fit. There is no single “right” setting for these STFT parameters, but there are settings that will be better or worse for certain applications.

9.2.3.1. Frame length \(N_F\)#

Unlike the standard DFT, where the number of analysis frequencies is dictated by the number of samples, the STFT allows us to control this parameter directly. This introduces a time-frequency trade-off.

Large values of \(N_F\) will provide a high frequency resolution, dividing the frequency range \([0, f_s/2]\) into smaller pieces as \(N_F\) increases. This comes at a cost of reduced time resolution: large values of \(N_F\) integrate over longer windows of time, so any changes in frequency content that are shorter than the frame length could be obscured. Intuitively, when the frame length is large (and the hop length is fixed), any given sample \(x[n]\) will be covered by more frames, and therefore contribute to more columns in the spectrogram, resulting in a blurring over time.

Conversely, small values of \(N_F\) provide good time localization—since each frame only sees a small amount of information—but poor frequency resolution, since the range \([0, f_s/2]\) is divided into relatively few pieces.

Fig. 9.4 visualizes this trade-off for a fixed hop length \(N_H\) and varying frame length \(N_F\).