Abstract:
Modern pulse-Doppler radars use digital receivers with high speed ADCs and sophisticated radar signal
processors that necessitate high data rates, computationally intensive processing, and strict latency
requirements. Data-independent processing is performed as the first stage and requires the highest
data and computational rates of between 1 Gigaops to 1 Teraops, traditionally reserved for specialized
circuits that typically employ restrictive fixed-point arithmetic. The first stage generally requires FIR
filters, correlation, Fourier transforms, and matrix-vector algebra on multi-dimensional data, which
provides a range of demanding and interesting computational challenges, and that present ample opportunities
for parallel processing. Modern many-core GPUs provide general-purpose computation
on the GPU (GPGPU) for high-performance computing applications through fully programmable
pipelines, high memory bandwidths of up to hundreds of Gigabytes per second and high floatingpoint
computational performance of up to several Teraflops on a single chip. The massively-parallel
GPU architecture is well-suited for intrinsically parallel applications that require high dynamic range,
such as radar signal processing. However, numerous factors have to be considered in order to realize
the massive performance potential through a conventionally unfamiliar stream-programming
paradigm. Explicit control is also granted over a deep memory hierarchy and parallelism at various
granularities within an optimization space that is considered non-linear in many respects. The aim of this research is to address and characterize the challenges and intricacies of using modern
GPUs with GPGPU capabilities for the computationally demanding software-defined pulse-Doppler
radar signal processing application. A single receiver-element, coherent pulse-Doppler system with
a two-dimensional data storage model was assumed, due to widespread use and the interesting challenges
and opportunities that it provides for parallel implementation on the GPU architecture. The
NVIDIA Tesla C1060 GPU and CUDA were selected as a suitable GPGPU platform for the implementation
using single-precision floating-point arithmetic. A set of microbenchmarks was first
developed to isolate and highlight fundamental traits and relevant features of the GPU architecture, in
order to determine their impact in the radar application context. The common digital pulse compression
(DPC), corner turning (CT), Doppler filtering (DF), envelope (ENV) and constant false-alarm
rate (CFAR) processing functions were then implemented and optimized for the GPU architecture.
Multiple algorithmic variants were implemented, where appropriate, to evaluate the efficiency of different
algorithmic structures on the GPU architecture. These functions were then integrated to form
a radar signal processing chain, which allowed for further holistic optimization under realistic conditions.
An experimental framework and simple analytical framework was developed and utilized for
analyzing low-level kernel performance and high-level system performance for individual functions
and the processing chain.
The microbenchmark results highlighted the severity of uncoalesced device memory access as well as
the importance of high arithmetic intensity to achieve high computational throughput, and an asymmetry
in performance for primitive math operations. Further, the microbenchmark results showed
that memory transfer performance for small buffers or effectively small radar bursts is fundamentally
poor, but also that memory transfer can be efficiently overlapped with computation, reducing the impact
of slow transfers in general. For the DPC and DF functions, the FFT-based variants using the
CUFFT library proved optimal. For the CT function, the use of shared memory is vital to achieve fully
coalesced transfers, and the lesser-known, but potentially highly detrimental, partition camping effect
needs to be addressed. For the CFAR function, the segmentation into separate processing stages for
rows and columns proved the most vital overall optimization. The ENV function along with several
simple GPU helper-kernels with low arithmetic intensity such as padding, scaling, and the window
function were found to be bandwidth-limited, as expected, and hence performs comparably to a pure
copy kernel. Based on the findings, pulse-Doppler radar signal processing on GPUs is highly feasible
for medium to large burst sizes, provided that the main performance contributors and detractors for
the target GPU architecture is well understood and adhered to.