Particle tracking and identification on multi- and many-core hardware platforms

Fernández Declara, Plácido

Particle tracking and identification on multi- and many-core hardware platforms

Fernández Declara, Plácido

Dirigida por:

José Daniel García Sánchez Director/a
Niko Neufeld Codirector/a

Universidad de defensa: Universidad Carlos III de Madrid

Fecha de defensa: 30 de septiembre de 2020

Tribunal:

David Expósito Singh Presidente/a
Manuel Prieto Matías Secretario
Marco Danelutto Vocal

Tipo: Tesis

Teseo: 630299 DIALNET

Resumen

Particle track reconstruction in high-energy physics is used for purposes that include simulation, particle analysis, and particle collision filtering and recording. The Large Hadron Collider (LHC) at CERN, Switzerland is undergoing an upgrade. The four big experiments, ATLAS, CMS, ALICE and LHCb, that use the LHC to record particle collisions are also being upgraded before the next data-taking period. The LHCb experiment is specialized in studying charm and beauty quarks with high precision. Its main goal is to study particle interactions that help to explain the matter-antimatter asymmetry in the Universe. This experiment is being upgraded with improvements that include its particle detectors, and the hardware and software components used to record these particle collisions. One of the main changes for this upgrade is the removal of the “hardware filter” used in previous data-taking periods, composed of custom electronics and FPGAs. It will be substituted by a full “software filter” using commodity hardware to process a data rate of 40 Tbits per second in real-time. This deluge of data is produced by the proton-proton collisions, where 30 millions collisions per second occur, generating 10^9 particles. Estimations on the evolution of both hardware and software where short in their predictions, and alternative hardware architectures are being considered to achieve the 40 Tbits per second goal. The software algorithms used in this hardware to reconstruct the particle collisions, also known as events, needs to be optimized and improved to compute the resulting trajectories of the particles, which later are analyzed and used for physics analysis. This thesis explores different opportunities with multi- and many-core hardware architectures, to improve the throughput processing of particle collisions, and the maintainability and improvement of the source code used for it. Opportunities with specific hardware architectures are explored in collaboration with Intel Corporation, targeting the Intel Xeon Phi Knights Landing processor. Alternative hardware architectures, specifically GPUs, are explored to construct new software frameworks and algorithms. Finally, vectorization opportunities are explored using the new developed algorithms to benefit from data-oriented designs that efficiently use the vector units of CPUs. The Kalman filter algorithm is widely used in high-energy physics for particle reconstruction. It is a linear quadratic estimator that takes measurements from sensors and predicts where the next measurement should be. It combines both the prediction and the real measurement to produce more accurate measurements to reduce the error associated with the measurements. At LHCb’s experiment, the tracks produced by the resulting proton-proton collisions are processed with the Kalman filter to reduce the error of the track. For the second stage of LHCb’s event filter, the Kalman filter is the biggest time contributor of all the algorithms, making it the best candidate to apply optimizations that improve the throughput processing with it. The Intel Xeon Phi Knights Landing processor offers a many-core x86 architecture that is well suited for parallel workloads. It has a high core count of up to 72 cores where each core contains four Simultaneous Multi-Threading hardware threads, which amount for a total of up to 288 hardware threads. It also includes two vector processing units for each core, adding another level of hardware parallelization. A LHCb Kalman filter implementation already hand-tuned and vectorized is used as a target to add parallel processing inside the computing of each track Computing the Kalman filter at the LHCb experiment presents various dependencies between its different steps that need to be processed one after the other. The classic prediction and update phases can be parallelized using a pipeline pattern thanks to the continuous input of particle tracks. At a higher level, different parts of the track are processed differently, allowing for an extra level of parallelization with another pipeline composed of tasks that process sections of the track differently. At the highest level of the Kalman filter used at the LHCb experiment, tracks are fully processed first in one direction, then in the opposite direction, and finally these measurements are combined in a “smoother” phase. This processing reduces the error of the Kalman filter at the beginning and end of the track and produce an overall more accurate track. Parallel implementation for all these levels of parallelization are studied, from the finest grained tasks to the coarser grain tasks. Software used at CERN has long life cycle usage where different programmers, scientists and users interact, update and change the applications over the years. This software is often very data intensive and needs to be efficient in processing the tasks. Given these requirements, both computing performance and maintainability become key aspects of software used at CERN. Generic parallel patterns applied to software can meet these two conditions: they offer a simple programming interface to implement algorithms, and can achieve high degree of efficiency by designing them with optimizing parallelization backends that abstract the programmer from the complexity of parallelism. The Generic Reusable Parallel Patterns Interface (GRPPI) implements both data parallel and streaming patterns while targeting different parallel backends. The Kalman filter application is adapted to be expressed in terms of generic parallel patterns using this library and the Intel Xeon Phi KNL as a target. The library is extended to be thread and NUMA aware, allowing to make a more efficient use of the target processor. The hwloc library allows GRPPI to be topology aware and to use the parallel patterns while pinning the threads and distributing the load according to the NUMA nodes available in the processor. Two implementations with two streaming patterns are presented: a pipeline and a farm. The farm implementation achieves comparable performance to the baseline hand-tuned application, while delivering a source code that is more maintainable and improves programmability. The implementation using both patterns is implemented with streaming patterns, which is closer to the real-world scenario of the real-time track processing that LHCb will implement for the first stage of the filter. The different levels of parallelization of the Kalman filter at the intra-track level contain tasks that are too small to compensate for the parallelization overhead, but open the possibility for future upgrades of LHCb where track size and amount of information may increase, making the parallelization feasible. Finally, the Intel Xeon Phi KNL is considered to be very optimization and workload dependent, where the usage of vector units and SMT threads remains critical to achieve high performance throughput under this processor. Graphics Processing Units (GPU) are a good fit for high-energy physics workloads, where particle collisions can be computed independently, and the new particles generated with these collisions offer opportunities for parallel processing. This high degree of parallelism can be exploited by GPUs efficiently as this architecture contains lower clocks speeds with thousands of cores. Difficulties in implementing the algorithms for the first stage of the filter where encountered with the baseline CPU framework, which motivated the creation of a framework for this first stage developed to be efficient for GPU architectures. This framework, name Allen, was developed during the course of this thesis, and it was designed to target many-core architectures following data-oriented design principles. The framework can run various streams in parallel, running algorithms asynchronously, and implements a custom memory manager to make memory access efficient and reduce the memory footprint of the application to accommodate it in GPUs, which have less memory available compared to CPUs. The algorithms needed to process the second tracking subdetector, the Upstream Tracker (UT), are implemented to be efficient in GPUs. The UT subdetector receives as input the already reconstructed tracks from the previous subdetector, the VELO, and the raw input data from the subdetector, which needs to be decoded. The algorithms found for the UT have to match the tracks to the activation signals (hits) left by the charged particles that cross the subdetector. This subdetector is under the influence of the magnetic field, which affects the tracks by bending their trajectories proportionally to their momentum. This creates a combinatorial problem where a single VELO track could be matched to many different combinations of hits due to the curvature of the track. The best found candidate from the possible options is selected to be matched. The first implemented step is to decode the raw input from the detector into UT hits with useful information. This process is parallelized by selecting different pieces of UT raw input and processing them independently, in parallel. When decoding the information, only the strictly necessary information is extracted to reduce the memory footprint of the algorithm. To store the hits an efficient memory layout and access pattern is used: structure of arrays are found to be efficient when hits are accessed by multiple threads from the GPU in parallel. The decoded hits are stored into sector groups where groups of hits are sorted only partially. Each group of hits is guaranteed to be in between two X coordinates, but the hits in between those coordinates are not sorted by X. Each group is then sorted by Y coordinate, avoiding a strict sorting which would be too costly in computing. This storage of the hits allows it to be fast when saving them, and fast when searching for this by using two binary searches. The second step of the algorithm is the search of windows of hits to delimit the area to be searched. The algorithm is made to be configurable in both the number of sectors that can be searched, and the number of hits that will be used to find the best candidate: this allows to achieve a balance between both physics efficiency results and computing performance. Only two pointers are stored to indicate the windows, instead of saving all the hits available in the windows. These are stored using 16-bits data types to save in memory usage. Shared memory from the GPU is used to filter early tracks that are invalid, increasing thread usage of the GPU. The third and last step is to find tracklets or hits from different UT panels that form a piece of track that can be matched to the VELO track. The combinatorial problem is reduced in complexity from O(n4) to O(n3). The overall branching of the algorithm is reduced to make it more efficient in GPU processing. The resulting algorithm delivers high physics efficiency in some of its configuration options while delivering high throughput processing. It delivers up to 7 times more throughput performance. An analysis of price performance and a comparison against the baseline CPU implementation shows the more efficient use of the GPU for track processing. The design principles applied for the Allen framework and its algorithms can be applied to other parallel architectures. In particular CPUs can achieve a high degree level of parallelization by leveraging the vector processing units: good parallel patterns, access patterns and data layout are key ingredients to achieve good efficiency with both GPUs and vector processing units in CPUs. Implementing algorithms that achieve good vectorization usage usually involve using intrinsics, intermediate libraries or relying on auto-vectorization: these techniques usually have trade-offs between poor increases in performance and high development and maintenance costs. The Intel Implicit SPDM Program Compiler (ISPC) offers a C language variant that is similar to programming for GPUs while allowing the programmer to be explicit about which loops should be vectorized. Some ISPC concepts that are introduced are gangs or groups of threads that run in the vector lanes, and the uniform and varying keywords that indicate when a data type is viewed as the same value for all the vector lanes or different values for each lane. This distinction combined with the foreach() control sequence allow the programmer to be explicit about vectorization in a way that resembles programming with other SPMD languages such as OpenCL or CUDA. The Allen framework is adapted to accommodate this compiler and produce a sequence of algorithms that combines the CUDA compiler nvcc, the gcc compiler and the ISPC compiler. All the GPU memory management from Allen is adapted to have memory allocations and accesses for the CPU, making the CPU to be viewed as a device from the algorithms to minimize changes in the behaviour of the algorithms. The implemented kernels are compiled and linked in a separate manner and identified by their own namespace. Data types are explicitly indicated for the ISPC compiler, casting ambiguous types and ISPC specific structs. All the usage of shared memory implemented for GPUs is removed and replaced by usage of uniform vectors of memory. Kernels that correspond to the VELO and UT subdetector are implemented using the ISPC language, where all the parts implemented using the C++ language need to be replaced by equivalent constructs in the C language, such as templates or class constructors, methods, etc. All the synchronization related to CUDA blocks is removed as the loops vectorized with ISPC run in the vector lanes in lockstep. Functions are duplicated to accommodate different combinations of the uniform and varying keywords, depending on if these are called from a vectorized or non-vectorized loops. Corner cases that render arithmetic exceptions in the vector lanes are implemented case by case to avoid such exceptions, and the handling of half precision types is managed in a special manner as math with these types is not implemented in the ISPC compiler. The result from this implementation deliver more than two times the speedup compared to the non vectorized implementation. Versions compiled with SSE4, AVX1 and AVX2 ISAs are implemented, with the AVX2 version achieving the highest throughput. A mix of kernels with vectorized and non-vectorized versions is used, as some kernels implement algorithms that do not run efficiently in parallel architectures because of their sequential nature. These kernels achieve higher throughput when using a non-vectorized implementation and the mix of the best performance kernels is used. This thesis explored opportunities in high-performance computing for the field of computing in high-energy physics in real-time. Multi-threading, accelerators and vectorization approaches are explored and discussed with new algorithms and modifications to the existing ones, achieving improvements in both computing performance and maintainability. These improvements contributed to reach the goal of 40 Tbits per second of the LHCb experiment for the next data-taking period. The Allen framework is selected as the baseline to process the first stage of the event filter to be used in the coming years. The LHCb experiment selected an heterogeneous computing solution that combines various different architectures and prepares it for future data taking periods that are estimated to have a data throughput rate of 400 Tbits per second. The Allen framework can accommodate CPU, CUDA, HIP, CPU SPMD and other algorithms for different parallel architectures.