Novel high performance techniques for high definition computer aided tomography

Serrano López, Estefania

Novel high performance techniques for high definition computer aided tomography

Serrano López, Estefania

Dirigida por:

Jesús Carretero Pérez Director/a
Francisco Javier García Blas Codirector/a

Universidad de defensa: Universidad Carlos III de Madrid

Fecha de defensa: 30 de noviembre de 2018

Tribunal:

Jose Daniel Garcia Sanchez Presidente/a
Katzalin Olcoz Secretaria
Domenico Talia Vocal

Tipo: Tesis

Teseo: 572077 DIALNET

Resumen

Medical image processing field includes topics from different disciplines ranging from medicine or electrical engineering to computer science. In the past years, new methods and techniques for the acquisition of medical images have been developed. The growth in the usage of digital equipment and detectors in all imaging modalities has also increased the possibilities of introducing digital image processing techniques at different stages. Additionally, the innovations achieved through the development of new image scanners and acquisition geometries demand new protocols and advances in reconstruction algorithms. The main objective of these advances is to obtain better images, in terms of quality, at faster rates. This applies also to the world of CT, an imaging modality based on the acquisition of images using X-Ray properties that has been used for decades to obtain tomographic images of objects, animals or patients. It is widely employed in non-destructive testing to recover a digital model of the object inspected, but its most famous application is Computed Axial Tomography (CAT) in health care. In this modality, X-Ray images are obtained from the object or patient scanned from different positions, traditionally, from different angles around the object of interest. These X-Ray radiographs, also known as projections, are then digitally processed or reconstructed to recover the final tomographic image or volume. In the last years, new research lines related to the creation and design of new algorithms and acquisition protocols have arisen. However, the design of new acquisition protocols and configurations requires of the possibility of changing the acquisition parameters in a flexible manner. Although in most X-Ray scanners positions can be configured to meet different requirements, the acquisition of the images can represent a long process not suitable for research purposes. Additionally, exact positioning of these machines implies an additional calibration step. Therefore, advantages of emulating the operation of a flexible X-Ray scanner are straightforward since there are not any physical limitations in terms of positioning or acquisition times. The development of new flexible X-Ray systems benefits from computer simulations, an environment that enables performance to be checked before expensive real systems are implemented. Along with the new acquisition protocols, new reconstruction algorithms must be applied to obtain the final tomographic image. Nevertheless, the task of developing a platform for simulating the acquisition of X-Ray images and reconstructing them has several difficulties to be tackled. First, in terms of emulating the physical effects, a realistic simulation of the acquisition of the X-Rays can not be done efficiently. Since for the development of new configurations and acquisition protocols the critical point is the geometrical positioning of the elements, a simplified model is required. Second, digital detectors have increased their size and density in the last years, thus incrementing the size of the data set to be processed. Luckily, memory and computational specifications have also increased, and new hardware architectures have made it possible to obtain images in almost near real-time. The final problem is related to the new advanced reconstruction algorithms that have appeared recently. These algorithms, designed to work with limited-data configurations, are also computationally more expensive because of their extensive use of iterative methods. For these reasons, it is necessary to explore the possibilities of developing these algorithms in non-traditional computing platforms with programming models and paradigms that can provide the necessary computational resources to obtain these images in lower times than those obtained in conventional systems. To provide solutions and overcome these problems this thesis was developed. The main objective is the development of new techniques for reconstruction in situations where the acquired data from the scanner is not suitable for traditional reconstruction methods, in an optimized and flexible platform that can provide a meaningful reduction of the time spent for obtaining the final reconstructed image. Simulation and reconstruction tools for CT already exist in many frameworks and implementations. However, all of them lack different capabilities that can make them a good solution for diverse situations. We have seen that most of them are focused on one type of architecture, supporting GPUs and multicore CPUs through a unique programming model lacking portability and leading to compatibility issues. When they do support different architectures, they do not address problems such as the reconstruction of non-standard acquisition geometries or the easy construction of new simulation or reconstruction methods focusing on one specific algorithm. Additionally, most of these applications are designed for shared memory architectures, which is limited in terms of performance and memory. Distributed memory implementations in this area have not demonstrated high scalability and performance with platforms not optimized for distributed heterogeneous architectures. In this thesis will cover the development of new reconstruction and simulation techniques in different environments aiming at providing maximum portability of our algorithms and the maximum performance from the hardware chosen. For this purpose, we have covered different programming models and paradigms that complete the gaps seen in the review of the literature. First, we present FUX-Sim [1], a new, highly flexible X-ray simulation/reconstruction framework that enables fully configurable positioning of source and detector elements. The implementation is optimized for two different families of GPUs (CUDA and OpenCL) and multi-core CPUs using a modularized approach based on a layered architecture and the parallel implementation of the algorithms in both devices. Consequently, FUX-Sim can be executed in most current hardware platforms, since OpenCL is supported by AMD and NVidia GPUs and by Intel and ARM processors, while CUDA is the most widely applied programming model for GPUs. The modular architecture also facilitates the maintenance and adaptation for current and future programming models. The execution times we measured were faster than other state-of-the-art implementations for different system configurations and hardware platforms. We show that compared to ASTRA [2], a similar framework in C++, we can be up to 25% faster reconstructing. FUX-Sim can prove to be valuable for research on new configurations for X-ray systems with non-standard scanning orbits, new acquisition protocols, and advanced reconstruction algorithms. Moreover, the proposed framework will make it possible to obtain tomographic images from very few projections, thus enabling easy and inexpensive assessment before implementation in real systems. The main contributions of this frameworks are: first, FUX-Sim, a novel X-ray simulation/reconstruction framework that was designed to be flexible and fast; and second, an optimized implementation of the projection and backprojection algorithms for different families of GPU (CUDA and OpenCL) and multi-core CPUs. The development of new advanced reconstruction algorithms based on iterative methods is an essential feature in any reconstruction platform. From this initial proposed design, FUX-Sim is extended to support iterative reconstruction algorithms, but previously, further hardware specific optimizations need to be provided. The main components of the simulator, are the backprojection and projection kernels and the remainder work will focus on their optimization to decrease their execution time. CUDA kernels of both components performed better in general and will be those further optimized. With this objective we presented a full characterization of the two medical image processing kernels with an extended GPU Cache Aware Roofline Model (CARM) [3] [4]. These kernels, highly used in modern reconstruction algorithms, present certain characteristics that make them suitable for an extensive characterization. Both have been implemented in many versions and architectures, including GPUs. When implemented many optimizations can be applied to them to increase their performance. One example is the usage of texture mechanisms or cache optimizations. With respect to our baseline implementation we have been able to obtain a speedup of 2× for the projection kernel and 1.25× for the backprojection kernel. We can summarize the main contributions of this part of the work in: first, the extension of the roofs for constructing GPU CARMs considering textures and special functions executed in GPU hardware; second, an study of the influence of parallelism in GPU CARM in terms of block sizes, and a study of the influence of the most used compilation flags for CUDA GPUs, fastmath and maxreg; and, finally, the proposal of new optimized backprojection and projection kernels for CUDA compatible GPUs evaluated in different GPU architectures and performing up to 2× faster than previous versions. Although some benchmark applications had been already characterized with the GPU CARM, in this thesis we present a real application and the effects of characterizing it in terms of assisting in the process of optimization to obtain better execution times. We propose multiple versions of these kernels, with different optimization levels based on the data obtained from the NVidia profiler and the constructed CARM. We also show a comparison with the CARM obtained from the CPU versions of these kernels. As demonstrated in this thesis, the most time consuming components have been improved in terms of performance. Once optimized, it is feasible to incorporate them into an iterative reconstruction solution that is mainly composed by repetitive backprojection and projection stages. With these new and optimized projection and backprojection kernels we extend FUX-Sim to support 3D reconstruction of limited-data tomography with iterative reconstruction algorithms. It solves the problem of iterative reconstruction in an efficient way by using GPUs for the most time-consuming operations. The main contributions of this extended FUX-Sim framework are: first, the design of iterative reconstruction framework for Total Variation (TV) based iterative reconstruction methods with GPU acceleration; second, an evaluation of the efficiency in terms of performance for different iterative reconstruction methods based on TV; and finally, a matrix-free BiCGStab implementation compatible with GPU callback functions. We offer a solution that is 48x faster than CPU-only solutions and that is capable of reconstructing in reasonable time standard resolution datasets. Nevertheless, when reconstructing high resolution and large size studies (greater than 10243 ) the large memory requirements for maintaining all the necessary data volume prevents the execution in standard machines. Therefore, for large size studies a distributed memory implementation that can take advantage of the larger resources present in clusters and clouds can be the solution to the problem. To tackle this problem, we translated this type of iterative reconstruction methods to distributed environments. From this extended iterative reconstruction framework a new module for CT reconstruction for a distributed environment was included. For the distribution of the application in these new architecture PETSc[5], a mathematical library on top of MPI[6] that already implemented some of the algorithms needed for our method. We created new partitioning strategies for the main output dataset to provide scalability and distribute the computing load between different nodes. In the evaluation, we observed that our implementation scales for the main components (backprojection and projection operators) and it is possible to execute in parallel with negligible differences in the final result. However, when distributing over several nodes, the specific characteristics of the projection operation along with the increasing cost of the projection reduction impacts negatively on the overall performance providing poor speedup. The main contributions of this distributed framework for reconstruction are: first, the adaptation of an iterative reconstruction algorithm to a distributed environment using the PETSc library; and the study of the scalability in a HPC environment of the backprojection and projection kernels. One of the main conclusions from this part of the work is that the reduction of the projection data must be optimized. In the distributed version, this phase represents up to 33% of the total time, reducing the possibilities of scaling to high resolution images. With the distributed platform presented in this chapter, the current solution approach unbalances the load of each node, thus preventing the algorithm from better scaling. MPI does not possess native mechanisms for load balancing and the management of irregular partitioned data is difficult due to the assumption that all hardware is similar and that the load will be balanced by the programmer. Due to these restrictions and the requirements given by the domain, studies over a different type of paradigm, closer to the data and the programmer, are needed. This platform must provide an easier management of irregular data as well as a resource manager capable of distributing the partitioned tasks between the resources. To finalize this thesis, we present a novel approach based on Python and Apache Spark[7] for the implementation of the backprojection and projection components of an iterative reconstruction method for cone-beam geometry. The solution enables two alternatives for different architectures: a GPU-based architecture, supporting NVidia GPUs, and multi-core CPU architecture, relying on multi-core CPU acceleration and the compatibility with C/C++ native code. Although the performance results show that a linear speed up is not reached, our approach is an adequate alternative for porting previous HPC applications already implemented in C/C++ or even with CUDA or OpenCL programming models. Furthermore, it is possible to automatically detect the GPU devices and execute CPU and GPU tasks at the same time under the same system, using all the available resources. This solution is an approach based on Apache Spark, but it is applicable to other Big Data frameworks that support Python, enabling quickly updates for new requirements. Moreover, since the alternatives presented in this thesis are based on the union of different components, they can be generalized to other types of architectures or acceleration devices. In the case of the heterogeneous approach, OpenCL could also be used for compatibility with other accelerators or even a better exploitation of the CPU. The main contributions of this novel approach are: first, the execution of accelerated iterative reconstruction algorithms based on computation offloading of the most computationally expensive components in Big Data frameworks efficiently; second, a complete view of a GPU-based architecture for Apache Spark framework and its evaluation using two main medical imaging algorithms; third, an study of the partitioning problem in the projection and backprojection components in terms of performance; and finally, an evaluation of the proposed architecture, in both CPU/GPU-based clusters, combining HPC programming models and Big Data frameworks. From the approaches presented in this thesis, depending on the availability of computing resources, the user experience and the type and size of the input data, a different solution can be chosen. We can highlight different advantages or disadvantages of these solutions depending on the priority of the user. If performance is a priority, versions with GPU support provide much higher performance than CPU-based versions. This conclusion includes FUX-Sim and the extended FUX-Sim framework and the distributed Big Data version for PySpark. In all of them the difference between accelerated and non accelerated execution can represent being up to 48× faster. However, the performance is severely affected in low memory systems and GPUs when reconstructing high resolution studies. When we need support for high resolution images the limiting factor when reconstructing and simulating is memory. For this type of studies, distributed resources are needed, especially when working with accelerator-based systems in which GPUs are normally packed with lower memory capacities. However, when performance is not a priority and distributed systems are not available the standard FUX-Sim platform executing in an accelerated node can be used. Finally, if how easy is to program is what is important, we must consider that, except for the Big Data option, all the other alternatives presented in this thesis are constructed around native programming languages. To extend the platform it is necessary to have certain knowledge about the C language. However, domain experts in medical imaging do not normally have expertise in this kind of programming languages. For those cases, the python-based solution for Apache Spark can represent a good option for fast prototyping with the additional advantage of increased performance due to the extended support for GPUs. Finally, in this thesis, we have studied the unification of traditional HPC-based techniques with modern Big Data methods. The solution presented here combines the performance offered by heterogeneous devices and the leading data management mechanisms offered by the MapReduce programming model. With the extended FUX-Sim platform, the main objective of this thesis was accomplished obtaining a framework for the simulation and reconstruction of non-regular geometries. We also demonstrated that FUX-Sim provides higher performance than other state-of-the-art platforms with the possibility of reconstructing high resolution studies in near real time. [1] M. Abella et al., «FUX-Sim: Implementation of a fast universal simulation/reconstruction framework for X-ray systems», PLOS ONE, vol. 12, n.o 7, p. e0180363, jul. 2017. [2] W. van Aarle et al., «The ASTRA Toolbox: A platform for advanced algorithm development in electron tomography», Ultramicroscopy, vol. 157, pp. 35-47, oct. 2015. [3] A. Ilic, F. Pratas, y L. Sousa, «CARM: Cache-aware performance, power and energy-efficiency roofline modeling», en Compiler, Architecture and Tools Conf. Intel CATC, 2015. [4] A. Lopes, F. Pratas, L. Sousa, y A. Ilic, «Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling», en Performance Analysis of Systems and Software (ISPASS), 2017 IEEE International Symposium on, 2017, pp. 259–268. [5] S. Balay et al., PETSc web page. 2001. [6] W. Gropp, R. Thakur, y E. Lusk, Using MPI-2: Advanced features of the message passing interface. MIT press, 1999. [7] M. Zaharia et al., «Apache spark: a unified engine for big data processing», Commun. ACM, vol. 59, n.o 11, pp. 56–65, 2016.