Balancing performance and reliability in software components

  1. López Gómez, Javier
Dirixida por:
  1. José Daniel García Sánchez Director
  2. Javier Fernández Muñoz Co-director

Universidade de defensa: Universidad Carlos III de Madrid

Fecha de defensa: 08 de outubro de 2020

Tribunal:
  1. Jesús Carretero Pérez Presidente/a
  2. Katzalin Olcoz Secretaria
  3. Massimo Torquati Vogal

Tipo: Tese

Resumo

Over the last few years, software complexity has grown exponentially. This growth has been partially caused by its use in control systems, ranging from data analysis in scientific applications to governing real-time systems such as medical devices or particle accelerators. The derived cost of a software defect increases non-linearly as it progresses through the lifecycle of a project and may cause substantial economic damage, or in life-critical systems, loss of human life. In general, the cost of detecting and mitigating a defect in a software system increases non-linearly as it progresses through the lifecycle of a project, being up to thirty times higher if the defect is detected in the production environment compared to the early stages of the developent. Consequently, it is desirable to detect software defects as soon as possible with the aim of minimizing the associated costs. This is especially important in the context of parallel applications, as they are considerably more complex to implement and may introduce new defects compared to a sequential version. As observed, the development of a parallel application is typically not a straightforward task. A number of industrial or scientific applications start as a small prototype that is iteratively improved until it reaches full functionality. Frequently, these prototypes are developed as a sequential program, as it generally follows the logic of the designed algorithms. In general, the use of a sequential prototype is twofold: it reduces the efforts required to implement the application, and ensures that all the defects introduced at this stage are not related to early optimization and/or parallelization. Therefore, the use of a prototype during the early stages of development is a common practice. To that end, there are a growing variety of interpreted languages (e.g. Python or MATLAB) that aid in this task. These interpreted languages provide a set of characteristics that aid in the development of prototypes, that are not usually included in programming languages tailored for production environments. Nevertheless, most of these interpreted languages are not appropriate for production environments due to their high execution overhead. Thus, in many cases, the only step forward to achieve the required performance is to rewrite the prototype in a more production-oriented language. However, this process not also has an economic impact, but also it may be the origin of new defects. In general, two different approaches may be leveraged to avoid refactoring: (i) reusing the aforementioned prototype directly at the production environment, and (ii) developing the prototype in the same language that will be used in production, which in general slows down the development process because these languages lack many features included in interpreted languages. This problem can be better addressed by using a single language for both, the prototype and the production version. In this sense, there are several related works regarding the use of C++, a well-known production language, as an interpreted language. However, state-of-the-art C++ interpreters still have some issues, e.g. they do not allow symbol redefinition. Once the fully functional sequential version has been developed, performance issues are taken into consideration. Oftentimes, the first step is to leverage all the computing units in a single node. As previously commented, developing a parallel application for a shared-memory architecture introduces its own set of defects (e.g. data races, deadlocks, etc.), that are especially difficult to spot and debug. Therefore, an active research area is to contribute new tools that aid in this respect. In this regard, contract-based programming helps to improve software reliability by reducing the presence of defects. This proposal enables the specification of assumptions made by a software component, both from the provider and consumer perspectives. Such specification may be used not only to validate a software module but also to automatically generate unit tests or to trigger additional compiler optimizations. In general, contract-based programming allows for writing predicates that describe the semantics of a software component, while data race detectors are generally rigid in regards to their usability. Consequently, a trace-off between both techniques may be desirable. As the performance requirements increase, the need for more powerful platforms becomes evident, to the point where the use of a single shared-memory node is no longer feasible, thus requiring the program to be ported to run on distributed-memory among several nodes. To that end, the de facto interfaces used for distributed- and shared-memory platforms are, respectively, MPI and OpenMP. While these interfaces are not considered low-level, they do not expose a set of primitives that is sufficiently closer to the problem domain. Therefore, tuning applications for high-performance often demands knowledge in both the application and system domains. Consequently, the use of higher-level abstractions that encapsulate algorithmic aspects and/or hide inter-process communication help to reduce the burden. To that end, the use of pattern-based models is an alternative that simplifies the implementation and maintenance of portable parallel applications. In this regard, using parallel patterns may reduce the required implementation effort and the number of defects in distributed parallel software. However, most parallel frameworks that allow writing shared-memory applications, are not as performant (or cannot be used at all) in distributed environments. Refactoring parallel code to run on a distributed architecture brings back the same issues that appeared when the prototype was ported to the production environment. In this sense, few parallel pattern-based models target both shared-memory and distributed environments. In general, software reliability can be improved either by detecting existing defects or by avoiding the introduction of new problems. Also, both strategies reduce the development efforts incurred by the debugging of large applications. In short, to tackle the aforementioned issues, this thesis contributes with the following. First, we provide a methodology to relax the One Definition Rule (ODR) in C++ interpreters, thus allowing users to give a new definition for a declaration. While most implementations of the C++ programming language generate architecture-specific executable code, interpreted execution of C++ sources has its own use cases as the Cling interpreter from CERN's ROOT project has shown. Nevertheless, some limitations are derived from the ODR that rules out multiple definitions of entities within a single translation unit. ODR ensures a uniform view of a given C++ entity across translation units, which helps when producing ABI compatible binaries. Interpreting C++ presumes a single ever-growing translation unit that defines away some of the ODR use-cases. Therefore, it may well be desirable to relax the ODR and, consequently, to support the ability of developers to override any existing definition for a given declaration. To this end, this thesis presents a technique that allows the user to redefine functions, types, and variables in C++, in a similar way to other interpreted languages such as Python. To achieve this, top-level declarations are nested in inline namespaces and the translation unit lookup table is adjusted to invalidate previous definitions that would otherwise result in ambiguities. Formally, this technique refactors the code to an equivalent that does not violate the ODR, as each definition is nested in a different namespace. Furthermore, any previous definition that has been shadowed is still accessible by means of its fully-qualified name. Also, a prototype implementation of the presented technique has been integrated into the Cling C++ interpreter and the ROOT project's master branch, showing that our technique is feasible and usable. Second, we provide a framework that leverages C++ contracts to aid in the verification of concurrent shared-memory data structures. This contribution is built atop a custom prototype implementation of contract-based programming for the C++ programming language which is aligned with the latest proposals from the C++ standards committee. As previously observed, multi-/many core processors became widely used in HPC platforms and parallel programming models have evolved to allow applications to exploit their computational resources. Parallel frameworks are designed to provide abstraction layers to applications in the form of libraries, that avoid the direct use of low-level concurrency mechanisms. To this extent, the usage of building blocks implementing core functionalities has been the de facto engineering methodology in many of those frameworks. Such reusable blocks should be designed to ensure correctness and thread-safety in order to produce correct global results. However, parallel programming may entail unexpected concurrency errors, e.g. data races or deadlocks. Spotting concurrency problems is known to be a complex task, given that these errors may only occur in low-probability event orderings and may depend on external factors, e.g. machine load. These facts make data races extremely sensitive in terms of time, scheduling policies, compiler options, memory models, etc. Although data race detectors aid during application debugging, they may still be improved. In particular, some race detectors fail to properly track the use of lock-free data structures and generate false positives. This adds noise to the generated report and may hide harmful races, making the debugging process even harder when tracing back the root cause of the problem. Furthermore, current race detectors are not able to spot violation of the semantics of lock-free data structures. In this line, the semantics of the Single-Producer/Single-Consumer (SPSC) lock-free queue have already been embedded into the ThreadSanitizer race detector tool. However, the required rules were hard-coded in the detector, preventing a more general use by the user. This thesis contributes CSV, a framework that enables the detection of access semantics violation of lock-free data structures that is built on top of a modified ThreadSanitizer race detector, and leverages C++ contracts as a carrier of user-specified concurrency constraints. The improvement in detection accuracy over the unmodified ThreadSanitizer is evaluated in terms of false positives and false negatives spotted on several synthetic benchmarks that make use of the SPSC and MPMC lock-free queue stuctures from the Boost C++ library. Note that we assume the data structure implementation to be correct; by using the proposed framework, we are able to detect misuses of its interface, thus reducing the number of false positives. We conclude that, while data race detectors aid in identifying races in parallel applications, few of them are aware of the semantics of lock-free data structures; thus, they may emit false positive warnings that could hide real data races. Although this framework has been implemented within the ThreadSanitizer detector, it would be possible to port it to other data race detectors from the state-of-the-art, such as CDSchecker or RCMC. Also, we demonstrated the benefits of CSV with respect to the stand-alone ThreadSanitizer race detector and compared the detection accuracy using the Boost C++ SPSC and MPMC lock-free queues as use cases. Basically, CSV provides two main features: (i) to filter misleading data race reports (false positives); and (ii) to detect violations of the semantics of lock-free data structures. Assuming that the implementation of the data structure is correct, CSV replaces ThreadSanitizer low-level reports by more meaningful traces that include the violated condition. Also, given the high-level syntax used by CSV, it may be used to state concurrency constraints on paralell code, as long as those constraints can be formalized. Third, we extend a generic parallel pattern interface to enable porting a shared-memory parallel application to distributed-memory with minimum efforts. Scientific experiments, ranging from particle accelerators to sensor networks, are currently generating large volumes of streaming data that sometimes has to be processed in near real-time. A number of developers rely on C/C++ for performance reasons; however, several challenges have been identified in the high-performance computing (HPC) domain in regards to handling the requirements of high throughput and low latency of data stream processing applications (DaSP). These challenges include the development of new, scalable and efficient streaming algorithms, programming models, languages and runtime systems for this type of applications. This requires considerable investment in the DaSP software stack to accomodate the increasing volume of streaming data. In general, the computing resource demands of this kind of applications imply that they will execute on multiple shared-memory multi-core nodes. As observed, the de facto programming models for distributed- and shared-memory platforms are, respectively, the MPI and OpenMP interfaces, which can be used in conjunction to enable hybrid parallelism. Regardless of their efficiency, both expose low-level interfaces that demand considerable expertise on the application and system domains to fine-tune the target application. An attempt to provide higher-level abstractions is made by pattern-based programming models, which encapsulate algorithmic aspects following a "building blocks" approach. Parallel patterns are an alternative to implement robust, readable and portable solutions, hiding the internal complexity of the underlying concurrency mechanisms, synchronizations or data sharing. In this respect, some pattern-based interfaces and libraries from the state-of-the-art (e.g. FastFlow, Muesli or SkePU 2) already support clusters of multi-core machines and make use of scheduling algorithms to improve load balancing among nodes. Unfortunately, few of these frameworks allow targeting both, shared-memory and distributed-memory environments. To tackle this issue, GrPPI, an existing C++ parallel-pattern library has been extended with a new MPI back-end, enabling the execution of some streaming patterns on distributed platforms. The GrPPI layer exposes a unified interface for parallel patterns on top of existing pattern-based frameworks or scheduling primitives, e.g. threads. This allows users to write applications that are independent of the parallel programming framework used underneath, resulting in portable and readable source code. The MPI back-end introduced in this thesis supports the distributed and hybrid execution of the pipeline and farm patterns, two stream-oriented patterns that permit the execution of user-defined operators in series (pipeline) or in parallel (farm). In this case, operators only have data item dependencies with the previous operators. Hybrid scenarios are supported using an intra-node shared-memory execution policy that if needed, is used to run multiple pipeline or farm operators inside an MPI process using OpenMP, C++ Threads, or Intel TBB. In general, the design of GrPPI contributes to improve the flexibility and portability of distributed applications, while exploiting both inter- and intra-node parallelism. As part of this contribution, we have provided two different MPI-based MPMC queue designs, based on two-sided and one-sided MPI primitives, respectively, that are used as communication channels between stream operators. Also, as demonstrated in the evaluation, our use cases benefit from considerable speedup gains compared with the corresponding sequential version. We have also proved that the proposed communication mechanisms attain the same performance using an Ethernet interconnection network; while for high-throughput and low latency networks that support RDMA, the one-sided mode outperforms the send/receive communication mode. In any case, it is always important to balance pipeline stages according to the stage workloads in order to better exploit the available resources and avoid bottlenecks. In general, some DaSP applications may benefit from the use of the proposed GrPPI back-end and obtain comparable performance results with respect to other frameworks. We also proved that leveraging parallel pattern libraries, in particular GrPPI, considerably reduces the number of required lines of code and cyclomatic complexity compared to an MPI-only implementation. Additionally, based on the qualitative comparison of the two interfaces (i.e., GrPPI and stand-alone MPI), we conclude that GrPPI leads to more readable source codes, and thus, improves their general maintainability. The implementation of a distributed back-end by means of MPI has been mainly motivated by the scaling needs of new DaSP scientific applications. Furthermore, given that MPI is widely used in today's supercomputers but it lacks direct support for stream processing, the presented contribution can greatly aid in developing stream applications in C++. To conclude, we have defined a set of techniques that contribute to increase software reliability by reducing the number of software defects. As mentioned, such reduction may be attained either employing tools that ease their detection or partially avoiding new defects, (e.g. providing programmers with high-level abstractions). All in all, this thesis contributes with the following: (i) a methodology to relax the ODR in C++ interpreters, allowing users to provide a new definition for an already-defined symbol, while not interfering in C++ overload resolution, i.e., correctly handling function or template overloads where applicable; (ii) a prototype implementation of the current C++ contracts proposal built into the Clang compiler that enables design-by-contract in C++, and makes it possible to build other contributions atop; (iii) a framework that aids in concurrent data structure verification, that leveraging C++ contracts and the ThreadSanitizer race detector, allows users to specify valid function call orderings by making use of the happen-before temporal relation in shared-memory architectures; and (iv) a GrPPI back-end for distributed DaSP application execution, that makes it possible to port a shared-memory parallel application to run on distributed-memory using MPI. Thus, single-node multithreaded applications that make use of the GrPPI interface can scale over several multi-core nodes by modifying a single line of code.