Efficient Resource Management for Data Centers: The ACTiCLOUD Approach [SAMOS XVIII]

Vasileios Karakostas, Georgios Goumas, Ewnetu Bayuh Lakew, Erik Elmroth, Stefanos Gerangelos, Simon Kolberg, Konstantinos Nikas, Stratos Psomadakis, Dimitrios Siakavaras, Petter Svärd, Nectarios Koziris

Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast resources efficiently. Resources are stranded and fragmented, limiting cloud applicability only to classes of applications that pose moderate resource demands. In addition, the need for reduced cost through consolidation introduces performance interference, as multiple VMs are co-located on the same nodes. To avoid such issues, current providers follow a rather conservative approach regarding resource management that leads to significant underutilization. ACTiCLOUD is a three-year Horizon 2020 project that aims at creating a novel cloud architecture that breaks existing scale-up and share-nothing barriers and enables the holistic management of physical resources, at both local and distributed cloud site levels. This extended abstract provides a brief overview of the resource management part of ACTiCLOUD, focusing on the design principles and the components

Published at:
IEEE International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation

Finding the Pitfalls in Query Performance [DBTest'18]

M.L. Kersten, P. Koutsourakis, Y. Zhang

Despite their popularity, database benchmarks only highlight a small part of the capabilities of any given system. They do not necessarily highlight problematic components encountered in real life or provide hints for further research and engineering. In this paper we introduce discriminative performance benchmarking, which aids in exploring a larger search space to find performance outliers and their underlying cause. The approach is based on deriving a domain specific language from a sample query to identify a query workload. SQLscalpel subsequently explores the space using query morphing, and simulated annealing to find performance outliers, and the query components responsible. To speed-up the exploration for often time-consuming experiments SQLscalpel has been designed to run asynchronously on a large cluster of machines.

Published at:
Workshop on Testing Database Systems

On the future of research VMs: a hardware/software perspective [Programming'18]

Foivos S. Zakkak, Andy Nisbet, John Mawer, Tim Hartley, Nikos Foutris, Orion Papadakis, Andreas Andronikakis, Iain Apreotesei, Christos Kotselidis

In the recent years, we have witnessed an explosion of the usages of Virtual Machines (VMs) which are currently found in desktops, smartphones, and cloud deployments. These recent developments create new research opportunities in the VM domain extending from performance to energy efficiency, and scalability studies. Research into these directions necessitates research frameworks for VMs that provide full coverage of the execution domains and hardware platforms. Unfortunately, the state of the art on Research VMs does not live up to such expectations and lacks behind industrial-strength software, making it hard for the research community to provide valuable insights. This paper presents our work in attempting to tackle those shortcomings by introducing Beehive, our vision towards a modular and seamlessly extensible ecosystem for research on virtual machines. Beehive unifies a number of existing state-of-the-art tools and components with novel ones providing a complete platform for hardware/software co-design of Virtual Machines.

Published at:
Conference Companion of the 2nd International Conference on Art, Science, and Engineering of Programming

Type Information Elimination from Objects on Architectures with Tagged Pointers Support [IEEE Transactions on Computers ( Volume: 67 , Issue: 1 , Jan. 1 2018 )]

Andrey Rodchenko, Christos Kotselidis, Andy Nisbet, Antoniu Pop, Mikel Luján

mplementations of object-oriented programming languages associate type information with each object to perform various runtime tasks such as dynamic dispatch, type introspection, and reflection. A common means of storing such relation is by inserting a pointer to the associated type information into every object. Such an approach, however, introduces memory and performance overheads when compared with non-object-oriented languages. Recent 64-bit computer architectures have added support for tagged pointers by ignoring a number of bits - tag - of memory addresses during memory access operations and utilize them for other purposes; mainly security. This paper presents the first investigation into how this hardware support can be exploited by a Java Virtual Machine to remove type information from objects. Moreover, we propose novel hardware extensions to the address generation and load-store units to achieve low-overhead type information retrieval and tagged object pointers compression-decompression. The evaluation has been conducted after integrating the Maxine VM and the ZSim microarchitectural simulator. The results, across all the DaCapo benchmark suite, pseudo-SPECjbb2005, SLAMBench and GraphChi-PR executed to completion, show up to 26 and 10 percent geometric mean heap space savings, up to 50 and 12 percent geometric mean dynamic DRAM energy reduction, and up to 49 and 3 percent geometric mean execution time reduction with no significant performance regressions.

Published at:
IEEE Transactions on Computers ( Volume: 67 , Issue: 1 , Jan. 1 2018 )

Cross-ISA debugging in meta-circular VMs [VMIL'17]

Christos Kotselidis, Andy Nisbet, Foivos S. Zakkak, Nikos Foutris

Extending current Virtual Machine implementations to new Instruction Set Architectures entails a significant programming and debugging effort. Meta-circular VMs add another level of complexity towards this aim since they have to compile themselves with the same compiler that is being extended. Therefore, having low-level debugging tools is of vital importance in decreasing development time and bugs introduced. In this paper we describe our experiences in extending Maxine VM to the ARMv7 architecture. During that process, we developed a QEMU-based toolchain which enables us to debug a wide range of VM features in an automated way. The presented toolchain has been integrated with the JUNIT testing framework of Maxine VM and is capable of executing from simple assembly instructions to fully JIT compiled code. Furthermore, it is fully open-sourced and can be adapted to any other VMs seamlessly. Finally, we describe a compiler-assisted methodology that helps us identify, at runtime, faulty methods that generate no stack traces, in an automatic and fast manner.

Published at:
9th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages

Experiences with Building Domain-Specific Compilation Plugins in Graal [ManLang'17]

Colin Barrett, Christos Kotselidis, Foivos S. Zakkak, Nikos Foutris, Mikel Luján

In this paper, we describe our experiences in co-designing a domain-specific compilation stack. Our motivation stems from the missed optimization opportunities we observed while implementing a computer vision library in Java. To tackle the performance shortcomings, we developed Indigo, a computer vision API co-designed with a compilation plugin for optimizing computer vision applications. Indigo exploits the extensible nature of the Graal compiler which provides invocation plugins, that replace methods with dedicated nodes, and generates machine code compatible with both the Java Virtual Machine (JVM) and the SIMD hardware unit. Our approach improves performance by up to 66.75× when compared to pure Java implementations and by up to 2.75× when compared to the original C++ implementation. These performance improvements are the result of low-level concurrency, idiomatic implementation of algorithms, and by keeping temporary objects in the wider vector unit registers.

Published at:
14th International Conference on Managed Languages and Runtimes

RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees [PACT'17]

Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas, Nectarios Koziris

In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to proceed without any synchronization and without being affected by concurrent modifications. The novelty of RCU-HTM lies at leveraging HTM to permit multiple updating threads to execute concurrently. After appropriately modifying the private copy, we execute an HTM transaction, which atomically validates that all the affected parts of the tree have remained unchanged since they've been read and, only if this validation is successful, installs the copy in the tree structure.We apply RCU-HTM on AVL and Red-Black balanced BSTs and compare theirperformance to state-of-the-art lock-based, non-blocking, RCU- and HTM-basedBSTs. Our experimental evaluation reveals that BSTs implemented with RCU-HTMachieve high performance, not only for read-only operations, but also for update operations. More specifically, our evaluation includes a diverse range of tree sizes and operation workloads and reveals that BSTs based on RCU-HTM outperform other alternatives by more than 18%, on average, on a multi-core server with 44 hardware threads.

Published at:
26th International Conference on Parallel Architectures and Compilation Techniques

ACTiCLOUD: Enabling the Next Generation of Cloud Applications [ICDCS'17]

Georgios I. Goumas, Konstantinos Nikas, Ewnetu Bayuh Lakew, Christos Kotselidis, Andrew Attwood, Erik Elmroth, Michail Flouris, Nikos Foutris, John Goodacre, Davide Grohmann, Vasileios Karakostas, Panagiotis Koutsourakis, Martin L. Kersten, Mikel Luján, Einar Rustad, John Thomson, Luis Tomás, Atle Vesterkjaer, Jim Webber, Ying Zhang, Nectarios Koziris

Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast amounts of resources efficiently. Resources are stranded and fragmented, ultimately limiting cloud systems' applicability to large classes of critical applications that pose non-moderate resource demands. Eliminating current technological barriers of actual fluidity and scalability of cloud resources is essential to strengthen cloud computing's role as a critical cornerstone for the digital economy. ACTiCLOUD proposes a novel cloud architecture that breaks the existing scale-up and share-nothing barriers and enables the holistic management of physical resources both at the local cloud site and at distributed levels. Specifically, it makes advancements in the cloud resource management stacks by extending state-of-the-art hypervisor technology beyond the physical server boundary and localized cloud management system to provide a holistic resource management within a rack, within a site, and across distributed cloud sites. On top of this, ACTiCLOUD will adapt and optimize system libraries and runtimes (e.g., JVM) as well as ACTiCLOUD-native applications, which are extremely demanding, and critical classes of applications that currently face severe difficulties in matching their resource requirements to state-of-the-art cloud offerings.

Published at:
37th IEEE International Conference on Distributed Computing Systems

MaxSim: A simulation platform for managed applications [ISPASS'17]

Andrey Rodchenko, Christos Kotselidis, Andy Nisbet, Antoniu Pop, Mikel Luján

Managed applications, written in programming languages such as Java, C# and others, represent a significant share of workloads in the mobile, desktop, and server domains. Microarchitectural timing simulation of such workloads is useful for characterization and performance analysis, of both hardware and software, as well as for research and development of novel hardware extensions. This paper introduces MaxSim, a simulation platform based on the Maxine VM, the ZSim simulator, and the McPAT modeling framework. MaxSim is able to simulate fast and accurately managed workloads running on top of Maxine VM and its capabilities are showcased with novel simulation techniques for: 1) low-intrusive microarchitectural profiling via pointer tagging on the x86-64 platforms, 2) modeling of hardware extensions related, but not limited to, tagged pointers, and 3) modeling of complex software changes via address-space morphing. Low-intrusive microarchitectural profiling is achieved by utilizing tagged pointers to collect type- and allocation-site-related hardware events. Furthermore, MaxSim allows, through a novel technique called address space morphing, the easy modeling of complex object layout transformations. Finally, through the codesigned capabilities of MaxSim, novel hardware extensions can be implemented and evaluated. We showcase MaxSim's capabilities by simulating the whole set of the DaCapo-9.12-bach benchmarks in less than a day while performing an up-to-date microarchitectural power and performance characterization. Furthermore, we demonstrate a hardware/software co-designed optimization that performs dynamic load elimination for array length retrieval achieving up to 14% L1 data cache loads reduction and up to 4% dynamic energy reduction. MaxSim is available at released as free software.

Published at:
IEEE International Symposium on Performance Analysis of Systems and Software

Heterogeneous Managed Runtime Systems: A Computer Vision Case Study [VEE'17]

Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet, John Mawer, Mikel Luján

Real-time 3D space understanding is becoming prevalent across a wide range of applications and hardware platforms. To meet the desired Quality of Service (QoS), computer vision applications tend to be heavily parallelized and exploit any available hardware accelerators. Current approaches to achieving real-time computer vision, evolve around programming languages typically associated with High Performance Computing along with binding extensions for OpenCL or CUDA execution. Such implementations, although high performing, lack portability across the wide range of diverse hardware resources and accelerators. In this paper, we showcase how a complex computer vision application can be implemented within a managed runtime system. We discuss the complexities of achieving high-performing and portable execution across embedded and desktop configurations. Furthermore, we demonstrate that it is possible to achieve the QoS target of over 30 frames per second (FPS) by exploiting FPGA and GPGPU acceleration transparently through the managed runtime system.

Published at:
13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

Improving QoS and Utilisation in modern multi-core servers with Dynamic Cache Partitioning [COSH'17]

Ioannis Papadakis,Konstantinos Nikas, Vasileios Karakostas, Georgios Goumas, Nectarios Koziris

Co-execution of multiple workloads in modern multi-core servers may create severe performance degradation and unpredictable execution behavior, impacting significantly their Quality of Service (QoS) levels. To safeguard the QoS levels of high priority workloads, current resource allocation policies are quite conservative, disallowing their co-execution with low priority ones, creating a wasteful tradeoff between QoS and aggregated system throughput. In this paper we utilise the cache monitoring and allocation facilities provided by modern processors and implement a dynamic cache partitioning scheme, where high-priority workloads are monitored and allocated the amount of shared cache that they actually need. This way, we are able to simultaneously maintain their QoS very close to the levels of full cache allocation and boost the system’s throughput by allocating the surplus cache space to co-executing, low-priority applications.

Published at:
Workshop on Co-Scheduling of HPC Applications