Profiling and Tracing Support for Java Applications [ICPE 2019]

Andrew Nisbet, Nuno Miguel Nobre, Graham Riley, Mikel Luján

We demonstrate the feasibility of undertaking performance evaluations for JVMs using:(1) a hybrid JVM/OStool, such as async-profiler, (2) OS centric profiling and tracing tools based on Linux perf, and (3) the Extended Berkeley Packet Filter Tracing (eBPF) framework where we demonstrate the rationale behind the standard offwaketime tool, for analysing the causes of blocking latencies, and our own eBPF-based tool bcc-java, that relates changes in microarchitecture performance counter values to the execution of individual JVM and application threads at low overhead. The relative execution time overheads of the performance tools are illustrated for the DaCapo-bach-9.12 benchmarks with OpenJDK9 on an Intel Xeon E5-2690, running Ubuntu 16.04. Whereas sampling based tools can have up to 25{\%} slowdown using 4kHz frequency, our tool bcc-java has a geometric mean of less than 5{\%}. Only for the avrora benchmark, bcc-java has a significant overhead (37{\%}) due to an unusually high number of futex system calls. Finally, we provide a discussion on the recommended approaches to solve specific performance use-case scenarios.

Published at:
10th ACM/SPEC International Conference on Performance Engineering
(ICPE 2019)

SQALPEL: A database performance platform [CIDR 2019]

M.L. Kersten, P. Koutsourakis, S. Manegold, Y. Zhang

Despite their popularity, database benchmarks only highlight a small fraction of the capabilities of any given DBMS. They often do not highlight problematic components encountered in real life database applications or provide hints for further research and engineering. To alleviate this problem we coined discriminative performance benchmarking as the way to go. It aids in exploring a larger query search space to find performance outliers and their underlying cause. The approach is based on deriving a domain specific language from a sample complex query to identify and execute a query workload. The demo illustrates sqalpel, a complete platform to collect, manage and selectively disseminate performance facts, that enables repeatability studies, and economy of scale by sharing performance experiences.

Published at:
9th biennial Conference on Innovative Data Systems Research
(CIDR 2019)

Database Resource Allocation Based on Resilient Intermediates [XtremeCLOUD 2018]

Martin Kersten, Ying Zhang, Pavlos Katsogridakis, Panagiotis Koutsourakis, Joeri van Ruth

Scale-out of big data analytics applications often does not pay off due to the poor performance in response time and the increasing bill due to a longer execution time on a resource limited machine. To enable a stable DBMS workload environment it helps to maintain several virtual machines with difference resource configurations (CPU, memory, disk, etc) hosting part of the database, so that users can send their tasks to those machines that have the best price/performance characteristics. This, however, requires a method to decide which VM should be used for a given query. When choosing the VM, the memory usage of a query is a particularly important factor, especially for the main-memory (optimised) DBMSs which are generally used for analytical queries today. In this paper, we introduce MALCOM, a memory footprint predictor for queries based on resilient intermediates in MonetDB. Unlike traditional cost-based approaches, MALCOM uses an empirical approach (i.e. using the memory usage information of queries executed in the past) to incrementally update its model to improve its predictions. Our preliminary experiment results show that this approach is robust against varying data distributions.

Published at:
1st International Workshop on Next Generation Clouds for Extreme Data
(XtremeCLOUD 2018)

Performance Prediction of NUMA Placement: A Machine-Learning Approach [XtremeCLOUD 2018]

Fanourios Arapidis, Vasileios Karakostas, Nikela Papadopoulou, Konstantinos Nikas, Georgios Goumas, Nectarios Koziris

In this paper we present a machine-learning approach to predict the impact on performance of core and memory placement in non-uniform memory access (NUMA) systems. The impact on performance depends on the architecture and the application’s characteristics. We focus our study on features that can be easily extracted with hardware performance counters that are found in commodity off-the-self systems. We run various single-threaded benchmarks from Spec2006 and Parsec under different placement scenarios, and we use this benchmarking data to train multiple regression models that could serve as performance predictors. Our experimental results show notable accuracy in predicting the impact on performance with relatively simple prediction models.

Published at:
1st International Workshop on Next Generation Clouds for Extreme Data
(XtremeCLOUD 2018)

Utility-based Allocation of Industrial IoT Applications in Mobile Edge Clouds [IPCCC2018]

Amardeep Mehta, Ewnetu Bayuh Lakew, Johan Tordsson, Erik Elmroth

Mobile Edge Clouds (MECs) create new opportunities and challenges in terms of scheduling and running applications that have a wide range of latency requirements, such as intelligent transportation systems, process automation, and smart grids. We propose a two-tier scheduler for allocating runtime resources to Industrial Internet of Things (IIoT) applications in MECs. The scheduler at the higher level runs periodically – monitors system state and the performance of applications – and decides whether to admit new applications and migrate existing applications. In contrast, the lower-level scheduler decides which application will get the runtime resource next. We use performance based metrics that tells the extent to which the runtimes are meeting the Service Level Objectives (SLOs) of the hosted applications. The Application Happiness metric is based on a single application’s performance and SLOs. The Runtime Happiness metric is based on the Application Happiness of the applications the runtime is hosting. These metrics may be used for decision-making by the scheduler, rather than runtime utilization, for example. We evaluate four scheduling policies for the high-level scheduler and five for the low-level scheduler. The objective for the schedulers is to minimize cost while meeting the SLO of each application. The policies are evaluated with respect to the number of runtimes, the impact on the performance of applications and utilization of the runtimes. The results of our evaluation show that the high-level policy based on Runtime Happiness combined with the low-level policy based on Application Happiness outperforms other policies for the schedulers, including the bin packing and random strategies. In particular, our combined policy requires up to 30% fewer runtimes than the simple bin packing strategy and increases the runtime utilization up to 40% for the Edge Data Center (DC) in the scenarios we evaluated.

Published at:
37th IEEE International Performance Computing and Communications Conference

SmallTail: Scaling Cores and Probabilistic Cloning Requests for Web Systems [ICAC 2018]

E. B. Lakew, R. Birke, J. F. Perez, E. Elmroth, L. Y. Chen

Users quality of experience on web systems are largely determined by the tail latency, e.g., 95 th percentile. Scaling resources along, e.g., the number of virtual cores per VM, is shown to be effective to meet the average latency but falls short in taming the latency tail in the cloud where the performance variability is higher. The prior art shows the prominence of increasing the request redundancy to curtail the latency either in the off-line setting or without scaling-in cores of virtual machines. In this paper, we propose an opportunistic scaler, termed SmallTail, which aims to achieve stringent targets of tail latency while provisioning a minimum amount of resources and keeping them well utilized. Against dynamic workloads, SmallTail simultaneously adjusts the core provisioning per VM and probabilistically replicates requests so as to achieve the tail latency target. The core of SmallTail is a two level controller, where the outer loops controls the core provision per distributed VMs and the inner loop controls the clones in a finer granularity. We also provide theoretical analysis on the steady-state latency for a given probabilistic replication that clones one out of N arriving requests. We extensively evaluate SmallTail on three different web systems, namely web commerce, web searching, and web bulletin board. Our testbed results show that SmallTail can ensure the 95 th latency below 1000 ms using up to 53% less cores compared to the strategy of constant cloning, whereas scaling-core only solution exceeds the latency target by up to 70%.

Published at:
2018 IEEE International Conference on Autonomic Computing
(ICAC 2018)

Efficient Resource Management for Data Centers: The ACTiCLOUD Approach [SAMOS XVIII]

Vasileios Karakostas, Georgios Goumas, Ewnetu Bayuh Lakew, Erik Elmroth, Stefanos Gerangelos, Simon Kolberg, Konstantinos Nikas, Stratos Psomadakis, Dimitrios Siakavaras, Petter Svärd, Nectarios Koziris

Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast resources efficiently. Resources are stranded and fragmented, limiting cloud applicability only to classes of applications that pose moderate resource demands. In addition, the need for reduced cost through consolidation introduces performance interference, as multiple VMs are co-located on the same nodes. To avoid such issues, current providers follow a rather conservative approach regarding resource management that leads to significant underutilization. ACTiCLOUD is a three-year Horizon 2020 project that aims at creating a novel cloud architecture that breaks existing scale-up and share-nothing barriers and enables the holistic management of physical resources, at both local and distributed cloud site levels. This extended abstract provides a brief overview of the resource management part of ACTiCLOUD, focusing on the design principles and the components

Published at:
IEEE International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation

Finding the Pitfalls in Query Performance [DBTest'18]

M.L. Kersten, P. Koutsourakis, Y. Zhang

Despite their popularity, database benchmarks only highlight a small part of the capabilities of any given system. They do not necessarily highlight problematic components encountered in real life or provide hints for further research and engineering. In this paper we introduce discriminative performance benchmarking, which aids in exploring a larger search space to find performance outliers and their underlying cause. The approach is based on deriving a domain specific language from a sample query to identify a query workload. SQLscalpel subsequently explores the space using query morphing, and simulated annealing to find performance outliers, and the query components responsible. To speed-up the exploration for often time-consuming experiments SQLscalpel has been designed to run asynchronously on a large cluster of machines.

Published at:
Workshop on Testing Database Systems

On the future of research VMs: a hardware/software perspective [Programming'18]

Foivos S. Zakkak, Andy Nisbet, John Mawer, Tim Hartley, Nikos Foutris, Orion Papadakis, Andreas Andronikakis, Iain Apreotesei, Christos Kotselidis

In the recent years, we have witnessed an explosion of the usages of Virtual Machines (VMs) which are currently found in desktops, smartphones, and cloud deployments. These recent developments create new research opportunities in the VM domain extending from performance to energy efficiency, and scalability studies. Research into these directions necessitates research frameworks for VMs that provide full coverage of the execution domains and hardware platforms. Unfortunately, the state of the art on Research VMs does not live up to such expectations and lacks behind industrial-strength software, making it hard for the research community to provide valuable insights. This paper presents our work in attempting to tackle those shortcomings by introducing Beehive, our vision towards a modular and seamlessly extensible ecosystem for research on virtual machines. Beehive unifies a number of existing state-of-the-art tools and components with novel ones providing a complete platform for hardware/software co-design of Virtual Machines.

Published at:
Conference Companion of the 2nd International Conference on Art, Science, and Engineering of Programming

Type Information Elimination from Objects on Architectures with Tagged Pointers Support [IEEE Transactions on Computers ( Volume: 67 , Issue: 1 , Jan. 1 2018 )]

Andrey Rodchenko, Christos Kotselidis, Andy Nisbet, Antoniu Pop, Mikel Luján

mplementations of object-oriented programming languages associate type information with each object to perform various runtime tasks such as dynamic dispatch, type introspection, and reflection. A common means of storing such relation is by inserting a pointer to the associated type information into every object. Such an approach, however, introduces memory and performance overheads when compared with non-object-oriented languages. Recent 64-bit computer architectures have added support for tagged pointers by ignoring a number of bits - tag - of memory addresses during memory access operations and utilize them for other purposes; mainly security. This paper presents the first investigation into how this hardware support can be exploited by a Java Virtual Machine to remove type information from objects. Moreover, we propose novel hardware extensions to the address generation and load-store units to achieve low-overhead type information retrieval and tagged object pointers compression-decompression. The evaluation has been conducted after integrating the Maxine VM and the ZSim microarchitectural simulator. The results, across all the DaCapo benchmark suite, pseudo-SPECjbb2005, SLAMBench and GraphChi-PR executed to completion, show up to 26 and 10 percent geometric mean heap space savings, up to 50 and 12 percent geometric mean dynamic DRAM energy reduction, and up to 49 and 3 percent geometric mean execution time reduction with no significant performance regressions.

Published at:
IEEE Transactions on Computers ( Volume: 67 , Issue: 1 , Jan. 1 2018 )

Cross-ISA debugging in meta-circular VMs [VMIL'17]

Christos Kotselidis, Andy Nisbet, Foivos S. Zakkak, Nikos Foutris

Extending current Virtual Machine implementations to new Instruction Set Architectures entails a significant programming and debugging effort. Meta-circular VMs add another level of complexity towards this aim since they have to compile themselves with the same compiler that is being extended. Therefore, having low-level debugging tools is of vital importance in decreasing development time and bugs introduced. In this paper we describe our experiences in extending Maxine VM to the ARMv7 architecture. During that process, we developed a QEMU-based toolchain which enables us to debug a wide range of VM features in an automated way. The presented toolchain has been integrated with the JUNIT testing framework of Maxine VM and is capable of executing from simple assembly instructions to fully JIT compiled code. Furthermore, it is fully open-sourced and can be adapted to any other VMs seamlessly. Finally, we describe a compiler-assisted methodology that helps us identify, at runtime, faulty methods that generate no stack traces, in an automatic and fast manner.

Published at:
9th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages

Experiences with Building Domain-Specific Compilation Plugins in Graal [ManLang'17]

Colin Barrett, Christos Kotselidis, Foivos S. Zakkak, Nikos Foutris, Mikel Luján

In this paper, we describe our experiences in co-designing a domain-specific compilation stack. Our motivation stems from the missed optimization opportunities we observed while implementing a computer vision library in Java. To tackle the performance shortcomings, we developed Indigo, a computer vision API co-designed with a compilation plugin for optimizing computer vision applications. Indigo exploits the extensible nature of the Graal compiler which provides invocation plugins, that replace methods with dedicated nodes, and generates machine code compatible with both the Java Virtual Machine (JVM) and the SIMD hardware unit. Our approach improves performance by up to 66.75× when compared to pure Java implementations and by up to 2.75× when compared to the original C++ implementation. These performance improvements are the result of low-level concurrency, idiomatic implementation of algorithms, and by keeping temporary objects in the wider vector unit registers.

Published at:
14th International Conference on Managed Languages and Runtimes

RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees [PACT'17]

Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas, Nectarios Koziris

In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to proceed without any synchronization and without being affected by concurrent modifications. The novelty of RCU-HTM lies at leveraging HTM to permit multiple updating threads to execute concurrently. After appropriately modifying the private copy, we execute an HTM transaction, which atomically validates that all the affected parts of the tree have remained unchanged since they've been read and, only if this validation is successful, installs the copy in the tree structure.We apply RCU-HTM on AVL and Red-Black balanced BSTs and compare theirperformance to state-of-the-art lock-based, non-blocking, RCU- and HTM-basedBSTs. Our experimental evaluation reveals that BSTs implemented with RCU-HTMachieve high performance, not only for read-only operations, but also for update operations. More specifically, our evaluation includes a diverse range of tree sizes and operation workloads and reveals that BSTs based on RCU-HTM outperform other alternatives by more than 18%, on average, on a multi-core server with 44 hardware threads.

Published at:
26th International Conference on Parallel Architectures and Compilation Techniques

ACTiCLOUD: Enabling the Next Generation of Cloud Applications [ICDCS'17]

Georgios I. Goumas, Konstantinos Nikas, Ewnetu Bayuh Lakew, Christos Kotselidis, Andrew Attwood, Erik Elmroth, Michail Flouris, Nikos Foutris, John Goodacre, Davide Grohmann, Vasileios Karakostas, Panagiotis Koutsourakis, Martin L. Kersten, Mikel Luján, Einar Rustad, John Thomson, Luis Tomás, Atle Vesterkjaer, Jim Webber, Ying Zhang, Nectarios Koziris

Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast amounts of resources efficiently. Resources are stranded and fragmented, ultimately limiting cloud systems' applicability to large classes of critical applications that pose non-moderate resource demands. Eliminating current technological barriers of actual fluidity and scalability of cloud resources is essential to strengthen cloud computing's role as a critical cornerstone for the digital economy. ACTiCLOUD proposes a novel cloud architecture that breaks the existing scale-up and share-nothing barriers and enables the holistic management of physical resources both at the local cloud site and at distributed levels. Specifically, it makes advancements in the cloud resource management stacks by extending state-of-the-art hypervisor technology beyond the physical server boundary and localized cloud management system to provide a holistic resource management within a rack, within a site, and across distributed cloud sites. On top of this, ACTiCLOUD will adapt and optimize system libraries and runtimes (e.g., JVM) as well as ACTiCLOUD-native applications, which are extremely demanding, and critical classes of applications that currently face severe difficulties in matching their resource requirements to state-of-the-art cloud offerings.

Published at:
37th IEEE International Conference on Distributed Computing Systems

MaxSim: A simulation platform for managed applications [ISPASS'17]

Andrey Rodchenko, Christos Kotselidis, Andy Nisbet, Antoniu Pop, Mikel Luján

Managed applications, written in programming languages such as Java, C# and others, represent a significant share of workloads in the mobile, desktop, and server domains. Microarchitectural timing simulation of such workloads is useful for characterization and performance analysis, of both hardware and software, as well as for research and development of novel hardware extensions. This paper introduces MaxSim, a simulation platform based on the Maxine VM, the ZSim simulator, and the McPAT modeling framework. MaxSim is able to simulate fast and accurately managed workloads running on top of Maxine VM and its capabilities are showcased with novel simulation techniques for: 1) low-intrusive microarchitectural profiling via pointer tagging on the x86-64 platforms, 2) modeling of hardware extensions related, but not limited to, tagged pointers, and 3) modeling of complex software changes via address-space morphing. Low-intrusive microarchitectural profiling is achieved by utilizing tagged pointers to collect type- and allocation-site-related hardware events. Furthermore, MaxSim allows, through a novel technique called address space morphing, the easy modeling of complex object layout transformations. Finally, through the codesigned capabilities of MaxSim, novel hardware extensions can be implemented and evaluated. We showcase MaxSim's capabilities by simulating the whole set of the DaCapo-9.12-bach benchmarks in less than a day while performing an up-to-date microarchitectural power and performance characterization. Furthermore, we demonstrate a hardware/software co-designed optimization that performs dynamic load elimination for array length retrieval achieving up to 14% L1 data cache loads reduction and up to 4% dynamic energy reduction. MaxSim is available at released as free software.

Published at:
IEEE International Symposium on Performance Analysis of Systems and Software

Heterogeneous Managed Runtime Systems: A Computer Vision Case Study [VEE'17]

Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet, John Mawer, Mikel Luján

Real-time 3D space understanding is becoming prevalent across a wide range of applications and hardware platforms. To meet the desired Quality of Service (QoS), computer vision applications tend to be heavily parallelized and exploit any available hardware accelerators. Current approaches to achieving real-time computer vision, evolve around programming languages typically associated with High Performance Computing along with binding extensions for OpenCL or CUDA execution. Such implementations, although high performing, lack portability across the wide range of diverse hardware resources and accelerators. In this paper, we showcase how a complex computer vision application can be implemented within a managed runtime system. We discuss the complexities of achieving high-performing and portable execution across embedded and desktop configurations. Furthermore, we demonstrate that it is possible to achieve the QoS target of over 30 frames per second (FPS) by exploiting FPGA and GPGPU acceleration transparently through the managed runtime system.

Published at:
13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

Improving QoS and Utilisation in modern multi-core servers with Dynamic Cache Partitioning [COSH'17]

Ioannis Papadakis,Konstantinos Nikas, Vasileios Karakostas, Georgios Goumas, Nectarios Koziris

Co-execution of multiple workloads in modern multi-core servers may create severe performance degradation and unpredictable execution behavior, impacting significantly their Quality of Service (QoS) levels. To safeguard the QoS levels of high priority workloads, current resource allocation policies are quite conservative, disallowing their co-execution with low priority ones, creating a wasteful tradeoff between QoS and aggregated system throughput. In this paper we utilise the cache monitoring and allocation facilities provided by modern processors and implement a dynamic cache partitioning scheme, where high-priority workloads are monitored and allocated the amount of shared cache that they actually need. This way, we are able to simultaneously maintain their QoS very close to the levels of full cache allocation and boost the system’s throughput by allocating the surplus cache space to co-executing, low-priority applications.

Published at:
Workshop on Co-Scheduling of HPC Applications