Publications
MaxSim: A simulation platform for managed applications [ISPASS'17]
Andrey Rodchenko, Christos Kotselidis, Andy Nisbet, Antoniu Pop, Mikel Luján
Abstract
Managed applications, written in programming languages such as Java, C# and others, represent a significant share of workloads in the mobile, desktop, and server domains. Microarchitectural timing simulation of such workloads is useful for characterization and performance analysis, of both hardware and software, as well as for research and development of novel hardware extensions. This paper introduces MaxSim, a simulation platform based on the Maxine VM, the ZSim simulator, and the McPAT modeling framework. MaxSim is able to simulate fast and accurately managed workloads running on top of Maxine VM and its capabilities are showcased with novel simulation techniques for: 1) low-intrusive microarchitectural profiling via pointer tagging on the x86-64 platforms, 2) modeling of hardware extensions related, but not limited to, tagged pointers, and 3) modeling of complex software changes via address-space morphing. Low-intrusive microarchitectural profiling is achieved by utilizing tagged pointers to collect type- and allocation-site-related hardware events. Furthermore, MaxSim allows, through a novel technique called address space morphing, the easy modeling of complex object layout transformations. Finally, through the codesigned capabilities of MaxSim, novel hardware extensions can be implemented and evaluated. We showcase MaxSim's capabilities by simulating the whole set of the DaCapo-9.12-bach benchmarks in less than a day while performing an up-to-date microarchitectural power and performance characterization. Furthermore, we demonstrate a hardware/software co-designed optimization that performs dynamic load elimination for array length retrieval achieving up to 14% L1 data cache loads reduction and up to 4% dynamic energy reduction. MaxSim is available at https://github.com/arodchen/MaxSim released as free software.
Heterogeneous Managed Runtime Systems: A Computer Vision Case Study [VEE'17]
Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet, John Mawer, Mikel Luján
Abstract
Real-time 3D space understanding is becoming prevalent across a wide range of applications and hardware platforms. To meet the desired Quality of Service (QoS), computer vision applications tend to be heavily parallelized and exploit any available hardware accelerators. Current approaches to achieving real-time computer vision, evolve around programming languages typically associated with High Performance Computing along with binding extensions for OpenCL or CUDA execution.
Such implementations, although high performing, lack portability across the wide range of diverse hardware resources and accelerators. In this paper, we showcase how a complex computer vision application can be implemented within a managed runtime system. We discuss the complexities of achieving high-performing and portable execution across embedded and desktop configurations. Furthermore, we demonstrate that it is possible to achieve the QoS target of over 30 frames per second (FPS) by exploiting FPGA and GPGPU acceleration transparently through the managed runtime system.
Improving QoS and Utilisation in modern multi-core servers with Dynamic Cache Partitioning [COSH'17]
Ioannis Papadakis,Konstantinos Nikas, Vasileios Karakostas, Georgios Goumas, Nectarios Koziris
Abstract
Co-execution of multiple workloads in modern multi-core
servers may create severe performance degradation and unpredictable
execution behavior, impacting significantly their
Quality of Service (QoS) levels. To safeguard the QoS levels
of high priority workloads, current resource allocation policies
are quite conservative, disallowing their co-execution
with low priority ones, creating a wasteful tradeoff between
QoS and aggregated system throughput. In this paper we
utilise the cache monitoring and allocation facilities provided
by modern processors and implement a dynamic cache partitioning
scheme, where high-priority workloads are monitored
and allocated the amount of shared cache that they
actually need. This way, we are able to simultaneously maintain
their QoS very close to the levels of full cache allocation
and boost the system’s throughput by allocating the surplus
cache space to co-executing, low-priority applications.