BSR Winter school organized by TU/e, TUDelft and UTwente
Lectures
Lectures will be given by the invited speakers and the BSR senior researchers.
- Wil van der Aalst on Making Sense From Software Using Process Mining
Short abstract: Software-related problems have an incredible impact on society, organizations, and users that increasingly rely on information technology. Since software is evolving and operates in a changing environment, one cannot anticipate all problems at design-time. We propose to use process mining to analyze software in its natural habitat. Process mining aims to bridge the gap between model-based process analysis methods such as simulation and other business process management techniques on the one hand and data-centric analysis methods such as machine learning and data mining on the other. It provides tools and techniques for automated process model discovery, conformance checking, data-driven model repair and extension, bottleneck analysis, and prediction based on event log data. Process discovery techniques can be used to capture the real behavior of software. Conformance checking techniques can be used to spot deviations. The alignment of models and real software behavior can be used to predict problems related to performance or conformance. Recent developments in process mining and the instrumentation of software make this possible. This lecture provides pointers to the state-of-the-art in process mining and its application to software.
- Jack van Wijk on Introduction to Data Visualization
Short abstract: Data Visualization concerns the use of interactive computer graphics to obtain insight in large amounts of data. The aim is to exploit the unique capabilities of the human visual system to detect patterns, structures, and irregularities, and to enable experts to formulate new hypotheses, confirm the expected, and to discover the unexpected. In this lecture an overview of the field is given, illustrated with examples of work from Eindhoven, covering a variety of different data and application domains. The focus is on information visualization and visual analytics. We study how large amounts of abstract data, such as tables, hierarchies, and networks can be represented and interacted with. In many cases, combinations of such data have to be dealt with, and also, the data is often dynamic, which brings another big challenge. Typical use cases are how to understand large software systems, how to analyze thousands of medicine prescriptions, and how to see patterns in huge telecom datasets. In visual analytics, the aim is to integrate methods from statistics, machine learning, and data mining, as well as to support data types such as text and multimedia, and to support the full process from data acquisition to presentation.
- Margaret-Anne Storey on Beyond Mixed Methods: Why Big Data Needs Thick Data
Short abstract: Software analytics and the use of computational methods on “big” data in software engineering is transforming the ways software is developed, used, improved and deployed. Software engineering researchers and practitioners are witnessing an increasing trend in the availability of diverse trace and operational data and the methods to analyze the data. This information is being used to paint a picture of how software is engineered and suggest ways it may be improved.
Although, software analytics shows great potential for improving software quality, user experience and developer productivity, it is important to remember that software engineering is inherently a socio-technical endeavour, with complex practices, activities and cultural aspects that cannot be externalized or captured by tools alone. Consequently, we need other methods to surface “thick data” that will provide rich explanations and narratives about the hidden aspects of software engineering.
In this tutorial, we will explore the following questions:
- What kinds of risks should be considered when using software analytics in automated software engineering?
- Are researchers and practitioners adequately considering the unanticipated impacts that software analytics can have on software engineering processes and stakeholders?
- Are there important questions that are not being asked because the answers do not lie in the data that are readily available?
- Can we improve the application of software analytics using other methods that collect insights directly from participants in software engineering (e.g., through observations)?
- How can we combine or develop new methods that combine “big data” with “thick data” to bring more meaningful and actionable insights?
We will discuss these questions through specific examples and case studies.
- Frits Vaandrager on Active Learning of Automata
Short abstract: Active automata learning is emerging as a highly effective technique for obtaining state machine models of software components. In this talk, I will give a survey of recent progress in the field, highlight applications, and identify some remaining research challenges.
- Arie van Deursen on Exceptional Logging
Short abstract: Incorrect error handling is a major cause for software system crashes. Luckily, the majority of these crashes lead to useful log data that can helped to analyze the root cause. In this presentation we explore exception handling practices from different perspectives, with the ultimate goal of making error handling less error prone, prioritizing error handling based on their occurrences in log data, and automating the fixing process of error handling as far as possible. We cover a range of research methods we used in our studies, including static analysis, repository mining, genetic algorithms, log file analytics, and qualitative analysis of surveys.
- Robert DeLine on Supporting Data-Centered Software Development
Short abstract: Modern software consists of both logic and data. Some use of the data is “back stage”, invisible to customers. For example, teams analyze service logs to make engineering decisions, like assigning bug and feature priorities, or use dashboards to monitor service performance. Other use of data is “on stage” (that is, part of the user experience), for example, the graph algorithms of the Facebook feed, the machine learning behind web search results, or the signal processing inside fitness bracelets. Today, working with data is typically assigned to the specialized role of the data scientist, a discipline with its own tools, skills and knowledge. In the first part of talk, I'll describe some empirical studies of the emerging role of data scientists, to convey their current work practice.
Over time, working with data is likely to change from a role to a skill. As a precedent, software testing was originally the responsibility of the role of software testers. Eventually, the prevalence of test-driven development and unit tests turned testing into a skill that many developers practice. Will the same be true of the role of data scientists? Is it possible to create tools to allow a wide range of developers (or even end users) to analyze data and to create data-centered algorithms? The second part the talk will demo emerging tools that take initial steps toward democratizing data science.
- Mark van den Brand on Challenges in Automotive Software Development Running on Big Software
Short abstract: The amount of software in vehicles has increased rapidly. The first lines of code in a vehicle were introduced in 1970s, nowadays over 100 million lines of code is no exception in a premium car. More and more functionality is realised in software and software is the main innovator at this moment in automotive domain. The amount of software will increase because of future innovations, think of adaptive cruise control, lane keeping, etc., which all leads to the ultimate goal of autonomous driving. Automotive systems can be categorized into vehicle-centric functional domains (including powertrain control, chassis control, and active/passive safety systems) and passenger-centric functional domains (covering multimedia/ telematics, body/comfort, and Human Machine Interface). From these domains, powertrain, connectivity, active safety and assisted driving are considered major areas of potential innovation. The ever increasing amount of software to enable innovation in vehicle-centric functional domains requires even more attention to assessment and improvement of the quality of automotive software. This is because software-driven innovations can come with software defects, failures, and vulnerability for hackers’ attacks. This can be observed by the enormous amounts of recalls lately, quite a few are software related.
Functional safety is another important aspect of automotive software. Failures in the software may be costly, because of the recalls, but may even be life threatening. The failure or malfunctioning of an automotive system may result in serious injuries or death of people. A number of functional safety standards have been developed for safety-critical systems; the ISO26262 standard is the functional safety standard for the automotive domain, geared towards passenger cars. A new version of the ISO26262 standard will cover trucks, busses and motor cycles as well. The automotive industry start to apply these safety standards as guidelines in their development projects. However, compliance with these standards are still very costly and time consuming due to huge amount of manual work.
In the last six years we have been doing research in the domain of automotive software development. We have investigated how to evaluate the quality of automotive software architectures. Software in the automotive domain is mainly developed in Matlab/Simulink and more recently SysML; from the developed models C code is generated. Of course, some functioanality is developed directly in C. We have used general software quality metrics frameworks to establish the quality of Matlab/Simulink and SysML models and automotive software architectures in general. In the area of functional safety we have applied model driven techniques to support functional safety development process and safety assurance. We have done research on how to apply ISO26262 for functional safety improvement. Furthermore, we have developed a meta-model for the ISO26262 standard; generated based on this meta-model tooling for safety case construction and assessment.
Recent research focuses on the integration of functional safety standards in the development process. In the future we want to investigate how functional safety related requirements can be covered directly in the automotive architecture at an early stage. We will do this in relation to research in the field of autonomous driving.
- Georgios Gousios on Mining GitHub for fun and profit
Short abstract: Modern organizations use telemetry and process data to make software production more efficient. Consequently, software engineering is an increasingly data-centered scientific field. With over 30 million repositories and 10 million users, GitHub is currently the largest code hosting site in the world. Software engineering researchers have been drawn to GitHub due to this popularity, as well as its integrated social features and the metadata that can be accessed through its API. To make research with GitHub data approachable, we created the GHTorrent project, a scalable, off-line mirror of all data offered through the GitHub API. In our lecture, we will discuss the GHTorrent project in detail and present insights drawn from using this dataset in various research works.
- Jaco van de Pol on Scalable Model Analysis
Short abstract: Software is a complex product. This holds already for its static structure, but even more so for its dynamic behaviour. When considering Big Software on the Run, the role of models is changing fast: instead of using them as a blueprint (like engineers) we now use models to understand running software (like biologists). More extremely, we are now using machine learning to obtain models of complex software systems automatically. However, adapting a classic motto, we should “spend more time on the analysis of models, than on collecting logs, and learning and visualising models” (1).
We will discuss algorithms and tools for studying models of the dynamic behaviour of systems. Since their complex behaviour is essentially modeled as a giant graph, we will review various high performance graph algorithms. In particular, we will cover LTL as a logic to specify properties of system runs, and symbolic and multi-core model checking as the scalable means to analyse large models. We will illustrate this with Petri Nets modelling software systems, and timed automata modelling biological systems.
(1) variation on “Spend more time working on code that analyzes the meaning of metrics, than code that collects, moves, stores and displays metrics - Adrian Cockcrof”, cited by H. Hartmann in Communications of the ACM 59(7), July 2016
- Marieke Huisman on Reliable Concurrent Software
Short abstract: Concurrent software is inherently error-prone, due to the possible interactions and subtle interplays between the parallel computations. As a result, error prediction and tracing the sources of errors often is difficult. In particular, rerunning an execution with exactly the same input might not lead to the same error. To improve this situation, we need techniques that can provide guarantees about the behaviour of a concurrent program. In this lecture, we discuss an approach based on program annotations. The program annotations describe locally what part of the memory are affected by a thread, and what the expected behaviour of a thread is. From the local program annotations, conclusions can be drawn about the global behaviour of a concurrent application. In this lecture, we discuss various techniques to verify such annotations. If a high-correctness guarantee is needed, static program verification techniques can be used. However, in many cases, checking at run-time that the annotations are not violated is sufficient. We discuss both approaches, and we show in particular what are the challenges to use them in a concurrent setting.
- Bram Adams on How NOT to analyze your release process
Short abstract: The release engineering process is the process that brings high quality code changes from a developer’s workspace to the end user, encompassing code change integration, continuous integration, build system specifications, infrastructure-as-code, deployment and release. Recent practices of continuous delivery, which bring new content to the end user in days or hours rather than months or years, require companies to closely monitor the progress of their release engineering process by mining the repositories involved in each phase, such as their version control system, bug/reviewing repositories and deployment logs. This tutorial presents the six major phases of the release engineering pipeline, the main repositories that are available for analysis in each phase, and three families of mistakes that could invalidate empirical analysis of the release process. Even if you are not working on release engineering, the mistakes discussed in this tutorial can impact your research results as well!
For more background: http://mcis.polymtl.ca/publications/2016/fose.pdf
- Zekeriya Erkin on Software Analysis: Anonymity and Cryptography for Privacy
Short abstract: Validation in a big software system can be managed by dynamically analysing its behaviour. A software system in use occasionally reports information to developers about its status in the form of event logs. Developers use these information to detect flaws in the software and to improve its performance with the help of process mining techniques. Process mining generates process models from the collected events or checks the conformance of these events with an existing process model to identify flaws in the software. Algorithms in process mining to discover such process models rely on software behaviour through real event logs and indeed very useful for software validation. However, the existence of some sensitive information in the collected logs may become a threat for the privacy of users as seen in practice. In this talk, we present privacy enhancing technologies (PETs) for privacy-preserving algorithms for software modelling. We focus on different approaches, namely anonymization techniques and deploying advanced cryptographic tools such as homomorphic encryption for the protection of sensitive data in logs during software analysis. As a very new field of research, we introduce a number of challenges yet to be solved and discuss different aspects of the challenge in terms of level of privacy, utility and overhead introduced by deploying PETs.