Methods and Tools for Supporting (Semi-)Automated Evaluation in Long-Term In-the-Wild Deployment Studies

michael.koch@unibw.de
University of the Bundeswehr Munich
Munich, Germany
julian.fietkau@unibw.de
University of the Bundeswehr Munich
Munich, Germany
susanne.draheim@haw-hamburg.de
Hamburg University of Applied Sciences
Hamburg, Germany
jan.schwarzer@haw-hamburg.de
Hamburg University of Applied Sciences
Hamburg, Germany
kai.vonluck@haw-hamburg.de
Hamburg University of Applied Sciences
Hamburg, Germany

Abstract

Human-computer interaction increasingly focuses on long-term evaluation of in-the-wild deployments. With this trend, however, understanding the usage behavior becomes more challenging. Due to the high repeating manual labor involved, existing methods such as in-situ observations and manual video analysis are no promising prospects on this avenue. Automated approaches (e.g., based on body tracking cameras) have been suggested recently to capture the usage behavior in long-term evaluations more efficiently. Still, these approaches may not be the only ones under consideration to move the field forward from here. This workshop gathers and reflects on the current state of the art regarding this trend and outlines perspectives for future research. The contributions cover, among other topics: methods and tools for data collection, noise and errors in sensor data, the correlation of automated observations with ground truth data, and augmenting sensor data with field work (e.g., interviews) for the contextualization of findings.

Keywords

in-the-wild deployments, long-term evaluation, automated data processing, ambient displays, mixed methods, workshop


1. Background

When designing and evaluating technology in human-computer interaction (HCI) research, the increasing complexity and ubiquity of technological artifacts is combined with the emerging need to take entire socio-technical systems into account. Methodologies for collecting, combining, and analyzing data are also increasing in maturity. For example, the use of both quantitative and qualitative usage behavior data in tandem (i.e., mixed methods) is becoming more common throughout in-the-wild (ITW) deployment studies.

Nowadays, this development manifests itself in an increasingly practice-based perspective of HCI research [13], which already found its establishment in the field of Computer-Supported Cooperative Work (CSCW) [25]. Here, a practice describes collective patterns of interaction that are reproduced in specific contexts [25]. At the core of these approaches is the understanding of technology as a flexible entity in an equally flexible environment, from which concrete practices form over time [11].

One area where we can observe this development is ambient display research [1, 2, 9, 18, 20]. Here, research questions have started to arise that can only be meaningfully investigated in the field. These questions encompass topics such as user behavior (e.g., walking paths or interaction phases), user experience, acceptance (e.g., with respect to privacy or data protection), and the social impact of new technologies [1]. In investigating these questions, the ecological validity of the collected data is crucial, i.e., whether data was collected in a realistic environment reflecting authentic usage behavior. There is a need to develop a better understanding of how the interaction of people, their physical environment, and the use of technology differentiates [16].

Unsurprisingly, a recent trend in this field is to increasingly augment and automate the processes of data collection (i.e., by using optical sensors such as 3D cameras) and analysis (e.g., by applying algorithms for pattern discovery) in longitudinal field deployment studies. Questions revolve around, for example, the impact of the presence of interactive systems on user walking paths, how different interaction techniques attract potential users, or how people engage with such systems. In essence, these studies find motivation in learning more about the spatial, temporal, and social behavior of users. The central assumption is that long-term sensor data, on the one hand, complements touch interaction logs adequately (i.e., in terms of cost-effectiveness and richness) and, on the other hand, more holistically makes both passive and active use explicit. However, it remains to be seen whether this methodological choice will prove successful in the long run and, if so, how it affects the HCI community in a broader sense (e.g., regarding an overarching research design). Field deployment research is known for its continually changing environmental conditions such as contextual variables (e.g., team structures and room layouts) or the information demand of the target audience (e.g., introduction of new tools). While these dynamics do indeed point to significant research design challenges, they simultaneously underline the necessity to intensify the dialogue on methodological guidance for our community.

This workshop aims to answer two fundamental questions:

  1. What is the current state of the art in automated data processing for evaluation in HCI field deployment studies?
  2. How does this knowledge need to be advanced practically (e.g., development of new tools) and methodologically (e.g., introduction of new means for data analysis)?

In addition, the workshop is also intended to initiate more exchanges and collaborative work in the field – contributions to tool chains, use of tools from other groups, and collaborative development of tools.


Roughly a decade ago, Alt et al. [2] introduced different kinds of research questions and how to address them methodologically in ambient display research. To this day, obtaining insights in this field has so far mainly relied on two types of methods: first, short-term observations (i.e., the whole spectrum from participant observation to video analysis to surveys) and second, interaction logs (such as touch gestures). Interaction logs were long considered the only data sources that allow deducing statements regarding usage over a longer period of time [4]. Recently, we summarized that the field, however, lacks rigorous procedures to enable a methodology-driven collection and analysis of data [21]. Studies were found to be more likely to use individual data collection methods and less likely to see them as part of an overarching research process (e.g., considering how different methods interconnect).

Recent developments increasingly target the challenge of (automatically) examining user behavior per se in greater detail (e.g., [9, 23]). Studies have criticized such systems for not being understood as part of a broader context (ibid.) Fundamentally, the study of user behavior is considered complex, often resulting in a reliance on manual observations and ethnographic research in the past. Therefore, a discernible trend in these more recent efforts is to successively augment and automate the processes of data collection (i.e., by leveraging 3D cameras) and analysis (e.g., through algorithmic solutions). Such research finds motivation in gaining more in-depth knowledge about the spatial and temporal behavior of users in close proximity to a display installation. The goal is to gain complementary information about content transitions, presentation times, and interactions. To date, however, there are only a few studies that follow this path [9].

The study by Williamson and Williamson [23] identifies several questions to explore in this now emerging research focus. These questions revolve around, for instance, the impact of an ambient display's presence on user walking paths or how different interaction techniques attract potential users. In addition, these following studies may identify data that might be of particular interest in future studies:

In addition to the automatic collection of data, there is work envisioning other methodological aspects. For example, Claes et al. [7] compared findings from an ITW study and a controlled ITW study (i.e., a merge of the qualities of both lab-based and ITW studies) of an ambient display installation. For the latter, the authors proactively invited participants to an open study on interactive installations, while for the former they just observed what interaction naturally occurred in the field. In both cases, structured interviews were performed and it was concluded that an ITW study was better suited to identify quantitative indications of actual user engagement, whereas a controlled ITW study yielded more valuable insights on why these trends where happening. Overall, when evaluating more complex interactions techniques, a controlled ITW study was found to offer a viable alternative.


3. Toward Automatic Evaluation of In-the-Wild Deployment Studies

In our work, we heavily build on quantitative data (i.e., body tracking and interaction data) as a foundation to guide our research and enrich incremental findings by thorough contextualization through qualitative insights. We believe that only using both kinds of methods in tandem can bring forth sound conclusions regarding how user really behave around display installations. Mäkelä et al. [18] recently introduced a good overview that shows what data is usually available in ambient display deployment studies and how to process it: both body tracking data from a camera and interaction data from the display software itself are the pillars in this overview. Data is processed, combined, and fed into variables that are defined for particular research questions. We have implemented this view in our research. As part of that, we developed a new data format for storing body tracking data [10] as well as an application for Elastic Stack to store both interaction and body tracking data [19].

Our methodological stance finds motivation in the issue of lacking comparability. Without contextualization, it is still a challenge to compare two intervals of interaction data to, for instance, determine whether a new feature changes the usage of the display (e.g., by averaging interaction counts) and if so, how. There is arguably a large variety of context factors that influence the overall interaction process. Examples are holidays, remote work, changing team structures, the current information demand, and so on. In contrast, including data about what is actually happening in front of an ambient display enables us to draw a more holistic picture of an interaction. We are able to be very specific about conversion rates of users, such as Michelis and Müller [17] describe them, to distinguish between real users and simple passers-by as well as to shed light on subtle and direct interaction.

In our view, existing work such as the study by Mäkelä et al. [18] lacks some crucial parts to grasp on interaction as a full concept and consequently fails to provide answers on how to empower researchers toward this goal. To name a few aspects:

  1. The possibility to readily visualize body tracking data to identify relevant situations (e.g., people aggregating in front of a display as described by the honeypot effect).
  2. Algorithmic means to easily search for patterns in huge amounts of collected data over time.
  3. Methodological suggestions on how to include insights from the context gathered through, for instance, interviews and observations.
  4. Answers to cope with the inherent dynamics of ITW studies in an overarching research design.

In the following, we provide some more in-depth elaborations on these ideas and summarize them in Figure 1.

DataCollectionLog skeletal and interaction data during deploymentpositionjointsZED2 / Kinect LoggingpointingapplicationApplication LoggingPreparationCombine and clean up skeletal data filesHoPE Tracking ServerCombine interaction data filesElastic StackCombine skeletal and interaction data filesPoseViz extraction scripts generating events for Elastic StackExploration(Visually) explore dataInteractive visualization of PoseViz or Elastic Stack dataFeatureExtractionDefine variables needed to answer research questionsGenerate variables from (combined) skeletal and interaction dataScripts working on PoseViz and/or Elastic StackAnalysisStatistical analyses using variable dataElastic Stack (Kibana), Excel, …ContextContent displayed / functionality offeredStatic and dynamiccontextInterviews, additionalobservationsElastic Stackstructuredlaboratory journalsinfluenceinformationinformation
Figure 1: 
A preliminary methodological blueprint for long-term, semi-automatic, and real-world evaluations of ambient displays (based on Figure 5 in Mäkelä et al. [18]).

3.1. Exploration

In reality, we often find ourselves in the situation to determine the right data for addressing a particular research question. We regularly engage in weighing the pros and cons of individual data collection methods to unveil new insights. While in some instances we have clear ideas in mind throughout this exploration process, in other situations we experience the filtering by some parameters to be useful and, with these parameters in mind, look at specific situations and their underlying data. A practical example is one of our research projects where we are investigating the honeypot effect in more detail. Here, we first filter situations to be elaborated on in the body tracking data. Filters can be, but are not limited to, aspects such as the ones described by Azad et al. [3]: How many people enter a scene from the left, the right, or the front? How many people slow down or start interacting? In regards to the honeypot effect, we look at situations where initially only one person was standing in front of a display installation and where, then, others join this person. Next, we try to identify patterns in the underlying body tracking data to, ultimately, find other occurrences algorithmically. While we can obtain one or many instances of the honeypot effect quantitatively this way, we are then required to provide some context for these instances to provide meaning.

3.2. Context data

Context data is required for interpreting what can be really seen in body tracking and interaction data. As said before, it can make a difference if we are looking at a data set collected during holidays or when the needs for information within a company change. Context data can be, but is not limited to:

This list of context information can be expanded to include more complex aspects such as organisational work processes, for instance, in agile software development teams. Here, questions arise such as: Which sprint are the teams currently working on? When is the next release scheduled? When is the next on-site team meeting? What is the status of the individual teams? Also post-COVID questions emerge such as how can the hybridity of work processes be included in the understanding of the context and data processing? In hybrid work situations, the actors are exposed to the duality of the work space (i.e., both the physical and digital space exist simultaneously as communication and interaction spaces) [15].

Another type of context data is the location of an installation. If we collect data from several screens it might be interesting to document factors relating to the location for every screen separately (e.g., to determine whether the data is complementary or comparable). Context data can be also automatically obtained from calendars or (historical) services like weather services, but it can also be part of research projects in the form of interviews or the documentation of additional observations. We generally try to adhere to a procedure of writing laboratory journals indicating special events and times that might be interesting for interpreting usage data later on. Last but not least, it is worth mentioning that, as Dourish [3] vividly describes, the meaning of a specific context is per definition flexible and in constant negotiation with its participants. We therefore have to regularly review the initial understanding of context during a study to tie it back to the initial goal definition or adapt it to the research process if necessary.


4. Possible Future Work and Questions

The written contributions for this workshop cover what has been addressed in the previous sections: Rohde et al. [19] describe an infrastructure for interaction logging. Fietkau [10] showcases a toolset for logging and visualizing body tracking data. Koch et al. [12] document a long-term ITW deployment of multiple public screens. Cabalo et al. [6] and Lacher et al. [14] propose and test two different approaches for analyzing body tracking data for determining engagement or attention. Buhl et al. [5] report on a limited-time gamification study to check whether such a change in the application leads to different user behavior.

Below are some open questions that are raised in the workshop papers or that emerge from the bigger picture formed by the collective contributions:

We have attempted to address some of these pressing issues in our field in Figure 1. In the workshop, we aim to discuss this preliminary methodological blueprint and thereby revise it in a meaningful way.

A further important issue to be discussed in the workshop concerns the research data management (e.g., how to manage the collection and storage of interaction logs and qualitative data like interviews), including long-term data storage and making data accessible to others in future studies.

Finally, another interesting topic, which is closely related to long-term ITW deployments, is the “sustainablility” of IT research in practice. Nowadays, research in applied computing requires researchers to engage deeply in the field (e.g., with practitioners) in order to design innovative IT artifacts and understand their appropriation. The problem that has not been solved so far is what happens when the research project is completed (see, for example, Simone et al. [22] for a broader discussion on this matter).


5. Organizers

As a research group, the workshop organizers are currently working on the DFG-funded research project “Investigation of the honeypot effect on (semi-)public interactive ambient displays in long-term field studies.” 1 They are eager to extend their internal discussions beyond the project's scope and to exchange insights with the broader community.

Michael Koch is a professor for HCI at University of the Bundeswehr Munich, Germany. His main interests in research and education are cooperation systems, i.e., bringing collaboration technology to use in teams. In the past decades, he has worked on several projects in the field of public displays and has conducted multiple long-term field studies in this domain.

Julian Fietkau is a post-doc researcher in HCI at University of the Bundeswehr Munich. His recently concluded doctoral project has involved the design and evaluation of public displays of different kinds to support older adults in outdoor activities.

Susanne Draheim is a post-doc researcher and Managing Director of the Research and Transfer Centre “Smart Systems” at Hamburg University of Applied Sciences. She has an academic background in sociology, educational sciences, and cultural sciences. She works on datafication & qualitative social research methods, companion technology, and digital transformation.

Jan Schwarzer is a post-doc researcher in the Creative Space for Technical Innovations (CSTI) group at Hamburg University of Applied Sciences, working on long-term evaluations of user behavior around ambient displays deployed in authentic environments. Recently, he concentrates on algorithmic approaches to distill underlying patterns in quantitative usage behavior data.

Kai von Luck is a professor for computer science at Hamburg University of Applied Sciences and the Academic Director of the CSTI group. His background in artificial intelligence informs and enriches his work on ambient displays and tangible interfaces.


1: https://gepris.dfg.de/gepris/projekt/451069094


References

[1] Florian Alt, Daniel Buschek, David Heuss, and Jörg Müller. 2021. Orbuculum – Predicting When Users Intend to Leave Large Public Displays. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 1, Article 46 (2021), 16 pages. https://doi.org/10.1145/3448075
[2] Florian Alt, Stefan Schneegaß, Albrecht Schmidt, Jörg Müller, and Nemanja Memarovic. 2012. How to evaluate public displays. In Proceedings of the 2012 International Symposium on Pervasive Displays. Association for Computing Machinery, New York, NY, USA, 6 pages. https://doi.org/10.1145/2307798.2307815
[3] Alec Azad, Jaime Ruiz, Daniel Vogel, Mark Hancock, and Edward Lank. 2012. Territoriality and Behaviour on and around Large Vertical Publicly-Shared Displays. In Proceedings of the Designing Interactive Systems Conference (Newcastle Upon Tyne, United Kingdom) (DIS ’12). Association for Computing Machinery, New York, NY, USA, 468–477. https://doi.org/10.1145/2317956.2318025
[4] Dirk Börner, Marco Kalz, and Marcus Specht. 2013. Closer to you: Reviewing the application, design, and evaluation of ambient displays. International Journal of Ambient Computing and Intelligence 5, 3 (2013), 16–31. https://doi.org/10.4018/ijaci.2013070102
[5] Wolfgang Buhl, Klaas-Frederik Engel, and Viktor Buller. 2023. Evaluation of a gamification approach for increasing interaction with public displays. In Mensch und Computer 2023 – Workshopband, Peter Fröhlich and Vanessa Cobus (Eds.). Gesellschaft für Informatik e.V., Bonn, Germany, 5 pages. https://doi.org/10.18420/muc2023-mci-ws13-344
[6] Coleen Cabalo, Lars Gatzemeyer, and Lukas Mathes. 2023. Evaluating the engagement of users from public displays. In Mensch und Computer 2023 – Workshopband, Peter Fröhlich and Vanessa Cobus (Eds.). Gesellschaft für Informatik e.V., Bonn, Germany, 7 pages. https://doi.org/10.18420/muc2023-mci-ws13-282
[7] Sandy Claes, Niels Wouters, Karin Slegers, and Andrew Vande Moere. 2015. Controlling In-the-Wild Evaluation Studies of Public Displays. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI ’15). Association for Computing Machinery, New York, NY, USA, 81–84. https://doi.org/10.1145/2702123.2702353
[8] Paul Dourish. 2004. What we talk about when we talk about context. Personal and Ubiquitous Computing 8 (2004), 19–30. https://doi.org/10.1007/s00779-003-0253-8
[9] Ivan Elhart, Mateusz Mikusz, Cristian Gomez Mora, Marc Langheinrich, and Nigel Davies. 2017. Audience Monitor: An Open Source Tool for Tracking Audience Mobility in Front of Pervasive Displays. In Proceedings of the 6th ACM International Symposium on Pervasive Displays (Lugano, Switzerland) (PerDis ’17). Association for Computing Machinery, New York, NY, USA, 1–8. https://doi.org/10.1145/3078810.3078823
[10] Julian Fietkau. 2023. A New Software Toolset for Recording and Viewing Body Tracking Data. In Mensch und Computer 2023 – Workshopband, Peter Fröhlich and Vanessa Cobus (Eds.). Gesellschaft für Informatik e.V., Bonn, Germany, 4 pages. https://doi.org/10.18420/muc2023-mci-ws13-334
[11] Marko Jurmu, Leena Ventä-Olkkonen, Arto Lanamäki, Hannu Kukka, Netta Iivari, and Kari Kuutti. 2016. Emergent Practice as a Methodological Lens for Public Displays In-the-Wild. In Proceedings of the 5th ACM International Symposium on Pervasive Displays (Oulu, Finland) (PerDis ’16). Association for Computing Machinery, New York, NY, USA, 124–131. https://doi.org/10.1145/2914920.2915007
[12] Michael Koch, Julian Fietkau, and Laura Stojko. 2023. Setting up a Long-Term Evaluation Environment for interactive semi-public Information Displays. In Mensch und Computer 2023 – Workshopband, Peter Fröhlich and Vanessa Cobus (Eds.). Gesellschaft für Informatik e.V., Bonn, Germany, 5 pages. https://doi.org/10.18420/muc2023-mci-ws13-356
[13] Kari Kuutti and Liam J. Bannon. 2014. The Turn to Practice in HCI: Towards a Research Agenda. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Toronto, Ontario, Canada) (CHI ’14). Association for Computing Machinery, New York, NY, USA, 3543–3552. https://doi.org/10.1145/2556288.2557111
[14] Jonas Lacher, Laura Bieschke, Florian Michalowski, and Johannes Münch. 2023. Using machine learning to determine attention towards public displays from skeletal data. In Mensch und Computer 2023 – Workshopband, Peter Fröhlich and Vanessa Cobus (Eds.). Gesellschaft für Informatik e.V., Bonn, Germany, 4 pages. https://doi.org/10.18420/muc2023-mci-ws13-293
[15] Gesa Lindemann and David Schünemann. 2020. Presence in Digital Spaces. A Phenomenological Concept of Presence in Mediatized Communication. Human Studies 43 (2020), 627–651. https://doi.org/10.1007/s10746-020-09567-y
[16] Joseph F McCarthy, David H Nguyen, Al Mamunur Rashid, and Suzanne Soroczak. 2004. Proactive Displays: Enhancing Awareness and Interactions in a Conference Context. Technical Report (IRS-TR-04-015). Intel Research. https://www.interrelativity.com/joe/publications/ProactiveDisplays-IRS-TR-04-015.pdf
[17] Daniel Michelis and Jörg Müller. 2011. The Audience Funnel: Observations of Gesture Based Interaction With Multiple Large Displays in a City Center. International Journal of Human-Computer Interaction 27, 6 (2011), 562–579. https://doi.org/10.1080/10447318.2011.555299
[18] Ville Mäkelä, Tomi Heimonen, and Markku Turunen. 2018. Semi-Automated, Large-Scale Evaluation of Public Displays. International Journal of Human-Computer Interaction 34, 6 (2018), 491–505. https://doi.org/10.1080/10447318.2017.1367905
[19] Christopher Rohde, Michael Koch, and Laura Stojko. 2023. Using an Elastic Stack as a Base for Logging and Evaluation of Public Displays. In Mensch und Computer 2023 – Workshopband, Peter Fröhlich and Vanessa Cobus (Eds.). Gesellschaft für Informatik e.V., Bonn, Germany, 6 pages. https://doi.org/10.18420/muc2023-mci-ws13-303
[20] Jan Schwarzer, Susanne Draheim, and Kai von Luck. 2022. Spatial and Temporal Audience Behavior of Scrum Practitioners Around Semi-Public Ambient Displays. International Journal of Human-Computer Interaction (2022), 19 pages. https://doi.org/10.1080/10447318.2022.2099238
[21] Jan Schwarzer, Kai von Luck, Susanne Draheim, and Michael Koch. 2019. Towards Methodological Guidance for Longitudinal Ambient Display In Situ Research. In Proceedings of the 17th European Conference on Computer Supported Cooperative Work. EUSSET, Siegen, Germany, 20 pages. https://doi.org/10.18420/ecscw2019_ep07
[22] Carla Simone, Ina Wagner, Claudia Müller, Anne Weibert, and Volker Wulf. 2022. Future-proofing: Making Practice-Based IT Design Sustainable. Oxford Academic, Oxford, UK. https://doi.org/10.1093/oso/9780198862505.001.0001
[23] Julie R. Williamson and John Williamson. 2014. Analysing Pedestrian Traffic Around Public Displays. In Proceedings of The International Symposium on Pervasive Displays (Copenhagen, Denmark) (PerDis ’14). Association for Computing Machinery, New York, NY, USA, 13–18. https://doi.org/10.1145/2611009.2611022
[24] Niels Wouters, John Downs, Mitchell Harrop, Travis Cox, Eduardo Oliveira, Sarah Webber, Frank Vetere, and Andrew Vande Moere. 2016. Uncovering the Honeypot Effect: How Audiences Engage with Public Interactive Systems. In Proceedings of the 2016 ACM Conference on Designing Interactive Systems (Brisbane, QLD, Australia) (DIS ’16). Association for Computing Machinery, New York, NY, USA, 5–16. https://doi.org/10.1145/2901790.2901796
[25] Volker Wulf, Markus Rohde, Volkmar Pipek, and Gunnar Stevens. 2011. Engaging with Practices: Design Case Studies as a Research Framework in CSCW. In Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work (Hangzhou, China) (CSCW ’11). Association for Computing Machinery, New York, NY, USA, 505–512. https://doi.org/10.1145/1958824.1958902