Tobias Hey

I am researching in the domain of Natural Language Processing (NLP) for Software Engineering (SE) with focus on traceability link recovery and requirements engineering.

In my dissertation, I developed the automated traceability link recovery approach FTLR, that is able to relate requirements to their corresponding source code entities by utilizing fine-grained word embedding-based relations. Furthermore, I’ve developed the requirements classification approach NoRBERT and integrated its results as a filter into FTLR.

Previously, I’ve done research on programming in natural language, mapping natural language instructions to their corresponding API calls.

selected publications

ICSE
LiSSA: Toward Generic Traceability Link Recovery through Retrieval-Augmented Generation

Dominik Fuchß, Tobias Hey, Jan Keim, Haoyu Liu, Niklas Ewald, Tobias Thirolf, and Anne Koziolek

In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering, 2025

Abstract Bib

There are a multitude of software artifacts which need to be handled during the development and maintenance of a software system. These artifacts interrelate in multiple, complex ways. Therefore, many software engineering tasks are enabled — and even empowered — by a clear understanding of artifact interrelationships and also by the continued advancement of techniques for automated artifact linking. However, current approaches in automatic Traceability Link Recovery (TLR) target mostly the links between specific sets of artifacts, such as those between requirements and code. Fortunately, recent advancements in Large Language Models (LLMs) can enable TLR approaches to achieve broad applicability. Still, it is a nontrivial problem how to provide the LLMs with the specific information needed to perform TLR. In this paper, we present LiSSA, a framework that harnesses LLM performance and enhances them through Retrieval-Augmented Generation (RAG). We empirically evaluate LiSSA on three different TLR tasks, requirements to code, documentation to code, and architecture documentation to architecture models, and we compare our approach to state-of-the-art approaches. Our results show that the RAG-based approach can significantly outperform the state-of-the-art on the code-related tasks. However, further research is required to improve the performance of RAG-based approaches to be applicable in practice.
@inproceedings{fuchss_lissa_2025, author = {Fuchß, Dominik and Hey, Tobias and Keim, Jan and Liu, Haoyu and Ewald, Niklas and Thirolf, Tobias and Koziolek, Anne}, year = {2025}, title = {{LiSSA: Toward Generic Traceability Link Recovery through Retrieval-Augmented Generation}}, booktitle = {Proceedings of the IEEE/ACM 47th International Conference on Software Engineering}, publisher = {{Institute of Electrical and Electronics Engineers (IEEE)}}, location = {Ottawa, Canada}, series = {ICSE '25}, }
RE
Requirements Classification for Traceability Link Recovery

Tobias Hey, Jan Keim, and Sophie Corallo

In 2024 IEEE 32nd International Requirements Engineering Conference (RE), 2024

Abstract Bib DOI

Being aware of and understanding the relations between the requirements of a software system to its other artifacts is crucial for their successful development, maintenance and evolution. There are approaches to automatically recover this traceability information, but they fail to identify the actual relevant parts of the requirements. Recent large language model-based requirements classification approaches have shown to be able to identify aspects and concerns of requirements with promising accuracy. Therefore, we investigate the potential of those classification approaches for identifying irrelevant requirement parts for traceability link recovery between requirements and code. We train the large language model-based requirements classification approach NoRBERT on a new dataset of requirements and their entailed aspects and concerns. We use the results of the classification to filter irrelevant parts of the requirements before recovering trace links with the fine-grained word embedding-based FTLR approach. Two empirical studies show promising results regarding the quality of classification and the impact on traceability link recovery. NoRBERT can identify functional and user-related aspects in the requirements with an F\_1\-score of 84%. With the classification and requirements filtering, the performance of FTLR could be improved significantly and FTLR performs better than state-of-the-art unsupervised traceability link recovery approaches.
@inproceedings{hey_requirements_2024, title = {{Requirements Classification for Traceability Link Recovery}}, booktitle = {{2024 IEEE 32nd International Requirements Engineering Conference (RE)}}, author = {Hey, Tobias and Keim, Jan and Corallo, Sophie}, year = {2024}, pages = {155-167}, doi = {10.1109/RE59067.2024.00024}, }
ICSE
Recovering Trace Links Between Software Documentation And Code

Jan Keim, Sophie Corallo, Dominik Fuchß, Tobias Hey, Tobias Telge, and Anne Koziolek

In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), Apr 2024

Abstract Bib DOI

Introduction Software development involves creating various artifacts at different levels of abstraction and establishing relationships between them is essential. Traceability link recovery (TLR) automates this process, enhancing software quality by aiding tasks like maintenance and evolution. However, automating TLR is challenging due to semantic gaps resulting from different levels of abstraction. While automated TLR approaches exist for requirements and code, architecture documentation lacks tailored solutions, hindering the preservation of architecture knowledge and design decisions. Methods This paper presents our approach TransArC for TLR between architecture documentation and code, using componentbased architecture models as intermediate artifacts to bridge the semantic gap. We create transitive trace links by combining the existing approach ArDoCo for linking architecture documentation to models with our novel approach ArCoTL for linking architecture models to code. Results We evaluate our approaches with five open-source projects, comparing our results to baseline approaches. The model-to-code TLR approach achieves an average F1-score of 0.98, while the documentation-to-code TLR approach achieves a promising average F1-score of 0.82, significantly outperforming baselines. Conclusion Combining two specialized approaches with an intermediate artifact shows promise for bridging the semantic gap. In future research, we will explore further possibilities for such transitive approaches.
@inproceedings{keim_recovering_2024, title = {Recovering {{Trace Links Between Software Documentation And Code}}}, booktitle = {2024 {{IEEE}}/{{ACM}} 46th {{International Conference}} on {{Software Engineering}} ({{ICSE}})}, author = {Keim, Jan and Corallo, Sophie and Fuch{\ss}, Dominik and Hey, Tobias and Telge, Tobias and Koziolek, Anne}, year = {2024}, month = apr, pages = {2655--2667}, issn = {1558-1225}, doi = {10.1145/3597503.3639130}, urldate = {2024-01-17}, }
ICSME
Improving Traceability Link Recovery Using Fine-grained Requirements-to-Code Relations

Tobias Hey, Fei Chen, Sebastian Weigelt, and Walter F. Tichy

In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), Apr 2024

Abstract Bib DOI

Traceability information is a fundamental prerequisite for many essential software maintenance and evolution tasks, such as change impact and software reusability analyses. However, manually generating traceability information is costly and error-prone. Therefore, researchers have developed automated approaches that utilize textual similarities between artifacts to establish trace links. These approaches tend to achieve low precision at reasonable recall levels, as they are not able to bridge the semantic gap between high-level natural language requirements and code. We propose to overcome this limitation by leveraging fine-grained, method and sentence level, similarities between the artifacts for traceability link recovery. Our approach uses word embeddings and a Word Mover’s Distance-based similarity to bridge the semantic gap. The fine-grained similarities are aggregated according to the artifacts structure and participate in a majority vote to retrieve coarse-grained, requirement-to-class, trace links. In a comprehensive empirical evaluation, we show that our approach is able to outperform state-of-the-art unsupervised traceability link recovery approaches. Additionally, we illustrate the benefits of fine-grained structural analyses to word embedding-based trace link generation.
@inproceedings{hey_improving_2021, title = {Improving {{Traceability Link Recovery Using Fine-grained Requirements-to-Code Relations}}}, booktitle = {2021 {{IEEE International Conference}} on {{Software Maintenance}} and {{Evolution}} ({{ICSME}})}, author = {Hey, Tobias and Chen, Fei and Weigelt, Sebastian and Tichy, Walter F.}, date = {2021-09}, pages = {12--22}, issn = {2576-3148}, doi = {10.1109/ICSME52107.2021.00008}, eventtitle = {2021 {{IEEE International Conference}} on {{Software Maintenance}} and {{Evolution}} ({{ICSME}})}, }
RE
NoRBERT: Transfer Learning for Requirements Classification

Tobias Hey, Jan Keim, Anne Koziolek, and Walter F. Tichy

In 2020 IEEE 28th International Requirements Engineering Conference (RE), Apr 2024

Abstract Bib DOI

Classifying requirements is crucial for automatically handling natural language requirements. The performance of existing automatic classification approaches diminishes when applied to unseen projects because requirements usually vary in wording and style. The main problem is poor generalization. We propose NoRBERT that fine-tunes BERT, a language model that has proven useful for transfer learning. We apply our approach to different tasks in the domain of requirements classification. We achieve similar or better results F1-scores of up to 94%) on both seen and unseen projects for classifying functional and non-functional requirements on the PROMISE NFR dataset. NoRBERT outperforms recent approaches at classifying non-functional requirements subclasses. The most frequent classes are classified with an average F1-score of 87%. In an unseen project setup on a relabeled PROMISE NFR dataset, our approach achieves an improvement of 15 percentage points in average F1 score compared to recent approaches. Additionally, we propose to classify functional requirements according to the included concerns, i.e., function, data, and behavior. We labeled the functional requirements in the PROMISE NFR dataset and applied our approach. NoRBERT achieves an F1-score of up to 92%. Overall, NoRBERT improves requirements classification and can be applied to unseen projects with convincing results.
@inproceedings{hey_norbert_2020, title = {{{NoRBERT}}: {{Transfer Learning}} for {{Requirements Classification}}}, shorttitle = {{{NoRBERT}}}, booktitle = {2020 {{IEEE}} 28th {{International Requirements Engineering Conference}} ({{RE}})}, author = {Hey, Tobias and Keim, Jan and Koziolek, Anne and Tichy, Walter F.}, date = {2020-08}, pages = {169--179}, issn = {2332-6441}, doi = {10.1109/RE48521.2020.00028}, eventtitle = {2020 {{IEEE}} 28th {{International Requirements Engineering Conference}} ({{RE}})}, }