`
1.1. Time and Place
The defense session in April will take place on April 25, 2025, Friday at 10:00 AM. You can join the Zoom meeting room using the information below.
Join Zoom Meeting
https://tum-conf.zoom-x.de/j/61786165009?pwd=Zkg0R2VvbWJHd28vbmJCa2RHdkZEQT09
Passcode: i6defense
Hosted by Genghang Zhuang (zhuang@in.tum.de )
1.2. Schedule
10:00 - 10:20 Kaan Durmaz (BA Thesis)
Title: Towards Robust Dense Prediction using Multimodal Fusion
Advisor: Hu Cao
Keywords: Multimodal sensor fusion, Vision language model
Abstract: Multimodal sensor fusion plays a crucial role in enhancing perception of autonomous vehicles in challenging environments such as low light or adverse weather conditions. Textguided image fusion has recently emerged as a promising approach for combining these modalities while adapting to complex degradation conditions. Building upon the Text-IF framework, this thesis explores how to more effectively leverage textual prompts throughout the fusion process of visible and thermal images. We identify key limitations in the original model, including shallow prompts and text fusion, unguided refinement, and static loss weighting. To address these, we propose several improvements: a prompt engineering strategy that enhances the semantic richness of textual inputs, an attention-based text fusion module for deeper image-text alignment, a text-conditioned loss formulation that adapts to task semantics, and a Prompt-Guided Residual Refinement Module (PG-RRM) that integrates
text features throughout the refinement pipeline. Extensive experiments on the EMS dataset demonstrate that our model improves both visual quality and degradation handling across a range of conditions. This work highlights the potential of fully using prompts as a dynamic source of semantic guidance, laying the groundwork for more adaptable and context-aware multimodal fusion systems
10:20 - 10:45 Muhammad Reza Ar Raz (MA Thesis)
Title: Leveraging LLM Interpretability for Privacy-focused Model Enhancement
Advisor: Mohammadhossein Malmir, Dr. Ahmed Frikha
Keywords: Large Language Models, Privacy Preservation, Sparse Autoencoders, Feature Disentanglement, Neural Networks, Machine Learning, Interpretability
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks but face significant privacy challenges due to their tendency to memorize and potentially leak sensitive information from training data. While existing privacy-preserving approaches such as differential privacy and neuron-level interventions offer some protection, they often result in substantial utility degradation or are limited by the polysemantic nature of neural representations. This thesis introduces PrivacyScalpel, a novel framework for privacy preservation in LLMs using Sparse Autoencoders (SAEs). By leveraging the monosemantic properties of SAEs, our approach enables precise feature-level interventions to mitigate privacy leakage while maintaining model utility. Through extensive experimentation on the Gemma2-2b and Llama2-7b models, we demonstrate that SAE-based interventions achieve superior privacy-utility trade-offs compared to existing approaches. Our framework reduces privacy leakage rates to as low as 0.01% while maintaining over 58% utility on downstream tasks. We propose a methodology to identify the layer that captures the most privacy-related features, enabling targeted and effective interventions. Comparative analysis reveals that our feature-level approach outperforms traditional neuron-level methods and circuit breaker techniques, particularly in terms of intervention precision and scalability. This thesis makes several key contributions: (1) introduces a framework for precise privacy preservation using SAEs, (2) proposes a systematic method to localize layers concentrating privacy-sensitive features in transformer models, (3) demonstrates the effectiveness of feature-level interventions across different model scales, and (4) establishes the robustness of SAE-based methods under significant reductions in training data. These findings advance the state of the art in privacy-preserving machine learning, offering practical solutions for safeguarding sensitive information while preserving model functionality.
10:45 - 11:10 Margarita Shibarshina (MA Thesis)
Title: Using Vision-Language Models as a Reward Mechanism in Reinforcement Learning
Advisor: Josip Josifovski, Karan Sharma, Burak Demirbilek
Keywords: Reinforcement Learning, Robotic Manipulation, Vision-Language Models, Simulation, Reward Engineering
Abstract: Modern reinforcement learning for robotic manipulation traditionally relies on handcrafted numeric rewards and extensive domain-specific tuning. This thesis explores the innovative use of Vision–Language Models as semantic reward mechanisms, leveraging their zero-shot capabilities to reduce manual reward design and enhance the overall guidance provided to learning agents. Evaluations across a broad range of scenarios, from standard benchmark control tasks to complex robotic manipulation challenges, demonstrate that VLM-derived rewards can deliver rich semantic feedback. Notably, image-based goal representations excel in tasks that require precise spatial alignment, while textual prompts provide robust support in addressing simpler or more abstract objectives.
Findings from both offline evaluations and real-world experiments, such as robotic bottle-lifting tasks, indicate that the variations in reward signal consistency offer valuable insights into the dynamics of reinforcement learning. These signal fluctuations, which influence the learning process, also present opportunities for further refinement of the approach. By analyzing these effects, the study lays the groundwork for future advancements, including domain-specific fine-tuning, the integration of hybrid reward frameworks that blend minimal numeric shaping with VLM feedback, and the development of temporally aware model architectures. Collectively, these insights highlight the potential of VLM-based rewards to enhance reinforcement learning strategies in robotic manipulation while paving the way for continued research and innovation.
11:10 - 11:35 Mu He (MA Thesis)
Title: Semantic place recognition based on biologically-inspired sequence memory
Advisor: Genghang Zhuang
Keywords:
Abstract: Visual place recognition refers to the capability of a robot to identify previously visited locations using visual sensor data. It plays an essential role in Simultaneous Localization and Mapping (SLAM), where a robot builds a map of an unknown environment while simultaneously estimating its own position within it. In this thesis, we propose a biologically-inspired semantic place recognition approach based on sequence memory. Unlike traditional methods relying solely on image features, our method leverages sequences of detected semantic objects combined with odometry data, encoded as Sparse Distributed Representations (SDR). Inspired by the functioning of cortical columns in the mammalian neocortex, the proposed sequence memory utilizes Hierarchical Temporal Memory (HTM) to learn and recognize sequences of semantic objects observed during navigation. We integrate odometry information via a grid-cell model inspired by rodents, allowing robust place recognition even when visual conditions change or become ambiguous. Experiments conducted in a simulated CARLA environment indicate that our method achieves reasonable precision and recall in visual place recognition tasks, enabling effective map construction with acceptable errors.
11:35 - 12:00 Christian Kellinger (MA Thesis)
Title: Spike-Based Broad Learning System: A Study towards Autonomous Navigation
Advisor: Genghang Zhuang
Keywords:
Abstract: Broad Learning System (BLS) is a recent concept, which builds upon the fast training convergence in a single iteration of Random Vector Functional Link Neural Networks (RVFLNN). It aims to add incremental optimization on trained networks, to save time and resources compared to needing to re-train the network from scratch. This is achieved by designing algorithms which are able to increase training data and certain model hyperparameters on the already trained network as desired. Core element of those algorithms is Greville’s Method, which provides a way to update the Moore-Penrose pseudoinverse A† incrementally when the base matrix A is extended horizontally or vertically, without re-computing it from scratch. Building on BLS, this thesis aims to introduce a first version of Spiking Broad Learning System (SBLS). The goal is to build the groundwork for the examination on whether it is possible to combine the advantages of BLS, namely the fast training time and the ability for incremental optimization, with the advantages of spiking neurons, namely exploiting event-based data and saving power on special, so-called neuromorphic hardware compared to conventional artificial neurons, with the end-goal to be used in autonomous driving tasks To this end, SBLS was implemented as a class in Python code in PyTorch style. To enable future use, the focus was to make it universal and simple to understand, which makes it easy to set up further experiments, and make changes or extensions to the code. Using this code, two series of experiments were conducted to test the capabilities of SBLS and lay the groundwork for future improvements. The fist series uses SBLS as a classifier on the MNIST data set, while the second series tests the capabilities of SBLS in control, on the example of an autonomous driving task. The result was that SBLS easily reached 85% to 90% accuracy on MNIST without extensive tuning of the hyperparameters, suggesting there is still room for improvement. Regarding the control experiments, the experiments showed SBLS’s ability to adapt to an end-to-end control training set, lowering the mean-squared error continuously with more training data. At the same time, it also revealed problems with basic concepts of SBLS and it’s predecessor BLS, which are critical to solve for the advance of SBLS.
12:00 - 12:25 Kaan Kalaycıoğlu (MA Thesis)
Title: Referring Traffic Object Detection and Multi-Object Tracking
Advisor: Xingcheng Zhou
Keywords:
Abstract: This thesis addresses the task of Referring Multi-Object Tracking (RMOT) from fixed infrastructure camera viewpoints, a critical yet underexplored setting for intelligent transportation systems (ITS). We introduce a scalable, modular annotation pipeline that combines Vision-Language Models (LlavaNext, Qwen-VL, Llama3.2-Vision) and Large Language Models (Qwen-7B-Instruct, GPT-4o-mini) to automatically generate structured RMOT annotations from raw urban traffic videos. Applying this pipeline to 100 sequences from the TUMTraffic-VideoQA dataset, we construct Refer-TUMTraffic Mini (RTT-Mini)—a novel RMOT benchmark featuring 43,679 annotated frames, 395 object tracks, and 2247 filtered referring expressions. Fine-tuning the TempRMOT model on RTT-Mini yields significant improvements over zero-shot baselines, demonstrating the value of domain-specific adaptation. Comprehensive evaluation highlights the strengths and current limitations of both the model and the automated annotation framework, offering insights into future directions for language-grounded tracking in realistic ITS contexts.
The I6 defense day is held monthly, usually on the last Friday of each month. The standard formats of talks are:
Type | Time of Presentation | Time for questions & answers |
---|---|---|
Initial topic presentation | 5 min | 5 min |
BA thesis | 15 min | 5 min |
Guided Research | 10 min | 5 min |
Interdisciplinary Project | 15 min | 5 min |
MA thesis | 20 min | 5 min |
More information on preparing presentations can be found in the Thesis Submission Guidelines.