Impressions and Links from
ICML 2024 (online)
and
KI 2024.

I had the great pleasure of taking part in ICML 2024 (The 41st International Conference on Machine Learning (ICML)) online
and KI 2024 (47th German Conference on Artificial Intelligence (KI)).

See Icml & KI.

Tried to follow as many talks as possible. But, well, these notes are, of course, in no way, shape or form complete...
Rather, these notes were written on conference nights, as my way of keeping track of the events that I attended. And as a way of storing links and references for future reference.

Below you will find impressions from the conferences, and links for further reading.

1. Icml 2024.

Followed Icml 2024 online (July 23rd - July 27th).

1.1. Presentations Wednesday, July 24th.

1.1.1. LLMs: Code and Arithmetic.
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator.

Chengshu Li talked about ''Chain of Code: Reasoning with a Language Model-Augmented Code Emulator''.

In this work, we propose Chain of Code (CoC)...
The key idea is to encourage LLMs to format semantic sub-tasks in a program, as flexible pseudocode, so that the interpreter can explicitly catch undefined behaviors, also to be handled by an LLM (as an ''LMulator'').

''Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks''.

1.1.2. Reinforcement Learning.
Position: Automatic Environment Shaping is the Next Frontier in RL.

Younghyo Park talked about ''Automatic Environment Shaping is the Next Frontier in RL''.

Many roboticists dream of presenting a robot with a task in the evening, and returning the next morning to find the robot capable of solving the task. What is preventing us from achieving this? Sim-to-real reinforcement learning (RL) has achieved impressive performance on challenging robotics tasks, but requires substantial human effort to set up the task in a way that is amenable to RL. It's our position that algorithmic improvements in policy optimization and other ideas should be guided towards resolving the primary bottleneck of shaping the training environment, i.e., designing observations, actions, rewards and simulation dynamics. Most practitioners don't tune the RL algorithm, but other environment parameters to obtain a desirable controller. We posit that scaling RL to diverse robotic tasks will only be achieved if the community focuses on automating environment shaping procedures.

Automatic Environment Shaping is the Next Frontier in RL. Icml 2024.

''We hope this will motivate an increased focus in RL research on communicating and evaluating environment-shaping measures that impact performance''.

1.1.3. Invited Talk.
Machine Learning Opportunities for the Next Generation of Particle Physics.

Javier Duarte talked about ''Machine Learning Opportunities for the Next Generation of Particle Physics''.

At the CERN Large Hadron Collider, protons collide 40 million times per second at the highest energies achievable in the lab, probing the microscopic nature of subatomic particles on the smallest length scales. These proton-proton collisions give rise to thousands of particles per collision... This avalanche of data will continue to grow in the next generation of experiments, posing tremendous challenges. Machine learning (ML) methods are increasingly essential to analyze this data while overcoming these challenges.

Among many other interesting topics, the talk introduced us to (particle physics) use cases for
''Accelerated AI Algorithms for Data-Driven Discovery'' [1].

Where the Accelerated AI Algorithms for Data-Driven Discovery (A3D3) Institute is a multi-disciplinary and geographically distributed entity, with the primary mission to lead a paradigm shift in the application of real-time artificial intelligence (AI) at scale to advance scientific knowledge and accelerate discovery [1].

Where, among those helping in the efforts to use ML (to help out) in particle physics, we find institutions like ''The European Center of Excellence in Exascale Computing (Research on AI- and Simulation-Based Engineering at Exascale, CoE RAISE) [2].

But it is of course possible for everyone to give a helping hand...
E.g. by starting on the TrackML Particle Tracking Challenge [3].

A team of Machine Learning experts and physics scientists working at CERN, has partnered with Kaggle to answer the question: Can machine learning assist high energy physics in discovering and characterizing new particles?

Specifically, in this competition, you’re challenged to build an algorithm that quickly reconstructs particle tracks from 3D points left in the silicon detectors.

Indeed, no time to waste, it is all about getting started here...!

In the process one might even up constructing a ML ''foundation model'' for particle physics. That is, a multi-dataset and multi-task machine learning method that, once pre-trained, can be fine-tuned for a large variety of downstream applications...
(Just) Like ''Omni-Jet-α'' [4].

Important work as ML has the promise to help us get better results from future particle colliders.

AutoEncoders that detect anomalies, when they cant quite reproduce (such) inputs based on what they have been trained on. Icml 2024.

   Many ML techniques are helpful when it comes to potentially
   discovering new physics.
   E.g. AutoEncoders can also help detect particle physics anomalies
   (As they, autoencoders, cant quite reproduce, such, inputs
   based on what they have been trained on earlier on).

Indeed, a very interesting talk.

1.2. Presentations Thursday. July 25th.

1.2.1. Invited talk: What robots have taught me about machine learning.

Chelsea Finn talked about ''What robots have taught me about machine learning''.

''Moravec's Paradox'' states that while:

A computer can beat the world's most brilliant chess player, can perform any number of sophisticated tasks of logic, the ability of technology to perform even the most basic physical tasks is significantly limited'' [5].

Or:

That it is easy to train computers to do things that humans find hard, e.g. play chess, but it is hard to train them to do things humans find easy, like walking and image recognition [6].

Indeed, machine learning has made many real advances in recent years (In ML):

Supervised learning works well.
And there has been significant advances when it comes to architectures, setting learning objectives, using optimizers and having reliable engineering practices for debugging, in order to improve performances.

Still.
Training robots is a hard thing.
It is not easy to:

Generalize broadly and beyond training distribution.
And there aren't that many (ML-like) datasets around for robotics
that can help the robot learn how to e.g. ''tear off tape and put it on an box''.
And it is not easy to get good (robot) training data for cheap (dollars).

Training general-purpose household robots, that can do things such as vacuuming, doing laundry, and watering plants - is difficult.

But, meet Stanfords ''Aloha'' [7]:

The researchers strap themselves into a teleoperation system directly behind the robot’s arms and puppeteer the robot through the desired actions...
Once Mobile ALOHA is operated through a task in a set environment about 50 times, powerful imitation learning algorithms help it make the leap to doing that task independently (These algorithms are similar to the large language models behind popular chatbots, but for physical action instead of words) [7].

(And) Using (human) language corrections to help update
a robots high level policies would, of course, be even better.

Enter ''Yell At Your Robot'':

The longer the task is, the more likely it is that some stage will fail. Can humans help the robot to continuously improve its long-horizon task performance through intuitive and natural feedback? In this work, we make the following observation: High-level policies that index into sufficiently rich and expressive low-level language-conditioned skills can be readily supervised with human feedback in the form of language corrections. We show that even fine-grained corrections, such as small movements (''move a bit to the left''), can be effectively incorporated into high level policies, and that such corrections can be readily obtained from humans observing the robot and making occasional suggestions [8].

I.e.

You can productively yell at your robot now.
The robot can improve from language feedback, by fine tuning its highlevel instruction policy.

More at that IRIS lab website, here.

An awesome talk!

1.2.2. Continuous learning.
LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control.

Andrey Kolobov talked about ''LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control''.

On X Kolobov writes:

LLMs are models for text generation. To a computer, text is a sequence of byte values -- let's call these values ''codes''. Thus, the problem LLMs are trying to solve can be thought of a sequence of ''which code to emit next'' decisions (Say, over a codebook of size 256) [9].

Now let's turn to continuous control. Like text generation, control involves making a sequence of highly granular decisions (''which control inputs to send to the robot's motors in the current state'', every 30-40ms in the case of robotic manipulation) [9].

Behavior cloning (BC), the most common method for pretraining large models for control domains such as robotics, learns from trajectory data in a way similar to LLM training via teacher forcing using text data [9].

So, can we come up with a notion of tokens that would correspond to blocks of state-action pairs, analogous to LLMs' text tokens, treat control as a problem of sequentially choosing tokens decodable into sequences of several low-level decisions [9].

The PRISE method builds on this exact intuition by transplanting LLMs' mechanism for constructing tokens to BC-based control model pretraining. LLM training computes tokens by sweeping over the training text corpus, identifying frequent code sequences in it, and assigning a token (basically, an ID) to each [9].

A robotic manipulation policy for virtually any task consists of switching among a handful of low-level primitives: free-space moves (FSM), lowering and raising the end-effector (LEEF/REEF), closing the gripper around an object (CGR), etc.

Say:
FSM-FSM-FSM-FSM-LEEF-LEEF-CGR.
...
[9].

And, then, from this, learn powerful action abstractions.

Clever, indeed.

1.2.3. Test of Time Award.
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition.

Trevor Darrell talked about ''DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition''.

Before deep learning redefined computer vision (AlexNet. 2012) things like Haar features and Haar cascades were the kind of tools you would use for (visual) object detection [10] (Does an image contain certain features in the righ places, then it might be a ''face'', a ''car'' - or what we might be looking for).

Now, Haar classifiers can still be found as part of the OpenCV package. Along with other (classic) techniques such as HOG (Histogram of oriented graphs):

Where, the HOG algorithm divides an image into small cells, computes each cell's gradient orientation and magnitude, and then aggregates the gradient information into a histogram of oriented gradients. These histograms describe the image features and detect objects within an image [11].

See my 2018 experiments here.

But after AlexNet we then, of course, moved to new worlds, with pretrained models, deep neural models,
you could finetune (for your own pecific needs).

And along came frameworks like Caffe:

Caffe supports many different types of deep learning architectures geared towards image classification and image segmentation. It supports CNN, RCNN, LSTM and fully-connected neural network designs [11], [12].

Where, in April 2017, Facebook announced Caffe2, that at the end of March 2018, was merged into PyTorch.

DeCAF showed the suprising effectiveness of transfer learning using relatively ''frozen'' AlexNet features.
DeCAF was a precursor to Caffe, which became the de facto standard for deep learning in academia and industry for period of time.

All worth remebering as we move on to models like LLARVA,
where vision is connected to action spaces for robot learning.

A great talk.

1.2.4. NExT-GPT: Any-to-Any Multimodal LLM.

Shengqiong Wu talked about ''NExT-GPT: Any-to-Any Multimodal LLM''.

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities.
...
NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, image, video, and audio.

I.e. as humans, we continuously engage in the intricate process of receiving and producing cross-model content.
Now:

With Multimodel LLMs it becomes possible to model world knowledge in a more human like way.
Behave, interact and reason like humans.

Where (it is proposed that) external (e.g.) vision encoders and decoders are (can be seen as) added to existing LLM architectures in order to perceive and generate visual information.
Building on the rich ''zoo'' of existing LLMs.

A timeline of existing large language models (having a size larger than 10B) in recent years.

Indeed, going multi-model doesn't make the LLM world any less complex...

1.2.5. Multimodal Learning.
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.

Dongping Chen talked about ''MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark''.

Assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities.

A closer examination reveals persistent challenges in the evaluative capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment.
...
We advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges [13].

Human preference goes beyond ''merely'' accuracy in evaluating a response.
And the work here aims to assess MLLM performance from a comprehensive standpoint,
rather than evaluating individual MLLM performance on specific datasets.

Again, a complex world we are entering here.
Still, surely, this is a good first step towards evaluating MLLM responses!

1.3. Workshop.
Foundation Models in the Wild. Friday, July 26th.

Friday, I followed the ICML 2024 Workshop on Foundation Models in the Wild.

The workshop was dealing with research questions such as:

How can we leverage the comprehensive knowledge in FMs to adapt them for specific domains,
such as drug discovery, education og clinical health?
How can FMs work reliable outside their training distribution?
And how can we address issues like hallucination and privacy?
How can we ensure that the deployment of FMs preserve safety,
ethics and fairness within society, safeguarding against biases and unethical use?
How can FMs tackle challenges in practical application,
such as system constraints, computational costs, data acquisition barriers?

AI's in the loop.

In the opening remarks for the workshop, and in the first talk of the day, it was noted,
that in many ways LLMs gives (even) more focus on data in ML,
and less focus on the models (LLMs comes pretrained),
which leads to a more ''data-centric view'' in ML.

Which might also change our perspective on what in means to have a ''human in loop'',
labelling data (output from the LLM) model, and helping out with the reasoning part of output...
This might work well in crowdsourcing scenarios,
but it might not work so well when the humans, in the loop, are doctors etc.
Which (then, in the future) might lead to more ''AI's in the loop''.

Indeed, interesting times ahead.

1.3.1. Foundation model paradigm lends itself to a data-centric view.

David Alvarez-Melis, Harvard Seas
(School of engineering and aplied sciences), Github,
talked about how foundation models imply a new learning paradigm.

I.e. Foundation models imply a new learning paradigm.

Data:Heterogenous,''fluid'', modifiable.
Models: ''Rigid'', multi-stage-training, often fixed.

Paradigm lends itself to a data-centric view:

Models as ''data-modifiers''.
Data can and should be optimizations variables.

Moving from ''classic'' ML to Foundation Models ML (in the ''wild'') gives a more ''data-centric'' view:

''Classic''Machine Learning:

Data is fixed or subject to extrinsic change.
Single model universe.
Model is trained from scratch, maleable.
Adaption mediated by model.
Model is the ''secret sauce''.

Foundation Model learning in the ''wild'':

Data is ''fluid'', modifiable.
Model zoos.
Models are ''rigid'', can't always be modified.
Adaption mediated by data.
Data is the secret ''sauce''.

Meaning that ML models becomes ''Data modifiers''...
Where data can be, and should be, thought of as an optimization variable...

An interesting change of perspective for sure!

1.3.2. Towards Foundation Models for Vehicle Routing Problems.

Federico Berto, KAIST, talked about ''Towards Foundation Models for Vehicle Routing Problems''.

Ai4CO is an ''Open research group in Artificial Intelligence (AI) for Combinatorial Optimization (CO)''.
And here we were given an introduction to RouteFinder
(''Towards Foundation Models for Vehicle Routing Problems'').

I.e. ''Foundation models for text (LLMs) and images have been on the rise,
but few explore generalized VRP problems''.

Enter ''Routefinder'':

RouteFinder, a framework for developing foundation models for (Vehicle Routing Problems) VRPs.
Our key idea is that a foundation model for VRPs should be able to model variants,
by treating each variant as a subset of a larger VRP problem, equipped with different attributes [14].
...

For more, see also the article ''Large Language Models as Hyper-Heuristics for Combinatorial Optimization'':

The omnipresence of NP-hard combinatorial optimization problems (COPs) compels domain experts to engage in trial-and-error heuristic design. The long-standing endeavor of design automation has gained new momentum with the rise of large language models (LLMs) [15].

Clever.

Code is here.

1.3.3. Lightweight, Pre-trained Transformers for Remote Sensing Timeseries.

Hannah Kerner, Arizona State University,
talked about ''Lightweight, Pre-trained Transformers for Remote Sensing Timeseries''.

As part of the NASA Harvest & NASA Acrest programs,
her focus is to:

Use FM models that can:

Generalize to a wide diversity of tasks.

Are flexible enough to be able to use diverse input sensors and shapes.

Are efficient.

Are easy to get started with.

Where such Foundational Models will also be relevant for projects like:

Skylight (Protect oceans).
Satlas (Allen AI institute).
Global Plastic Watch.
Global Forest Watch.
Open buildings
(Support social good applications. Where building footprints are useful for a range of important applications, from population estimation, urban planning and humanitarian response).

In the ending helping us to understand ''What an impact an extreme event X has on cultivation in region Y'' etc.

Where, in many of these projects, they are still using older, more standard,
ML techniques like random forests.
So, it is expected that Foundational Models can be (generally) useful.

Enter: Presto (Lightweight, Pre-trained Transformers for Remote Sensing Timeseries):

Presto excels at a wide variety of globally distributed remote sensing tasks and performs competitively with much larger models while requiring far less compute. Presto can be used for transfer learning or as a feature extractor for simple models, enabling efficient deployment at scale.

And, as Presto is pretty lightweight,
it only took under 6 minutes on a 2017 MacBook Pro's CPU to fine-tuning Presto.

All, pretty cool.

1.3.4. Data is driving progress in AI.

Pang Wei Koh talked about
''how we can make machine learning systems more useful to society,
and more reliable in real-world application contexts''.

E.g.

Today's foundation models can access the sum total of human knowledge through natural language. But how do we really harness this knowledge, and adapt these models to particular domains and applications?

About Data:
As data has such a central role we might wonder,
if we can use synthetic data to train and improve the best models?

Self-Instruct (Wang et al).
SynCLR (Tian et al).

(Perhaps, systems can) Generate their own input data, and self-improve?
Like SELF (self evolution with language feedback):

They autonomously generate responses to unlabeled instructions, refine these responses interactively, and use the refined and filtered data for iterative self-training, thereby progressively boosting their capabilities [16].

Certainly:

Synthetic data allows controllability (e.g. sampling more images from target classes).

For image classification, currently: Synthetic data < retrieved data.

Take away:

Definetely not ruling out using synthetic data.
(But) Open question: When and why (how much) should we use synthetic data...

About (making) ML systems more reliable in real-world application contexts:
Also, see RAG (MlCon, Berlin 2023. Section 1.5.1).

We want language models that we can trust, so they must be able to give:

Attribution (ascribing a work, with good sources, links).
Up-to-date information (new links).
Use reliable sources (links).

Here, it is suggested that, RAG techniques will be helpful.
And:

Retrieval based models will allow us to reason about data, as first-class citizens.
Future directions: Supporting attribution, prioritizing sources, update knowledge.

Indeed, all in all, super interesting, and certainly thoughts and material to consider for future classes in Deep Learning...
and beyond...

1.4. Workshop.
Large Language Models and Cognition. Saturday, July 27th.

Workshop (LLMs and Cognition) Program.

Questions to consider in this workshop:

Where do LLMs stand in terms of performance in cognitive tasks, such as reasoning, navigation, planning and teory of mind.
What are the fundamental limits of language models with respect to cognitive abilities?
How do LLMs fine-tuned on specific tasks end-to-end compare with augmented LLMs coupled with external modules.
What are the similarities between mechanistic interpretability approaches in AI and in neuroscience? What do they tell us about similiarities and differences between LLMs and human brians?
How can we improve existing benchmarks and evaluation methods to rigorously assess cognitive abilities in LLMs.
Can multimodal and multiagent approaches address some of the current limits of LLMs (in cognitive tests).

Challenges:

Evaluations.
Hallucinations.
Reasoning and planning.
Finetuning + Synthetic data.
Scaling and emergence.

1.4.1. LLMs are system 1's for humanity, and that ought to be enough...

Subbarao Kambhampati talked about ''LLMs are system 1's for humanity, and that ought to be enough...''.

About ''System 1'' and ''System 2'':

System 1 is fast, automatic, and intuitive, operating with little to no effort. This mode of thinking allows us to make quick decisions and judgements based on patterns and experiences. In contrast, System 2 is slow, deliberate, and conscious, requiring intentional effort [17].
...
To survive physically or psychologically, we sometimes need to react automatically to a speeding taxi as we step off the curb or to the subtle facial cues of an angry boss. That automatic mode of thinking, not under voluntary control, contrasts with the need to slow down and deliberately fiddle with pencil and paper when working through an algebra problem [18].

According to Kambhampati: ''LLMs can't plan'':

LLMs can't plan in autonomous modes.
Claims to the contrary are qustionable.
CoT, ReACT, fine-tuning etc. don't help that much. As the don't generalize enough.
They can't improve by self-verification (since they can't self-verify).
Having humans iteratively prompt is an invitation for ''Clever Hans'' effects.

(But) LLMs can support planning.

LLMs can be used in conjunction with external verifiers and solvers.
Guess plans.
Help elaborate problem specifications.
Translate formats.

Using a ''blocks world'' example, Kambhampati talked about an (impossible) task

If block C is on top of block A, and block B is separately on the table.
Can you tell me how I can make a stack of blocks, with block A on top of B, and B on top of block C, without moving C.

Where LLMs typically will begin hallucinating (About specs, physics or goal), in order to ''solve'' a problem, without understanding that the problem is indeed unsolvable.

LLMs (and generative AI in general) capture the distribution of the data they are trained on.

Style is a distributional property.
(That LLMs are able to learn).
Correctness/factuality is an instance level property
(That LLMs can't guarantee).
Where people tend to think that good style implies good content.
- Which is not necessarily the case.

- Whether or not a prompt results in a ''factual completion'', depends on the prompter knowing enough to tell that the given answer is accurate.

LLMs can't self-critique as they are essentially doing ''approximate retrieval''.
(And, btw, in style based/qualitative tasks, such as writing a good essay, there are no formal notions of correctness).

GOFAI:

Get the domain model.
Have a planner solve the problem.

LLM-AI:

Get the domain model.
Make a trillion Blocksworld problems.
Finetune the LLM with the problems and solutions
(ensure correctness of guess by external validator).

LLMs are trained on everything on the internet, and therefore have ''approximate omniscience''.
But they lack the ablity to stitch the recipes together in a way that ensures correct plans.

A great talk, that helps clarify what LLMs can and can't do
(July 2024).

1.4.1.1. ARC Prize for artificial general intelligence research.

Where recent AI-competitions also illustrate that there are many things that LLM's can't do.

The ARC Prize for artificial general intelligence research.

The ARC-AGI benchmark measures the efficiency of AI systems to acquire new skills outside of their training data. Chollet considers the ability to efficiently aquire new skills a mark of AGI. Since its inception, however, the ARC-AGI’s success rate has been lagging—from an already low 21 percent in 2020 to just 30 percent in 2023.

1.4.2. Identifying and exploiting pseudo-cognitive processes in LLMs.

Antoine Bosselut talked about ''Identifying and exploiting pseudo-cognitive processes in LLMs.''.

We can draw on connections to human cognition to both study LLM behaviours and improve their conclusions.

E.g. it might be possible to discover LLM subnetworks that suppress
sequences, memories and behavious, when removed.

Certainly:

Human memories are reconstructed, rather than accessed.
- LLMs also reconstruct memories when outputting sequences.
- Localising these memories enables targeted interventions.
Human reasoning allows for changes to our world model.
- LLMs can be augmented,
in a way where it acquires additional human-like cognition abilities?

- Good points, & many things to consider.

1.4.3. Evaluating the Robustness of LLMs on Abstract Reasoning and Analogy.

Melanie Mitchell, Santa Fe Institute, talked about ''Evaluating the Robustness of LLMs on Abstract Reasoning and Analogy''.

Webb et al. writes in Nature:

Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems [19].

Well, well:

We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set [20].

- ''Evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making'' [20].

Indeed, sometimes people draw far-reaching conclusions from their findings.
But one should, of course, always think carefully about what kind of conclusions that can actually be drawn from data.

An excellent presentation.

1.5. Workshop.
Multi-modal Foundation Model meets Embodied AI.

(Watched the video-recording of this workshop ''post conference'')

Workshop (Multi-modal Foundation Model meets Embodied AI) Program.

Questions to consider in this workshop. E.g.

Training and evaluation of MFM in open-ended scenarios.
Data collection for training embodied Agents.
Perception and high-level planning,
in embodied agents empowered by MFM.
Decision-making and low-level control,
in embodied agents empowered by MFM.
Evaluation of capability of an embodied agent.

1.5.1. General-Purpose Embodied AI.

Sergey Levine (UCB) talked about ''General-Purpose Embodied AI''.

Getting started with robot learning,
we need ''A Dataset for Robot Learning at Scale'':

Bridge. A Dataset for Robot Learning at Scale.

Bridge is a large and diverse dataset of robotic manipulation behaviors designed to facilitate research in scalable robot learning [21].

In the dataset we have ''observations'', where we know that an action leads to a new observation.
And therefore a link between observations and actions that can be used in training.

Leading to General Navigation Models:

General Navigation Models are general-purpose goal-conditioned visual navigation policies trained on diverse, cross-embodiment training data, that can control many different robots in zero-shot [22].

Indeed, ''Open X-Embodiment: Robotic Learning Datasets and RT-X Models'',
where a high-capacity model trained on this data (RT-X), can improve the capabilities (''generalist'' robot policy) of multiple robots:

Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Here we instead train a ''generalist'' X-robot policy that can be adapted efficiently to new robots, tasks, and environments [23].

There are also vision-language (VLA) models.

Where we co-fine-tune (a combination of fine-tuning and co-training, where we keep some of the old vision & text data around) an existing vision-language model with robot data. The robot data includes the current image, language command and the robot action at the particular time step [24].

And: OpenVLA (An Open-Source Vision-Language-Action Model):

That supports controlling multiple robots out of the box, and can be quickly adapted to new robot setups via parameter-efficient fine-tuning.

And, the company Physical Intelligence might build even larger and better models in the not so distant future.

Developing foundation models and learning algorithms to power the robots of today and the physically-actuated devices of the future.

Interesting, indeed.

1.5.2. On Building General-Purpose Robots.

Lerrel Pinto (NYU) talked about ''On Building General-Purpose Robots''.

When we are dealing with large language models we train on data (tokens) we find on the internet.
But training robots is different. Here the data to train on is created by human demonstrators
(Which is not scalable).

So, what to do? How can we build useful robots?
In this talk 3 approaches were discussed:

Build on prior knowledge.
Construct new knowledge from human data.
Assimilate and accommodate new information from interaction.

Build on prior knowledge.

Existing foundational model can be very helpful.
They can help the robots understand what is in the world,
and where is it in the world.
E.g.:

First, existing software packages can help us map a room.
Next, we can use a foundational model to name everything in the room.
And then it wil be possible to use e.g. the A* algorithm to move from the robots current position, to a new position, where a particular item, we are going for, is located.
Finally, grasping the object can be done without any additional training with the help of e.g. the software package anygrasp (Or perhaps, with the help of a transformer package like e.g. ''Robotic View Transformer for 3D Object Manipulation'', 3D manipulation that is both scalable and accurate, [25]).

Putting it together, we here get something like the Ok, Robot system (An open, modular framework for zero-shot, language conditioned pick-and-drop tasks in arbitrary homes).

Ok Robot. An open, modular framework for zero-shot, language conditioned pick-and-drop tasks in arbitrary homes.

As the foundational model might not know all possible words that can be used for a given object, the model will be ''blind'', and not recognize a certain object in the scene, if we don't use words for the object that the model is familiar with.
So, humans will have to know the right words to prompt these models (and make the robot go to the right place).

Construct new knowledge from human data.

If we want the robot to do something else, like opening a drawer, we will then have to train this behavior specificly.
I.e. get (human) data for this behaviour (open drawer, open door etc), and then finetune the robots model for this new task.
See: Dobb E (An open-source, general framework for learning household robotic manipulation).

Assimilate and accommodate new information from interaction.

Finally, the robot needs to understand whether it has done a good job...
I.e. when the robot tries to do the thing that it has been shown by the human expert,
it takes an ''encoding'' for the outcome of the action, to see if matches the outcome the human expert got.
See: Watch and Match (Supercharging imitation with regularized optimal transport).

See also, the VQ-Bet system, for behavior generation that handles multimodal action prediction, conditional generation, and partial observations.

A great talk, indeed.

1.5.3. Foundation models for robotics.

Chelsea Finn (UCB) talked about ''Foundation models for robotics''.

A part of the talk was about ''Humanoid Shadowing and Imitation from Humans'', and the Human Plus project (a full-stack system for humanoids to learn motion and autonomous skills from human data).

With the take-way:
''If you can teleoperate a task, a robot can very likely learn it''.

Interesting, indeed.

1.6. Tutorial.
Understanding the Role of Large Language Models in Planning.

(Watched the video-recording of this Tutorial ''post conference'')

Subbarao Kambhampati talked about ''Understanding the Role of Large Language Models in Planning''.
See also: 1.4.1. (LLMs are system 1's for humanity, and that ought to be enough).

LLMs are not useless for planning tasks.

LLMs are good at idea generation.
They can tell you what the domain model is.
etc.

But LLMs cannot guarantee robustness of a plan.

In other words:

LLMs always hallucinate
(Such that the completion, distribution, is the same as on the texts they have been trained on).
Whether or not the LLM gives the ''Factual'' completion
depends on the user knowing what the facts are.

Complexity of the underlying task has no bearing on LLM guesses
(LLMs do not think longer or harder, if the problem they are considering is hard, or impossible, compared to the time it takes to come with an answer for an easy problem.
The mechanism for coming up with an answer is exactly the same).

So, what about planning?

Planning: Given a set of objectives, come up with a set of actions (to achieve the objectives).
Scheduling: Given a set of tasks, make sure that there are no undesired interactions.
Model-based reinforcement learning: Agents acts in an environment, and learn action models there.

Where we want plans to be a) robust (achieve its objective) b) have a certain quality (desired style of the solution).

It should also be possible to verify the plan without executing it, according to some model of the world
(If you have your own world model, you can critique other world views).

So what about LLMs?
- LLMs can fake reasoning with pattern finding
(I.e. a lot of memory reduces the need to reason from first principles).

LLMs can capture the distribution of data (in next word prediction).
Still, this is not the same as making sure that each instance the LLM deals with, outputs, is correct.

Indeed, as Kambhampati writes on X:

Afraid of #GPT4 going rogue and killing y'all? Worry not. Planning has got your back. You can ask it to solve any simple few step classical planning problem and snuff that - AGI spark- well and good [26].

On ArXiv:

Our studies also show that on many critical capabilities-including plan generation-LLM performance falls quite short, even with the SOTA models [27].

On X:

Claude 3 Opus does no better at planning (1.3% in Mystery Blocksword), but apparently a tad better at approximate retrieval on standard Blocks World [29].

And on Neurips:

Our findings reveal that LLMs’ ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ∼12% across the domains [30].

Great points, indeed!

2. KI 2024.

Künstliche Intelligenz 2024 [31].
- 47th German Conference on Artificial Intelligence.
Venue:
Faculty of Mathematics and Computer Science
at the Julius-Maximilians University Würzburg [32].

Attended KI 2024 (September 25th - September 27th, 2024).

KI 2024 - 47th German Conference on Artificial Intelligence.

Würzburg Universität.

Getting started.
Wednesday, Institut für Informatik, Julius Maximilians Universität, Würzburg. September 2024.

2.1. Presentations Wednesday, September 25th.

The 47th German Conference on Artificial Intelligence, KI 2024,
took place in the Computer Science building (Faculty of Mathematics and Computer Science) at the Julius Maximilians University Würzburg.

2.1.1. Amplifying Human-Human and Human-Agent interaction with AI.

Elizabeth André, University of Augsburg, gave the keynote Wednesday about ''Amplifying Human-Human and Human-Agent interaction with AI''.

Among many other interesting topics the keynote introduced us to Nova, a tool for annotating and analyzing behaviours in social interactions [33].

Nova. KI 2024 - 47th German Conference on Artificial Intelligence.

Nova is a tool for annotating and analyzing behaviours in social interactions.

Nova is a tool that gives us a tool for annotating and analyzing behaviours in social interactions.
Where, psychotherapy often uses questionnaires, to examine session processes.
Where, ''additional or alternative data sources are highly valuable, because humans (themselves) often tend to have a distorted recall of emotions and mood'' and ''(bodily) expressions are also of great importance to better understand interpersonal processes and communication'' [34].

For example, the therapist could use Nova to reveal discrepancies between self-perception of the patient
and external perception, and discuss it with the patient...
A potential scenario would be, that a patient is convinced that his grief is clearly observable for others
and, in contrast, there is no sign of negative valence during the sessions.

The therapist could also use it for self-reflection.
For example if he has very low arousal and negative valence with one patient, compared to others, he could use such indicators from Nova as a starting point for self-reflection [34].

At Aamas 2018, a computational model of user emotions for empathic agents (MARSSI: Model of Appraisal, Regulation, and Social Signal) was introduced. A model that can help with e.g. ''social signal interpretation (by) taking directions of expressions into account''.

Model of Appraisal, Regulation, and Social Signal
Interpretation. KI 2024 - 47th German Conference on Artificial Intelligence.

MARSSI employs an extended theory of emotions, which makes a more precise description of emotions possible [35].

Conclusion:
AI (tools), like these described here, can help with:

Training and enhancing communication skills.
Analyzing communication errors, and therefore help people connect more easily.
Create personalized expresions, and communication, in ways that resonate with ones own identity and preference.

A very interesting talk, indeed.

2.1.2. Session 2. Probabilistic and Predictive Models.

A Note on Linear Time Series Prediction.

Christopher Bonenberger et al.,Ravensburg-Weingarten University of Applied Sciences (Institut für Künstliche Intelligenz), Weingarten, Germany, talked about ''A Note on Linear Time Series Prediction''.

We consider the problem of univariate time series prediction from an elementary machine learning point of view. Beginning with the question of whether and how Principal Component Analysis (PCA) can be used for time series prediction, we describe a simple methodology and attempt to classify PCA-based prediction in terms of statistics, signal processing and dynamical systems theory [36].

Idea:

One of the main benefits of PCA for time series data is that it can help you simplify your data and reduce the noise. By keeping only the most important components, you can focus on the essential features of the data and ignore the irrelevant ones. This can make your analysis easier and faster, as well as improve the performance and accuracy of your models. Another advantage of PCA is that it can help you discover hidden relationships and structures in your data, such as seasonality, cycles, or outliers. By visualizing the principal components, you can gain insights into the dynamics and behavior of your data over time [37].

Many interesting notes in the talk, indeed.
So, now, we could all get started predicting sunspot numbers [38], or likewise.

Turing. For more, see my trip to Bletchley Park, here.

2.1.3. Session 3. Visual and Acoustic Approaches.

Leveraging YOLO for Real-Time video Analysis of animal welfare in a pig slaughtering process.

Christian Beecks et al., Fernuni-hagen, talked about ''Leveraging YOLO for Real-Time video Analysis of animal welfare in a pig slaughtering process.''.

In this project, the authors used YOLO (''You Only Look Once''), an effective real-time object recognition algorithm.

Yolo. KI 2024 - 47th German Conference on Artificial Intelligence.

YOLO (You Only Look Once) is a popular object detection model known for its speed and accuracy. It was first introduced by Joseph Redmon et al. in 2016 and has since undergone several iterations [39].

First they had to identify situations with risk of violations (mostly, incorrect use of tools),
and then they train Yolo to be able to identity these situations
(In order not make these violations themselves, they create similar situations, for training purposes, but not with real animals).

We investigate the domain of animal welfare and present our latest findings in relation to the automated detection of animal welfare violations. To this end, we introduce three different situations of increased animal welfare risk occurring in a pig slaughtering process and elucidate YOLO-based approaches to detect these situations based on video data [40].

Clever, indeed:

Though the reported results are considered to be preliminary, our solution already detects most of the situations of increased animal welfare risk with high accuracy [40].

In the future they expect to be able to utilize audio as well, to identity animal welfare violations.

Early Explorations of Lightweight Models for Wound Segmentation on Mobile Devices.

Vanessa Borst et al., Julius Maximilians University Würzburg, talked about ''Early Explorations of Lightweight Models for Wound Segmentation on Mobile Devices''.

''The aging population poses numerous challenges to healthcare, including the increase in chronic wounds in the elderly''.
So, there is a need for computer-aided wound recognition from smartphone photos (Do-it-at-home apps).

(But) Despite research in mobile image segmentation, there is a lack of focus on mobile wound segmentation.
To address this gap, we conduct initial research on three lightweight architectures to investigate their suitability for smartphone-based wound segmentation.
...
We deploy the models into a smartphone app for visual assessment of live segmentation, where results demonstrate the effectiveness of TopFormer in distinguishing wounds from wound-coloured objects [41].

Needed, and very useful, indeed.

Active Learning in Multi-label Classification of Bioacoustic Data.

Hannes Kath et al., German Research Center for Artificial Intelligence (DFKI) , Oldenburg, Germany, talked about ''Active Learning in Multi-label Classification of Bioacoustic Data''.

''Passive Acoustic Monitoring (PAM) has become a key technology in wildlife monitoring, providing vast amounts of acoustic data''.
...
''The recording process naturally generates multi-label datasets (More animal calls at the same time). However, due to the significant annotation time required, most available datasets use exclusive labels. ''.

Active learning to the rescue:

There are situations in which unlabeled data is abundant but manual labeling is expensive. In such a scenario, learning algorithms can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning [42].

Results:

We investigate the effects of class sparsity, ceiling performance, number of classes, and different AL strategies on AL performance. Our results show that AL performance is superior on datasets with sparser classes, lower ceiling performance, fewer classes, and when using uncertainty sampling strategies [43].

Active learning works, indeed.

Würzburg.

Zuse. KI 2024 - 47th German Conference on Artificial Intelligence.

Poster, Institut für Informatik, Julius Maximilians University, Würzburg.

Zuse invented the world's first programmable computer.

The functional program-controlled Turing-complete Z3 became operational in May 1941.

Thanks to this machine and its predecessors,
Zuse is regarded by many as one of the inventors of the modern computer [44].

Julius Maximilians University Würzburg.

2.2. Presentations Thursday, September 26th.

2.2.1. Session 4. Explainability.

LaFAM: Unsupervised Feature Attribution with Label-free Activation Maps.

Aray Karjauv et al. talked about ''LaFAM: Unsupervised Feature Attribution with Label-free Activation Maps''.

LaFAM. KI 2024 - 47th German Conference on Artificial Intelligence.

Image saliency maps (A saliency map is an image that highlights either the region on which people's eyes focus first or the most relevant regions for machine learning models).

Where: ''Saliency maps are a great tool to understand what convolutional layers are seeing in computer vision, allowing us to use these models in production in an informed manner. Meaning that they can also be used to troubleshoot models in cases where models are not performing as expected [45]''.

The authors write:
''While Vision Transformers show impressive results, convolutional neural networks (CNNs) continue to be extensively used due to their well-established effectiveness in extracting spatially coherent features and their interpretability ''.

''Evidence suggests that CNNs can inherently develop detectors for semantic objects without explicit supervision, making Activation Maps (AMs) in deeper layers a key factor in the success of current XAI methods''.

Convolutional Neural Networks (CNNs) are known for their ability to learn hierarchical structures, naturally developing detectors for objects, and semantic concepts within their deeper layers. Activation maps (AMs) reveal these saliency regions, which are crucial for many Explainable AI (XAI) methods [46].

Where the study evaluates XAI (explainable) methods in the context of CNNs, comparing the performance of LaFAM with other methods.
And indicating that this studys method ability to highlight salient regions is better than other (comparable) methods.

Concluding: Results highlight that (this studys method) LaFAM emerges as a robust and flexible method, contributing to the diversification of the XAI toolbox.

A Brief Systematization of Explanation-Aware Attacks.

Maximilian Noppel et al., Kastel Security Research Labs, Karlsruhe, talked about ''A Brief Systematization of Explanation-Aware Attacks.''.

''Many machine learning models are largely considered black boxes. Where explanation methods aim to shed light on the inner working of such models, and, thus can serve as debugging tools.

However, recent research has demonstrated that carefully crafted manipulations at the input or the model can successfully fool the model and the explanation method [47]''.

Still:

The classifier and the explanation method can be fooled through attacks at inference time or at training time, and in a vast number of different attack scenarios. These adversarial corner cases raise concerns on the trustworthiness of explainable machine learning in general [47].

Where the following attack types where described in talk:

Explanation-Preserving Attack.
The aim of an explanation-preserving attack is to change the prediction, but to preserve the explanation in comparison to a benign case. Note that the benign case might be specified by a clean in-distribution input and/or a benignly trained model, depending on the threat model.
Prediction-Preservering Attack.
In a prediction-preserving attack the prediction should be correct, but the explanation should deviate from the benign case.
Dual Attack.
Lastly, in a dual attack the predictions and the explanations differ from the benign case. This type of attack allows the greatest flexibility.

Where the aim of the talk was to systematize attack types against explainable systems, according to these three primary dimensions. Indeed, this systematization makes it easier to identify potential attacks in the concrete application scenario at hand [47].

Canteen. Würzburg Universität.

2.2.2. Session 5. AI in Practice.

Saxony-Anhalt is the Worst: Bias Towards German Federal States in Large Language Models.

Anna Kruspe et al., Munich University of Applied Sciences, talked about ''Saxony-Anhalt is the Worst: Bias Towards German Federal States in Large Language Models''.

Recent research demonstrates geographic biases in various Large Language Models that reflects common human biases, which are presumably present in the training data.
...
We evaluate the responses of ChatGPT-3.5, ChatGPT-4, and LeoLM for various ratings and estimations by state.
....
In particular, we demonstrate that Eastern states are consistently rated lower (or worse, depending on task) [48].

Interesting, and, not good (that this is so).
(And) In the talk the authors mentioned that similar finds are possible when it comes to LLMs and north- and south Italy etc.
Well, not all that surprising. But good to have it addressed here.

Evaluating AI-Based Components in Autonomous Railway Systems: A Methodology.

Jan Roßbach et al., Institut für Informatik, Heinrich Heine Universität, Düsseldorf, talked about ''Evaluating AI-Based Components in Autonomous Railway Systems: A Methodology''.

Recent breakthroughs in Artificial Intelligence (AI) are poised to transform many domains, including autonomous railway transportation systems.
...
However, safety is essential in this high-stake, safety-critical domain.
To ensure compliance with current safety certification standards, we propose a comprehensive methodology for evaluating AI-based components in railway applications [49].

Evaluating AI-Based Components in Autonomous Railway Systems.

Indeed, safety is critical here. An automated AI railway system must perform better than humans, with a very low hazard rate (How often does an accidet occur), less than 10^-9.

AI systems that run trains cannot hallucinate (like LLM models)...

Generative AI can provide images (e.g. where it rains) that can help test ofter systems.
(Here) (AI) systems for driving trains etc.
Still, in all cases, for all of these railway systems, there is a general requirement, that the AI system has to perform better than a human driver.

Indeed, great work and a great talk.

Context-Specific Selection of Commonsense Knowledge Using Large Language Models.

Oliver jacobs et al., Hochschule Trier, Trier, Germany, talked about ''Context-Specific Selection of Commonsense Knowledge Using Large Language Models''.

In the field of automated reasoning, practical applications often face a significant challenge: Knowledge bases are typically too large to be fully processed by theorem provers.
...
To still be able to prove that a given goal follows from a large knowledge base, selection techniques are used to determine the parts of the knowledge base that are relevant to the goal.
...
Traditional selection techniques used for this task are usually syntax-based and often overlook a crucial aspect
- The meaning of symbol names and axioms.
...
Especially in commonsense reasoning scenarios, the meaning embedded in the symbol names provides invaluable insights [50].

Here, a selection technique is introduced that uses the capabilities of large language models to closely align the selected part of the knowledge base with the context of the goal.

Clever, indeed!

And, as the authours report ''The approach is implemented, and we present a series of experiments that show promising results''. All good, indeed.

But there are, of course, still many problems ahead, when it comes to dealing with commonsense knowledge.

E.g. many have noted, that ''common-sense'' can be unspoken, and unwritten, and can have a cultural component.

Commonsense knowledge is one of the fundamental aspects of human cognition and reasoning. A large fraction of this knowledge consists of general common sense, which refers to a broad and fundamental understanding of the world that is shared by most people worldwide [51].

But:

Commonsense is often unspoken and unwritten, with the assumption that the other party holds the same understanding. Hence, unlike factual knowledge, it is acquired over time through exploration and cultural learning. This often entails shared societal norms and expectations, and a shared understanding of the world to navigate diverse situations, which leads to cultural commonsense – a specific set of values, beliefs, norms, and behaviors that are accepted and practiced within a particular culture or community. Cultural commonsense is a form of commonsense knowledge, and while agreed upon by a group of people, it may not necessarily be commonsensical to others outside that group [51].

For instance, ''wedding dresses are typically red'' is a cultural norm shared in China, India, and Vietnam, but not shared in Italy or France [51].

Authors Siqi Shen et al., University of Michigann, conclude:

Our findings indicate that LLMs tend to associate general commonsense with cultures that are well-represented in the training data, and that LLMs have uneven performance on cultural commonsense,, where they underperform for less- represented cultures [51].

Indeed, navigating the world of common-sense, and common-sense reasoning is a tricky thing...

Institut für Informatik, Würzburg Universität.

2.2.3. Session 6. AI and Games.

A Framework for General Trick-Taking Card Games.

Stefan Edelkamp et al., Charles University, Prague, Czech Republic, talked about ''A Framework for General Trick-Taking Card Games''.

Skat is a three-player trick-taking card game of the ace–ten family, devised around 1810 in Altenburg (in the Duchy of Saxe-Gotha-Altenburg). It is the national card game in Germany and one of the most popular card games in Poland.
...
In order to play the game well, one must be good at hand evaluation, counting, cooperation and bidding intelligence [51].

Skat. KI 2024 - 47th German Conference on Artificial Intelligence.

Here, in this talk, the focus was on bidding, team building and general and specialized card recommenders:

Inspired by recent advances in Computer Skat and Bridge, here we look into automated play for several other trick-taking card games.
...
We have included bidding, team building, and game selection, as well as general and specialized card recommenders applicable for the different stages of trick-taking.
...
The AIs are evaluated in different variants and against a general card player that lacks expert rules [52].

An interesting topic, see e.g. Michael Schofield et al. writes in ''Journal of Artificial Intelligence Research, 66'':

Game playing presents a controlled environment in which to evaluate AI techniques, and so we have seen an increase in interest in this field of research. Games of imperfect information offer the researcher an additional challenge in terms of complexity over games with perfect information [General Game Playing with Imperfect Information, ''Journal of Artificial Intelligence Research, 66''].

Indeed, this talk might have started with the ''Skat'' card game, but, clearly, topics like multi agent game tree search, how to deal with incomplete information, coalition and cooperation building etc. are not far away, when playing card games, like Skat.

Indeed, super interesting, and a great talk.

Institut für Informatik, Würzburg Universität.

2.3. Presentations Friday, September 27th.

2.3.1. Quantum AI/ML - Hype or Hope?

Christian Baukhage, University of Bonn, gave the keynote Friday about ''Quantum AI/ML - Hype or Hope?''.

Many useful quotes, links and references in this awesome talk.
Among others, I noted:

Scott Aronson writes in ''Quantum Machine Learning algorithms, Read the fine print'':

...
Quantum computers could solve certain problems exponentially faster than we know how to solve them with any existing computer
...
But there has always been a catch, and I'm not even talking about the difficulty of building practical quantum computers. Supposing we had a quantum computer, what would we use it for?
The ''killer apps'' - the problems for which a quantum computer would promise huge speed advantages over classical computers - have struck some people as inconveniently narrow. By using a quantum computer, one could dramatically accelerate the simulation of quantum physics and chemistry, break almost all of the public-key cryptography currently used on the Internet (for example, by quickly factoring large numbers, with the famous Shor's algorithm). And maybe achieve a modest speedup for solving optimization problems in the infamous ''NP-hard'' class
...
Alas, as interesting as that list might be, it's hard to argue that it would transform civilization in anything like the way classical computing did in the previous century...[53].

New quantum algorithms promise an exponential speed-up for machine learning, clustering and finding patterns in big data. But to achieve a real speed-up, we need to delve into the details [54].

So, how does Quantum Computers work?
Well, Quantum computers do ''bit level computing''.
I.e.

A digital computer both stores and processes information using bits, which can be either 0 or 1.
Physically, a bit can be anything that has two distinct configurations: one represented by ''0'', and the other represented by ''1''. It could be a light bulb that is on or off, a coin that is heads or tails, or any other system with two distinct and distinguishable possibilities.
In modern computing and communications, bits are represented by the absence or presence of an electrical signal, encoding ''0'' and ''1'' respectively.
...
A quantum bit is any bit made out of a quantum system, like an electron or photon.
Just like classical bits, a quantum bit must have two distinct states: One representing ''0'' and one representing ''1''.
Unlike a classical bit, a quantum bit can also exist in superposition states, be subjected to incompatible measurements, and even be entangled with other quantum bits. Having the ability to harness the powers of superposition, interference and entanglement makes qubits fundamentally different and much more powerful than classical bits [55].

The first quantum computer with 2-qubits was built by IBM in 1997 [56].
In 2019 there was a quantum computer with 53 bits [57], [58].
And, recently, IBM unveiled the first quantum computer with more than 1,000 qubits.
Where:

Quantum computers promise to perform certain computations that are beyond the reach of classical computers. They will do so by exploiting uniquely quantum phenomena such as entanglement and superposition, which allow multiple qubits to exist in multiple collective states at once
...
Researchers have generally said that state-of-the-art error-correction techniques will require more than 1,000 physical qubits for each logical qubit. A machine that can do useful computations would then need to have millions of physical qubits [59].

Indeed, ''getting enough qubits to work together to run all kinds of quantum algorithm - in what is known as a universal quantum computer - has proved extremely challenging (D-Wave's machines are not ''universal computers'', and can only run a limited range of quantum algorithms)'' [60].

Jack Krupansky, writing on Medium, continues: A ''Universal Quantum Computer'' must have:

An assembly of independent qubits, each of which functions according to the probabilistic nature of quantum mechanics, including superposition and entanglement of quantum states, as well as a universal quantum logic gate set which supports all of the operations possible at the quantum mechanical level for each qubit.

Can simulate or implement any and all operations of a classical computer. A classic Turing machine.

Capable of simulating physics, especially quantum mechanics, but the rest of physics as well [60].

Where Baukhage succinctly noted, that we do not have Universal Quantum Computers yet...[60], [61], [62].

Nonetheless, Baukhage also noted, that there is still hope for quantum AI, as long as we are looking for appropriate use cases.
As an example, he discussed how quantum algorithms can accelerate Bayesian network inference:

Quantum Bayesian Computation (QBC) is an emerging field that levers the computational gains available from quantum computers to provide an exponential speed-up in Bayesian computation [63].

Still, the quantum world is a very, very strange world.
E.g. Baukhage reminded us that Feynman noted:

Where did we get that [Schrödinger's equation] from? Nowhere. It is not possible to derive from anything you know. It came out of the mind of Schrödinger, invented in his struggle to find an understanding of the experimental observation of the real world [64].

Schroedinger and the Schroedinger equation.

Which makes you wonder, how a future version of chatGPT (with reasoning) should be able to come up with a derivation of Schrödinger's equation [65]...
Well...

Certainly, we (humans) shouldn't expect to understand it all in one go...

An absolutely awesome talk!

2.3.2. Session 7. Symbolic Approaches.

Out-of-Distribution Detection with Logical Reasoning.

Konstantin Krichheim et al., Otto von Guericke University, talked about ''Out-of-Distribution Detection with Logical Reasoning''.

Machine Learning models often only generalize reliably to samples from the training distribution. Consequentially, detecting when input data is out-of-distribution (OOD) is crucial, especially in safety-critical applications
...
A logical reasoning system uses this knowledge base at run-time to infer whether inputs are consistent with prior knowledge about the training distribution
...
We demonstrate the effectiveness of our method through experiments on several datasets [66].

In contrast to DNNs, the logical reasoning system provides our system with a certain degree of interpretability in the sense that we are able to provide meaningful and intuitive explanations, based on human-interpretable concepts for many of the decisions of the system [66].

Very relevant, and usefull indeed.

2.3.3. Session 8. Reinforcement Learning.

Data Augmentation in Latent Space with Variational Autoencoder and Pretrained Image Model for Visual Reinforcement Learning.

Xuzhe Dang et al., Czech Technical University, Prague, talked about ''Data Augmentation in Latent Space with Variational Autoencoder and Pretrained Image Model for Visual Reinforcement Learning''.

The authors write:

In this paper we investigate alternative data augmentation strategies for Visual Reinforcement Learning.
... We propose an innovative approach that applies data augmentation in the latent space, rather than directly manipulating pixel values [67].

Where augmentation directly in a latent space sounded very effective, efficient.
Clever, indeed.

2.3.4. Impressions. Campus & Institut für Informatik.

Cool projects.

Novels are one of the types of text that, despite huge advances in recent years, still pose challenges for NLP. Even the latest LLMs are not capable of reading and processing complete novel texts in one go.

A possible solution for this should be investigated in this project.

Perhaps, dividing the text into building blocks (scenes), where graphs can connect the scenes, characters, constellations and other information.

Hallways.

Wondering what to do with that dated computer tower? Well, here's an easy solution, grow plants!

Hand-ins. ''Algorithms and Datastructures''.

Problems: Goldbach's conjecture.

Campus.

The building where X-Rays were discovered.

On 8 November 1895, Röntgen produced and detected electromagnetic radiation in a wavelength range (now) known as X-rays.

3. Trip impressions.

For more trip impressions, see here.

4. Conclusion.

Indeed, the end of 2 wunderbar conferences. With many memorable talks.

Impressions and Links fromICML 2024 (online) and KI 2024.