In this work, we propose Chain of Code (CoC)...
The key idea is to encourage LLMs to format semantic sub-tasks in a program, as flexible pseudocode, so that the interpreter can explicitly catch undefined behaviors, also to be handled by an LLM (as an ''LMulator'').

Many roboticists dream of presenting a robot with a task in the evening, and returning the next morning to find the robot capable of solving the task. What is preventing us from achieving this? Sim-to-real reinforcement learning (RL) has achieved impressive performance on challenging robotics tasks, but requires substantial human effort to set up the task in a way that is amenable to RL. It's our position that algorithmic improvements in policy optimization and other ideas should be guided towards resolving the primary bottleneck of shaping the training environment, i.e., designing observations, actions, rewards and simulation dynamics. Most practitioners don't tune the RL algorithm, but other environment parameters to obtain a desirable controller. We posit that scaling RL to diverse robotic tasks will only be achieved if the community focuses on automating environment shaping procedures.

At the CERN Large Hadron Collider, protons collide 40 million times per second at the highest energies achievable in the lab, probing the microscopic nature of subatomic particles on the smallest length scales. These proton-proton collisions give rise to thousands of particles per collision... This avalanche of data will continue to grow in the next generation of experiments, posing tremendous challenges. Machine learning (ML) methods are increasingly essential to analyze this data while overcoming these challenges.Among many other interesting topics, the talk introduced us to (particle physics) use cases for
Where the Accelerated AI Algorithms for Data-Driven Discovery (A3D3) Institute is a multi-disciplinary and geographically distributed entity, with the primary mission to lead a paradigm shift in the application of real-time artificial intelligence (AI) at scale to advance scientific knowledge and accelerate discovery [1].Where, among those helping in the efforts to use ML (to help out) in particle physics, we find institutions like ''The European Center of Excellence in Exascale Computing (Research on AI- and Simulation-Based Engineering at Exascale, CoE RAISE) [2].
A team of Machine Learning experts and physics scientists working at CERN, has partnered with Kaggle to answer the question: Can machine learning assist high energy physics in discovering and characterizing new particles?Indeed, no time to waste, it is all about getting started here...!
Specifically, in this competition, you’re challenged to build an algorithm that quickly reconstructs particle tracks from 3D points left in the silicon detectors.

A computer can beat the world's most brilliant chess player, can perform any number of sophisticated tasks of logic, the ability of technology to perform even the most basic physical tasks is significantly limited'' [5].Or:
That it is easy to train computers to do things that humans find hard, e.g. play chess, but it is hard to train them to do things humans find easy, like walking and image recognition [6].
The researchers strap themselves into a teleoperation system directly behind the robot’s arms and puppeteer the robot through the desired actions...
Once Mobile ALOHA is operated through a task in a set environment about 50 times, powerful imitation learning algorithms help it make the leap to doing that task independently (These algorithms are similar to the large language models behind popular chatbots, but for physical action instead of words) [7].
The longer the task is, the more likely it is that some stage will fail. Can humans help the robot to continuously improve its long-horizon task performance through intuitive and natural feedback? In this work, we make the following observation: High-level policies that index into sufficiently rich and expressive low-level language-conditioned skills can be readily supervised with human feedback in the form of language corrections. We show that even fine-grained corrections, such as small movements (''move a bit to the left''), can be effectively incorporated into high level policies, and that such corrections can be readily obtained from humans observing the robot and making occasional suggestions [8].
LLMs are models for text generation. To a computer, text is a sequence of byte values -- let's call these values ''codes''. Thus, the problem LLMs are trying to solve can be thought of a sequence of ''which code to emit next'' decisions (Say, over a codebook of size 256) [9].
Now let's turn to continuous control. Like text generation, control involves making a sequence of highly granular decisions (''which control inputs to send to the robot's motors in the current state'', every 30-40ms in the case of robotic manipulation) [9].
Behavior cloning (BC), the most common method for pretraining large models for control domains such as robotics, learns from trajectory data in a way similar to LLM training via teacher forcing using text data [9].
So, can we come up with a notion of tokens that would correspond to blocks of state-action pairs, analogous to LLMs' text tokens, treat control as a problem of sequentially choosing tokens decodable into sequences of several low-level decisions [9].
The PRISE method builds on this exact intuition by transplanting LLMs' mechanism for constructing tokens to BC-based control model pretraining. LLM training computes tokens by sweeping over the training text corpus, identifying frequent code sequences in it, and assigning a token (basically, an ID) to each [9].
A robotic manipulation policy for virtually any task consists of switching among a handful of low-level primitives: free-space moves (FSM), lowering and raising the end-effector (LEEF/REEF), closing the gripper around an object (CGR), etc.And, then, from this, learn powerful action abstractions.
Say:
FSM-FSM-FSM-FSM-LEEF-LEEF-CGR.
...
[9].
Where, the HOG algorithm divides an image into small cells, computes each cell's gradient orientation and magnitude, and then aggregates the gradient information into a histogram of oriented gradients. These histograms describe the image features and detect objects within an image [11].
Caffe supports many different types of deep learning architectures geared towards image classification and image segmentation. It supports CNN, RCNN, LSTM and fully-connected neural network designs [11], [12].Where, in April 2017, Facebook announced Caffe2, that at the end of March 2018, was merged into PyTorch.
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities.I.e. as humans, we continuously engage in the intricate process of receiving and producing cross-model content.
...
NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, image, video, and audio.
Assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities.
A closer examination reveals persistent challenges in the evaluative capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment.Human preference goes beyond ''merely'' accuracy in evaluating a response.
...
We advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges [13].
RouteFinder, a framework for developing foundation models for (Vehicle Routing Problems) VRPs.For more, see also the article ''Large Language Models as Hyper-Heuristics for Combinatorial Optimization'':
Our key idea is that a foundation model for VRPs should be able to model variants,
by treating each variant as a subset of a larger VRP problem, equipped with different attributes [14].
...
The omnipresence of NP-hard combinatorial optimization problems (COPs) compels domain experts to engage in trial-and-error heuristic design. The long-standing endeavor of design automation has gained new momentum with the rise of large language models (LLMs) [15].Clever.
Use FM models that can:Where such Foundational Models will also be relevant for projects like:
- Generalize to a wide diversity of tasks.
- Are flexible enough to be able to use diverse input sensors and shapes.
- Are efficient.
- Are easy to get started with.
Presto excels at a wide variety of globally distributed remote sensing tasks and performs competitively with much larger models while requiring far less compute. Presto can be used for transfer learning or as a feature extractor for simple models, enabling efficient deployment at scale.And, as Presto is pretty lightweight,
Today's foundation models can access the sum total of human knowledge through natural language. But how do we really harness this knowledge, and adapt these models to particular domains and applications?About Data:
They autonomously generate responses to unlabeled instructions, refine these responses interactively, and use the refined and filtered data for iterative self-training, thereby progressively boosting their capabilities [16].Certainly:
Take away:
- Synthetic data allows controllability (e.g. sampling more images from target classes).
- For image classification, currently: Synthetic data < retrieved data.
System 1 is fast, automatic, and intuitive, operating with little to no effort. This mode of thinking allows us to make quick decisions and judgements based on patterns and experiences. In contrast, System 2 is slow, deliberate, and conscious, requiring intentional effort [17].According to Kambhampati: ''LLMs can't plan'':
...
To survive physically or psychologically, we sometimes need to react automatically to a speeding taxi as we step off the curb or to the subtle facial cues of an angry boss. That automatic mode of thinking, not under voluntary control, contrasts with the need to slow down and deliberately fiddle with pencil and paper when working through an algebra problem [18].
If block C is on top of block A, and block B is separately on the table.Where LLMs typically will begin hallucinating (About specs, physics or goal), in order to ''solve'' a problem, without understanding that the problem is indeed unsolvable.
Can you tell me how I can make a stack of blocks, with block A on top of B, and B on top of block C, without moving C.
The ARC-AGI benchmark measures the efficiency of AI systems to acquire new skills outside of their training data. Chollet considers the ability to efficiently aquire new skills a mark of AGI. Since its inception, however, the ARC-AGI’s success rate has been lagging—from an already low 21 percent in 2020 to just 30 percent in 2023.
Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems [19].Well, well:
We take one set of analogy problems used to evaluate LLMs and create a set of "counterfactual" variants-versions that test the same abstract reasoning abilities but that are likely dissimilar from any pre-training data. We test humans and three GPT models on both the original and counterfactual problems, and show that, while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set [20].- ''Evidence that, despite previously reported successes of LLMs on analogical reasoning, these models lack the robustness and generality of human analogy-making'' [20].
Bridge is a large and diverse dataset of robotic manipulation behaviors designed to facilitate research in scalable robot learning [21].In the dataset we have ''observations'', where we know that an action leads to a new observation.
General Navigation Models are general-purpose goal-conditioned visual navigation policies trained on diverse, cross-embodiment training data, that can control many different robots in zero-shot [22].Indeed, ''Open X-Embodiment: Robotic Learning Datasets and RT-X Models'',
Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Here we instead train a ''generalist'' X-robot policy that can be adapted efficiently to new robots, tasks, and environments [23].There are also vision-language (VLA) models.
Where we co-fine-tune (a combination of fine-tuning and co-training, where we keep some of the old vision & text data around) an existing vision-language model with robot data. The robot data includes the current image, language command and the robot action at the particular time step [24].And: OpenVLA (An Open-Source Vision-Language-Action Model):
That supports controlling multiple robots out of the box, and can be quickly adapted to new robot setups via parameter-efficient fine-tuning.And, the company Physical Intelligence might build even larger and better models in the not so distant future.
Developing foundation models and learning algorithms to power the robots of today and the physically-actuated devices of the future.Interesting, indeed.
Afraid of #GPT4 going rogue and killing y'all? Worry not. Planning has got your back. You can ask it to solve any simple few step classical planning problem and snuff that - AGI spark- well and good [26].On ArXiv:
Our studies also show that on many critical capabilities-including plan generation-LLM performance falls quite short, even with the SOTA models [27].On X:
Claude 3 Opus does no better at planning (1.3% in Mystery Blocksword), but apparently a tad better at approximate retrieval on standard Blocks World [29].And on Neurips:
Our findings reveal that LLMs’ ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ∼12% across the domains [30].Great points, indeed!

For example, the therapist could use Nova to reveal discrepancies between self-perception of the patientAt Aamas 2018, a computational model of user emotions for empathic agents (MARSSI: Model of Appraisal, Regulation, and Social Signal) was introduced. A model that can help with e.g. ''social signal interpretation (by) taking directions of expressions into account''.
and external perception, and discuss it with the patient...
A potential scenario would be, that a patient is convinced that his grief is clearly observable for others
and, in contrast, there is no sign of negative valence during the sessions.
The therapist could also use it for self-reflection.
For example if he has very low arousal and negative valence with one patient, compared to others, he could use such indicators from Nova as a starting point for self-reflection [34].
We consider the problem of univariate time series prediction from an elementary machine learning point of view. Beginning with the question of whether and how Principal Component Analysis (PCA) can be used for time series prediction, we describe a simple methodology and attempt to classify PCA-based prediction in terms of statistics, signal processing and dynamical systems theory [36].Idea:
One of the main benefits of PCA for time series data is that it can help you simplify your data and reduce the noise. By keeping only the most important components, you can focus on the essential features of the data and ignore the irrelevant ones. This can make your analysis easier and faster, as well as improve the performance and accuracy of your models. Another advantage of PCA is that it can help you discover hidden relationships and structures in your data, such as seasonality, cycles, or outliers. By visualizing the principal components, you can gain insights into the dynamics and behavior of your data over time [37].Many interesting notes in the talk, indeed.
We investigate the domain of animal welfare and present our latest findings in relation to the automated detection of animal welfare violations. To this end, we introduce three different situations of increased animal welfare risk occurring in a pig slaughtering process and elucidate YOLO-based approaches to detect these situations based on video data [40].Clever, indeed:
Though the reported results are considered to be preliminary, our solution already detects most of the situations of increased animal welfare risk with high accuracy [40].In the future they expect to be able to utilize audio as well, to identity animal welfare violations.
(But) Despite research in mobile image segmentation, there is a lack of focus on mobile wound segmentation.Needed, and very useful, indeed.
To address this gap, we conduct initial research on three lightweight architectures to investigate their suitability for smartphone-based wound segmentation.
...
We deploy the models into a smartphone app for visual assessment of live segmentation, where results demonstrate the effectiveness of TopFormer in distinguishing wounds from wound-coloured objects [41].
There are situations in which unlabeled data is abundant but manual labeling is expensive. In such a scenario, learning algorithms can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning [42].Results:
We investigate the effects of class sparsity, ceiling performance, number of classes, and different AL strategies on AL performance. Our results show that AL performance is superior on datasets with sparser classes, lower ceiling performance, fewer classes, and when using uncertainty sampling strategies [43].Active learning works, indeed.
Zuse invented the world's first programmable computer.
The functional program-controlled Turing-complete Z3 became operational in May 1941.
Thanks to this machine and its predecessors,
Zuse is regarded by many as one of the inventors of the modern computer [44].
Convolutional Neural Networks (CNNs) are known for their ability to learn hierarchical structures, naturally developing detectors for objects, and semantic concepts within their deeper layers. Activation maps (AMs) reveal these saliency regions, which are crucial for many Explainable AI (XAI) methods [46].Where the study evaluates XAI (explainable) methods in the context of CNNs, comparing the performance of LaFAM with other methods.
The classifier and the explanation method can be fooled through attacks at inference time or at training time, and in a vast number of different attack scenarios. These adversarial corner cases raise concerns on the trustworthiness of explainable machine learning in general [47].Where the following attack types where described in talk:
Recent research demonstrates geographic biases in various Large Language Models that reflects common human biases, which are presumably present in the training data.Interesting, and, not good (that this is so).
...
We evaluate the responses of ChatGPT-3.5, ChatGPT-4, and LeoLM for various ratings and estimations by state.
....
In particular, we demonstrate that Eastern states are consistently rated lower (or worse, depending on task) [48].
Recent breakthroughs in Artificial Intelligence (AI) are poised to transform many domains, including autonomous railway transportation systems.
...
However, safety is essential in this high-stake, safety-critical domain.
To ensure compliance with current safety certification standards, we propose a comprehensive methodology for evaluating AI-based components in railway applications [49].
In the field of automated reasoning, practical applications often face a significant challenge: Knowledge bases are typically too large to be fully processed by theorem provers.Here, a selection technique is introduced that uses the capabilities of large language models to closely align the selected part of the knowledge base with the context of the goal.
...
To still be able to prove that a given goal follows from a large knowledge base, selection techniques are used to determine the parts of the knowledge base that are relevant to the goal.
...
Traditional selection techniques used for this task are usually syntax-based and often overlook a crucial aspect
- The meaning of symbol names and axioms.
...
Especially in commonsense reasoning scenarios, the meaning embedded in the symbol names provides invaluable insights [50].
Commonsense knowledge is one of the fundamental aspects of human cognition and reasoning. A large fraction of this knowledge consists of general common sense, which refers to a broad and fundamental understanding of the world that is shared by most people worldwide [51].But:
Commonsense is often unspoken and unwritten, with the assumption that the other party holds the same understanding. Hence, unlike factual knowledge, it is acquired over time through exploration and cultural learning. This often entails shared societal norms and expectations, and a shared understanding of the world to navigate diverse situations, which leads to cultural commonsense – a specific set of values, beliefs, norms, and behaviors that are accepted and practiced within a particular culture or community. Cultural commonsense is a form of commonsense knowledge, and while agreed upon by a group of people, it may not necessarily be commonsensical to others outside that group [51].
For instance, ''wedding dresses are typically red'' is a cultural norm shared in China, India, and Vietnam, but not shared in Italy or France [51].Authors Siqi Shen et al., University of Michigann, conclude:
Our findings indicate that LLMs tend to associate general commonsense with cultures that are well-represented in the training data, and that LLMs have uneven performance on cultural commonsense,, where they underperform for less- represented cultures [51].Indeed, navigating the world of common-sense, and common-sense reasoning is a tricky thing...
Skat is a three-player trick-taking card game of the ace–ten family, devised around 1810 in Altenburg (in the Duchy of Saxe-Gotha-Altenburg). It is the national card game in Germany and one of the most popular card games in Poland.
...
In order to play the game well, one must be good at hand evaluation, counting, cooperation and bidding intelligence [51].
Inspired by recent advances in Computer Skat and Bridge, here we look into automated play for several other trick-taking card games.An interesting topic, see e.g. Michael Schofield et al. writes in ''Journal of Artificial Intelligence Research, 66'':
...
We have included bidding, team building, and game selection, as well as general and specialized card recommenders applicable for the different stages of trick-taking.
...
The AIs are evaluated in different variants and against a general card player that lacks expert rules [52].
Game playing presents a controlled environment in which to evaluate AI techniques, and so we have seen an increase in interest in this field of research. Games of imperfect information offer the researcher an additional challenge in terms of complexity over games with perfect information [General Game Playing with Imperfect Information, ''Journal of Artificial Intelligence Research, 66''].Indeed, this talk might have started with the ''Skat'' card game, but, clearly, topics like multi agent game tree search, how to deal with incomplete information, coalition and cooperation building etc. are not far away, when playing card games, like Skat.
...
Quantum computers could solve certain problems exponentially faster than we know how to solve them with any existing computer
...
But there has always been a catch, and I'm not even talking about the difficulty of building practical quantum computers. Supposing we had a quantum computer, what would we use it for?
The ''killer apps'' - the problems for which a quantum computer would promise huge speed advantages over classical computers - have struck some people as inconveniently narrow. By using a quantum computer, one could dramatically accelerate the simulation of quantum physics and chemistry, break almost all of the public-key cryptography currently used on the Internet (for example, by quickly factoring large numbers, with the famous Shor's algorithm). And maybe achieve a modest speedup for solving optimization problems in the infamous ''NP-hard'' class
...
Alas, as interesting as that list might be, it's hard to argue that it would transform civilization in anything like the way classical computing did in the previous century...[53].
New quantum algorithms promise an exponential speed-up for machine learning, clustering and finding patterns in big data. But to achieve a real speed-up, we need to delve into the details [54].
A digital computer both stores and processes information using bits, which can be either 0 or 1.The first quantum computer with 2-qubits was built by IBM in 1997 [56].
Physically, a bit can be anything that has two distinct configurations: one represented by ''0'', and the other represented by ''1''. It could be a light bulb that is on or off, a coin that is heads or tails, or any other system with two distinct and distinguishable possibilities.
In modern computing and communications, bits are represented by the absence or presence of an electrical signal, encoding ''0'' and ''1'' respectively.
...
A quantum bit is any bit made out of a quantum system, like an electron or photon.
Just like classical bits, a quantum bit must have two distinct states: One representing ''0'' and one representing ''1''.
Unlike a classical bit, a quantum bit can also exist in superposition states, be subjected to incompatible measurements, and even be entangled with other quantum bits. Having the ability to harness the powers of superposition, interference and entanglement makes qubits fundamentally different and much more powerful than classical bits [55].
Quantum computers promise to perform certain computations that are beyond the reach of classical computers. They will do so by exploiting uniquely quantum phenomena such as entanglement and superposition, which allow multiple qubits to exist in multiple collective states at onceIndeed, ''getting enough qubits to work together to run all kinds of quantum algorithm - in what is known as a universal quantum computer - has proved extremely challenging (D-Wave's machines are not ''universal computers'', and can only run a limited range of quantum algorithms)'' [60].
...
Researchers have generally said that state-of-the-art error-correction techniques will require more than 1,000 physical qubits for each logical qubit. A machine that can do useful computations would then need to have millions of physical qubits [59].
Where Baukhage succinctly noted, that we do not have Universal Quantum Computers yet...[60], [61], [62].
- An assembly of independent qubits, each of which functions according to the probabilistic nature of quantum mechanics, including superposition and entanglement of quantum states, as well as a universal quantum logic gate set which supports all of the operations possible at the quantum mechanical level for each qubit.
- Can simulate or implement any and all operations of a classical computer. A classic Turing machine.
- Capable of simulating physics, especially quantum mechanics, but the rest of physics as well [60].
Quantum Bayesian Computation (QBC) is an emerging field that levers the computational gains available from quantum computers to provide an exponential speed-up in Bayesian computation [63].Still, the quantum world is a very, very strange world.
Where did we get that [Schrödinger's equation] from? Nowhere. It is not possible to derive from anything you know. It came out of the mind of Schrödinger, invented in his struggle to find an understanding of the experimental observation of the real world [64].
Machine Learning models often only generalize reliably to samples from the training distribution. Consequentially, detecting when input data is out-of-distribution (OOD) is crucial, especially in safety-critical applications
...
A logical reasoning system uses this knowledge base at run-time to infer whether inputs are consistent with prior knowledge about the training distribution
...
We demonstrate the effectiveness of our method through experiments on several datasets [66].
In contrast to DNNs, the logical reasoning system provides our system with a certain degree of interpretability in the sense that we are able to provide meaningful and intuitive explanations, based on human-interpretable concepts for many of the decisions of the system [66].Very relevant, and usefull indeed.
In this paper we investigate alternative data augmentation strategies for Visual Reinforcement Learning.Where augmentation directly in a latent space sounded very effective, efficient.
... We propose an innovative approach that applies data augmentation in the latent space, rather than directly manipulating pixel values [67].
Novels are one of the types of text that, despite huge advances in recent years, still pose challenges for NLP. Even the latest LLMs are not capable of reading and processing complete novel texts in one go.
A possible solution for this should be investigated in this project.
Perhaps, dividing the text into building blocks (scenes), where graphs can connect the scenes, characters, constellations and other information.