twimbit-gif

Deep Inference

Session 5: Foundational Aspects of General Intelligence and AI

2:15 pm – 2:45pm GMT

Karl Friston (University College London) – Deep Inference

To watch the video on YouTube, click the ‘YouTube’ button above.

Transcript:-

Sheri Markose

What I would say is Karl is one of the leading neuroscientists. He’s one of the great exponents of the, of the successes of the current AI frameworks. But on top of which, in going back to the unreasonable effectiveness of mathematics, I think, the line of mathematics and he wants to sort of propound to has been very effective is, is the free energy principle, that, at the heart of the matter, there is some element, he calls some surprises that you want to minimize, and general intelligence going back to our previous meeting at the webinar on AI and general intelligence. On the November 5, I completely agree with him that the basis of general intelligence would be to minimize, you know, to keep lives vitals as it were in a minimal in a narrow range, and you wouldn’t want any surprises there. So to sort of think that the free energy principle would give us a unified framework for intelligence itself is Karl’s way of addressing the unreasonable effectiveness of mathematics and Karl on to you.

Karl Friston

Thank you very much. It’s a great pleasure to be here. privilege to talk to such a wide range of expertise and investment, I am effectively going to tell exactly the same story that Hector told but using a slightly different language, so So forgive me for any, any repetition. And so my talk is going to be about how to address general intelligence from the point of view of sentient behavior. And I use scenting behavior deliberately because of sentience and sensemaking. Under the hood, but it’s also inactive in the sense that it is about behavior and making choices and actively, as we will see, sampling those data that you’re making sense of. So I’m going to frame that in terms of what philosophers call self evidencing. And motivated from the point of view of normative models sent in behavioral generalized intelligence, with a particular focus on what that this kind of perspective gives you in relation to AI research, and then are rushed through some practical aspects using simulations just to unpack the basic ideas, but they’re going to be very trivial simulations in relation to what we’ve been what we have been hearing about during the previous talk. So it’s just going to be trying to understand how a little animal chooses to look over there or look over here. So, the notion of self, self evidencing rests upon a very simple assumption that is underwritten by we’ll see at the end, a physics of sentience that comes from trying to understand self organization from the point of view of non equilibrium steady state physics. But for the moment, I’m just going to assume that all of action and perception can be summarized in terms of optimizing beliefs. And by beliefs, I mean, conditional probability distributions that are physically represented or encoded or parameterised, by some state of an artifact, or sentient artifact. And I’m denoting those beliefs about the things that the artifact or the agent doesn’t know, basically states of affairs in the world, causing its observable observations, and what he’s actually doing in terms of control variables or policies. And the argument is that a sufficient explanation for all behavior, but maybe imperative, from an existential point of view, is that you just want to maximize the marginal likelihood of your observables the things that you measure, under a causal cause effect model, a generative model, a statistical model that could be cast in terms of symbols, or it could be captured in terms of continuous variables, and a model of how those consequences the observable consequences were caused. And I’m going to motivate this simple specification of normative behavior. optimal behavior in terms of this optimization just by looking at what other general formulations of sentient behavior would, would look like, and from the lens of this, this formulation, so if I take the log of this marginal likelihood, also known as we’ll see in a second as model evidence in Bayesian statistics, and then I’ve got a nice proxy for value in the sense that everything.

I choose to do, or the beliefs about about what I’m going to do, will be in the service of maximizing value, if I have fought value to these observations here, and what we’ll see in a moment is that we can estimate or evaluate a lower bound on this log marginal likelihood that if you treat it as value leads to formulations in terms of reinforcement learning, optimal control theory and engineering, and economics expected utility theory, if you were an information theorist, you would take the negative of this and treat it as a self information or surprise, a lot more simply surprise, that underwrites formulations of good behavior, in terms of things like the principle of maximum mutual information, or as follows minimum redundancy or maximum efficiency principle. And indeed, the free energy principle that inherits from this evidence lower bound here. That’s nice, because the average of self information is entropy, which means that this will look as if it’s trying to minimize entropy, or the dispersion of its observable exchanges with the outside world or its Ico niche. And of course, that is the holy grail of things like self organization, in Synergetics. And a DD five as a physiologist, it’s just a statement of homeostasis, that I, my beliefs about the state of the world, and my beliefs about what I’m doing are in the service of keeping my observable interactions with the world, my physiology within viable bounds and stopping their dispersion and tendency to disorder. If I were a statistician, then I would simply call this marginal Likelihood Model evidence, the evidence inherent in these outcomes, these data that under a given model, where I simply am the model of how those outcomes were generated. So that’s my motivation, that free energy inherits from two sources. And I say this in relation to the previous discussion with with Hector and Sherry, you can either trade soup from some of the work of Richard Feynman and the patentable formulation as a device to make an impossible marginalization or integration problem tractable by converting it into an optimization problem by creating or constructing a bound on something you want to optimize. Or you can see its heritage or its lineage, as deriving from kamagra complexity, minimum description length, and then in the message length. So depending upon your favorite, your favorite kind of maths, you get to that same quantity. So what is this quantity? It’s essentially a quantity that articulates James’s maximum entropy principle. But making the point very clearly that the entropy that we are talking about is the entropy of our beliefs about states of the world, generating our observable outcomes, under constraints, and the constraints in this sense, are supplied by a negative energy, and are supplied usually decomposed into a log likelihood and a log prior. And that’s key in the sense that these two things together provide a probabilistic description of the cause effect relationships are the joint distribution of causes and consequences. And that in my world is referred to as a generative model, exactly the same kind of concept underlies generative adversarial networks and the like. So everything I’m going to say really rests upon committing to a generative model, and then everything else is about optimizing that model and making inferences under that model. Just to illustrate what the what this brings to the table, in a very sort of broad brush sense. What I’ve done here is just rearranged that expression and divided it into something that a statistician would recognize immediately. First of all, making the point and that I can rearrange this into an accuracy the expected log likelihood of some observations expected under my beliefs about those observations. And then we have this term, which is a complexity term, which is the KL divergence that basically equips the free energy with a row has a bound upon the thing that I want to optimize the lump, the marginal likelihood here, this complexity is really interesting this, from my perspective, this has cropped up at a number of occasions throughout the throughout certainly the last session or two.

Technically, it’s just the KL divergence between my beliefs, having observed some data, and my prior beliefs, so it’s the degree to which I have changed my mind. If I was a computer scientist or a statistician. It’s the statement or the qualification of my belief updating in terms of big sore NATs, as scored by the relative entropy between my posterior beliefs after seeing the data and my prior believes the degrees of freedom, I’m using up in order to provide an accurate explanation of what’s going on what’s going on. And slightly contrary to Hector, who saw statistics as being subsumed and and algorithms. My perspective, which I’m willing to change is, I would say the other way around the, in the limiting case that your states can are quantum states, and the limiting case of large numbers, this now becomes Kolmogorov, like complexity, and from that, you can develop some often reduction and universal computation, all in the service of finding the simplest explanation for some observed some observations. And what we’re basically going to see the second key point apart, but other than the first key point, which is at the root of all this is a generative model that entails a cause effect relationship. The second point is, if you want to make this calculus, speak to choices and action and decisions, then you have to think about what this kind of self evidencing means for the consequences of a choice or a decision or an act. So what we’re going to see basically is that we’re going to choose those behaviors that maximize the accuracy expected following a particular move or choice, and, and complexity. And what we’ll see is that this translates into optimum basing design principles of this movement about this in terms of little sensors in classrooms, I think, during rosellas talk, you know, finding the right data, designing the right kinds of experiments, to solicit the data that’s going to be maximally informative and reducing your uncertainty about the causes of your observations. This complexity term when you think about its expectation under the posterior predictive density, in the future, we’re going to see underwrites by a Bayes optimality in a Bayesian decision theoretic sense. And the two together, literally added together. Going to, in this topic, cast itself evidencing namely, acting to minimize so maximize your expected free energy. And the weather I normally motivate that is to ask the audience, imagine you’re an owl, and you’re hungry. So what are you going to do? And we asked somebody, Sheri, what what are you going to do?

You’re very hungry. She’s got a microphone off right now.

Sheri Markose

And obviously look for some prey. But yeah, but how was that figured in your statistical model?

Karl Friston

We got to find out now, because that was a perfect bouncer looking for some prey. So in looking for some prey, and let’s just see the the owl looking for its prey, there is that answer, I think, reveal some deep truths about a question that you might pose yourself. If you had to commit to some maths or some principle in order to identify the best thing to do next, you can go one of two ways. So I’m deliberately introducing this dialectic. I’m going to repair it later what one is going to become a special case of the other. But I think this dialectic is really useful and is informed by Sheri’s answer. We can either assume the existence Have some value function of some states that would ensue if I took this action u sub t. And then if I could evaluate that value function, I could identify the thing that I should do whenever I was in this particular state to generate the next most valuable state. And that view, as we’ll see in a second, predicated on others, optimality principle assumes that optimal action depends upon states of the world. The problem with that, though, is inherent in Sheri’s answer. So if I cast I going to look for food as resolving uncertainty about the location of my prey, then immediately, that tells me that the uncertainty, the thing that I’m optimizing is an attribute of a belief. So uncertainty is an attribute of belief level, what I’m actually optimizing is not a state of the world, optimizing optimal action depends upon beliefs about states of the world. So your action depends upon your current beliefs. And this is about where was your brain. So I’m denoting that with a very different kind of construction, where we’re talking about a functional of beliefs about what will happen if I did that. Furthermore, it matters, the order in which I do things. So it matters, whether I look for my food, then I eat it, or I try to eat that I look for my food. And that tells you immediately, the optimal action has to be sensitive to the sequence of actions in the world. And I’ve written that down here as a sum of this function here is going to be an energy functionals. And this becomes an action pattern, broad time interval of an energy. So this gives you a very different perspective on what the best things are to do. In the sense that now, we’re going to be context sensitive, sensitive to the beliefs about our current beliefs about the world. And we’re going to be minimizing this path integral, or optimizing this path integral of an energy function. And I phrase it like that, because you’ve got these very two distinct ways of writing down a first principle account of normative behavior, optimal behavior, got this Bellman optimality, like approach that would be applicable in the domain of optical control theory, deep reinforcement learning, Bayesian decision theory and so on, contrasted with this belief based scheme. That looks much more and just as a variational principle of station reaction, including things like the free energy principle, or in a biological context, active inference, also subsuming things like artificial curiosity and intrinsic motivation, optimal Bayesian design information seeking of an optimal sort, in the context of face uncertainty in things like partially observed Markov decision processes. So what I want to do now is, I’m sure he must tell me when I’m five minutes, for the end, I’m not gonna get to the end of this presentation. I don’t think it matters, because I imagine there’s gonna be more interesting discussions, and we do the show, and show and tell. But this is, this is minutes before we get to some questions. Okay.

Karl Friston

So this is just a rehearsal of the what I’ve just said, in reference to the functional form of this variational free energy. For those people in machine learning, this is just the evidence lower bound. So I’ve written out the free energy here. And that action, or the expected free energy, that is a functional of my beliefs consequent upon a particular sequence of actions or some policy here. And I’ve done that just to show the beautiful symmetry between the functional forms for the free energy, which is the evidence lower bound, and the expected free energy. So here, again, we have our free energy written in terms of complexity and accuracy. I can switch this move the term terms around to carve it up interpretation in a different way. Just to reiterate, it is a bound on log evidence, and it’s exactly the same bound you use in a variational auto encoder. If you know this, this is exactly the same quantity. If we now take this quantity and take its expectation under the outcomes, So as per story, I predict I would encounter should I make that choice or commit to that policy? Then the we ended up with a formally similar expression but now conditioned upon the policy. So I just want to go through what that expression means or how it could be interpreted. from different perspectives, first of all, let’s just forget about this sort of log of the marginal likelihood E and just focus on the expected bound here, okay, on divergence here. So what is this? Well, it’s another KL divergence. It’s a KL divergence that scores the difference between my beliefs after seeing these fictive outcomes in the future, given my policy, relative to what, what I believe in the absence of those observations. So it scores the information gain, it scores. The in visual search of this is known as Bayesian surprise. It’s cause the resolution of uncertainty. And of course, it is just basically an expected or it is a mutual information. If I now remove a particular sort of uncertainty from the world, let’s remove ambiguity. So how might I do that? Let’s imagine that all my I had sufficient number of sensors, or data points, and they were virtually noiseless. So there’s no real difference between my observations and the things that were causing my observations. So there’s no ambiguity about the world. There’s no uncertainty about observations given states of the world. These two can now be merged. I’m just left with this, which is the expected complexity costs that we were talking about before the thing that underwrites efficient, efficient explanations mean, the simplest explanations. So what’s this KL divergence? Well, it’s simply the difference between my anticipated states outcomes in the future or observations in the future, if I pursued this policy, relative to my prior preferences, the kind of outcomes or the underlying causes that a priori I expect to encounter. And then engineering, this is KL control and economics. It could be read as risk sensitive control just scores, that divergence of outcomes from one’s prior preferences. And if I make the final move, to remove the kind of uncertainty inherent in reduced uncertainty that I could resolve by active sensing by active learning, sampling, the right kind of data that resolves all the uncertainty that is reducible or resolvable, I’m just left with this. And what is this? Well, we started with this in the first slide, it’s just the expected value. So what we are talking about then is a calculus that suggests that you will always appear to or choose policies that either minimize the risk or minimize a mixture of risk and ambiguity, or they will maximize the intrinsic and extrinsic value. This is sometimes known as intrinsic motivation. And the extrinsic value would just be the expected value from an economic or utilitarian perspective. So this is summarizes what we’ve been talking about, here framed from the point of view of a Bayesian statistician that in optimizing expected free energy with respect to a policy, we are implicitly make choosing those policies that result in the greatest expected value. That is, of course, the objective function behind base Bayesian decision theory, when your value now plays the role of a negative cost function, or loss function, at the same time, you’re going to be maximizing the information gain. So you’ve got Bayes optimal decisions with optimal design, basically, equals active inference. And then if I had time, and I know I don’t,

Sheri Markose

how can I can I ask you remember you, you say that one of your graphs, which you then edited out, he said, searching under the lamppost is not what you would be recommending? Because, you know, there are some economists here and, and some of them would see this as naked optimization within given, given the set a choice that you can’t do anything new. But I know, you say that you can do more out of this framework. There’s an additional term, which I don’t see, or you haven’t emphasized that from which you would be searching novel outcomes.

Karl Friston

Yeah, this great point, which I suspect we’ll come back to in the general discussion, but you’re absolutely right. I just been talking about sort of inference about states that are time dependent. This whole story is hierarchically embedded. So that then you have uncertainty about the parameters of a generative model, you then have uncertainty about the structure of a generative model, the same math supplies. So the intrinsic value the that which underwrites curious behavior also applies to your posterior is of the parameters of your gender model, if you’re in machine learning, these would be your posterior over the weights, for example, they never actually articulated in most deep learning I know about but if you had a full, a fully comprehensive deep learning machine, where you also had posters about weights, then you would make moves that disclose or resolve uncertainty about the contingencies and the associations that are parameterised by the weight matrix, that would look like novelty seeking behavior. And then you take it to the next level, which is the very structure of the model itself. And then you would make moves and tried to find the right structure. At that point, we come to what you were talking about, with Hector, which is where do you explore in that imperative to search for novel structures and novel explanations? And now you are in a truly categorical symbolic world? Because we’re talking about structures of models? Does it have an extra factor? Is there an extra layer? Does this relationship exist or not? And that you know that that’s a really deep, really important problem, which one could argue has been solved by natural selection in the sense of Bayesian model selection? So that would be one way of casting structural learning. Fewer, say, for radical constructivism? So yeah, you’ve just, You distracted me, so I can’t even fast forward through the rest. But what I would have done is, is take you through the mechanics of this, it’s nice, because it’s actually exactly the same belief propagation. I think Tim was talking about Tim Rogers is talking about, I can’t quite remember. And certainly it has come up, you can just take off the shelf stuff that does this belief propagation, given the judging model, and crucially, and interestingly in relation to Hector’s talk. the almost universally, nowadays, when we apply this to real world scenarios, we use discrete state space machines and discrete state space, generative models that that that have an easy reading in terms of symbolic manipulations. If This Then That you can write it down very much in terms of things like modal logic, just simply because you’re dealing with this instance, Markov decision processes or partially observed MDPs. And I would have taken you through 40 scale or normal style factor graphs and the nature of belief propagation and how that looks very much like the brain. And then I’ll give you a very simple example of a little mouse searching for food, given instructional cues that tells it where its food is and showed a a deterministic Bayes Optimal progression from exploitative exploration, active sensing, uncertainty reducing behavior, through to exploitative preference seeking girls. And I would have then showed,

Sheri Markose

I’ll have to stop you here.

Karl Friston

Right.

Sheri Markose

And then yes, because I think we we need to get some questions from our audience. Charles, I mean, since Shayam well known to zero intelligence agents. He is he’s suffered violently against this maximalist optimization. I’ve got, I’ve got him here from Yale. So yeah. Do you want to open the questions to Karl?

Shyam Sundar

Oh, I’m not sure if it’s a question for Karl or for Hector. The, so, I imagine a world which will be dominated by these artificial agents, by AI agents, term in agents. What do we know about the how well these artificial agent theories will work in that world, where data itself is being generated by a population of agents, which are Predock dominated by such agents.

Sheri Markose

so That sounds like something that Vincent might compete. It’s about self reflexivity, is it that agents produce information that of course, it can be manipulative, always, sometimes. And Hector, do you want to answer that?

Hector Zenil

I think Karl wanted to say something

Sheri Markose

you are passing it back to Karl

Karl Friston

I would like to hear what Hector had to say before I.

Hector Zenil

So let me understand the question. So the question is if agents themselves are generating the data, and then they are also making the inference calls, is that the question?

Shyam Sundar

We are living in a world now, where the role of AI is on the increase that fair, okay. Now, let’s imagine a world where it is artificial intelligence, which dominates the world. So the data being generated, is being gathered from a world where agents themselves are artificial agents, not human agents. We learn a lot, many models and theories in these two talks about the properties of AI systems. And trying to understand how will how well, will the AI systems work? Where data itself is being generated? Mostly, not entirely, but mostly by AI agents.

Hector Zenil

Yeah, I see. There are so many angles, but when I can perhaps explore a little bit is that at some point, and that’s what we have to be careful about, I think there could be a disconnection of how we perform science from actually how they are from science, because then things like, for example, asking them to communicate with humans would no longer be necessary, right. And in some cases, we may or may not need humans to understand what is going on. So in some way, in some way, I think there’s an opportunity for AI to accelerate science, but we are going to lose also understanding of how that science was performed. And that’s what we are going to lose eventually, perhaps, but it is humans who have to decide whether that science should continue, be performed in such a way, or should be taken back to actually be just interpretable by human scientists.

Shyam Sundar

Let me actually, if I may pursue this little bit more. Will it be the case that any AI system will become self fulfilling in this world?

Sheri Markose

So, Vincent talk will be on these issues of singularity? Okay, let’s suppose on a train, so can we put it off to them? So another hand Anindya’s hand was up? Do you want to ask a question?

Anindya S. Chakrabarti

Yeah, so this question to Karl. This framework is very fascinating. I have basically one very specific query. So in an epistemological game theory, it’s actually quite common to, you know, pursue this Bayesian approach to say that it’s not so much about the fact that there are, you know, given probabilities, which people are already aware of. So from I mean, given the particular example that you gave about and owl and a rat, how does observe you have set it up? I mean, if I may use the word in fully decision theoretic framework, I was just wondering how they translate into a strategic framework, like a game theoretic framework where I mean, this is a standard example that there is a predator and the prey and the prey will whatever the prey does will also change my status. So does it just directly translate into incorporating strategic behavior? Or is there a more modifications that are required on the inequalities that you are showing us?

Karl Friston

That’s a great question. And in a sense, it actually speaks to the previous question. So what you’re asking now is he got two send in systems or possibly an ensemble or a population of systems that are all trying to actively infer each other. And the this is not something that is so developed in the literature, but it really is a tremendously important focus, when you put two of these self evidencing systems, so that now their actions become the observations for the other, and vice versa. There’s now dyadic, or generalized coupling amongst a whole bunch of these things, what tends to happen is because that variational, free energy is an extensive quantity, the free energy solution of the ensemble or the population or the diet is that which is the sum of the variational free energies of each of the elements. And just heuristically, what tends to happen is if you read that three energy in, in the spirit, and Sheri was talking about, which is basically predictability, so that, you know, if I can predict what’s going on, I have an accurate model, and it is a simple model, and therefore it will generalize and it will elude sharp minima, then that means I aspire to make the world very predictable. And the simplest way for an ensemble or two agents to make the world as particular as possible when the world is constituted by other the other agent is for them to do the same thing. So you get something called a generalized synchronization. So everything sort of converges towards the same generative model, or more poetically, a shared narrative about how to exchange so everything starts to sing from the same same sheet.

Sheri Markose

So let me jump in there. Remember, Kevin Moore is not with us, this is exactly the problem that he says, has created a problem in game theory, game theory, you don’t have this agent, which I will call girdles, liar, the adversarial agent. So in the natural sort of predator prey things that you just look at the test non linearity, but the actual algorithmic game in which your digital game which is involved in the virus and the host and that sort of thing, certainty is going to be punished. That’s where your free energy principle in substance gets unstuck. Because you can’t minimize it’s not you’re not, it’s not in your gift to minimize surprises, you’ll be forced into radical novelty producing outcomes, and that’ll only sort of accelerate so you are forced into more and more complex system, you know, aren’t what I would call girdle incompleteness you know, so, um, what, you know, this is the point that your, your material, Colombo was criticizing you but on one hand, I agree that we would like to remove surprises with regard to all of life’s vitals, that is homeostasis or whatever, you know, some people call it allo status. But in order to achieve that, in multicellular life, we have to Peled into open ended search was novelty because there is a virus or an you know, an oppositional adversarial agent in that system.

Karl Friston

Yes, well, that would be an emergent property of an ensemble where you had initially symmetry breaking. So, any heterogeneity will be manifest and you’re absolutely right that in virtue of trying to optimize your expected free energy, you will become novelty hungry, but novelty about what what will you be exploring while you’re exploring an increasingly complex world that is actually generated by other things like you that are becoming increasingly complex? So, again, it speaks to the very first Yeah,

Sheri Markose

like you said to central agents or to such opposing I call them digital agents capable of novelty production the world So, this is where you know on one hand, we know that these are arms races these are Red Queen races, and on one hand everything remains the same but on the other you are into propelled into complex behaviors and unforeseen novelty you know. So, so, this is a problem I think the free energy principle is fine, except because you unless you fess up to this, this problem, which which actually you do say that I think I caught you say that. You know a lot of people think that free energy is all about being in the dark room right? You know, you don’t need to do very much. You keep saying just turn on the light, but I’m going to argue turning on the light is is a novelty exercise right turning on the light to do something novelty searching and so on so forth. It is not a given it’s, it’s something very unique to life itself, right?

Karl Friston

Yes, well, switching on the light is a nice example of making a move, or committing to a policy that minimizes the ambiguity, for example. So the again, like switching on is some is something that is prescribed and is characteristic of an expected free energy optimizing agent. And just in there is some interesting work by John Danzo, who put a whole bunch of different free energy minimizing agents together and was played out economics games, looking specifically the degree of sophistication. So in the generative model, I have a model view. And in my model view, you have a model of me. And the question is, to what level of sophistication do I embed my generative models, and he actually, using numerical analyses showed that there was an ESS and evolutionary stable strategy where half the people were really, really unsophisticated, and the other half of very, very sophisticated. So if you break the symmetry in the beginning, you can get this wasn’t quite red queen dynamics, but it was certainly an interesting dissociation where some people actually dumbed down, but other people became much, much more more complex. And that wasn’t your steady state solution. Born off and only on minimizing maximizing because we machine learning context, maximizing the free energy. So all of these features should be emergent properties.

Sheri Markose

Yeah, so So this is where maybe, logic comes in. Right? I, I begin to see that there is no function that you’re actually capable of maximizing or minimizing I challenge that there is a function. There is only one principle that I would argue it’s not one of necessarily free energy, but consistency, if you go to logic, logic of how these things are going to be organized within you know, individual beings.

share
artificial intelligence

Channel