Sparsistent Filtering of High Dimensional Networks

Session 4: Data Science for Interconnected Systems and Epidemics

12:30 pm – 1:00 pm GMT

Anindya S. Chakrabarti (Indian Institute of Management Ahmedabad) – Sparsistent Filtering of High Dimensional Networks

To watch the video on YouTube, click the ‘YouTube’ button above.

Transcript: –

Anindya Chakrabarti
Here I will talk about something that we are describing as persistent futuring of movement networks from high dimensional data, there is a joint work with Anindya Chakrabarti , who is a post doc working with me, our names look very similar, but we are not related in any fashion. So, what is the problem that we are sort of, you know, interested in and what is it that we would like to analyze here. So, let me just, you know, give you a few very basic definitions just to set up the stage and for the whole audience, I guess it’s not a very important to describe networks in great details, everyone knows about it. But in particular, what we will be dealing with, in this presentation is something called co movement networks, which is basically, you know, which we will, you know, deal with, that is a large literature in sort of statistical finance, which deals with these kinds of networks. So, I would like to just, you know, maybe visually describe what kind of networks I’m talking about, and then I’ll get into the problem. So, this is what you’re seeing here on the screen. This is an S&P 500 index from 2002 to 2017. So, it’s 16 years’ worth of data. And these are daily closing prices of the index. So, whatever is reported as the S&P you know, closing price for a given day, so the market opens in the morning, then it closes, maybe around 5pm or something like that. So, whatever the price that is quoted at the very end, that is the closing price, and we see that it fluctuates a lot, etc, etc. So, what we are plotting on the y axis is literally that S&P index. So, for those of you who are not from finance, S&P 500 is basically a combination of 500 different assets, which are all registered in the New York Stock Exchange, and it’s a weighted average of all of their different prices, generally speaking, it can be thought of to represent as you know, it can be used as a kind of a barometer of what the traders are doing in the market. So, what we see is that, just to give you a sense of the x axis every year, roughly speaking, there are 250 trading days, because the weekends are closed, and then there’ll be some other days where the trading would be off, so 250 into 16, you get 4000, roughly that is the you know, dimension on the x axis. So, what we see is that, I mean, one obvious thing is there, which is the 2007-09 financial crash, but generally speaking, what people have thought about is that this kind of process clearly shows something called a it has a random more components or from a financial economic point of view, this can be described as having a unit road. So, which has this explosive tendency, so, these are non-stationary processes. So, one of the very easy ways to convert it into a stationary process is simply to take the first difference, which can be described like this, that you take the log difference and then that becomes a return series that is also called the log return series, the moment we convert it into return, then the same data which was plotted here, this dataset basically looks like this. So, on average, the return becomes close to zero and then there is a bunch of fluctuations around it. Now, one of the key observations and this is very well known in the financial data is something called volatility clustering, that when the market gets volatile, it tends to remain volatile for some time and when the market gets you know, becomes relatively more calm, then it tends to remain calm for quite some time. So, as we can see that during the dotcom bubble, the market got very volatile and 2007 to 09 financial crisis, an obvious example, where again, it caught into a long spiel of volatility. So, what the, the financial networks what it started looking at is that, there just to clarify, there are two types of financial reports one can think about one is that literally you know, form to form or you know, financial entity to einancial entity connection through let’s say asset cross holding or other types of connectivity, which can be where the linkage can be real in some sense. These are not statistically inferred, but what I would be looking at is more likely inferred connectivity. So, what do I mean by that the idea is as follows that you if you if there are N assets so, for example, in the earlier days, if you take S&P 500, when this N was more like 500. Now, this is what represents the average behavior. But technically speaking, what one can do is that you, no one can look at all of these 500 different assets separately, and each of them will give rise to a time series. So we will have 500 times of let’s say some given length. Now, once we do that, then from that a pretty standard, you know, idea is there, which is that, let’s say I denote the, you know, return data, which may look something like this, I’m abusing notation a little bit. So what I mean is that N is, let’s say, also a set of all the stocks that are there, all the assets, and the capital D is the number of time points for which we have the data. Now from this one can construct something like a you know, covariance matrix, or let’s say eight by eight, or let’s see our correlation matrix. Now, there is a lot of literature on analyzing these disaggregated kind of network where all the nodes now I can I guess, go back to the first slide, where these kind of financial reports are there, but the stocks are the nodes, and some functions of correlations represents the age bets. So and one very intuitive way to think about it is that suppose there is absolutely no correlation whatsoever, you know, in terms of returns between two stocks, then there will be no age or think. So if you do that, then this is a very simple, you know, example, that if you have 50 stocks, if you’re plotting the the network as a whole, then this is what you get. Now, one can, you know, there is a little more nuance to it, because if someone just takes a correlation value, and says that suppose this is the age weight and there is a problem, because population groups can integrate as well. So can we treat it as a matrix? The answer is no. So you have to take some kind of transformation of the correlation blob in order to come up with a reasonable description of that. But the key point that I’d like to summarize is that that can be done very easily with some very simple matrix and the whole network as the market as a whole can be described as a network.

So this is a network inferred from what can be described as the sample covariance matrix. Now, this is the object that we are interested in. Now, what is the question that we are interested in the quiz, the point is as follows that if you look at this whole thing, it’s not very, you know, meaningful in the sense that there can be, you know, two types of problems or otherwise three types of problem let me give you the first problem, which is actually the entry point of our algorithm to begin with. The first point is that there has been actually a lot of debate about this, that if the markets are, you know, evolving in such a volatile manner over time, then is it even fair to take time series, which covers volatile and non volatile period. So in the sense that, you know, in order to find out the correlation between, let’s say, a pair of stocks, you need a reasonable length of time series data. So let’s say you take four years window, which is a very standard, you know, window in this literature. Now, suppose that four years starts from 2006 and ending ends in 2009, or even more extreme case, it starts in 2008 times in 2011. The problem is, it’s sort of contaminated, there is some contribution coming from, you know, very volatile period, some contributions are made from not so volatile period. And potentially one can actually argue that it’s not as if like, if you just want more volatility, then things will be over, because then there is a big the point that one can make is that maybe the market structure itself shapes by that, so, four years is not a short time. Now, then one can ask why four years, why not take one year, when the problem is that suppose you have 500 assets, you have only 100 you know, time observations 100 days worth of observation, then the sample covariance matrix that one can estimate will be a very bad estimator of the true covariance matrix. This is the basic problem of high dimensional statistics that if the number of observations number of entities that one is observing is very high, but the number of data points that is being observed is very low, then there is a problem you cannot estimate the covariance matrix in a critical way. So, I will show you what is the problem with that, I will give me a very specific example. Now, the second this is the, you know first problem that we want to address that suppose you really chop it down to a very small data set in terms of timeline, then what is the most credible way to infer network from the covariance matrix. The second problem is that, is there a way to they if you just look at this picture, it’s just a blob. There is a bunch of points and which are connected to each other. Is there a way to get rid of much of this You know, many of these linkages, which are potentially non informative, and not only that, we can actually ask, you know, two questions here. The first one is that suppose these are just some correlations or statistical in nature. So, this is more like in the motivation comes more from our point of view or filtering, that suppose, there are just some spurious connections, how do we get rid of them, that is something that the literature has talked a lot about. So, in that way, we are not later on being the first one here we are, you know, late and right here, but, the bigger question that one can now ask Is that what if we could remove even some true edges, but it just that they are not particularly strong and maybe they are there, but if we remove them,

Is it possible to infer a more clearer picture of the true underlying and strong connections, you know, that describe the network. So, I’ll give you a visual as to what does it look like so, what we are saying is that, at the end of the you know, talk maybe another 15 minutes, what I would like to convince you is that if you start from here, which is there a given network, which looks very messy and entangled and so, on, we want to propose an algorithm that will give you something like this, where it’s basically the exact same you know, set of stocks which are there and all of these connections are subsets of what you find here, there is no new connection is that is being added it’s just that all the sort of what we will describe as unnecessary connections now, what is unnecessary is something that we will you know, define later, but whatever we just describe as unnecessary connections that will be taken out and only some of the, you know, key connections will remain. So, we will ,the purpose of this paper is just to propose this, this you know, this thing called sparsistent filtering, which will allow us to do that, the good point about this whole thing is that, this uh, let me also say it clearly to differentiate it from the existing algorithms, there is a lot of algorithm which allows you to do filtering on this kind of net or even covariance networks that can be there the difference here and that is, I guess, our you know, probably the key contribution which is that they do not have good small sample property, here we have that. So, this is actually you know, useful that way okay. So, here is something that I do not want to spend a lot of time on. So, there is a you know, the spectral structure has been studied a lot and especially, what people do is that they use this Marcenko-Pastur Theorem under matrix theory to filter correlation matrices and so, on. Given that I don’t have a lot of time I will just move on and get to our exit point. The eventual goal of technically speaking, but we will we would like to generate exists isn’t the adjacency matrix of the network has to be sparse and the spectrum of the adjacency matrix has to be close to the true covariance matrix. Now, there is a problem the problem is as follows that, if I am simply observing financial data, how do I know what is the true data generating process? So, if it is purely and we cannot really I mean, we do not know the philosophically whatever is the true data generating process right. So, we have to have an analogue for that. So, the way we you know proceed about it is that, so, here is the basic problem and that you will find for high dimensional data that here there is in a little bit of notational problem. So, think of it this let’s ignore this notation T and N Let’s just consider N stocks maybe and you know T time points. So, the idea is that if this T is substantially greater than N then whatever you know covariance matrix that you can estimate that will be that will have all the good properties, but if it so, happens that this N is actually much smaller than T sorry if the other way around

N is substantially greater than T then this covariance matrix that we will estimate simply from you know you that you can calculate from the sample that will be biased and that will have a lot of problems in terms of convergence to the true covariance matrix. So, here is a simulated example, that, you know, if this T is almost, you know 10 times of N, in that case, let’s say the true Eigen values that you can, you know, decompose and that should be one, then if you just recorded all the eigenvalues then you know that they are by and large around one there are some fluctuations. Ideally speaking, this would just like just be a horizontal line like this, but they’re still somewhat close but the moment it goes to 0.5 Or you know these kind of things. So, here N & T are exactly equal, if you construct the covariance matrix, then the eigenvalues are completely off from what should be just a horizontal line it just completely off. So, this is the main statistical problem now, how do you how do we deal with it, this has been recognized in the literature and the key to that is that something that is known as you know, Stein’s approach, what it does is that it takes the covariance matrix and you know, decomposes it finds out all of these Eigen values and by hand it compresses it towards one. So, this is a very mechanical, you know, idea that, you know that it will be biased. So, what you do is that construct the covariance matrix, do the identity composition, find out the Eigen values, and then by hand push them closer to one, but then it has a lot of problems. So, then there was a very famous, you know, idea that came up something called Ledoit-Wolf estimator. It has all sorts of good problems and a set of very good optimality properties as well, I’ll not get into that, roughly speaking, the idea is that it takes the sample covariance matrix and some other positive definite matrix. You know, what they showed is that that positive definite matrix can simply be the identity matrix. And you can choose this in a weighted average of some optimally chosen alpha, that gives you the closest to the, you know, known data generating process. So it has all sorts of, you know, good statistical properties about this high dimensional problem. The third thing that was proposed is something called thresholding. Again, I will not spend a lot of time in a lifetime on it, intuitively speaking, what it does is that it takes the covariance matrix and then puts a threshold, if some values are above the threshold, keep it if those values are less than threshold, make it equal to zero. Now, this thresholding is actually very intuitive. And it’s very common in the network’s literature, because this is the easiest way to filter, right? Because if you say that, suppose I go back to this picture, and I say that one, so, if the covariance between sum I and J is less than some threshold, let’s call it out, I will make it equal to zero, then immediately that gives you a filter, but then that is a very ad hoc estimate. Right. So, what we are doing here, as this is our algorithm that we are being a bit greedy, what we want to do is to return the property of this, you know, threshold estimator, because it’s very easy, very intuitive, but it has its drawbacks in terms of being very ad hoc. So, what if we do not want that adhocism? So, what do we do, what we say is that our proposed estimator is basically that from a given sample covariance matrix, you first find out whatever is the Ledoit-Wolf ?, because that is something that we can very easily calculate. And it’s all of its good properties are you know, there and then at every point of time, let us try to get close to it by thresholding it one after another, I mean, ages one after another. So, mathematically what we mean is that, this is the optimization problem that, you know, we described, let me just quickly get here that, what we want to do is to reduce this is this will be the argument of the spectral distance between the true covariance matrix sorry, between the sample covariance matrix and the Ledoit-Wolf ? , we want to reduce the distance between them, because we know that the Ledoit-Wolf give the currency ,it has all the good properties. So, that is something that we would like to target.

So, we would like to minimize the distance between them, how do we minimize it, we just fill the network one after another, till the edges of the network one after another, based on first remove the weakest link, then see whether the speaker together gets closer to So, it’s like, you know, reduce the, you know, age, let’s say some age which is, which is the minimum, we’ll find out whatever is the spectrum and find out the difference between the letter pool and the threshold M addresses. If you are in getting closer, then remove it once more and keep on doing that, but then the more I am building of ages, technically speaking, we can give a cost because this is also loss of information. So, for that we are attaching a Cost parameter associated with it. Let me just you know, visually diagrammatically explain what is going on. The idea is that this blue curve, let’s say represents the cost of escalation. Now, this is parametric if you don’t want let’s say that the cost of escalation is zero costlessly. Doesn’t matter. The key point is that start from the network as a whole keep on peeling off the first you know the weakest link the The second weakest link ,the third weakest link and so on. And this red line This one represents the spectral distance between the thresholded covariance matrix and the target square later at full analog. So, this represents the distance between these two spectra, the idea is that you keep on going and after a while this is empirically observed we don’t have a mathematical proof for it, but after a while the distance you know goes up, wherever you have the minimum distance. So, then essentially the thresholded metrics or spectra is the closest to the letter at full spectrum that is something that we are calling maximal filtering. So, this is the maximum amount of filtering that you should do, if you do if you go any further then the spectral distance again starts increasing. So, you just stop that is the maximum amount of you know, so, as you are getting the closest now, if you add up the cost of escalation, then if you just take the sum of these two and that gives you this black dotted line. So, here for example, if you believe that you know escalation is costly then if you have that then you get the minimum here which we call human filtering the intuition is very simple, if there is a cost of escalation then you have to stop before maximal filtering . Maximal filtering is if there was no cost whatsoever then how far would you go like that is cost then you stop before maximal filtering. So, we get this is roughly speaking visual of you know, maximal and what we will discuss today.

Now, I will just give you one simulation example, I guess I have only two more minutes left. So, let’s say this is the true network. So, what does it mean this is something that we are imposing by hand that let’s say four and five are not connected, we are putting the in the covariance matrix we are putting the covariance between the fourth node and the five node if not as literally zero and the age which represents how strongly they co vary. So, five and six for example, very strongly co vary and so on. So, two to one and a strongly co vary similarly, seven nine ten – they create and acquire. So, again this is just imposed by hand. Now, obviously, if you can simulate it then due to sampling fluctuations everything will be connected, because in small sample you will be covered some amount of correlations. So, here I know the true data generating process. And this is the simulated analog, okay. Now, this is not unique obviously, because if I run another simulation, then there will be something else but which will be very close to it. So, once we apply our the maximum filtering, then this is what we recover. This is the maximum filter. And once we apply some parametric tune filtering them, this is what we recover. The takeaway is that, for example, the maximal filtering recovers, you know, these 2165 and also this seven nine ten. Now, if you go back to the original, you know, network, there’ll be seven nine ten Is there and two nine six five , two one six five is here. So, there’s the takeaway, I guess, is simply that it does a reasonable job, at least in terms of recovering only the strongest connections. So this is a simulated example. Now, let’s see some, you know, real data, I would just get into that. So here we took NASDAQ data, in our paper, we did a whole bunch of other exercises, there is just one example to you know, convey the basic idea. We took NASDAQ data for 50 stocks for 70 consecutive days, and anyone who has the usual return thing that I have just described. So if you this is the you know, sample covariance matrix, so it’s basically a giant block with where nothing is visible, and it’s not clear which one is more important or the other and so, what we are plotting here, if you look at panel C is that so there are 50 You know, nodes, so there’ll be 50 C two possible number of connections right? So, we are encoding them from here to here on this x axis right. So one, two, three, four rank ordering in the sense that the smallest weight then the second smallest way, then the third smallest word and so on. So, what and then what what this line represents is that you keep on you know, removing one of these you know, edges

You remove one edge,find out what is the newest new Eigen spectrum, find out what is the distance from the layout hold analog of the original matrix and then plot it here, whatever is the distance between these two spectra. So, as we can see that the minimum is here, so, maximum number would be 50 C two which would be probably something like 150 or whatever, something like that. So, my you know, maximum filtering removes a massive number of edges. So, where you are left with basically you know this one Were some of the nodes some of the stocks basically do not have any more connections left. But obviously, you could have stopped you know, here as well or here based on you know, here we did not assign a cost curve like this, if we did then we could have stopped here and then a few more connections maybe you would have bought. So, this is an example on the on the data. And, so, let me now just quickly summarize because, I guess, close to the finishing time, so, here what we have proposed is an algorithm for you know, maximal Antion filtering, the idea is that, we can apply it to complex networks and more importantly, this applies to the high level to high dimensional data. Now, if you really do not have high dimensional data, so, suppose you have let’s say the number of observations is significantly larger than the number of entities in that case, you would not face this problem and in that case, the data analysis would be very close to the true to the realized you know, metrics to begin with. So, when these algorithms would not be very useful, unless you have you know, a high dimensionality problem. So, this is sort of geared specifically towards financial markets having said that, the my co author in this paper, who is a statistician by training told me that in case of genetic data, again, there is a very similar problem that people face because the number of possible entities can be very large, whereas, for each of them, they can observe only a very few realizations. So, the same high dimensionality problem pops up there as well. I didn’t add that example here, because I do not know much about genetic data. So, we skip it, but the point is that the application of this algorithm goes much beyond that much beyond you know, only financial network and here we have you know, sort of use some, you know, local geometric property and experiment structure. And the, I guess, the important feature of this, you know, filter is that it it has very nice statistical properties along with retaining your very good spectral properties and this is something that I would like to differentiate from some of the other loan filters. So, if you look at the literature, at least on the, in the, in the case of financial markets, there are some algorithms like what we’re calling the planet XML filter or minimum spanning tree and so on, but, none of them have good statistical properties, those are kind of optimization based approach, they take whatever network that is given as representing the crude interest generating process, there is no question of a false positive or false negative in terms of realization of ages, there is no statistical concept, there is another filtering approach, we what it does is that it takes you know, individual ages to represent, you know, it takes a statistical approach to validate whether someone should exist or not, but they do not have full spectrum of what is there purely statistical. So, here, we think that, probably this kind of nicely, you know, combines statistical properties along with spectral properties. So, currently, this paper is, you know, we have circulating this paper and it’s under review, so, all sorts of sorts of comments would be very useful. Sheri, especially your, you have been doing this kind of work for quite some time. So, and also from the rest of you, I guess, I will stop here. Thank you very much for listening to me

Sheri Markose

Anindya that was very interesting. I have some experiences you know, and filtering correlation matrices right using random variance. So, my problem is with the correlation, price price based volatility matrices or price based correlation matrices, they suffer from this paradoxical was mesh the problem is that during a boom volatilities low correlation is low during a bust correlation jumps up right. So, when when we started then did redemptions, switching estimation of, in other words, trying to overcome this problem of boom, this paradox paradoxical way in which these risk measures behave this may be of relevance even to resellers measurement of all stochastic volatility. stochastic volatility models really don’t overcome this problem during boom volatility shown to be very, very low, and then During the bust, it jumps up but then you don’t get any full warning of this switch from low volatility during a boom and then a big jump when the market start falling. So, when we when we tried to filter the correlation matrices, we first first sorted out this problem and then the strata matrix sort of methods of magic maths and pester muster Marchenko magic, Marchenko pasta, and it did it. And then, of course, to sort of see which whether or not this sort of so we did a prior correction for this correlation problem. And then, when we did the random matrix filtering, we got a much better outcome than if we didn’t. So that’s a comment that I want to make just from experience of what we’ve done.

Anindya Chakrabarti

So one thing that I would like to just, you know, add that, I mean, basically, we are, you know, proposing a solution pretty much to account for that. But here, the solution is as follows that, let’s say you have data from here to here, then there is a problem, because as you’re saying that it’s contaminated by high volatility and low volatility . Now, one obvious solution is why don’t we take only you know, from here to here, let’s say, then the problem is that the number of entities can be much larger than the number of time points I mean, that is the Cohen’s metrics estimation would not be statistically legitimate, I mean, statistically credible and correct to begin, it will not it will add the problem that I showed the diagonals that you will estimate from the sample covariance matrix will be all over the place, they are not correct. So in that case, also, one cannot even even go to the realm of application of random matrix theory. And so, so here, we are proposing basically a theory around that, that, suppose you really chop off and just try to work with only this much, is there even a credible way to do that, and our algorithm basically tells that you can do that, just apply an old adult pool and then try to these are on shrinkage estimators. So then try to strain the eigenvalues corresponding, and then use it as a filter, it will work just fine, you will recover most of the you know, the true you know, covariance values, which are there, the one that is also what we show by simulation as well. It has good small sample properties. So, yeah, right.

Sheri Markose

So you make an interesting point that we shouldn’t over filter, right? You’re saying there is a maximal man by way of filtering that you should do, that’s very interesting. Because normally, because how much to filter, right?

Anindya Chakrabarti

If you simply go by the spectral structure, then you can filter a lot, but then that might be just too much, because filtering is also costly. I mean, you’re throwing away information, so then one should actually account for that. And so what we are doing is that from zero filtering, which gives you the original report to begin with, it gives you an upper bound in terms of maximal data. And it gives you a way to sort of interpolate between these two, you know, limits.

Sheri Markose

Okay, thank you. Thanks,

Karl Friston

This is fascinating, particularly because we face a very, very similar problem in the neurosciences trying to estimate the underlying directed connectivity in neural networks from usually quite sparse data, say, from EEG, electrodes and the like. So one thing which preoccupies our world is the distinction between a filter network that’s composed of directed versus undirected edges. So is this particular kind of filtering, the sparse filtering doesn’t give you an underlying adjacency matrix or what we would call effective connectivity, which is usually much sparser than the correlations that we will call a functional connectivity is it directed or undirected?

Anindya Chakrabarti

This will be so given that I’m working with a covariance matrix after I filter it will still remain basically a subset of all the original edges. So therefore, it will still be undirected. But so, this one, I can convert it into a filtered sort of adjacency matrix. So if I start from a known adjacency matrix, I filtered it this way. Whatever is about what I get, I can represent it basically, as an thresholded adjacency matrix. But it will not be directed, it will still be undirected because the starting point is an undirected covariance matrix

Karl Friston

So yeah, if it were if you wanted to stop, I’m just mindful of I know Tim’s gone now. But remember, Tim’s sort of no backward causality or No, no, no Jump back, which seems to be a very important sort of aspect of asymmetry. That would necessarily, certainly not. So time series analysis require an asymmetry in the adjacency matrix. And of course, that’s going to induce complex eigenvalues. So I’m just wondering how you would handle how you’d handle complex spectral characterizations.

Anindya Chakrabarti

that’s actually a very good point in this framework, it will not be visible because here when the target you know the letter it will analog, which gives you all the consistency properties and so on. That is only for covariance matrices. So, if you take for example, partial correlation or something, which then introduces direction, then we don’t have a letter analog as far as I know. Now, if so, because we are creating Liverpool as the target, you know, spectrum then I have to find out a different target spectrum to begin with, which which is actually a very important point. I never thought about it. But yeah, I don’t know the solution.