**Session 4: Data Science for Interconnected Systems and Epidemics**

**12:00 pm â€“ 12:30 pm ** **GMT**

Rosella Arcucci (Imperial College) â€“ **Data Learning: Integrating Data Assimilation and Machine Learning. An application to the COVID-19 Pandemic**

Transcript: â€“

**Rosella Arcucci**

Thank you for attending to these events. And thank you to Sheri for inviting me. So my name is Rosella Arcucci. Iâ€™m a lecturer at Imperial College Lecturer in data science and machine learning, where I had the data learning working group. So Iâ€™m a mainly a data scientist, so a mathematician, as letâ€™s say, my masterâ€™s degree and then PhD in computational in computer science. And then Iâ€™ve been working with a lot of institutions on real world scenarios. So my first postdoc has been with the Mediterranean center on climate change. And then Iâ€™ve been working with the DSI data scientists at Imperial College since a few years now, before became a lecturer in data science and machine learning. And so theyâ€™re behind obviously, all the questions that I try to answer in my research, there are data obviously, we know that we are in here have the data, but when I say data, what I mean and just to make sure that we are on the same page with the next slides, when they see data, I donâ€™t mean just the what we call observed data. So from for example, from social media or from satellite sensors, but also data coming from models, for example, competition, upgraded genomics simulations, and or just dynamic simulations. So, data provided for both observations and forecasting models. So, why we are interested in it was data what we can do with the data obviously, there are a few main aspects that we try to focus on. The first one is learning the dynamic behind the data, when we try to understand a phenomenon, and we donâ€™t have any information about the phenomena. So, we just try to understand something using the data. So, we try to learn the dynamic behind the data. Also, most of the time in real world scenarios, we have to deal with big data. So, to work on these forecasting model using a huge amount of data, then we have to deal with compression reduction of the data. And finally, when we have this data driven model, then we work a lot on trying to make the data driven model stable data driven models are very famous to be unstable, if the data you have been using to train them is not a good quality. So, we work on ozone technologies to help the data driven models to improve the stability and have good forecasts. So, I will tell you more about this over the next slides just to introduce you why we talked about data learning. So not just machine learning is because the the main point, when you try to build data driven models, obviously, you have to deal with few problems. As I said the dimensionality constrained data can start being very, very big. Also, obviously, we have noisy data and it is almost all the time, I could say all the time not almost all the time. And finally, one of the another very important problem in is when you have low quality data, we say low quality data, when for example, you have data representative one of the variables of your system, but you want information on another variable. So you donâ€™t have very representative data, but somehow related to that. So, in order to face all these problems together with reduction of uncertainty, then we combine machine learning model with something called data data simulation. This is the reason why we talk about data learning models. So that learning models that are a combination of data simulation, which is a data science model with machine learning, so models to train our forecasting problem. So, within the data learning group, we have been working on a lot of different applications tends to the grants that we have for the group, and we have been working on social science medical application, we will see now COVID But also geoscience biology, finance, I will show you just few examples. And then we focus on the application to modeling the pandemic.

Um, so all these applications as you put together Oh, this application Iâ€™m the only one umbrella they can put together on under the umbrella of something called Sustainable Development Goals. Now all the companies have to implement sustainable development goals and the the technologies that we implement. So these digital things, these forecasting model, if needed to clean the data can be used in some of these goals.

**Sheri Markose**

Rosella just a quick, you know, sustainable development goals were actually discussed by Omar in the previous session stuff, you know, I donâ€™t know whether you attended that. But yeah, yeah.

**Rosella Arcucci**

Yeah, yes. Okay. Thank you. Sorry. I was in another meeting. So I had to jump on these and I missed them. Yeah, theyâ€™re talking about I would watch the recording. But thanks for, for letting me know. So I have a few collaborators within the data working group working with me, and also lots of students. So we are open to project with students all the time I reporting here, just students that have finished projects that have they have published the results of the project, but we have a huge number of students coming to, to work with us in the data learning. So just a quick overview on the kind of models that we are developing all the time we focused on offline and online, depending on if you are in production. So if you are training the model, or then if you have to use that in running time, and depending on if you want more accuracy, what I mean, if you really care about having a good accuracy of your forecasting, or if you donâ€™t mind having an approximate accuracy, letâ€™s say the solution may be just approximate it, but you want something really fast, or both, or clean. So depending on if you want more focusing on the accuracy or deficiency, we develop several type of models, I will tell you more about this over the next slides. All these models are developed with a pre process. So we start your time from the data, the analysis, we run analysis of the error, we study the distribution of the error. And we always try to focus on the questions behind the data to make the models behavioral to being used and in decision making. So about your CUDA C, letâ€™s focus on, letâ€™s say the offline part. So we have mainly two kind of models, which are based on optimal Data Selection and parameter estimation. So optimal Data Selection, and this is also important for COVID. I will show you how after why. So imagine you have a room a class, for example, and then you want to collect information from some sensors. One of the question may be okay, where should I put the sensor to have most of the information in the room? So I donâ€™t know if youâ€™re probably are already aware of that. But considering the airflow within a room, where do you put the sensor, the sensor will be able to capture most of the information about for example, the air quality or not, depending on the position. So how can you do that in advance? Should you put just send some sensors in the room randomly, or you can understand what is the optimal placement. This is what we have done with this technology based on duction process mutual information and data simulation. So Iâ€™m a mathematician, so I can be very detailed and give you all the information. But I will just give you an overview of the models. And then Iâ€™m very happy to go into the details, in case you have other questions or share with you the papers. So all these these models are going to show you have been already published. So we have the papers and also the code on GitHub to be run. So as you can see, if you just use a random sensor within the room, or you put up the sensor, youâ€™re not not placement, you have a mean squared error, MSE stays for mean squared error, reduction of the mean squared error of letâ€™s say three orders of magnitude. So the forecasting with the sensor placed in a random place gives you a nice parameter of zero point 17. If you put the sensors in the optimal location, you have very high accuracy of the forecasting. This is very important to us in some scenarios, like for example air quality in a room just to let you see that the technologies we are developing are completely general. So for example, that the same technology that we use in an indoor environment can use in an outdoor environment. So this is the elephant in Castle in London. This is where is the London School. So the, the university, this is all just on the on the corner. And then these are the results of the optimal sensor placement or a random placement. It is a simulation of a simulation. So where pollution in our inner city.

Parameters estimation in a completely different scenario, letâ€™s assume for example, that you want to estimate parameter of an economic model and you want to use data from the cryptocurrency market then you have these time series all these points are essentially time series exploded in in time series. So, if you want to estimate these parameters, beta and alpha, for example, then how can you learn from the data the optimal values of these parameters we have done, we have done that using data simulation, we have been publishing quite few papers about that. And as you can see, the most important part is that when you just set up the parameters, even from historical data in a proper way, but then you try to use the model to make any forecasting then you have an error, which is around 1.4. Anyway, on the almost on the first order of magnitude. And then if you use the data simulation to estimate the parameters, and then you use these parameters within your model, then you have a reduction of or up to six orders of magnitude. So very gain in terms of accuracy, I began in terms of se m then sort of get models. So imagine you have your CFD, so your competition for the bionic model, which can, for example be used in this case to get pollution, it can be used to make forecasting, but it is very slow. So we develop technologies that are able to emulate what our competition to genomics software predicts, but in real time, so this is for orders of magnitude faster than a standard CFD simulation. And as I was saying, at the beginning, we worked a lot on making the technologies stable. So that like a standard data driven exclude the doctor few time steps of forecasting, our surrogate models are able to make forecasting for a longer time. Um, we know that all these technologies are computationally very expensive. So sometimes, we are not able to do that on a thunder, letâ€™s say technology, we have to use a supercomputer. And then to do that, we use a domain decomposition, and we saw decompose the domain in sub domains. And these are some again, so symptoms of energy reduction. So power reduction, when you when you run a corporality on a supercomputer. So finally, data simulation, I will be quick on this, because I will give you more details for COVID about data simulation. So data simulation is these technologies that merge together to information in one on only one output. So imagine you have something which has been predicted by a model and something that has been observed by some real data, so from sensors or from any other kind of information. So you merge together these two data and you have this assimilated data, which is more accurate and reliable are the first two. This can be done not just for a geoscience or medical application, but also for example, for sentiment of people, I will not focus too much on these because because of the lack of the time. And then finally, data learning and surrogate models. So the main approaches, the main question is we all rely on real data. But the point is, sometimes you donâ€™t have data all the time, especially when you use sensors, imaging. You donâ€™t have data all the time, sometimes you have gaps in your observation. So what can you do when you have these gaps? And this question is answered using this technology, which learns how your forecasting model misfit from the from the observer data and learning this mistake can be used to predict future forecasting, assuming so without real observation in time, we When you donâ€™t have observation, just to let you see that with this approach, even when you donâ€™t have observations, the model works quite well in terms of accuracy. So the blue line is the the model, which is running without real observations, but with this network behind them.

So this can be also used for integration of both Geoscience and social media. And we are doing this, for example, in the context of wildfires. So about modeling the pandemic, all these models that have been showing over the past slides are very modular, and they are like very general, it can be used in a lot of different scenarios. So over. So Iâ€™ve, as I said, Iâ€™ve been working with the the Data Science Institute for for quite a long time and December 2019, the pandemic was still not a pandemic. So I remember we had China, it was mainly just in China. And we were trying the DSI, Data Science Institute to collect, you know, some funding to, to send them masks and other stuff. Very important, you we all know now, very important when you are trying to feed a pandemic, the director of the data center, at the time, say, No, guys, okay, we can do that. But what we really need to do that now is using your brains to help facing the pandemic. So, and this is an all started on that December, December 2019. And on February 2020, we attended to the first to the first meeting at Royal Society, when also focus on was presenting the first results from the Imperial College from his research. So, we were all presenting what we were starting doing and together with other companies. And so, since that moment, we have been working mainly on several different approaches. So, of the epidemic models for us, as I said, forecasting model using data and then either equality any flow in indoor environments, and finally, vaccination strategy. So I will give you a quick overview of the these three main problems we have been facing. So about the epidemic model. As I was saying before, this data simulation is technology, which is able to merge together a forecasting model, ingesting new information all the time. So using this kind of technology, we have been adjusted, letâ€™s say the parameters of the sere model, and also a different version of the seed model where we also introduced in the treated people in the in the same model, treated people have been implemented just for China, because for China, we have the information about the number of hospitalization, Iâ€™m talking about beginning of 2020. So at that time, we didnâ€™t have this information for worldwide, we just have lots of information from China, and information about the number of infections worldwide. So we separated letâ€™s say, China and the rest of the world with the CM model and seed model plus treated for China. And essentially, what we have been doing with these is helping the car model to be more accurate in terms of forecasting the peak of the infections. Learning the new parameters from the new available data all the time. And obviously one very important point here. And in case youâ€™re interested, Iâ€™m happy to give you more details and the details are provided in this paper. The most important point here is that we are assuming all the time that the data provided by the government are noisy, no, because the government didnâ€™t want to provide the real data, but because it was especially the beginning impossible to track the number of infections. So obviously, learning completely from data you are learning something which is really noisy, assuming that the seer model is very accurate is also a strong approximation. So at some point we implemented something called the in the context of physics in for machine learning. And so we balance together information from the seer model and from the data in one in only one Integrated Model, which was putting together these two information series model plus data driven models, so, models developed from the data and we have seen

in the, in the forecasting, long term forecasting and improvement in terms of accuracy in using both. So, this physics informed approach, but why, so why at the beginning, we were interested in understanding the peak, why that was so, important for us. So, we have been doing that because this was integrated in a system able to understand the impact of some mitigation effect mitigation or some suppression. So, what I mean, mitigation is, for example, this rule of keeping, you know, these two meters distance, and at the beginning, if you remember, the beginning of 2020 was like, crazy people didnâ€™t really know how to implement that, and also the suppression that maybe for example, the lockdown, so, when we develop this, this tool that one year and a half ago, then it has been used to understand the impact. So, the impact of some mitigation and some suppression effect in terms of number of infections, and we have been checking the impact of these in both China and European countries like in the UK. So, that, when the number of infections when we, when these letâ€™s say the populations that be when you want to understand how the infection goes within the population in a city, then the number of these movements is then the number of variables in the same model, the seat vectors of the city model seems to be very big. So, this means that they may be computationally extremely expensive. So, imagine that you have for example, you will know the workers no workers male female children, so, he says to be like a number of variables very high. So, when we started working on this aspect, so, we didnâ€™t see this, then we have seen that it was impossible running in in a acceptable time this kind of simulation without stable data driven models without machine learning models. So, we have been working on a generative adversarial networks model and recurrent neural networks model, we have published this paper called digital twins, for digital twins for COVID Then it has been published on neural computing, Journal of neural computing, but let me show you a few results. So, the, what is very interesting now is the execution time in terms of seconds. So, the seer model takes up letâ€™s say 0. 45 seconds. So, the the other models, so, the machine learning models takes four or there are up to four orders of magnitude left, which is very important when then you want to forecast months or at least weeks. And this is in terms of accuracy, you can see that this is the movement of the infection within the room or within the within the city sorry, and this is how the Model C model is is predicting and what the AI model is. So, this was for the epidemic epidemiologic part, then there is a we have been working with the applied modeling and conditional on a flow inequality in indoor environment. Let me let me show you why. So, imagine for example, in a room, and the question was okay, when you have a class and you have students in the class, what happens if a studentâ€™s niece and the student doesnâ€™t have the mask. So, to do that, of course, we have, we can implement it flow simulations, but also to make it more realistic, reliable, essentially realistic and reliable. Then we have been merging these simulations with real images, real simulation, real observation of people sneezing and coughing.

So, and this is how this integration happens. So the background is what a computational fluid dynamics software says when somebody is meeting, and then your observation, saying something different, and then you can say, you can see the integration. But let me give you A few information about why obviously, when you have a fluid dynamics simulation of something, when you, you have, obviously your input data and then your solution. So what weâ€™re trying to do here is improving the accuracy of the input data, because the error in the input data propagates to the model on the solution. This is the reason why we are trying to integrate information in the input data in the initial condition in order to have better forecasting. And this is for example, what happens in a train this has been a simulation, the my group has been working together with them and the group have assigned their science and engineering with the for them ramp, ramp task for royal society, society responsible COVID. And obviously, this is for the part of airflow in hydrosols. That obviously, that one was a person which was coughing and sneezing, but what happens if you just breathe, and then together with University of Cambridge, they have been providing data of people just breathing, or even what happens when they have a mask, or breathing or laughing without a mask or with a mask. So what happens and all this information can be included in the computational fluid dynamics simulation, so you have a better understanding of what is happening in indoor environment. Integrating

**Ben Etheridge**

sorry, sorry to jump in, can I just say that time is pressing? And so if you can maybe move to conclusions, and if thereâ€™s any questions take your time

**Rosella Arcucci**

okay. And then there is the vaccination strategy also, for that we have been using neural network graph neural networks. The main point in the in the models we have developed is the integration of this data simulation in the models based on the centrality and we can see the impact of in terms of accuracy of ingesting real information within the system. Yeah, so I was concluding and just to say that we are happy to share, we have our codes available for you, in case you are interested. And we are happy to provide any more any extra info. And also happy to have you we are organizing here a workshop machine learning and data simulation for dynamical system. Youâ€™re very welcome to join us. And thatâ€™s it. Thank you very much for your attention.

**Ben Etheridge**

I think Tim has a question.

**Tim Rogers **

Hello, yeah. Thanks for a really interesting talk. Iâ€™m curious to know, do you think these approaches will replace the traditional models, you know, the epidemic compartment models and things like this in the future?

**Rosella Arcucci**

So Iâ€™m not replacing integrated in.

**Tim Rogers **

Okay, so in the sense that people wonâ€™t be doing just dynamical systems modeling anymore?

**Rosella Arcucci**

Yes. Yeah. I think that the use of the data information from the data will be definitely, yeah. So the answer is, yes, not replacing but even more and more information from the data will be used and ingested in the in the standard models in the future.

**Tim Rogers **

I shall educate myself about this a bit more, Thank you.

**Sheri Markose**

Yeah, so Rosella. That was very interesting. So going to pulling from Timâ€™s question, youâ€™re not trying to replace si Rs, that you just saying that the data simulation is a problem, which I agree, you know, the problems with some underlying models with regard to the data on the, on the networks, or letâ€™s say, as you said, various behavioral inputs, like within without masks. Who was inoculated whoâ€™s not? So we have problems, finding that very granular data on the ground, and you saying, Weâ€™ve got to do things to clean that up? Before we feed it into the models? Am I right? And then you say, you know, so you have a number of models, and you mentioned loader, one, they use deep learning one. You also mentioned how we could use GaNS. So just tell me you know, how, which of these had best Did you do a contest on the different methods of cleaning on the same problem? Yeah,

**Rosella Arcucci**

so actually, we have like, just published a paper on it. You donâ€™t have competition. Scientists call the data learning. So, the name of the paper is data learning theory integrated data simulation and machine learning. And in the paper, we say exactly this. So, obviously, depending on the problem, which usually means depending on the question that you want to answer, there is one more than which is working better than another. So, um, so, yeah.

**Sheri Markose**

So, so, Rosella for COVID Obviously, got a number of different models for the Yeti simulation, which of these works best is there any way of judging that

**Rosella Arcucci**

so, I think that the one working nothing working best, the one that in my heart feels more useful, because itâ€™s very important is the data simulation for parameters estimation of the car model, because even in the car model or the the parameters the beta be how far that has been Moodle to understand you know, the how the blocks are connected, if you make that parameters dynamic, and you change the parameters in time, because the parameters are in reality dynamic, because the the environment change the government, the governments are changing the rules etc. So, that parameters are no static are definitely dynamic. So, learning that parameters from the data and adjusting the fear model using that new permission, that I think is one of the most useful part

**Sheri Markose**

for so how do you learn it and thatâ€™s the thing that should be talked about what would do a better job.

**Rosella Arcucci**

Yeah, you So, essentially, you solve an optimization problem. So, you learned that from the information provided by the government in terms of a number of infections, the number of hospitalizations, number of people in that in that count in that city, number of you know, students and if the student the schools are open or not, so, all these information may may be included in this optimization problem, that letâ€™s say minimizing the optimization problem, you have the optimal parameters that you can use for the car model every changing that every day. Okay, thank you.

**Sheri Markose**

Thank you

**Anindya Chakrabarti**

Can I ask a quick question?

**Rosella Arcucci**

Of course

**Anindya Chakrabarti**

So Rosella. So, you, you mentioned something about an economic model which look like a state space model, where you are estimating an unobserved variable, what was the application of data there, because the usual Kalman filters and so on, they can be applied, you know, to infer that beta coefficient that youâ€™re showing right, which will evolve over time. So, what is what was it I somehow missed it, what was the you know, what was your point there?

**Rosella Arcucci**

So, the model was a bar model and the application was the was the cryptocurrency market. So, especially we have in the paper I was showing we have experiments for the aquarium about any other was the same for us. And the So, we have been using a variational approach to both Kalman filter and additional approach, which is very efficient in terms of both speed and accuracy to estimate the optimal parameters, so, he also in that case, similar case, instead of having the parameters static, we have this change of dynamic change of the parameter. As soon as you have new information from the observations, in that case, the currency market, then we were ingesting this information in the optimization problem, estimate beta, and how far youâ€™ve beaten our part to the bar model. Make the prediction?

**Anindya Chakrabarti**

Okay.