A world of data

//Interview with Jordi Vitrià, Professor of Applied Mathematics and Analysis at University of Barcelona

What do we mean by big data?

Big data are quantities of data you have to process as a whole – you have to see all the information together to be able to draw conclusions, but you cannot analyse it using conventional methods. Big data have always existed. Physicists tell us that when experiments are made at CERN, data on this scale are generated. But clearly what is novel today is that we have data sources which incessantly produce big data, and which are growing very fast. This process is known as datafication.

The really new phenomenon, then, is the massive scale of the sources of information, not of the data?

Exactly – this is datafication. And we’re just at the beginning. There are many aspects of our lives which used to be ephemeral. In the old days you might phone someone, and your call wasn’t recorded. Now, when instead of making a phone call you send an email, a record is kept. More and more aspects of our lives are being digitized, and the growth is exponential. Now we are able to accumulate series of data which, when taken as a whole, do indeed classify as big data. And they are becoming increasingly cheaper to digitize and save. And if something can be done, then inevitably it is done.

Is the main problem the difficulty of systematizing information from heterogeneous and complex sources?

The gurus speak of the three “v”s of big data. The first is volume; this isn’t usually the worst problem, because if the set of data is large but homogeneous it is easy to process. The second is velocity. A patient in the ICU, for example, generates an enormous amount of data per hour. And it makes little sense to store them all: data should be stored only if they’re likely to be consulted. And if you generate so many data so fast, you will never look at them. So we have cases of big data where the difficulty is not the volume but the velocity at which they are produced. The third “v” is variety, when the data you have are basically of the same kind but presented in different ways: images, sounds, physiological signals, etc. Analysing data that are highly heterogeneous, and not structured, adds another factor of complexity.

So the potential of big data lies in knowing how to integrate structured and unstructured data. What tools are available to achieve this?

«The data alone never say anything: they have to be interpreted»

This is what we are working on at the moment. The emphasis is not so much on big data as on data science – two phenomena that act in parallel. The big data concept is normally associated with machinery and infrastructure. Data science, despite its name, is not strictly a science, but a discipline that integrates various sciences. Data scientists are professionals who analyse data that are not easy to handle, and so they need a range of skills: computing, statistics, and mathematics, since the analysis concludes with the formulation of mathematical and statistical models. They must also be familiar with the subject matter in question. If the data are related to health, you might be a good statistician or computer scientist, but if you don’t understand the information you’re dealing with … This profession is well paid because this multidisciplinary profile is hard to find and because, in the end, it is the data scientist who analyses whether the results produced in the study make sense or not and can be used as the basis for taking decisions. One thing we must stress is that the data alone never say anything: they have to be interpreted.

So it seems that the key is to separate the wheat from the chaff – to distinguish between data that are significant and data that are irrelevant?

One of the dangers of data science and big data is when you think that you have found something significant … and it later turns out to be a coincidence. You have to be meticulous. If you find a mathematical model, you have to be sure that it makes sense – which is a risk you expose yourself to when you process many types of data of different dimensions. And your analysis is never conclusive: there’s always the possibility of a chance result, or you may have missed out certain factors that you should have included.

How can data change the health sector?

Big medical data have always existed. What has changed is datafication. One possible project would be to carry out a longitudinal follow-up of thousands of patients based on data such as weight, physical activity, eating habits, sleep, and so on. Parameters of this kind are being used in the promotion of healthy lifestyles. If data are compiled over a long period, they can be used to determine which lifestyles are truly healthy because they provide objectively contrasted results. Lifestyle is a concept that lends itself naturally to big data.

The health system generates a great deal of data that could be of use in controlling costs, improving the effectiveness of medical procedures, or even reducing mortality rates. Are these data being sufficiently exploited?

Data management raises three main questions. First, how to save the data, because a person’s health data may be stored in different places: at the outpatient clinic, or at a hospital, at another hospital, and so on. The second issue is privacy. Do individuals have the right to go to a hospital and request that their data be erased? Or to ask for a copy of an external report? The third issue is governance. That is, who is responsible for your data? Who decides what will be done with your information? Responsibilities like this can’t be left to the head of the computing service at the hospital, who is the person who stores patients’ records in the physical sense (i.e., making sure that nobody can access or steal them). Until these problems are resolved, there will be a huge gap in health data management.

Indeed, this new situation also poses new problems such as the legal framework that regulates users’ privacy and the commercial exploitation of personal data. What measures should be taken to resolve them?

Privacy is the big problem facing big data. From the legal point of view, the concept of privacy does not exist: the concept of data protection exists, but that is another matter. Individuals may make their medical records available for use in health studies, but they must also be entitled to refuse permission or withdraw a part of the data if they so wish. At the end of the day, the data belong to them, but this an area without specific legislation at present. It is easy to imagine disturbing situations in relation to health: for example, if an insurance company has access to data and refuses to insure you because there is a possibility that you may have a certain disease. This is a question for politicians, because what is at stake is the model of society that we want to have. Do we want people with a certain genetic make-up to pay more, and other people to pay less?

«We should make a New Deal on data, a social contract to decide the boundaries»

The expression “too big to predict” is being increasingly used today. The point is whether we should allow a single company to accumulate a large amount of information. Probably the information will be digitized anyway, but it should not be transferable. That is, if you give someone access to your health information, you should be sure that you are not sharing it with someone else who has information on your leisure activities, because the crossing of these data may create a major risk. This does not happen if data are compartmentalized. But firms of excessive size represent a great danger. In the financial domain, these limitations already exist: there are states that do not allow banks to grow indefinitely. The way to solve this situation is to impose very clear legislation in which no one is in any doubt regarding what can and what can’t be done. Then, there should be a system of public audits. Companies must be periodically audited to check for any sharing or leaking of information, or any illicit storing. We should make a New Deal on data, a social contract to decide the boundaries, because the issue of limits is as important as it was in the world of finance sixty or seventy years ago.

Related news