The role of data in prediction systems

Posted on: March 21, 2022

A prediction system is created by using data mining techniques that result in predictive modelling, which help to anticipate future trends. This can be in areas as diverse as weather, stock markets, health outcomes, and business intelligence. Forecasting in this way helps with decision making and reduces risk because it is possible to weigh up the odds of something happening or not happening.

Predictive modelling can create complex simulations that help to go beyond simple statistics like standard deviation and so uncover deeper insights. Because prediction systems rely on amassing data which is associated with the problem, it’s important that the data is cleaned and formatted so that it’s ready to be combed for accurate predictions.

Machine learning is the methodology used to train computers to look for patterns within these vast amounts of data, sometimes called data lakes, when information has been collected over decades. Once the computers have been trained on what to look for by learning algorithms, such as gradient boosting for regression or classification problems, computers can go further in their unsupervised learning to notice even more patterns while data mining. These are patterns which may not have even been considered by researchers because they are unknowns. The power of neural networks can uncover them far more quickly and efficiently than human brains can. Artificial neural networks (ANNs) simulate the learning behaviours of the human brain by using three or more layers of artificial neurons. The more layers that a neural net uses, the more complex its “thinking”, and the deeper its insights.

Ethics in language processing

The level of deep learning that ANNs facilitate means that ever larger datasets can be scoured for insights. Just as data needs to be cleansed though, it’s important that data does not inherently contain biases. This is increasingly recognised as a problem in data mining and analysis, and especially in natural language processing (NLP). NLP relies on an element of prediction, which is built when computers scan through and learn from writing and speaking samples from all over the internet that contain keywords. NLP is used in predictive texting and autosuggest functions and may be used more in the future in automatic journalism.

Timrit Gebru is a widely respected leader in artificial intelligence (AI) ethics research and was co-lead of Google’s ethical AI team when she wrote a controversial paper with five co-authors. She was forced out of the company for raising various concerns including the environmental and financial costs of training large AI models and how large language models can be problematic when they contain sexist and racist language that ends up in training sets. The paper also highlighted how research efforts are often directed away from achieving understanding from curated data sets because volume of data is seen as more valuable.

Ultimately, large language models mimic real human language without fully understanding it, which can lead to errors, misunderstandings, and offensive content. This is made abundantly clear through the way that bots manipulate and distribute misinformation during political election periods, when there is war, or a crisis such as the global pandemic. The paper written by Gebru et al gives the example of how Facebook mistranslated a Palestinian man’s post as saying “attack them” in Hebrew, when it actually said “good morning” in Arabic, leading to his arrest.

Prediction and forecasting systems can do incredible things and offer us information that we wouldn’t be able to glean without the large processing power of computers. However, a robustness in defining datasets and metrics as well as in establishing the training algorithms for computers is vital.

What is a supercomputer?

Supercomputers play an important role in the advancement of computer science and are used for intensive tasks in a wide variety of fields. These include quantum mechanics, weather forecasts and climate models, oil and gas exploration, and molecular modelling. Scientists can also create physical simulations using supercomputers, which show us what the big bang looked like, how effective aeroplane and spacecraft aerodynamics are, and the impact the detonation of nuclear weapons would have.

Fugaku is currently the fastest supercomputer performing more than 415 quadrillion computations a second. Developed by Japanese tech giant Fujitsu and research institute, Riken, Fugaku can model the impact of an earthquake and tsunami and map out escape routes. Such sophisticated modelling reduces the unpredictability of the impact natural disasters and extreme weather events may have.

Ensemble forecasting

A popular prediction modelling technique is regression analysis which estimates the relationship between two or more variables. These variables are multiplied by coefficients. For example, in a simple linear equation (containing only one x variable) represented in graphical form, the slope of the line is the coefficient. With more complex prediction systems, such as those used for weather forecasting – which are affected by many variables as well as time series forecasting – ensemble models are used.

Ensemble forecasting is a form of Monte Carlo analysis where multiple simulations are carried out from slightly different initial conditions. There are two main sources of uncertainty in forecasting systems. The first is made up of errors introduced by the use of imperfect initial conditions, which get amplified by the chaotic nature of equations used in relation to changes in the atmosphere. The second encompasses errors that come into play because of imperfections in the model formulation, such as the approximate mathematical methods to solve the equations. Every single forecast is to some extent uncertain – a band of precipitation can be blown off course by a change in wind direction that affects sea temperature, which will then alter other variables. Ensemble forecasting offers as close of an approximation as possible.

The European Centre for Medium-Range Weather Forecasts (ECMWF) uses a system called the Ensemble Prediction System (EPS). This has merged with ECMWF’s monthly prediction system and been coupled with a dynamical ocean model to produce 15-day probabilistic forecasts. Other major weather prediction facilities such as the National Earth System Prediction Capability in the USA and the Met Office in the UK also use ensemble predictions. When it comes to climate change modelling, tests on short-term timescales are offering validation of long-term estimations, but for real accuracy, they require climate-prediction systems to be unified across countries.

Could a master’s help you predict your future career?

Data is all around us just waiting to be made meaningful through the wonders of data science. From ensuring that datasets represent the many different voices and perspectives of people around the world to helping society cope with the effects of climate change by using prediction systems, data analytics is helping us make sense of the world.

Find out how studying a 100% online MSc Computer Science with Data Analytics could help you contribute to solving real-world problems today.

The role of data in prediction systems

Ethics in language processing

What is a supercomputer?

Ensemble forecasting

Could a master’s help you predict your future career?

Quick Links