Wednesday, January 22, 2025

Attacks on Deep Learning Models

Recent Articles

white robot with tabletBy Sumeet Saini, Head of Research of the Artificial Intelligence Society at King’s College London

Deep Learning (DL) models are becoming an integral part of our society, with examples seen in many problem spaces. As time goes on, the implementation will only become more pervasive. We generally trust these models and the predictions they make.

One interesting characteristic of DL models is that we are often unable to see why they have come to the conclusions that they have. As they work explicitly on data that has been fed to them, they create their own way of prediction, which is not stated or easy to derive from the final model. As a result, the only part of the model we have direct control or understanding of is the data we feed into it. If the data is correct, we anticipate that generally, the conclusions drawn will be correct, with some room for error. The error is given as a confidence rating, which is a way for the model to tell us how sure it is of its prediction.

Let us take the example of image recognition. The model will be trained on data that has many images with a label that indicates what class the image belongs to, for example, a car or an aeroplane. By training itself on the image set, the model should then be able to correctly predict which class an unseen image should belong to. An example of this is Google Lens, in which items in the real world are scanned to predict what object they are.

What if there is a problem with the data? There is the possibility that images are incorrectly labelled, but this can be easily verified by checking over the data multiple times by many different people. But what if there are problems with the data that human verifiers cannot easily check? 

pastedGraphic.png

Figure 1 showing artefacts applied to images [arXiv:1312.6199]

In a landmark paper on this topic, some ways of attacking neural networks were identified, which will be discussed now.

In the image above, artefacts have been applied to an image which has then been fed into the DL model. The left column shows the original image, the centre column shows only the artefact applied and the final column shows the image with the artefact applied. If the original image was given to the model, it would be able to correctly classify the image. However, once given the images on the right-hand side, the model incorrectly classified all the images as an ostrich! Manipulating images to force an incorrect classification is known as an adversarial attack. In this case, the attack was a targeted attack, as the attacker explicitly wanted the model to incorrectly classify the images as an ostrich. This is also an example of a white-box attack in which the attacker knows the algorithms and data that the model used.

To a human, the right column images are indistinguishable from the original. Even if someone was to notice a difference, in the absence of an original image to compare to, there would be no way to see if an image has been maliciously tampered with. If models are trained on this malicious data, the result could be a very broken classification system. This brings an alarming issue to the forefront. How do we protect models from such attacks when they are not perceptible by humans? This is where adversarial robustness comes into play.

Adversarial robustness is concerned with strengthening a model’s defences to these kinds of attacks. There are multiple ways to do this, one example being training the model to be able to identify when images have been manipulated. This is easier said than done, as there are many different types of attacks that can be used. An example, other than applying artefacts like above, is One-Pixel attacks, identified in the paper One pixel attack for fooling deep neural networks. In this kind of attack, only one pixel in an image is manipulated to fool the network. It is a possibility and many models now implement this kind of training. However, we must know the attacks exist before we can train against them, allowing for the existence of loopholes.

Once a model has been trained to withstand some of these kinds of attacks, we can test its strength with a variety of benchmarking tools available online. One such python package Foolbox was created with this exact goal in mind. In simple terms, it can run many different types of attacks on a DL model and then report its findings by showing how many images were able to be correctly classified despite being tampered with.

Even though we can train models to scarper the effectiveness of different adversarial attacks, creators of DL models should still be careful. As has been shown, these models are vulnerable and if we allow such attacks to go unnoticed, many future models might have flaws in their predictions. This is not only an issue for image recognition, but for all types of machine learning. If these models are to have an important role in society, it is of the utmost importance that everything is done to mitigate the impact of these attacks.

LEAVE A REPLY

Please enter your comment!
Please enter your name here