Adversarial attacks: A detailed review

Deep Learning has proven to be a very efficient tool in recent times when it comes to solving challenging problems across various fields such as healthcare (computer-aided assessment, drug discovery), financial services (fraud detection), automobiles (self-driving automobiles, robotics), communications (news aggregation and misinformation detection), as well as other day-to-day utility services (such as virtual assistants, language translation, information extraction)

Deep Learning, on the other hand, is now proven to be sensitive to adversarial attacks that can alter its predictions by introducing practically imperceptible disturbances in sounds, pictures, or videos. In this series of blogs, we will learn about adversarial attacks and how they attempt to influence deep learning models in order to obtain the desired result.

Picture 1: A few illustrations of hostile actions

  • By adding a little noise, a panda was mistaken for a gibbon.
  • A stop sign was mistaken for a speed restriction sign.
  • A person wearing a certain pattern was not noticed.
Outline

Prior to understanding    and the formal specification of the issue statement, we will first define several technical terms that frequently appear in publications. We shall divide the attacks into numerous groups based on a variety of characteristics in the next section.

Common terminologies and definitions:

Here, we’ll attempt to clarify some of the phrases that are used often in publications about this attack and throughout this essay.

  • An adversarial example or picture is one that has been purposefully altered to lead to erroneous model predictions. The model receives this example (or a number of similar instances) as input.
  • The element of an adversarial example or picture that results in an inaccurate prediction is called an adversarial perturbation. It frequently looks like low-magnitude additive noise.
  • An adversarial example is being made by the agent or the attacker. Alternately, albeit much less frequently, the adversary is sometimes referred to as the hostile signal/perturbation.
  • Defence/hostile defence is a general word that refers to any method that increases a model’s resilience, as well as external or internal systems to identify adversarial signals and image processing to counteract the effects of input modifications that might be considered adversarial.
  • Target image: This is a blatant instance of an opponent manipulating a picture.
  • Target label: This is the antagonistic example’s (desired) inaccurate label. The phrase applies more to categorization issues.
What is an Adversarial attack?

An adversarial attack uses several strategies to harvest important data from the deep learning model or manipulate the input picture as little as possible in an effort to fool the network into producing the desired output. The target deep learning models, their weights, and the training dataset may be accessible to the attacker in a variety of ways. Attacks can be divided into many types, some of which have been mentioned in the article later, depending on the level of access the attacker has.

The following equation can be used to formalise the issue:

The pre-defined scalar threshold () is frequently kept at a low value, so the difference seems to a human subject to be extremely little. Similar to p, the most frequent values of p are often one or two, however, this is not a restriction.

Classification of adversarial attacks

A generic data processing pipeline may be used to visualise a machine learning system (see Figure below). At inference, (a) input features are gathered from sensors or data sources, (b) processed digitally, (c) utilised by the model to generate an output, and (d) the outcome is transmitted to an external system or user and used to take action. Take a look at Figure 1 for an example of a general pipeline, an autonomous car, and network intrusion detection systems (middle and bottom). We will attempt to comprehend alternative attacks based on their impact on this general pipeline given that the attack might be of variable range depending on the objectives of an opponent and his capabilities in accessing the model and data.

Picture 2: ML system pipeline in general (with examples)

Attack surface, adversarial capabilities, and adversarial aims are the three characteristics used to categorise attacks.

Attack surface

An attacker can decide which step (or surface) of a pipeline to target in order to accomplish his or her objective given a pipeline of phases. The following is a sketch of the primary attack scenarios detected by the attack surface:

  1. Evasion attack: The most frequent attack in an adversarial context is this one. During the testing phase, the adversary modifies harmful samples in an effort to go around the system. This option makes no assumptions about how the training data will be affected.
  2. Poisoning attack: In order to jeopardise the entire learning process, this kind of attack, also known as contamination of the training data, is carried out during the training phase by putting carefully created samples into the system to poison it.
  3. Exploratory attack: The training dataset is unaffected by these attacks. When given black-box access to the model, they attempt to learn as much as they can about the underlying system’s learning mechanism and the patterns in the training data.
Adversarial capabilities:

It speaks to the volume of knowledge an opponent has about the system. By further separating them into inference and training phases, we may better understand the breadth of attacker capabilities.

Training phase capabilities:

The majority of attacks are carried out during the training phase by directly changing the dataset in order to learn, influence, or corrupt the model. Based on the adversarial capabilities, the attack tactics are roughly divided into the following three categories:

Data injection: when the adversary is unable to access the learning algorithm or training data but is still able to add fresh data to the training set. By including hostile samples in the training, he can taint the target model.

Data modification: The training data is completely accessible to the adversary but not the learning algorithm. By altering the data before it is used to train the algorithm, he directly poisons the training data.

Logic corruption: The learning algorithm is susceptible to interference from the opponent. Creating a counterplan against them becomes exceedingly challenging.

  1. Testing phase capabilities:

Instead of interfering with the targeted model during testing, adversarial attacks cause it to provide the wrong results. These can either be considered white-box or black-box.

White-box attack: An adversary using a white-box attack on a machine learning model has complete knowledge about the model being used (for example, the kind of neural network and the number of layers, details on the training procedure, and parameters () of the fully trained model architecture). This data is used by the adversary to examine the feature space where the model may be weak, i.e., where the model has a high mistake rate. For a white-box assault, access to internal model weights equates to an extremely potent adversarial attack.

Black-box attack: Black-box attacks leverage information about the settings and previous inputs to take advantage of the model without assuming any prior knowledge of the model. The three kinds of black-box attacks include strict black-box attacks, adaptive black-box attacks, and non-adaptive black-box attacks.

Adversarial goals

Attacks may be categorised into the following four groups according to the adversary’s goal:

  1. Confidence reduction: The adversary seeks to lower the target model’s forecast confidence. For instance, a legal image of a “stop” sign can be predicted with less certainty and with a lower likelihood of class membership.
  2. Mis‐classification: The adversary tries to change an input example’s output categorization to belong to a different class. For instance, any other class other than the class of a stop sign will be predicted for a real image of a “stop” sign.
  3. Targeted misclassification: The adversary attempts to manipulate the inputs so that the model generates the output of a specific target class.
  4. Source/target misclassification: The adversary attempts to assign a certain input source to a predetermined target class. For instance, the classification model will forecast that the input picture of the “stop” sign represents the “go” sign.

Using the flowchart below, all kinds and subcategories of adversarial assaults may be summarised:

Picture 3: Flowchart for several sorts of adversary attacks

As a result, we now know what an adversarial assault is and how many distinct ways it may be categorised based on various attributes. In the following sections, we’ll look at some of the most typical attack types and how adversarial attacks may be used for tasks like object identification, object tracking, NLP, and audio in addition to image classification.

Leapfrog your Enterprise AI adoption journey

Request Demo!