Genetic algorithms as a mathematical apparatus

Genetic algorithms aim to solve optimization and modeling tasks by iterating, variation and combining target parameters with the help of mechanisms similar to biological evolution.

Genetic algorithms aim to solve optimization problems. Neural network training can serve as an example of such a task. It consists in selecting such values of weights at which the minimum error is achieved. The genetic algorithm is based on the random search method. The main disadvantage of random search is that we do not know how long it will take to solve the problem.

To avoid such a waste of time in solving the problem, professionals use the methods discovered in the study of the evolution and origin of species. In the evolution process, the fittest specimens survive. This leads to an increase in the fitness level of the populations, which allows it to survive better in changing conditions.

This algorithm was first proposed by John Holland at the University of Michigan in 1975. It was called the Holland Reproductive Plan and formed the basis for almost all varieties of genetic algorithms. Before we examine it in detail, we should focus on how real-world objects can be encoded for use in genetic algorithms.

Representation of objects

We know from biology that any organism can be represented by its own phenotype, which actually defines what an object is in the real world, and a genotype that contains all the information about the object at the level of the chromosome set. Each gene (that is, an element of genotype information) has its own reflection in the phenotype.

To solve the problems, we need to represent each trait of the object in a form suitable for use in the genetic algorithm. All further functioning of the mechanisms of the genetic algorithm occurs at the genotype level. It allows you to do without information about the internal structure of the object, which causes its wide application in a variety of tasks.

In the most common variation of the genetic algorithm, bit strings are used to represent the genotype of an object. Each trait of an object in the phenotype corresponds to one gene in the genotype of the object. A gene is a bit string, most often of fixed length, that represents the value of this trait.

Encoding of traits (integers)

To encode traits, you can use the simplest option — the bit value of a trait. Then it will be easy for you to use a gene of a certain length, long enough to represent all possible values of such a trait. But, unfortunately, this kind of encoding has its disadvantages.

The main drawback is that neighboring numbers differ in the values of several bits. For example, the numbers 7 and 8 in the bit representation differ in 4 positions, which makes it difficult for the genetic algorithm to function and increases the time required for its convergence. To avoid this problem, it is better to use encoding, in which neighboring numbers differ in a smaller number of positions, ideally in the value of one bit.
Gray code can be used to work out the genetic algorithm. The values of the Gray code are indicated in the table below:
 

Decimal Code Binary Value Hexadecimal value Decimal Code Binary Value Hexadecimal value
0 0000 0h 0 0000 0h
1 0001 1h 1 0001 1h
2 0010 2h 3 0011 3h
3 0011 3h 2 0010 2h
4 0100 4h 6 0110 6h
5 0101 5h 7 0111 7h
6 0110 6h 5 0101 5h
7 0111 7h 4 0100 4h
8 1000 8h 12 1100 Ch
9 1001 9h 13 1101 Dh
10 1010 Ah 15 1111 Fh
11 1011 Bh 14 1110 Eh
12 1100 Ch 10 1010 Ah
13 1101 Dh 11 1011 Bh
14 1110 Eh 9 1001 9h
15 1111 Fh 8 1000 8h

Table 1. Correlation of decimal codes (left) and Gray codes (right)

When encoding an integer trait, we divide it into tetrads and transform each tetrad according to the Gray code.

In practical implementations of genetic algorithms, there is usually no need to convert the values of a trait to the value of a gene. In practice, there is an inverse problem, when we need to determine the value of a trait, based on the value of the gene.

Thus, the task of decoding the values of genes that correspond to integer traits is trivial.

Encoding of traits (real numbers)

The simplest and the most obvious encoding method is to use bit representation. However, this option has the same disadvantages as the method that is applied to integers. Therefore, in practice, we would usually resort to the following sequence of actions:

  1. Divide the entire range of acceptable values of the trait into sections with the required accuracy.
  2. Take the gene value as an integer that defines the interval number (using Gray code).
  3. Take the number that is the middle of this interval as the parameter value.

Let's analyze the above-described sequence of actions using an example.

Let's assume that the values of the trait belong to the range of [0, 1]. To encode the section, we divided it into 256 intervals. To encode their number, we will need 8 bits. Let the gene value be 00100101bG (the capital G letter indicates that Gray code is used). To get started, let's use Gray code to find the corresponding interval number:

25hG->36h->54d.

Now let's see which interval corresponds to it. By using simple calculations, we get the [0.20703125, 0.2109375] interval. So the value of the parameter will be (0.20703125+0.2109375)/2=0.208984375.

Encoding non-numeric data

When encoding non-numeric data, you must first convert it to numerals.

How to determine the phenotype by genotype

To determine the phenotype of an object (that is, the values of the traits that describe the object), we only need to know the values of the genes that correspond to these traits (that is, the genotype of the object). The set of genes that describes the genotype of the object is a chromosome. In some implementations, it is also called a specimen.

In the implementation of the genetic algorithm, the chromosome is a bit string with a fixed length. Each section of the string corresponds to a certain gen. The length of the genes within the chromosome can be identical or different. Most often, same-length genes are used.

Let's consider an example of a chromosome and the interpretation of its meaning. Let's say that an object has 5 traits, each encoded by a gene with a length of 4 elements. In this case, the length of the chromosome will be 5*4=20 bits:

0010 1010 1001 0100 1101

Now we can determine the values of the traits:

Trait Gene value Binary trait value Decimal trait value
1 0010 0011 3
2 1010 1100 12
3 1001 1110 14
4 0100 0111 7
5 1101 1001 9

Basic genetic operators

In the evolution theory, the way in which the characteristics of parents are transmitted to descendants plays an important role. In genetic algorithms, an operator called crossover is responsible for transmitting the characteristics of parents to descendants. This operator defines the transfer of parent traits to descendants.

It works as follows:

  1. Two specimens are selected from the population to become parents.
  2. The point of discontinuity is determined (usually randomly).
  3. The descendant is defined as a concatenation of a part of the first and second parent.

Let's consider the functioning of this operator:

Chromosome_1 0000000000
Chromosome_2 1111111111

Let's imagine that the discontinuity occurs after the 3rd bit of the chromosome. So, in this case the following values will apply:

Chromosome_1 0000000000 >> 000 1111111 Resulting_chromosome_1
Chromosome_2 1111111111 >> 111 0000000 Resulting_chromosome_2

Then, with a probability of 0.5, one of the resulting chromosomes is determined as a descendant.

The next genetic operator aims to maintain the diversity of specimens within the population. It is called the mutation operator. When using this operator, each bit in the chromosome is inverted with a certain probability.

Plus, the so-called inversion operator is also used. It divides the chromosome into two parts, and then they are swapped. This can be represented as the following scheme:

000 1111111 >> 1111111 000

Technically, these two genetic operators are sufficient for the functioning of a genetic algorithm. Yet in practice, some additional operators or modifications of these operators are also used.

For example, a crossover may not be single-point (as described above), but multi-point, when several points of discontinuity are formed (most often, two). Besides, in some implementations of the algorithm, the mutation operator is the inversion of only one randomly selected bit of the chromosome.

Operating scheme

ТNow that you know how to interpret the values of genes, we will proceed to the description of the functioning of the genetic algorithm. Let's consider the scheme of functioning of the genetic algorithm in its classical version:

  1. Set the start time as t=0. Randomly generate an initial population consisting of k specimens B_0={A_1,A_2,…,A_k}
  2. Calculate the fitness of each specimen F_{A_i}=fit(A_i), i=1…k and the population as a whole F_t=fit(B_t).The value of this function determines how well the specimen described by this chromosome is suitable for solving the problem.
  3. Select the A_c specimen from the population A_c=Get(B_t).
  4. With a certain probability (crossover probability P_c) select the second specimen from the population A_{c_1}=Get(B_t) and produce the crossover operator A_c=crossing(A_c,A_{c_1}).
  5. With a certain probability (probability of mutation P_m) execute the mutation operator A_c=mutation(A_c).
  6. With a certain probability (probability of inversion P_i) execute the inversion operator A_c=inversion(A_c).
  7. Place the resulting chromosome in a new population insert(B_{t+1},A_c).
  8. Perform the operations starting with step 3, k times.
  9. Increase the number of the current epoch t=t+1.
  10. If the stopping condition is met, then exit. Otherwise, move to step 2.

Now, let's take a closer look at certain stages of the algorithm.

The selection of the parent chromosomes in steps 3 and 4 plays the most important role in the successful functioning of the algorithm. Various options are possible here. The selection method that is used the most often is called the roulette. When using this method, the probability of choosing a chromosome is determined by its fitness, that is P_{Get{A_i}}=Fit(A_i)/Fit(B_t). Using this method leads to the fact that the specimen with greater fitness levels will be more likely to transmit their traits to their descendants.

Another commonly used method is tournament selection. It consists in several specimens (usually 2) being randomly selected from the population. The specimen with the greatest fitness is selected as the winner.

Besides, some implementations of the algorithm rely on the so-called elitism strategy. Specimens with the best fitness levels are guaranteed to move to a new population. The use of elitism usually accelerates the convergence of the genetic algorithm. The disadvantage of using the elitism strategy is that it increases the probability of the algorithm hitting the local minimum.

Another important point is the definition of the stopping criteria. Usually, these are either a restriction on the maximum number of epochs of the algorithm's functioning, or a determination of its convergence, usually by comparing the fitness of the population at several epochs and stopping when this parameter is stabilized.

See also

Demystifying Data Science Careers
In this article, we will provide an overview of three different roles and job titles associated with analytics and data science. We will first look at the evolution of the role of data scientist.
The Limitations of Spreadsheets in a Data-Driven Financial World
The Finance sector is a data behemoth, with an estimated 150 zettabytes of data to be analyzed by 2025 according to IBM.
Statistics: The Foundation of Data Science
Statistics are a powerful tool, but interpretation is key. Don't just look at the numbers – understand their meaning: uncover hidden insights from data, compare groups and make informed decisions, replicate...

About Megaladata

Megaladata is a low code platform for advanced analytics

A solution for a wide range of business problems that require processing large volumes of data, implementing complex logic, and applying machine learning methods.
GET STARTED!
It's free