Offspring Population Size Matters when Comparing Evolutionary Algorithms with Self-Adjusting Mutation Rates

We analyze the performance of the 2-rate $(1+\lambda)$ Evolutionary Algorithm (EA) with self-adjusting mutation rate control, its 3-rate counterpart, and a $(1+\lambda)$~EA variant using multiplicative update rules on the OneMax problem. We compare their efficiency for offspring population sizes ranging up to $\lambda=3,200$ and problem sizes up to $n=100,000$. Our empirical results show that the ranking of the algorithms is very consistent across all tested dimensions, but strongly depends on the population size. While for small values of $\lambda$ the 2-rate EA performs best, the multiplicative updates become superior for starting for some threshold value of $\lambda$ between 50 and 100. Interestingly, for population sizes around 50, the $(1+\lambda)$~EA with static mutation rates performs on par with the best of the self-adjusting algorithms. We also consider how the lower bound $p_{\min}$ for the mutation rate influences the efficiency of the algorithms. We observe that for the 2-rate EA and the EA with multiplicative update rules the more generous bound $p_{\min}=1/n^2$ gives better results than $p_{\min}=1/n$ when $\lambda$ is small. For both algorithms the situation reverses for large~$\lambda$.


Introduction
A key driver for the success of evolutionary algorithms (EAs) is their global search behavior, i.e., their capability of searching the whole decision space without getting stuck in local optima. This feature distinguishes EAs from other well-known heuristics such as Simulated Annealing and other local hill climbers, which search the decision space only within a small neighborhood. The global search behavior of EAs, however, comes at the cost of a less focused search when the optimization converges, which can result in performances losses in the final parts of the optimization process. The question how to most effectively combine the best of both worlds is the driving force behind research on parameter control [KHE15, AM16,EHM99], adaptive operator selection [MLS10,FCSS10], and hyper-heuristics [BGH + 13], which are the most prominent umbrella terms for adjusting the structure of the search behavior to the current needs of an iterative optimization process.
Interestingly, research on parameter control has shown that in many applications the search behavior can be adjusted very efficiently by quite simple update rules, see the above-cited surveys for details. The considerable performance gains observed in practice have inspired a whole series of theoretical works on non-static parameter choices. In the last years, an increasing number of results appeared which rigorously quantify the advantages of parameter control, see [DD18b] for a summary and classification of the mechanisms.
We will focus in this work on two different self-adjusting approaches previously shown to yield excellent performances on the OneMax benchmark problem: a 2-rate success control and a generalized one-fifth success rule. We analyze their efficiency when implemented in a (1 + λ) EA framework. Our key research question concerns the scalability of performance with respect to the offspring population size λ and with respect to the problem dimension n. A summary of findings will be presented in Section 1.2.

Self-Adjusting Algorithms
The One-Fifth Success Rule Applied to the (1 + λ) EA. The one-fifth success rule originally stems from theoretical observations made by Rechenberg for the optimal step-size adaptation in the (1+1) Evolution Strategy [Rec73]. An interpretation of this rule which is suitable also for the adaptation of other parameters, such as the mutation rate in discrete EAs, was presented in [KMH + 04]. The rule itself is simple: if after one iteration of an algorithm a strictly better offspring has been found, the parameter under consideration is multiplied with some constant F , and it is multiplied with F −1/4 otherwise. With this setting it holds that after 5 iterations the parameter is the same as in the first if exactly one out of the five iterations was "successful" (i.e., found a better solution), and it is increased or decreased otherwise, depending on the sign of 1−F and the number of successful iterations. The one-fifth success rule was shown to yield very efficient optimization times for the (1 + (λ, λ)) Genetic Algorithm (GA), first by empirical means [DDE15] and later by rigorous mathematical running time analysis [DD18a].
Since the one-fifth success rule had been derived only for the (1+1) ES and only for particular benchmark problems (the sphere and a corridor function [Rec73]), it seems natural to generalize the update mechanisms to other success rules. In fact, multiplying the parameters by 2 and 1/2 in the case of a successful and an unsuccessful iteration, respectively, has been experimented with in various context. It has also been rigorously proven to be efficient for the optimization of different benchmark functions with a (1 + λ) EA variant that uses a self-adjusting offspring population size λ [LS11]. In [DW18] an empirical study analyzed the impact of the update factors on the performance of a (1+1) EA variant. This algorithm samples in each iteration one offspring y from the current best solution x and updates the mutation rate p to bp (with b < 1 being some constant) if y is at least as good as its parent (i.e., in our setting with maximization as objective, if and only if f (y) ≥ f (x)). In this situation x is replaced by y. If y is strictly worse than x (i.e., if f (y) < f (x)) the mutation rate is increased to Ap, where A > 1 is again some constant. The results presented in [DW18] show that this self-adjusting (1 + 1) EA(A, b) is very efficient on two classic benchmark problems OneMax and LeadingOnes, and this holds for broad ranges of update strengths A and b. In the very recent work [DDL19] it was rigorously proven that this algorithm, for suitably chosen hyper-parameters A and b, achieves optimal optimization time on LeadingOnes, up to lower order terms.
In this work, we will extend the (1 + 1) EA(A, b) to the (1 + λ) EA, which samples λ offspring per each iteration. Since the probability of creating individuals which are equally good, but not strictly better than the parent, increases considerably for large λ, we have to refine the interpretation of "success" for this context. Taking into account that the fraction of offspring y satisfying f (y) ≥ f (x) showed promising performances in initial experiments, in our (1 + λ) EA(A, b) we therefore discriminate with respect to whether or not at least 5% of the offspring are at least as good as their parent.
2-Rate and 3-Rate Update Rules. An alternative way to control the mutation rate was proposed and analyzed in [DGWY19]. Their 2-rate (1 + λ) EA, which we refer to as 2rate (1 + λ) EA r/2,2r in the following, uses two different mutation rates in each iteration. Half the offspring are created with mutation rate p/2 and the other λ/2 offspring are sampled with mutation rate 2p. The mutation rate is parametrized as p = r/n in the 2-rate (1 + λ) EA r/2,2r . The value of r is updated after each iteration, by a random decision which favors the rate by which the best offspring has been created (details will be presented in Section 2). It was proven in [DGWY19] that the 2-rate (1 + λ) EA r/2,2r with λ ≥ 45 and λ = n O(1) is not only more efficient for the optimization of OneMax than any (1 + λ) EA with static mutation rates, but, from a comparison with a lower bound proven in [BLS14], their result also implies that the 2-rate (1 + λ) EA r/2,2r even achieves asymptotically optimal expected optimization time, which is Θ n log λ + n log n λ when measured in terms of generations. We also study a 3-rate variant of 2-rate (1 + λ) EA r/2,2r , which creates one third of the offspring with mutation rate c 1 r/n, r/n, and c 2 r/n, respectively, where 0 < c 1 < 1 and 1 < c 2 are hyper-parameters set by the user. In the original work [DGWY19] a similar 3-rate variant has been studied by empirical means. We extend their preliminary study by considering 100 different pairs of (c 1 , c 2 ), but also by considering its performance for a broader range of population sizes λ and much larger dimensions n.

Summary of Results
We investigate the efficiency of the three above-mentioned algorithms (i.e., the (1 + λ) EA(A, b), the 2-rate (1+λ) EA r/2,2r , and the 3-rate (1+λ) EA r/2,r,2r ) on OneMax, for various population sizes λ up to 3,200 and problem sizes up to n = 100,000. Their performances are compared among each other and to the traditional (1 + λ) EA using static mutation rates.
Interestingly, the ranking of the algorithms is very consistent across all tested dimensions, and only depends on the offspring population size λ. More precisely, we show that for small population sizes λ ≤ 50, the 2-rate (1 + λ) EA r/2,2r performs best, while for λ ≥ 100 the (1 + λ) EA(A, b) is the most efficient among the four algorithms. It is also worth mentioning that for λ of about 50, the conventional (1 + λ) EA with static mutation rates performs on par with the best performing self-adjusting algorithm, which is 2-rate (1 + λ) EA r/2,2r .
All our algorithms are implemented with the shift mutation operator discussed in [CD18b]. This operator, unlike the unconditional standard bit mutation operator traditionally studied in the theory of evolutionary computation, ensures that offspring differ from their parent by at least one bit. This is achieved as follows. The operator first draws the mutation strength (i.e., the number of bits that are flipped to create the offspring) from the binomial distribution Bin(n, p), where the parameter p denotes the mutation rate. While the traditional standard bit mutation operator allows = 0, it is easily seen that, for (1 + λ) EAs, which only evolve a single solution, an offspring that is identical to its parent cannot advance the search. The shift mutation operator therefore interprets = 0 as a vote for a small search radius, and flips exactly one bit, i.e., effectively using = 1. Put differently, this operator always flips at least one bit, and its probability of flipping exactly one bit equals Bin(n, p)(0) + Bin(n, p)(1).
When the mutation rate p converges to zero, the shift mutation operator converges against the operator using mutation strength one deterministically, thus effectively reducing the global search property of standard bit mutation to a purely local search. As explained above, we would like to avoid performing a purely random search only, and therefore cap the mutation rate at a lower bound p min . This lower bound can have a significant impact on the performance, as our further experimental results demonstrate. More precisely, we observe that for smaller population sizes a lower bound of 1/n 2 seems to work better than a lower bound of 1/n. Interestingly, the situation reverses for larger population sizes. These observations are true for all considered selfadjusting algorithms, but with different values of λ. For 2-rate (1 + λ) EA r/2,2r , the transition population size is between λ = 100 and λ = 200, while for (1+λ) EA(A, b) it is between λ = 400 and λ = 800.
In light of the consistent behavior across all tested dimensions, we are confident that our findings will inspire future rigorous theoretical analyses for the self-adjusting algorithms inves-tigated herein.

Related Work
Our work can, to some extent, be seen as an extension of [DYvR + 18], where different (1+λ) EA variants have been studied on the two benchmark problems OneMax and LeadingOnes. Note though that we consider here much larger offspring population sizes λ (up to 3, 200) and much larger dimensions (up to 100,000), whereas the results presented in [DYvR + 18] are restricted to settings with n ≤ 4000, and λ ∈ {1, 2, 5, 10, 50}. As discussed above, the ranking of the algorithms strongly depends on the size of λ, so that our work gives a much more complete picture than the results presented in [DYvR + 18].

Structure of the Paper
The rest of the paper is structured as follows. In Section 2 we describe the considered algorithms in more detail. Section 3 gives a general overview of how each algorithm performs across different dimensions and population sizes. In Section 4 the influence of the lower bound for the mutation probability is studied, and conclusions about the ranking of the self-adjusting algorithms are made. Section 5 provides insights into the anytime performance of the algorithms from the fixed-budget perspective. In Section 6 we compare the 2-rate (1 + λ) EA r/2,2r with its 3-rate counterpart. Conclusions and avenues for future work are presented in Section 7.

OneMax and Three (1 + λ) EA Variants
In this section we describe the baseline (1 + λ) EA 0→1 algorithm and its two self-adjusting variants. The description of the algorithms assumes the maximization of a pseudo-Boolean function f : {0, 1} n → R as the optimization task. All empirical results in the further sections are obtained for the OneMax benchmark problem Om : OneMax is the most widely studied benchmark problem in the theory of EAs [Jan13,AD11], but also serves as a recurring benchmark problem in empirical studies, and in particular in the context of parameter control [FCSS08,Bäc92,Bäc93,JDJW05,DW18]. A discussion of properties that make OneMax a suitable test problem for adaptive algorithms was offered by Thierens in [Thi09]. Apart from the better comparison with existing results, the availability of theoretical performance limits of adaptive and static evolutionary algorithms makes OneMax a particularly appealing benchmark problem; see [Doe18] for a survey of such black-box complexity bounds. In the context of our work, the asymptotic Θ n log λ + n log n λ bound for all λ-parallel EAs proven in [BLS14] and the precise n ln(n) − cn ± o(n) bound for any unary unbiased black-box algorithm from [DDY16] are the probably most relevant results. As mentioned at the end of Section 1.2, we are also confident that further advances in running time analysis will allow us to convert our empirical findings into rigorous mathematical statements.
Before we describe our algorithms, we recall from [JDJW05] that for the optimization of OneMax the best static offspring population size is λ = 1. All larger values result in worse expected optimization times. However, the parallel optimization time, measured by the number of generations needed to find the optimum, of a (1 + λ) EA with λ > 1 can (and typically is) smaller than that of the (1 + 1) EA.
We implemented sampling from Bin 0→1 (n, p) as follows. First, each bit of a string of length n is inverted with probability p. Then it is checked whether at least one bit was inverted. If not, one random bit position is chosen uniformly at random and the corresponding bit is inverted.
We note without going into much detail that apart from the "shift" strategy 0 → 1, which assigns the probability mass of sampling = 0 to = 1, a second approach to deal with 0-bit flips was suggested in [CD18b]. This alternative approach, coined "resampling strategy" and denoted by a subscript > 0 in [CD18b] and its follow-up works [DYvR + 18, DW18], distributes the probability mass Bin(n, p)(0) proportionally to all integers 1 ≤ ≤ n. Initial experiments with this resampling strategy indicate similar finding as those presented for the shift strategy used here in this work. A detailed examination is left for future work. Next, we consider two self-adjusting variants of the (1 + λ) EA. The first one is the 2-rate (1 + λ) EA r/2,2r proposed in [DGWY19]. This algorithm is summarized in Algorithm 2. It creates one half of the offspring with mutation rate 2r/n and the other half of the offspring with mutation rate r/(2n), respectively. At the end of each iteration, the 2-rate (1 + λ) EA r/2,2r updates the mutation rate based on which value was used when the best offspring was obtained. Note here that with probability 1/2 a random decision is made whether to update the mutation rate to 2r/n or to r/(2n), thus effectively assigning a 3/4 chance to update to the value by which the best offspring has been created. The reason not to update to this rate deterministically was explained in [DGWY19] by the fact that the constant probability to update the mutation rate in the seemingly unfavorable direction can be useful in the optimization of non-unimodal functions; see [DGWY19, Section 3.1] for a more detailed discussion.
We complement our study by a performance analysis for an extension of the (1+1) EA(A, b) studied in [DW18]. In this extension we adapt the success-based multiplicative update rule used in the (1 + 1) EA(A, b) to the case that more than one offspring is generated per each iteration. We call our extension the (1+λ) EA(A, b). Algorithm 3 provides its pseudo-code. In the original algorithm from [DW18], the mutation rate is updated based on whether or not the offspring replaces its parent. In our case, we have λ offspring, and we refine the update rule as follows. First, we compute the number of "good" offspring, which are at least as good as the parent (see line 6 of Algorithm 3). If the share of good offspring is at least 5%, the mutation rate is multiplied by the factor A > 1. It is decreased to bp otherwise, where the factor b is some constant satisfying 0 < b < 1. In this work we set A = 2 and b = 1/2.
One may wonder why we use the 5% threshold. This is based on some preliminary experiments with λ = 1,600, where we observed good performances for this value. It is likely that this threshold depends on the population size λ. A detailed study is left for future work.

Algorithm's Performance by Dimension
In a first step, we analyze how each algorithm performs across different problem dimensions. To this end, we compute the average optimization time (i.e., the number of function evaluations needed to find an optimal solution) of each of the algorithms described in Section 2 for 100 independent runs. In Figure 1 we show these values for the algorithm (1 + λ) EA(A, b), for different values of λ. We also show the average parallel optimization time, measured by the number of generations needed to find an optimal solution, for the (1 + λ) EA(A, b) in Figure 2. The parallel optimization time equals the average optimization time divided by the offspring population size λ. This performance measure is useful when the fitness evaluations within one generation are made in parallel.
We note without details that the plots for the 2-rate (1 + λ) EA r/2,2r and (1 + λ) EA 0→1 look very similar, but with different values. The relative standard deviation is small for all algorithms: it starts at about 10-15% for λ = 5 and decreases to just 0.5-1.5% for λ = 3200, so we can draw significant conclusions from the average optimization time and the average parallel  As expected, we see that for each algorithm and each dimension the average optimization times strictly increase with increasing λ. Therefore, small values of λ are optimal in terms of optimization time. However, in terms of parallel optimization time, the situation is reversed: the algorithms with larger population sizes λ require fewer iterations to find the optimum. Therefore, both small and large values of λ are worth being used under different circumstances.

The Impact of the Lower Bound
One of the key advantages of sampling the mutation rates from the distribution Bin 0→1 (n, p) is that it allows to transition from a classical (1+λ) EA to a (1+λ) variant of RLS that creates each offspring by flipping exactly one uniformly chosen bit. This transition is achieved by reducing p beyond 1/n, see [CD18a] for a discussion. In the algorithms studied above, we had enforced a lower bound of 1/n for p. We now relax this lower bound and only require that p ≥ 1/n 2 . We study the impact of this relaxed lower bound on the performance of the (1 + λ) EA(A, b) and the 2-rate (1 + λ) EA r/2,2r algorithm. The algorithms with these relaxed lower bounds are denoted by (1 + λ) EA(A, b, 1/n 2 ) and 2-rate (1 + λ) EA r/2,2r (1/n 2 ), respectively. In Figures 3a and 3b we show the average parallel optimization times of the different algorithms for n = 10,000 and n = 100,000, respectively. Recall that the average parallel optimization time equals the average number of generations. Note also that we use a logarithmic scale, to ease the comparison. Note that the standard deviation is still relatively small for the algorithms using p min = 1/n 2 as well. The only difference is that for 2-rate (1 + λ) EA r/2,2r (1/n 2 ) it does not decrease with increasing λ and is always around 5-10%.
The picture for n = 10,000 is quite similar to that for n = 100,000, and, in fact, for all tested dimensions. To demonstrate this, we plot in Figures 4a and 4b the average parallel optimization times for fixed λ = 10 and λ = 3, 200, respectively, which confirm a stable ranking of the different algorithms. We use the parallel optimization time again as performance measure, to ease the comparison with Figures 3a and 3b.
From Figures 3a and 3b we can derive threshold values for λ at which the ranking of the algorithms changes.
We see that the ranking of algorithms strongly depends on the population size λ. For small values of λ, the 2-rate (1 + λ) EA r/2,2r (1/n 2 ) performs the best, while for medium values the (1 + λ) EA(A, b, 1/n 2 ) wins, and then for large values of λ the (1 + λ) EA (A, b) is superior. Interestingly, for λ = 50, the (1 + λ) EA 0→1 , which uses a static mutation rate, performs on par with the best self-adjusting algorithm.
Generally, for both self-adjusting algorithms, it is more preferable to use 1/n 2 as lower bound for the mutation probability when the population size is small, while for large population sizes a lower bound of 1/n seems to work better. For the 2-rate (1 + λ) EA r/2,2r , the transition population size is some value λ 3 satisfying 100 ≤ λ 3 < 200, while for the (1 + λ) EA(A, b) it is some λ 2 with 400 ≤ λ 2 < 800. At the same time, while the performance of the 2-rate (1 + λ) EA r/2,2r and the 2-rate (1 + λ) EA r/2,2r (1/n 2 ) drastically depends on the population size and the mutation probability lower bound, the (1 + λ) EA(A, b, 1/n 2 ) is more stable and performs rather good for all considered population sizes. Note that the (1 + λ) EA(A, b, 1/n 2 ) is the only self-adjusting algorithm which is never significantly worse than the baseline (1+λ) EA.

Fixed-Budget Performance
In the sections above, we have only regarded the average optimization times. This measure, albeit very commonly used in the runtime analysis community, does not give any information about how the algorithms perform in an anytime sense. This criticism was addressed in [JZ14,CD18b], where the benefits of fixed-budget and fixed-target analyses are advertised, respectively. We note that, of course, both performance measures have played an important role in empirical evaluations ever since -the contribution of the mentioned papers is rather to be seen in discussing why these measures provide important insights for theoreticians. Figures 5, 6a, 6b show anytime performance plots for selected (1 + 10) EAs, (1 + 400) EAs, and (1 + 1600) EAs, respectively, on the n = 10,000-dimensional OneMax instance.
We see that for λ = 10 ( Figure 5) the algorithms demonstrate quite different performance over time, while for larger values of λ (Figures 6a, 6b) the performance of the most efficient algorithms seems to be rather stable.
More precisely, the situation is as follows. For generational budgets (note that the budget in terms of fitness evaluations is simply λ times larger than the generational budget, since we are dealing with constant offspring population sizes in this paper) up to around 2,590 the 2-rate (1 + λ) EA r/2,2r is the most efficient algorithm; then from 2,950 to 5,615 generations (1 + λ) EA 0→1 is the best one, then the 2-rate (1 + λ) EA r/2,2r (1/n 2 ), which remains to be the best performing algorithm for all budgets up to 11,800, at which point it hits the optimum. The runner-up is the (1 + λ) EA(A, b, 1/n 2 ), third the (1 + λ) EA(A, b), then the (1 + λ) EA 0→1 and the worst among the five algorithms is the 2-rate (1+λ) EA r/2,2r (with lower bound p min = 1/n). Therefore, for small population sizes, the ranking of the algorithms depends on the budget.  For λ = 400 and λ = 1,600, we see that for all budgets the two algorithms (1 + λ) EA(A, b, 1/n 2 ) and (1 + λ) EA (A, b) show similar performance, and are the best among the five algorithms. For λ = 400 the 2-rate (1 + λ) EA r/2,2r shows fine performance for budgets up to around 1,700, at which point its performance deteriorates (due to too many offspring sampled with four times the in this case optimal mutation rate). For λ = 1,600 this effect disappears, and the 2-rate (1 + λ) EA r/2,2r shows fine performance. The performance of the 2-rate (1 + λ) EA r/2,2r (1/n 2 ) is seen to be the worst of all algorithms for budgets greater than 1,400 for λ = 400 and budgets larger than 1,000 for λ = 1,600.

2-Rate vs. 3-Rate Adaptation
We next analyze an idea previously communicated in [DGWY19], a 3-rate success-based variant of 2-rate (1 + λ) EA r/2,2r , which we call the 3-rate (1 + λ) EA r/2,r,2r . This algorithm creates one third of the offspring with mutation rate c 1 r/n, r/n, and c 2 r/n, respectively, where 0 < c 1 < 1 and 1 < c 2 are hyper-parameters set by the user. The detailed description of the 3-rate (1 + λ) EA r/2,r,2r is given in Algorithm 4. In the original work [DGWY19] a similar 3-rate variant has been studied. However, there is no pseudocode of this variant, so it is hard to compare the update rule used in the original work and the one we use.
Following [DGWY19], we use the standard bit mutation operator for all algorithms in this section. It inverts each bit of a string equiprobably with the corresponding mutation rate. So, in contrast to the shift mutation operator used above, it may happen that no bits are flipped in the mutation phase. However, for moderately large λ, the shift mutation operator performs only slightly better than the standard one (this empirical observation is also confirmed by the findings made in [DYvR + 18] for the resampling strategy, and can be confirmed theoretically; see the full version of [CD18a] for a similar theoretical justification in the context of the (1 + λ) EA with static mutation rates), so we believe that the key observations made in this section do not depend on the choice of the mutation operator.
r ← min{max{2, r}, n/4}; As we have seen in the previous sections, the 2-rate (1 + λ) EA r/2,2r performs worse than the (1 + λ) EA(A, b) for large population sizes. To ensure that this is not caused by a poor tuning of the 2-rate (1 + λ) EA r/2,2r , we consider the 3-rate (1 + λ) EA c 1 r,r,c 2 r on λ = 1,600 and check 100 different configurations of (c 1 , c 2 ) hyper-parameters for this version. The tuning procedure and its results are illustrated in Figure 7. First, some random values for c 1 and c 2 were taken. Then we investigated the areas around values which gave better performance. These results indicated that the best configuration are close to (c 1 = 0.7, c 2 = 1.4). It is also worth mentioning that the performance of the 3-rate (1 + λ) EA c 1 r,r,c 2 r obtained with the different hyper-parameters differs by up to around 16%. The relative standard deviation of the tested algorithms is about 1%.
Note that we also plot in Figure 7 the line for which c 1 = 1/c 2 , because such a functional dependence was used in [DGWY19]. According to the results of our procedure, the corresponding line does not fit the whole area of good configurations, albeit it does seem to cross some of the best configurations.

Conclusions
We have seen in this work that the ranking of different self-adjusting (1+λ) EA variants strongly depends on the population size. Both small population sizes and large ones are of interest, as the former provide the best performance in terms of number of fitness evaluations, while the latter are more efficient in terms of generation number, which is useful when calculating fitness in parallel.
Based on our results, the following conclusions for the OneMax benchmark problem may be drawn: for small offspring population sizes 5 ≤ λ ≤ λ 1 , λ 1 satisfying 50 ≤ λ 1 < 100 the 2-rate (1+λ) EA r/2,2r (1/n 2 ) is the most efficient among the tested algorithms, while for medium population sizes λ 1 ≤ λ ≤ λ 2 with 400 ≤ λ 2 < 800 the (1 + λ) EA(A, b, 1/n 2 ) performs best, and for large offspring population sizes λ 2 ≤ λ ≤ 3, 200 the (1 + λ) EA(A, b) with p min = 1/n as the lower bound for the mutation probability should be used. Furthermore, if the evaluation budget is limited and the population size is small, one should pick the algorithms very carefully, as the ranking strongly depends on the budget in this case. We also confirmed that although a careful selection of the hyper-parameters can improve the performance of the 2-rate (1 + λ) EA r/2,2r and the 3-rate (1 + λ) EA r/2,r,2r , the default parameters seem to be quite suitably chosen, so that the overall gain of such hyper-parameter tuning seems to be moderate at best. We plan on extending our studies by investigating if similar conclusions can be drawn for the (1 + λ) EA(A, b) with different values of A and b. The results in [DW18] suggest a rather flat dependence of performance on these hyper-parameters for the (1 + 1) EA, but note that we have considered in our work much larger dimensions and offspring population sizes, so that it is a priori not clear if a similar conclusion holds. We recall from [Len18] that small changes in the parameter setting of an EA can result in drastic performance differences.
As mentioned in the main part, the consistent ranking of the algorithms makes us feel confident about an extension of our work to a rigorous theoretical running time analysis. Such results would offer more detailed insights into the working principles underlying the diverse rankings, and may help to identify the cut-off points where performances of two algorithms intersect.
In another important line of research we plan on investigating whether our conclusions for the OneMax problem can be extended to other, more challenging benchmark problems. A promising direction of such extensions is offered by the W-model [WW18], which allows one to calibrate different features of the optimization problem, such as its ruggedness, the fraction of effective and "dummy" variables, the epistasis, neutrality, separability, etc.         (1 + λ) EA r/2,2r , (1 + λ) EA(A, b), 2-rate (1 + λ) EA r/2,2r (1/n 2 ), (1 + λ) EA(A, b, 1/n 2 ). Shift mutation operator is used in all algorithms, avg.=average, r.dev.=relative standard deviation.