Side-Channel Attacks on BLISS Lattice-Based Signatures

In this paper, we investigate the security of the BLISS lattice-based signature scheme, one of the most promising candidates for post-quantum-secure signatures, against side-channel attacks. Several works have been devoted to its efficient implementation on various platforms, from desktop CPUs to microcontrollers and FPGAs, and more recent papers have also considered its security against certain types of physical attacks, notably fault injection and cache attacks. We turn to more traditional side-channel analysis, and describe several attacks that can yield a full key recovery. We first identify a serious source of leakage in the rejection sampling algorithm used during signature generation. Existing implementations of that rejection sampling step, which is essential for security, actually leak the “relative norm” of the secret key. We show how an extension of an algorithm due to Howgrave-Graham and Szydlo can be used to recover the key from that relative norm, at least when the absolute norm is easy to factor (which happens for a significant fraction of secret keys). We describe how this leakage can be exploited in practice both on an embedded device (an 8-bit AVR microcontroller) using electromagnetic analysis (EMA), and a desktop computer (recent Intel CPU running Linux) using branch tracing . The latter attack has been mounted against the open source VPN software strongSwan. We also show that other parts of the BLISS signing algorithm can leak secrets not just for a subset of secret keys, but for 100% of them. The BLISS Gaussian sampling algorithm in strongSwan is intrinsically variable time. This would be hard to exploit using a noisy source of leakage like EMA, but branch tracing allows to recover the entire randomness and hence the key: we show that a single execution of the strongSwan signature algorithm is actually sufficient for full key recovery. We also describe a more traditional side-channel attack on the sparse polynomial multiplications carried out in BLISS: classically, multiplications can be attacked using DPA; however, our target 8-bit AVR target implementation uses repeated shifted additions instead. Surprisingly, we manage to obtain a full key recovery in that setting using integer linear programming from a single EMA trace.


INTRODUCTION
As possibly the most promising candidate to replace classical RSA and ECC-based cryptography in the postquantum setting, latticebased cryptography has been the subject of increasing interest in recent years from an implementation standpoint, including on constrained and embedded devices.In particular, in the last five years or so, numerous papers have been devoted to the implementation of lattice-based signatures schemes on various such platforms, such as FPGA and microcontrollers [11,28,33,34,45,46].Concomitantly, industry-baked open-source librairies implementing lattice-based schemes have been developed such as Microsoft's Lattice Cryptography Library [41], Google Chrome Canary's TLS 1.2 [8] or even OpenSSL 1.0.2g[47], implementing Peikert's R-LWE key exchange.This has provided a better understanding of how practical these schemes are at concrete security levels.
More recently, researchers have started investigating the security of these implementations against physical attacks.In particular, Bruinderink et al. [26] have demonstrated a cache attack against BLISS at CHES 2016, and two papers by Bindel et al. [6] and Espitau et al. in [20] at FDTC 2016 and SAC 2016 have presented fault attacks against BLISS and several other lattice-based signature schemes.Those attacks mainly rely on the idea that lattice signatures contain some "noise", and learning partial information about that noise (either through cache side-channels or because fault injection allows to fix some of it to a known value) makes it possible to reduce the dimension of the underlying lattice problems, and hence the security of the schemes, often allowing to recover the secret key.
Lattice-based signatures and BLISS.In the early days of latticebased cryptography, several signature schemes with heuristic security were proposed, most notably GGH [25] and NTRUSign [31], but despite several attempts to patch them, they turned out to be insecure: it was found that the distribution of generated signatures leaks statistical information about the secret key, which can be exploited to break these schemes and their variants [22,24,42].The most common approach to obtain efficient, provably secure lattice-based signatures in the random oracle model is the "Fiat-Shamir with aborts" paradigm introduced by Lyubashevsky [38] (it coexists with the GPV hash-and-sign paradigm relying on lattice trapdoors [23], which has some theoretical benefits compared to Fiat-Shamir, but tends to result in less efficient implementations [16]).Lyubashevsky's approach is an extension of the usual Fiat-Shamir transformation which uses rejection sampling to make sure that generated signatures have a distribution independent of the secret key, and avoid the statistical pitfalls of schemes like NTRUSign.More precisely, the underlying identification protocol achieves its honest-verifier zero-knowledge property by aborting some of the time, and signatures are produced by re-running that protocol with random challenges until it succeeds.
Several instantiations of this paradigm have been proposed [4,28,32,39], targeting various output distributions for signatures, but the most popular among them is certainly the BLISS signature scheme proposed by Ducas et al. [14].It is possibly the most efficient latticebased signature scheme so far, boasting performance comparable to common implementations of RSA and ECC-based signatures, such as the one in OpenSSL.Signature and public-key size are a few times larger than RSA (and about one order of magnitude bigger than ECC); signature generation is comparable to ECC and beats RSA by an order of magnitude; and signature verification is similar to RSA and faster than ECC by an order of magnitude.
This efficiency is achieved in particular through the use of Gaussian noise, and a target distribution for signature that has a bimodal Gaussian shape.This makes the rejection sampling step for BLISS somewhat tricky to implement, particularly on platforms where evaluating transcendental functions to a high precision is impractical.However, the authors of [14] proposed an efficient technique to carry out this rejection sampling based on iterated Bernoulli trials.This technique is used, in particular, in the embedded implementations of BLISS described in [33,46].Our contributions.Our goal is to look at the security of BLISS against side-channel analysis.Most of the attacks we describe apply in particular to the original proof-of-concept implementation of Ducas et al. [15], but we specifically target two implementations of a less academic nature: the 8-bit AVR microcontroller implementation of Pöppelmann et al. [46], as well as the productiongrade implementation included in the open source VPN software strongSwan [51].
The first source of side-channel leakage that we consider is the clever algorithm proposed in the original BLISS paper [14] to perform the rejection sampling, which is intervened in a crucial way in those embedded implementations.To achieve the correct output distribution, the signature generation algorithm has to be restarted with probability: where (z, c) is the signature generated so far, S the secret key, σ the Gaussian standard deviation and M a scaling factor ensuring that this probability is always at most 1.
It turns out that the clever algorithm for rejection sampling, based on iterated Bernoulli trials, traverses the bits of the two values ⟨z, Sc⟩ and K − ∥Sc∥ 2 (where K is defined such that M = exp K/(2σ 2 ) ) in much the same way as a square-and-multiply algorithm traverses the bits of its exponent: one can basically read those bits on a power or electromagnetic trace!This makes it possible to mount a simple power analysis (SPA) or simple electromagnetic analysis (SEMA) attack on the rejection sampling using either of these values.Similarly, on a desktop platform using a recent Intel CPU, one can similarly read out that sequence of bits from the list of branching instructions executed within the corresponding function, which is recorded in the CPU branch trace store.On Linux, this is accessible using perf_events, which are often available to all userland processes with the same user ID as the program running the BLISS computation (in our case, strongSwan).
The attack using the scalar product is conceptually quite simple: given the value ⟨z, Sc⟩ for many known signatures (z, c), one can recover the secret key S using basic linear algebra.However, in real BLISS signatures, the component z is not output in full, but in a shorter, compressed form which loses part of the information.This makes the attack inapplicable in practice.
On the other hand, in real BLISS implementations, one is actually able to retrieve the value ∥Sc∥ 2 using SPA/SEMA, and collecting sufficiently many such values allows us to compute the relative norms s 1 • s1 and s 2 • s2 of the two cyclotomic integers s 1 , s 2 forming the secret key S. Recovering the secret key from those relative norms is a problem analogous to the one addressed in a 2004 paper of Howgrave-Graham and Szydlo [35], except for the fact that the cyclotomic field Q(ζ m ) of interest in our case has a conductor m equal to a power of two, instead of an odd prime as in the original paper.We are able to extend the Howgrave-Graham-Szydlo algorithm to this power-of-two conductor case, and use it to complete the key recovery attack.
There is a technical hurdle to overcome, however.Like the original Howgrave-Graham-Szydlo algorithm, our method is only efficient when one knows the factorization of the absolute norm of cyclotomic integer of interest.But for BLISS parameters, this absolute norm is between 1000 and 2000-bit long, so it is not easy to factor academically in general.However, for a significant fraction of all keys, the absolute norm is prime, or at least the product of a large prime with very small prime powers that can be factored out using trial division; and for those "weak" keys (forming over 6% of all keys for typical BLISS parameters) our generalized Howgrave-Graham-Szydlo algorithm runs in full polynomial time.
In addition, we also consider two other sources of side-channel leakage: the Gaussian sampling algorithm used to generate the random masks used in BLISS signatures on the one hand, and the polynomial multiplication s 1 • c between the secret key and the public variable hash value c on the other hand.
Regarding the rejection sampling, the original BLISS paper proposes several techniques to carry it out, and the strongSwan implementation chooses one which intrinsically runs in variable time.Like the rejection sampling, it is based on repeated sampling of Bernoulli trials.It consists of an a priori unbounded number of iterations, with a complex collection of functions calling one another hundreds of times, and it is carried out to generate each of the 512 coefficients of the random Gaussian polynomial y 1 .The power or EM trace of the execution of this algorithm on an embedded device would likely look gibberish; however, branch tracing allows to take full advantage of it: since branch tracing records the full list of branching instructions carried out during this computation, it can be used to reconstruct y 1 entirely.That polynomial, together with the signature elements c and z 1 = y + (−1) b s 1 • c, is enough to recover the entire secret key (up to sign, which is enough): branch tracing of a single execution of the signature generation algorithm is sufficient for a complete break of all keys!As for the computation of the product s 1 • c, it is relatively classical that such a multiplication, when implemented in a naive way, can be attacked using differential power analysis (DPA), correlation power analysis (CPA) and related techniques.The implementation of the multiplication in our 8-bit AVR target [46], however, takes advantage of the special form of c (which is a very sparse polynomial with coefficients in {0, 1}).The product is simply computed as a sum of signed shifts of the secret key.As a result, a somewhat different attack approach is required.Surprisingly, using integer linear programming techniques, we describe a method for recovering the secret key using a single power or EM trace of this multiplication algorithm.This not only breaks the unprotected implementation described in [46], but also defeats various blinding countermeasures that could be designed to protect it, such as the one recently proposed by Saarinen [49].
Finally, we conclude the paper by discussing limitations of our attacks and possible countermeasures.These attacks illustrate a major problem cryptography engineers often face, namely the difficulty of quantifying the threat due to a leakage.A security aware programmer would surely have noticed that the norm was easily readable due to non-constant execution time.But he may also have thought it was not an exploitable leakage.Our opinion is that as far as side-channels are concerned, simpler is often better.In that spirit, we consider the possible merits of using a lattice-based signature scheme with a simpler mathematical structure, such as the "ancestor" of BLISS due to Güneysu, Lyubashevsky and Pöppelmann [28].Our analysis suggests that such a scheme may be preferable to BLISS when physical attacks are a concern, not only because its simpler structure thwarts some of the attacks described in this paper, but also because side-channel countermeasures for it seem easier to design and implement (keeping in mind that an unprotected implementation is unlikely to achieve good levels of side-channel security in any case).

DESCRIPTION OF THE BLISS SCHEME
Notation.For any integer q, the ring Z q is represented by the integers in [−q/2, q/2) ∩ Z. Vectors are considered as column vectors and will be written in bold lower case letters and matrices with upper case letters.By default, we will use the L 2 Euclidean norm, Description of BLISS.The BLISS signature scheme [14] is possibly the most efficient lattice-based signature scheme so far.It has been implemented in both software [15] and hardware [45], c ← H ( ⌊u⌉ d mod p, µ) choose a random bit b 6: rejection sampling: restart to step 2 except with probability 1/ M exp(− ∥Sc ∥ 2 /(2σ 2 )) cosh( ⟨z, Sc⟩/σ 2 ) return (z 1 , z † 2 , c) 11: end function Figure 2: Sampling algorithms for the distributions B exp(−x /f ) and B 1/cosh(x /f ) .The values c i = 2 i /f are precomputed, and the x i 's are the bits in the binary expansion of for i = 0 to ℓ − 1 do 3: if a = 0 then return 0 6: end for 8: return 1 9: end function return 0 9: end function and boasts performance numbers comparable to classical factoring and discrete-logarithm based schemes.BLISS can be seen as a ring-based optimization of the earlier lattice-based scheme of Lyubashevsky [39], sharing the same "Fiat-Shamir with aborts" structure [38].One can give a simplified description of the scheme as follows: the public key is an NTRU-like ratio of the form a q = s 2 /s 1 mod q, where the signing key polynomials s 1 , s 2 ∈ R = Z[X ]/(X n + 1) are small and sparse.To sign a message µ, one first generates commitment values y 1 , y 2 ∈ R with normally distributed coefficients, and then computes a hash c of the message µ together with u = −a q y 1 + y 2 mod q.The signature is then the triple (c, z 1 , z 2 ), with z i = y i + s i c, and there is rejection sampling to ensure that the distribution of z i is independent of the secret key.Verification is possible because u = −a q z 1 + z 2 mod q.
The real BLISS scheme, described in Figure 1, includes several optimizations on top of the above description.In particular, to improve the repetition rate, it targets a bimodal Gaussian distribution for the z i 's, so there is a random sign flip in their definition.In addition, to reduce key size, the signature element z 2 is actually transmitted in compressed form z † 2 , and accordingly the hash input includes only a compressed version of u.See also the full version of this paper [21] for a description of key generation and verification.
Implementation of the BLISS rejection sampling.It is essential for the security of the scheme that the distribution of signatures is essentially statistically independent of the secret signing key.This is achieved using the rejection sampling step 8 of algorithm Sign, as described in Figure 1.
To implement this rejection sampling in practice, one needs to be able to efficiently sample from Bernoulli distributions of the form B exp(−x /f ) and B 1/cosh(x /f ) for some fixed constant f and variable integers x (where B p denotes the Bernoulli distribution of parameter p, which outputs 1 with probability p and 0 otherwise).
This can in principle be done by computing the rejection probability every time with sufficient precision and comparing it to uniformly sampled randomness in a suitable interval, but such an approach is quite costly, especially on constrained devices, as it relies on the evaluation of transcendental functions to arbitrary precision.Therefore, BLISS relies on an alternate approach, which is described in [14, §6] and can be implemented based on sampling Bernoulli distributions B c i for a few precomputed constants c i .
The idea is as follows.To sample from B exp(−x /f ) , one can consider the binary expansion x i • 2 i of x, and let c i = exp(−2 i /f ).Then one has exp(−x/f ) = x i =1 c i .As a result, sampling from B exp(−x /f ) can be done by sampling from each of the B c i ; if all the resulting samples are 1, return 1, and 0 otherwise.This can even be done in a lazy manner, as described in algorithm SampleBernExp in Figure 2.
In addition, one can show that sampling from B 1/cosh(x /f ) can be done by repeated sampling from B exp(−x /f ) and B 1/2 , as described in algorithm SampleBernCosh in Figure 2 (the correctness of that method is proved as [14,Lemma 6.3]).The algorithm has an a priori unbounded number of iterations, but the expected number of calls to B exp(−x /f ) is less than 3.
Concretely, the BLISS rejection sampling is thus implemented as follows.The denominator f in SampleBernExp and Sample-BernCosh is set to 2σ 2 , and the scaling factor M for the rejection sampling is taken of the form exp(K/f ) for some integer K.Then, 1: x ← K − ∥Sc ∥ 2 2: Sample a ← SampleBernExp(x ) 3: if a = 0 then restart the signing algorithm 4: x ← 2 • ⟨z, Sc⟩ 5: if x < 0 then x ← −x 6: Sample a ← SampleBernCosh(x ) 7: if a = 0 then restart the signing algorithm

ATTACKS ON THE REJECTION SAMPLING
As discussed in §2, the original BLISS paper describes an efficient technique to carry out the rejection sampling during signature generation [14, §6], based on iterated Bernoulli trials.This technique is used in particular in embedded implementations such as [46].
From the description of algorithm SampleBernExp in Figure 2, one can easily infer that its input x will be leaked in full on a power or EMA trace whenever the entire for loop contained in it executes in full (which is always the case when the algorithm returns 1).This leakage is validated in the practical experiments of §3.3.Now the signature generation algorithm Sign of BLISS only passes the rejection sampling step if the calls to both SampleBern-Exp and SampleBernCosh return 1.This means that a side-channel attacker can recover the entire inputs to both of these functions, which are K − ∥Sc∥ 2 and 2 • ⟨z, Sc⟩ respectively.Thus, one can obtain the squared norm ∥Sc∥ 2 on the one hand, and the scalar product ⟨z, Sc⟩ up to sign on the other hand.However, since the computation of the absolute value also leaks the sign of that value on a power/EMA trace due to the conditional branch, one actually obtains the full value ⟨z, Sc⟩ in practice.In this section, we describe how these leaked values can be exploited for key recovery.

Exploiting the scalar product leakage
We first describe how to exploit the leakage of the scalar product values: ⟨z, Sc⟩ = ⟨z 1 , s 1 c⟩ + ⟨z 2 , s 2 c⟩, assuming that the adversary can somehow recover the whole uncompressed signature (c, z 1 , z 2 ).In that case, the attack is essentially straightforward: indeed, each such leaked value is the evaluation of a linear form with known coefficients on the vector S = (s 1 , s 2 ) seen as an element of Q 2n .And it is clear that those linear forms on Q 2n generate the entire dual vector space of Q 2n as (c, z 1 , z 2 ) vary.This means that if we collect the leakage values associated with sufficiently many signatures (slightly more than 2n is enough in practice), we obtain a full-rank linear system in the coefficients of S, and hence we can recover the entire secret key by simply solving that linear system.This attack does not apply to real implementations of BLISS, however, due to signature compression: real BLISS signatures do not contain the entire element z 2 but only a compressed version z † 2 that depends only on the higher-order bits of the former.As a result, the previous approach fails to apply.We cannot even use it to reduce the dimension of the underlying lattice problem, because a leaked scalar product reveals less information about the secret key than the number of unknown bits of z 2 in view of the signature.
Thus, a more sophisticated approach is necessary, based on the leaked Euclidean norms ∥Sc∥ 2 .

Exploiting the norm leakage
Let's suppose we can have access by SPA to the bits of ∥Sc∥ 2 in the final computation of the rejection sampling.Recalling that Sc = (s 1 c, s 2 c) T , we have ∥Sc∥ 2 = ⟨s 1 c, s 1 c⟩ + ⟨s 2 c, s 2 c⟩ and thus this norm can be seen as for c being the vector encoding of the polynomial c, and S 1 (resp.S 2 ) being the skew-circulant matrix encoding the polynomial s 1 (resp.s 2 ).Let X be the matrix Σ T • Σ.Then, recovering the value ∥Sc∥ 2 yields an equation of the shape: This equation can be viewed as a row of a linear system whose unknowns are the coefficients of the secret-dependent matrix X .
Since X is a block matrix of shape Diag X (1) , X (2) where X (i) are circulant matrices of first line as product of two conjugate skew-circulant matrices.Thus, only 2 × m/2 = m distinct unknowns are actually present in X .As a consequence we only need m linearly independent equations to fully recover the matrix X .Once recovered, we therefore get access to the submatrices S T 1 • S 1 and S T 2 • S 2 .By definition, this matrix corresponds to the encoding of the polynomial s 1 • s1 and s 2 • s2 , that is the relative norm of the secrets s 1 and s 2 in the totally real subfield.Using techniques from algebraic number theory we can retrieve both parts of the secret up to multiplication by a root of unity.Precisely, this is performed using our generalization of the Howgrave-Graham-Szydlo algorithm [35], presented in Appendix A. Actually once one value s 1 • u 1 or s 2 • u 2 (with u 1 , u 2 roots of unity), we can use our knowledge of the public key s 2 /s 1 mod q to recover candidates for the other part of the secret (once again up to unity).Hence when candidate secrets are determined, we can discriminate valid keys among them by checking their sparsity and polynomial height, to satisfy the conditions imposes by the key generation procedure.The whole attack is described in Figure 4.
The mostly costly part of the attack is the generalized Howgrave-Graham-Szydlo algorithm from Appendix A, and in particular the step of norm factorization over the integers.In order to estimate the cost of this factorization step, one can bound the algebraic norm of the secret element.A classical computation on resultants (see Appendix A.5 for a detailed argument) entails that this norm is bounded (somewhat crudely) as: Table 1 compiles the theoretical bound and the average practical results for the various proposed security parameters.We can see that these integers are typically too large to be factored in practice.Since the success of the attack depends on the 1: Collect traces (c (k ) , ∥Sc (k ) ∥ 2 ) k until the matrix C corresponding of the corresponding system is full-rank.2: Solve the linear system C • X = ( ∥Sc (1) ∥ 2 , . . ., ∥Sc (k ) ∥ 2 ) T .3: Call Algorithm 12 on either s 1 • s1 or s 2 • s2 to recover s 1 and s 2 up to a root of unity.ability to factor the norm, we are only able to attack a fraction of the whole space of private keys, for which the factorization is easy.A particular class of them is the set of keys whose norm is a B-semi-smooth integer, that is a composite number p • b, where p is prime and b is B-smooth for an non-negative integer B. As already remarked, the recovery of either s 1 or s 2 is sufficient to recover the full secret.Hence the above-described attack becomes tractable as soon as one of the norm N (s 1 ), N (s 2 ) is semi-smooth.This means that the probability of getting a weak key is twice the probability of one of the constituting part of the private key to have a semi-smooth norm.Practical estimations of the fraction of keys with semi-smooth norms are presented in Table 2.
Note that the entire attack is actually known to run on average in polynomial time, except, classically, the factorization of the norm.Amusingly, this means that the attack becomes quantumly fully polynomial: this is an interesting feature for an attack targeting a postquantum scheme!
The entire attack was implemented in PARI/GP, including the generalized Howgrave-Graham-Szydlo algorithm and the Gentry-Szydlo algorithm.To the best of our knowledge, this was the first full implementation of this algorithm.It allows to tackle the problem of solving norm equations in dimension up to 512 1 Experiments were conducted with this implementation to obtain the running time the attack and presented in Table 3, on a single core of a Xeon E5-2697 2.6 GHz CPU.

SEMA experiments against microcontroller
Experimental setup.Experiments were conducted on the same target that the one used for development in [46] that is an XMEGA-A1 Xplained board with a ATxmega128A1 micro controller, running at a frequency of approximately 20 MHz.Traces were obtained by measuring the electromagnetic radiations using a Langer EM H-Field probe (30MHz-3GHz) and a MITEQ amplifier (up to 500MHz with gain 63dB).Acquisition was performed using a Lecroy Wave-Master 8300A oscilloscope.For the norm recovery we used a sampling frequency of 5 MHz to obtain a good-looking picture but good results may also be obtained with smaller sampling rates.The more relevant position for the probe (i.e. the one providing the clearer patterns) was over the ground capacitance and not over the chip itself (which is not that surprising since the chip was packaged).
Table 2: Estimation of the proportion weak BLISS secret keys, namely those for which at least one of s 1 or s 2 has B-semismooth norm, for the security parameters of [14] and several choices of B. Estimates obtained by sampling 2000 secret keys per parameter set and testing their norm for semi-smoothness by trial division.Experimental result.Section 3.2 shows that from a leaked set of norms ∥Sc∥ 2 , we are able to fully recover the secret value S. We show here how such norm can easily be recovered in the implementation proposed in [46] (we recall this implementation was not supposed to be secure since optimized for performance).
We can see in Figure 5 an extract of the code performing the rejection sampling procedure.It corresponds to the probability 1/ M exp(−∥Sc∥ 2 /(2σ 2 ) .The main point to notice here is the and condition in the samplerBerExp function.Indeed, it looks like the call to samplerBer will only be executed if the least significant bit of the intermediate variable x is one.Of course, after compiling and scrutenizing the ASM code it turns out that it is precisely what the compiler did.This is an obvious SPA leakage source that may be easily recovered as we show now.
In Figure 6, we plotted the EM radiations corresponding to a norm ∥Sc∥ 2 = 14404.Thus, the Bernoulli exponential sampler is called using argument 46539 − 14404 = 32135 = 0x7D87 since the paraM value is 46539 in the (standard) set of parameter used.The loop on the norm is performed from the least to the most significant bit thus we expect the serie of bit { 1 , 1 , 1 , 0 , 0 , 0 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 1 , 1 } to be visible on the trace in Figure 6 what is obviously the case.

Branch tracing attack against strongSwan
Experimental setup.As discussed in the introduction, our other target is the open source VPN software strongSwan [51], which includes a production grade implementation of BLISS for use in the IKEv2 authentication protocol, as well as for the postquantumsecure signing of X.509 certificates.The architecture of the BLISS implementation in strongSwan is described in [50].
We attack the latest stable version of strongSwan as of this writing (5.5.2) when compiled and run on a Linux x86-64 desktop platform.We ran our experiments on an Intel Core i7-3770 CPU from the Ivy Bridge family, but they should apply to all Intel CPUs from the past ten years at least, with very recent architectures like Sky Lake providing even greater control.We used Linux kernel version 4.8.0, but the relevant interfaces have been available since late iterations of the 2.6.xbranch.
Using the CPU branch trace store.Our attack vector is branch tracing.Modern CPUs like the ones mentioned above provide extensive, low-overhead instrumentation for performance profiling, debugging and other applications.On Intel CPUs, this includes the Precise Event-Based Sampling (PEBS) interface, which provides accurate counters for numerous events related to program execution (CPU cycles, context switches, page faults, branch misses, etc.), as well as the Branch Trace Store (BTS), which records various information (including origin and destination addresses) of all branching instructions actually taken during the execution of a process or process family in a special area of memory.When PEBS performance counters overflow, or when the BTS memory area fills up, the CPU generates interrupts that can be captured by the operating system to record complete information separately.
Access to those hardware functionalities is normally privileged, but can be exposed to userland processes by the operation system.This is in particular the case under Linux through the perf_events interface and the perf command.Branch tracing of a process (belonging to the same user), in particular, can be carried out using the following instruction: perf record -e branches:u -c 1 -d -p ⟨pid⟩ which produces a large binary file (perf.data)containing information about all branching instructions executed in the process with PID ⟨pid⟩ between user space addresses, in the correct order (even if the CPU may execute out-of-order or mispredict branches, the data in perf.datacorresponds to the logical execution, so it is in the right order and does not include branch misses).The binary file can then be analyzed using the perf report and perf script commands.Specifically, perf script -F ip,addr prints a list of lines of the form: 7fd0a2a48884 => 7fd0a2a484a8 indicating the source and destination address of each branching instruction.The use of perf_events by non-privileged users can be restricted, according to the kernel.perf_event_paranoidsysctl setting: if that parameter is set to 2 (most paranoid), the perf record command above will fail without root privileges, which makes our attack rather meaningless.However, at any level below that (1, 0 or −1), the command succeeds, and so does our attack.And the parameter is often not set to the most paranoid level by default.Moreover, there can actually be security reasons not to disallow user space perf_events by ordinary users: for example, the use of PEBS counters has been recommended as a possible way of detecting cache attacks [27] and other attacks causing large numbers of cache misses (such as rowhammer [36]): see for example the discussion in [30].Since the detection requires high-frequency polling of performance counters, applying it system-wide could cause significant slowdowns and lead to false positives, so it should ideally be run by the user himself on the sensitive process he wants to monitor against attacks (strongSwan would be a natural candidate!).
An additional issue worty of mention is that, perhaps due to the fact that perf_events are not always perceived as a possible attack vector, the relevant part of the Linux kernel does not seem to be the most scrutinized from a security standpoint.In particular, until May 2016, a race condition in perf_event_open allowed to bypass the privilege verification entirely [5], which means that our attack could be conducted from any userland process (and probably still can on many live systems).
Recovering the BLISS norm in strongSwan with branch tracing.The strongSwan implementation of BLISS signature generation uses a direct implementation of the Bernoulli-based algorithm for  rejection sampling described in Figure 2. In particular, the function that implements SampleBernExp, called bernoulli_exp in strongSwan, iterates over each bit of the input value x, and skips to the next iteration whenever the bit is zero.More precisely, branches are taken in bernoulli_exp: (1) to enter into the function; (2) to iterate over the bits of the input x; (3) each time the corresponding bit is 0; (4) to return from the function.If we can find the addresses corresponding to all of these branching instructions, the information provided by perf script will immediately reveal every bit of x, which will let us reconstruct x and hence mount the attack described in §3.2.
The strongSwan software (and more generally any program linked against the libstrongswan library) accesses the BLISS-related functions from the libstrongswan-bliss.so object, which is dynamically loaded using dlopen(3).So to find the addresses of relevant branching instructions in the memory space of the attacked process, it suffices to find the offsets of those instructions within libstrongswan-bliss.so, as well as the address at which that library is mapped in the memory space of the process once it is loaded.The latter can be done by parsing the line corresponding to the libstrongswan-bliss.so executable segment in the file /dev/⟨pid⟩/maps.
As for the former, it can be done using a disassembler, such as the disassemble/m command in gdb, as shown in Figure 7.We can see from that figure that, in our compiled version of the libstrongswan-bliss.so object, the four offsets we are interested in are respectively 0x66f0 (entry point), 0x67c6 (while loop iteration), 0x674d (conditional branch on the bits of x) and 0x67ad (return).The same method allows us to find the address in the BLISS signing function sign_bliss at which the bernoulli_exp method is called (it can be either 0x2de3 or 0x2f51).Then, it suffices to find the corresponding branching event in the output of perf script and follow from there the branching instructions carried out inside bernoulli_exp to recover x and hence ∥Sc∥ 2 .
The entire attack has been mounted against a short program linked against libstrongswan-bliss.so that generates a BLISS key, computes a BLISS signature with it and verifies it.A short shell script runs that program, launches perf report on it, and calls a perl script to parse the output of perf script afterwards.An example output of the shell script implementing the attack is shown in Figure 8.We are able to recover the value ∥Sc∥ 2 correctly all the time.The source code of the entire attack is available for download from https://github.com/mti/bliss-sidechannel.
Features and limitations of the attack.The adversarial use of perf_events-based branch tracing does not seem to have received much consideration in the literature, but it is quite similar to the branch prediction attacks of Acıiçmez et al. [1][2][3].The branch tracing side-channel is more robust, however: on most CPUs, branch prediction is randomized or at least non-deterministic, so that the information one can get from branch mispredictions and similar events is noisy.Our attack, on the other hand, produces a complete and faithful execution trace, and thus succeeds all the time.
It does shares some of the limitations of branch prediction attacks, however.In particular, the attack model is quite strong, as it requires the ability to run another process on the same platform, usually with the same UID as the target process (unless privilege checks can be bypassed as in the perf_event_open security bug mentioned earlier).As a result, the threat posed by this kind of attack is normally limited.It cannot be ignored entirely, however, as modern Linux systems, through security measures like ptrace protection, will not allow a process to e.g.read from the memory space of another process from the same user: userland interprocess spying is usually regarded as a significant security issue.Similarly, secret keys stored on disk are usually encrypted and passphraseprotected, so one cannot assume that a process can learn all of a user's secrets just by acquiring his UID.
Another limitation of the attack, which is again shared with branch prediction attacks and many other attacks in the spy-process paradigm [3, §2.3], is that is has a noticeable effect on overall system performance, and hence is unlikely to remain undetected when using the target software interactively.The main performance hit in our attack is due to the perf record command writing to disk an exhaustive record of all branching events in the execution of the target: this amounts to hundreds of megabytes of data, of which we use only a few dozen bytes.Modifying perf to only write the events we need would reduce the overhead of the attack by a considerable extent.Moreover, Sky Lake and newer Intel CPUs allow this filtering to be carried out directly on-chip using hardware address filters, reducing the attack overhead to practically zero.

ATTACK ON THE GAUSSIAN SAMPLING
The second source of side-channel leakage that we consider is the Gaussian sampling algorithm used to generate the random masks y 1 and y 2 during signature generation.This algorithm samples from a fixed, centered discrete Gaussian distribution.Numerous techniques have been proposed to carry out that operation.One standard approach [18] involves the use of a cumulative distribution table, as suggested by Peikert [44], combined with the Knuth-Yao algorithm.This is mentioned in the original BLISS paper [14, §6], but the authors note that this requires storing large tables in memory.
Instead, they propose an alternate iterative approach based on the repeated sampling of Bernoulli and uniform distributions.Their algorithm SampleGaussian is described in Figure 9, and relies on the Bernoulli sampling function SampleBernExp as well a simple function SamplePosGaussian that samples from the positive part of the discrete Gaussian distribution that picks the integer i with probability proportional to 2 −i 2 .This approach is the one implemented in strongSwan.It has the advantage of requiring only a small amount of storage space (the tables containing the constants used in SampleBernExp).A drawback, however, is that it runs intrinsically in variable time: the number of iterations is a priori unknown, and even though one could choose to bound the number of iterations in SamplePos-Gaussian by some sufficiently large value, carrying out the entire loop every time would be very costly.This timing variability is a source of side-channel leakage.Due to the contrived structure of the algorithm (which starts over multiple times, etc.), it seems difficult to extract a lot of information from that leakage on a noisy trace, as provided by power or EM analysis techniques, especially as a given Gaussian sample is used in only one signature generation (so that DPA/CPA style statistical tools are not applicable).Nevertheless, one cannot rule out that this leakage can effectively lead to a catastrophic security failure.
The branch tracing technique described in §3.4, in particular, provides an ideal way of taking advantage of that leakage.Indeed, a branch trace of the execution of SamplePosGaussian directly reveals the output of that function: it suffices to count the number of iterations of the for loop (reverting to 0 when the algorithm restarts).And we have already seen that a branch trace of Sample-BernExp reveals the input of that function.Therefore, a branch trace of SampleGaussian reveals the value of x (as the output of SamplePosGaussian) as well as the value y(y +2kx), which is equal to z 2 − k 2 x 2 (as the input of SampleBernExp).Since k is actually a fixed constant for a given parameter set, one obtains z 2 , as well as the sign of z from the final conditional branch, and hence the complete output z!We have carried out this attack in the same setting as in §3. 4, and verified that it recovers the samples correctly: see the GitHub repository for the corresponding parser.
This means that a branch trace of a BLISS signature generation in strongSwan leaks the entire value of the random masks y 1 and y 2 .But the signature itself contains the values c and z 1 = y 1 + (−1) b s 1 c.Hence, whenever c is invertible (which happens with probability (1 − 1/q) n > 95% for each signature), we can recover the secret key element s 1 as c −1 • (z 1 − y 1 ).And the secret key element s 2 is easily deduced from s 1 using the public key.
In other words, a branch trace of a single BLISS signature generation in strongSwan will, with > 95% probability, reveal the entire secret key!And unlike the attack on the rejection sampling, this works for 100% of secret keys.The attack is subject to the caveats mentioned at the end of §3.4: it assumes a relatively powerful attacker and is not very stealthy.Nevertheless, it shows that this variable-time approach for Gaussian sampling has significant security implications.

ATTACKS ON THE MULTIPLICATION
We now turn to attacks targeting the multiplication between the signature element c and the secret key (s 1 , s 2 ).Those attacks will apply to all secret keys (in contrast with the one from §3.2, which only recovered a subset of all keys).
If no protection is used, traditional polynomial multiplication can be attacked using classical DPA-like techniques.If an NTT-based multiplication is used, we can target n one-to-one products between a known varying value and a secret.However, since BLISS uses sparse polynomials with small coefficients, other algorithms are sometimes used to compute the product.Specifically, the 8-bit AVR implementation of Pöppelmann et al. [46] that we target 2 relies on repeated shifted additions instead.The polynomial c (which we recall has exactly κ coefficients equal to 1 and all others equal to 0) is represented as a vector of κ indices corresponding to the ones, and multiplication by c is an iterated sum over those indices.We show that this unusual implementation of polynomial multiplication not only remains vulnerable to side-channel analysis, but can in fact be broken with a single trace.Thus, our proposed attack can be applied even when using blinding countermeasures, such as the ones proposed by Saarinen [49].Indeed, the blinding randomizes polynomials before multiplication using shifts and multiplications by constants, which corresponds to a search space of affordable dimension in the case of BLISS.
In the following, we first describe the sparse multiplication of [46], and then propose an attack on this implementation, both in the unprotected case and when using Saarinen's blinding countermesure.

Implementation details
First, let us recall that we target the multiplications of the polynomial c with the two components s 1 , s 2 of the secret key.These polynomials have the following properties: • s 1 is of degree n with coefficients in {−2, −1, 0, 1, 2}, 2 This implementation does not claim any particular SCA resistance.
• s 2 is of degree n with coefficients in {−3, −1, 1, 3, 5}, • c is of degree n and with κ coefficients 1 and all others equal to 0.
Figure 10 describes the core of the targeted sparse multiplication.Before performing this multiplication, the vector c is randomly generated in such a way that the indices in the vector are not ordered.Thus, even if one knows the non-zero coefficients of c (as part of the signature), we cannot infer the order in which they are accessed in the j-loop (i.e.we cannot deduce i from q and j).

Attacking the unprotected multiplication
The attack proceeds in two steps.First, the attacker recovers the order in which the non-zero indices of c are stored in memory.Then, based on this information, he recovers the coefficients of the second multiplicand.
Recovering the order of indices in c.For the first step, one can use two possible approaches: (1) using SPA information from the if condition on i, (2) performing a CPA on the computation of q-c[j].

The first technique involves exploiting the timing difference induced by the if condition using pattern matching techniques.
There is indeed an exploitable timing difference between the patterns corresponding to the conditional code being executed or not.Finding the positions where the additional code is executed in a j-loop allows to recover index positions.More precisely, the values for ( q = 0 ; q < N ; q ++ ) { /* loop on res coefficients */ res [q] = 0; for ( j = 0 ; j < Kappa ; j ++ ) { /* loop on c coefficients */ int8_t val = 1; int16_t i = ( q -c[j] ); /* the corresponding s coefficient index */ if ( i < 0 ) { i += N; val = -val ; } val *= s[i ]; res [q] += val ; /* update res coefficient at position q */ } } Figure 10: Code snippet of mulSparse function from BLISS implementation [46]. of j for which the if condition evaluates to true correspond to the cells of array c that contain a value larger than q.This knowledge directly translates to knowing the order of non-zero indexes in c.
In the second approach, one exploits data-dependent leakage.For a targeted value c[j], the samples corresponding to the computations of i for the n different values of q are recorded.Then, one can perform a CPA attack to distinguish the correct value of c[j] from others since q is known.
Attacking the polynomial multiplication.Now that the attacker knows the order in which the coefficients of c are processed, he can compute the values of i for each inner-loop iteration.This allows him to target the accumulation operation res[q] += val since he knows which secret coefficient of s is contained in val.Our hypothesis is that he obtains the Hamming weight of val and res[q] (both before and after addition) as it is generally the case on micro-controllers.This hypothesis is actually pessimistic, since one may obtain additional leakage (e.g.Hamming distances) or more informative ones (e.g.polynomials in the register bits).
Let us first look at what an attacker can learn from the leakage of val.Values in this variable are directly linked to coefficients in s up to a (known) sign inversion.Nevertheless, these coefficients cannot be directly recovered from the leakages: obtaining the Hamming weight of a variable taking values in {−2, −1, 0, 1, 2} will potentially reveal its value if it is zero but only its sign if it is non-zero 3 .The parameters of the scheme make the remaining exhaustive search intractable.
Let us now consider the second leakage source, namely the Hamming weight of res[q].In that case the problem comes from the fact that a classical divide-and-conquer strategy cannot be applied.Exploiting a leakage on res[q] would require the knowledge of its previous value to derive information on val.This previous value actually is a sum of (unknown) coefficients of s, which prevents such a divide-and-conquer strategy.
To overcome this difficulty, a first idea would be to use a Viterbilike algorithm to avoid testing all possible combinations.Indeed, accumulating in res[q] can be seen as a Markov process: the state being the current sum and the transitions being the possible values for s[i].The evaluation of the probability of a sequence could be obtained using templates or derived from a correlation coefficient between expected Hamming weights of the different states and the trace chunks.We experimented this approach on simulated traces and recovered the key.However, we did not manage to obtain good results with a high level of noise.
In highly noisy settings, the Hamming weight of a 16-bit variable having a small absolute value (as is the case for res[q]) reveals its sign, since the binary representation contains many leading zeros if the value is non-negative and many leading ones in the opposite case.Based on this simple observation, the attacker will obtain highconfidence constraints of the form Recovering s[i]'s from those constraints is an Integer Linear Programming (ILP) problem with no objective function (parameters are large enough to guarantee the uniqueness of the solution).
We performed some simulations based on this second idea.Simulating the Hamming weight of 16-bit variables, correct constraint 3 For s 2 , coefficients are taken in {−3, −1, 1, 3, 5} thus exploitation is a bit easier but still it is not enough).systems were obtained up to noise standard deviation 1.5 (as expected for a Gaussian noise).Using such error-less systems and the Gurobi ILP solver [29] the secret vectors were recovered within a few dozen of milliseconds.To handle higher levels of noise, the attacker should discard constraints in which he has less confidence (i.e.those corresponding to leakage close to the cut-off).
Simulations have been performed up to a noise of standard deviation 3.5 on a desktop computer.Results are given in Table 4.They were obtained on a set of 50 systems for each noise level and they show that attacks are easily performed in this range.We see that the solver should be asked to look for more than a single solution in some settings.Timings are provided both for situations when the solver is asked for a single solution, and for 10 solutions.Systems have also been produced for noise standard deviation 4.0 where only 10% had a single solutions.Others where not solved after tens of minutes.Ongoing experiments on a real device are expected to provide insights about relevant parameters for the discarding strategy and to confirm the applicability of this attack to real-world settings.
Note that we presented results when attacking using a single trace.Without SCA protection, however, different executions can be combined to construct the system.Thus, the attack may work in higher noise levels using more traces.

Attacking a blinded multiplication
Even in the case when Saarinen's blinding countermeasure is used, c remains sparse.Indeed, the blinding consists in shuffling and multiplying by a constant, so that the same sparseMult function can still be used (with val being initialized to the blinding constant).However, the indices of the non-zero positions of the blinded c are unknown because the signature contains the actual c and not the blinded one.Thus, these indices have to be recovered in addition to their positions in the array c.Note that the signature provides us with the non-zero positions of the blinded c up to a shift.This information may help with the index recovery.
Recovering the c indices.Similarly to the index order recovery step above, we exploit the timing leakage due to the if condition.More precisely, the if branch will be taken in a q-loop as many times as the number of indices in c that are smaller than q.Hence, the duration of the q-loop will decrease (non-strictly) as q grows.The steps at which an actual decrease occur correspond to the values of q that match an index in c, which trivially reveals the non-zero positions of the vector.As previously mentioned, we may additionally use the knowledge of the actual c to help (since we know the indices up to a shift).
end for 7: return r 8: end function Unblinding c.Finally, we have to recover the actual value of s.
Here, we just have to go through the n possible shift values and the invertible constants, which is of reasonable complexity for all BLISS parameters.

DISCUSSION
We have presented side-channel attacks against three parts of the BLISS signing algorithm, namely the rejection sampling step, the Gaussian sampling, and the multiplication of the secret key by the hash value c.We have found them to yield full key recoveries, either on the embedded 8-bit AVR implementation of Pöppelmann et al. [46] through EM emanations, on in the strongSwan software [51] on a Linux desktop machine under branch tracing.In this section, we n discuss possible countermeasures and conclusions that one could draw from these observations.

Attack against the rejection sampling
As we have seen, the implementation of BLISS rejection sampling, which relies on iterated Bernoulli trials, leaks the values ∥Sc∥ 2 and ⟨z, Sc⟩.If we ignore the compression of the signature element z 2 , exploiting the leakage of the scalar product to recover the secret key is a simple matter of linear algebra; however, real implementations do include compression, which seems to thwart that attack.
However, the leakage of the norm ∥Sc∥ 2 does suffice to recover the secret key using our variant of the Howgrave-Graham-Szydlo algorithm provided that the algebraic norm of s 1 (or s 2 ) is easy to factor, which happens in a noticeable fraction of all cases (over 3% for the 128-bit secure parameter sets BLISS-I and BLISS-II, for example, according to Table 2).And it seems difficult to protect against this attack.
Hard-to-factor algebraic norms.A first possible countermeasure could be to try and ensure that the algebraic norms of s 1 and s 2 are hard to factor, but it isn't clear how that could be done in practice: just generating these values using the existing key generation algorithm and eliminating "easy to factor" values seems hopeless, as there is no such thing as an efficient test for "hard to factor" composites.Alternatively, one could try to construct these values as products of two small, sparse elements with large prime norms (so as to obtain RSA-like cyclotomic integers), but this would require a significant increase in all parameters (worsening the efficiency of BLISS to a considerable extent).Moreover, relying on the hardness of factoring in a scheme whose main selling point is postquantum security is quite unsatisfactory.
Constant-time implementation.A much easier possible countermeasure could be to try and implement the iterated Bernoulli rejection sampling algorithm in constant time.Our simple power analysis (or rather, SEMA) attack is made particularly easy in the case of the implementation of Pöppelmann et al. [46] by the fact that algorithm SampleBernExp(x) as described in Figure 2 carries out the samplings B c i only for indices i such that the corresponding bits x i of x are 1.This produces a trace very similar to the 1990s SPA attacks on RSA [37, §3.1]: one can simply read the bits of x on the trace directly (where, in our case, x = K − ∥Sc∥ 2 ).The same observation applies of course to strongSwan with respect to branch tracing.One can make things more difficult for the side-channel attacker by rewriting the algorithm in such a way that the B c i samplings are carried out all the time regardless of the value x i .We can also eliminate data-dependent branches completely.A possible such algorithm is described in Figure 11.
That countermeasure is probably sufficient at least in the case of strongSwan, provided that on takes good care to precompute the bits x i before starting the loop, and to check that the compiler does not introduce spurious branching instructions in that computation.Strictly speaking, however, the countermeasure does not eliminate all leakage related to the x i 's.For example, it involves a multiplication a • x i whose operands are just bits, so one can reasonably expect to be able to distinguish the cases x i = 0 and x i = 1 on a power or EM trace, according to the Hamming weight leakage model.
Rejection sampling using transcendental functions.As mentioned in §2, one could also avoid iterated Bernoulli sampling entirely, and carry out the rejection sampling in a single step by computing the rejection probability every time with sufficient precision and comparing it to uniformly sampled randomness in a suitable interval.However, this involves computing the transcendental functions exp and cosh to a high precision if one doesn't want to lose too much accuracy at the tails of the distribution (which could in principle jeopardize the security of the scheme via statistical attacks of the form considered by Ducas and Nguyen against NTRUSign [17]).Such a computation, however, would be really inefficient, especially on constrained devices like the 8-bit AVR microcontroller targeted in our experiments.Moreover, it is also highly non-linear and has a high circuit complexity, making it particularly inconvenient if one wants to introduce more theoretically sound countermeasures against SPA and DPA, like masking.
One could also conceivably precompute all possible values for the probabilities involved in rejection sampling and tabulate them, in an approach similar to CDT-based techniques for Gaussian sampling.The rejection sampling step would then be fast and easy to implement in constant time.The obvious drawback, however, is that the storage requirement is very large: tens of thousands of highprecision values for each parameter set, amounting to megabytes of storage overall.This is again unsuitable on constrained devices.It may be acceptable on desktop computers, however, but in that setting, cache attacks become a source of concern.
Using a scheme with a simpler rejection sampling?A different approach could be to use an alternate scheme with a simpler rejection sampling algorithm.An obvious candidate is the "ancestor" of BLISS: the lattice-based signature scheme described by Güneysu, Lyubashevsky and Pöppelmann (GLP) in [28], which targets a uniform distribution in a suitable interval for the coefficients of the signature elements z 1 , z 2 , instead of the bimodal Gaussian distribution of BLISS.Due to that difference, the GLP scheme has slightly less compact signatures: BLISS signatures at the 128-bit security level are about 5000-bit long, while GLP signatures are about 9000bit long (and the bit security claim is also a bit weaker).
However, the rejection sampling step also becomes considerably simpler: it simply involves checking whether the coefficients of z 1 and z 2 fall in the expected interval.In other words, it boils down to a collection of simple integer comparisons, which are typically fast, constant-time operations even on the most modest platforms.Moreover, this simple rejection sampling can be easily combined with arithmetic masking: although it is not a linear operation, masking it amounts to masking a shallow Boolean circuit with single-bit output, which can be done efficiently (it can be seen as a simpler variant of the arithmetic-to-Boolean masking conversion of Coron et al. [10]).

Attack on the Gaussian sampling
We also showed in §4 that the variable-time algorithm for discrete Gaussian sampling proposed by Ducas et al. [14] and used in strongSwan is also a potential source of side-channel leakage.Mounting a concrete key recovery attack only seems practical when given access to a mostly "noise-free" attack vector like branch tracing, but in that case, the attack is very powerful: it recovers the entire secret key with high probability from the branch trace of a single signature, and works for all keys.
Gaussian sampling in constant time.As we have seen above, the function SampleBernExp can be implemented in constant time relatively painlessly and with a moderate performance penalty.The same cannot be said of SamplePosGaussian however: that function carries out an a priori unbounded number of iterations, and even though it is feasible to fix a large bound instead, the performance penalty incurred if one were to execute the entire for loop instead of returning early would be tremendous.Therefore, converting this overall Gaussian sampling algorithm to constant time seems impractical.
One could observe that implementing SampleBernExp alone in constant time would make the attack significantly more difficult.This is true, but even though the attacker would no longer be able to recover the entire output of SampleGaussian, he would still be able to compute x, and hence would get the value of z up to the small error y located in the least significant bits.In other words, the attacker would be able to replace the large Gaussian masks y i by much smaller noise values, which clearly affects the security of the scheme.
Therefore, in any setting where resorting to a side-channel attack vector similar to branch tracing is plausible, it seems preferable to avoid the Gaussian sampling algorithm of Ducas et al. altogether.Other algorithms for discrete Gaussian sampling are much more practical to implement in constant time.This includes the approach proposed by Dwarakanath and Galbraith [18] combining Knuth-Yao with cumulative distribution tables; it suffers from a large storage requirements, however.The most convenient option so far seems to be the algorithm of Micciancio and Walter [40], which can be implemented entirely in constant time with only integer arithmetic, and relies on an amount of storage than can be tailored to specific applications: the algorithm runs faster with more storage, but can use less storage on more constrained devices.
Alternatively, switching to a simpler signature scheme such as GLP [28] solves the problem for this source of leakage as well: the masks in GLP are sampled with uniform coefficients in short intervals, and this is again straightforward to implement in a constanttime and branch-free manner.

Attack on the multiplication by c
As seen in §5, we can also recover the secret key from power analysis/EMA traces of the computation of the products s 1 • c, s 2 • c of the secret key elements with the varying hash element c computed during signature generation (which is a sparse polynomial with coefficients in {0, 1}).
We have described this key recovery in the context of the Pöppelmann et al. implementation [46], in which c is simply represented as the list of the positions of its non-zero coefficients, and the products s 1 • c, s 2 • c are computed as simple sums of signed shifts of s 1 and s 2 respectively.
We note that even if c is instead represented as a ring element and the products s i • c are computed as generic products in the ring (using the number-theoretic transform in Z q [x]/(x n + 1)), a DPA attack will still easily recover the s i 's.This is because the NTT-based product operation is a simple component-wise product of vectors in Z n q , where the vectors corresponding to s 1 and s 2 are constant, whereas the other operands of the multiplication vary with each signature generation.Since the size of the integers involved is quite small (q is about 14-bit long), the key recovery is feasible with a reasonable number of traces on an arbitrary platform, and is of course even easier on an 8-bit microcontroller, where multiplication is carried out on 8-bit operands.
Again, protecting against such a DPA attack seems tricky.We can suggest the following two possible approaches, but both of them have drawbacks.
A heuristic countermeasure.What makes the DPA approach so effective against this multiplication is the fact that it combines the secret key with a variable operand c which is known to the adversary since it is part of the signature.One possible approach make the leakage harder to exploit is thus to avoid multiplying the secret by a known value.This can be done by computing the element z i = y i + (−1) b s i • c as: With that formula, the known value c −1 is multiplied by a secret value y i , but since that secret value changes with each signature generation, it is much more difficult to exploit the corresponding leakage with DPA (although a single-trace template attack may still apply when the adversary has access to an identical device).As for the product of c with w i , the corresponding leakage should not be an issue since w i can be recovered from the available signature elements anyway (in case the iteration corresponds to a signature that is actually output; for values eliminated in rejection sampling, the argument does not hold, but c is not known in that case either).This provides a heuristic countermeasure which may help against DPA attacks (although a quantitative evaluation of the effectiveness of that countermeasure is left for future work).It can be seen as an alternative to Saarinen's countermeasure [49], which does thwart the stronger attacks of §5.It does have a number of limitations however.As we have noted, it may still be defeated by template attacks.Moreover, it involves the computation of the inversion c −1 , which would typically be carried out in the Fourier domain (so it amounts to a series of inversions in Z q ).This is a somewhat costly operation on constrained devices, and has the additional drawback of requiring c to be invertible (which happens with probability (1 − 1/q) n ≈ 0.96, which is high but still less than 1).And since c −1 has full size coefficients modulo q, one cannot use a naive multiplication taking advantage of the sparsity of c: relying on a full-blown number-theoretic transform seems necessary to achieve satisfactory performance.
Arithmetic masking.It is also possible to achieve provable security against DPA (at least in the so-called t-probing model, which is known to capture realistic leakage scenarios [48]) using arithmetic masking (with the caveat that, again, template attacks still apply [43]).The linearity of the multiplication by c makes it particularly easy to mask.However, the entire signing algorithm needs to be masked for the security argument to be valid.In particular, one also needs to mask the generation of the random Gaussian values y 1 , y 2 , which seems difficult, and the rejection sampling, which we have already noted sounds even harder.These highly non-linear operations constitute a major stumbling block for applying a masking countermeasure.
Thus, one may again want to consider an alternate scheme involving simpler operations, like the GLP scheme mentioned earlier [28].In that scheme, y 1 and y 2 are sampled from uniform distributions in suitable intervals, so that a masked sampling is reasonably easy to implement.The generation of u, z 1 , z 2 remains linear, and as noted above the rejection sampling can be masked easily as well.This may make the larger signature size an acceptable trade-off to achieve good security against physical attacks.

Related schemes
BLISS-B.In [13], Ducas proposed an optimized variant of BLISS called BLISS-B, which features significantly faster key generation and signing algorithms while maintaining the same security level.This variant is also implemented in strongSwan [51].
The main functional difference between BLISS and BLISS-B is the way the "challenge element" c is used in the signature generation algorithm.In BLISS-B, c is computed in a similar way as in BLISS (as the output of a hash function mapping to sparse polynomials with 0/1 coefficents), but instead of computing the signature elements z i directly with c as z i = y i + (−1) b s i c, the algorithm first obtains an intermediate ring element c ′ by flipping the signs of the coefficients of c, and uses c ′ instead of c in the computation of the z i 's.The sign flips are deterministic, and are chosen in such a way as to make the norm ∥Sc ′ ∥ 2 = ∥s 1 c ′ ∥ 2 + ∥s 2 c ′ ∥ 2 smaller, which reduces the chance that a candidate signature is thrown out in rejection sampling, and hence improves the overall signing speed.The change does not affect signature verification, since it only depends on the value of c ′ modulo 2, which is equal to the original c.
In terms of side-channel security, the change introduced by BLISS-B looks rather dangerous, as the sign flips are typically implemented in terms of secret-dependent branches (this is in particular the case in strongSwan).We have not mounted a specific attack against this modification, but at least in a strong attack model like branch tracing, it seems relatively easy to exploit.BLISS-B is also vulnerable to the attack of §4 (since the Gaussian sampling is exactly the same as in BLISS) and the attack of §5 adapts relatively easily as well (since the multiplication s i c ′ is still carried out as a sequence of shifted additions and subtractions, and the values of the shifts are known).On the other hand, the attack of §3 seems difficult to adapt, because the leakage reveals the norm ∥Sc ′ ∥ 2 whereas the attacker only knows c.Since the attack requires many signatures, simply guessing the sign flips is not feasible either.Generalizing this attack to BLISS-B is an interesting open problem.
GPV-style hash-and-sign signatures.Besides Fiat-Shamir style signatures, the other major type of lattice-based signature scheme in the random oracle model consists of hash-and-sign signatures based on GPV lattice trapdoors and Gaussian sampling in lattices [23].Such schemes are considered less efficient than their Fiat-Shamir counterparts, but offer some advantages, like clearer security guarantees in the quantum random oracle model.
Not many implementations of GPV-style lattice signatures have been published, but a few efficient variants have publicly available code, including the NTRU-based scheme of Ducas, Lyubashevsky and Prest [16].A complete side-channel evaluation of these implementations is beyond the scope of this paper, but generally speaking, a major issue with all of these schemes from a side-channel standpoint is the fact that the main operation in signature generation, namely Gaussian sampling in a lattice, is hard to carry out in constant-time and without secret-dependent branches (this is in contrast with a scheme like BLISS, which only require Gaussian sampling over Z).In fact, to the best of our knowledge, this is still an open problem at the time of this writing, and until this problem is solved, even an SPA-resistant implementation of GPV-style signatures appears difficult to achieve.
(II-2) Lift of ideal.Let consider one of the ideal (a ± ib) of Z[i] resulting from the splitting on the quadratic subfield, and lift it the whole cyclotomic field, that is seeing it as an ideal of Z By notational abuse we also denote by (a + ib) (resp.(a − ib)) the ideal lifted from (a + ib) (resp.(a − ib)).
From these two ideals, we can construct two candidate ideals a + and a − respectively defined as the ideals (a At least one of this candidate ideal is principal and by construction its generator will be solution of the norm equation.The schematic representation of the algorithm in this case is presented in Figure 13. Case p ≡ 1 mod 4, α > 1..For any prime ideal P dividing (f * f ), appearing with multiplicity t in its decomposition, P can appear in the decomposition of (f ) with multiplicity 0 ≤ i ≤ t, implying that P will appear with multiplicity t − i in the decomposition of (f ).As such one of the 1 + t such ideals appears in the decomposition of (f ).
In order to compute (f ), we thus need to compute the prime decomposition of (f • f ) and test for the possible multiplicities of each prime ideal of its decomposition.Since we know that the algebraic norm of f • f is p α , we also know that its prime divisors are all primes over p.Then after first enumerating the prime ideals over p, with Berlekamp's algorithm for instance, we test the divisibility of (f • f ) by each of their exponentiation.By multiplicativity of the norm each prime ideal can only appear with multiplicity lower than α.
Once the prime decompositions we can construct the candidate ideals by computing the products of exactly one ideal of the form p k i (p i ) among the 1 + t i possibles for each (p i , pi ) pair of conjugate primes, divisors of (f • f ).
Case p ≡ 3 mod 4. In this case p is inert in Z[i], and then we can not construct a list of candidate from the decomposition in the quadratic field as in step II-1 of the previous case.Nonetheless this case is somehow simpler: the ideal (f ) in Z[ζ m ] is actually invariant under the conjugation map.Indeed if we decompose (f ) in prime ideals: (f ) = i p e i i , each ideal p i is necessarily a real prime ideal over p.Indeed, the norm of each p i over Q(i) is also real as being a prime over p in Z 2 , and we only need to compute the square root of the principal ideal generated by the norm (f • f ) to recover (f ).This can be done easily by first decomposing (f • f ) in prime ideals and then dividing the valuation of each prime by two.For notational simplicity and consistence with the ideals generated for the case p ≡ 1 mod 4, we will also denote the recovered ideal (f ) as I and call it a candidate ideal.The schematic representation of the algorithm in this case is presented in Figure 14.

A.2.3 (
Step III) Generator recovery.This final step is now common for the two cases.Given one -or the unique -of the candidate I -which is supposed to be principal -as well as the element f • f , generator of the ideal I • Ī by construction, the Gentry-Szydlo [24] algorithm can be called to recover f up to a root of unity.In the case where I is not principal5 , the latter algorithm returns an error, giving hence a method to distinguish principal candidates from others.For completeness purpose we recall the basic aspects of this final tool in the case where I = (f ).
Surprisingly enough the data given by only the relative norm and the ideal are sufficient to recover the generator in polynomial time.We recall that given only the basis of an ideal I, computing a generator is computationally hard, this corresponds indeed to the so-called Principal Ideal Problem.State-of-the-art algorithms for PIP are still subexponential time [19].
The main idea of the Gentry-Szydlo algorithm, more precisely described in [24], is to combine algebra in number field with lattice reduction techniques.Using Fermat's Little Theorem, for a prime q, we have f q = f over R q = Z/qZ[X ]/(X m + 1), the cyclotomic ring of integers reduced mod q.Unless f is a zero divisor in R q 6 we have f q−1 = 1 over R q .Assume we compute a LLL-reduced basis B of the ideal (f q−1 ) in polynomial time in m, q and the bit-length of f , we will find a shortest element w = f q−1 • a for some a.If ∥a∥ ∞ < q/2, we get a = [w] q exactly and thus f q−1 .Then, we can recover f up to a root of unity in polynomial time.

A.3 The case p = 2
The last case is p = 2, which is very specific since 2 is the only prime that ramifies in In particular, the only prime above 2 is , giving directly f up to a unit of the field.
Before seeing how these algebraic considerations can be applied to exploit side-channel traces, let us see how to generalize the latter described algorithm in the case of a composite norm N (f ).

A.4 Composite case
Due to the inherent multiplicative structure of the problem, knowing how to solve it for every element whose norm is a prime power is actually sufficient to solve any instance.Indeed, we perform a reasoning à la Chinese reminder theorem, that is, dealing separately with every prime power factor thanks to the studies we just carried out and multiplicatively recompose these chunks of solutions.
Thus, the algorithm starts exactly as before, by computing the algebraic norm of the element f , as the square root of the norm of f • f .In order to deal with every prime factors, we then factor this norm in: where the (p i ) i are the prime factors congruent to 1 modulo 4 and the (q i ) i are the prime factors congruent to 3 modulo 4.
We first take care of the primes (p i ) i congruent to 1 modulo 4. For primes appearing with multiplicity one, applying the technique described in Section A.2.2 each prime p i yields exactly two ideals above itself divisors of (f ), along with the guarantee that at least one of them is principal.As such, since the (p i ) i are coprimes, we can construct 2 T possible products, for T the number of primes appearing with multiplicity 1, obtained by taking exactly one ideal above each p i .Let C 1 this set of ideals.
And now let treat the primes appearing with multiplicity greater than one.In order to fall back on the cases described in section A.2.2, we need to construct an ideal of norm p α i i dividing the principal ideal generated by the relative norm (f • f ).This is simply the sum of the latter ideal with the principal ideal generated by the element p α i i ∈ Z[ζ ].Then applying the technique described in Section A.2.2 each prime p i yields a certain number c i of candidate ideals above p α i i divisors of (f ), along with the guarantee that at least one of them is principal.As such, since the (p i ) i are coprimes, the 2 T i n i possible products obtained by taking exactly one ideal above each p α i i and one ideal from the set C 1 , are divisors of (f ), above i p α i i = N(f ) . At least one of them is principal.
We then treat the case of the primes (q j ) j congruent to 3 modulo 4. Let q j one of those prime appearing in the factorization of N (f ).
In order to fall back on the cases described in section A.2.2, we need to construct a real ideal N of norm q α j j dividing the principal ideal generated by the relative norm (f • f ).This is simply the sum of the latter ideal with the principal ideal generated by the element q α j j ∈ Z[ζ ].As in A.2.2 we construct a principal I q j of norm q α j j dividing (f ).Performing this construction on every prime q j and denoting by R the product of each freshly obtained I q j , ensures that the principal ideal R is a divisor of the ideal (f ) above j q Finally, we deal with the power of two appearing in the norm.The reasoning is similar to what happens in Section A.3: the prime power principal ideal D = (1 + ζ m ) α 2 is a divisor of (f ) above 2 α 2 .
It is now time to reconstruct candidate ideals from these three parts.Multiplying each of the i (1 + α i ) candidates obtained from the (p i ) i with the principal ideal R • D yields an candidate ideal of norm N (f ).Eventually, taking the sum with the ideal (f • f ) gives then a list of i (1+α i ) ideals satisfying the norm equation, with the guarantee that at least one of them is principal.The final step of the algorithm is then unchanged: finding the generator of the principal ideal by the use of Gentry-Szydlo.A method of reducing the running time of this final phase is to process all possible candidate ideals in parallel and stops as soon as one of the process returns a generator.The full outline of the algorithm is given in Figure 12.
A.4.1 Remarks on complexity.Performing operations on ideals in a n-dimensional number field is polynomial in n, since when working with HNF representation of ideals, the computations of sum, product or intersections of ideals boils down to basic linear algebra computations and calls to an HNF oracle, which is known to be polynomial in the dimension (see for instance Chapter 3 to 5 of [9] for a complete introduction to computations with ideals).As early mentioned by Dedekind in [12], computing the decomposition in prime ideals of a given prime7 boils down to factor the defining polynomial of the field Φ m modulo p, which can be efficiently   end if (28) end for performed by the Cantor-Zassenhaus algorithm or Berlekamp algorithm [52].
As pointed out in [24] the Gentry-Szydlo algorithm runs in polynomial time in the dimension of the field.

A.5 Estimation of the algebraic norm of secrets
Let s be one of the secret elements s 1 , s 2 .Note that N (s) is equal to the resultant of s (explicitly, the lift of s in Z[X ] from its representation in Z[ζ ] ≃ Z[X ]/(Φ n (X ))) and Φ n (X ) = X n + 1 the 2n-th cyclotomic polynomial.A classical bound on univariate resultant (see [7] for instance) coming from Hadamard inequalities on the Sylvester matrix, ensures that:

Figure 1 :
Figure 1: Description of the BLISS signature algorithm.The random oracle H takes its values in the set of polynomials in R with 0/1 coefficients and Hamming weight exactly κ, for some small constant κ.

Figure 3 :
Figure 3: BLISS rejection sampling implemented in terms of Bernoulli trials.

step 8 of
Sign in Figure1actually consists of the instructions described in Figure3.

Figure 4 :
Figure 4: Exploiting the norm leakage in BLISS.

Figure 7 :
Figure 7: Disassembled code snippet of the bernoulli_exp method, as output by the gdb disassemble/m command.

Figure 13 :
Figure 13: Recovery of the generator f from its relative norm: Case p ≡ 1 mod 4.

Figure 14 :
Figure 14: Recovery of the generator f from its relative norm: Case p ≡ 3 mod 4.

Table 1 :
[14]mation of the absolute norms of BLISS secret keys for the security parameters of[14](experimental averages over 2000 keys per set).

Table 3 :
Average running time of the attack for various field sizes n.The BLISS parameters correspond to n = 256 and n = 512.

Table 4 :
Results of simulated attacks on the multiplication for several levels of noise.