A Complete Generalization of Atkin’s Square Root Algorithm

(1)

IOS Press

A Complete Generalization of Atkin’s Square Root Algorithm

Armand Stefan Rotaru^∗

Institute of Computer Science, Romanian Academy Carol I no. 8, 700505 Iasi, Romania

[email protected]

Sorin Iftene

Department of Computer Science, Alexandru Ioan Cuza University General Berthelot no. 16, 700483 Iasi, Romania

[email protected]

Abstract. Atkin’s algorithm [2] for computing square roots inZ^∗

p, wherepis a prime such that p≡5mod8,has been extended by M¨uller [15] for the casep≡9mod16. In this paper we extend Atkin’s algorithm to the general casep≡2^s+1mod2^s+1,for anys≥2,thus providing a complete solution for the casep≡1mod4. Complexity analysis and comparisons with other methods are also provided.

Keywords: Square Roots, Efficient Computation, Complexity

1. Introduction

Computing square roots in finite fields is a fundamental problem in number theory, with major applica- tions related to primality testing [3], factorization [17] or elliptic point compression [10]. In this paper we consider the problem of finding square roots inZ^∗_p, wherepis an odd prime. We have to remark that, using Hensel’s lemma and Chinese remainder theorem, the problem of finding square roots modulo any composite number can be reduced to the case of prime modulus, by considering its prime factorization (for more details, see [4]).

∗Address for correspondence: Institute of Computer Science, Romanian Academy, Carol I no. 8, 700505 Iasi, Romania

(2)

According to Bach and Shallit [4, Notes on Chapter 7, page 194] and Lemmermeyer [13, Exercise 1.16, Page 29], Lagrange was the first to derive an explicit formula for the casep ≡3mod4in 1769.

According to the same sources ([4, Exercise 1, page 188] and [13, Exercise 1.17, Page 29]), the case p ≡ 5 mod8 was solved by Legendre in 1785. Atkin [2] also found a simple solution for the case p ≡ 5 mod 8 in 1992. In 2004, M¨uller [15] extended Atkin’s algorithm to the case p ≡ 9 mod 16 and left further developing Atkin’s algorithm as an open problem. In this paper we extend Atkin’s algorithm to the case p ≡ 2^s + 1mod 2^s+1,for any s ≥ 2, thus providing a complete solution for the casep ≡1mod4. M¨uller’s algorithm and our generalization use quadratic non-residues, and thus, they are probabilistic algorithms. We remark that several deterministic approaches for computing square roots modulo a primephave also been presented in the literature. Schoof [19] proposed an impractical deterministic algorithm of complexity O((log₂p)⁹). Sze [21] has recently developed a deterministic algorithm for computing square roots which is efficient (its complexity isO˜((log₂p)²))) only for certain primesp.

The paper is structured as follows. Section 2 is dedicated to some mathematical preliminaries on quadratic residues and square roots. Section 3 presents Atkin’s algorithm and its extension (M¨uller’s algorithm), both based on computing square roots of−1 modulop. We present our generalization in Section 4. Its performance, efficient implementation and comparisons with other methods are presented in Section 5. In the last section we briefly discuss the conclusions of our paper and the possibility of adapting our algorithm for other finite fields.

2. Mathematical Background

In this section we will present some basic facts on quadratic residues and square roots. For simplicity of notation, from this point forward we will omit the modular reduction, but the reader must be aware that all computations are performed modulopif not explicitly stated otherwise.

Letpbe a prime anda ∈ Z^∗_p.We say thatais a quadratic residue modulopif there existsb ∈ Z^∗_p with the property a= b². Otherwise, ais a quadratic non-residue modulop. It is easy to see that the product of two residues is a residue and that the product of a residue with a non-residue is a non-residue.

Ifb² =athenbwill be referred to as a square root ofa(modulop) and we will simply denote this fact byb=√

a.We have to remark that ifais a quadratic residue modulop, pprime, thenahas exactly two square roots - ifbis a square root ofa,thenp−bis the other one. In particular,1has the square roots 1and−1(in this case,−1will be regarded as beingp−1) or, equivalently,a² = 1⇔(a= 1∨a=−1).

The Legendre symbol of amodulop, denoted as a

p

,is defined to be equal to±1depending on whetherais a quadratic residue modulop.More exactly,

a p

=

( 1, ifais a quadratic residue modulop;

−1, otherwise.

Euler’s criterion states that, for any primepanda∈Z^∗_p, the following relation holds:

a^p−1² = a

p

.

(3)

Euler’s criterion provides a method of computing the Legendre symbol ofamodulopusing an exponentiation modulop, whose complexity isO((log₂p)³).There are faster methods for evaluating the Legendre symbol - see, for example [8], in which are presented algorithms of complexityO(_log^(log²^p)²

2log₂p)for computing the Jacobi symbol (the Jacobi symbol is a generalization of the Legendre symbol to arbitrary moduli).

Another useful property is that 2

p

= (−1)^p

2−1

8 ,that implies that2is a quadratic residue modulo pif and only ifp≡ ±1mod8.

Ifpis prime,p≡ 3mod4,anda∈ Z^∗_p is a quadratic residue modulopthenb =a^p+1⁴ is a square root ofamodulop. Indeed, in this case,b² = a^p+1² =a·a^p−1² =a·

a p

= a·1 = a.Thus, in this case, finding square roots moduloprequires only a single exponentiation modulop.In the next sections we will focus on the casepprime,p≡1mod4.

3. Square Root Algorithms based on Computing √

−1

In this section we present two methods for computing square roots for the cases p ≡ 5 mod 8 and p≡9mod16, both based on computing square roots of−1modulop.

3.1. Atkin’s Algorithm

Let p be a prime such that p ≡ 5mod 8 and aa quadratic residue modulo p. Atkin’s idea [2] is to express √aas√a=αa(β−1) whereβ² =−1and2aα² =β. Indeed, in this case,(αa(β −1))² = a(−2aα²β) = a(−β²) =a.Moreover, in order to easily determine α, it will be convenient thatβ has the formβ = (2a)^k, withkodd. Thus, the major challenge is to find√

−1of the mentioned form.

By Euler’s criterion, the relation(2a)^p−1² =−1holds (ais a quadratic residue, but 2is a quadratic non-residue, therefore 2ais a quadratic non-residue), so we can choose β asβ = (2a)^p−1⁴ and α as α= (2a)

p−1 4 −1

2 = (2a)^p−5⁸ .

The resulted algorithm is presented in Figure 1.

Atkin’s Algorithm(p,a)

input: pprime such thatp≡5mod8, a∈Z^∗_pa quadratic residue;

output: b, a square root ofamodulop;

begin

1. α:= (2a)^p−5⁸ ; 2. β:= 2aα²; 3. b:=αa(β −1);

4. returnb end.

Figure 1: Atkin’s algorithm

(4)

Atkin’s algorithm requires one exponentiation (in Step 1) and four multiplications (two multiplications in Step 2 and two multiplications in Step 3).

3.2. M ¨uller’s Algorithm

Letpbe a prime such thatp≡9mod16andaa quadratic residue modulop. M¨uller [15] has extended Atkin’s algorithm by expressing√

aas√

a=αad(β−1)whereβ² =−1and2ad²α² =β. Indeed, in this case,(αad(β−1))² =a(−2ad²α²β) =a(−β²) =a.Moreover, in order to easily determineα, it will be convenient thatβhas the formβ= (2ad²)^k, withkodd.

By Euler’s criterion, the relation(2a)^p−1² = 1holds (aand 2are quadratic residues, therefore2ais a quadratic residue). We have two cases:

(I) (2a)^p−1⁴ =−1- in this case we can chooseβasβ = (2a)^p−1⁸ andαasα= (2a)

p−18 −1

2 = (2a)^p−9¹⁶ (d= 1);

(II) (2a)^p−1⁴ = 1- in this case we need a quadratic non-residued- by Euler’s criterion,d^p−1² =−1 and, thus,(2ad²)^p−1⁴ =−1, so we can chooseβ asβ = (2ad²)^p−1⁸ andαasα= (2ad²)

p−1 8 −1

2 =

(2ad²)^p¹⁶⁻⁹.

The above presentation is in fact a slightly modified variant of the original one - for Case (I), M¨uller used an arbitrary residue d. Kong et al. [11] have remarked that using d = 1 in this case leads to an important improvement of the performance of original M¨uller’s algorithm, by requiring only one exponentiation for half of the squares inZ^∗_p(Case (I)) and two for the rest (Case (II)).

The resulted algorithm is presented in Figure 2.

In case(2a)^p⁻¹⁴ =−1, M¨uller’s algorithm requires one exponentiation (Step 1) and five multiplications (two multiplications in Step 2, one multiplication in Step 3 and two multiplications in Step 4). In case(2a)^p−1⁴ = 1, M¨uller’s algorithm, besides the operations in Steps 1-3, requires one more exponentiation (Step 8) and eight more multiplications (one multiplication in Step 8, four multiplications in Step 9 and three multiplications in Step 10. Additionally, Step 7 requires, on average, two quadratic character evaluations (generate randomlyd ∈ Z^∗_p until

d p

= −1- because half of the elements are quadratic non-residues, two generations are required on average). It is interesting to remark that Ankeny [1] has proven that, by assuming the Extended Riemann Hypothesis (ERH), the least quadratic non-residue mod- ulopis inO((log₂p)²). As a consequence, in this case, the presented probabilistic algorithm for finding a quadratic non-residue can be transformed into a deterministic polynomial time algorithm of complexity O((log₂p)⁴).

4. A Complete Generalization of Atkin’s Square Root Algorithm

In this section we extend Atkin’s algorithm to the case p ≡ 2^s + 1 mod 2^s+1,for any s ≥ 2,thus providing a complete solution for the casep ≡ 1mod4. For any primep,withp ≡ 1mod4,we can express p−1asp−1 = 2^st, wheres ≥2 andtis odd. If we writetast = 2t^′ + 1, we obtain that p= 2^s+1t^′+ 2^s+ 1that implies thatp≡2^s+ 1mod2^s+1.

(5)

M ¨uller’s Algorithm(p,a)

input: pprime such thatp≡9mod16, a∈Z^∗_pa quadratic residue;

begin

1. α:= (2a)^p¹⁶⁻⁹; 2. β:= 2aα²; 3. ifβ² =−1

4. then b:=αa(β−1);

5. else

6. begin

7. generated,a quadratic non-residue modulop;

8. α:=αd^p−9⁸ ;

9. β:= 2ad²α²; 10. b:=αad(β−1);

11. end

12. returnb end.

Figure 2: M¨uller’s algorithm

We will express√ aas√

a=αa(β−1)d^norm,whereβ² =−1, dis a quadratic non-residue modulo p, norm≥0,and2ad^2·normα² =β.Indeed, in this case,(αa(β−1)d^norm)² =a(−2ad^2·normα²β) = a(−β²) = a.Moreover, in order to easily determine α, it will be convenient that β has the formβ = (2ad^2·norm)^k, withkodd.

The key point of our generalization is

Base Case:(2ad^2·norm)²^p−1^s−1 =−1,for somenorm≥0.

In this case, because ^p−1₂s is odd, we can chooseβasβ= (2ad^2·norm)^p−1²^s , αas α= (2ad^2·norm)

p−1 2s −1

2 = (2ad^2·norm)^p−(2

s+1)

2s+1 = (2ad^2·norm)^t−1² .

In contrast to M¨uller’s impractical attempt of further generalizing Atkin’s approach ([15, Remark 2]), we focus on finding an adequate value fornorm, the exponent ofdsuch that theBase Caseis satisfied.

In order to derive the value ofnorm, we use the following results:

Theorem 4.1. Letpbe an odd prime,p−1 = 2^st(s≥3,todd),aa quadratic residue modulop,andd a quadratic non-residue modulop. Then, for all1≤i≤s−1, the following statement holds

(∃norm^′∈N)((2ad^2·norm^′)^p−1²ⁱ = 1)⇒(∃norm∈N)((2ad^2·norm)²^p−1^s−1 =−1)

(6)

Proof:

We use induction oni.

Initial Case- Fori= s−1the reasoning is very simple. If there is a positive integernorm^′ such that(2ad^2·norm^′)²^p−1^s−1 = 1then, using that d^p−1² = −1 (or,(d²^s−2)²^p−1^s−1 = −1), we obtain that (2ad^2·norm^′d²^s−2)²^p−1^s−1 = −1,and, furthermore, (2ad^2·(norm^′⁺²^s−3⁾)²^p−1^s−1 = −1. Thus, we may choosenorm=norm^′+ 2^s−3.

Inductive Case- Let us consider an arbitrary number i,1 ≤ i < s−1. We assume that the statement holds for the casei+ 1and we will prove it for the casei.

If there is a natural numbernorm^′ such that(2ad^2·norm^′)^p−1²ⁱ = 1, or, ((2ad^2·norm^′)²^p−1ⁱ⁺¹)² = 1, then(2ad^2·norm^′)²^p−1ⁱ⁺¹ =±1. We have two cases:

– If(2ad^2·norm^′)²^p−1ⁱ⁺¹ = 1then, using the inductive hypothesis, we directly obtain that(∃norm∈ N)((2ad^2·norm)²^p−1^s−1 =−1);

– If (2ad^2·norm^′)²^p−1ⁱ⁺¹ = −1then, using thatd^p−1² = −1(or, equivalently, (d²ⁱ)²^p−1ⁱ⁺¹ = −1) we obtain that (2ad^2·norm^′d²ⁱ)²^p−1ⁱ⁺¹ = 1,and, furthermore, (2ad^2·(norm^′⁺²ⁱ⁻¹⁾)²^p−1ⁱ⁺¹ = 1.

Finally, using the inductive hypothesis, we obtain that the required statement holds. ⊓⊔ The previous theorem leads to the following:

Corollary 4.2. Letpbe an odd prime,p−1 = 2^st(s≥2,todd),aa quadratic residue modulop,and da quadratic non-residue modulop. Then there existsnorm∈Nsuch that

(2ad^2·norm)²^p−1^s−1 =−1.

Proof:

Fors= 2, we obtain directlynorm= 0,because in this case2is a quadratic non-residue modulopand the relation(2a)^p−1² =−1holds.

Fors≥ 3,2is a quadratic residue and thus, we have(2a)^p−1² = 1. Using Theorem 4.1, fori = 1 (norm^′= 0) we obtain that there isnorm∈Nsuch that

(2ad^2·norm)²^p−1^s−1 =−1. ⊓⊔

Therefore, all other possible cases can be recursively reduced to theBase Caseas presented above.

In order to further clarify the points made so far, we will now give an algorithmic description of our generalization. We will use a special subroutine namedFindPlace(presented in Figure 3), in which, starting with certain values foraandnormthat satisfy(2ad^2·norm)^p²⁻¹ⁱ = 1,for somei,we will search for a placejas close as possible tos−1such thattemp= (2ad^2·norm)^p²⁻¹^j =±1.

Furthermore, we will also formulateBase Caseas a subroutine in Figure 4.

Finally, the main part of our algorithm is presented in Figure 5.

(7)

FindPlace(a, norm)

begin

1. ifnorm= 0thentemp:= (2a)^t 2. elsetemp:= (2ad²^·norm)^t; 3. j:=s;

4. repeat 5. j:=j−1;

6. temp:=temp²;

7. until(temp= 1∨temp=−1) 8. return(j, temp)

end.

Figure 3:FindPlaceSubroutine BaseCase(a, norm)

begin

1. α:= (2ad^2·norm)^t⁻¹² ; 2. β:= (2ad^2·norm)α²; 3. b:=αa(β −1)d^norm; 4. returnb

end.

Figure 4:BaseCaseSubroutine

Remark 4.3. For the clarity of the presentation, we believe it is also necessary to make some comments and prove some statements on the Generalized Atkin Algorithm and its subroutines:

1. The variablenormcontains the current value of the normalization exponent.

2. Some useful properties of the subroutineFindPlaceare presented next:

(a) If the outputted valuejof the subroutine FindPlaceis not equal tos−1, then the corre- sponding valuetempwill be−1.

Proof:

Becausej < s−1then at least two iterations ofrepeat untilhave been performed (because initially j = sand thenj is decremented in each iteration). If we assume by contradiction that the final value oftempis1,then the previous valuetempsatisfiestemp=±1(becausetemp= temp² in Step 6), and, thus, the algorithm had to terminate at the previous iteration. ⊓⊔ (b) Let (j, temp) and (j^′, temp^′) be the outputs of two consecutive calls of the subroutine

FindPlace. Thenj < j^′.

(8)

Generalized Atkin Algorithm(p,a)

input: pprime such thatp≡1mod4 a∈Z^∗_pa quadratic residue;

begin

1. determines≥2andtodd such thatp−1 = 2^st;

2. generated, a quadratic non-residue modulop;

3. norm:= 0;

4. (j, temp) :=FindPlace(a, norm);

5. while(j < s−1) 6. begin

7. norm:=norm+ 2^j−2;

8. (j, temp) :=FindPlace(a, norm);

9. end

10. if(temp=−1)then BaseCase(a, norm) 11. if(temp= 1)then

12. begin

13. norm:=norm+ 2^s−3; 14. BaseCase(a, norm);

15. end

end.

Figure 5: Generalized Atkin Algorithm

Proof:

Let us first point out thatj < s−1(otherwise, ifj = s−1,there will not be another call of FindPlace, since the algorithm will end with a call ofBaseCase), which implies thatj+1≤s−1.

Therefore, we obtaintemp= (2ad^2·norm)^p−1²^j =−1. Furthermore, we have(2ad^2·norm^′)^p−1²^j = 1, which implies that(2ad^2·norm^′)²^p−1^j+1 =±1, leading toj+1≤j^′(becausej^′is the greatest element less thans−1such that(2ad^2·norm^′)

p−1

2j′ =±1). ⊓⊔

3. Ifp ≡5mod8, i.e.,s= 2,thenFindPlacewill be called exactly once (withaand norm= 0) and it will outputj = s−1 = 1andtemp = −1- in this case, the subroutine BaseCasewill directly lead to the final result (no normalization is required). Thus, we have obtained Atkin’s algorithm as a particular case of our algorithm.

4. Ifp≡9mod16, i.e.,s= 3,thenFindPlacewill be called exactly once (withaandnorm= 0) and it will outputj =s−1 = 2andtemp=±1. Two subcases are possible:

• In casetemp= −1, the subroutineBaseCasewill lead directly to the final result (no normalization is required);

(9)

• In casetemp= 1, the normalization exponent will be updated asnorm= 0 + 2³⁻³ = 1and the subroutine BaseCasewill be called. Consequently, the final result will be computed as b:=αa(β−1)d¹(Step 3 ofBaseCase).

Thus, we have obtained M¨uller’s algorithm as a particular case of our algorithm.

5. Efficient Implementation and Performance Analysis

We start with the average-case and worst-case complexity analysis of our initial algorithm and then we discuss several improvements for efficient implementation. Finally we present several comparisons with the most important generic square root computing methods, namely Tonelli-Shanks and Cippola-Lehmer.

5.1. Average-Case and Worst-Case Complexity Analysis

We will consider the casess ≥ 4 (fors = 2, s = 3,we obtain, Atkin’s algorithm, and, respectively, M ¨uller’s algorithm, whose complexities have been discussed in Section 3). Our algorithm determines the value ofnormby calling the subroutineFindPlacefor each1digit in the binary expression ofnorm.

Therefore, the algorithm makesHw(norm)calls toFindPlace, where Hw(x)denotes the Hamming weight ofx(i.e., the number of1’s inx).

LetEdenote one exponentiation, M- one multiplication, and S- one squaring (all these operations are performed modulop). Our subroutines will involve:

• FindPlace- if the output is(j, temp)then at most 2E+1M+(s−j)S;

• BaseCase- at most 3E+6M+1S.

We exclude the complexity of generating a quadratic non-residued. All the other computations can be considered negligible (ifnormis represented in base2then the stepnorm:=norm+ 2^j−2implies only setting a certain bit to 1). In the average case, we have Hw(norm) = ^s−2₂ , which means that our algorithm will include ^s−2₂ calls toFindPlaceand a call toBaseCase. Thus, the total number of operations is, on average, the following:

s−2

2 (2E+ 1M) +

Ps−1 j=2(s−j)

2 S+ 3E+ 6M+ 1S =

(s−2)E+^(s−2)₂ M+^{(s−1)(s−2)}₄ S+ 3E+ 6M+ 1S = (s+ 1)E+^s+10₂ M+^s²^−3s+6₄ S

In contrast, in the worst case,normwill haveHw(norm) =s−2(i.e., all the bits fromnormwill be equal to 1), resulting in(s−2) calls toFindPlaceand a call to BaseCase. The total number of operations now becomes the following:

(s−2)(2E+ 1M) +P_s−1

j=2(s−j)S+ 3E+ 6M+ 1S = 2(s−2)E+ (s−2)M+^{(s−1)(s−2)}₂ S+ 3E+ 6M+ 1S = (2s−1)E+ (s+ 4)M+^s²^−3s+4₂ S

Once more, we do not count the generation of a quadratic non-residue d. Consequently, both the average-case and the worst-case complexity of our initial algorithm are inO((log₂p)⁴).

(10)

5.2. More Efficient Implementation

It is obvious that several steps (especially the steps that involves exponentiations) of our algorithm can be performed much more efficiently compared to their raw implementation. A first solution is to precompute several powers ofd,to keep track ofd^normand(d^norm)^t⁻¹² asnormis updated and efficiently recompute the value oftempfrom Step 2 ofFindPlaceby using the previous values. Moreover, in this case, the final exponentiations (fromBaseCase) can also be performed efficiently.

We will now examine the computations behind our algorithm more closely, in order to point out possible improvements, if precomputations can be afforded. We begin by defining the elements Dj, 0 ≤ j ≤ s−1, D_j = D²^j,D = d^t andA_j, 0 ≤ j ≤ s−2, A_j = A²^j,A = (2a)^t. Let us denote Aj ·D^{temp norm}by < j, temp norm >, where temp normis in binary form. In our algorithm, we determine the value of norm = (f_s−3...f₀)₂ by successively computing the digits f₀,f₁ ,...,f_s−3, so that:

T_s−2 =< s−2, f₀

s−1 0^′s

z }| {

00...00>= 1

T_s−3 =< s−3, f₁f₀00...00

| {z }

s−2 0^′s

>= 1 ...

T2 =<2, fs−4...f1f0000>= 1

T₁ =<1, f_s−3...f₁f₀00>=−1

We remark that the last element T₁ is exactly the element from theBase Case: (2ad^2·norm)²^p−1^s−1. The reader will notice that for any0 ≤ j ≤ s−2, we would compute Tj = Aj ·D^(f^s−2−j^...f¹^f⁰^0...0)² in a naive manner, by multiplyingAj with all the(2ⁱ)^th powers ofDcorresponding to the1bits from f_s−2−j...f₁f₀0...0. In order to reduce the number of modular multiplications, let us choose a fixed, small integerk, withk ≥1, and consider thektermsTj, Tj−1, ..., T_j−k+1,wherej−k+ 1 >1. We obtain the following sequence:

T_j =< j, f_s−2−jf_s−3−j...f₁f₀0...0

| {z }

j+1 0^′s

> = 1

T_j−1 =< j−1, f_s−2−j+1f_s−2−jf_s−3−j...f₁f₀0...0

| {z }

j0^′s

> = 1 ...

T_j−k+2 =< j−k+ 2, f_{s−2−j+k−2}...f_s−2−jf_s−3−j...f₁f₀ 0...0

| {z }

j−k+3 0^′s

> = 1

T_j−k+1 =< j−k+ 1, f_{s−2−j+k−1}...f_s−2−jf_s−3−j...f₁f₀ 0...0

| {z }

j−k+2 0^′s

> = 1

(11)

Importantly, the sequencef_s−3−j...f₁f₀ appears in all of the above terms. Let us denote the term D^f^s−3−j^...f¹^f⁰

j−k+2 0′s

z }| {

00...00byaux.

We notice that theT_j, T_j−1, ..., T_j−k+1 can be computed in the following way, once we have deter- minedT_j+1(and, implicitly,f_s−3−j, ...., f₁, f₀):

Tj =Aj·D_s−1^f^s⁻²⁻^j

| {z }

S₀

·aux²^k−1

T_j−1 =A_j−1·D^f_s−1^s−2−j+1

| {z }

S₁

·D_s−2^f^s−2−j

| {z }

S₀

·aux²^k−2 ...

T_j−k+2=A_j−k+2·D^f_s−1^{s−2−j+k−2}

| {z }

Sk−2

· · · ·D^f_s−k+2^s−2−j+1

| {z }

S1

·D_s−k+1^f^s−2−j

| {z }

S0

·aux²

T_j−k+1=A_j−k+1·D^f_s−1^{s−2−j+k−1}

| {z }

Sk−1

·D^f_s−2^{s−2−j+k−2}

| {z }

Sk−2

· · · ·D^f_s−k+1^s−2−j+1

| {z }

S₁

·D_s−k^f^s−2−j

| {z }

S₀

·aux

We added underbraces with subscripts to the terms in order to highlight the fact that it is useful to see terms which have the same exponent as being part of a larger set. We will show how we can efficiently generate the powers ofauxand the setsSw, for0≤w≤k−1.

Firstly, we computeauxin the regular manner, and thenaux²ⁱ, for1 ≤ i≤ k−1, throughk−1 modular squarings. This way, we use onlyk−1modular squarings, instead of(k−1)·Hw(f_s−3−j...f₁f₀) regular modular multiplications.

Secondly, we still have to compute the sets of termsS_w = {D_s−k+w+z^f^s−2−j+w |0 ≤ z ≤ k−1−w}, for0 ≤ w ≤ k−1. Intuitively, the setSw contains all the terms located wpositions below the main diagonal, for0 ≤ w ≤ k−1. For eachw, iff_s−2−j+w = 0(as T_s−2−j+w = 1), we do not have to compute anything because S_w = {1}, while iff_s−2−j+w = 1(as T_s−2−j+w = −1), each set can be easily generated by takingD_s−k+wand applyingk−w−1modular squarings.

Inner Loop 1. computeaux

2. setT_j−k+1:=A_j−k+1·aux;

3. forw= 0tok−1do:

4. determinefs−2−j+w

5. updateT_j−k+1by settingT_j−k+1:=T_j−k+1·D_s−k+w 6. fori= 2tok−w−1do:

7. updateT_j−k+iby settingT_j−k+i:=T_j−k+i−1² Figure 6: Inner Loop

(12)

Thirdly, by storing the termsTj, T_j−1, ..., T_j−k+1we can efficiently combine the two aforementioned improvements as presented in Figure 6.

For each value ofwbetween0andk−1, running the inner loop generates both the necessary powers ofaux and the set Sw. Once the outer loop is completed, we have determined knew digits from the binary representation ofnorm. We repeat this procedure until we know all the bits ofnorm.

Finally, we can also simplify the last computations of our initial algorithm. The standard procedure would be to first calculate the termsα= (2ad^2·norm)^t−1² andβ = (2ad^2·norm)α², and then to generate the square rootb=αa(β−1)d^norm. However, if we elaborate the expression ofbwe obtain

b = aαd^norm(β−1)

= a(2a)^t−1² d^2·norm^t−1² d^norm((2a)^t(d^2·norm)^t−1)

= a(2a)^t−1² d^t·norm((2a)^t(d^t·norm)²−1)

Once we have computed(2a)^t⁻¹² , we can then easily modify the final run of the inner loop in order to generateD^norm = (d^t)^normand to compute the value ofb. When combined, our suggestions lead to a significantly improved version of our initial algorithm. The precomputation stage is as follows:

Precomputation(p,a,k)

input: pprime such thatp≡1mod4,p−1 = 2^st, todd;

k,1≤k≤s, a precomputation parameter;

output: Dj,0≤j≤s−1,Dj =D²^j,D=d^t,dquadratic non-residue, aux_A= (2a)^t−1² ,A= (2a)^tandA_s−1−k·i,1≤i≤q,

whereq =⌊^s−2k ⌋,A_s−1−k·i = (2a)^t2^s−1−k·i begin

1. generate and stored(by any means available);

2. compute and storeDandDi,0≤j≤s−1 (by square-and-multiply exponentiation);

3. compute and storeaux_A,AandA_s−1−k·i,1≤i≤q (by square-and-multiply exponentiation);

end

Figure 7:PrecomputationSubroutine

We have precomputed all the required powers ofD, but only certain powers ofA. It is not necessary to keep all the powers of A, since the missing powers can be generated as they are needed. This is because the algorithm behind the third improvement uses only one stored power of A and implicitly employsk−1other powers ofA, which are kept only for the duration of the outer loop. We have also computed the termauxA= (2a)^t−1² , which is part of the final improvement. Moreover, we assume that we have enough memory capacity to storeknumbers, namelyACCh, where1≤h≤k. These numbers are exactlyTj, T_j−1, ..., T_j−k+1, as used in the description ofInner Loop.

(13)

The main part of our improved algorithm is presented in Figure 8. Before running the actual algorithm, thePrecomputationsubroutine must be called. Note, however, that ifpis a priori known, Steps 1 and 2 from Precomputationneed to be performed only once (and this may be done in advance), while Step 3 must be repeated for eacha.

Improved Generalized Atkin Algorithm(p,a,k)

input: pprime such thatp≡1mod4, p−1 = 2^st, todd;

a∈Z^∗_pa quadratic residue;

begin

1. step:=s−2;

2. auxnorm := 0 = (es−1...e0)2; (the finalauxnormis4·norm) 3. q:=⌊^s−2k ⌋;

4. rem:= (s−2)mod k+ 1;

5. fori= 1toqdo 6. begin

7. Complete Accumulator Update(step,k);

8. Complete Inner Loop(step,k);

9. end

11. Final Accumulator Update and Inner Loop(step,k,rem);

12. b:=a·aux_A·aux_ACC ·(A·aux²_ACC −1);

13. returnb end.

Figure 8: Improved Generalized Atkin Algorithm

The first two subroutines correspond to the Inner Loop(described in Figure 6) in the following manner:

• Complete Accumulator Update(presented in Figure 9) implements Steps 1 and 2, computing auxandT_j−k+1.

• Complete Inner Loop(presented in Figure 10) implements the loop in Steps 3 through 7, computing the bitsfs−2−j,. . . ,f_s−3−j+k.

Final Accumulator Update and Inner Loop (presented in Figures 11, 12) is an incomplete combination of a Complete Accumulator Update and a Complete Inner Loop, for determining the remaining bits ofnorm, sinces−2may not be an exact multiple ofk. Moreover, a slight adjustment is made in order to obtain the term aux_ACC = D^norm = (d^t)^norm. We consider the case s = 2 separately and setauxACC := 1since this case does not fit in the general framework. Fors > 2,the last part of this subroutine (Steps 15-34) computes the term T₁ which must be treated individually, as T₁=−1while all otherT_i’s are equal to 1, for2≤i≤s−2.

(14)

Complete Accumulator Update(step,k) begin

1. ACC₁ :=A_step−k+1

s−1Y

j=step+2

(Dj−k)^e^j; 2. forj = 2tokdoACCj :=ACC_j−1² ;

end

Figure 9:Complete Accumulator UpdateSubroutine

Complete Inner Loop(step,k)

begin

1. forj =kdownto1do 2. begin

3. aux_norm:=aux_norm/2;

4. ifACCj=−1then

5. begin

6. ACC₁:=ACC₁·D_s−j;

7. forh= 2toj−1doACCh:=ACC_h−1² ;

8. e_s−1 := 1;

9. end

10. step:=step−1;

11. end

end

Figure 10:Complete Inner LoopSubroutine

(15)

Final Accumulator Update and Inner Loop(step,k,rem) begin

1. aux_ACC :=

s−1Y

j=step+2

(D_j−step−2)^e^j; 2. ACC₁:=A·aux²_ACC;

3. forj= 2toremdoACC_j :=ACC_j−1² ; 4 forj=rem−1downto2do

5. begin

6. aux_norm:=aux_norm/2;

7. ifACC_j =−1then

8. begin

9. auxACC :=auxACC ·D_s−1−j; 10. ACC1:=A·aux²_ACC;

11. forh= 2toj−1doACCh:=ACC_h−1² ;

12. e_s−1 := 1;

13. end

14. end

15. ifrem= 1then

16. ifs= 2thenauxACC := 1;

17. else

18. ife_s−1 = 0then

19. begin

20. auxACC :=auxACC ·D_s−3;

21. e_s−1 := 1;

22. end

23. else begin

24. aux_ACC :=aux_ACC ·D_s−3·D_s−2·D_s−1;

25. es−1 := 0;

26. end

Figure 11:Final Accumulator Update and Inner LoopSubroutine

(16)

27. else begin

28. aux_norm :=aux_norm/2;

29. ifACC₂ = 1then

30. begin

31. aux_ACC :=aux_ACC·D_s−3;

32. e_s−1 := 1;

33. end

34. end end

Figure 12:Final Accumulator Update and Inner LoopSubroutine (continued) Example 5.1 illustrates the application of our improved algorithm.

Example 5.1. Let us considerp = 12289 (s = 12,t = 3) anda = 2564(2564is a quadratic residue modulo12289). We choosed= 19,k= 3(therefore,q = 3), and obtain the following values :

i 0 1 2 3 4 5

D_i 6859 3589 2049 7852 12280 81

Ai 8835 9786 9908 3932 1062 9545

i 6 7 8 9 10 11

Di 6561 10643 5736 4043 1479 12288

A_i 8668 11567 5146 10810 12288 -

However, we will only store theDi’s, for0≤i≤11, as well asA₀,A₂,A₅,A₈andauxA= 5128.

We obtainstep= 10,aux_norm= 0andrem= 2.

Fori = 1, we update the accumulators so thatACC₁ = A₈ = 5164, ACC₂ = A₉ = 10810and ACC₃=A₁₀=−1.

Entering theComplete Inner Loop, we have:

• since ACC₃ = −1, we haveACC₁ = ACC₁·D₉ = 1and ACC₂ = ACC₁² = 1. Moreover, e₁₁ = 1,auxnorm = 0/2 + 2048 = 2048andstep= 9;

• sinceACC2= 1, we getauxnorm = 2048/2 = 1024andstep= 8;

• sinceACC₁= 1, we getaux_norm = 1024/2 = 512andstep= 7;

Fori= 2, we update the accumulators so thatACC1 = 1,ACC2 = 1andACC3 = 1.

Entering theComplete Inner Loop, we have:

(17)

• sinceACC₃= 1, we getauxnorm= 512/2 = 256andstep= 6;

• sinceACC₂= 1, we getauxnorm= 256/2 = 128andstep= 5;

• sinceACC₁= 1, we getauxnorm= 128/2 = 64andstep= 4;

Fori= 3, we update the accumulators so thatACC1= 8246,ACC2 = 1479andACC3 =−1.

Entering theInner Loop, we have:

• since ACC3 = −1, we have ACC1 = ACC1 ·D9 = 10810 and ACC2 = ACC₁² = −1.

Furthermore,e₁₁= 1,auxnorm= 64/2 + 2048 = 2080andstep= 3;

• since ACC₂ = −1, we have ACC₁ = ACC₁·D₁₀ = 1. Furthermore, e₁₁ = 1, auxnorm = 2080/2 + 2048 = 3088andstep= 2;

• sinceACC₁= 1, we getaux_norm= 3088/2 = 1544andstep= 1;

We now perform theFinal Accumulator Update and Inner Loop. Thus, we obtainauxACC = D₀·D₆·D₇ = 1490,ACC₁ =A·aux²_ACC =−1andACC₂ =ACC₁² = 1. SinceACC₂ = 1, we obtaine11 = 1, auxnorm = 1544/2 + 2048 = 2820 (thus, norm = 2820/4 = 705) andauxACC = auxACC ·D₉ = 2460. The final computation gives usb= a·auxA·auxACC ·(A·aux²_ACC −1) = 2564·5128·2460·(10810−1) = 253.

5.3. Average-Case and Worst-Case Complexity Analysis for the Improved Algorithm In the average case, we obtain the following complexities, based on the fact thatnormhas arounds/2 bits equal to1in its representation:

• Ifpis a priori known,Precomputationtakes1Efor the terms involvingA(the computation of the terms involvingDcan be performed in advance). Ifpis not a priori known,Precomputation takes2E, which means1Efor terms involvingDand1Efor the terms involvingA.

• Complete Accumulator Updatetakes₄^sMand(k−1)S(since, on average, we uses/2bits from norm, either of which can be0or1, with equal probability).

• Complete Inner Looptakes ^k₂M+ ^k(k−1)₄ S(since we usekbits fromnorm, either of which can be0or1, with equal probability).

• Final Accumulator Update and Inner Looptakes2·^k₂M+ ^s₄M+(k−1)S+ ^k₂M+ ^k(k−1)₄ S.

• The final computation ofbtakes4M+1S.

The estimate does not include the generation of a quadratic non-residue d. In general, the computation takes about2E+ ^s_k (^s₄M+kS+ ^k₂M+ ^k(k−₄ ¹⁾S) +(k+ 4)M+1S. This value is around2E+ ^3s₄S+

s

2M+ ¹₄(^s_k²)M+ ¹₄(sk)S+ kM. Taking k = ⌈√

s⌉(the optimal choice) leaves us with2E+ ^3s₄S+ ₂^sM+

1

4(s⌈√s⌉)M+ ¹₄(s⌈√s⌉)S+⌈√s⌉M.

(18)

Ifpis a priori known, we obtain1E+ ^3s₄S+ ^s₂M+ ¹₄(s⌈√

s⌉)M+ ¹₄(s⌈√

s⌉)S+⌈√

s⌉M. In this case, we needsprecomputed elements and memory for just2⌈√

s⌉additional elements.

Moving on to the worst case, we consider the fact that norm’s binary representation has roughlys bits which are equal to1. This results in the following complexities:

• Precomputation- same as for the average case.

• Complete Accumulator Updatetakes ₂^sMand(k−1)S(since, on average, we uses/2bits from norm, and all ofnorm’s bits are equal to1).

• Complete Inner LooptakeskM+ ^k(k−1)₂ S(since we usekbits fromnorm, and all ofnorm’s bits are equal to1).

• Final Accumulator Update and Inner Looptakes2kM+ ^s₂M+(k−1)S+kM+ ^k(k−1)₂ S.

• The final computation ofbtakes4M+1S- same as for the average case.

Again, we exclude the generation of a quadratic non-residue d. The computation takes at most about 2E + ^s_k (^s₂M + kS+ kM+ ^k(k−1)₂ S) + (2k + 4)M+ 1S. This value is approximately 2E+ ₂^sS+ sM+ ¹₂(^s_k²)M + ¹₂(sk)S+ 2kM. If we setk = ⌈√s⌉ (the optimal choice), we have 2E + ₂^sS + sM +

1 2(s⌈√

s⌉)M+ ¹₂(s⌈√

s⌉)S+2⌈√

s⌉M. Ifpis a priori known, we obtain 2E+ ^s₂S+sM+ ¹₂(s⌈√ s⌉)M+

1 2(s⌈√

s⌉)S+2⌈√

s⌉M. Like in the average case, we will needsprecomputed elements and memory for just2⌈√s⌉additional eleme nts. Consequently, both the average-case and the worst-case complexity of our improved algorithm are inO((log₂p)^3.5).

5.4. Comparisons with Other Methods

In this section we will compare our algorithm with the most important square root algorithms, namely Tonelli-Shanks and Cippola-Lehmer. After a short overview of these algorithms, we will put forward a computational comparison of the three algorithms.

5.4.1. Tonelli-Shanks Algorithm

The Tonelli-Shanks algorithm ([22], [20]) reduces the problem of computing a square root to another famous problem, namely the discrete logarithm problem - given a finite cyclic groupG,a generator α of it, and an arbitrary elementβ∈G, determine the uniquek,0≤k≤ |G| −1,such thatβ=α^k. The elementkwill be referred to as the discrete logarithm ofβin baseα, denoted byk= log_αβ.Although this problem is intractable, if the order of the group is smooth, i.e., its prime factors do not exceed a given bound, there is an efficient algorithm due to Pohlig and Hellman [16].

Let us consider an odd primep, p= 2^st+ 1,withs≥2andtis odd,aa quadratic residue andda quadratic non-residue (modulop). Tonelli-Shanks algorithm is based on the following simple facts:

1. Letα = d^t. Then| < α > | = 2^s, or, equivalently, ord(α) = 2^s, where< α >denotes the subgroup induced byα,andord(α)represents the order ofα(inZ^∗_p).

2. Letβ=a^t. Thenβ ∈< α >andlog_αβis even (this discrete logarithm is considered with respect to the subgroup induced byα).

(19)

Thus, if we can determineksuch that β = α^k, then√

acan be computed as√

a = a^t+1² (d⁻¹)^k²^t. Indeed,(a^t+1² (d⁻¹)^k²^t)² =a^t+1(d^kt)⁻¹ =a^t+1a^−t=a.

Thus, the difficult part is findingk, the discrete logarithm ofβ in baseα(in the subgroup < α >

of order 2^s). Tonnelli and Shanks compute the element k bit by bit. Lindhurst [14] has proven that Tonelli-Shanks algorithm requires on average two exponentiations, ^s₄² multiplications, and two quadratic character evaluations, with the worst-case complexityO((log₂p)⁴).

Bernstein [5] has proposed a method of computingw bits of kat a time. His algorithm involves an exponentiation and _2w^s²2 multiplications, with a precomputation phase that additionally requires two quadratic character evaluations on average, an exponentiation, and about2^{w s}_wmultiplications, producing a table with2^{w s}_w precomputed powers ofα.

5.4.2. Cippola-Lehmer Algorithm

The following square root algorithm is due to Cipolla [6] and Lehmer [12]. Cipolla’s method is based on arithmetic in quadratic extension fields, which is briefly reminded below.

Let us consider an odd primepand aa quadratic residue modulop. We first generate an element z ∈Z^∗_p such thatz² −ais a quadratic non-residue. The extension fieldZ_p(√

z²−a)is constructed as follows:

• its elements are pairs(x, y)∈Z²_p;

• the addition is defined as(x, y) + (x^′, y^′) = (x+x^′, y+y^′);

• the multiplication is defined as(x, y)·(x^′, y^′) = (xx^′+yy^′(z²−a), xy^′+x^′y);

• the additive identity is(0,0),and the multiplicative identity is(1,0);

• the additive inverse of(x, y)is(−x,−y)and its multiplicative inverse is(x(x²−y²(z²−a))⁻¹,

−y(x²−y²(z²−a))⁻¹).

Cipolla has remarked that a square root ofacan be computed using that (z,1)^p+1² = (√

a,0),

and his method requires two quadratic character evaluations on average and at most6 log₂pmultiplications ([7, page 96]).

Lehmer’s method is based on evaluating Lucas’ sequences. Let us consider the sequence (Vk)_k≥0 defined byV₀ = 2, V₁ =z,andV_k =zV_k−1−aV_k−2,for allk ≥2,wherez∈ Z^∗_pis generates such thatz²−4ais a quadratic non-residue. Lehmer has proved that

√a= 1 2Vp+1

2 ,

and his method requires two quadratic character evaluations on average and about4.5 log₂pmultiplications ([18]). M¨uller [15] has proposed an improved variant that requires only 2 log₂pmultiplications, which will be referred to as the Improved Cipolla-Lehmer.

(20)

5.4.3. Tests Results

We have implemented Improved Generalized Atkin (Imp-Gen-Atk) and the fastest known algorithms, namely Tonelli-Shanks-Bernstein (Ton-Sha-Ber) and Improved Cipolla-Lehmer (Imp-Cip-Leh). For all pairs (log₂p, s), log₂p ∈ {128,256,512,1024}, s ∈ {4,8,16,^log₂²^p}, we have generated 32 pairs (p, a), whereais a quadratic residue modulo pand we have counted the average number of modular squarings and regular modular multiplications. We have considered two cases, depending whetherpis known a priori or not. We have not included the computation required for finding a quadratic non-residue modulop. For exponentiation we have considered the simplest method, namely the square-and-multiply exponentiation. In case of an exponent x, this method requires log₂x squarings and Hw(x) regular multiplications.

For Improved Generalized Atkin we choose the optimal k = ⌈√

s⌉, requiring s+ 2⌈√

s⌉ stored values. For Tonelli-Shanks-Bernstein, given that the number of needed precomputed values iss²_w^w, in order to reach a number of elements comparable with ours, we choose the parameterw= 2(that leads to2selements). We have to remark that the performance of Improved Cipolla-Lehmer does not depend ons.

We present the results for the case that p is not known a priori in Tables 1-4. In each column the first value indicates the average number of squarings and the second one denotes the average number of regular multiplications.

Method

log₂p 128 256 512 1024

Imp-Gen-Atk 256 / 138 512 / 262 1024 / 515 2048 / 1036 Ton-Sha-Ber 255 / 146 511 / 300 1023 / 562 2047 / 1076 Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020

Table 1. Comparison between methods fors= 4, wherepis unknown

Method

log₂p 128 256 512 1024

Method

log₂p 128 256 512 1024

(21)

Method

log₂p 128 256 512 1024

Method

log₂p 128 256 512 1024

Imp-Gen-Atk 392 / 217 904 / 526 2084 / 1334 5096 / 3721 Ton-Sha-Ber 253 / 723 509 / 2468 1021 / 9021 2045 / 34431

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020

Table 5. Comparison between methods fors=log2p

2 , wherepis unknown

In case thatpis not known a priori, Improved Cipolla-Lehmer is clearly the best, while our algorithm is comparable with Tonelli-Shanks-Bernstein.

We are interested in determining the values ofsfor which our algorithm is more efficient than Im- proved Cipolla-Lehmer and/or Tonelli-Shanks-Bernstein considering the case that pis known a priori.

We express 1Easlog₂pS+ ^log²₂^p−s M. To simplify the comparisons we no longer distinguish between squarings and regular multiplications.

More precisely, let us first determine s such that our algorithm is more efficient than Improved Cipolla-Lehmer in terms of total computation:

log₂p+log₂p−s 2 +3s

4 +s 2 +1

4(s⌈√

s⌉) +1 4(s⌈√

s⌉) +⌈√

s⌉<2 log₂p

We obtain the following sequence of equivalent inequalities:

log₂p+log₂p−s

2 +s⌈√ s⌉ 2 + 5s

4 +⌈√

s⌉ < 2 log₂p s⌈√

s⌉ 2 + 3s

4 +⌈√

s⌉ < log₂p 2 s⌈√

s⌉+3s

2 + 2⌈√

s⌉ < log₂p

We now turn our attention to Tonelli-Shanks-Bernstein with the parameterw= 2. A more thorough analysis of this algorithm gives uslog₂p+log₂p−s

2 +s² 8 + 3s

2 multiplications.

We obtain the following inequality:

s⌈√ s⌉ 2 +5s

4 +⌈√

s⌉ < s² 8 +3s

2 which leads tos >20.

(22)

We present the results for the case thatpis known a priori in Tables 5-8. We remind the reader that in each column the first value indicates the average number of squarings and the second one denotes the average number of regular multiplications.

Method

log₂p 128 256 512 1024

Table 6. Comparison between methods fors= 4, wherepis known a priori

Method

log₂p 128 256 512 1024

Method

log₂p 128 256 512 1024

Method

log₂p 128 256 512 1024

(23)

Method

log₂p 128 256 512 1024

Imp-Gen-Atk 268 / 186 649 / 453 1597 / 1257 4043 / 3425 Ton-Sha-Ber 126 / 686 254 / 2399 510 / 8895 1022 / 34172

Imp-Cip-Leh 126 / 124 254 / 252 510 / 508 1022 / 1020

Table 10. Comparison between methods fors= log2p

2 , wherepis known a priori

6. Conclusions and Future Work

In this paper we have extended Atkin’s algorithm to the general casep≡2^s+1mod2^s+1,for anys≥2, thus providing a complete solution for the casep≡1mod4. Complexity analysis and comparisons with other methods have also been provided.

An interesting problem is extending our algorithm to arbitrary finite fields. In the case of the finite fieldsGF(p^k), forkodd, the efficient techniques described in [11], [9] can be adapted to our case in a straightforward manner, but, to the best of our knowledge, there are no similar techniques for the case GF(p^k), forkeven. We will focus on this topic in our future work.

Acknowledgements

We would like to thank the two anonymous reviewers for their helpful suggestions.

References

[1] Ankeny, N. C.: The Least Quadratic Non Residue, Annals of Mathematics, 55(1), 1952, 65–72.

[2] Atkin, A.: Probabilistic primality testing (summary by F. Morain), Technical Report 1779, INRIA, 1992, URL:http://algo.inria.fr/seminars/sem91-92/atkin.pdf.

[3] Atkin, A., Morain, F.: Elliptic Curves and Primality Proving, Mathematics of Computation, 61(203), 1993, 29–68.

[4] Bach, E., Shallit, J.: Algorithmic Number Theory, Volume I: Efficient Algorithms, MIT Press, 1996.

[5] Bernstein, D. J.: Faster square roots in annoying finite fields (preprint), 2001, URL:http://cr.yp.to/papers/sqroot.pdf.

[6] Cipolla, M.: Un metodo per la risoluzione della congruenza di secondo grado, Rendiconto dell’Accademia delle Scienze Fisiche e Matematiche, Napoli, 9, 1903, 154–163.

[7] Crandall, R., Pomerance, C.: Prime Numbers. A Computational Perspective, Springer-Verlag, 2001.

[8] Eikenberry, S., Sorenson, J.: Efficient Algorithms for Computing the Jacobi Symbol, Journal of Symbolic Computation, 26(4), 1998, 509–523.

[9] Han, D.-G., Choi, D., Kim, H.: Improved Computation of Square Roots in Specific Finite Fields, IEEE Transactions on Computers, 58(2), 2009, 188–196.