Efficient Scheduling of Arbitrary Task Graphs to Multiprocessors using A Parallel Genetic Algorithm

(1)

Efficient Scheduling of Arbitrary Task Graphs to Multiprocessors using A Parallel Genetic Algorithm

Yu-Kwong Kwok^† and Ishfaq Ahmad Department of Computer Science

The Hong Kong University of Science and Technology Clear Water Bay, Hong Kong

Email: {csricky, iahmad}@cs.ust.hk January 1997

Revised: July 1997

To appear in the Journal of Parallel and Distributed Computing Special Issue on Parallel Evolutionary Computing

Abstract

Given a parallel program represented by a task graph, the objective of a scheduling algorithm is to minimize the overall execution time of the program by properly assigning the nodes of the graph to the processors. This multiprocessor scheduling problem is NP-complete even with simplifying assumptions, and becomes more complex under relaxed assumptions such as arbitrary precedence constraints, and arbitrary task execution and communication times. The present literature on this topic is a large repertoire of heuristics that produce good solutions in a reasonable amount of time. These heuristics, however, have restricted applicability in a practical environment because they have a number of fundamental problems including high time complexity, lack of scalability, and no performance guarantee with respect to optimal solutions. Recently, genetic algorithms (GAs) have been widely reckoned as a useful vehicle for obtaining high quality or even optimal solutions for a broad range of combinatorial optimization problems. While a few GAs for scheduling have already been suggested, in this paper we propose a novel GA-based algorithm with an objective to simultaneously meet the goals of high performance, scalability, and fast running time. The proposed Parallel Genetic Scheduling (PGS) algorithm itself is a parallel algorithm which generates high quality solutions in a short time. By encoding the scheduling list as a chromosome, the PGS algorithm can potentially generate an optimal scheduling list which in turn leads to an optimal schedule. The major strength of the PGS algorithm lies in its two efficient genetic operators: the order crossover and mutation. These operators effectively combine the building-blocks of good scheduling lists to construct better lists. The proposed algorithm is evaluated through a robust comparison with two heuristics best known in terms of performance and time complexity. It outperforms both heuristics while taking considerably less running time. When evaluated with random task graphs for which optimal solutions are known, the PGS algorithm generates optimal solutions for more than half of the test cases and close-to-optimal for the other half.

†. This research was supported by the Hong Kong Research Grants Council under contract number HKUST 734/96E.

(2)

1 Introduction

To effectively harness the computing power of high-performance parallel computing systems, it is crucial to employ a judicious scheduling algorithm for proper allocation and sequencing of tasks on the processors. Given a parallel program modeled by a node- and edge- weighted directed acyclic graph (DAG), finding an optimal schedule with a minimum turnaround time without violating precedence constraints among the tasks is well known to be an NP- complete problem [10], [13], [39]. Only for a few simplified cases [4], [10], [13], [29], [32], the problem can be solved by a polynomial-time algorithm. If the simplifying assumptions of these cases are relaxed, the problem becomes NP-hard in the strong sense. Thus, it is unlikely that the general scheduling problem can be solved in a polynomial-time, unless . Therefore, state-space search techniques are considered as the only resort for finding optimal solutions [8], [9]. However, most of these techniques are still designed to work under restricted environments and are usually not applicable to general situations. Furthermore, a state-space search incurs an exponential time in the worst case. As such, with these techniques even moderately larger problems cannot be solved in a reasonable amount of time.

Due to the intractability of the scheduling problem and the ineffectiveness of state-space search techniques, many polynomial-time heuristics are reported to tackle the problem under more pragmatic situations [1], [3], [14], [26]. The rationale of these heuristics is to sacrifice optimality for the sake of reduced time complexity. While these heuristics are shown to be effective in experimental studies, they usually cannot generate optimal solutions, and there is no guarantee in their performance in general. The major weakness of these heuristics is that they usually employ a deterministic greedy strategy which can be sensitive to the scheduling parameters such as the number of target processors available and the structure of the input task graphs. Improving the performance of a heuristic generally increases its complexity.

Furthermore, these heuristics are usually not scalable in terms of their running time and solution quality for large problem sizes.

In view of the drawbacks of the existing sequential scheduling heuristics, we aim at designing a new scheduling scheme which has a high capability to generate optimal solutions and is also fast and scalable. To obtain high quality solutions, we devise a genetic formulation of the scheduling problem, in which scheduling lists (ordering of tasks for scheduling) are systematically combined by using computationally efficient operators so as to determine an optimal scheduling list. To achieve a reduced time complexity, the proposed algorithm is parallelized. The algorithm not only scales well with the number of processors but can also handle general DAGs without making simplifying assumptions.

Inspired by the Darwinian concept of evolution, genetic algorithms [6], [11], [16], [18], [36], [40] are global search techniques which explore different regions of the search space simultaneously by keeping track of a set of potential solutions called a population. According to the Building-block Hypothesis [16] and the Schema Theorem [16], a genetic algorithm systematically combines the good building-blocks of some selected individuals in the population to generate

P = NP

(3)

better individuals for survival in a new generation through employing genetic operators such as crossover and mutation. Another attractive merit of genetic search is that the parallelization of the algorithm is possible. With these distinctive algorithmic merits, genetic algorithms are becoming more widely used in many areas to tackle the quest for optimal solutions in optimization problems. Indeed, genetic algorithms have been applied to the data partitioning problem [7], the graph partitioning problem [2], the robotic control problem [15], the standard cell placement problem [33], etc.

We formulate the scheduling problem in a genetic search framework based on the observation that if the tasks of a parallel program are arranged properly in a list, an optimal schedule may be obtained by scheduling the tasks one by one according to their order in the list.

With this concept, we encode each chromosome to be a valid scheduling list, one in which the precedence constraints among tasks are preserved. We also design two genetic operators: the order crossover and mutation. These operators effectively combine the good features of existing scheduling lists to form better lists. Using random task graphs for which optimal schedules are known, we have found that the proposed algorithm can generate optimal solutions for a majority of the test cases. Furthermore, when compared with two efficient scheduling heuristics, the proposed algorithm outperforms them while taking much less computation time due to its effective parallelization.

The remainder of the paper is organized as follow. In the next section we provide the problem statement. In Section 3 we give a background of genetic search by presenting a brief survey of genetic techniques. In Section 4 we present the proposed parallel genetic scheduling algorithm. Examples are used to illustrate the functionality of the proposed technique. In Section 5 we describe our experimental study and its results. We also describe some related work on using genetic algorithms for scheduling in Section 6. Finally, we provide some concluding remarks and future research directions in the last section.

2 Problem Statement

In static scheduling, a parallel program can be modeled by a directed acyclic graph (DAG) , where V is a set of v nodes and E is a set of e directed edges. A node in the DAG represents a task which in turn is a set of instructions that must be executed sequentially without preemption in the same processor. The weight associated with a node, which represents the amount of time needed for a processor to execute the task, is called the computation cost of a node and is denoted by . The edges in the DAG, each of which is denoted by , correspond to the communication messages and precedence constraints among the nodes. The weight associated with an edge, which represents the amount of time needed to communicate the data, is called the communication cost of the edge and is denoted by . The communication-to-computation-ratio (CCR) of a parallel program is defined as its average communication cost divided by its average computation cost on a given system.

The source node of an edge incident on a node is called a parent of that node. Likewise, the G = (V E, )

n_i w n( )_i (n_i,n_j)

c n( _i,n_j)

(4)

destination node emerged from a node is called a child of that node. A node with no parent is called an entry node and a node with no child is called an exit node. The precedence constraints of a DAG dictate that a node cannot start execution before it gathers all of the messages from its parent nodes. The communication cost among two nodes assigned to the same processor is assumed to be zero. Thus, the data available time (DAT) of a node depends heavily on the processor to which the node is scheduled. If node is scheduled, and denote the start-time and finish-time of , respectively. After all nodes have been scheduled, the schedule length is defined as across all nodes. The objective of scheduling is to minimize the schedule length by proper allocation of the nodes to the processors and arrangement of execution sequencing of the nodes without violating the precedence constraints. We summarize in Table 1 the notations used in the paper. An example DAG, shown in Figure 1(a), will be used as an example in the subsequent discussion.

The processing elements (PEs) in the target system may be heterogeneous or homogeneous.

Heterogeneity of PEs means that the PEs have different speeds or processing capabilities; we assume the communication links are homogeneous. However, we assume every module of a parallel program can be executed on any PE though the computation time needed on different PEs may be different. The PEs are connected by an interconnection network based on a certain topology. The topology may be fully-connected or of a particular structure such as a hypercube or mesh. That is, a message is transmitted with the same speed on all links. Using this model, a multiprocessor network can be represented by an undirected graph. An example processor

Table 1: Definitions of some notations.

Symbol Definition

Node number of a node in the parallel program task graph Computation cost of node

Communication cost of the directed edge from node to v Number of nodes in the task graph

e Number of edges in the task graph

p Number of processing elements (PEs) in the target system b-level Bottom level of a node

t-level Top level of a node

ALAP As Late As Possible start-time of a node DAT Data available time of a node on a particular PE

Start-time of node Finish-time of node

CCR Communication-to-computation Ratio

SL Schedule Length

PPE Physical Processing Elements (on which the PGS algorithm is executed) Crossover Rate

Mutation Rate Population Size Number of Generations Fitness value of a chromosome

n_i ST n( )_i FT n( )_i n_i

max_i{FT n( )_i }

ni

w n( )i ni

c n( i,nj) ni nj

ST n( )i ni

FT n( )i ni

µc

µm

Np

Ng

f

(5)

graph is shown in Figure 1(b).

The problem of optimally scheduling a DAG in a polynomial-time has been solved for only three simple cases. The first case is to schedule a free-tree with uniform node weights to arbitrary number of processors. Hu [20] suggested a linear-time algorithm to solve the problem. The second case is to schedule an arbitrarily structured DAG with uniform node weights to two processors. Coffman and Graham [10] devised a quadratic time algorithm for solving this problem. The third case is to schedule an interval-ordered DAG with uniform node weights to arbitrary number of processors. Papadimitriou and Yannakakis [29] designed a linear time algorithm to tackle the problem. In all these cases, communication among the tasks is ignored.

Recently, Ali and El-Rewini [4], [12] have shown that interval-ordered DAG with uniform edge weights, which are equal to the node weights, can also be optimally scheduled in polynomial- time. Ullman [39] and Garey et al. [13] have shown that for more general cases, the scheduling problem is NP-complete.

3 Overview of Genetic Search Techniques

In this section we present a brief review of standard genetic algorithms (SGA). This will be followed by a discussion of different models of parallel genetic algorithms (PGA).

3.1 Standard Genetic Algorithms

Genetic algorithms (GAs), introduced by Holland in the 1970’s [18], are search techniques that are designed based on the concept of evolution [6], [11], [16], [36]. In simple terms, given a well-defined search space in which each point is represented by a bit string, called a chromosome, a GA is applied with its three genetic search operators—selection, crossover, and mutation—to

PE 0

PE 2 PE 1

Figure 1: (a) An example DAG; (b) A 4-processor fully-connected target system.

(a)

(b)

PE 3 n₁

2

n₂ 3

n₅

5

n₉ n₃ 3

n₇

4

n₄ 4

n₈

4 n₆

4

1

6 10

1

1 1 1

4

1

5 5

1

(6)

transform a population of chromosomes with the objective of improving the quality of the chromosomes. A GA is usually employed to determine the optimal solution of a specific objective function. The search space, therefore, is defined as the solution space so that each feasible solution is represented by a distinct chromosome. Before the search starts, a set of chromosomes is randomly chosen from the search space to form the initial population. The three genetic search operations are then applied one after the other to obtain a new generation of chromosomes in which the expected quality over all the chromosomes is better than that of the previous generation. This process is repeated until the stopping criterion is met and the best chromosome of the last generation is reported as the final solution. An outline of a generic GA is as follows. The detailed mechanism of the three operators will be discussed in detail afterwards.

Standard Genetic Algorithm (SGA):

(1) Generate initial population;

(2) while number of generations not exhausted do (3) fori = 1 to PopulationSize do

(4) Randomly select two chromosomes and apply the crossover operator;

(5) Randomly select one chromosome and apply mutation operator;

(6) endfor

(7) Evaluate all the chromosomes in the population and perform selection;

(8) endwhile

(9) Report the best chromosome as the final solution.

In nature we observe that stronger individuals survive, reproduce, and hence transmit their good characteristics to subsequent generations. This natural selection process inspires the design of the selection operator in the GAs. Given a generation of chromosomes, each of them is evaluated by measuring its fitness which is, in fact, the quality of the solution the chromosome represents. The fitness value is usually normalized to a real number between 0 and 1, and the higher the value, the fitter the chromosome. Usually a proportionate selection scheme is used.

With this scheme, a chromosome with a fitness value is allocated offspring in the subsequent generation, where is the average fitness value of the population. Thus, a chromosome with a larger fitness value is allocated more offsprings while a chromosome with a smaller fitness value, for example, less than the average, may be discarded in the next generation.

GAs are least affected by the continuity properties of the search space unlike many other heuristic approach. For instance, many researchers have found that GAs are better than simulated annealing [16] as well as tabu search [11], both of which operate on one single solution only.

For an efficient GA search, in addition to a proper solution structure, it is necessary that the initial population of solutions be a diverse representative of the search space. Furthermore, the solution encoding should permit:

• a large diversity in a small population;

f f f⁄ _avg

f_avg

(7)

• easy crossover and mutation operations; and

• an easy computation of the objective function.

3.2 Genetic Search Operators

In this section we review the mechanism and characteristics of two important standard genetic operators: crossover and mutation. A less commonly used operator, called inversion, is also discussed.

Crossover is a crucial operator of GAs and is applied after selection. While selection is used to improve the overall quality of the population, crossover is used to explore the search space to find better individuals. Pairs of chromosomes are selected randomly from the population for application of the crossover operator. In the simplest approach, a point is chosen randomly as the crossover point. The two chromosomes then exchange the portions beyond the crossover point to generate two new chromosomes. A simple example of the standard crossover is given in Figure 2(a). The rationale is that after the exchange the newly generated chromosomes may contain the good characteristics from both the parent chromosomes and hence, possess a higher fitness value. Nevertheless the newly generated chromosomes may be worse than their parents.

With respect to this, the crossover operator is not always applied to the selected pair of chromosomes. It is applied with a certain pre-specified probability called the crossover rate, denoted by . There are a number of variants of standard crossover operators. These include

the order crossover, partially-mapped crossover (PMX), and cycle crossover [16], [33]. The characteristics of these variants will be described later in the paper.

Mutation is a genetic operator for recovering the good characteristics lost during crossover.

µ_c

1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1 crossover point

Parent 1 (Target Parent)

Parent 2 (Passing Parent)

1 0 1 0 1 1 1 1

(a) Standard crossover.

1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 1

Mutated the 7th bit

(b) Standard mutation.

1 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1

Segment to be inverted

Inversion

(c) Standard inversion.

Figure 2: Examples of the standard (a) crossover operator; (b) mutation operator; and (c) inversion operator on a binary coded chromosome.

From Parent 1

(8)

Mutation of a chromosome is achieved by simply flipping a randomly selected bit of the chromosome. A simple example of a standard mutation is given in Figure 2(b). Like a crossover, mutation is applied with a certain probability called the mutation rate which denoted by . Although mutation is a secondary search operator, it is useful for escaping from the local minima. For instance, suppose all the chromosomes have converged to 0 at a certain bit position while the optimal solution has a 1 at that position. Crossover cannot regenerate a 1 at that position but mutation may be able to. If crossover is the only operator used, then as new patterns evolve out of the old ones, there is invariably a loss in pattern diversity and hence the breadth of the search domain.

An inversion operator takes a random segment in a chromosome and reverses it. A simple example of the standard inversion is given in Figure 2(c). The advantage of the inversion operator is as follows. There are some groups of properties, or genes which would be advantageous for offsprings to inherit together from one parent. Such groups of genes, which interact to increase the fitness of the offspring which inherits them, are said to be co-adapted. If two genes are close to each other in the chromosome, the probability of being split up, when the crossover operator divides the chromosome into two segments, will be less.

3.3 Control Parameters

A GA is governed by a number of parameters: population size , number of generations , crossover rate , and mutation rate . Finding appropriate values for these parameters requires extensive experimentations [6], [16], [36]. Even with appropriate parameters, optimal solutions cannot be guaranteed due to the probabilistic nature of GAs. Grefenstette [17]

proposed using the scaling method for preventing a premature convergence, a scenario in which chromosomes of a population become homogeneous and converge to a sub-optimal chromosome. Scaling involves re-adjusting the fitness values of solutions in order to sustain a steady selective pressure in the population so that a premature convergence may be avoided. To tune the control parameters on-the-fly, Srinivas and Patnaik [35] proposed an adaptive method which is driven by the idea of sustaining diversity in a population without affecting its convergence properties. Their algorithm [35] protects the best solutions in each generation from being disrupted by crossover and mutation. Extensive experiments have shown that this adaptive strategy can help prevent GA’s getting stuck at a local minimum.

3.4 Parallel Genetic Algorithms

The inherent parallelism in GAs can be exploited to enhance their search efficiency. In contrast to simulated annealing which is intrinsically sequential and thus hard to parallelize [28], parallelism in GAs is easier to exploit [11], [16], [22], [27]. One of the approaches of parallelizing a GA is to divide the population into q partitions, where q is the number of physical processing elements (PPEs) on which the parallel GA is executed^†. Each subpopulation in a PPE contains more than one chromosome. Each PPE then runs a separate copy of the original GA with its own

†. A PPE should be distinguished from a PE of the target system to which a DAG is scheduled.

µ_m

N_p

N_g µ_c µ_m

(9)

partition of the population. Such a parallel GA (PGA) is called the coarse-grained PGA. Two less common types of PGAs are the fine-grained PGA and the micro-grained PGA. In a fine-grained PGA exactly one chromosome is assigned to each PPE. In this type of PGA, it is the topology of the PPE network that determines the degree of population isolation, and hence diversity, of the individuals in the whole population. In a micro-grained PGA, only a single population is maintained, and parallelism comes from the use of multiple PPEs to evaluate the individual fitness function. We employ the coarse-grained parallelization approach.

In a coarse-grained PGA, there are many possible schemes for the PPEs to communicate and exchange information about solutions [22], [27], [30]. The objective of communication is to transfer fitter chromosomes from their local populations to other PPEs for parallel exploitation.

There are two major models of communication: the isolated island model and the connected island model.

In an isolated island model, PPEs work independently and communicate only at the end of the whole search process to select the best solution among all the PPEs. No migration of fitter chromosomes is performed. Obviously linear speedup can be achieved. Nevertheless, a PPE may waste all its processing cycles when it gets stuck at a poor population without the knowledge that other PPEs are searching in more promising regions of the search space. This in turn is a result of partitioning the population, which reduces the diversity of the chromosomes in the search space.

In a connected island model, PPEs communicate periodically to exchange the information about their solutions found thus far. It is common for the best chromosome found to be broadcast to all PPEs so that every PPE may devote the processing power towards the most promising direction. There are two variants of the connected island model. The first is that PPEs communicate in a synchronous fashion, that is, all PPEs participate in a communication phase simultaneously. While this scheme may be easy to implement, its drawback is that the communication cost paid for by the information exchange can be a significant overhead, limiting the achievable speedup. The second variant is that PPEs communicate in an event-driven manner, that is, PPEs communicate only when necessary. Also, not all PPEs participate in a communication phase. The merit of this scheme is that the communication overhead is lower.

However, a drawback is that this scheme is more difficult to implement. Thus, most PGAs employ the synchronous connected island model.

Finally, some researchers have found that in a PGA, PPEs can use different sets of control parameters to further increase the population diversity [37], [38]. This can help avoiding premature convergence.

4 The Proposed Parallel Genetic Algorithm for Scheduling

In this section we describe the proposed parallel genetic scheduling algorithm. We first present a scrutiny of the list scheduling method, which is necessary for the proposed genetic formulation for the scheduling problem. We then proceed to describe the chromosome encoding

(10)

scheme, the design of the genetic operators, and the selection of the control parameters, and finally, the parallelization of the algorithm.

4.1 A Scrutiny of List Scheduling

Classical optimal scheduling algorithms, like Hu’s algorithm [20] and Coffman et al. ‘s algorithm [10], are based on the list scheduling approach in which the nodes of the DAG are first arranged as a list such that the ordering of the nodes in the list preserves the precedence constraints. In the second step, beginning from the first node in the list, each node is removed and scheduled to a PE that allows an earliest start-time. Hereafter we refer to this second step as the start-time minimization step, which is outlined as follows.

Start-time Minimization:

(1) , ReadyTime( ) = 0;

(2) while the scheduling list L is not empty do (3) remove the first node from L;

(4) Min_ST =∞;

(5) forj = 0 to p-1 do

(6) This_ST = max{ReadyTime( ), DAT( , )};

(7) if This_ST < Min_ST then Min_ST = This_ST; Candidate = ; endif (8) endfor

(9) schedule to Candidate; ReadyTime(Candidate) = Min_ST + ; (10) endwhile

An optimal ordering of nodes in the list is required to generate an optimal schedule using the list scheduling approach. For instance, in Hu’s algorithm [20], the scheduling list is constructed by using a node labeling process which proceeds from the top level leave nodes of the free-tree down to the root node. Such labeling leads to an optimal ordering of the nodes in that the nodes in the list, when scheduled, will occupy the earliest possible time slot in the processors.

Unfortunately, while optimal scheduling lists can be easily constructed for certain restricted cases (e.g., a unit-weight free-tree as in the case of Hu’s algorithm), such lists cannot be determined for arbitrary DAGs. Indeed, there are an exponential number of legitimate lists for a DAG that can be used for scheduling. An exhaustive search for an optimal list is clearly not a feasible approach.

Recent heuristics use node priorities to construct scheduling lists. Node priorities can be assigned using various attributes. Two frequently used attributes for assigning priority are the t- level (top level) and b-level (bottom level). The t-level of a node is the length of the longest path from an entry node to (excluding ). Here, the length of a path is the sum of all the node and edge weights along the path. The t-level of highly correlates with ’s start-time which is determined after is scheduled to a processor. The b-level of a node is the length of the longest path from node to an exit node. The b-level of a node is bounded by the length of the critical path (CP). A CP of a DAG, is a path with the longest length; clearly, a DAG can have more than one CP. Some scheduling algorithms do not take into account the edge weights in

∀j PE_j

n_i

PE_j n_i PE_j

PE_j

n_i w n( )_i

n_i

n_i n_i

n_i

(11)

computing the b-level; to distinguish such definition of b-level from the one described above, we call it the static b-level or simply static level (sl). Some other DAG scheduling algorithms employ an attribute called ALAP (As-Late-As-Possible start-time). To compute these attributes, only two traversals of the DAG are needed. Thus, these attributes can be computed in time. For example, the t-level’s, b-level’s, and sl’s of the DAG depicted in Figure 1(a) are shown in Figure 3.

Note that the nodes of the CP are marked by an asterisk.

Using different combinations of the above attributes, some algorithms have demonstrated better performance than the others. For example, let us consider the schedules for the task graph shown in Figure 1(a) produced by the DCP algorithm [23] and the MCP algorithm [41], which are shown in Figure 4(a) and Figure 4(b), respectively. Note that the schedule generated by the DCP algorithm is an optimal schedule (schedule length = 16 time units). The scheduling order of the MCP algorithm is: , which is an increasing ALAP ordering.

However, this order is sub-optimal. On the other hand, the DCP algorithm does not schedule node following a static topological order, and optimizes the use of available time slots in the processors by scheduling some more important descendants first. The DCP algorithm is described in detail in [23].

Let us analyze the schedule^† produced by the DCP algorithm from another perspective: If we are given a list and we schedule the nodes on the list one by one using the start-time minimization strategy, we will get a slightly different schedule with the same length (shown in Figure 5) assuming four processors are available. Another list, , can also result in the same schedule. Thus, we can view the start- time minimization method as a mapping M: which maps the set of topologically ordered lists to the set S of valid schedules. However, such a mapping is not surjective. That is, when we are given an optimal schedule, it is not always possible to find a corresponding list which can lead to the schedule by the start-time minimization strategy.

For example, the optimal schedule generated by the DCP algorithm, shown in Figure 4(a), cannot be generated by any list using start-time minimization. The reason is that the node does not start at the earliest possible time, the time right after finishes. On the other hand,

†. Note that the DCP algorithm assumes the availability of unlimited number of processors [23], albeit it uses only three.

O e( +v)

node sl t-level b-level ALAP

* 11 0 23 0

8 6 15 8

8 3 14 9

9 3 15 8

5 3 5 18

5 10 10 13

* 5 12 11 12

5 8 10 13

* 1 22 1 22

n1

n2

n3

n4

n5

n6

n7

n8

n9

Figure 3: The static levels (sl’s), t-level’s, b-level’s and ALAP’s of the nodes.

n1, , , , , , , ,n4 n2 n3 n7 n6 n8 n5 n9

n1, , , , , , , ,n2 n7 n4 n3 n8 n6 n9 n5

{ }

n1, , , , , , , ,n2 n4 n3 n7 n6 n8 n5 n9

{ }

Π→S Π

n6

n7

(12)

such a mapping is not injective either because distinct lists can be mapped to the same schedule.

The scheduling problem is then reduced to the problem of finding a list which can be mapped to an optimal schedule. In fact most of the list scheduling algorithms can be analyzed using this framework. The major differences in these algorithms are: (i) the method of implementing a different function M (i.e., a different space and time assignment strategy, which may not be start- time minimization); and (ii) the method of selecting scheduling lists from the set . Some algorithms optimize the former while constructing lists by a simple method. Other algorithms, such as the MCP algorithm, optimize the latter while using the start-time minimization strategy as the mapping M. A few algorithms, such as the DCP algorithm, optimize both.

4.2 A Genetic Formulation of the Scheduling Problem

The likelihood of the existence of lists leading to optimal schedules using the start-time minimization technique is very high, albeit it has not been proven that such a list always exists.

Since an optimal schedule is not unique, the list which can lead to an optimal schedule, therefore, is not unique. We call such lists as the optimal lists. There are a number of members in the set which are qualified to be optimal lists. A solution neighborhood can then be defined for genetic search. Specifically, we can start from an initial list from which we obtain an initial schedule. We can then systematically modify the ordering within the list in a way such that the nodes are still in topological order (i.e., the precedence constraints are still satisfied). From the new list we obtain a new schedule. If the schedule is better, we adopt it; otherwise we test another modified list.

Based on the above analysis, we give the proposed genetic formulation of the scheduling problem as follows.

PE 0 PE 1 PE 2 PE 3

n₁ 2

n₂ 3

n₅

5

n₉ 1

n₃ 3

n₇ 4

n₄ 4 0

5

10

15

20

n₈

4 n₆

4

(b)

PE 0 PE 1 PE 2

n₁ 2

n₂ 3

n₅

5

n₉ 1

n₃

3 n₇

4

n₄ 4 0

5

10

15

20

n₈

4

n₆

4

(a)

Figure 4: The schedules for the example DAG shown in Figure 1(a) generated by (a) the DCP algorithm (schedule length = 16 time units); (b) the MCP algorithm (schedule length = 20 time units).

Π

(13)

Encoding: We make a valid scheduling list as a chromosome. A valid scheduling list is one in which the nodes of the DAG are in a topological order. For example, the list

is a valid chromosome for the DAG shown in Figure 1(a).

Fitness of a Chromosome: Fitness value is defined as: , where the schedule length SL is determined by using the start-time minimization method. The fitness of a chromosome is therefore always bounded between 0 and 1. For example, the list used by the MCP algorithm for the DAG shown in Figure 1(a) has a fitness value of .

Generation of the Initial Population: An initial population is generated from a set of scheduling lists which are constructed by ALAP ordering, b-level ordering, t-level ordering, sl ordering and a random topological ordering, etc. These different orderings not only provide the necessary diversity but also represent a population with a higher fitness than a set of totally random topological orderings. A whole population is then generated from these orderings by performing random valid swapping of nodes in the lists.

In the next section we describe the design of the mutation and crossover operators. Since the selection mechanism is related to the migration process, we will discuss this aspect when we present the parallel genetic algorithm.

4.3 Genetic Operators

As the standard crossover and mutation operators may violate precedence constraints, we need to use other well-defined genetic operators. The inversion operator is not considered because it can obviously generate invalid scheduling lists. We consider three kinds of crossover operators: the order crossover [16], [33], [40], the partially-mapped crossover (PMX) [16], [33], and the cycle crossover [16], [33]. By using small counter-examples, we show that the PMX and cycle crossover operators may also produce invalid lists. Therefore in the proposed algorithm,

PE 0 PE 1 PE 2

n₁ 2

n₂ 3

n₅

5

n9 1

n₃ 3 n₇

4

n4 4 0

5

10

15

20

n₈ 4

n₆

4

Figure 5: The schedule generated by the start-time minimization method (schedule length = 16 time units).

PE 3

n1, , , , , , , ,n2 n4 n3 n7 n6 n8 n5 n9

{ }

w n( )i –SL

∑

( )^⁄∑w n( )i

10

30--- = 0.3333

(14)

only the order crossover operator is used. We also describe a a mutation operator based on node- swapping.

Order Crossover Operator:We consider a single-point order crossover operator. That is, given two parents, we first pass the left segment (i.e., the segment on the left of the crossover point) from the first parent, called parent 1, to the child. Then we construct the right fragment of the child by taking the remaining nodes from the other parent, called parent 2, in the same order. An example of the crossover operator is given in Figure 6(a). Note that the chromosomes shown in the figure are all valid topological ordering of the DAG in Figure 1(a). The left segment of parent 1 is passed directly to the child. The nodes in the right segment of parent 1 are then appended to the child according to their order in parent 2.

This order crossover operator is easy to implement and permits fast processing. The most important merit is that it never violates the precedence constraints, as dictated by following theorem.

Theorem 1: The order crossover operator always produces a valid scheduling list from two valid parent chromosomes.

Proof: Suppose we are given two parent chromosomes and . Let the crossover point be chosen at the kth position. Then after applying the order crossover, the child will be for some indices x and y. Obviously the precedence constraints among any two nodes at or before the kth position will be respected. For a node at or before the kth position and a node after the kth position, the precedence constraint (if any) will also be respected because their relative positions are the same as in parent 1. Finally for any two nodes after the kth position, the precedence constraint (if any) will also be respected because their relative positions are the same as in parent 2, by definition of the order crossover operator.

(Q.E.D.)

The order crossover operator as defined above has the potential to properly combine the accurate task orderings of the two parent chromosomes so as to generate a scheduling list which can lead to a shorter schedule. This is because the “good” portions of a parent chromosome is a subsequence of the list which is an optimal scheduling ordering of the nodes in the subsequence.

These good portions are essentially the building-blocks of an optimal list, and an order crossover operation can potentially pass such building-blocks to an offspring chromosome from which a shorter schedule may be obtained.

PMX Crossover Operator:A single point partially-mapped crossover can be implemented as follows. First a random crossover point is chosen. Then we consider the segments following the crossover point in both parents as the partial mapping of the genes to be exchanged in the first parent to generate the child. To do this, we first take corresponding genes from the two right segments of both parents, and then locate both these genes in the first parent and exchange them.

Thus a gene in the right segment of the first parent and a gene at the same position in the second parent will define which genes in the first parent have to be swapped to generate the child. An example of the crossover operator is given in Figure 6(b). The pairs , , and

are situated at the same locations in both parents (note that the pairs and are n1, , ,n2 n7 n4

{ }

n3, , , ,n8 n6 n9 n5

{ }

n_i₁, , ,n_i₂ … n_i_v

{ } {n_j₁,n_j₂, ,… n_j_v} n_i₁, , , ,n_i₂ … n_i_k n_j_x, ,… n_j_y

{ }

n3,n7

( ) (n8,n6) (n9,n5) n6,n8

( ) (n5,n9)

(15)

implicitly handled also). Their corresponding positions in parent 1 are swapped to generate the child. Then the remaining nodes in parent 1 are copied to the child to finish the crossover.

Unfortunately, unlike the order crossover operator, PMX crossover may produce invalid scheduling lists. An example of such scenario is shown in Figure 7. As can be seen, in the resulting chromosome, the positions of and violate their precedence relationship. This is also true for and .

Cycle Crossover Operator:A single point cycle crossover operator can be implemented as n1, , , , , , , ,n2 n7 n4 n3 n8 n6 n9 n5

{ }

n1, , , , , , , ,n4 n2 n3 n7 n6 n8 n5 n9

{ }

parent 1 parent 2

crossover point

n1, , , , , , , ,n2 n7 n4 n3 n6 n8 n5 n9

{ }

(a) An example of the order crossover operator.

n1, , , , , , , ,n2 n4 n7 n3 n6 n8 n5 n9

{ } {n1, , , , , , , ,n2 n7 n4 n3 n6 n8 n5 n9} Mutated

(swapped and ) n4 n7

(d) An example of the mutation operator.

Figure 6: Examples of the (a) order crossover, (b) PMX crossover, (c) cycle crossover, and (d) mutation operators.

n1, , , , , , , ,n2 n7 n4 n3 n8 n6 n9 n5

{ }

n1, , , , , , , ,n4 n2 n3 n7 n6 n8 n5 n9

{ }

parent 1 parent 2

crossover point

n1, , , , , , , ,n2 n3 n4 n7 n6 n8 n5 n9

{ }

(b) An example of the PMX crossover operator.

n1, , , , , , , ,n2 n7 n4 n3 n8 n6 n9 n5

{ }

n1, , , , , , , ,n4 n2 n3 n7 n6 n8 n5 n9

{ }

parent 1 parent 2

n1, , , , , , , ,n4 n2 n3 n7 n8 n6 n5 n9

{ }

(c) An example of the cycle crossover operator.

n2 n4

n3 n5

(16)

follows. We start at position 1 in parent 1 and copy the gene to location 1 of the child. Then we examine the gene at position 1 in parent 2. This gene cannot be copied to the child since the child’s corresponding position has been occupied. We then locate this gene from parent 1 and suppose it is found in position i. We copy this gene to position i of the child. Similarly the gene at position i in parent 2 cannot be copied. We again locate this gene from parent 1 and copy it to the child. This process is repeated until we encounter a gene in parent 2 which has already been copied to the child from parent 1. This completes one cycle. Another cycle is then initiated at the earliest position of the child that has not been occupied and the copying is performed from parent 2 to the child. Thus, in alternate cycles, the child inherits genes from both parents and the genes are copied at the same locations.

An example of the crossover operator is given in Figure 6(c). As can be seen in the figure, is first copied to the child from parent 1. But the corresponding node in parent 2 is also and, therefore, a cycle is completed. We start over again at position 2 of parent 2. We first copy from parent 2 to the child. Then we find that the node in position 2 of parent 1 is so that we copy from parent 2 to position 3 of the child. This time the corresponding node is and so we copy from parent 2 to position 5 of the child. Since the node in position 5 of parent 1 is , we copy from parent 2 to position 4 of the child. Now we encounter in parent 1 which has already been copied from parent 2 to the child. Thus, the second cycle is completed. The third cycle starts from parent 1 again and nodes and are copied to the child at positions 6 and 7,

n1

n5 n4

n3 n2

(a)

n1, , , ,n2 n4 n3 n5

{ }

n1, , , ,n3 n2 n5 n4

{ }

crossover point

n1, , , ,n4 n2 n5 n3

{ }

(b)

n1, , , ,n2 n4 n3 n5

{ }

n1, , , ,n3 n5 n2 n4

{ }

n1, , , ,n3 n4 n2 n5

{ }

(c)

Figure 7: (a) A simple task graph; (b) An example of generating an invalid ordering of the graph by using the PMX crossover operator; (c) An example of generating an invalid ordering of the graph by using the cycle crossover operator.

n1

n4

n2

n2 n7

n7 n3

n3 n4

n8 n6

(17)

respectively. The last cycle starts from parent 2 again and nodes and are copied to the child at position 8 and 9, respectively. Unfortunately the cycle crossover operator may also generate invalid scheduling lists. An example of such situation is shown in Figure 7. As can be seen, after the crossover, the positions of and violate their precedence relationship.

Since both the PMX crossover operator and the cycle crossover operator does not guarantee valid scheduling lists, they are not employed in the proposed algorithm.

Mutation Operator: A valid topological order can be transformed into another topological order by swapping some nodes. For example, the scheduling list used by the MCP algorithm—

—can be transformed into an optimal list by swapping and . Not every pairs of nodes can be swapped without violating the precedence constraints. Two nodes are interchangeable if they are not lying on the same path in the DAG. Using a pre-processing depth-first traversal of the DAG, we can check whether two nodes are interchangeable in a constant time during the search. This implies that we can efficiently test whether two randomly selected nodes are interchangeable and if so swap them and check the new schedule length. Such swapping actually defines a random search neighborhood. The size of the neighborhood is since there are pairs of interchangeable nodes. We define the mutation operator as a swap of two interchangeable nodes in a given chromosome. This operator captures the major characteristic of mutation, which is to randomly perturb the chromosome in such a way that a lost genetic feature can be recovered when the population is becoming homogeneous. An example is given in Figure 6(d) where two interchangeable nodes and are randomly chosen and their positions in the list are swapped.

4.4 Control Parameters

As Tanese [37], [38] has suggested, if the parallel processors executing a parallel genetic algorithm use heterogeneous control parameters, the diversity of the global population can be more effectively sustained. To implement this strategy, we use adaptive control parameters as suggested by Srinivas et al. [35]. The adaptive crossover rate is defined as follows:

where is the maximum fitness value in the local population, is the average fitness value, is the fitness value of the fitter parent for the crossover, and is a positive real constant less than 1.

The adaptive mutation rate is defined as follows:

where f is the fitness value of the chromosome to be mutated and is a positive real constant less than 1.

n5 n9

n2 n4

n1, , , , , , , ,n4 n2 n3 n7 n6 n8 n5 n9

{ }

n1, , , , , , , ,n2 n4 n3 n7 n6 n8 n5 n9

{ } n2 n4

O v( )² O C( )2v

n4 n7

µ_c µ_c kc(fmax– f′)

f_max– f_avg

( )

---

=

f_max f_avg

f′ k_c

µ_c

µ_m k_m(f_max– f) f_max– f_avg

( )

---

=

k_m

(18)

Using the above adaptive crossover and mutation rate, the best chromosome is protected from disruption by crossover and mutation. On the other hand, when the population tends to become more homogeneous, both rates increase because will be about the same as . Thus, under such a situation, chromosomes are more likely to be perturbed. This helps to prevent a pre-mature convergence to a sub-optimal solution. Note that even though the initial setting of the crossover rate and mutation rate is the same for all the parallel processors, the adaptive strategy gradually leads to the desired heterogeneity of the parameters among the processors.

Two other control parameters which are critical to the performance of a GA are the population size and the number of generation . Usually and are fixed for all problem sizes. However, this is not appropriate because larger problems require more time for exploration and exploitation. We therefore vary these two parameters linearly according to the problem size. Specifically, we set and , where and are real constants.

4.5 Parallelization

The global population is partitioned into q sub-populations, where q is the number of PPEs.

For efficiency we use a synchronous connected island model, in which PPEs communicate periodically to exchange the fittest individual and the communication is a synchronous voting such that the fittest individual is broadcast to all the PPEs. In other words sub-populations network topology is not considered. The reason is that the communication delay is insensitive to the distance between the PPEs of the Intel Paragon.

For the migration and selection of chromosomes, we adopt the following strategy. When the PPEs communicate, only the best chromosome migrates. When the fittest chromosome is imported, the worst chromosome in the local population is discarded while the fittest chromosome and the locally best chromosome are protected from the rank-based selection process. That is, in addition to having a higher expected share of offsprings, they are guaranteed to be retained in the new generation.

The period of communication for the PPEs is set to be T number of generations, which follows an exponentially decreasing sequence: initially , then , , and so on. The rationale is that at the beginning of the search, the diversity of the global population is high. At such early stages, exploration is more important than exploitation; therefore, the PPEs should work on the local sub-population independently for a longer period of time. When the search reaches the later stages, it is likely that the global population converges to a number of different fittest chromosomes. Thus, exploitation of more promising chromosomes is needed to avoid unnecessary work on optimizing the locally best chromosomes that may have smaller fitness values than the globally best chromosomes.

With the above design considerations, the Parallel Genetic Scheduling (PGS) algorithm is outlined below.

f_avg f_max

N_p N_g N_p N_g

N_p = k_pv N_g = k_gv k_p k_g

Ng

---2 Ng

---4 Ng

---8

(19)

Parallel Genetic Scheduling (PGS):

(1) Generate a local population with size equal to by perturbing pre-defined topological orderings of the DAG (e.g., ALAP ordering, b-level ordering, etc.).

(2) ;

(3) repeat

(4) ;

(5) repeat

(6) forj = 1 to do

(7) Using the current crossover and mutation rates, applied the two operators to randomly chosen chromosomes.

(8) endfor

(9) Evaluate the local population.

(10) until ++T = ;

(11) Accept the best chromosome from a remote PPE and discard the worst local chromosome accordingly.

(12) Explicitly protect the best local and remote chromosome and adapt new crossover and mutation rates.

(13)

(14) until the total number of generations elapsed equal to ;

5 Performance Results

To examine the efficacy of the PGS algorithm, we have implemented it on the Intel Paragon using the C language and tested it using different suites of task graphs. In the first two experiments, we aimed at investigating the absolute solution quality of the algorithm by applying it to two different sets of random task graphs for which the optimal solutions are known. As no widely accepted benchmark graphs exist for the DAG scheduling problem, we believe using random graphs with diverse parameters is appropriate for testing the performance of the algorithm. To compare the PGS algorithm with the existing techniques, we choose two extreme examples. The first is the DCP algorithm [23] which has been compared to the best known six heuristic algorithms (DLS [34], MCP [41], MD [41], ETF [21], DSC [42] and EZ [31]), and is known to be considerably better in terms of performance but has a slightly higher complexity. The other is the DSC algorithm [42] which is widely known and has been considered by many studies to be one of the best in terms of time complexity with a reasonable performance.

In all the experiments, we assumed the target processors are fully-connected and homogeneous.

5.1 Workload

The first suite of random task graphs consists of three sets of graphs with different CCRs: 0.1, 1.0, and 10.0. Each set consists of graphs in which the number of nodes vary from 10 to 32 with increments of 2, and thus, each set contains 12 graphs. The graphs within the same set have the same value of CCR. The graphs were randomly generated as follows. First the computation cost of each node in the graph was randomly selected from a uniform distribution with mean equal to 40 (minimum = 2 and maximum = 78). Then beginning from the first node, a random number indicating the number of children was chosen from a uniform distribution with mean equal to

Np⁄q

i = 2 T = 0

N_p⁄q

Ng⁄i

i = i×2

N_g

(20)

. Thus, the connectivity of the graph increases with the size of the graph. The communication cost of each edge was also randomly selected from a uniform distribution with mean equal to 40 times the specified value of CCR. Hereafter we call this suite of graphs the type-1 random task graphs.

To obtain optimal solutions for the task graphs, we applied a parallel A* algorithm to the graphs. For details of the A* algorithm, the reader is referred to [24]. Since generating optimal solutions for arbitrarily structured task graphs takes exponential time, it is not feasible to obtain optimal solutions for large graphs. On the other hand, to investigate the scalability of the PGS algorithm, it is desirable to test it with larger task graphs for which optimal solutions are known.

To resolve this problem, we employed a different strategy to generate the second suite of random task graphs. Rather than trying to find out the optimal solutions after the graphs are randomly generated, we set out to generate task graphs with given optimal schedule lengths and number of processors used in the optimal schedules.

The method to generate task graphs with known optimal schedules is as follows. Suppose that the optimal schedule length of a graph and the number of processors used are specified as and p, respectively. Then for each PE i, we randomly generate a number from a uniform distribution with mean . The time interval between 0 and of PE i is then randomly partitioned into sections. Each section represents the execution span of one task. Thus, tasks are “scheduled” to PE i with no idle time slot. In this manner, v tasks are generated so that every processor has the same schedule length. To generate an edge, two tasks and are randomly chosen such that . The edge is made to emerge from to . As to the edge weight, there are two cases to consider: (i) the two tasks are scheduled to different processors; and (ii) the two tasks are scheduled to the same processor. In the first case, the edge weight is randomly chosen from a uniform distribution with maximum equal to (the mean is adjusted according to the given CCR value). In the second case, the edge weight can be an arbitrary positive integer because the edge does not affect the start and finish times of the tasks which are scheduled to the same processor. We randomly chose the edge weight for this case according to the given CCR value. Using this method, we generated three sets of task graphs with three CCRs: 0.1, 1.0, and 10.0. Each set consists of graphs in which the number of nodes vary from 50 to 500 with increments of 50; thus, each set contains 10 graphs.

The graphs within the same set have the same value of CCR. Hereafter we call this suite of graphs the type-2 random task graphs.

We also used a suite of regularly structured task graphs which represent a number parallel numerical algorithms. The suite comprises graphs representing the parallel Gaussian elimination algorithm [41], the parallel Laplace equation solver [41], and the parallel LU-decomposition algorithm [25]. As all these algorithms operate on matrices, the sizes of the task graphs vary with the matrix-sizes N in that . We used matrix-sizes from 9 to 18 with increments of 1.

5.2 Comparison against Optimal Solutions

Since the performance and efficiency of the PGS algorithm critically depend on the values of

v 10---

SL_opt x_i

v

p--- SL_opt

x_i x_i

n_a n_b

FT n( )_a <ST n( )_b n_a n_b

ST n( )_b –FT n( )_a

( )

v = O N( )²