Evolutionary computing in search-based software engineering

(1)

LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology

Evolutionary computing in search-based software engineering

The topic of master’s thesis has been confirmed in the Departmental Council of Department of Information Technology on 3rd December 2003.

Supervisors: D.Sc. (Econ.), Professor Jouni Lampinen and D.Sc. (Econ.), Lecturer Timo Mantere.

Lappeenranta 23rd March 2004

Leo Rela

Korpraalinkuja 3 As. 1 53800 LAPPEENRANTA Finland

(2)

Abstract

Lappeenranta University of Technology Department of Information Technology Leo Rela

Evolutionary computing in search-based software engineering Master’s thesis

2004

125 pages, 30 figures and 2 tables.

Supervisors: Professor D.Sc. (Econ.) Jouni Lampinen and Lecturer, D.Sc. (Econ.) Timo Mantere.

Keywords: Evolutionary algorithms, genetic algorithm, search-based software- engineering, software production.

This master’s thesis aims to study and represent from literature how evolutionary algorithms are used to solve different search and optimisation problems in the area of software engineering. Evolutionary algorithms are methods, which imitate the natural evolution process. An artificial evolution process evaluates fitness of each individual, which are solution candidates. The next population of candidate solutions is formed by using the good properties of the current population by applying different mutation and crossover operations.

Different kinds of evolutionary algorithm applications related to software engineering were searched in the literature. Applications were classified and represented. Also the necessary basics about evolutionary algorithms were presented.

It was concluded, that majority of evolutionary algorithm applications related to software engineering were about software design or testing. For example, there were applications about classifying software production data, project scheduling, static task scheduling related to parallel computing, allocating modules to subsystems, N-version programming, test data generation and generating an integration test order. Many applications were experimental testing rather than ready for real production use. There were also some Computer Aided Software Engineering tools based on evolutionary algorithms.

(3)

Tiivistelmä

Lappeenrannan teknillinen yliopisto Tietotekniikan osasto

Leo Rela

Evoluutiolaskennan sovellutukset ohjelmistotekniikassa Diplomityö

2004

125 sivua, 30 kuvaa ja 2 taulukkoa.

Tarkastajat: Professori, KTT Jouni Lampinen ja Lehtori, KTT Timo Mantere

Hakusanat: Evoluutioalgoritmit, geneettinen algoritmi, ohjelmistotekniikka, ohjelmis- totuotanto

Tämän diplomityön tavoitteena on kartoittaa ja esitellä kirjallisuudessa raportoituja evoluutioalgoritmien sovellutuksia ohjelmistotekniikan alueelle liittyvien haku- ja op- timointiongelmien ratkaisuun. Evoluutioalgoritmit ovat erilaisia luonnon evoluu- tioprosessia jäljitteleviä menetelmiä, joissa keinotekoinen evoluutioprosessi arvioi ratkaisun hyvyyttä mittaamalla tavoitefunktiolla populaation yksilöt eli ratkaistavan ongelman eri ratkaisuehdokkaat. Seuraavan sukupolven populaatio yriteratkaisuja muodostetaan yleensä käyttäen nykyisen sukupolven hyvien yksilöiden ominaisuuk- sia soveltaen erilaisia mutaatio- ja risteytysoperaatioita.

Työssä etsittiin, luokiteltiin ja esiteltiin erilaisia kirjallisuudessa julkaistuja evoluutioalgoritmien ohjelmistotekniikkaan liittyviä käyttötapoja ja sovellutuksia sekä käytiin läpi tarpeellisin osin itse evoluutioalgoritmien teoriaa käsitteineen.

Työssä havaittiin, että suurin osa evoluutioalgoritmien sovellutuksista ohjelmistotekniikassa liittyy ohjelmistojen suunnittelu- ja testausvaiheisiin. Sovellutuksia oli mm. ohjelmistotuotantoon liittyvän datan luokittelu, projektin aikataulutus, rin- nakkaislaskentaan liittyvä staattinen tehtävien jako, moduulien jako alijärjestelmiin, N-versio ohjelmointi, testitapauksien generointi ja integrointitestausjärjestyksen lu- onti. Suurin osa sovellutuksista oli luonteeltaan enemmänkin kokeellisia kuin soveltu- via suoraan todelliseen käyttöön. Toisaalta valmiita evoluutioalgoritmeihin perustuvia tietokoneavusteisen ohjelmistosuunnittelun välineitä on olemassa.

(4)

Symbols and Abbreviations

Artificial Neural Network

Burrows, Abadi and Needham logic (BAN, A logic of authentication) Computer Aided Software Engineering

Csharp, an object-oriented language

Control-Dependence Graph

Condition-Decision Coverage

Control-Flow Graph

Conjugate Gradient method

Computational Intelligence

Constructive Cost Model

Central Processing Unit

Directed Acyclic Graph

Distributed Computing System

Duplication Scheduling Heuristic

Evolutionary Algorithm

Evolutionary Programming Extended Module Graph

!

Evolution Strategy

Function Point

Genetic Algorithm

Grammar Guided Genetic Programming

Genetic Programming

"

Lines of Code

#

Module Dependency Graph

"

Machine Learning

#%$

Mean Magnitude Relative Error (mean of absolute percentage errors)

(7)

Modularization Quality function Mean Square Error

'

Non-deterministic polynomial

)()

N-Variant Programming, N-Version Programming

*)

Ordered-Deme Genetic Algorithm

*

Object-Oriented

+$,.-0/

The Parallel Degree

1)

Parallel Genetic Algorithm, Partitioned Genetic Algorithm

Program Net

$

Rapid Application Development

2

Stongly Typed GP

2!

Task Precedence Graph, Test Precedence Graph

) "

Unified Modeling Language

(8)

1 Introduction

This Chapter represents briefly the overall structure of this document, research objectives, and the background concepts related to this research: software engineering, computational intellicenge and search-based software engineering.

1.1 Research objectives

The objective in this master’s thesis is to study and represent wide selection of applications from literature, where evolutionary computing methods (evolutionary algorithms, EA) are used to solve different search and optimisation problems in the area of software engineering. This thesis will concentrate only on the applications related to software production rather than cases, where evolutionary algorithms are used only in the software. It was to be expected beforehand that the majority of evolutionary algorithm applications related to software engineering were about software testing.

Publications reporting the use of EA methods to solve software engineering problems are classified depending on their position in software development process and their application area. The objective is to find promising application areas for EA, to find areas where EA will be probably used in the practical softare engineering and to discuss the possibilities of more research.

(9)

1.2 Structure of this document

The Chapter 1 is an introduction and describes the background concepts and research objectives for this master’s thesis. The Chapter 2 provides the basic information and theory about evolutionary algorithms, which are needed to understand the contents of this thesis.

This thesis is reporting, how evolutionary algorithms are used for solving software engineering problems. In Chapter 3, is it described how those applications found in the literature are being represented and classified to classes corresponding four major phases of the software development process. The applications itself are described in Chapters 4, 5, 6, and 7.

The Chapter 8 is the summary and describes the number and type of publications publications related to the subject of this thesis. In addition, the overall findings are also described here. In Chapter 9 there are the final conclusions of this research.

1.3 Software engineering

IEEE standard 610.12-1990 defines the term "Software engineering" as:

"The application of a systematic, disciplined, quantifiable approach to the develop- ment, operation, and maintenance of software." [IEE90]

In this thesis that term means work or single process that results in a complete, tested software product that behaves much as possible like the original model that was

(10)

The history of software engineering is relatively short when compared to other engineering sciences. The software technology has developed rapidly and the size of a usual software product can be doubled in a few years. Thus, there exists a rapidly growing demand for new software having increased complexity, but still the productivity of the software development work grows slow, as it is said to be only about four percents per year. [HM98, PP98]

Software engineering methods are usually products of interaction between science and engineering. New problems are usually solved with ad hoc solutions, which will continue their life as a folklore. Folklore gets a systematic form and soon we will have models and theories, which can be applied commonly to real world problems [Sha90]. After that, we have also new problems.

1.4 Computational intelligence in software engineering

In the past, the research of intelligent and adaptive systems has organized to the research area that is called computational intelligence and soft computing. The idea is to replace the lack of explicit models and human knowledge with numerical models that are computed to fit the solution of the problem. [Ala98] Most famous techniques on the area are neural networks, fuzzy logic, and evolutionary algorithms.

Search methods based on techniques such as evolutionary algorithms, simulated annealing, and tabu search are also referred as metaheuristic search and optimisation methods. According Portland Pattern Repository [PPR02], heuristic approach to problem is empirical method that often solves problem but cannot proof that given solution is the best possible. Metaheuristic methods are used to find good heuristics for different problems.

(11)

Portland Pattern Repository [PPR02] gives following example about the metaheuristic approach to the problem:

"What parameter do I use to get good results when applying heuristic method X to problem Y?"

"How do I adjust the parameters of heuristic X so I get better results on problem Y?"

"Which is better, heuristic X or heuristic Y?"

Different computational intelligence methods separately and in various combinations have been used to solve software engineering problems. [PP98]

1.5 Search-based software engineering

Harman and Jones [HJ01b] say that:

"Search-based software engineering is a reformulation of software engineering as a search problem, in which the solution to a problem is found by sampling a large search space of possible solutions."

According to Harman and Jones [HJ01a, HJ01b], metaheuristic search techniques, such as evolutionary algorithms, are applicable to many software engineering related optimisation and search problems because there are usually need to balance competing constraints and cope with inconsistency without any precise rules to compute the best solution. There are usually too many solutions without a perfect one, but still good solutions can be recognised from the bad ones.

Search methods with metaheuristic nature such an evolutionary algorithms, tabu

(12)

search techniques in search-based software engineering are evolutionary algorithms including techniques such a genetic algorihms (GA) [Gol89], genetic programming (GP) [Koz92], evolution strategy (ES) [Rec73, Sch81], and evolution programming (EP) [FOW66]. Overview on various evolutionary algorithms techniques is represented in Chapter 2.

One challenge of search-based software engineering is to reformulate software engineering problems as search problems. Different methods based on evolutionary algorithms have been successfully applied to software engineering problems. For example, GP to automatic programming, GA to project scheduling, and test data generation.

There exists a research network called Software Engineering using Metaheuristic Innovative Algorithms (SEMINAL) [SEM02]. SEMINAL have the objective to demonstrate applicability of miscellaneous metaheuristic techniques to software engineering problems. This thesis will concentrate only to the use of evolutionary algorithms in software engineering.

(13)

2 Overview on evolutionary algorithms

This Chapter provides basic information and theory about evolutionary algorithms needed to understand the contents of this work. Three most widely used techniques - Genetic Algorithm (GA), Genetic Programming (GP), Evolution Strategy (ES), and Evolutionary Programming (EP) - are described briefly.

2.1 Evolutionary algorithms

Evolutionary algorithms are soft computing techniques inspired by Charles Darwins’

“survival of the fittest” theory. The general idea is to have a population of solution candidates (individuals) for the problem. A problem specified evaluation function is used to measure the fitness of each individual. The first population can be initialized with random parameters, but usually all other populations after that contain individuals that derive “good” parameter values from their parent populations. Different kinds of crossover, mutation and stochastic operations are performed when creating new populations.

Evolutionary algorithms can be effective for finding global optimum of complex and non-continuous problems that are too difficult to solve using derivative-based methods.

When using large and n-dimensional search space, evolutionary algorithm cannot always find the best possible solution in reasonable amount of time. However, solutions can still be “good enough” when compared to use of the brute force approach, which means the evaluation of all possible solutions. When the search space is large, brute force cannot be used because of so called combinatorical explosion that means too

(14)

Evolutionary algorithms have been applied to many practical problems successfully.

For example, mechanical shape optimisation [AL97], communications network design [KC93, KC97], finding potential customers from marketing database [Sun92], and optimisation related on electrical power systems [Yin93].

Next Sections (2.2, 2.4, 2.3, 2.5) present the most commonly applied techniques based on evolutionary algoritms. However, for each technique, there is a large number of different approaches and modifications and only the basics are described here.

2.2 Genetic algorithm

Genetic algorithms (GA), originally developed by John Holland [Hol75] are search algorithms based on the natural selection and natural genetics. The original goal for the research was to explain the adaptation of natural systems and to design artificial system that retains mechanism of natural systems. [Gol89]

The simulated evolution process in the genetic algorithm is done with a population of individuals represented by chromosomes. In practice, chromosomes are individual’s parameters encoded to the string, bit or byte presentation.

Because often the first population does not have the final or “good enough” solution, there is need for keeping an artificial diversity in the population. Diversity can be maintained by using the crossover and mutation operations. The mutation is performed by applying a random change to the individual’s chromosomes. A mutation can be complete or affect only few genes. In Figure 1 there is an example, where the mutation was performed for one gene. When using binary encoding, one practical way to perform mutation is to flip one bit in a chromosome. For example, a chromosome

3546373

after the mutation can be ^3847463 .

(15)

Area chosen for mutation After

Before

Figure 1: Mutation in Genetic Algorithm [Rya00]

The crossover in the natural evolutionary process means that child will inherit its properties (genes) from its parents. In genetic algorithms, the crossover operation is needed to mix and inherit good gene combinations from the current population to the new population. In Figure 3 there is an example of a crossover, where childs inherit their genes from the both parents.

Implementation of the genetic algorithm usually does following cycle [HB01]:

1. Evaluate the fitness value for all the individuals in current population.

2. Create new population by using crossover, mutation and reproduction operations.

3. Discard the old population and continue iteration.

(16)

Child 1

Child 2 Parent 1

Parent 2

Crossover point

Figure 2: Example of crossover in the Genetic Algorithm [Rya00]

2.3 Genetic programming

Genetic programming (GP), introduced by John Koza [Koz92], has the objective to let a computer solve problems without being explicitly programmed to do so. Genetic programming can be seen as an extension of the conventional genetic algorithm. While the GA will generate a string that represents the solution, the GP generates a computer program as the solution. [Neg02]

High-level programming language called LISP was chosen as the main programming language for GP. Individuals in the population are variable length strings and each string will encode a candidate solution. Each solution is a parse tree that represents

(17)

a computer program. Like GA, GP uses a fitness function to evaluate solutions.

[HB01, Neg02]

for crossover Subtrees selected

3 +

2 *

4 +

2 6

3 4

7 *

−

7 6

Figure 3: Crossover for two parent parse trees [Rya00]

GP uses the crossover operation, which will select two programs and produces two offspring programs. Offspring programs are recombinations of their parents with randomly chosen parts swapped between them. An example of crossover operation in GP is shown in Figure 3: there are two parse trees representing parent functions

9;:=<>+?

and^@BADC . The crossover produces two new parse trees with functions^9;: ^C

and^@A ^<>? . [Neg02, HB01]

(18)

GP programs are usually composed of elements like fuctions and terminals. Mutation in GP can randomly change a terminal symbol to different terminal symbol or a function to another function. [Neg02]

2.4 Evolution strategy

Evolution strategy (ES) was developed to solve technical and mechanical optimisation problems and was previously only known in the civil engineering area. Evolution strategy can be used when no suitable objective (fitness) function exists or it is too difficult to use and no other suitable optimisation method exists. [HB01]

In the original ES, an individual (parent) in population generates one new individual (offspring) per generation. The process causes small mutations to occur more likely than larger mutations, until the offspring performs better than its parent. For example, the mutation operation can be based on normal probability distribution. Later on, ES was generalized to the form, which imitates the natural evolution process by using recombination, mutation, and selection. [Gol89, HB01]

2.5 Evolutionary programming

Evolutionary Programming (EP), introduced by Fogel et al. [FOW66] allows the use of different length individuals in the population. There is no crossover in EP, but instead the individuals selected as parents are subjected to the mutation to produce children. It has been said, that any structure, which can be mutated, can be evolved by using EP.

[Rya00]

(19)

3 Classification of EA applications for solving software engineering problems

The software development process is usually divided into four major phases which are derived from the linear sequential model (Figure 4), also known as the waterfall model.

[Pre01] The phases in chronological order are analysis, design, implementation, and testing. In this thesis, those four phases are used as rough classes, into which different EA based applications for solving software engineering problems are classified. Each EA based approach that is used to solve a particular software engineering problem is called as an EA application or a case. Classes will contain subclasses, which are also called application areas. For example, the testing phase contains separated subclasses for the structural testing and the functional testing.

Analysis Design Code Test

System/information engineering

Figure 4: Linear sequential model, also known as waterfall model [Pre01]

The first half of the classes, the analysis and the design, represent the area of

(20)

rest of the classes, the implementation and the testing are more concentrated on the software under development. The following four Sections (3.1, 3.2, 3.3, 3.4) will briefly describe each class.

3.1 Analysis

In a software development project, the analysis phase contains the feasibility study and the requirements analysis phases. In the feasibility study phase, highly generalised system level requirements are defined. Those requirements are only customer requirements and usually have no technical details about the system. Customer requirements are analysed and software requirements, needed in the requirements analysis phase, are derived from them. The requirements analysis is a gathering process that is intensified and focused specifically on the software itself. To understand the software being implemented later, the information about the required functionality, behaviour, performance, interfaces, and the application domain must be understood.

[HM98, Pre01]

In this thesis, the analysis phase corresponds to the first class and is meant to contain the applications related to early software project tasks. Those tasks could be about the project management and design, project effort, and software quality prediction. The analysis class is about software project design rather than the software design, with which the next class is concerned.

(21)

3.2 Design

The second class is called design and concentrates on the software design, which is used to translate software requirements to a representation of the software. The software design process focuses on four attributes of a program: the data structure, the architecture, interface presentations, and algorithmic details. [Pre01]

This class will contain the applications, which are used to search for the software structure or the internal functionality. Also applications about the searching of optimal runtime organisation of the software are concerned, e.g. the resource and task allocation in the distributed system.

3.3 Implementation

Third class is called implementation and refers to implementation or code generation phase in the linear sequential model. In this phase, the previously produced software design is translated in to machine readable form - e.g. the computer program. EA applications which can be used to produce computer programs or support work of the human programmers, are classified in to this class.

Because genetic programming (GP) technique is meant only to generate programs, almost every use of the GP goes into this class. However, it is not appropriate, within the scope of this work, to report every use of the GP in the literature as an EA based application in the implementation phase of the waterfall model. That is why this class contains only cases which are closely related to the well known software engineering problems.

(22)

3.4 Testing

The testing phase refers to testing process that focuses on logical internals of implemented software. The objective is to uncover errors and to ensure that the program behaviour is as expected. Test results must coincide with the requirements.

[Pre01]

All applications related to software testing are classified in to this class. Applications can be about test case generation or searching for program inputs which will cause failures or too long response times.

(23)

4 Analysis

This Chapter represents applications related to the analysis phase of the waterfall model. The related application areas are: prediction of software failures, exploring difficulty of the problem, software project effort prediction, and project management.

4.1 Prediction of software failures

Software failures usually occur after some implementation tasks have already been completed. If failures are found in testing phase, there will be need for extra implementation tasks, which will try to eliminate those failures. The extra work will usually cause development costs to increase. If failures are found after software is in use, the costs could go up even more. Because of this, it is important to find software failures in early phases of development.

Using special reliability techniques in the software development will require more resources, time and money. Because of this, it is important not to use them in every part of the system development. Quality prediction methods could be used to determine parts of the system to, which realibility techniques should be applied [EKCA98].

Evett et al. have described in the paper GP-based software quality prediction [EKCA98] the model based on GP, which can be used to find modules with high propability of failure and should be implemented with using reliability improvement techniques. The GP system will predict a number of faults the module is likely to produce. Number of faults will be used to rank the modules. Thus, final evaluation of quality will be based on ordinal criteria rather than number of failures.

(24)

There exists a software quality data from software development projects. Data set has attributes for unique and total number of operators and operands, lines of source code, lines of executable code, and two values for cyclomatic complexity (McCabe’s cyclomatic complexity and extended cyclomatic complexity, number of linearly-independent paths throught a program [MB94]). Modules are classified as fault-prone and not fault-prone. [EKCA98]

The data was splitted to training and validation data sets. The GP system was trained by using training data. For every run, the best program was returned as the result.

The ability of the trained program to generalise the data was measured by using the result program to predict faults also for validation set. After that, best-of-run programs were ordered accordingly. The usefulness of the model is evaluated by its ability to approximately order modules from the most fault-prone to the least fault-prone. The order was compared to random order of modules. In this case, random order results were really different than GP results. [EKCA98]

Even best GP result programs are not expected to predict failure module perfectly.

According to Pareto’s Law, 20% modules have 80% of total number of failures (in this case, over 70% of modules has near zero failures) and developers should be interested in those top 20% of modules in order. [EKCA98]

In the paper Automated Knowledge Acquisition and Application for Software Development Projects Baisch et al. [BL98] represented how to express tailored fuzzy expert system capable of classifying software modules by their propability of having errors. Quality evaluation was interpreted as the transformation of different software metrics to a quality factor, which means correctness or maintainability of the software.

Because rules consist of a complex combination of different metrics, Baisch et al.

propose to use GA to assist human experts to define rule-base for classification.

(25)

Every individual in to population represents one fuzzy expert system encoded as a binary string. Fitness function was based on the contingency-table like weighting according to false and correct classification data. Modules were classified to two different classes: Modules with fault amount of 0 to 5 and more than 20 faults. Data was splitted in training and test sets. Extracted expert system was able to classify correctly 87.8% of training data (409 modules) and 82.4% of test data (301 modules).

System was also tested with real-world application. When applied to top 10% modules with high amount of faults, the prediction of failures reduced total number of late faults in top-10 down with 14%. [BL98]

Liu et al. in the paper Genetic programming model for software quality classification [LK01] have used GP to classify software modules. Liu et al. have used real world datasets from large software written in C++. The used dataset was about 807 software modules classified as fault-prone or not fault-prone. Dataset has attributes for a number of times the source code was inspected, number of LOC for different production phases and final number of commented code. System was made to run faster than interpreted LISP by having pointers to C functions in nodes of parse trees.

Data was splitted into two sets (training and validation). At first, GP model tries to predict number of faults for each module. Then it classifies modules. Fitness was calculated from number of hits and raw fitness, which comes from pre-defined cost of misclassification values. Built GP model has misclassification rate of about 20% and it was compared against regression model, which has a rate of over 30%. In conclusions of the paper, the GP was said to be more accurate in this case. [LK01]

Bouktif et al. discuss in the paper Combining Software Quality Predictive Models:

An Evolutionary Approach [BKS02] the problem caused by different software quality

(26)

published. For a solution to this problem, Bouktif et al. propose the use of existing models as independent experts. [BKS02]

In an approach to this problem, it was tried to combine decision trees into one final classifier system with GA. When the input vector was given to the system, decision making starts from the root of the tree and corresponding edges are followed.

Questions in nodes can be something like Is^EGFIH ?. [BKS02]

NPPM LCOM

stable

<= 16 >16

>10

<= 10

unstable stable

Figure 5: Example of a decision tree for the software stability prediction [BKS02]

Chromosome representation of trees was made with set of boxes with sides parallel to the axes. In Figure 5 there is a simple decision tree for software stability prediction with OO related variables LCOM (lack of cohesion methods) and NPPM (Number of Public and Protected Methods in a class). Same tree in a box representation is in Figure 6. [BKS02]

(27)

unstable stable

16

10 NPPM

LCOM

Figure 6: Example of output regions for decision tree for software stability prediction [BKS02]

JLKNMOLP

3

Q R

ST

UWV

X

TYT

Z R

[

UWV X T[

(1)

Fitness was calculated by using Youden’s J-index, which calculates the average correctness per attribute. Function is represented in Equation 1 where^X

T[

is the number of training vectors with real attribute^\

T

classified as^\ ^[ . [BKS02]

According to Bouktif et al. [BKS02] The system was tested by predicting the stability of classes written in Java. 22 structural software metrics about coupling, cohesion, inheritance, and complexity were selected for attributes. Data was randomly split into 10 datasets with equal size and then the classifier is trained with 9 sets. The last one dataset was used as test data. This was repeated with all 10 possible combinations.

After test experiments, it was concluded, that resulting meta-expert model can perform significantly better than individual models.

(28)

GA is used to train ANN to predict software quality in the papers Using the genetic algorithm to build optimal neural networks for fault-prone module detection [HKAH96] and Evolutionary neural networks: a robust approach to software reliability problems [HKAH97].

Other related papers are Software quality knowledge discovery: a rough set approach [RPA02] and Using genetic programming to determine software quality [EKA99].

4.2 Exploring difficulty of the problem

During early phases of a software development project, developers usually do not have an exact awareness about the problem, which should be solved using the software.

Developers increased uncertainty about the real nature of the problem could increase cost and duration of the project. It is commonly said, that bad decisions occuring in early phases of project will lead to problems, which will be expensive to eliminate later.

One way to avoid increased development cost is to collect more knowledge about the problem to be solved with the software.

Feldt in the paper Genetic Programming as an Explorative Tool in Early Software Development Phases [Fel99] (available also in Biomimetic Software Engineering Techniques for Dependability [Fel02]) has presented the idea about using genetic programming to explore the difficulty from input data. The term software problem exploration using genetic programming (SPE-GP) is proposed to mean this approach.

The target system is software, which controls arresting of landing aircraft on a runway.

This is commonly used in military aircraft carriers, where the length of the runway is limited. There is a cable on the runway, which will catch the incoming aircraft. The controlling system has to apply suitable pressure to drums of tape attached to cable for

(29)

smooth stop. The system has to stop the aircraft close as possible to target distance without exceeding physical limits (e.g. length of the cable or tape and maximum retarding force applied to the pilot and the aircraft) [Fel99, Fel02]

Figure 7: Difficulty of test cases as function of mass and velocity [Fel99]

Values from the simulation are used to assign penalty values on the four fitness criteria.

All penalty values are summed to the total fitness of the test case, therefore this is a

(30)

tree was evaluated with 10000 test cases with different aircraft mass and velocity values. If any physical limit was exceeded, failure was recorded. [Fel99, Fel02]

Difficulty of the test case is a proportion of programs with failures. Test case difficulty with different masses and velocities is shown in Figure 7 where dark area means higher difficulty. In original requirements, there was defined maximum hook force for certain points with specified mass and velocity. Because of this, there is separated areas with higher difficulty in the middle of the Figure 7. In addition, there was discussion on the paper, which critised the usage of the low-dimensional input space and lack of research about effect of altered requirements to difficulty. Feldt says also, that SPE- GP technique includes the risk, that result data will be GP-specific rather than human- specific. [Fel99, Fel02]

4.3 Software project effort prediction

In the early days of computing, the cost of the software itself was only a small part of the total cost of an information system. Nowadays software is usually the most complex part of the system and developing it may be expensive when compared to other costs related. [Pre01]

Cost and effort estimation for a software development is not very exact science. Larger number of variables affects the total cost of software [Pre01]. Because of this, many computational intelligence methods have been applied to estimate cost.

In the paper On the problem on the software cost function [Dol01] Dolado has tried to use classical regression and GP for publicly available data sets to find software cost functions. Both methods were used because classical regression makes assumptions about the distribution of data and GP can explore data without assumptions. There

(31)

were 12 different sets with software product size (e.g., LOC or FP values) as independent variable. As result, significant good predictions attained solely by the product size was not found. Dolado has also compared GP, ANN and regression methods for software cost estimation in the paper Limits to the Methods in Software Cost Estimation [Dol99].

Burgess et al. in the paper Can genetic programming improve software effort estimation? A comparative evaluation [BL01] have trained GP system to predict effort of software projects and compared results to usage of other methods. Used data was taken from 81 projects from a Canadian software house and had attributes about developers/managers experience, development environment, year of completion, and 5 other attributes about software size and complexity. The dependent attribute was effort (person-hours). The data was splitted into training and query (test) sets with ratio 63/18.

Different methods were used to predict efforts: random, linear LSR, 2/5 nearest neighbours, ANN and GP. Errors were measured with different functions. ANN has superior performance and GP seems to be able to provide accurate estimates. Authors believe, that after their results, further investigation is needed. In conclusions it was said, that ANNs and GPs give good accuracy but they will need more effort to setup (training). [BL01]

(32)

Shan et al. have used almost same approach in the paper Software Project Effort Estimation Using Genetic Programming [SMLE02]. The Grammar Guided GP (GGGP) was used to fit a model. In GGGP, the objective is to find good production rules of grammar, which are used to produce the actual program.

effort = size * (if application in{dss,missing} then 5.34 else 1.68) + 18.5 * team_size * log(size) + 92.7 * log(size)

Figure 8: The example of GP program [SMLE02]

According to Shan et al. [SMLE02], Data set of 423 software projects was splitted to training and test sets (211/212). For each project, there was high number of different attributes: 32. There were numerical and non-numerical attributes, e.g. team size, size (in function points), business area type, usage of object-oriented techniques and usage of different project management methods. In the Figure 8 there is an example of GP program.

As a result, it was observed that in best programs there were no used non-numerical attributes in best-run programs, only numerical. Shan Y. et al. suggests, that non- numerical attributes are too complex to be discovered by the GP or not closely related to the project effort. [SMLE02]

Neural networks trained with GA is also used to estimate efforts in the papers Neuro- genetic prediction of software development efforts [Shu00].

(33)

Other related papers are An evolutionary approach to estimating software development projects [ARRRT01] and A validation of the Component-Based Method for Software Size Estimation [Dol00].

4.4 Project management

Project manager has time, resources, and a goal for the project. The goal is reached when all the project tasks are completed. Each task needs resources (like employees) and time to be completed. There is a limited number of resources for use and project must be completed before the pre-defined date.

Usually resource usage and task completion order can be planned by using timeline chart. In Figure 9 there is an example of timeline chart (also known as Gantt chart) for the project of six tasks and three employees. Horisontal bars are tasks and lines between them are dependencies: e.g., tasks 2 and 3 cannot be started before task 1 is completed.

1 2 3 4 5 6

emp2 emp1

emp2

emp3; emp1

emp3; emp1 emp2 15 Dec ’03

Task

Figure 9: Gantt chart for the project with three employees and six tasks.

(34)

Chang et al. [CCNC98] have developed a formal model called Software Project Management Net (SPMNet) to model software development projects. Model has an automatic resource allocating and a scheduling based on GA. Projects and tasks needed to complete are representated by using Task Precedence Graph (TPG). In Figure 10 there is an example of TPG with ten tasks. Each task has values for man month (MM) and skill required (SR).

MM:3 SR:1,2,3

MM:7 SR:4,5 MM:6

SR:1,6,7

MM:6 SR:1,2,4

MM:9 SR:5,6 MM:10

SR:1,2,3

MM:11 SR:2,3,5 SR:1,4,5

MM:4

SR:2,3 MM:7

MM:5 SR:2,4

T10 T8

T7

T9 T6

T5

T4

T3 T2

T1

Figure 10: Example of Task Precedence Graph. (MM: man month, SR: skill required) [CCNC98]

The GA system takes TPG and employee/skill database as an input and outputs a near- optimal schedule. The project schedule itself is used as a chromosome and it is a string, which tells the order how employeers will participate to tasks. GA-based scheduling was compared against exhaustive search (which generates a set of all possible solutions and returns the best one) and GA was able to find optimal solution in reasonably shorter time. [CCNC98]

(35)

Later in the paper Genetic Algorithms for Project Management [CCZ01] Chang et al.

have represented improved GA-based project scheduling method. New method uses 2D array, which allows many to many and one to many relations between tasks and employees. A C++ library of GA components called GAlib was used. The approach have similarities to previous research [CCNC98], where GA takes TPG and employee database (skills and salary) as a input. Individuals are 2D arrays with employeers enumerated along the rows and tasks along the collumns. The population diversity is increased by using several operations, like flip, destruction, and swap in mutation. In Figure 11 there is an example of used 2D array crossover. Also the support for multiple projects and partial commitment (e.g., employees A and B can take part to the task with ratios^{4^]Y<} and^46]^@ ) is added. [CCZ01]

=>

Figure 11: Example of the crossover of two 2D arrays [CCNC98]

(36)

The objective function is composition of four objectives:

- Validity of job assignments (Validity). If all skill requirements for the tasks are satisfied then _`badcfegcNhji

P 3

, otherwise⁴ .

- Minimum level of overtime (Overload). Amount of over time for all employees.

- Minimum cost (CostMoney). Labor costs of the project. Calculated by using labor rates of each resource and the hours used for the tasks.

- Minimum of time span (CostTime). Maximum time span, in which the project must be completed.

The used composite objective function needed to maximise is:

M

ckh Xlnmnm

P

_!`0adcoebckhji

> Kop

Vrqtsu l5vxwzy

`be

: p|{

qt}

yxm

h~

ynXl

i : p|

qt}

yxm

hcN

l O

(2)

The objective function (2) allows usage of weigts ^; for component objectives. In this case, the_`badcfegcNhji component with value ⁴ will cause large penalty for the whole function. GA system was tested with test problems and was superior when compared to exhaustive search. Also the ability to set weights for the components of the objective function was said to be very usefull. [CCZ01]

Chang et al. are going to continue research on this area. They will suggests integration of some cost model (like COCOMO) with the constrains for better relationship with time to complete the task and number of people involved in it.

(37)

In the paper Projecting Risks in a Software Project through Kepner-Tregoe Program and Schedule Re-Planning for Avoiding the Risks. Komiya et al. [KH00] have proposed the method for avoiding risks in software projects through Kepner-Tregoe program. Kepner-Tregoe is logical method for problem solving and decision making.

The used method itself was project task scheduling with the GA.

(38)

5 Design

This Chapter represents applications related to the design phase of the waterfall model. The related application areas are: multiprocessor scheduling, task and resource allocation in distributed systems, hardware/software co-design in embedded systems, protocol construction, and architecture design.

5.1 Multiprocessor scheduling

A multiprocessor system is a set of two or more processors, which communicate with each other. A parallel program is set of tasks to be executed under a number of constrains. In static multiprocessor scheduling, tasks can are scheduled to the processors before the execution or during runtime. If tasks, which had been scheduled for different processors, communicate during their execution, it will cause message passing and waiting, which will slow down the execution. Usually the goal in static multiprocessor scheduling is to minimise the total expected runtime of tasks.

[CFR99, JPP00]

A dynamic multiprocessor scheduling will occur on runtime and is widely used in modern operating systems. All scheduling, dynamic or static, is also referred as load balancing.

A directed acyclic graph (DAG) can be used to represent computer program. Nodes are tasks and arcs between them are data dependencies. Weights in arcs means cost caused by message passing and weights in nodes representing computational cost. [DAA95]

(39)

Dhodhi et al. say that efficient assignment and scheduling of parallel program tasks are important in the effective utilisation of multiprocessor system [DAA95]. In the paper A Multiprocessor Scheduling Scheme Using Problem-Space Genetic Algorithms [DAA95] they represent a GA based technique, which can be used to increase resource utilisation and therefore to reduce completion time of the parallel program.

1

2 3 4 5 6 7 8 9

10 11 12 13

14 15 16 17

18

3 3 3 3 3 3 3 3

10 10 10 10

5 5 5 5

Cost = 1

Figure 12: Directed acyclic graph (DAG). Nodes are tasks and arcs are data dependencies between them. [RL90]

(40)

7 4 13 11 17 14 18

8 5 2 10

1 9 6 5 12

P3 P2 P1

Time

15

16

Figure 13: Task allocation for three processors ( ^X ) [DAA95]

According to Dhodhi et al. [DAA95], GA was used to search good assignment of tasks of the program to different processors. DAG (Figure 12) is used to represent a program.

The task execution can start after all the data it needs have been received from previous tasks for, which task has dependency. In this case, the communication cost is always zero between tasks assigned to the same processor. If tasks are in different processors, the weight of the arc between their nodes means communication cost. In good task assignment, every processor has a task to execute and total execution time of program is minimal.

Individual (Figure 14) is a list, which index corresponds with DAG node numbers. The list itself contains task priority values. The tasks without previous dependencies are placed to the new task list. Then a task with highest priority is allocated to the available processor on, which the start time of the task with dependencies is the earliest. This is repeated until there are no tasks to allocate or no idle processor. A task cannot be allocated to processor until tasks it depends are completed and communication from them has occured. Figure 13 shows complete task allocation for three processors ( ^X ).

[DAA95]

(41)

Figure 14: GA individual, with list having node (task) priorities [DAA95]

Fitness value to be minimised is calculated by taking MAX value from the set of completion time of processors. The completion time includes computational, communication and waiting times. [DAA95]

Experimental results for this technique were compared against examples from literature and results were very promising. It was concluded, that proposed technique was able to reduce completition time and increase processor utilisation. [DAA95]

In the paper Scheduling using Genetic Algorithms Ursula Fissgus [Fis00] has used GA to find good execution orders for tasks. DAG was replaced by extended module graph (EMG), which has separated structural and data dependencies between modules. In EMG, the structural dependency between modules A and B means that module A must be executed before B. Data dependency means, that module A will produce data to module B as a parameter. [Fis00]

Each individual represents an execution order of tasks and also versions of tasks.

Each task has one or more versions implemented using different functions. Fitness is measured by using total execution time of the program. In conclusions part of the

(42)

to choose between different implementations of tasks can lead to faster programs.

[Fis00]

Qi-Wei Ge in the paper PARAdeg-processor scheduling for acyclic SWITCH-less program nets [Ge99a] has used GA for multiprocessor scheduling problem for data-flow program nets (PN). Data flow program is run by the idea of data-driven processing, which can be used to avoiding memory access bottlenecks. PN differs from DAG, e.g. by allowing to visit in one node several times.

Tsuchiya et al. in the paper Genetics-based multiprocessor scheduling using task duplication [TOK98] represents a GA approach to multiprocessor scheduling problem, which uses task duplication. If tasks are executed on different processors, a message passing between the processors will cause delay. Task duplication means scheduling with copies of the task in different processors, which can eventually reduce delay caused by communication.

According to Tsuchiya et al. [TOK98], the individual is set of lists. Each list represents a processor and had tasks assigned to it. Order of tasks in list is also execution order.

An example of representation is in Figure 15. Each individual must satisfy following three conditions [TOK98]:

1. Every task is allocated to at least one processor.

2. No task is allocated to one processor more than once.

3. For any task, none of its predecessors are executed after the task on the same processor.

(43)

P

1

P

2

t

2

t

1

t

3

t

5

t

4

t

6

t

3

t

1

Figure 15: Tasks scheduled for two processors ( ) with duplicated tasks ^h ^V and ^h [TOK98]

Fitness value ^M is defined in Equation 3 where ^wzj is the maximum length from all schedules in population and ^a^l5XW ^h

KoLO

is the length of the currently evaluated schedule. [TOK98]

M'P

wj

Aa l5X

h

KNzO

354

:38

{

(3)

Method was compared against Duplication Scheduling Heuristic (DSH) and as result it was observed that when the communication delays are small, the GA was able to find better schedules than DSH. Otherwise, both methods had almost the same performance. [TOK98]

Jung et al. had used ordered-deme GA (OGA) for multiprocessor scheduling in the paper An Ordered-Deme Genetic Algorithm for Multiprocessor Scheduling [JPP00].

In OGA, a globally sorted population is divided to the sorted array of subpopulations.

(44)

The average fitness of each subpopulation is maintained in a stepwise manner and converted to one value. This method can improve search capability because search takes place independently in each subpopulation. Scheduling with OGA was compared to GA and was performed better in 13.4% of all the task graphs.

Yi-Hsuan et al. in the paper A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems [LC03] has used partitioned genetic algorithm (PGA) for multiprocessor scheduling. PGA integrates the idea of divide and conquer to partition the entire problem space into subgroups, solve them individually, and merge them to the solution of the problem.

At first, task graph is partitioned in to subgroups, which are scheduled using GA. The schedule itself is represented as list of tasks. After subgroups are scheduled, they are combined to the final solution. A single schedule is represented as a list with execution order of tasks and number of processor to which task is scheduled. [LC03]

T₅ T₁₀ T₈ T₆ T₇

T₉ T₁₁ T₃ T₁ T₄ T₂ T₀

1 0 2 1 1 2 0 1 2 0 1 2

Figure 16: A single schedule with task numbers ( ) and number of processor to, which the task is allocated [LC03]

According to Yi-Hsuan et al. [LC03], in experiments, it was noticed that PGA was usually able to find better solutions with lesser time than the original GA.

(45)

In the paper Efficient Scheduling of Arbitrary Task Graphs to Multiprocessors using A Parallel Genetic Algorithm, Yu-Kwong and Ahmad [KA97] represents a Parallel Genetic Scheduling algorithm (PGS), which can be used for multiprocessor DAG scheduling. The PGS itself is a parallel algorithm, which can cause scheduling to be faster. In experimental study, PGS has found optimal solution for half of the cases.

There is more research about using the GA in (static) multiprocessor scheduling: An implementation of the linear scheduling algorithm in multiprocessor systems using genetic algorithms [BC00], Genetic algorithm approach towards scheduling DAG on multiprocessor [YQN01] and A two-processor scheduling method for a class of program nets with unity node firing time [Ge99b], Pareto-based soft real-time task scheduling in multiprocessor systems [OBWK00] Fast scheduling and partitioning algorithm in the multiprocessor system with redudant communication resources [Las01], and Scheduling Multiprocessor Tasks with Genetic Algorithm [CFR99]. In addition, a cellular automata [Ser98] was also used to design multiprocessor task schedules.

5.2 Task and resource allocation in distributed systems

In distributed systems, all the information and functionality is usually allocated to different nodes of the system. Usually one node needs access to its own resources but also resources in other nodes. Communication with other nodes causes delay in execution and can decrease overall reliability of the system.

Sanjay Ahuja in A genetic algorithm perspective to distributed systems design [Ahu00]

has developed a GA based framework for allocating data files to the nodes of the

(46)

the nodes are connection links, which can have reliabilities and capacities. For every program, it is known how much traffic is required to be transferred during execution.

The GA takes DCS graph, values for link reliables and capacities, program allocation in nodes, files needed for each program, and traffic required to be transferred when the program is executed. The goal is to maximise^B ^K ^R ^O , which means Average Distributed Program Throughput of program ^R . The function is defined in Equation 4 where^H is the traffic delivered across the system and ^H

T

is traffic requirement of the system for program ^!c. [Ahu00]

*

K R

OzP H

H

T

(4)

The file allocation (individuals) is coded to binary numbers. In DCS with four nodes, the single file to allocate had ^V ^{ ^! where binary number for ^[ tells if file

T

is present in node ^[ . GA run is terminated when the average and maximum^B ^K ^R ^O equals, number of generations reach the maximum or all individuals in the population are same. The GA was compared to exhaustive search and it was concluded that GA was usually able to find optimal solutions using less computational time. [Ahu00]

Grajcar has used genetic list scheduling algorithm, based on list scheduling and GA, to assigning tasks to processors in coupled and heterogeneous system. Research is represented in the paper Genetic List Scheduling Algorithm for Scheduling and Allocation on a Loosely Coupled Heterogeneous Multiprocessor System. [Gra99]

Kumar et al. in the paper Genetic algorithm based approach for file allocation on distributed systems [KPG95] have done earlier research on GA based file allocation in DCS where approach was very similar to [Ahu00].

(47)

Ahuja and Kumar have used GA to allocate programs and data into nodes of distributed systems in A genetic algorithm approach for performance based reliability enhancement of distributed systems [AK94].

Jaewon Oh et al. have represented a GA based model for allocation of objects into distributed system in the paper A formal model for allocation of objects into heterogeneous distributed environments [OCW00].

Kim and Hong in the paper A task allocation using a genetic algorithm in multicomputer systems [KH93] had used GA to allocate program modules for processors in multicomputer systems. Method and results are very similar to other research where GA is used for task scheduling in multiprocessor systems (Section 5.1 Multiprocessor scheduling).

There are also other papers about task scheduling (in multiprocessor systems), which are related mainly to multicomputer/distributed systems: Object-oriented simulation and GA [Ram01], Genetic algorithm based data and program partitioning [SNYF00], Genetic scheduling algorithms in distributed computing systems [WYKH97], Efficient allocation of program modules on multicomputers [BA94], Building a retargetable local instruction scheduler, and Improved Static Multiprocessor Scheduling using Cyclic Task Graphs: A Genetic Approach [SM98].

5.3 Hardware/software co-design in embedded systems

There is need for real-time embedded system, which operate reliably and predictably.

To achieve these requirements, usually hardware and software designers will co-

(48)

Korousic-Seljak and Cooling have proposed a GA based co-design technique in the paper Optimization of multiprocessor real-time embedded system structures [SC94].

When a multiprocessor real-time embedded system is being designed, this technique can be used to search optimal node structure and task allocation.

The system is a set of computing nodes. Nodes will perform tasks and are connected to each other with I/O devices. The system kernel has scheduler, which is responsible for allocating ready-to-run tasks to processors. The GA is used find schedule where all tasks are allocated and completed before their deadline time is expired. If there is no schedule where no deadlines are expired, some tasks must be rejected. In real-time system design, decisions must be taken between the criticality and timeliness. [SC94]

P_m P_i P₂ P₁

t₁ t₂ t_j t_n

0 0

0

0 0

0

1

1 1

. . .

Figure 17: An encoded individual with processors and tasks^h [SC94]

The individual (Figure 17) is encoded to an array where first dimension corrensponds to the processor number and second to ready tasks^hj . Numbers 0 and 1 indicates if the task is allocated to the processor or not. In this approach, there was no dependencies or communication between the tasks. [SC94]

(49)

Grajcar in the paper Conditional scheduling for embedded systems using genetic list [Gra00b] has used GA based genetic list scheduling algorithm to search for good task schedules for the real-time system. Task execution is assumed to be conditional, which means that executed tasks and their start times depends on previous task executions.

The idea of using genetic list scheduling algorithm on multiprocessor task scheduling was represented in the paper Genetic List Scheduling Algorithm for Scheduling and Allocation on a Loosely Coupled Heterogeneous Multiprocessor System [Gra99], which was introduced in Section 5.2.

Semeraro has used GA to find near-optimal system configurations for real-time operating system in poster paper Evolutionary approach to real-time analysis [Sem98].

Other articles about using GA in embedded systems hardware/software co-design are Optimization of multiprocessor real-time embedded system structures [YG02a], System level software/hardware partitioning by genetic algorithm [YG02b], Heuristics to optimize the speed-up of parallel programs [Agu96], and A hierarchical genetic algorithm for hardware software co-synthesis with a stochastic approach [CRC02].

5.4 Protocol construction

Generally a protocol stack has a fixed number of layers. Every layer is implementation of some protocol functionality. Construction of static end-to-end protocol to meet all communication requirements could be difficult and this complexity could reduce protocol performance. Dynamically configurable protocol stacks have used to construct protocols, which will meet its requirements with minimal overhead. [Gra00a]

(50)

decompose protocols to micro-protocols and then with GP to compose them to form a complete protocol, which is light as possible, and satisfies the requirements. Used GP system was GPsys and protocols were constructed by using JavaGroups, a Java-based protocol framework toolkit.

Because of the used toolkit, each protocol stack was represented in string format:

In the string, there are numbered properties (prop1,prop2...) separated with colons, each property represents one protocol layer. Arguments can be passed to the each layer: arg1=val1 means, that value (val1) is assigned as argument number 1 (arg1). In Figure 18 there is protocol stackUDP:NAKACK:UNICAST:FLUSH:GMS represented as GP program tree, where each node represents one layer and only operator is colon (:), which separates layers. [Gra00a]

:

: :

NAKACK :

UDP UNICAST

FLUSH GMS

Figure 18: Protocol stackUDP:NAKACK:UNICAST:FLUSH:GMSrepresented as GP program tree [Gra00a]

(51)

Fitness for a single protocol layer was measured with three tests: testing semantic viability, how stack meets the communication requirements and quality of service provided by the protocol. After experiments, it was concluded, that the system was able to generate correct protocol stacks in hours and results were highly depending on the size of the population. Future work was proposed to improve performance of GP system and to allow system to be tailored each particular search. [Gra00a]

In the paper Protocols are programs too: the meta-heuristic search for security protocols [CJ01] Clark and Jacob have used GA and simulated annealing to generate correct and efficient Burrows, Abadi and Needham (BAN) protocols.

5.5 Architecture design

Software design is usually splitted in to two parts: Architecture design and module design. In architecture design, system is being clustered to modules and subsystems and their interfaces are defined. In module design, each module and its internal functionality is being designed. In architecture design, it is usually tried to develop system’s parts as independent as possible and keep number of connections and dependencies between subsystems small as possible. [HM98]

Doval et al. in the paper Automatic clustering of software systems using a genetic algorithm [DMM99] represents software clustering GA, which can be used to good partition of the module dependency graph (MDG). MDG is a directed graph which is used to describe the modules of the system and their relationships. In Figure 19 there are an example of simple MDG. Figure 20 shows a partitioned MDG and in Figure 21 there is MDG with best possible partitioning: there is many intra-connections inside

Evolutionary computing in search-based software engineering