Satisfaction for Software Reverse Engineering

(1)

Satisfaction for Software Reverse Engineering

by

StevenWo o ds

A thesis

presented to the University of Waterloo in fullment of the

thesis requirement for the degree of Doctor of Philosophy

in

Computer Science

Waterloo, Ontario, Canada, 1996

c

Steven Woods 1996

(2)

I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the purpose of scholarly research.

I further authorize the University of Waterloo to reproduce this thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.

ii

(3)

this thesis. Please sign below, and give address and date.

iii

(4)

The process of understanding a source code in a high-level programming language is a complex cognitive task. The provision of helpful decision aid subsystems would be of great benet to software maintainers. Given a library of program plan templates, generating a partial understanding of a piece of software source code can be shown to correspond to the construction of mappings between segments of the source code and particular program plans represented in a library of domain source programs (plans). These mappings can be used as part of the larger task of reverse engineering source code, to facilitate many software engineering tasks such as software reuse, and for program maintenance.

We present a novel model of program understanding using constraint satisfaction.

The model composes a partial global picture of source program code by transforming knowledge about the problem domain and the program structure into constraints. These constraints facilitate the ecient construction of mappings between code and library knowledge. Under this representational framework, earlier heuristic approaches to program understanding may be unied, contrasted, and compared.

We describe an empirical study which demonstrates the feasibility of our model in the program understanding subtask of partial local explanation. In addition, we incorporate this local model into the larger task of combining these local explanations into a coherent global picture of the code. While many heuristic global models are possible, we describe an encompassing structure and demonstrate, through a careful depiction of the algorithm and several domain examples, how constraint satisfaction oers a rich paradigm for exploiting both library and program structural constraint information. One primary advantage of the constraint satisfaction paradigm (CSP) is its generality many previous program understanding eorts can be more easily compared. Another advantage is the improvement in search eciency using various heuristic techniques in CSP.

iv

(5)

As with any work of this size, many people must be acknowledged for their concern, interest and eorts on my behalf. Much of the work presented in this thesis has resulted from collaborations with Dr Qiang Yang (University of Waterloo) and Dr Alex Quilici (University of Hawaii). In particular I would like to thank Qiang for literally hundreds of hours of conversations on many topics, several of which actually included program understanding! As well, thank you Qiang for your continuous mentoring and encouragement without which this thesis would never have been completed. Thanks Alex for your interest in the application of some of my ideas to your previous work, and for encourag- ing a very successful, ongoing collaborative eort to extend the possibilities of program understanding.

To my Waterloo Ph.D. committee members, Drs Robin Cohen, Grant Weddell and Rick Kazman, thank you for frequently nding time to discuss related aspects of your work and for many helpful suggestions and criticisms of my wanderings. Thank you to my external examiners, Drs Rudolph Seviora (University of Waterloo, Computer Engi- neering), and Hausi Muller (University of Victoria) for taking the time to read my thesis and oer specic and valuable observations. Drs Peter van Beek (University of Alberta), Josh Tenenberg (University of Indiana), Jim Ning (Arthur Andersen Consulting) and Premkumar Devanbu (AT&T Bell Labs) all receive my thanks for their discerning com- ments, pointers, and references which I have undoubtedly incorporated without citation into this thesis. Thank you to Drs Martha Pollack (University of Pittsburg), Sandra Carberry (University of Delaware), and Robert Holte (University of Ottawa) for oering their valuable insight and encouragement on early aspects of my work.

During the course of my (second) tenure as a graduate student I have received substantial nancial and equipment support from a variety of groups, all of whom I would like

v

(6)

Council of Canada (NSERC) for two years of valuable support. In addition, thank you for continuous monetary and administrative support from the University of Waterloo De- partment of Computer Science and the Information Technology Research Centre (ITRC, Waterloo). Special mention to the Intelligent Software Group (ISG) of Simon Fraser University, and the Department of Electrical Engineering of the University of Hawaii for providing me with both computer support and space to work during my visits.

Portions of the work in this thesis evolved out of related work I undertook while em- ployed with the Commonwealth Scientic and Industrial Research Organization (CSIRO), Division of Information Technology in Canberra, Australia and the Department of Na- tional Defence, Defence Research Establishment Valcartier (DREV) in Quebec, Canada.

Thank you to the Spatial Information System and Tactical Information Fusion research groups respectively for providing interesting and dynamic work environments.

The University of Waterloo is blessed with tremendous administrative sta who take an unprecedented interest in helping graduate students with a never-ending stream of diculties. I owe debts to them all, however, I would like to especially thank Wendy Rush for going far above and beyond in helping me with the complicated task of submitting and getting approvals for various phases of this thesis, and Jane Prime for helping me untangle the web of administration actually required to graduate.

Several members of the Logic Programming and Articial Intelligence Group (LPAIG) from the University of Waterloo merit special mention for their willingness to talk at all hours of the day and help with the inevitable, arcane and unsolvable problems - technical and otherwise. Thanks for all your help Stephanie and Toby - I guess all of LPAIG owes a special thank you to the folks at the Grand China for so many wonderful meals! Several fantastic friends put up with me (and in fact, put me up) during my often hectic Ph.D.

pursuit. Thank you for everything ... especially Ruth, Megan, Kelly, Wendy and Bart.

vi

(7)

I can't begin to thank them enough for all of their patience, love, and encouragement over all the years. Last and most importantly, I owe a tremendous debt to my closest friend and co-conspirator - my ancee, Kirsten Wehner. Thank you for not only reading, editing, listening, and oering insightful suggestions in response to hours of mostly unintelligible and always confusing techno-babble, but also for retaining both your sanity and mine during what has been an exciting but also dicult and trying adventure!

It should be clearly noted that portions of this work have been published by the author and collaborators prior to the delivery of this dissertation. While it is impossible to localize this material specically within the thesis, the following publications cover most of Chapters 6, 7, and preliminary work extended in Chapters 2, 3, and 8: Woods and Yang, 1995b], Woods and Yang, 1995a], Woods and Yang, 1996a], Woods and Quilici, 1996a], Woods and Quilici, 1996c], Quilici et al., 1996] and Woods and Quilici, 1996b].

vii

(8)

... Software entities are more complex for their size than perhaps any other human construct, because no two parts are alike ... Brooks, 1995, p. 182]

Man with Cuboid, M.C. Escher, 1958.

... Yesterday's complexity is tomorrow's order. The complexity of molecu- lar disorder gave way to the kinetic theory of gases and the three laws of thermodynamics. Now software may not ever reveal those kinds of ordering principles, but the burden is on you to explain why not. I believe that some- day the \complexity" of software will be understood in terms of some higher order notations ... Steve Lukasik, Brooks, 1995, p. 211]

For Kirsten, in all your complexity ...

viii

(9)

List of Tables

2.1 Example abstract data type . . . 32

3.1 The Kautz Non-dichronic Program Understanding algorithm . . . 61

3.2 PU versus PR Comparison of assumptions . . . 71

4.1 Generic CSP Search Algorithm . . . 94

5.1 Program statement type distribution . . . 140

7.1

Equal

program statement type distribution . . . 158

7.2

Skewed

program statement type distribution . . . 158

8.1 The overall understanding algorithm . . . 230

8.2 Merging partial local explanations to global view . . . 233

9.1 Example hierarchic constraint f'n between

V0

and

V1

. . . 241

V2

and

V0

. . . 242

V1

and

V2

. . . 243

9.4 OR³ logical operator . . . 245

9.5 AND³ logical operator . . . 245

9.6 NOT³ logical operator . . . 245

9.7 The ApplyRalgorithm . . . 247 xvii

(18)

9.9 The ApplyUp algorithm, part 2 of 2 . . . 250

9.10 The ApplyDown algorithm, part 1 of 3 . . . 254

9.13 The AggressiveRevise algorithm . . . 260

9.14 The Stepped Revisealgorithm . . . 262

9.15 The Simplify hierarchical reduction algorithm . . . 263

9.16 The SimplifyUpreduction algorithm . . . 264

9.17 The SimplifyDownreduction algorithm . . . 265

9.18 The AO-HAC arc-consistency algorithm . . . 268

9.19 The AO-HAC-Newarc-consistency algorithm . . . 269

9.20 Hierarchical arc-consistency algorithm results . . . 277

C.1 The DeleteSourcePropagateAggrpropagation algorithm . . . 354

C.2 The KeepSourcePropagateAggrpropagation algorithm . . . 356

C.3 The DeleteSourcePropagateSteppropagation algorithm . . . 358

C.4 The KeepSourcePropagateStep propagation algorithm . . . 360

xviii

(19)

List of Figures

1.1 Conceptualizing source with expert knowledge . . . 2

1.2 Conceptualizing source with a plan library . . . 6

2.1 Sommerville's software engineering world . . . 22

2.2 Program understanding in software engineering . . . 26

2.3 C source code mapped through a String ADT instance to C++ code . . . 31

2.4

String

ADT within a hierarchical program plan library . . . 32

3.1 Action hierarchy for the cooking domain . . . 54

3.2 An Example Action Hierarchy . . . 55

3.3 Another Example Action Hierarchy . . . 66

4.1 A Map Coloring Problem . . . 83

4.2 Map-Coloring CSP . . . 84

4.3 A search tree for a backtrack-based algorithm . . . 93

4.4 SCH Example with level 1 solution . . . 106

4.5 SCH Example limiting range of level 2 instances . . . 107

5.1 An example code pattern . . . 119

5.2 MAP-CSP representation of TRAVERSE-STRINGplan (index shaded) . . . . 122

5.3 Spatial situation with 300 objects of four types. . . 125 xix

(20)

5.5 An example code pattern . . . 128

5.6 Decode's algorithm for automatically recognizing plan instances in code. 143 5.7 MAP-CSP representation of code patterns . . . 143

5.8 CSP-based internal representation for plans . . . 144

5.9 The representation for a plan index . . . 145

5.10 The median results for each of 5 algorithms . . . 145

6.1 The

String

ADT in MAP-CSP . . . 150

6.2 Program Template Matching . . . 152

7.1 Extended program plan

quilici-large

. . . 164

7.2 Instance of

quilici-t1-index

plan. . . 165

7.3 Instance of

quilici-t1

plan. . . 165

7.4 Instance of

quilici-t1-large

plan. . . 165

7.5 Instance of

quilici-t1

plan with 10 inserted statements. . . 166

7.6 Standard BackTrack (95%conf. interval) . . . 166

7.7 BackTrack, variable order (95%conf. interval) . . . 167

7.8 BackTrack CPU-time, variable order (95%conf. interval) . . . 167

7.9 Forward Checking, DR (95%conf. interval) . . . 168

7.10 Forward Checking, DR CPU-time (95%conf. interval) . . . 168

7.11 AC-3 with FCDR (95%conf. interval) . . . 169

7.12 Memory-CSP with FCDR (95%conf. interval) . . . 169

7.13 A range of strategies (medians graphed) . . . 170

7.14 BT adv Constraints vs Time, Standard distribution . . . 170

7.15 FCDR adv Constraints vs Time, Standard distribution . . . 171

7.16 FCDR (Random), Standard Template, Standard code distribution . . . . 173 xx

(21)

7.18 FCDR (Random), Standard Template, Skewed code distribution . . . 174

7.19 FCDR (Random), Standard Template, three distributions . . . 174

7.20 FCDR Standard Template, Standard code distribution . . . 175

7.21 FCDR Standard Template, Equal code distribution . . . 175

7.22 FCDR Standard Template, Skewed code distribution . . . 176

7.23 FCDR, Standard Template, three distributions . . . 176

7.24 Index Template, Standard code distribution . . . 177

7.25 Index Template, Equal code distribution . . . 177

7.26 Index Template, Skewed code distribution . . . 178

7.27 FCDR, Index Template, three distributions . . . 178

7.28 Large Template, Standard code distribution . . . 179

7.29 Large Template, Equal code distribution . . . 179

7.30 Large Template, Skewed code distribution . . . 180

7.31 FCDR, Large Template, three distributions . . . 180

7.32 Extended results: strategy range . . . 181

8.1 C source code mapped through a String ADT instance to C++ code . . . 192

8.2

String

ADT within a hierarchical program plan library . . . 192

8.3 Simple Program Understanding . . . 196

8.4 One \Blocking" of a Source Fragment . . . 201

8.5 Library Fragment . . . 202

8.6 Initial PU-CSP . . . 203

8.7 PUCSP Formulation CSP Graph exploded in Figure 8.8 . . . 205

8.8 PUCSP Graph . . . 207

8.9 Library knowledge constraints . . . 210

xxi

(22)

8.11 Criticality-based action hierarchy for

PickupBlock

. . . 213 8.12 A simple decomposition hierarchy . . . 214 8.13 Specialization and decomposition represented . . . 216 8.14 Image Processing Plan Library Fragment . . . 221 8.15 Example 1 PU-CSP formulation . . . 223 8.16 Example 2 Area Management Plan Library Fragment . . . 224 8.17 Example 2 PU-CSP formulation . . . 226 9.1 An example (!attened) CSP structure . . . 237 9.2 An example hierarchical domain value structure . . . 238 9.3 Close-up of CSP justication linkage . . . 239 9.4 Example complete justication linkage . . . 240 9.5 Upward cases for source, target structure . . . 251 9.6 Downward cases for source, target structure . . . 253 9.7 Justication of V1 domain values w.r.t. V2 . . . 273 9.8 Justication V1 w.r.tV2 and V0 w.r.t. V1 . . . 274 9.9 Final example justication structure . . . 274 9.10 Final example hierarchic structure . . . 276 10.1 The recent Program Understanding world . . . 286 A.1 Example of BackJumping Behaviour . . . 326 A.2 Partially Instantiated Constraint Graph . . . 329 A.3 Contamination trickling through graph . . . 330 A.4 Constraint propagation between unbound variables . . . 331 A.5 Problem space before upward propagation . . . 332

xxii

(23)

A.7 Problem space after upward constraint . . . 334 A.8 Embedded Constraint Satisfaction . . . 340 A.9 One abstraction hierarchy in a PCSP space . . . 345

xxiii

(24)

Introduction

1.1 Software Engineering and Program Understanding Goals

Software is an artifact created by human experts as a means to encode knowledge about a specic domain. Once created, software will be repeatedly accessed by other experts as part of routine software upgrades, maintenance and debugging. Those constructing software are, however, seldom those who maintain it afterwards, with the consequence that the software itself constitutes highly specic modes of communication between the constructors of software and those accessing it throughout its lifespan. The encoding of knowledge of various domains in software and the lack of direct communication between constructors and maintainers means that the processes by which maintainers understand the knowledge encoded in software are central to continued use of legacy software. Legacy software is typically thought of as a constantly evolving corporate asset, critical to corporate goals, and which needs to be maintained against depreciation. Indeed, studies of software maintenance indicate that as much as 80%of software maintenance costs are directly attributable to time software maintainers spend attempting to understand what is being conveyed in particular software.

1

(25)

The process by which software maintainers understand a software system may be best viewed as a primarily conceptual process, wherein the software expert develops a mapping between his/her knowledge of both conventions of software construction and the domain knowledge and the given source. A successful understanding is thus the successful construction of a mapping between some portion of the expert's store of relevant knowledge and the structures and components inherent in the source code. Figure 1.1 illustrates such a mapping.

and Plans and Plans General Algorithms

Expert Programmer Knowledge

Expert Domain Knowledge

General Data Structures

Domain-specific Algorithms Existing Program Libraries

Domain terminology

Specific Documentation Programming Design and Style Specfic Language Syntax

for (int i=0; B[i]; i++){

print(B[i]); } C[sz-j] = B[sz-j]; } sz = 7;

Domain-specific Data Structures C[sz] = "3";

for (int j = sz; j>0; j--){

outchar(A[k]); } for (int j=0; j++){

printf("%s",C[i]); } main()

{

A = "s" + "t" + "r" + "i" + "n" + "g" + "1";

B = "string2"

char* A, B, C;

} // end of main

// reference to outdated documentation, old library // comment referring to non-existent code

// documentation describing an older version of code

// description of non-existing variables, omitting current these artifacts ? How can I

explain

for (int k=0; A[k]; k++){

Legacy Source Code Artifacts

Figure 1.1: Conceptualizing source with expert knowledge

The task of program understanding, that is, the process of successful mapping between the expert's knowledge and the source, may be assisted by both physical representations of relevant knowledge and by automated decision support tools. Software systems are inher- ently extremely complex Brooks, 1995]. Software is highly multi-dimensional in terms of causation and eect, and there is little necessary correlation between physical representa-

(26)

tion and program execution behaviour in large scale software constructs. Consequently, visualization and physical representation of software is a dicult task. Nevertheless, commercial software developers are seeking to create packages of software in the form of software libraries and object sets that provide both general computational and specic domain functionality. Software development rms themselves are seeking to extend the eciency of their software divisions through highly organized programs of software reuse.

Although humans are adept at interpreting material representations of knowledge created by others, the sheer complexity and volume of software often hinders this process. The usefulness of such visualization tools for understanding software systems depends upon creators and perceivers of these representations sharing an understanding of the context of both its production and reception. In order to be useful to a programmer, any visualization tool must deal with complicated issues in separating and integrating various code views, trade-os in information hiding and complexity brought on by viewing code abstractly or in greater detail.

In addition to visualization tools, other decision support tools have been proposed to assist in the program understanding process. These include congurable pattern matching systems and case-based reasoning systems for retrieval of software appropriate to a particular task. Both of these approaches are directed at increasing the eectiveness of constructing mappings between software and existing bases of software knowledge or libraries. The central focus of this thesis is the development of automated tools that can assistthe expert software maintainer in the task of program understanding. In particular, a tool that can help in constructing mappings between existing knowledge sources and highly complex but also highly structured source code is developed. This interactive tool should be seen as a part of a larger expert-driven toolset which includes eective software visualization tools.

The interactive tool described in this thesis is based on algorithmic methods for iden-

(27)

tifying portions of code structure that correspond to known program plans or sets of program plans. These recognized portions are intended to supplement and complement the knowledge of the reverse engineer during the process of working with the code. In particular, the algorithms discussed are designed with the intention of supporting interactive interpretation of the code. Consequently, expert-supplied knowledge can be utilized during recognition to reduce the overall complexity of the problem, and to generate \views"

on the code that the expert has had a role in constructing. The power of the algorithms lie in their ability to propagate small pieces of understanding through the space of all interpretations in order to limit the set of possible interpretations. In this way the software expert can take advantage of the recognition capability of such algorithms in very noisy and complicated situations, and yet not be overly limited by the problems with the inevitable inability of an understanding algorithm to explain an entire program.

The construction of mappings between existing knowledge, either embodied in an expert or represented physically, and source code permits the software expert to make inferences about the source program's possible higher-level goals. This process of inference permits, through abstraction, perceptions of the source as actual code statements to be re-conceptualized at the more general level of the existing representation (or language of expression) of the domain knowledge. This abstract understanding may be exploited in many ways, including: (1) as part of a process of translating the program into the source code of another programming language, (2) recognizing errors in the code (or design/requirements) and assisting in debugging the system at the more abstract level, and (3) replacing understood code portions with generic application code or calls to other code libraries. In addition, in many real-world circumstances, a reduction in the size of an existing source code through adoption of standard code libraries or reduction of redundancy by only a small percentage can result in a substantial reduction of the ongoing maintenance cost, simplify future extensions to code, and reduce the probability

(28)

of introducing errors through such modications and extensions. Consequently, creating a mapping (even a partial one) between existing domain knowledge and a particular source code oers a possible lever for the software expert to employ.

Within the broader context of software reverse and re-engineering, program understanding is a sub-task based upon one of the primary goals of software engineers. This goal is to provide a solid and clear shared context for communication between a software creator and a software maintainer. The construction of this solid medium of readable, understandable software is based upon standard software engineering principles such as information hiding which is embedded in object-oriented methodologies. Another such principle for shared context standardization is represented in focused eorts at stan- dardizing software re-use. The attempt to implement rigid, shared standards for system and code design, documentation and coding is based upon the need to maintain a coherent shared context of presentation for the entire group of future expert maintainers.

In essence, this shared standardization forms the basis for any partial automatic understanding of code. As a medium of communication, software has the potential to be highly structured and regular, especially when compared with more general and !exible communication forms such as natural language.

1.2 A Brief Look at Program Understanding

In Articial Intelligence research, the problem of program understanding has been ap- proached indirectly from the perspective of plan recognition. In Section 2.2 I discuss in some detail how much of this plan recognition work fails to meet the requirements of the program understanding task. In other research more closely related to the software engineering community, a more direct approach to program understanding has been undertaken in which an explicit library of program plan templates and concepts is con-

(29)

structed, and various top-down and bottom-up search strategies are utilized to implement a mapping process between them. These approaches are introduced in detail in Section 2.2.2.

As I have introduced, program understanding methodologies typically attempt to construct formal mappings between knowledge sources and code. For example, in Figure 1.2 a subset of expert knowledge about a particular application domain is represented in a fragment of a hierarchical library of program templates. One possible mapping is shown between a plan template from the library and a specic source fragment, in this case a single source statement. The existence of such a mapping essentially explains the presence of the low-level source statement at a higher level of abstraction, in this case as an instance of the plan template

copy-character

specied in the library.

index when:

"near instance" of copy−character

Program Plan Library (excerpt)

AND

OR

main() {

char* A, B, C;

...

A = "s" + "t" + "r" + "i" + "n" + "g" + "1";

...

B = "string 2";

...

sz = 7;

for (int j = sz; j > 0; j−−) { ....

C[sz − j] = B[sz − j];

... } ...

C[sz] = 3;

...

for (int i=0; B[i]; i++) ...

print(B[i]) ....

...

for (int j=0; C[j];j++) { ....

printf("%s",C[i]);

... } ...

for (int k=0;A[k]; k++) { ...

outchar(A[k]);

... } ...}

specialize when:

contains = "$string"

copy−character loop−through character array builtin−char* copy loop−initialize

string initialize−string

String ADT plan

plan instance

Legacy Source Code

Figure 1.2: Conceptualizing source with a plan library

(30)

Much of the previous program understanding work has failed to demonstrate heuristic adequacy in even partially generating \understanding" of large problems. Specically, many recognition algorithms presented may be viewed as partially disjoint collections of heuristic tricks. Some of the landmark eorts are described in Section 2.2.2. Such heuristic construction makes it dicult for one to perform a systematic analysis of dif- ferent search methods, or to understand how the addition or deletion of certain types of domain-specic knowledge may aect performance. I am unaware of concrete examples or experiments which might suggest that these approaches might scale up for specic uses in large sources. However, both Wills, 1992], and Quilici, 1994] present empirical results promising in identifying partial mappings from sources of up to 1,000 lines to a small library of program plans.

1.3 My Approach to Modeling Program Understanding

This work is part of a research eort structured towards: (1) unifying previous heuristic approaches to program understanding with a single model capable of representing both structural knowledge and control knowledge in a single framework, and (2) demonstrating that an eective approach to automatic partial program understanding is possible with large code examples. Specically, I intend to clearly categorize the circumstances in which this use is possible, and the preconditions which must rst be met in terms of representation and application of domain knowledge. In response to these two primary goals, I present a generalized representation of program understanding as a Constraint Satisfac- tion Problem (CSP) Mackworth, 1977], and represent previous program understanding algorithms within the CSP framework.

In this CSP approach, a large source code is decomposed into a series of functionally related source \blocks" or \components" which may then be partially explained inde-

(31)

pendently. These independent explanations may be used to locally reduce the range of explanation of logical neighbour components. This divide-and-conquer approach to program understanding relies on several new algorithms which I present in this thesis:

program plan recognition algorithms, hierarchical constraint satisfaction algorithms and local constraint propagation methods.

I represent program understanding in two parts, and in two corresponding CSPs. The goal of the rst part is to identify local instances of program plan templates in the source.

These local instances may be thought of as partial local explanations of the code in which they reside. The CSP representation of the rst part is known as MAP-CSP.¹ The goal of the second part, or PU-CSP, is to integrate these local partial solutions into a coherent global view. Ongoing work such as has been reported in Quilici, 1994], is motivated by cognitive studies Pennington, 1987a] which support this two-phased approach.

There are at least two advantages in the constraint-based approach. The rst is its

generality

most of the previous recognition methods and heuristics can now be unied under the constraint-based view. Another advantage is an increased ability to address

heuristic adequacy

, or

scalability

by casting program understanding as a CSP, the previously known constraint propagation and search algorithms may be applied.

It is possible to now perform a systematic study of dierent search heuristics, including both top-down and bottom-up as well as many other hybrids, in order to discover their applicability to a particular source code.

1The term \MAP" derives from the process of constructing a^MAPping between a known library plan and an arbitrary piece of course code.

(32)

1.4 Program Understanding and Articial Intelligence

I have identied AI problem-solving approaches as applicable to the program understanding problem domain in at least three primary ways.

1. First, plan recognition is a sub-domain of AI in which the plans and goals of an agent are interpreted based on a set of perceptions of that agent's actions and a library of (hierarchical) actions of which that agent is capable. In Chapter 3 I identify the program understanding problem as a special case of plan recognition in which software reverse engineering algorithms have been designed to address the restricted plan recognition domain. In particular, these algorithms are able to exploit specic restrictive problem features to empirical advantage.

2. The constraint satisfaction problem (CSP) sub-domain of articial intelligence has recently received a lot of attention as a standard approach to modeling and solving hard problem instances. The application domain of software reverse engineering and program understanding is identied in this thesis as a rich testbed for CSP representation and solution schemes. Through a \bridging" thesis such as this, I have provided the opportunity for application of local, global and hierarchical work in constraint satisfaction to the program understanding domain. Similarly, program understanding researchers are provided with the opportunity to see the value of formally representing problems in the CSP framework in terms of increased scalability and standardization of heuristic representations.

3. Finally, through working in the software engineering world with the CSP modeling paradigm, a novel algorithm for propagating consistency in a constraint graph is presented in Chapter 8 which signicantly advances the current state of the art in CSP. In particular, this algorithm is intended to accommodate domain values

(33)

situated in a hierarchical structure consisting of both is-a (inheritance) and is- part-of (composition) relationships whereas previous work accommodated only is-a relations. This work is easily generalizable to other problems in which domain values can be hierarchically structured.

1.5 Primary Research Assumptions and Context

1.5.1 A Tool-based Vision of Program Understanding

Program understanding may be considered as an integral sub-task of software reverse and re-engineering. Within this context, large-scale program understanding is typically most protably undertaken with the assistance of a software visualization toolset such as provided by Rigi Muller et al., 1994]. I understand a visualization tool of this kind to form the basis of a code understanding decision support system in which automated program understanding tools may be embedded.

1.5.2 Software Repository

In accordance with the work upon which I build Wills, 1992, Quilici, 1994, Kozaczynski and Ning, 1994], the existence of a software repository is assumed from which program plans of a domain-specic or domain-independent nature are situated.

Such a repository could be populated through the use of existing commercial class and template libraries in languages such as C++, or possibly through the utilization of generic application libraries constructed by many companies as part of software reuse. While other work is concerned with the population or abstraction of such a repository, I restrict myself to its application.

(34)

1.5.3 Procedural Software and Plans

During the course of this research I have focused on procedural languages and plans which correspond to algorithms in the procedural sense. This choice was motivated by the constant desire to be relevant to the community sporting the largest set of deployed code - COBOL. That said, I regard the relationship between procedural plans and functional implementations (as might be evidenced in PROLOG for example) as an open research issue. It may well be that abstracted procedural plan representations have a poor canonicalization of functional implementations.

1.5.4 Program Plan Denition

The abstract program plans (cliches or idioms) utilized for program understanding have been primarily dened by understanding researchers attempting to generate canonical representations of particular algorithms. The language of representation tends to be pseudo-code-like and capable of mapping to many programming languages. Standard algorithmic devices such as \loops" are represented abstractly in terms of fundamental concepts such as \loop-begin", \loop-end" and \loop-until" and matching methodologies are devised to instantiate these abstract terms to particular instantiations.

The acquisition of domain and generic program knowledge is clearly a labor-intensive task, roughly analogous to the work performed by systems analysts in learning about a particular domain. WillsWills, 1992, pp. 19-58] devotes an entire chapter of her Ph.D.

thesis to describing a cliche library and experiences in obtaining it. In Wills' work, eort was expended both bottom-up in nding more than one implementation of a particular cliche before an attempt was made to generate an \abstract" representation, and top- down in mapping deliberately abstract descriptions of algorithms (i.e. from textbooks) to the cliche representation. Other subsequent work Quilici, 1994] is pushing forward this

(35)

denitional phase through an attempt to create an interactive, visual tool which will allow an expert to select a piece of code, abstract it partly through as yet undetermined methods of automated canonicalization, and support the expert in editing these representations.

As a result of the impossibility of dening \perfect" algorithmic abstractions, any automated understanding support tool which attempts to assist an expert in understanding source code must be able to accommodate partial and local explanations, and be able to successfully interact with the expert during interaction.

1.5.5 Software Structure-Analysis Tools

It is a recognized advantage in software understanding to have a range of possible source code views with which to examine a given source code. In particular, code visualizers discussed in Section 2.2.3 provide ways of both abstracting out and elaborating detail depending on the desire of the analyst. Views of software based on data-!ow or control-

!ow tend to be related, but each provides dierent information, and only together does their individual power become apparent to an expert attempting to understand software.

Views based purely on the use of given data structures and executable traces provide examples of a range of other possible code examination techniques. Each of these software views provides a unique set of constraint information about the construction of a given source code. Throughout this work I assume the existence of the tools which provide this constraint information. In particular, I assume the existence of control and data-

!ow analyzers such as Gen++ Devanbu and Eaves, 1994] which annotates an abstract syntax tree view of software. In the absence of such a tool², I approximate such constraint information through direct reasoning with source code examples. In verifying a constraint without explicit data and control-!ow information, it is necessary to analyze the source

2Due to licensing diculties, an annotating parser could not be obtained for nearly a year. In 1996 this license was obtained, and current research is underway incorporating the annotated ASTs.

(36)

program directly for the possible existence of such a data-!ow. This type of as-required checking is more expensive computationally than a simple check for a pre-identied data-

!ow however, since I am primarily measuring computation in terms of constraint-checks required, this assumption does not skew the check-counting results.

1.5.6 Phases of Analysis 1.5.6.1 Empirical Analysis

While I examine the specic structure of experiments performed in Chapter 7, it is important to note the assumptions that lead us to examine the MAP-CSP closely. I assume that local partial explanations are a crucial sub-task in program understanding, and further, one which could benet from a constraint-based tool assisting an expert in very large program analysis. Partial explanations of this type are utilized by at least Quilici, Wills, Ning and Kozacyznski albeit in varying representational forms. All of these works have attempted to work towards models that will have the potential to scale to usefully sized code instances.

1.5.7 Global Analysis

It is dicult to conduct an empirical study of the eectiveness of structure-identication tools in the absence of an integrated program analysis tool. Such a tool would utilize a combination of program visualization, program structure extraction, a hierarchically structured program plan library and the constraint-based local and global explanation assistance tools. I anticipate that such combination and study will occur in the future, and the goal of this aspect of my research is primarily to ascertain that constraint technology is adequate for the representation and solution of this problem in this context. The discussion in this thesis of specically extended constraint algorithms is a step towards

(37)

this combination.

1.6 Thesis Outline

The remainder of this thesis is divided into ve primary parts.

Part I,

Foundations

, presents necessary background material to acquaint or re- acquaint the reader with the tools that I will use to build my thesis arguments. I present a short introduction to the problem of program understanding in software engineering, outline how semi-automatic analysis of structured software typically proceeds, and provide an overview of crucial previous work in program understanding and plan recognition.

Part II,

Framework

, describes the modeling paradigm I have selected to represent the program understanding problem, and outlines related algorithms used in the solution of problems represented in this way. I introduce the conception of the two primary stages of program understanding: partial local, and global explanation. The description of these two sub-tasks within the context of my chosen modeling framework forms the background for the next two thesis parts.

Part III,

Local Explanation

, details the rst stage of my program understanding model. Analysis of problem complexity and description of experimental empirical results are presented along with an analysis of how local explanations are mirrored in earlier program understanding work.

Part IV,

Global Explanation

, completes my program understanding representation with a complete description of an integrative model for partial local explanations. This model does not represent the only possible way in which local explanations may be integrated into a more comprehensive view. Like other integrative approaches, it exploits both the presence of inter-component relationships and partial knowledge about program plan content. Unlike other approaches, my approach makes the explicit use and value of

(38)

program plan knowledge and structural constraint information in program understanding clear. In addition to a second complexity analysis of the integration problem, I present a set of illustrative examples and carefully describe the algorithmic approach of the global strategy in terms of a cooperative interactive system. The integration of constraint propagation (limited inferencing) with expert interaction is a crucial component of any useful decision support tool.

Part V,

Conclusions

, summarizes the nding of my research. In addition, I position myself with respect to other closely related work in program understanding, and outline the future direction of this project which in many ways is only now commencing.

(39)

Foundations

16

(40)

The Understanding Process

2.1 Software Engineering and Program Understanding

The main task of program understanding ts within the context of both software engineering and reverse engineering. Several important questions need to be posed and answered before one can discuss program understanding cogently. First, what do the often-used but seldom dened terms software engineering and reverse engineering mean?

What underlying process of software development requires that a software expert approach an existing piece of software and attempt to understand its function? How does program understanding t into the larger world of software development?

2.1.1 Software Engineering: an attempt at denition

A great range of denitions of software engineering have been presented in a variety of texts and articles. Software embodies the controlling logic governing the application of a computer to solving a given domain problem. Software encodes domain knowledge, domain methodologies, and domain data into a particular language designed to be com- piled into machine-executable instructions. As an embodiment of logic, software is a very

17

(41)

precise specication of a domain process. The specication itself, however, is only as accurate a representation of the true system as the particular specier's understanding of the process itself allows. In addition, errors can be made during specication in the encoding process itself. Consequently it is more accurate to say that any piece of software is only an approximation of the process it represents and models. This approximation has two dimensions. First, the software is based on an approximate understanding of the process itself. Second, the software is only an approximately-correct representation of this understanding.

The process of constructing this model or software has traditionally been referred to as \writing" software, re!ecting the perception that software construction is more about the manipulation of concepts rather than materials. However, as attempts have been made to instill software with a higher degree of both accuracy to the modeled process and intended logic, the process of software development has become known as software engineering.

Software engineering is concerned with the theories, methods, and tools which are needed to develop the software for these computers. In most cases, the software systems which must be developed are large and complex systems.

They are also abstract in that they do not have any physical form. Software engineering is therefore dierent from other engineering disciplines. It is not constrained by materials, governed by physical laws, or by manufacturing processes.

Software engineers model parts of the real world in software. These models are large, abstract, and complex so they must be made visible in documents such as system designs, user manuals, and so on. Producing these documents is as much part of the software engineering process as programming. As the

(42)

real world which is modeled changes, so too must the software. Therefore, software engineering is also concerned with evolving these models to meet changing needs and requirements. Sommerville, 1996, p. 4]

As Sommerville points out, the correctness and reliability of software is of more than critical importance.

In all industrialized countries, ... computer systems are economically critical.

More and more products incorporate computers in some form. Educational, administrative and health care systems are dependent on computer systems.

The software in these systems represents a large and increasing proportion of the total system costs. The eective functioning of modern economic and political systems therefore depends on our ability to produce software in a cost-eective way. Sommerville, 1996, p. 4]

Not only is software only an approximate specication of a real process, but it must necessarily be an evolving specication as well. These software models are \large" in the sense that they can easily total several million lines of coded instructions, and \complex"

in the sense that software entities are \...more complex for their size than perhaps any other human construct..."Brooks, 1995]. This complexity is attributable at least partly to the fact that no two parts of a software model are alike in a semantic sense. The engineering of computers diers from the engineering of buildings, automobiles and almost any physical construction in this important respect.

Complexity hinders software in signicant ways. Complex software has more opportunity to vary from the modeled process. Complex software is by denition hard for a third party to understand after it has been written. Complex software is dangerous to change for fear of adverse or unintended eects of a given change. Large, complex software is dicult to write and maintain since it will need to be segmented across members of a

(43)

team, introducing problems of communication, coordination and mutual understanding.

Brooks identies three other primary inherent diculties with the engineering of software in addition to size and complexity. First, software must conform to pre-existing specications and interfaces that are arbitrary when viewed with respect to the task at hand. The result of forcing a given software model to conform to this set of arbitrary constraints is to increase software complexity even further. Second, software is highly changeable since it is \embedded in a cultural matrix of applications, users, laws, and machine vehicles" which themselves change continuously. Third, software is dicult to visualizesince it is not embedded in space and consequently has no ready geometric representation in the vein of other engineering applications such as, say, circuit diagrams.

Graph-based representation models of software fail in their dimensionality since software has a multiplicity of parallel \views" such as control-!ow, data-!ow, dependency graphs, time sequence charts, name-space relationships and the like.

Software is very large, incredibly complex, changeable (and frequently changed), forced to conform to often arbitrary structures, and highly dicult to visualize spatio- temporally. One is tempted then to re-dene software engineering as simply the attempt to dene control logic with a maximal reduction in complexity of software. However, Brooks' further observation that \The complexity of software is an essential property, not an accidental one." helps dissuade us from this simple denition.

What then is software engineering? The answer is that there is no single answer.

Many techniques have been proposed as \software engineering" over the years. Software design and development methodologies such as structured analysis and design, rapid pro- totyping, waterfall-design and a myriad of other approaches have been and continue to be hailed as the next and best method for software development. The quest for Brooks' famed but mythical \silver bullet" for instituting a massive jump in software development productivity continues. Software engineering is an accumulation of the processes of an-

(44)

alyzing an existing system and a set of user requirements, designing an appropriate and exacting model of the existing system with careful emphasis on how an automated set of tasks would t within the existing system, design of software logic for these automated tasks, implementation of the software required to perform these automated tasks while minimizing complexity and still satisfying conformability constraints, and attempting to minimize the number and eect of changes required after-delivery while still satisfying functional requirements.

2.1.2 Reverse and re-engineering

If software engineering covers such a wide range of tasks, it is necessary to segment it in order to focus the scope of interest more precisely! Sommerville Sommerville, 1996]

proposes a convenient breakdown as follows:

Forward engineering is the conventional development of a software product from domain analysis through design to software production. This process typically includes the delivery of a software product and accompanying documentation and manuals.

Re-engineeringis the systematic re-implementation of an existing software system so as to make it more maintainable.

Reverse engineeringis the derivation of a design or specication of a system from its source code.

Sommerville further summarizes each of these methodologies in terms of the primary tasks and sub-products involved in each as shown in Figure 2.1.

During the \automated analysis" phase of reverse engineering, automated (or partially automated) tools collect information about the structure of the system. As Sommerville

(45)

System specification

Design and implementation

New system

Existing software system

Understanding and transformation

Re-engineered system

Re-engineering

System to be re-engineered

Manual annotation

Automated analysis

System information store

Document generation

Program structure diagrams

Data structure diagrams

Traceability materials Forward engineering

Reverse engineering

Figure 2.1: Sommerville's software engineering world

(46)

points out, the structural information collected is insucient to recover the system design itself. Rather, engineers \understand" information from the system, add this to the source code and the structural model(s), to produce an \information store", typically represented as some kind of directed graph or set of directed graphs of varying types. Further analysis can be undertaken using various source and graphical browser combinations in the subsequent annotation of the model of the source system. While these tools are aids for program understanding they are primarily intended as naviga- tional and view tools intended to aid the expert in his/her correlative task between source code and internalized, expected abstracted programming concept or plan representations. It is important to note that the understanding tools cited as relevant by Sommerville Sommerville, 1996, p. 712] do not include those proposed and discussed elsewhere Wills, 1992, Quilici, 1994, Kozaczynski and Ning, 1994] which take a more correlative focus between source and some type of abstract conceptual representations.

Sommerville importantly notes that the process of analysis and document generation is necessarily an iterative one. That is, the process of the analysis is one in which the system in question is analyzed to determine not only its structure, but rather that the structure itself is further used as a framework on which the expert may append knowledge about what semantic tasks the system achieves. This structure may be viewed as evidence of a programmer's (or programmers') plan for achieving the task encoded in the software.

Gamma Gamma et al., 1993] and others support this plan-based view of software.

Studies of expert programmers for conventional languages ... have shown that knowledge is not organized simply around syntax, but in larger conceptual structures such as algorithms, data structures and idioms, and also plans that indicate steps necessary to fulll a particular goal Gamma et al., 1993, p. 407]

(47)

While reverse engineering seeks to delineate the original system structure from source code, re-engineering intends to \improve" the system structure, generate new system documentation, and make the system easier to understand. The benets of this task are seen directly during subsequent software maintenance and extension. The re-engineering task mirrors the reverse engineering task, however, in at least one very important area:

understanding. In both cases, an expert or a team of experts must develop a clear understanding of an existing software system.

Sommerville outlines the factors that directly aect the cost (and success) of a re- engineering process as:

Existing qualityof the software and documentation.

Tool support for re-engineering. In particular, \re-engineering software is not nor- mally cost-eective unless some automated tool support can be deployed to support the process."Sommerville, 1996, p. 702].

Extent of Data Conversion required.

Availability of Expert Sta. In particular, the availability of the sta who maintain the system is critical as these people can greatly reduce the amount of time spent by system re-engineers in understanding the software system of focus.

Software re-engineering is instantiated in both the task of software translation and software restructuring. Translation is typically a largely (although not frequently totally) automated process involving the mapping between structured representations of logic in one language to

identical

representations in another language. Much of this process is potentially one-to-one if the target language has functionality both mirroring and extending that of the source. In addition, tools such asRefine(Software Renery)

(48)

Burn, 1992] andTXLCordy et al., 1991] support basic syntactic pattern matching and program transformations for the translation process.

Ongoing maintenance of software systems tends to destroy program structure and further complicate any understanding attempt. Control restructuring is intended to simplify the overall program structure so that conditionals and loop constructs are \parsed"

to provide a more readable form such as that oered through a series of embedded if- then-else statements. Other restructuring such as program modularization is typically undertaken manually. During this process, constraints and relationships between program components are identied and the function of components deduced. Sommerville Sommerville, 1996] states that \it is therefore impossible to automate this process, although some experimental systems have been produced to provide some computer-aided assistance for modularization." These systems rely on an analysis of references to data and procedure call structures to infer which program components may be part of the same module. Once well or tightly-coupled modules are identied, it may be possible to totally or partially explain their interconnection structure and data-!ow with reference to other templates of similar structure. In fact, in well-written programs that already provide tightly-coupled modules with shared data structures, the understanding task is much simplied.

In Figure 2.2 Sommerville's view of the understanding process is extended some- what to focus more on the interactive aspect of understanding, and on the tool support Sommerville identies as critical. This extended model accommodates both reverse and re-engineering subtasks of expert understanding. The automated analytical tools are utilized by a software expert to the degree desired, in conjunction with manual explanatory techniques, and domain and programming knowledge. Most importantly, this new model acknowledges the fact that a particular expert is dealing with a specic subset of knowledge about domain programs and structure and about general programming techniques,

(49)

and cliches Wills, 1992]. If one considers that some subset of this domain-specic and domain-independent knowledge may be formally representable (or represented) in the form of in-company or generic software re-use libraries (for instance), then portions of the perceived structure in the software system can be mapped to these existing software structures.

Generic software information store

System information store Domain

specific software library

Traceability materials Program structure diagrams

Data structure diagrams Automated

analysis

System to be re-engineered

Manual annotation

Interactive understanding

Program understanding

Figure 2.2: Program understanding in software engineering

This extended representation supports the view that complexity management can be addressed, at least partially, through information hiding paradigms such as that oered

Satisfaction for Software Reverse Engineering