Elias N. Houstis, Ann Christine Catlin, John R. Rice,
Vassilios S. Verykios, Naren Ramakrishnan, Catherine E. Houstis
| Often scientists need to locate appropriate software for their problems and then select from among many alternatives. We have previously proposed an approach for dealing with this task by processing performance data of the targeted software. This approach has been tested using a customized implementation referred to as PYTHIA. This experience made us realize the complexity of the algorithmic discovery of knowledge from performance data and of the management of these data together with the discovered knowledge. To address this issue, we created PYTHIA-II --- a modular framework and system which combines a general knowledge discovery in databases (KDD) methodology and recommender system technologies to provide advice about scientific software/hardware artifacts. The functionality and effectiveness of the system is demonstrated for two existing performance studies using sets of software for solving partial differential equations. From the end-user perspective, PYTHIA-II allows users to specify the problem to be solved and their computational objectives. In turn, PYTHIA-II (i) selects the software available for the user's problem, (ii) suggests parameter values, and (iii) assesses the recommendation provided. PYTHIA-II provides all the necessary facilities to set up database schemas for testing suites and associated performance data in order to test sets of software. Moreover, it allows easy interfacing of alternative data mining and recommendation facilities. PYTHIA-II is an open-ended system implemented on public domain software and has been used for performance evaluation in several different problem domains. |
|
Numerical Analysis: Partial Differential Equations
Information Systems: Types of Systems Database Management: Database Applications Artificial Intelligence: Applications and Expert Systems |
| Keywords: data mining, inductive logic programming, knowledge-based systems, knowledge discovery in data bases, performance evaluation, recommender systems, scientific software. |
Complex scientific, engineering or societal problems are often solved today by utilizing libraries or some form of problem solving environments (PSEs). Most software modules are characterized by a significant number of parameters affecting efficiency and applicability that must be specified by the user. This complexity is significantly increased by the number of parameters associated with the execution environment. Furthermore, one can create many alternative solutions of the same problem by selecting different software for the various phases of the computation. Thus, the task of selecting the best software and the associated algorithmic/hardware parameters for a particular computation is often difficult and sometimes even impossible. In [1] we proposed an approach for dealing with this task by processing performance data obtained from testing software. The testing of this approach is described in [2] using the PYTHIA implementation for a specific performance evaluation study. The approach has also been tested for numerical quadrature sofware [3] and is being tested for parallel computer performance [4, 5]. This experience made us realize the high level of complexity involved in the algorithmic discovery of knowledge from performance data and the management of these data together with the discovered knowledge. To address the complexity issue together with scalability and portability of this approach, we present a knowledge discovery in databases (KDD) methodology [6] for testing and recommending scientific software. PYTHIA-II is a system with an open software architecture implementing the KDD methodology, which can be used to build a Recommender System (RS) for many domains of scientific software/hardware artifacts [2, 3, 7, 8] In this paper, we describe the PYTHIA-II architecture and its use as a RS for PDE software.
Given a problem from a known class of problems and given some performance criteria, PYTHIA-II selects the best performing software/machine pair and estimates values for the associated parameters involved. It makes recommendations by combining attribute-based elicitation of specified problems and matching them against those of a predefined dense population of similar types of problems. Dense here means that there are enough data available so that it is reasonable to expect that a good recommendation can be made. The more dense the population is, the more reliable the recommendation. We describe case studies for two sets of elliptic partial differential equations software found in PELLPACK [9].
![]() |
The paper is organized as follows. Section 1. describes a general methodology for selecting and recommending scientific software implemented in PYTHIA-II. The architecture for a RS based on the PYTHIA-II approach is presented in Section 2. A description of the data management subsystem of PYTHIA-II is presented in Section 3. We include a database schema appropriate for building a RS for elliptic PDE software from the PELLPACK library to illustrate its use. Section 4. outlines the knowledge discovery components of PYTHIA-II. The data flow in PYTHIA-II is illustrated in Section 5. The results of applying PYTHIA-II to two case studies and comparing with earlier results from 1980s can be found in Section 6 and Section 7.
A RS for software/hardware artifacts uses stored performance data on a population of previously encountered problems and machines to locate and suggest efficient artifacts for solving previously unseen problems. Recommendation becomes necessary when user requests or objectives cannot be properly represented as ordinary database queries. In this section, we describe the complexity of this problem, the research issues to address, and a methodology for resolving them.
| Phases | Description |
|---|---|
| Determine evaluation objectives | Identify the computational objectives for which the performance evaluation of the selected scientific software is carried out. |
| Data preparation and selection |
|
| Data mining |
|
| Analysis of results | This is a post-processing phase done by knowledge engineers and domain experts to ensure correctness of the results. |
| Assimilation of knowledge | Create a user friendly interface to utilize the knowledge and to identify the scientific software (with parameters) for user's problems and computational objectives. |
The algorithm or software selection problem originated in an early paper by Rice [13]. Even for routine tasks in computational science, this problem is ill-posed and quite complicated. Its difficulty is due to the following factors:
The methodology for building PYTHIA-II uses the knowledge discovery in databases (KDD) process shown in Table 1. Assuming a dense population of benchmark problems from the targeted application domain, this RS methodology uses a three-pronged strategy: feature determination of problem instances, performance evaluation of scientific software, and the automatic generation of relevant knowledge. Note that the dense population assumption can be quite challenging for many application domains. We now address each of these aspects.
| FEATURE | ||
| name | text | feature record name |
| nfeatures | integer | no. of attributes identifying this feature |
| features | text [ ] | numeric/symbolic/textual identification |
| RELATION | ||
| name | text | relation record name |
| equation | text | name of equation with these features |
| feature | text | name of record identifying features |
| FEATURE | |
| name | opLaplace |
| nfeatures | 1 |
| features | "Uxx + Uyy (+ Uzz) = f" |
| RELATION | |
| name | opLaplace pde #3 |
| equation | pde #3 |
| feature | opLaplace |
The database schema and two instances from the feature and relation
tables are given above. The instance records illustrate the
correspondence between equation
pde#3 and its feature opLaplace (the PDE
is the Laplacian).
Accuracy is measured by the norm of the difference between the computed and the true solutions or by a guaranteed error estimate. Speed is measured by the time required to execute the software in a standard execution environment. PYTHIA-II ensures that all performance evaluations are made consistently; their outputs are automatically coded into predicate logic formulas. We resort to attribute-value encodings when the situation demands it; for instance, using straight line approximations to performance profiles (e.g., accuracy vs. grid size) for solvers is useful to obtain interpolated values of grid parameters for PDE problems.
CBR systems obey a lazy-learning paradigm in that learning consists solely of recording data from past experiments to help in future problem solving sessions. (This gain in simplicity of learning is offset by a more complicated process that occurs in the actual recommendation stage.) Evidence from psychology suggests that people use this approach to make judgements, using the experience gained in solving `similar' problems to devise a strategy for solving the present one. In addition, CBR systems can exploit a-priori domain knowledge to perform more sophisticated analyses even if pertinent data is not present. The original PYTHIA system utilized a rudimentary form of case-based reasoning employing a characteristic-vector representation for the problem population [2].
ILP systems, on the other hand, use an eager learning paradigm in that they attempt to construct a predicate logic formula so that all positive examples of good recommendations provided can be logically derived from the background knowledge, and no negative example can be logically derived. The advantages of this approach lie in the generality of the representation of background knowledge. Formally, the task in algorithm selection is: given a set of positive exemplars and negative exemplars of the selection mapping and a set of background knowledge, induce a definition of the selection mapping so that every positive example can be derived and no negative example can be derived. While the strict use of this definition is impractical, an approximate characterization, called the cover, is utilized which places greater emphasis on not representing the negative exemplars as opposed to representing the positive exemplars. Techniques such as relative least general generalization and inverse resolution [23] can then be applied to induce clausal definitions of the algorithm selection methodology. This forms the basis for building RS procedures using banks of selection rules.
ILP is often prohibitively expensive and the standard practice is to restrict the hypothesis space to a proper subset of first order predicate logic. Most commercial systems (like GOLEM and PROGOL [25] require that background knowledge be ground; meaning that only base facts can be provided as opposed to intensional information. This still renders the overall complexity exponential. In PYTHIA-II, we investigate the use of domain specific restrictions on the induction of hypotheses and analyze several strategies. First, we make syntactic and semantic restrictions on the nature of the induced methodology. For example, we require that a PDE solver should first activate a discretizer before a linear system solver (a different order of PDE solver parts does not make sense). An example of a semantic restriction is consistency checks between algorithms and their inputs. Second, we incorporate a generality ordering to guide the induction of rules and prune the search space for generating plausible hypotheses. Finally, since the software architecture of the domain specific RS has a natural database query interface, we utilize it to provide meta-level patterns for rule generation.
PYTHIA-II also employs more restricted forms of eager-learning such as the ID3 (Induction of Decision Trees) [26] system. It is a supervised learning system for top-down induction of decision trees from a set of examples and uses a greedy divide-and-conquer approach. The decision tree is structure where: (a) every internal node is labeled with the name of one of the predicting attributes; (b) the branches from an internal node are labeled with values of the node attribute; (c) every leaf node is labeled with a class (i.e., the value of the goal attribute). The training examples are tuples, where the domain of each attribute is limited to a small number of values, either symbolic or numerical. The ID3 system uses a top-down irrevocable strategy that searches only part of the search space, guaranteeing that a simple -- but not necessarily the simplest -- tree is found.
|
The graphical user interface in the top layer allows the knowledge engineer to use the system to generate knowledge as well as to query the system for facts stored in the database layer. The recommender is the end-user interface, and includes the inference engine. It uses the knowledge generated by the lower layers as an expert system to answer domain specific questions posed by end users. The architecture of PYTHIA-II is extensible, with well defined interfaces among the components of the various layers.
| EQUATION | ||
| name | text | record name |
| system | text | software to solve equation |
| nequations | integer | number of equations |
| equations | text [ ] | text describing equations to solve |
| forfile | text | source code file (used in definition) |
| EQUATION | |
| name | pde #39 |
| system | pellpack |
| nequations | 1 |
| equations | {"uxx + uyy + ((1.-h(x)**2*w(x,y)**2)/(&b))u = 0"} |
| forfile | /p/pses/projects/kbas/data-files/fortran/pde39.eq |
| SEQUENCES | ||
| name | text | record name |
| system | text | software that provides the solver modules |
| nmod | integer | number of modules in the solution scheme |
| types | text [ ] | array of record types (e.g., grid, solver) |
| names | text [ ] | array of module names |
| parms | text [ ] | array of module parameters |
| SEQUENCES | |
| name | uniform 950x950 proc 2 jacobi cg |
| system | pellpack |
| nmod | 6 |
| types | {"grid","machine","dec","discr","indx","solver"} |
| names | {"950x950 rect","machine_2","runtime grid 1x2",
"5-point star","red black","itpack-jacobi cg"} |
| parms | {"","","","","","itmax 20000"} |
&b in the specification defines a location for
parameter replacement and
the forfile attribute provides for additional source
code to be attached to the equation definition. The sequences
record shows an ordered listing of the module calls used to solve
a particular PDE problem. For each module call in the list, the
sequence identifies the module type, name and parameters.
| EXPERIMENT | ||
| name | text | record name (primary key) |
| system | text | software identification used for program generation |
| nopt | integer | number of options |
| options | text [ ] | array of option record names (foreign key) |
| noptparm | integer | number of parameter specific options |
| optparm | text [ ] | array of option record names |
| equation | text | equation record which defines the equation |
| neqnparm | integer | number of equation parameters |
| eqnparm | text [ ] | array of equation parameter names |
| domain | text | domain record on which the equation is defined |
| ndomparm | integer | number of domain parameters |
| domparm | text [ ] | array of domain parameter names |
| bcond | text | boundary condition record |
| nbcparm | integer | number of bcond parameters |
| bcparm | text [ ] | array of bcond parameter names |
| nparm | integer | number of parameters applied across all definitions |
| parm | text [ ] | array of problem-wide parameters (number of programs) |
| sequences | text [ ] | names of the sequence records containing the solution schemes |
| nout | integer | number of output records |
| output | text [ ] | array of output record names |
| nfor | integer | number of source code files to include |
| fortran | text [ ] | names of the files to include |
| name | pde54 dom02 fd-itpack-rscg SP2-17 |
| system | pellpack |
| comp_db | linearalgebra |
| composite_id | pde54 domain 02 fd-itpack-rscg |
| perfind_set | pellpack-std-par-grd |
| pid | 1432 |
| sequence_no | 17 |
| eqparms | pde #54 parameter set 5 |
| solverseq | 950x950 proc 4 reduced system cg |
| rundata | IBM SP2 with 18 compute node |
| nfeature | 5 |
| featurenames | {"matrix symmetric", "domain type", "boundary pieces", "problem type"} |
| featurevals | {"no", "non-rectangular","8", "FiniteDifference"} |
| nperf | 1 |
| perfnames | {"number of iterations"} |
| pervals | {"830"} |
| nproc | 4 |
| nmod | 5 |
| modnames | {"domain processor","decomposer","discretizer","indexer","solver"} |
| ntimeslice | 2 |
| timeslice | {"elapsed","communication"} |
| time | {{{"3.16","0"},{"2.3","0"},{"4.2","0"},{"0.11","0"},{"135.4","1.24"}},
{{"3.1","0"},{"2.46","0"},{"3.89","0"}, {"0.09","0"},{"135.4500024","36.74049"}}, {{"3.1","0"},{"2.47","0"},{"3.91","0"}, {"0.08","0"},{"135.5499933","37.1304893"}}, {{"3.1","0"},{"2.03","0"},{"4.139","0"}, {"0.04","0"},{"136.1499939","88.7300339"}}} |
| ntotal | 4 |
| total | {"150.1600037","149.9700012","150.0200043","149.6300049"} |
| nmemory | 4 |
| memorynames | {"number of equations", "x grid size","y grid size", "problem size"} |
| memoryvals | {"224676","950","950","902500"} |
| nerror | 3 |
| errornames | {"max abs error","L1 error","L2 error"} |
| errorvals | {"0.0022063255","0.00011032778","0.00022281437"} |
In this sample PYTHIA-II instantiation, the problem population has 13 problem specification tables (equation, domain, bcond, grid, mesh, dec, discr, indx, solver, triple, output, parameter, option) and 21 relationship tables ( equation-discr, mesh-domain, parameter-solver, etc). Additional tables define problem features and execution related information (machine and rundata tables). In all, 44 table definitions are used for the PYTHIA-II database. Section 6 and Section 7 give some examples of these tables.
First, it must be possible to define the input (i.e., the problem definition) using only information in an experiment record. The translation of an experiment into an executable program is handled by a script written for the software which extracts the necessary information from the experiment record and generates the files or drivers for the software. For PELLPACK, the experiment record is translated to a .e file , which is the PELLPACK language definition of the PDE problem, the solution scheme, and the output requirements. The script is written in Tcl and consists of about 250 lines of code. The standard PELLPACK preprocessing programs convert the .e file to a Fortran 77 driver and link the appropriate libraries to produce an executable program. The second requirement is that the software is able to operate in a batch mode. In the PELLPACK case, Perl scripts are used to execute PELLPACK programs, both sequential and parallel, on any number of platforms. The programs are created and executed without manual intervention. Finally, the software must produce performance measures as output. A post-processing program must be written specifically to convert the generated output into PYTHIA-II performance records. Each program execution should insert one record into the performance database. The PELLPACK post-processing program is written in Tcl (350 lines of code) and Perl (300 lines of code).
Data generation (program generation, program execution, data collection) may take place inside or outside of PYTHIA-II. This process is domain dependent since problem definition records, software, and output files depend on the domain.
|   | Algorithm 1 | Algorithm 2 | ... | Algorithm k |
| Problem 1 | X11 | X12 | ... | X1k |
| Problem 2 | X21 | X22 | ... | X2k |
|       ...       | ... | ... | ... | ... |
| Problem n | Xn1 | Xn2 | ... | Xnk |
| Rank | R1 | R2 | ... | Rk |
| Average Rank | R*1 | R*2 | ... | R*k |
PYTHIA-II currently ranks the preformance of algorithms with
Friedman rank sums
[27].
This distribution-free ranking
assumes nk data
values from each of k algorithms for n problems.
The analyzer can ``fill in''
missing values using various methods. The Friedman ranking proceeds as follows:
![]() |
The pattern-extraction module provides automatic knowledge acquisition (patterns/models) from the data to be used by a RS. This process is independent of the problem domain. PYTHIA-II extends the PYTHIA methodology to address the algorithm selection problem by applying various neuro-fuzzy, instance-based learning and clustering techniques. The relational model of PYTHIA-II automatically handles any amount of raw data related manipulation. It has a specific format for the data used by the pattern extraction process, and filters transform this format (on the fly) to the format required by the various data mining tools integrated into PYTHIA-II. The goal is to accumulate tools that generate knowledge in the form of logic rules, if-then-else rules or decision trees.
PYTHIA-II first used GOLEM [30], an empirical single predicate Inductive Logic Programming (ILP) learning system. It is a batch system that implements the relative least general generalization principle. We have experimented with other learning methods, e.g., fuzzy logic or neural networks, and have not found large differences in their learning abilities. We chose ILP because it seemed to be the easiest to use in PYTHIA-II; its selection is not the result of a systematic study of the effectiveness of learning methods. PYTHIA-II is designed so the learning component can be replaced if necessary. GOLEM generates knowledge in the form of logical rules which one can model in a language like first order predicate logic. These rules can then be easily utilized as the rule base of an expert system. We have also integrated PROGOL [25], CN2, PEBLS, and OC1 (the latter three are available in the MLC++ library [28]).
The fact base of the recommender is then processed for this example to provide parameters for which the user needs advice. The fact base consists of all the raw performance data stored in the database. This information is accessed by queries generated on the fly, based on the user's objectives and selections. If the user objectives cannot be met, then the recommender decides what ``best'' answer to give, using weights specified by the user for each performance criterion. For the case studies in Section 6 and Section 7, the final step is the recommendation of the best PDE solver to use. It also provides solver parameters such as the grid needed to achieve the solution accuracy within the given time limitations.
|
An experiment combines problem records into groups, and a high level problem specification is generated by a program-based transformation of the experiment record into an input file for execution. The problem execution environment invokes the appropriate scientific software to generate data. For the example instantiation referred to in Section 3 and Section 4, the execution environment consists of PELLPACK. The execution generates a number of output files, each containing performance and other information related to solving the problem. The input uses the specific schema of the problem record and the output format is specified by a system specific and user selected file template. The template lists the program used to collect the required output data. These data records keep logical references (called foreign keys) to the problem definition records so that performance can be matched with problem features by executing n-way joins during pattern extraction.
The statistical analyzer uses
the performance data for ranking based on the
parameter(s) selected by the user. The ranking
produces an ordering of these parameters which is statistically
significant (i.e., if the performance data shows no significant
difference between parameters, they are shown as tied in
rank). A predicate record defines the collection of profile
records to be used in pattern extraction and allows a
knowledge engineer to change the set of input profile records as
easily as updating a database record.
A filter program converts data to the
input format
required by the pattern extraction programs. PYTHIA-II
currently supports GOLEM/PROGOL, the MLC++ (Machine Learning Library in C++) librar
y, and others.
These programs generate output in
the form of logic rules, if-then rules or decision
trees/graphs for categorization purposes. This process is
open-ended, and tools like
neural networks, genetic algorithms, fuzzy logic tool-boxes, and rough
set systems can be used.
![]() |
![]() |
Defining the population of 35 PDEs and the experiments required 21 equation records with up to 10 parameter sets each, 3 rectangle domain records, 5 sets of boundary conditions records, 10 grid records, several discretizer, indexing, linear solver and triple records with corresponding parameters, and a set of 40 solver sequence records. Using these components, 37 experiments were specified, each defining a collection of PDE programs involving up to 35 solver sequences for a given PDE problem. Examples of these records are given in Section 3. The 37 experiments were executed on a SPARCstation20 with 32MB memory running Solaris 2.5.1 from within PYTHIA-II's execution environment. Over 500 performance records were created.
| Phases | Description | Implementation |
|---|---|---|
| Determine evaluation objectives | Evaluate the efficiency and accuracy of a set of solution methods and their associated parameters with respect to elapsed time, error and problem size. | Manual |
| Data preparation |
|
|
| Data mining |
|
|
| Analysis of results | Domain experts ensure correctness of the results. | Manual |
| Assimilation of knowledge | Create an intelligent interface to utilize the knowledge to identify the ``best method'' with associated parameters for user's problems and computational objectives. | CLIPS |
| Problem Component | Features |
|---|---|
| Equation |
first tier operator : Laplace, Poisson, Helmholtz, self-adjoint, general
second tier operator : analytic, entire, constant coefficients operator smoothness tier : constant, entire, analytic right-hand-side tier : entire, analytic, singular(infinite), singular derivatives, constant coefficients, nearly singular, peaked, oscillatory, homogeneous, computationally complex right-hand-side smoothness tier : constant, entire, analytic, computationally complex, singular, oscillatory, peaked |
| Domain |
unit square
[a,b] * [a+x,b+x], where x can vary [a,b] * [a+c,b+c], where c is a constant |
| Boundary Conditions |
U=0 on all boundaries
AU = f on all boundaries BUn = f on some boundaries AU + BUn=f on some boundaries constant coefficients, non-constant coefficients |
1. FFT6      
2. FFT4      
3. DCG4      
4. FFT2      
5. COLL      
6. DCG2      
7. 5PT      
FFT4: 27.0 %
COLL: 21.6 %
5PT : 18.9 %
DCG4 : 13.5 %
FFT6 : 10.8 %
DCG2 : 5.4 %
FFT2 : 2.7 %
| Best Method | For problems with these features |
|---|---|
| best_method(A,dyakanov-cg4) | :- opLaplace_yes(A), rhsSingular_yes(A) |
| best_method(A,fft_9_point_order_4) | :- opHelmholtz_yes(A), pdePeaked_no(A) |
| best_method(A,fft_9_point_order_4) | :- solVarSmooth_yes(A), solSmoSingular_no(A) |
| best_method(A,fft_9_point_order_2) | :- solSingular_no(A), solSmoSingDeriv_yes(A) |
| best_method(A,fft_9_point_order_6) | :- opLaplace_yes(A), rhsSingular_no(A), rhsConstCoeff_no(A), rhsNearlySingular_no(A), rhsPeaked_no(A) |
| best_method(A,fft_9_point_order_6) | :- pdeSmoConst_yes(A), rhsSmoDiscDeriv_yes(A) |
| best_method(A,dyakanov-cg4) | :- opSelfAdjoint_yes(A), rhsConstCoeff_no(A) |
| best_method(A,dyakanov-cg4) | :- pdeJump_yes(A) |
| best_method(A,dyakanov-cg) | :- pdeSmoConst_yes(A), rhsSmoDiscDeriv_yes(A) |
| best_method(A,hermite_collocation) | :- opGeneral_yes(A) |
| best_method(A,hermite_collocation) | :- pdePeaked_yes(A) |
| PDE | 5PT | COLL | DCG2 | DCG4 | FFT2 | FFT4 | FFT6 |
|---|---|---|---|---|---|---|---|
| 3-1 | 7 (7) | 6 (4) | 4 (6) | 5 (5) | 2 (2) | 1 (3) | 3 (1) |
| 3-2 | 6 (6) | 7 (7) | 1 (5) | 3 (3) | 4 (4) | 2 (2) | 5 (1) |
| 7-1 | 7 (7) | 6 (3) | 2 (5) | 3 (5) | 1 (4) | 4 (1) | 5 (1) |
| 8-2 | 7 (7) | 6 (5) | 1 (4) | 5 (6) | 2 (2) | 4 (3) | 3 (1) |
| 9-1 | 6 (6) | 5 (5) | 3 (4) | 2 (3) | 4 (2) | 1 (1) | - |
| 9-2 | 6 (6) | 5 (5) | 3 (3) | 4 (4) | 2 (2) | 1 (1) | - |
| 9-3 | 6 (6) | 4 (5) | 5 (3) | 3 (3) | 2 (2) | 1 (1) | - |
| 10-2 | 6 (6) | 7 (7) | 5 (5) | 2 (4) | 3 (3) | 1 (2) | 4 (1) |
| 10-3 | 6 (6) | 7 (7) | 5 (5) | 3 (4) | 4 (3) | 2 (2) | 1 (1) |
| 10-4 | 7 (5) | 5 (7) | 6 (4) | 3 (6) | 4 (3) | 2 (2) | 1 (1) |
| 10-7 | 6 (6) | 5 (7) | 4 (5) | 3 (3) | 7 (3) | 1 (2) | 2 (1) |
| 11-2 | 7 (7) | 6 (6) | 5 (5) | 1 (3) | 2 (3) | 3 (2) | 4 (1) |
| 11-3 | 7 (6) | 6 (6) | 4 (5) | 3 (4) | 5 (3) | 1 (2) | 2 (1) |
| 11-4 | 6 (6) | 7 (7) | 5 (5) | 3 (4) | 4 (3) | 2 (2) | 1 (1) |
| 11-5 | 6 (6) | 7 (7) | 5 (4) | 3 (4) | 4 (3) | 2 (2) | 1 (1) |
| 13-1 | 3 (3) | 4 (4) | 2 (1) | 1 (1) | - | - | - |
| 15-1 | 2 (2) | 1 (1) | - | - | - | - | - |
| 15-2 | 2 (2) | 1 (1) | - | - | - | - | - |
| 17-1 | 7 (7) | 6 (6) | 3 (5) | 1 (3) | 2 (4) | 4 (2) | 5 (1) |
| 17-2 | 6 (6) | 7 (7) | 4 (4) | 2 (5) | 5 (3) | 1 (2) | 3 (1) |
| 17-3 | 6 (6) | 7 (7) | 4 (4) | 5 (5) | 3 (3) | 2 (2) | 1 (1) |
| 20-1 | 2 (2) | 1 (1) | - | - | - | - | - |
| 20-2 | 1 (1) | 2 (2) | - | - | - | - | - |
| 28-2 | 3 (2) | - | 1 (1) | 2 (2) | - | - | - |
| 30-4 | 1 (1) | 2 (2) | - | - (2) | - | - | - |
| 30-8 | 2 (2) | 1 (1) | - | - | - | - | - |
| 34-1 | 4 (4) | 3 (2) | 2 (2) | 1 (1) | - | - | - |
| 35-1 | 4 (4) | 3 (2) | 2 (2) | 1 (1) | - | - | - |
| 36-2 | 2 (1) | 1 (1) | - | - | - | - | - |
| 39-2 | 1 (1) | 2 (2) | - | - | - | - | - |
| 39-4 | 2 (2) | 1 (1) | - | - | - | - | - |
| 44-2 | 2 (2) | 1 (1) | - | - | - | - | - |
| 44-3 | 2 (2) | 1 (1) | - | - | - | - | - |
| 47-2 | 6 (6) | 6 (6) | 3 (5) | 2 (4) | 4 (3) | 1 (2) | 5 (1) |
| 49-3 | 2 (2) | 1 (1) | - | - | - | - | - |
| 51-1 | 1 (1) | 2 (2) | - | - | - | - | - |
| 54-1 | 1 (1) | 2 (2) | - | - | - | - | - |
![]() |
The predicate names a matrix of
profile records that identify the number and type of analyzer
invocations. Then it identifies the boundary condition features used.
The analyzer rankings and the predicate feature
specifications are handed over to the rules generation process.
Table 9
lists, in part, the required predicate information.
The predicate controls the overall analysis
and the details are handled by the profile
records. Each profile
record identifies which fields of performance data are
extracted, how they are manipulated, and how the experiment
profiles for the analyzer are built. The result of the
analysis is a ranking of method performance for the selected experiments.
The query posed to the database by the
profile extracts exactly the information (a partial listing of the
profile record information is given in Table 9)
needed by the analyzer to answer this question.
The complex query used for building the analyzer's input data is
determined by profile field entries for x-axis, y-axis and field
matching. In this case, the profile record builds sets of (x,y)
points for each numerical method, where the x values are grid
points and the y values are relative elapsed time changes for
mixed boundary conditions with respect to Dirichlet conditions,
changes in elapsed time for Neumann conditions with respect to Dirichlet
conditions, and relative changes in error for derivative
conditions with respect to Dirichlet conditions. In all, 6
predicates and more than a hundred profiles were used.
| Record | Controlling information | Field data |
|---|---|---|
| Predicate | How many invocations of the analyzer? | 24 |
| Profiles to be used for each invocation. | pde01_Dir-vs-Mix,
pde01_Dir-vs-Neu, pde01_Mix-vs-Neu, pde02_Dir-vs-Mix, ... |
|
| Items to rank. | numerical methods : DGC, DCG4, MG-00, 5PT, COLL | |
| Features to base rules on. | ElapsedTimeEffect-Dir2Mix,
ElapsedTimeEffect-Dir2Neu, ... |
|
| Profile | Experiments used in a single analyzer run? | pde01-dirichlet,
pde01-mixed, ... |
| Matching record identifier for profile graph building. | use perfdata record and match fields: classparms = dir vs. mix
and select on numerical methods |
|
| Profile graph x-axis values? | grid sizes | |
| Profile graph y-axis values? | relative increase in mixed
execution elapsed time vs Dirichlet execution elapsed time : (Tmix - Tdir)   /   Tdir |
|
| Name of SQL query template. | dir.vs.mix |
COLL: 57.1 %
DCG4: 28.6 %
DCG2: 14.3 %
5PT : 0 %
MG-00: 0 %
COLL: 42.9 %
DCG4: 21.4 %
5PT: 14.3 %
DCG2: 14.3 %
MG-00: 7.1 %
best_method(A,hermite_collocation) :- dir2mix(A).
best_method(A,hermite_collocation) :- dir2neu(A).
![]() |
Next, we consider ranking the methods for all
PDE-boundary condition pairs using profile graphs involving
problem size vs. elapsed time. The analysis does not
consider the relative increase in execution time for different
boundary condition types; it ranks all methods over all
PDE problems as in Case Study 1. The analysis ranks
MG-00 as best method. It was selected 72% of the time as the
fastest method over all PDE problems. The analysis also showed that
all methods had
the same best-to-worst ranking for a fixed PDE
equation and all possible boundary conditions. In addition,
these results show that some of these methods
differ significantly when ranking with respect to execution
times across the collection of PDE problems.
[2] S. Weerawarana, E. N. Houstis, J. R. Rice, A. Joshi and C. Houstis, (1997), PYTHIA: A knowledge based system to select scientific algorithms, ACM Trans. Math. Soft., Vol. 23, pp. 447-468.
[3] N. Ramakrishnan, J. Rice and E. N. Houstis, (2000), GAUSS: An on-line algorithm recommender system for one-dimensional numerical quadrature, ACM Trans. Math. Soft., to appear.
[4] V. S. Adve, R. Bagrodia, J. C. Brown, E. Deelman, A. Dube, E. N. Houstis, J. R. Rice, R. Sakellariou and D. Surdaram-Stukel, P. J. Teller and M. K. Vernon, (2000), POEMS: End-to-end performance of large parallel adaptive computational systems, IEEE Trans. Soft. Eng., to appear.
[5] V. S. Verykios, E. N. Houstis and J. R. Rice, (1999), Mining the Performance Of Complex Systems, ICIIS' 99, IEEE International Conference on Information, Intelligence and Systems, IEEE Computer Society Press, pp. 606-612.
[6] U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, (1996), From data mining to knowledge discovery: an overview, Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Eds), AAI Press/MIT Press, pp. 1-34.
[7] V. S. Verykios, (1999), Knowledge Discovery in Scientific Databases, Ph. D. Thesis, Computer Science Department, Purdue University.
[8] V. S. Verykios, E. N. Houstis and J. R. Rice, (2000), A Knowledge Discovery Methodology for the Performance Evaluation of Scientific Software, Neural, Parallel \& Scientific Computations, to appear,
[9] E. N. Houstis, J. R. Rice, S. Weerawarana, A. C. Catlin, M. Gaitatzes, P. Papachiou, and K. Wang, (1998), Parallel ELLPACK: a problem solving environment for PDE based applications on multicomputer platforms, ACM Trans. Math. Soft., Vol. 24, No. 1, pp. 30-73.
[10] N. Ramakrishnan, (1997), Recommender systems for problem solving environments. Ph. D. thesis, Dept. of Computer Sciences, Purdue University.
[11] N. Ramakrishnan, E. N. Houstis, and J. R. Rice, (1998), Recommender Systems for Problem Solving Environments, Working notes of the AAAI-98 workshop on recommender systems, H. Kautz (Ed), AAAI/MIT Press.
[12] U. Fayyad, D. Haussler and P. Stolorz, (1996), Mining scientific data, Comm. ACM, Vol. 39, No. 11, pp. 51-57.
[13] J.R. Rice, (1976), The algorithm selection problem, Advances in Computers, Vol. 15, pp. 65-118.
[14] E.N. Houstis, R.E. Lynch and J.R. Rice, (1978), Evaluation of numerical methods for elliptic partial differential equations, Journal of Comp. Physics, Vol. 27, pp. 323-350.
[15] {Ronald F. Boisvert, John R. Rice and Elias N. Houstis, (1979), A system for performance evaluation of partial differential equations software, IEEE-TOSE, Vol. SE-5, No, 4, pp. 418-425.
[16] E. N. Houstis and W.F. Mitchell and T.S. Papatheo dorou, (1983), Performance evaluation of algorithms for mildly nonlinear elliptic partial differential equations, Inter. J. Numer. Meth. Engin., Vol. 19, pp. 665-709.
[17] J.R. Rice, (1983), Performance analysis of 13 methods to solve the Galerkin method equations, Lin. Alg. Appl., Vol. 53, pp. 533-546.
[18] W.R. Dyksen, E.N. Houstis, R.E. Lynch and J. R. Rice, (1984), The performance of the collocation and Galerkin methods with hermite bicubics, SIAM Journal of Numerical Analysis, Vol. 21, pp. 695-715.
[19] P.K. Moore, C. Ozturan and J.E. Flaherty, (1990), Towards the automatic numerical solution of partial differential equations, Intelligent Mathematical Software Systems, North-Holland, pp. 15-22.
[20] J.R. Rice, (1990), Software performance evaluation papers in TOMS CSD-TR-1026. Dept. Comp. Sci., Purdue University.
[21] A. Joshi, S. Weerawarana, N. Ramakrishnan, E. N. Houstis, and J. R. Rice, (1996), Neuro--fuzzy support for PSEs: a step toward the automated solution of PDEs, Special Joint Issue of IEEE Computer & IEEE Computational Science and Engineering, Vol. 3, No. 1, pp. 44--56.
[22] I. Bratko and S. Muggleton, (1995), Applications of inductive logic programming, Comm. ACM , Vol, 38, No. 11, pp. 65--70.
[23] Saso Dzeroski, (1996), Inductive logic programming and knowledge discovery in databases , Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Eds), AAAI Press/MIT Press, pp. 117-152.
[24] S. Muggleton and L. De Raedt, (1994), Inductive logic programming: theory and methods, Journal of Logic Programming, Vol. 19, No. 20, pp. 629-679.
[25] S. Muggleton, (1995), Inverse entailment and PROGOL, New Generation Computing, Vol, 13, pp. 245-286.
[26] J. R. Quinlan, (1986), Induction of decision trees, Machine Learning , Vol. 1, No. 1, pp. 81-106.
[27] M. Hollander and D. Wolfe, (1973), Non-parametric Statistical Methods, John Wiley and Sons.
[28] R. Kohavi, (1996), MLC++ developments: data mining using MLC++, Working Notes of the AAAI-96 Fall Symposia on `Learning Complex Behaviors in Adaptive Intelligent Systems, Kasif, S. et al. (Eds), AAAI Press, pp. 112-123.
[29] M. Stonebraker and L. A. Rowe, (1986), The design of POSTGRES, Proceedings of the ACM-SIGMOD Conference on Management of Data, pp. 340-355.
[30] A. Muggleton and C. Feng, (1990), Efficient induction of logic programs, Proceedings of the First International Conference on Algorithmic Learning Theory, S. Arikawa, S. Goto, S. Ohsuga, and T. Yokomori (Eds), Japanese Society for Artificial Intelligence, Tokyo, pp. 368-381.
[31] J. C. Giarratano, (1991), CLIPS user's guide, version 5.1, NASA Lyndon B. Johnson Space Center.
[32] C. Olston, A. Woodruff, A. Aiken, M. Chu, V. Ercegovac, M. Lin, M. Spalding and M. Stonebraker, DataSplash, Proceedings of the ACM-SIGMOD conference on management of data , Seattle, Washington, pp. 550-552.