PYTHIA-II

A Knowledge/Database System
for Managing Performance Data and
Recommending Scientific Software

Elias N. Houstis, Ann Christine Catlin, John R. Rice,
Vassilios S. Verykios, Naren Ramakrishnan, Catherine E. Houstis

Abstract

Often scientists need to locate appropriate software for their problems and then select from among many alternatives. We have previously proposed an approach for dealing with this task by processing performance data of the targeted software. This approach has been tested using a customized implementation referred to as PYTHIA. This experience made us realize the complexity of the algorithmic discovery of knowledge from performance data and of the management of these data together with the discovered knowledge. To address this issue, we created PYTHIA-II --- a modular framework and system which combines a general knowledge discovery in databases (KDD) methodology and recommender system technologies to provide advice about scientific software/hardware artifacts. The functionality and effectiveness of the system is demonstrated for two existing performance studies using sets of software for solving partial differential equations. From the end-user perspective, PYTHIA-II allows users to specify the problem to be solved and their computational objectives. In turn, PYTHIA-II (i) selects the software available for the user's problem, (ii) suggests parameter values, and (iii) assesses the recommendation provided. PYTHIA-II provides all the necessary facilities to set up database schemas for testing suites and associated performance data in order to test sets of software. Moreover, it allows easy interfacing of alternative data mining and recommendation facilities. PYTHIA-II is an open-ended system implemented on public domain software and has been used for performance evaluation in several different problem domains.
Numerical Analysis: Partial Differential Equations
Information Systems: Types of Systems
Database Management: Database Applications
Artificial Intelligence: Applications and Expert Systems
Keywords: data mining, inductive logic programming, knowledge-based systems, knowledge discovery in data bases, performance evaluation, recommender systems, scientific software.

Contents

Introduction
A Recommender Methodology for Scientific Software
  • Problem Features
  • Performance Evaluation
  • Reasoning and Learning Techniques for Generating Software Recommendations
    A Recommender System for Scientific Software
    Data Modeling and Management Components of PYTHIA-II
    Knowledge Discovery Components of PYTHIA-II
  • Data Generation
  • Data Mining
  • Inference Engine
  • User Interface
    Data Flow in PYTHIA-II
  • Knowledge Engineer Perspective
  • End User Perspective
    Case Study 1: Performance Effects of Singularites for Elliptic PDE Solvers
  • Performance Database Description
  • Data Mining and Knowledge Discovery Process
  • Knowledge Discovery Outcomes
    Case Study 2: The Effect of Mixed Boundary Conditions on the Performance of Numerical Methods
  • Performance Data Generation, Collection and Analysis
  • Knowledge Discovery Outcomes
  • Introduction

    Complex scientific, engineering or societal problems are often solved today by utilizing libraries or some form of problem solving environments (PSEs). Most software modules are characterized by a significant number of parameters affecting efficiency and applicability that must be specified by the user. This complexity is significantly increased by the number of parameters associated with the execution environment. Furthermore, one can create many alternative solutions of the same problem by selecting different software for the various phases of the computation. Thus, the task of selecting the best software and the associated algorithmic/hardware parameters for a particular computation is often difficult and sometimes even impossible. In [1] we proposed an approach for dealing with this task by processing performance data obtained from testing software. The testing of this approach is described in [2] using the PYTHIA implementation for a specific performance evaluation study. The approach has also been tested for numerical quadrature sofware [3] and is being tested for parallel computer performance [4, 5]. This experience made us realize the high level of complexity involved in the algorithmic discovery of knowledge from performance data and the management of these data together with the discovered knowledge. To address the complexity issue together with scalability and portability of this approach, we present a knowledge discovery in databases (KDD) methodology [6] for testing and recommending scientific software. PYTHIA-II is a system with an open software architecture implementing the KDD methodology, which can be used to build a Recommender System (RS) for many domains of scientific software/hardware artifacts [2, 3, 7, 8] In this paper, we describe the PYTHIA-II architecture and its use as a RS for PDE software.

    Given a problem from a known class of problems and given some performance criteria, PYTHIA-II selects the best performing software/machine pair and estimates values for the associated parameters involved. It makes recommendations by combining attribute-based elicitation of specified problems and matching them against those of a predefined dense population of similar types of problems. Dense here means that there are enough data available so that it is reasonable to expect that a good recommendation can be made. The more dense the population is, the more reliable the recommendation. We describe case studies for two sets of elliptic partial differential equations software found in PELLPACK [9].

    Figure 1. The recommender component of PYTHIA-II implemented as a web server providing advice to users.

    We now describe a sample PYTHIA-II session. Suppose that a scientist or engineer uses PYTHIA-II to find software that solves an elliptic partial differential equation (PDE). The system uses this broad categorization to direct the user to a form-based interface that requests more specific information about features of the problem and the user's performance constraints. Figure 1 illustrates a portion of this scenario where the user provides features about the operator, right side, domain, and boundary conditions - integral parts of a PDE - and specifies an execution time constraint (measured on a Sun SPARCstation 20, for instance) and an error requirement to be satisfied. Thus the user wants software that is fast and accurate; it is possible that no such software exists. The RS contacts the PYTHIA-II (web) server on the user's behalf and uses the knowledge acquired by the learning methodology presented in this paper to perform a selection from a software repository. Then the RS consults databases of performance data to determine the solver parameters, such as grid lines to use with a PDE discretizer, and estimates the time and accuracy using the recommended solver. Note that the RS does not involve the larger databases used in the KDD process, it only accesses specialized, smaller databases of knowledge distilled from the KDD process.

    The paper is organized as follows. Section 1. describes a general methodology for selecting and recommending scientific software implemented in PYTHIA-II. The architecture for a RS based on the PYTHIA-II approach is presented in Section 2. A description of the data management subsystem of PYTHIA-II is presented in Section 3. We include a database schema appropriate for building a RS for elliptic PDE software from the PELLPACK library to illustrate its use. Section 4. outlines the knowledge discovery components of PYTHIA-II. The data flow in PYTHIA-II is illustrated in Section 5. The results of applying PYTHIA-II to two case studies and comparing with earlier results from 1980s can be found in Section 6 and Section 7.

    Section 1. A Recommender Methodology for Scientific Software

    A RS uses stored information (user preferences, performance data, artifact characteristics, cost, size, ...) of a given class of artifacts (software, music, can openers, ...) to locate and s uggest artifacts of interest [10, 11, 12 ].

    A RS for software/hardware artifacts uses stored performance data on a population of previously encountered problems and machines to locate and suggest efficient artifacts for solving previously unseen problems. Recommendation becomes necessary when user requests or objectives cannot be properly represented as ordinary database queries. In this section, we describe the complexity of this problem, the research issues to address, and a methodology for resolving them.

    Table 1. A methodology for building a RS. This methodology is very similar to previous procedures adopted in the performance evaluation of scientific software
    Phases Description
    Determine evaluation objectives Identify the computational objectives for which the performance evaluation of the selected scientific software is carried out.
    Data preparation and selection
  • Identify the evaluation benchmark, its problem features, experiments (i.e., population of scientific problems for the generation of performance data).
  • Identify the performance indicators to be measured.
  • Identify the actual software to be tested, along with the numerical values of their parameters.
  • Generate performance data.
  • Data mining
  • Transform the data into an analytic or summary form.
  • Model the data to suit the intended analysis and data format required by the data mining algorithms.
  • Mine the transformed data to identify patterns or fit models to the data; this is the heart of the process.
  • Analysis of results This is a post-processing phase done by knowledge engineers and domain experts to ensure correctness of the results.
    Assimilation of knowledge Create a user friendly interface to utilize the knowledge and to identify the scientific software (with parameters) for user's problems and computational objectives.

    The algorithm or software selection problem originated in an early paper by Rice [13]. Even for routine tasks in computational science, this problem is ill-posed and quite complicated. Its difficulty is due to the following factors:

    The methodology for building PYTHIA-II uses the knowledge discovery in databases (KDD) process shown in Table 1. Assuming a dense population of benchmark problems from the targeted application domain, this RS methodology uses a three-pronged strategy: feature determination of problem instances, performance evaluation of scientific software, and the automatic generation of relevant knowledge. Note that the dense population assumption can be quite challenging for many application domains. We now address each of these aspects.

    Section 1.1. Problem Features

    The applicability and efficiency of software depends significantly on the features of the targeted problem domain. Identifying appropriate problem features of the problem domain is a fundamental problem in software selection. The way problem features affect software is complex, and algorithm selection might depend in an unstable way on the features. Thus selections and performance for solving uxx + uyy =1 and uxx + (1 + xy/10,000)uyy=1 can be completely different. Even when a simple structure exists, the actual features specified might not properly reflect the simplicity. For example, if a good structure is based on a simple linear combination of two features f1 and f2, the use of features such as f1*cos(f2) and f2*cos(f1) might be ineffective. Furthermore, a good selection methodology might fail because the features are given inappropriate measurments or attribute-value meanings. Many attribute-value approaches (such as neural networks) routinely assign value-interpretations to numeric features (such as 1 and 5), when such values can only be interpreted in an ordinal/symbolic sense. PYTHIA-II assumes features are defined by the knowledge engineer.

    FEATURE
    name text feature record name
    nfeatures integer no. of attributes identifying this feature
    features text [ ] numeric/symbolic/textual identification
    Database schema defining a feature.
    RELATION
    name text relation record name
    equation text name of equation with these features
    feature text name of record identifying features
    Database schema defining a relation between a PDE equation and one of its features.

    FEATURE
    name opLaplace
    nfeatures 1
    features "Uxx + Uyy (+ Uzz) = f"
    Instance of a feature record.
    RELATION
    name opLaplace pde #3
    equation pde #3
    feature opLaplace
    Instance of a relation record.

    The database schema and two instances from the feature and relation tables are given above. The instance records illustrate the correspondence between equation pde#3 and its feature opLaplace (the PDE is the Laplacian).


    Section 1.2. Performance Evaluation.

    There exist well established performance evaluation methodologies for scientific software [ 14, 15, 16, 17, 18, 19, 20 ]. While there are many important factors that contribute to the quality of numerical software, we illustrate our ideas using speed and accuracy. PYTHIA-II can handle other attributes (reliability, portability, documentation, etc.) in its data storage scheme. Similar performance evaluation methodology and attributes are needed for each application domain.

    Accuracy is measured by the norm of the difference between the computed and the true solutions or by a guaranteed error estimate. Speed is measured by the time required to execute the software in a standard execution environment. PYTHIA-II ensures that all performance evaluations are made consistently; their outputs are automatically coded into predicate logic formulas. We resort to attribute-value encodings when the situation demands it; for instance, using straight line approximations to performance profiles (e.g., accuracy vs. grid size) for solvers is useful to obtain interpolated values of grid parameters for PDE problems.

    Section 1.3. Reasoning and Learning Techniques for Generating Software Recommendations.

    PYTHIA-II uses a multi-modal approach by integrating different learning methods to leverage their individual strengths. We have explored and implemented two such strategies: Case-Based Reasoning (CBR) [21] and inductive logic programming (ILP) [22, 23, 24 ] which we describe in this section.

    CBR systems obey a lazy-learning paradigm in that learning consists solely of recording data from past experiments to help in future problem solving sessions. (This gain in simplicity of learning is offset by a more complicated process that occurs in the actual recommendation stage.) Evidence from psychology suggests that people use this approach to make judgements, using the experience gained in solving `similar' problems to devise a strategy for solving the present one. In addition, CBR systems can exploit a-priori domain knowledge to perform more sophisticated analyses even if pertinent data is not present. The original PYTHIA system utilized a rudimentary form of case-based reasoning employing a characteristic-vector representation for the problem population [2].

    ILP systems, on the other hand, use an eager learning paradigm in that they attempt to construct a predicate logic formula so that all positive examples of good recommendations provided can be logically derived from the background knowledge, and no negative example can be logically derived. The advantages of this approach lie in the generality of the representation of background knowledge. Formally, the task in algorithm selection is: given a set of positive exemplars and negative exemplars of the selection mapping and a set of background knowledge, induce a definition of the selection mapping so that every positive example can be derived and no negative example can be derived. While the strict use of this definition is impractical, an approximate characterization, called the cover, is utilized which places greater emphasis on not representing the negative exemplars as opposed to representing the positive exemplars. Techniques such as relative least general generalization and inverse resolution [23] can then be applied to induce clausal definitions of the algorithm selection methodology. This forms the basis for building RS procedures using banks of selection rules.

    ILP is often prohibitively expensive and the standard practice is to restrict the hypothesis space to a proper subset of first order predicate logic. Most commercial systems (like GOLEM and PROGOL [25] require that background knowledge be ground; meaning that only base facts can be provided as opposed to intensional information. This still renders the overall complexity exponential. In PYTHIA-II, we investigate the use of domain specific restrictions on the induction of hypotheses and analyze several strategies. First, we make syntactic and semantic restrictions on the nature of the induced methodology. For example, we require that a PDE solver should first activate a discretizer before a linear system solver (a different order of PDE solver parts does not make sense). An example of a semantic restriction is consistency checks between algorithms and their inputs. Second, we incorporate a generality ordering to guide the induction of rules and prune the search space for generating plausible hypotheses. Finally, since the software architecture of the domain specific RS has a natural database query interface, we utilize it to provide meta-level patterns for rule generation.

    PYTHIA-II also employs more restricted forms of eager-learning such as the ID3 (Induction of Decision Trees) [26] system. It is a supervised learning system for top-down induction of decision trees from a set of examples and uses a greedy divide-and-conquer approach. The decision tree is structure where: (a) every internal node is labeled with the name of one of the predicting attributes; (b) the branches from an internal node are labeled with values of the node attribute; (c) every leaf node is labeled with a class (i.e., the value of the goal attribute). The training examples are tuples, where the domain of each attribute is limited to a small number of values, either symbolic or numerical. The ID3 system uses a top-down irrevocable strategy that searches only part of the search space, guaranteeing that a simple -- but not necessarily the simplest -- tree is found.

    Section 2. PYTHIA-II: A Recommender System for Scientific Software.

    In this section we detail the software architecture of a domain specific RS, PYTHIA-II (see Figure 2) based on the methodology discussed above. Its design objectives include (i) modeling domain specific data into a structured representation using a database schema, (ii) providing facilities to generate specific performance data using simulation techniques, (iii) automatically collecting and storing this data, (iv) summarizing, generalizing, and discovering patterns/rules that capture the behavior of the scientific software system, and (v) incorporating them into the selected inference engine system. The system architecture has four layers:
    Figure 2. The system architecture of PYTHIA-II. The recommender component consists of the recommender system interface and the inference engine. The KDD component is the rest.



    The database layer provides permanent storage for the problem population, the performance data and problem features, and the computed statistical data. The next layer is the relational engine which supports an extended version of the SQL database query language and provides access for the upper layers. The third layer consists of three subsystems: the data generation system, the data mining system, and the inference engine. The data generation system accesses the records defining the problem population and procesess them within the problem execution environment to generate performance data. The statistical data analysis and pattern extraction modules comprise the data mining subsystem. The statistical analysis module uses a non-parametric statistical method to rank the generated performance data [27]. PYTHIA-II integrates a variety of publicly available pattern extraction tools such as relational learning, attribute value-based learning, and instance based learning techniques [ 22, 28]. These tools and our integration methods are discussed in Section 4.2. Our design allows for pattern finding in diverse domains of features like nominal, ordinal, numerical, etc.

    The graphical user interface in the top layer allows the knowledge engineer to use the system to generate knowledge as well as to query the system for facts stored in the database layer. The recommender is the end-user interface, and includes the inference engine. It uses the knowledge generated by the lower layers as an expert system to answer domain specific questions posed by end users. The architecture of PYTHIA-II is extensible, with well defined interfaces among the components of the various layers.

    Section 3. Data Modeling and Management Components of PYTHIA-II.

    PYTHIA-II needs a powerful, adaptable database and management system with an open architecture to support its data generation, data analysis, automatic knowledge acquisition and inference processes. The design requirements are summarized as follows: PYTHIA-II uses POSTGRES95 [29], an object-oriented, relational DBMS (Database Management System) which supports complex objects and which can easily be extended to new application domains by providing new data types, new operators, and new access methods. It also provides facilities for active databases and inferencing capabilities including forward and backward chaining. It supports the standard SQL language and has interfaces for C, Perl, Python, and Tcl. PYTHIA-II's relational data model offers an abstraction of the structure of the problem population which must be domain dependent. For example, the abstraction of a standard PDE problem includes the PDE system, the boundary conditions, the physical domain and its approximation in a grid or mesh format, etc. Each of the PDE problem specification components constitutes a separate entity set which is mapped into a separate table or relation. Interactions among entities can also be modeled by tables representing relationships. In a higher level of abstraction, we use tables for batch execution of experiments and performance data collection, aggregate statistical analysis, and data mining. The experiment table represents a large number of problems as sequences of problem components to be executed one at a time. A profile table collects sets of performance data records and profile specification information required by the analyzer. A predicate table identifies a collection of profile and feature records needed for data mining. To illustrate the data modeling and management of PYTHIA-II, we now describe an example database schema specification for a RS for elliptic PDE software from the PELLPACK library. Throughout the remainder of this paper, we use this example to describe some aspects of the components of PYTHIA-II. The overall design of the system, however, is independent of the particular case study, and the elements of the system that are case study dependent will always be c learly indicated. In the data modeling component of PYTHIA-II, the schema specification must be modified for a each domain of scientific software. The PYTHIA-II database mechanisms are independent of the application domain, but the problem population, performance measures and features do depend on the domain.

    In this sample PYTHIA-II instantiation, the problem population has 13 problem specification tables (equation, domain, bcond, grid, mesh, dec, discr, indx, solver, triple, output, parameter, option) and 21 relationship tables ( equation-discr, mesh-domain, parameter-solver, etc). Additional tables define problem features and execution related information (machine and rundata tables). In all, 44 table definitions are used for the PYTHIA-II database. Section 6 and Section 7 give some examples of these tables.

    Section 4. Knowledge Discovery Components of PYTHIA-II.

    We now describe the PYTHIA-II components in the top two layers of Figure 2.

    Section 4.1. Data Generation.

    The PYTHIA-II performance database may contain pre-existing performance measures or the data may be produced by executing scientific software using PYTHIA-II. The scientific software operates entirely as a black box except for three I/O requirements that must be met for integration into PYTHIA-II. This section describes these requirements and illustrates how the PELLPACK software satisfies them.

    First, it must be possible to define the input (i.e., the problem definition) using only information in an experiment record. The translation of an experiment into an executable program is handled by a script written for the software which extracts the necessary information from the experiment record and generates the files or drivers for the software. For PELLPACK, the experiment record is translated to a .e file , which is the PELLPACK language definition of the PDE problem, the solution scheme, and the output requirements. The script is written in Tcl and consists of about 250 lines of code. The standard PELLPACK preprocessing programs convert the .e file to a Fortran 77 driver and link the appropriate libraries to produce an executable program. The second requirement is that the software is able to operate in a batch mode. In the PELLPACK case, Perl scripts are used to execute PELLPACK programs, both sequential and parallel, on any number of platforms. The programs are created and executed without manual intervention. Finally, the software must produce performance measures as output. A post-processing program must be written specifically to convert the generated output into PYTHIA-II performance records. Each program execution should insert one record into the performance database. The PELLPACK post-processing program is written in Tcl (350 lines of code) and Perl (300 lines of code).

    Data generation (program generation, program execution, data collection) may take place inside or outside of PYTHIA-II. This process is domain dependent since problem definition records, software, and output files depend on the domain.

    Section 4.2. Data Mining.

    Data mining in PYTHIA-II is the process of extracting and filtering performance data for analysis, generating solver profiles and ranks, selecting and filtering data for pattern extraction, and generating the knowledge base. Its principal components are the statistical analysis module (analyzer) and the pattern extraction module.

      Algorithm 1 Algorithm 2 ... Algorithm k
    Problem 1 X11 X12 ... X1k
    Problem 2 X21 X22 ... X2k
          ...       ... ... ... ...
    Problem n Xn1 Xn2 ... Xnk
    Rank R1 R2 ... Rk
    Average Rank R*1 R*2 ... R*k
    Table 4. Algorithm ranking table based on Friedman rank sums using the two-way layout. The value Xij is the performance of algorithm j on problem i and Ri and R*i are the rank measures .
    PYTHIA-II runs the analyzer as a separate process with a configuable input call, so various data analyzers can easily be integrated. The statistical analyzer is problem domain independent as it operates on the fixed schema of the performance records. All the problem domain information is distilled to one number measuring the performance of a program for a problem. The analyzer assigns a performance ranking to a set of algorithms applied to a problem population. It accesses the the performance data using a selected predicate record which defines the complete set of analyzer results used as input for a single invocation of the rules generator. The predicate contains (1) the list of algorithms to rank, and (2) a profile matrix, where each row represents a single analyzer run and the columns identify the profile records to be accessed for that run. Table 4 illustrates the predicate's profile matrix; its columns represent algorithms and its rows represent problems as specified by a profile record. The Xij are performance values (described below) computed by the analyzer.

    PYTHIA-II currently ranks the preformance of algorithms with Friedman rank sums [27]. This distribution-free ranking assumes nk data values from each of k algorithms for n problems. The analyzer can ``fill in'' missing values using various methods. The Friedman ranking proceeds as follows:



    The assignment of a single value, Xij, to represent the performance of algorithm is not a simple matter. Even when comparing execution times, there are many parameters which should be varied for a serious evaluation: problem size, execution platform, number of processors (for parallel code), etc). The analyzer uses the method of least squares approximation of observed data to accomodate variations of problem executions. Thus, the relations between pairs of variables (e.g., time and grid size, time and number of processors) are represented linearly as seen in Figure 5 for Case Study 2. These profiles allow a query to obtain data of one variable for any value of another.

    The pattern-extraction module provides automatic knowledge acquisition (patterns/models) from the data to be used by a RS. This process is independent of the problem domain. PYTHIA-II extends the PYTHIA methodology to address the algorithm selection problem by applying various neuro-fuzzy, instance-based learning and clustering techniques. The relational model of PYTHIA-II automatically handles any amount of raw data related manipulation. It has a specific format for the data used by the pattern extraction process, and filters transform this format (on the fly) to the format required by the various data mining tools integrated into PYTHIA-II. The goal is to accumulate tools that generate knowledge in the form of logic rules, if-then-else rules or decision trees.

    PYTHIA-II first used GOLEM [30], an empirical single predicate Inductive Logic Programming (ILP) learning system. It is a batch system that implements the relative least general generalization principle. We have experimented with other learning methods, e.g., fuzzy logic or neural networks, and have not found large differences in their learning abilities. We chose ILP because it seemed to be the easiest to use in PYTHIA-II; its selection is not the result of a systematic study of the effectiveness of learning methods. PYTHIA-II is designed so the learning component can be replaced if necessary. GOLEM generates knowledge in the form of logical rules which one can model in a language like first order predicate logic. These rules can then be easily utilized as the rule base of an expert system. We have also integrated PROGOL [25], CN2, PEBLS, and OC1 (the latter three are available in the MLC++ library [28]).

    Section 4.3 Inference Engine.

    The recommender component of PYTHIA-II answers the user's questions using an inference engine and facts generated by the knowledge discovery process. It is both domain dependent and case study dependent. We describe the recommender that uses knowledge generated by GOLEM. Each GOLEM logical rule has an information compression factor f measuring its generalization accuracy. Its simple formula is f=p-(c+n+h) where p and n are the number of positive and negative examples, respectively, covered, while c and h are related to the form of the rule. The information compression factor is used for sorting the rules in decreasing order. The rules and the set of positive examples covered for each rule are passed to the recommender which then asks the user to specify the problem features. It uses the CLIPS [31] inference engine to check for rules that match the specified features. Every rule found in this way is placed into the agenda. Rules are sorted in decreasing order based on the number of examples they cover, so the very first rule covers the most examples and will fire at the end of the inference process and determine the best algorithm. The recommender then goes through the list of positive examples associated with the fired rule and retrieves the example that has the most features in common with the user's problem.

    The fact base of the recommender is then processed for this example to provide parameters for which the user needs advice. The fact base consists of all the raw performance data stored in the database. This information is accessed by queries generated on the fly, based on the user's objectives and selections. If the user objectives cannot be met, then the recommender decides what ``best'' answer to give, using weights specified by the user for each performance criterion. For the case studies in Section 6 and Section 7, the final step is the recommendation of the best PDE solver to use. It also provides solver parameters such as the grid needed to achieve the solution accuracy within the given time limitations.

    Section 4.4. User Interface

    PYTHIA-II can accomplish much of the work of knowledge discovery without using a graphical interface, for example Graphical interfaces that assist in these tasks are useful for knowledge engineers unfamiliar with the structure of PYTHIA-II or the POSTGRES95 SQL language. These interfaces are provided by PYTHIA-II and the top level window used to access them are shown in Figure 3.

    Figure 3. PYTHIA-II's top level window.


    The graphical interface to the POSTGRES95 database is dbEdit. Each PYTHIA-II record has a form presented when records of that type are selected for editing. Similarly, dataGEN facilitates the tasks involved in the data generation process, and frees the user from worrying about details such as: where the generated programs are stored, which scripts are available, where raw output data is located, and so on. DataMINE encompasses the data analysis and knowledge discovery. Even experienced users must perform these tasks inside PYTHIA-II. A template query is used to extract the performance data for the statistical analyzer. The query uses a profile record and may access hundreds of performance records to build the analyzer input file. The pattern matching input specification is equally difficult to build. DataMINE presents a simple menu system that walks the user through all these steps. It is integrated withDataSplash [32] an easy-to-use integrated visual environment which is built on top of POSTGRES95 and therefore interacts with PYTHIA-II's database naturally.

    Section 5. Data Flow in PYTHIA-II.

    PYTHIA-II has one interface for the knowledge engineer and another for end users. We describe the data flow and I/O interfaces between the main components of PYTHIA-II from the perspective of these two interfaces.

    Section 5.1. Knowledge Engineer Perspective.

    The data flow in PYTHIA-II is shown in Figure 4, where boxes represent stored data, edges represent operations on the database, and self-edges represent external programs. The knowledge engineer begins by populating the problem database, specifying the domain in terms of the relational data model to match PYTHIA-II's database schema. Extensible and dynamic schema are possible. POSTGRES95 does not have a restriction imposed by the traditional relational model that the attributes of a relation be atomic (sometimes referred to as the First Normal Form (1NF) of database systems.)

    An experiment combines problem records into groups, and a high level problem specification is generated by a program-based transformation of the experiment record into an input file for execution. The problem execution environment invokes the appropriate scientific software to generate data. For the example instantiation referred to in Section 3 and Section 4, the execution environment consists of PELLPACK. The execution generates a number of output files, each containing performance and other information related to solving the problem. The input uses the specific schema of the problem record and the output format is specified by a system specific and user selected file template. The template lists the program used to collect the required output data. These data records keep logical references (called foreign keys) to the problem definition records so that performance can be matched with problem features by executing n-way joins during pattern extraction.

    The statistical analyzer uses the performance data for ranking based on the parameter(s) selected by the user. The ranking produces an ordering of these parameters which is statistically significant (i.e., if the performance data shows no significant difference between parameters, they are shown as tied in rank). A predicate record defines the collection of profile records to be used in pattern extraction and allows a knowledge engineer to change the set of input profile records as easily as updating a database record. A filter program converts data to the input format required by the pattern extraction programs. PYTHIA-II currently supports GOLEM/PROGOL, the MLC++ (Machine Learning Library in C++) librar y, and others. These programs generate output in the form of logic rules, if-then rules or decision trees/graphs for categorization purposes. This process is open-ended, and tools like neural networks, genetic algorithms, fuzzy logic tool-boxes, and rough set systems can be used.

    Figure 4. Data flow and I/O for the knowledge engineer user interface.


    Section 5.2. End User Perspective.

    The Recommender interface must adapt to a variety of user needs. Users of a RS for scientific computing are most interested in questions regarding the accuracy of a solution method, performance of a hardware system, optimal number of processors to be used in a parallel machine, how to achieve certain accuracy by keeping the execution time under some limit, etc. PYTHIA-II allows users to specify problem characteristics plus performance objectives or constraints. The system uses facts to provide the user with the best inferred solution to the problem presented. If the user's objective cannot be satisfied. the system tries to satisfy the objectives (e.g., accuracy first, then memory constraints) based on the ordering implied by the user's performance weights.

    Section 6. Case Study 1: Performance Effects of Singularites for Elliptic PDE Solvers.

    To validate PYTHIA-II and its underlying KDD process, we reconsider a performance evaluation for a population of 2-dimensional, singular, elliptic PDE problems [41]. The algorithm selection problem for this domain is

    where L is a second order, linear elliptic operator, B is a differential operator with up to first order derivatives, Omega is a rectangle, and theta, T are performance criteria constraints.

    Section 6.1. Performance Database Description.

    In this study, PYTHIA-II collects tables of execution times and errors for each of the given solvers using various grid sizes. The error is the maximum absolute error on the grid divided by the maximum absolute value of the PDE solution. The grids considered are 5 x 5, 9 x 9, 17 x 17, 33 x 33, and 65 x 65. The PDE solvers are from PELLPACK :

    Defining the population of 35 PDEs and the experiments required 21 equation records with up to 10 parameter sets each, 3 rectangle domain records, 5 sets of boundary conditions records, 10 grid records, several discretizer, indexing, linear solver and triple records with corresponding parameters, and a set of 40 solver sequence records. Using these components, 37 experiments were specified, each defining a collection of PDE programs involving up to 35 solver sequences for a given PDE problem. Examples of these records are given in Section 3. The 37 experiments were executed on a SPARCstation20 with 32MB memory running Solaris 2.5.1 from within PYTHIA-II's execution environment. Over 500 performance records were created.

    Table 5. The PYTHIA-II process applied to Case Study 1.
    Phases Description Implementation
    Determine evaluation objectives Evaluate the efficiency and accuracy of a set of solution methods and their associated parameters with respect to elapsed time, error and problem size. Manual
    Data preparation
  • selection
  • pre-processing
  • problem population
  • measures: elapsed time, solver time, discretization error
  • methods
  • generate performance data
  • POSTGRES95
  • SQL
  • Tcl/Tk
  • PERL
  • Data mining
  • Collect the data for error and time across all solvers, grid sizes
  • Use the method of least squares to develop linear approximations of time vs error across all grid sizes.
  • Develop profiles of the methods for all problems, and rank the methods.
  • Use the rankings and the problem features to identify patterns and generate rules.
  • Tcl/Tk
  • PERL
  • In-house statistical software
  • PROGOL
  • Analysis of results Domain experts ensure correctness of the results. Manual
    Assimilation of knowledge Create an intelligent interface to utilize the knowledge to identify the ``best method'' with associated parameters for user's problems and computational objectives. CLIPS


    Table 6. Features for the problem population of the benchmark case study.
    Problem Component Features
    Equation first tier operator : Laplace, Poisson, Helmholtz, self-adjoint, general
    second tier operator : analytic, entire, constant coefficients
    operator smoothness tier : constant, entire, analytic
    right-hand-side tier : entire, analytic, singular(infinite), singular derivatives, constant coefficients, nearly singular, peaked, oscillatory, homogeneous, computationally complex
    right-hand-side smoothness tier : constant, entire, analytic, computationally complex, singular, oscillatory, peaked
    Domain unit square
    [a,b] * [a+x,b+x], where x can vary
    [a,b] * [a+c,b+c], where c is a constant
    Boundary Conditions U=0 on all boundaries
    AU = f on all boundaries
    BUn = f on some boundaries
    AU + BUn=f on some boundaries
    constant coefficients, non-constant coefficients

    Section 6.2 Data Mining and Knowledge Discovery Process.

    When the execution finished, the performance database was created. The dataMINE interface was used to access it using the predicate and profile records created for the case study. The rankings produced by the analyser for PDE problem 10-4 are, for example:

    1. FFT6       2. FFT4       3. DCG4       4. FFT2       5. COLL       6. DCG2       7. 5PT      


    The frequency for each solver being best for these 35 PDEs is

    FFT4: 27.0 %
    COLL: 21.6 %
    5PT : 18.9 %
    DCG4 : 13.5 %
    FFT6 : 10.8 %
    DCG2 : 5.4 %
    FFT2 : 2.7 %


    Note that some solvers are not applicale to many of the PDEs. These rankings over all PDE problems and their associated features (see Table 6 for the features) were then used to mine rules. Examples of these rules are shown below. The first rule indicates that the method Dyakanov CG4 is best if the problem has a Laplace operator and the right-hand-side is singular.

    Table 7. Example rules from Case Study 1
    Best Method For problems with these features
    best_method(A,dyakanov-cg4) :- opLaplace_yes(A), rhsSingular_yes(A)
    best_method(A,fft_9_point_order_4) :- opHelmholtz_yes(A), pdePeaked_no(A)
    best_method(A,fft_9_point_order_4) :- solVarSmooth_yes(A), solSmoSingular_no(A)
    best_method(A,fft_9_point_order_2) :- solSingular_no(A), solSmoSingDeriv_yes(A)
    best_method(A,fft_9_point_order_6) :- opLaplace_yes(A), rhsSingular_no(A), rhsConstCoeff_no(A), rhsNearlySingular_no(A), rhsPeaked_no(A)
    best_method(A,fft_9_point_order_6) :- pdeSmoConst_yes(A), rhsSmoDiscDeriv_yes(A)
    best_method(A,dyakanov-cg4) :- opSelfAdjoint_yes(A), rhsConstCoeff_no(A)
    best_method(A,dyakanov-cg4) :- pdeJump_yes(A)
    best_method(A,dyakanov-cg) :- pdeSmoConst_yes(A), rhsSmoDiscDeriv_yes(A)
    best_method(A,hermite_collocation) :- opGeneral_yes(A)
    best_method(A,hermite_collocation) :- pdePeaked_yes(A)



    Table 8. A listing of the rankings generated by PYTHIA-II and, in parentheses, the subjective rankings reported in [45].
    PDE 5PT COLL DCG2 DCG4 FFT2 FFT4 FFT6
    3-1 7 (7) 6 (4) 4 (6) 5 (5) 2 (2) 1 (3) 3 (1)
    3-2 6 (6) 7 (7) 1 (5) 3 (3) 4 (4) 2 (2) 5 (1)
    7-1 7 (7) 6 (3) 2 (5) 3 (5) 1 (4) 4 (1) 5 (1)
    8-2 7 (7) 6 (5) 1 (4) 5 (6) 2 (2) 4 (3) 3 (1)
    9-1 6 (6) 5 (5) 3 (4) 2 (3) 4 (2) 1 (1) -
    9-2 6 (6) 5 (5) 3 (3) 4 (4) 2 (2) 1 (1) -
    9-3 6 (6) 4 (5) 5 (3) 3 (3) 2 (2) 1 (1) -
    10-2 6 (6) 7 (7) 5 (5) 2 (4) 3 (3) 1 (2) 4 (1)
    10-3 6 (6) 7 (7) 5 (5) 3 (4) 4 (3) 2 (2) 1 (1)
    10-4 7 (5) 5 (7) 6 (4) 3 (6) 4 (3) 2 (2) 1 (1)
    10-7 6 (6) 5 (7) 4 (5) 3 (3) 7 (3) 1 (2) 2 (1)
    11-2 7 (7) 6 (6) 5 (5) 1 (3) 2 (3) 3 (2) 4 (1)
    11-3 7 (6) 6 (6) 4 (5) 3 (4) 5 (3) 1 (2) 2 (1)
    11-4 6 (6) 7 (7) 5 (5) 3 (4) 4 (3) 2 (2) 1 (1)
    11-5 6 (6) 7 (7) 5 (4) 3 (4) 4 (3) 2 (2) 1 (1)
    13-1 3 (3) 4 (4) 2 (1) 1 (1) - - -
    15-1 2 (2) 1 (1) - - - - -
    15-2 2 (2) 1 (1) - - - - -
    17-1 7 (7) 6 (6) 3 (5) 1 (3) 2 (4) 4 (2) 5 (1)
    17-2 6 (6) 7 (7) 4 (4) 2 (5) 5 (3) 1 (2) 3 (1)
    17-3 6 (6) 7 (7) 4 (4) 5 (5) 3 (3) 2 (2) 1 (1)
    20-1 2 (2) 1 (1) - - - - -
    20-2 1 (1) 2 (2) - - - - -
    28-2 3 (2) - 1 (1) 2 (2) - - -
    30-4 1 (1) 2 (2) - - (2) - - -
    30-8 2 (2) 1 (1) - - - - -
    34-1 4 (4) 3 (2) 2 (2) 1 (1) - - -
    35-1 4 (4) 3 (2) 2 (2) 1 (1) - - -
    36-2 2 (1) 1 (1) - - - - -
    39-2 1 (1) 2 (2) - - - - -
    39-4 2 (2) 1 (1) - - - - -
    44-2 2 (2) 1 (1) - - - - -
    44-3 2 (2) 1 (1) - - - - -
    47-2 6 (6) 6 (6) 3 (5) 2 (4) 4 (3) 1 (2) 5 (1)
    49-3 2 (2) 1 (1) - - - - -
    51-1 1 (1) 2 (2) - - - - -
    54-1 1 (1) 2 (2) - - - - -

    Setion 6.3 Knowledge Discovery Outcomes.

    The rules discovered confirm the assertion (established by statistical methods) in [41] that higher order methods are better for elliptic PDEs with singularities. They also confirm the general hypothesis that there is a strong correlation between the order of a method and its efficiency. More importantly, the rules impose an ordering of the various solvers for each of the problems considered in this study. Interestingly, this ranking corresponds closely with the subjective rankings published earlier (see Table 8). This shows that these simple rules capture much of the complexity of algorithm selection in this domain.

    Section 7. Case Study 2: The Effect of Mixed Boundary Conditions on the Performance of Numerical Methods.

    We apply PYTHIA-II to analyze the effect of different boundary condition types on the performance of elliptic PDE solvers considered in the study of [42]. The PDEs for this performance evaluation are of the form



    The parameters alpha and beta determine the strength of the derivative term. The coefficients and right hand sides, a, c, d, e, f, g, s and t, are functions of x and y, and Omega is a rectangle. The numerical methods considered are the modules (5PT, COLL, DCG2, DCG4) listed in
    Section 6 plus MG-00 (Multigrid mg00). The boundary condition types are defined as follows Every PDE equation is paired with all three boundary condition types and is associated with three experiments. Each experiment consists of a problem defined by the PDE equation and boundary condition, which is solved by the five methods using five uniform grids. There are 75 program executions for each PDE. Performance data on elapsed solver time and various error measures are collected.

    Section 7.1. Performance Data Generation, Collection and Analysis.

    The PYTHIA-II database records (equations, domains, boundary-conditions, parameters, modules, solver-sequences and experiments) are defined using dbEdit, and the PDE programs are built and executed with PYTHIA-II's dataGen and the PELLPACK problem execution environment. All experiments were executed on a SPARCstation-20 SunOS 5.5.1 with 32 MB memory. About 600 records were inserted into the performance database. The statistical analysis and rules generation are handled by dataMINE using the appropriate predicate and profile records which identify all parameters controlling the tasks.

    The predicate names a matrix of profile records that identify the number and type of analyzer invocations. Then it identifies the boundary condition features used. The analyzer rankings and the predicate feature specifications are handed over to the rules generation process. Table 9 lists, in part, the required predicate information. The predicate controls the overall analysis and the details are handled by the profile records. Each profile record identifies which fields of performance data are extracted, how they are manipulated, and how the experiment profiles for the analyzer are built. The result of the analysis is a ranking of method performance for the selected experiments. The query posed to the database by the profile extracts exactly the information (a partial listing of the profile record information is given in Table 9) needed by the analyzer to answer this question. The complex query used for building the analyzer's input data is determined by profile field entries for x-axis, y-axis and field matching. In this case, the profile record builds sets of (x,y) points for each numerical method, where the x values are grid points and the y values are relative elapsed time changes for mixed boundary conditions with respect to Dirichlet conditions, changes in elapsed time for Neumann conditions with respect to Dirichlet conditions, and relative changes in error for derivative conditions with respect to Dirichlet conditions. In all, 6 predicates and more than a hundred profiles were used.


    Table 9. Sample predicate and profile information for the relative elapsed times analysis for mixed vs. Dirichlet problem executions.
    Record Controlling information Field data
    Predicate How many invocations of the analyzer? 24
    Profiles to be used for each invocation. pde01_Dir-vs-Mix,
    pde01_Dir-vs-Neu,
    pde01_Mix-vs-Neu,
    pde02_Dir-vs-Mix,
    ...
    Items to rank. numerical methods : DGC, DCG4, MG-00, 5PT, COLL
    Features to base rules on. ElapsedTimeEffect-Dir2Mix,
    ElapsedTimeEffect-Dir2Neu,
    ...
    Profile Experiments used in a single analyzer run? pde01-dirichlet,
    pde01-mixed,
    ...
    Matching record identifier for profile graph building. use perfdata record and match fields: classparms = dir vs. mix
    and select on numerical methods
    Profile graph x-axis values? grid sizes
    Profile graph y-axis values? relative increase in mixed
    execution elapsed time vs Dirichlet
    execution elapsed time : (Tmix - Tdir)   /   Tdir
    Name of SQL query template. dir.vs.mix

    Section 7.2. Knowledge Discovery Outcomes.

    The rules derived in Case Study 2 are consistent with the hypothesis and conclusions stated in \cite{dyksen}. For the analysis, we use rankings based on the relative elapsed time profiles described above.
    Figure 5. Profile graph depicting the relative change of execution times between Dirichlet and Mixed problems as a function of the grid size for the five PDE solvers considered.

    Next, we consider ranking the methods for all PDE-boundary condition pairs using profile graphs involving problem size vs. elapsed time. The analysis does not consider the relative increase in execution time for different boundary condition types; it ranks all methods over all PDE problems as in Case Study 1. The analysis ranks MG-00 as best method. It was selected 72% of the time as the fastest method over all PDE problems. The analysis also showed that all methods had the same best-to-worst ranking for a fixed PDE equation and all possible boundary conditions. In addition, these results show that some of these methods differ significantly when ranking with respect to execution times across the collection of PDE problems.


    Bibliography


    [1] C. E. Houstis, E. N. Houstis, J. R. Rice, P. Varadaglou and T. S. Papatheodorou, (1991), Athena: a knowledge based system for //ELLPACK, Symbolic--Numeric Data Analysis and Learning, Eds, Diday, E. and Lechavallier, Y., Nova Science, pp.459-467.

    [2] S. Weerawarana, E. N. Houstis, J. R. Rice, A. Joshi and C. Houstis, (1997), PYTHIA: A knowledge based system to select scientific algorithms, ACM Trans. Math. Soft., Vol. 23, pp. 447-468.

    [3] N. Ramakrishnan, J. Rice and E. N. Houstis, (2000), GAUSS: An on-line algorithm recommender system for one-dimensional numerical quadrature, ACM Trans. Math. Soft., to appear.

    [4] V. S. Adve, R. Bagrodia, J. C. Brown, E. Deelman, A. Dube, E. N. Houstis, J. R. Rice, R. Sakellariou and D. Surdaram-Stukel, P. J. Teller and M. K. Vernon, (2000), POEMS: End-to-end performance of large parallel adaptive computational systems, IEEE Trans. Soft. Eng., to appear.

    [5] V. S. Verykios, E. N. Houstis and J. R. Rice, (1999), Mining the Performance Of Complex Systems, ICIIS' 99, IEEE International Conference on Information, Intelligence and Systems, IEEE Computer Society Press, pp. 606-612.

    [6] U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, (1996), From data mining to knowledge discovery: an overview, Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Eds), AAI Press/MIT Press, pp. 1-34.

    [7] V. S. Verykios, (1999), Knowledge Discovery in Scientific Databases, Ph. D. Thesis, Computer Science Department, Purdue University.

    [8] V. S. Verykios, E. N. Houstis and J. R. Rice, (2000), A Knowledge Discovery Methodology for the Performance Evaluation of Scientific Software, Neural, Parallel \& Scientific Computations, to appear,

    [9] E. N. Houstis, J. R. Rice, S. Weerawarana, A. C. Catlin, M. Gaitatzes, P. Papachiou, and K. Wang, (1998), Parallel ELLPACK: a problem solving environment for PDE based applications on multicomputer platforms, ACM Trans. Math. Soft., Vol. 24, No. 1, pp. 30-73.

    [10] N. Ramakrishnan, (1997), Recommender systems for problem solving environments. Ph. D. thesis, Dept. of Computer Sciences, Purdue University.

    [11] N. Ramakrishnan, E. N. Houstis, and J. R. Rice, (1998), Recommender Systems for Problem Solving Environments, Working notes of the AAAI-98 workshop on recommender systems, H. Kautz (Ed), AAAI/MIT Press.

    [12] U. Fayyad, D. Haussler and P. Stolorz, (1996), Mining scientific data, Comm. ACM, Vol. 39, No. 11, pp. 51-57.

    [13] J.R. Rice, (1976), The algorithm selection problem, Advances in Computers, Vol. 15, pp. 65-118.

    [14] E.N. Houstis, R.E. Lynch and J.R. Rice, (1978), Evaluation of numerical methods for elliptic partial differential equations, Journal of Comp. Physics, Vol. 27, pp. 323-350.

    [15] {Ronald F. Boisvert, John R. Rice and Elias N. Houstis, (1979), A system for performance evaluation of partial differential equations software, IEEE-TOSE, Vol. SE-5, No, 4, pp. 418-425.

    [16] E. N. Houstis and W.F. Mitchell and T.S. Papatheo dorou, (1983), Performance evaluation of algorithms for mildly nonlinear elliptic partial differential equations, Inter. J. Numer. Meth. Engin., Vol. 19, pp. 665-709.

    [17] J.R. Rice, (1983), Performance analysis of 13 methods to solve the Galerkin method equations, Lin. Alg. Appl., Vol. 53, pp. 533-546.

    [18] W.R. Dyksen, E.N. Houstis, R.E. Lynch and J. R. Rice, (1984), The performance of the collocation and Galerkin methods with hermite bicubics, SIAM Journal of Numerical Analysis, Vol. 21, pp. 695-715.

    [19] P.K. Moore, C. Ozturan and J.E. Flaherty, (1990), Towards the automatic numerical solution of partial differential equations, Intelligent Mathematical Software Systems, North-Holland, pp. 15-22.

    [20] J.R. Rice, (1990), Software performance evaluation papers in TOMS CSD-TR-1026. Dept. Comp. Sci., Purdue University.

    [21] A. Joshi, S. Weerawarana, N. Ramakrishnan, E. N. Houstis, and J. R. Rice, (1996), Neuro--fuzzy support for PSEs: a step toward the automated solution of PDEs, Special Joint Issue of IEEE Computer & IEEE Computational Science and Engineering, Vol. 3, No. 1, pp. 44--56.

    [22] I. Bratko and S. Muggleton, (1995), Applications of inductive logic programming, Comm. ACM , Vol, 38, No. 11, pp. 65--70.

    [23] Saso Dzeroski, (1996), Inductive logic programming and knowledge discovery in databases , Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Eds), AAAI Press/MIT Press, pp. 117-152.

    [24] S. Muggleton and L. De Raedt, (1994), Inductive logic programming: theory and methods, Journal of Logic Programming, Vol. 19, No. 20, pp. 629-679.

    [25] S. Muggleton, (1995), Inverse entailment and PROGOL, New Generation Computing, Vol, 13, pp. 245-286.

    [26] J. R. Quinlan, (1986), Induction of decision trees, Machine Learning , Vol. 1, No. 1, pp. 81-106.

    [27] M. Hollander and D. Wolfe, (1973), Non-parametric Statistical Methods, John Wiley and Sons.

    [28] R. Kohavi, (1996), MLC++ developments: data mining using MLC++, Working Notes of the AAAI-96 Fall Symposia on `Learning Complex Behaviors in Adaptive Intelligent Systems, Kasif, S. et al. (Eds), AAAI Press, pp. 112-123.

    [29] M. Stonebraker and L. A. Rowe, (1986), The design of POSTGRES, Proceedings of the ACM-SIGMOD Conference on Management of Data, pp. 340-355.

    [30] A. Muggleton and C. Feng, (1990), Efficient induction of logic programs, Proceedings of the First International Conference on Algorithmic Learning Theory, S. Arikawa, S. Goto, S. Ohsuga, and T. Yokomori (Eds), Japanese Society for Artificial Intelligence, Tokyo, pp. 368-381.

    [31] J. C. Giarratano, (1991), CLIPS user's guide, version 5.1, NASA Lyndon B. Johnson Space Center.

    [32] C. Olston, A. Woodruff, A. Aiken, M. Chu, V. Ercegovac, M. Lin, M. Spalding and M. Stonebraker, DataSplash, Proceedings of the ACM-SIGMOD conference on management of data , Seattle, Washington, pp. 550-552.


    This work was supported in part by NSF grant CDA 91-23502, PRF 6902851, DARPA grant N66001-97-C-8533 (Navy), DOE LG-6982, DARPA under ARO grant DAAH04-94-G-0010, and the Purdue Research Foundation.