Paper concentrates finding of three previous publications [11-13]
Made data integration from multiple sources difficult if not impossible.
Because of their value and need to be easily accessiable integrated information systems are usually realised as websites
Characteristics of data-intensive sites
- integrate information from multiple data sources
- complex structure
- increasingly detailed views of data, summary at top detailed at lower level
data intensive sites are typically hard to specify and implement commercial vendors and academic researchers are actively develping methods and tools for building sites.
While goals differ, most attempt to isolate the common tasks of web site development
- choosing and accessing the data that will be displayed at the site
- designing the site's content
- designing the site's structure
- designing the visual presentation of the pages
Common practice, data-intensive sites usually are not implemented by automated tools but by groups of loosely related programs written in imperative scripting languages such as Perl. Due to their appropriateness for "gluing" together software components [24].
Scripts of rmany site implementations however interleave the code for data access and integration, page construction and HTML generation. As a result important site-management tasks such as
- automatically updating or restructuring a site
- optimizing a site's performance based on common page-access patterns
- enforcing integrity constraints on a sites structure are tedious to perform and difficult to automate
Argument is that
- implementing data-intensive web sites is primarily a data-management problem
- solution consists of three main programming tasks
- accessing and integrating the data available in the site
- building the site's content and structure
- generating the HTML representation of the pages
STRUDEL provides STRUQL for specifying content and structure of a website and simple template language for specifying the site's HTML representation.
Two years experience with STRUDEL has initiated three distinct but complementary areas of research
- STRUDEL-R [18] investigate strategies for optimizing the run-time generation of a website
- Fun-STRUDEL [14] focuses on software engineering problem of producing site implementations that are extensible, reusable, analyzaable and optimizable.
- TIRAMISU [3] provides a declarative site-specification language that is decoupled from specific implementation tools; any implementation tools that support the common API can be used together to implement a site.
We emphasise that STRUDEL is a site-implementation tool, not an environment for Web-site design, nor is it intended for non-technical users or for development of an Web-based applications. Experience has shown that STRUDEL is best suited for data-intensive, non-transactional web sites.
- rectangles depict processes
- bold terms specify inputs and outputs
Data Model
Foundation of STRUDEL is a semistructured data model. Semistructured data is characterized as- having a few type constraints
- rapidly evolving schema
- missing schema [1]
- typically modeled as a labeled, directed graph
Strudel's data model is a variation of the OEM data model.
A strudel graph is a set of nodes/objects in which each object is either complex or atomic. Complex object is a set of attribute, object pairs and an atomic object has an atomic value. Hence edges in a data graph are labeled by attributes and leaves labeled with atomic values.
STRUDEL atomic types: integer, float, string, date, mime-content types (URL, image, HTML and postscript).
Internal nodes have unique object identifiers (OIDs). Objects are grouped into named collections which are referenced in queries
Objects may belong to multiple collections, objects in the same colelction may have different representations.
Graphs are stored in the STRUDEL data repository. Provides wrappers for common data sources: relational database, BibTeX bibliographies, flat text files, XML documents
STRUDEL initially used a home-grown syntax for data exchange but migrated to XML. Data exchanged between the repository and the external sources in XML.
data mediator
Mediator supports data integration by providing a uniform view of all underlying data, irrespective of where it is stored. The mediated view called a data graph is specified as a STRUQL query over the data sources.Designing the mediator addressed two problems
- whether to warehouse data from external sources or to access the external sources on demand at query time [22] for comparison
- how to sepcify the relationship between attributes and colections in the mediated schema and those in the data sources [28] for possible approaches
Query processor
STRUQL is a declarative language for querying and restructing semistructured data. A query is applied to a data graph and gives a site graph.Site generator
To get a browsable site a HTML template is associated with each object in the site graph. Objects in the site graph may represent complete pages of page components. A template is usually associated with a collection of related objects.A template interleave HTML text with STRUDEL specific tagged expressions that access an object's attributes and format attributre's values.
Template language is similar to other languages that separate presentation from contents [5, 29, 7].
This technique simplifies the site programmer's task so he writes plain HTML extended with simple programmativ constructs, instead of a more complex scripting program that generates HTML.
where HomePages{p}, p->"Paper"->q, typeOf(q,"postscript")
collect PostscriptPages{q}
- take a collection called HomePages
- get all the edge lables/links to objects called "Paper"
- which are of type postscript
- collection them in a collection called PostscriptPages
- model-driven design systems
Problems with website design: modelling the site's content, specifying navigational structure, and customizing visual presentation have been studied in the context of hypermedia systems.Autoweb [25], OOHDM [27], Araneus [6] ascribe to a top down methodology of website design. Purpose of which is to isolate the orthongonal tasks of site design and codify each in a meta schema.
- abstract model of site - ER/OO models
- navigation design
- presentation design
- application/physical design specifies relationship between higher level designs and the underlying applications databases
General methodology is the same, each system provides different tools with varying levels of automation
- autoweb - 1 tool automates each step and requires strict adherence to the design methodology
- Araneus data model (ADM) supports intensional description of a Web site as a graph of strictly typed page schemes. Query lanaguage defines a relational view, multiple data sources are integrated by relational queries.
- OOHDM only partially automates translation of design schemas into the scripting language CGI-LUa.
- Server side scripting languages
Include Embperl, PHP, Javascript (netscape), JSP, ASP, Cold Fusion. Common goal to eliminate details of CGI-scripting and simplify tedious develpment of web applications in languages linke Perl. Which provide few high-level programming constructs and result in code that is hard to modify and reuseTypically plain HTML text interleaved with segments of code that are interpreted by the server. They are imperative and most provide high level features to simplify development including: session tracking and management, access to stored objects, read-only transactional access to databases
These languages increase a Web developer's stickiness to a particular vendor because scripts must be interpreted by the vendor's web server. Some tools include "wizard" or RAD environments.
They have improved significantly the process of website development a site definition is still comprised of disparate scripts that interleave presentation with content. Extracting a holistic definition of the site's content and structure from scripts would be difficult and therefore any analysis or optimization of the implementation equally difficult Is this something Webfuse's structure deals with?
XSLT provides similar to STRUQL
YAT [8] semistructure database-management system intended primiarly for translation and integraion of data in heterogenous data sources.
- materialize the site completely before browsing
- precompute the roots and issue queries when pages are requested
[18] experimental study examining the optimal tradeoff between precomputation and dynamic evolution, propose several techniques for optimizing the runtime behaviouv of sites and describe a framework for automatically compiling site specifications into run-time policies
[3] C Anderson, A Levy, D Weld, Declarative web-site management with Tiramisu, ACM SIGMOD, Workshop on the Web and Databases (WebDB'99), Philadelphia, PA, June 1999
[10] A Deutsch, M Fernendez, D. Florescu, A LEvy, D. Suciu, A query language for XML, Proceedings of 8th International WWW conference, Toronto, 1999
[16] D Florescu, A Levy, A. Mendelzon, Database techniques for the World-Wide Web: A Survey, SIGMOD Record, 27(3), Sept 1998
P Paolini, P Fraternali, A conceptual model and a tool environment for develping more scalable, dynamic, and customizable web applications, Procedings of the conference on extending database technology (EDBT), 1998