Databases on the Web
1.1 What is the benefit of databases on the Web?
- dynamic content
- large amount of data
- separation of content from layout
- efficient data collection and maintenance
- interfaces to legacy systems
1.2 Types of Web Database Tools
- Extensions to existing database tools
- HTML editors with database capabilities
- Web Database Application languages (eg. ASP, ColdFusion)
- Server-side web languages with database capabilities (eg. Php, Perl, Java)
1.3 More details about Web Database Tools
Open Source
- (Php or Perl or Python) + (MySQL or Postgres)
- available on most operating systems
- scripting languages with DBIs
- MySQL does not support full SQL,
- Postgres is object-relational
- for larger databases: performance tuning may be a challenge
Microsoft
- ASP
- connection to OODB database, ActiveX data objects
- authoring tools: Drumbeat, Microsoft Visual InterDev
- SQL server
- enterprise level database product
- easily integrated with ASP and other MS tools
Coldfusion
- uses CFML which integrates with HTML and XML
- built-in support for Oracle
- claims to have "scalable deployment and high performance"
Reading:
Comparison of Dynamic Web Content Technologies
(contains code examples for database connectivity.)
1.4 Typical Applications of databases on the web
- Dynamic publishing (eg. newspapers)
- template driven
- database stores content
- consistent layout
- Web Portals/Search Engines/On-line catalogs
- form interface
- ranked list of results
- Boolean query language with wild-chars
- But not SQL!!!
- interfaces to legacy databases
- E-commerce
- shopping carts
- credit card transactions
- customer databases
- On-line collaborative environments (mostly unstructured text)
- storage of large amounts of user-generated content
- eg. guestbooks, discussion groups, Wikis
2.1 Challenges: Dynamic Content
Dynamic content can be generated ...
- when page is requested or
- at regular intervals and stored on temporary websites
Dynamic content causes problems for
- indexes (search engines)
- caching (by browser or proxy server)
- HTTP header which allows to set expiration time for caching
2.1.1 Semantic Caching
- each new query is compared to cache
- queries are evaluated with respect to semantic similarity
- in client server architecture DBs:
- components are full-fledged DBs
- example: users asks for "A and B", cache contains "A and B and C",
system only searches for "(A and B) Not C".
- in web database architectures
- components are not full-fledged DBs because they do not support full SQL
- new solutions are required !
2.2 The "Deep Web" ("hidden" or "invisible")
consists of
- the content of databases accessible on the web
- content is accessible "only by query"
- maybe 500 times more content than on the normal web
- content is not spidered by search engines
- but links to deep content are spidered
- deep content newspaper sites are usually spidered
- some search engines specialise on deep content
2.2.1 Searchable Web Sources
- site search: searches surface pages of a site. Not part of Deep Web.
- text database: searches unstructured text documents of a text collection.
- structured database: Deep Web content that could be spidered/mined.
- Examples: shopping catalogs, on-line bookstores, airline booking sites.
- Categories: business, computers, shopping, travel
- browsable interfaces for deep content
- possible for static content (eg. Amazon, Yahoo)
- impossible for highly dynamic content (eg. airline booking sites)
2.2.2 Complexity of Structured Web Databases
Research shows that schema vocabularies of structured web databases
tend to cluster and interlink, revealing hidden structures. (I.e.,
similar databases tend to use similar database schemas.)
Constraint patterns of structured web databases are fairly uniform
across domains. (I.e., similar types of queries can be asked in all
structured web databases.)
Reading: Structured Databases on the Web.
2.3 The "Semantic Web"
- metadata facilitates description of context/meaning/semantics of data
- XML
- RDF: resource description framework
- automated manipulation/interpretation of meaning
- ontologies: formal description of concepts and relationships
- OWL Web Ontology Language
- mathematical logic
- inference engines
Readings:
Tim Berner's
Lee's article
See figure 1 in
this paper.
2.4 Challenges: Maintaining State
Web-based state management
- hidden form fields
- query strings
- URL rewriting can be used to make URL more readable
- cookies
All of these can be manipulated by a malicious user!
Server-based state management:
- automatically implemented by languages such as Php, ASP
- application states versus session states
- internally these are also IMPLEMENTED as hidden fields, query strings
or cookies!
2.4.1 Maintaining state in web databases
preferred method is
- store detailed data in database on server
- user IDs are stored in cookies during a session
- user needs to login with username/password to start a new session
- for better security: use SSL authentication
Without authentication, there is no guarantee that state is maintained
correctly!