| |
We
develop novel techniques for performing query optimization
in databases, such as multimedia and web databases, which
rely on tresholding and top-k predicates. We propose an optimization
model that (1) takes into account different binding patterns
associated with query predicates and (2) considers the variations
in the query result size (or coverage), depending on the execution
order. We address the additional complexity and the well-known
NP-complete nature of the query optimization problem by adaptively
reducing the granularity of the search space.
Let us consider an SQL-like multimedia query that retrieves
surveillance images that contain an alert pattern:
select
location, time, image
from images
where surveillance_scans(location, time, image) and
location = "entrance_gate" and
extract_pattern(image, pattern) and
match_pattern(pattern,"alert.gif").
In this application, surveillance images are continuously
fed from the cameras into a database and the above query is
executed repeatedly to identify the alert condition. Therefore,
due to the nature of the application, it is essential that
the query is evaluated as quickly as possible. Furthermore,
since pattern extraction and matching are fuzzy operations,
it is essential that the results of the query have associated
confidence levels that can be used for deciding when to trigger
the alert condition. Note that, in this application, we would
like to identify the best alert candidates in the shortest
amount of time.
While
processing a query, we need to consider the following three
issues characteristic to multimedia databases. Note that some
of these, such as overloaded implementation of query predicates,
are also seen in other database applications. However, altogether,
thefollowing three issues define the characteristics of multimedia
databases:
Overloaded
implementations of query predicates.
Media-related predicates can be implemented by various user-defined
functions or indexes corresponding to different ways the predicate
can be invoked. For instance, extract_pattern(image,pattern),
can have three different overloaded implementations:
- given
an image, one implementation extracts a predetermined number
of patterns using a pattern recognition function,
- given
a pair of an image and a pattern, another one checks for
the presence of the pattern in the image using a neural-net
engine, and
- given
a pattern, the last implementation retrieves all matching
images using a cache of pre-extracted of pattern/image pairs
maintained in the form of a multi-dimensional index structure
that maps all known patterns into a feature space for quick
access.
Each
of these implementations can have drastically different behaviors
in terms of the number of results returned.
Thresholding
and ``top-k'' predicates. The solution space in most
media-related predicates is very large. For example, in the
above example, every given pattern matches every other one
to some degree. Therefore, providing all answers to a query
is neither feasible nor desirable. Consequently, multimedia
functions are usually implemented as thresholding or top-k
retrieval functions. This means that a sub-query or a user
defined function returns only a small, but most relevant,
portion of all possible results. However, if two overloaded
implementations of the same predicate define relevance differently,
then they can return different portions of the result space.
This
may cause certain complications in query optimization. One
complication is that a query can result in different numbers
of tuples depending on the execution plan chosen by the query
optimizer.
Multimedia queries are generally processed in two ways: (1)
using regular query processing techniques within database
management systems and (2) using special merge techniques
that benefit from the fact that results returned by most multimedia
user-defined functions are ranked based on their scores. Therefore,
- we study semantics of various fuzzy conjunction, disjunction,
and negation operators, as they apply to multtimedia and
web applications,
- we develop ranking-based query processing algorithmsthat
can handle non-progressive predicates; this requires maintenance
of additional statistics about rank and data distribution
and use of approximation techniques,
- we develop query optimization techniques that use a query
model where both selections and joins are modeled as predicates.
Consequently, we are able to state the query optimization
problem as a join-ordering problem.
An optimization model assigns a value (to be minimized) to
all partial or complete plans in the search space. It also
determines the output size of the data stream for every operator
and predicate in the plan. This plan must respect the available
binding patterns of each query predicate. A binding pattern
of a predicate specifies which
attributes of a relation must be bound to access it. Each
binding pattern abstracts an overloaded implementation of
a predicate in terms of an external function or an index
structure. Therefore, given a multimedia query predicate,
each binding pattern of predicate has different cost, size,
and quality parameters associated with it. Therefore, we need
an optimization model which captures all these dimensions.
We have shown that under these conditions, we can not use
dynamic programming-based optimization solutions. Therefore,
we develop optimization algorithms that capture these three
dimensions of optimality. The number of sub-plans to be considered
can be significantly reduced by approximating the costs of
the sub-plans and choosing only the best of them. The desired
level of approximation needed is described as the granularity
of the distribution. If the granularity is small, there
potentially is a considerable reduction in optimization time,
but this introduces an error in the optimization.
We divide the whole search space into buckets, whose sizes
are determined by the granularity desired. All sub-plans that
belong to a particular interval of values are placed in the
same bucket. In each step of query-optimization, instead of
using all sub-plans, only one sub-plan from every bucket is
used. Due to this approximation, thesearch-space size is controlled.
For this purpose, unlike the data histograms which
capture the data distribution, we introduce opt-histograms
that capture the distribution of sub-query-plan values over
many query optimization tasks.
[back]
|
|
|