K. Selçuk Candan
General CV Research
     

Query Optimization


Multimedia and web database queries have different characteristics from queries in traditional databases. For instance, due to the use of thresholding and top-k predicates and sub-queries the number of query results can also change depending on the query execution order. Furthermore, finding exact matches is not required. Due to these inherent differences, we develop novel query processing mechanisms.

[back]

 

We develop novel techniques for performing query optimization in databases, such as multimedia and web databases, which rely on tresholding and top-k predicates. We propose an optimization model that (1) takes into account different binding patterns associated with query predicates and (2) considers the variations in the query result size (or coverage), depending on the execution order. We address the additional complexity and the well-known NP-complete nature of the query optimization problem by adaptively reducing the granularity of the search space.

Let us consider an SQL-like multimedia query that retrieves surveillance images that contain an alert pattern:

select location, time, image
from images
where surveillance_scans(location, time, image) and
location = "entrance_gate" and
extract_pattern(image, pattern) and
match_pattern(pattern,"alert.gif").

In this application, surveillance images are continuously fed from the cameras into a database and the above query is executed repeatedly to identify the alert condition. Therefore, due to the nature of the application, it is essential that the query is evaluated as quickly as possible. Furthermore, since pattern extraction and matching are fuzzy operations, it is essential that the results of the query have associated confidence levels that can be used for deciding when to trigger the alert condition. Note that, in this application, we would like to identify the best alert candidates in the shortest amount of time.

While processing a query, we need to consider the following three issues characteristic to multimedia databases. Note that some of these, such as overloaded implementation of query predicates, are also seen in other database applications. However, altogether, thefollowing three issues define the characteristics of multimedia databases:

Overloaded implementations of query predicates. Media-related predicates can be implemented by various user-defined functions or indexes corresponding to different ways the predicate can be invoked. For instance, extract_pattern(image,pattern), can have three different overloaded implementations:

  • given an image, one implementation extracts a predetermined number of patterns using a pattern recognition function,
  • given a pair of an image and a pattern, another one checks for the presence of the pattern in the image using a neural-net engine, and
  • given a pattern, the last implementation retrieves all matching images using a cache of pre-extracted of pattern/image pairs maintained in the form of a multi-dimensional index structure that maps all known patterns into a feature space for quick access.

Each of these implementations can have drastically different behaviors in terms of the number of results returned.

Thresholding and ``top-k'' predicates. The solution space in most media-related predicates is very large. For example, in the above example, every given pattern matches every other one to some degree. Therefore, providing all answers to a query is neither feasible nor desirable. Consequently, multimedia functions are usually implemented as thresholding or top-k retrieval functions. This means that a sub-query or a user defined function returns only a small, but most relevant, portion of all possible results. However, if two overloaded implementations of the same predicate define relevance differently, then they can return different portions of the result space.

This may cause certain complications in query optimization. One complication is that a query can result in different numbers of tuples depending on the execution plan chosen by the query optimizer.

Multimedia queries are generally processed in two ways: (1) using regular query processing techniques within database management systems and (2) using special merge techniques that benefit from the fact that results returned by most multimedia user-defined functions are ranked based on their scores. Therefore,

  • we study semantics of various fuzzy conjunction, disjunction, and negation operators, as they apply to multtimedia and web applications,
  • we develop ranking-based query processing algorithmsthat can handle non-progressive predicates; this requires maintenance of additional statistics about rank and data distribution and use of approximation techniques,
  • we develop query optimization techniques that use a query model where both selections and joins are modeled as predicates. Consequently, we are able to state the query optimization problem as a join-ordering problem.

An optimization model assigns a value (to be minimized) to all partial or complete plans in the search space. It also determines the output size of the data stream for every operator and predicate in the plan. This plan must respect the available binding patterns of each query predicate. A binding pattern of a predicate specifies which attributes of a relation must be bound to access it. Each binding pattern abstracts an overloaded implementation of a predicate in terms of an external function or an index structure. Therefore, given a multimedia query predicate, each binding pattern of predicate has different cost, size, and quality parameters associated with it. Therefore, we need an optimization model which captures all these dimensions.

We have shown that under these conditions, we can not use dynamic programming-based optimization solutions. Therefore, we develop optimization algorithms that capture these three dimensions of optimality. The number of sub-plans to be considered can be significantly reduced by approximating the costs of the sub-plans and choosing only the best of them. The desired level of approximation needed is described as the granularity of the distribution. If the granularity is small, there potentially is a considerable reduction in optimization time, but this introduces an error in the optimization.

We divide the whole search space into buckets, whose sizes are determined by the granularity desired. All sub-plans that belong to a particular interval of values are placed in the same bucket. In each step of query-optimization, instead of using all sub-plans, only one sub-plan from every bucket is used. Due to this approximation, thesearch-space size is controlled. For this purpose, unlike the data histograms which capture the data distribution, we introduce opt-histograms that capture the distribution of sub-query-plan values over many query optimization tasks.

 

[back]

 




Conferences

MIS 2002

ISCAS 2002

Web3D 2002

ICDCS2002

MIR 2001

MIS 2001

Pictures

I enjoy taking pictures...

Okay, I am not very good, but hey why should that stop me from making some of them available online?