Intractable Problems is a column designed to challenge and stimulate your thinking skills. In this issue I want to challenge you to look at some of the difficulties inherent in accessing the data in our computer systems.
Now, I know that each and every one of you always gets all the information out of the computer that you need, without difficulty. Why, we all use SQL, the structured query language designed for easy access into relational databases!
Let's take a close look at SQL. [1] Did you know that SQL statements are relational calculus expressions? I am not making this up. Relational calculus is a way of expressing what our result is to be, without figuring out how to get it. Of course, we're all calculus experts, right? Guess what happens if we want some data which must be selected on more than one condition. For example, we would like to find all 65 year old males in the province of New Brunswick.
Now, logically, we know that there are two ways to find the answer. First, we could find all the males in New Brunswick and then look to see which of them are 65 years old. Or, we could find all the 65 year old people in New Brunswick and then see which of them are male. Does it make a difference which way we use?
Yes, it does. We would like to know the selectivity of each condition. Selectivity is the ratio between the amount of data that satisfies the condition to the total amount of data that must be searched. The selectivity is the probability that a specific data item is found. If the selectivity is small, then only a few data items are selected by the condition. We want to use this condition first in any compound expression so that we minimize the amount of work necessary to satisfy the next condition. In general, for multiple conjunctive conditions, the best order of application is from smallest selectivity to largest.
In many cases neither you nor I nor the computer know the precise selectivities. So, the best assumption we can make is that the data is uniformly distributed across its range. For example, assuming uniform distributions, if gender is either male or female and age ranges from 1 to 100, then we can predict a 50 fold improvement if we select on age before gender.
Now, let's raise the stakes. What happens if the distribution is not uniform? Let's introduce a new area attribute and consider the case where 95% of the people live in the province of New Brunswick and the remaining 5% live in 199 different areas of the world. In this case there are 200 different area values. The area selectivity, assuming a uniform distribution, is much better than the age selectivity. According to our rule, the area attribute should be accessed first given any query with a conjunctive clause relating area and age. This is a good strategy if we are looking for 65 year old males outside of New Brunswick. However, this is a very poor plan if we wanted a New Brunswick resident. How is a person to choose, without understanding calculus?
SQL, a language founded on relational calculus, does not help us choose the most effective query formulation. The query optimization task is often an intractable problem, frequently left to some foolish computer to figure out.
Next issue, I want to take a critical look at the serious problems involved in processing information. Do you think that there are any limits as to what we can do?
Codd, E.F. A Data Base Sublanguage Founded on Relational Calculus, Proc. 1971 ACM SIGFIDET Workshop on Data Description, Access and Control.

Copyright © 1995-2004 WSM Information Systems Inc.