Friday 5 August 2011

Set Operations

Set Operations

There are situations when we need to combine the results from two or more SELECT statements. SQL enables us to handle these requirements by using set operations. The result of each SELECT statement can be treated as a set, and SQL set operations can be applied on those sets to arrive at a final result. Oracle SQL supports the following four set operations:
  • UNION ALL
  • UNION
  • MINUS
  • INTERSECT
SQL statements containing these set operators are referred to as compound queries, and each SELECT statement in a compound query is referred to as a component query. Two SELECTs can be combined into a compound query by a set operation only if they satisfy the following two conditions:
  1. The result sets of both the queries must have the same number of columns.
  2. The datatype of each column in the second result set must match the datatype of its corresponding column in the first result set.
  3. TIP:   The datatypes do not need to be the same if those in the second result set can be automatically converted by Oracle (using implicit casting) to types compatible with those in the first result set.
These conditions are also referred to as union compatibility conditions. The term union compatibility is used even though these conditions apply to other set operations as well. Set operations are often called vertical joins, because the result combines data from two or more SELECTS based on columns instead of rows. The generic syntax of a query involving a set operation is:
<component query>
{UNION | UNION ALL | MINUS | INTERSECT}
<component query>
The keywords UNION, UNION ALL, MINUS, and INTERSECT are set operators. We can have more than two component queries in a composite query; we will always use one less set operator than the number of component queries.
The following sections discuss syntax, examples, rules, and restrictions for the four set operations.

Set Operators

The following list briefly describes the four set operations supported by Oracle SQL:
UNION ALL
Combines the results of two SELECT statements into one result set.
UNION
Combines the results of two SELECT statements into one result set, and then eliminates any duplicate rows from that result set.
MINUS
Takes the result set of one SELECT statement, and removes those rows that are also returned by a second SELECT statement.
INTERSECT
Returns only those rows that are returned by each of two SELECT statements.
Before moving on to the details on these set operators, let's look at the following two queries, which we'll use as component queries in our subsequent examples. The first query retrieves all the customers in region 5.
SELECT CUST_NBR, NAME
FROM CUSTOMER
WHERE REGION_ID = 5;
 
  CUST_NBR NAME
---------- ------------------------------
         1 Cooper Industries
         2 Emblazon Corp.
         3 Ditech Corp.
         4 Flowtech Inc.
         5 Gentech Industries
The second query retrieves all the customers with the sales representative is 'MARTIN'.
SELECT C.CUST_NBR, C.NAME
FROM CUSTOMER C
WHERE C.CUST_NBR IN (SELECT O.CUST_NBR
                     FROM CUST_ORDER O, EMPLOYEE E
                     WHERE O.SALES_EMP_ID = E.EMP_ID
                     AND E.LNAME = 'MARTIN');
 
  CUST_NBR NAME
---------- ------------------------------
         4 Flowtech Inc.
         8 Zantech Inc.
If we look at the results returned by these two queries, we will notice that there is one common row (for Flowtech Inc.). The following sections discuss the effects of the various set operations between these two result sets.

UNION ALL

The UNION ALL operator merges the result sets of two component queries. This operation returns rows retrieved by either of the component queries. The following example illustrates the UNION ALL operation:
SELECT CUST_NBR, NAME
FROM CUSTOMER
WHERE REGION_ID = 5
UNION ALL
SELECT C.CUST_NBR, C.NAME
FROM CUSTOMER C
WHERE C.CUST_NBR IN (SELECT O.CUST_NBR
                     FROM CUST_ORDER O, EMPLOYEE E
                     WHERE O.SALES_EMP_ID = E.EMP_ID
                     AND E.LNAME = 'MARTIN');
 
  CUST_NBR NAME
---------- ------------------------------
         1 Cooper Industries
         2 Emblazon Corp.
         3 Ditech Corp.
         4 Flowtech Inc.
         5 Gentech Industries
         4 Flowtech Inc.
         8 Zantech Inc.
 
7 rows selected.
As we can see from the result set, there is one customer, which is retrieved by both the SELECTs, and therefore appears twice in the result set. The UNION ALL operator simply merges the output of its component queries, without caring about any duplicates in the final result set.

UNION

The UNION operator returns all distinct rows retrieved by two component queries. The UNION operation eliminates duplicates while merging rows retrieved by either of the component queries. The following example illustrates the UNION operation:
SELECT CUST_NBR, NAME
FROM CUSTOMER
WHERE REGION_ID = 5
UNION
SELECT C.CUST_NBR, C.NAME
FROM CUSTOMER C
WHERE C.CUST_NBR IN (SELECT O.CUST_NBR
                     FROM CUST_ORDER O, EMPLOYEE E
                     WHERE O.SALES_EMP_ID = E.EMP_ID
                     AND E.LNAME = 'MARTIN');
 
  CUST_NBR NAME
---------- ------------------------------
         1 Cooper Industries
         2 Emblazon Corp.
         3 Ditech Corp.
         4 Flowtech Inc.
         5 Gentech Industries
         8 Zantech Inc.
 
6 rows selected.
This query is a modification of the previous query; the keywords UNION ALL have been replaced with UNION. Notice that the result set contains only distinct rows (no duplicates). To eliminate duplicate rows, a UNION operation needs to do some extra tasks as compared to the UNION ALL operation. These extra tasks include sorting and filtering the result set. If we observe carefully, we will notice that the result set of the UNION ALL operation is not sorted, whereas the result set of the UNION operation is sorted. These extra tasks introduce a performance overhead to the UNION operation. A query involving UNION will take extra time compared to the same query with UNION ALL, even if there are no duplicates to remove. Therefore, unless we have a valid need to retrieve only distinct rows, we should use UNION ALL instead of UNION for better performance.

INTERSECT

INTERSECT returns only the rows retrieved by both component queries. Compare this with UNION, which returns the rows retrieved by any of the component queries. If UNION acts like 'OR', INTERSECT acts like 'AND'. For example:
SELECT CUST_NBR, NAME
FROM CUSTOMER
WHERE REGION_ID = 5
INTERSECT
SELECT C.CUST_NBR, C.NAME
FROM CUSTOMER C
WHERE C.CUST_NBR IN (SELECT O.CUST_NBR
                     FROM CUST_ORDER O, EMPLOYEE E
                     WHERE O.SALES_EMP_ID = E.EMP_ID
                     AND E.LNAME = 'MARTIN');
 
  CUST_NBR NAME
---------- ------------------------------
         4 Flowtech Inc.
As we saw earlier, "Flowtech Inc." was the only customer retrieved by both SELECT statements. Therefore, the INTERSECT operator returns just that one row.

MINUS

MINUS returns all rows from the first SELECT that are not also returned by the second SELECT. For example:
SELECT CUST_NBR, NAME
FROM CUSTOMER
WHERE REGION_ID = 5
MINUS
SELECT C.CUST_NBR, C.NAME
FROM CUSTOMER C
WHERE C.CUST_NBR IN (SELECT O.CUST_NBR
                     FROM CUST_ORDER O, EMPLOYEE E
                     WHERE O.SALES_EMP_ID = E.EMP_ID
                     AND E.LNAME = 'MARTIN');
 
  CUST_NBR NAME
---------- ------------------------------
         1 Cooper Industries
         2 Emblazon Corp.
         3 Ditech Corp.
         5 Gentech Industries
You might wonder why we don't see "Zantech Inc." in the output. An important thing to note here is that the execution order of component queries in a set operation is from top to bottom. The results of UNION, UNION ALL, and INTERSECT will not change if we alter the ordering of component queries. However, the result of MINUS will be different if we alter the order of the component queries. If we rewrite the previous query by switching the positions of the two SELECTs, we get a completely different result:
SELECT C.CUST_NBR, C.NAME
FROM CUSTOMER C
WHERE C.CUST_NBR IN (SELECT O.CUST_NBR
                     FROM CUST_ORDER O, EMPLOYEE E
                     WHERE O.SALES_EMP_ID = E.EMP_ID
                     AND E.LNAME = 'MARTIN')
MINUS
SELECT CUST_NBR, NAME
FROM CUSTOMER
WHERE REGION_ID = 5;
 
  CUST_NBR NAME
---------- ------------------------------
         8 Zantech Inc.
The row for "Flowtech Inc." is returned by both queries, so in our first MINUS example the first component query adds "Flowtech Inc." to the result set while the second component query removes it. The second example turns the MINUS operation around. The first component query adds "Flowtech Inc." and "Zantech Inc." to the result set. The second component query specifies rows to subtract. One of the rows to subtract is "Flowtech Inc.", leaving "Zantech Inc." as the sole remaining row.
  In a MINUS operation, rows may be returned by the second SELECT that are not also returned by the first. These rows are not included in the output.

Using Set Operations to Compare Two Tables

Developers, and even DBAs, occasionally need to compare the contents of two tables to determine whether the tables contain the same data. The need to do this is especially common in test environments, as developers may want to compare a set of data generated by a program under test with a set of "known good" data. Comparison of tables is also useful for automated testing purposes, when we have to compare actual results with a given set of expected results. SQL's set operations provide an interesting solution to this problem of comparing two tables.
The following query uses both MINUS and UNION ALL to compare two tables for equality. The query depends on each table having either a primary key or at least one unique index.
(SELECT * FROM CUSTOMER_KNOWN_GOOD
MINUS
SELECT * FROM CUSTOMER_TEST)
UNION ALL
(SELECT * FROM CUSTOMER_TEST
MINUS
SELECT * FROM CUSTOMER_KNOWN_GOOD);
Let's talk a bit about how this query works. We can look at it as the union of two compound queries. The parentheses ensure that both MINUS operations take place first before the UNION ALL operation is performed. The result of the first MINUS query will be those rows in CUSTOMER_KNOWN_GOOD that are not also in CUSTOMER_TEST. The result of the second MINUS query will be those rows in CUSTOMER_TEST that are not also in CUSTOMER_KNOWN_GOOD. The UNION ALL operator simply combines these two result sets for convenience. If no rows are returned by this query, then we know that both tables have identical rows. Any rows returned by this query represent differences between the CUSTOMER_TEST and CUSTOMER_KNOWN_GOOD tables.
If the possibility exists for one or both tables to contain duplicate rows, we must use a more general form of this query in order to test two tables for equality. This more general form uses row counts to detect duplicates:
(SELECT C1.*,COUNT(*) 
 FROM CUSTOMER_KNOWN_GOOD
 GROUP BY C1.CUST_NBR, C1.NAME...
MINUS
 SELECT C2.*, COUNT(*)
 FROM CUSTOMER_TEST C2
 GROUP BY C2.CUST_NBR, C2.NAME...)
UNION ALL
(SELECT C3.*,COUNT(*) 
 FROM CUSTOMER_TEST C3
 GROUP BY C3.CUST_NBR, C3.NAME...
MINUS
 SELECT C4.*, COUNT(*)
 FROM CUSTOMER_KNOWN_GOOD C4
 GROUP BY C4.CUST_NBR, C4.NAME...)
This query is getting complex! The GROUP BY clause (see Chapter 4) for each SELECT must list all columns for the table being selected. Any duplicate rows will be grouped together, and the count will reflect the number of duplicates. If the number of duplicates is the same in both tables, the MINUS operations will cancel those rows out. If any rows are different, or if any occurrence counts are different, the resulting rows will be reported by the query.
Let's look at an example to illustrate how this query works. We'll start with the following tables and data:
DESC CUSTOMER_KNOWN_GOOD
 Name                         Null?    Type
 ---------------------------- -------- ----------------
 CUST_NBR                     NOT NULL NUMBER(5)
 NAME                         NOT NULL VARCHAR2(30)
 
SELECT * FROM CUSTOMER_KNOWN_GOOD;
 
   CUST_NBR NAME
----------- ------------------------------
          1 Sony
          1 Sony
          2 Samsung
          3 Panasonic
          3 Panasonic
          3 Panasonic
 
6 rows selected.
 
DESC CUSTOMER_TEST
Name                         Null?    Type
 ---------------------------- -------- ----------------
 CUST_NBR                     NOT NULL NUMBER(5)
 NAME                         NOT NULL VARCHAR2(30)
 
SELECT * FROM CUSTOMER_TEST;
   CUST_NBR NAME
----------- ------------------------------
          1 Sony
          1 Sony
          2 Samsung
          2 Samsung
          3 Panasonic
As we can see the CUSTOMER_KNOWN_GOOD and CUSTOMER_TEST tables have the same structure, but different data. Also notice that none of these tables has a primary or unique key; there are duplicate records in both. The following SQL will compare these two tables effectively:
(SELECT C1.*, COUNT(*)
FROM CUSTOMER_KNOWN_GOOD C1
GROUP BY C1.CUST_NBR, C1.NAME
MINUS
SELECT C2.*, COUNT(*)
FROM CUSTOMER_TEST C2
GROUP BY C2.CUST_NBR, C2.NAME)
UNION ALL
(SELECT C3.*, COUNT(*)
FROM CUSTOMER_TEST C3
GROUP BY C3.CUST_NBR, C3.NAME
MINUS
SELECT C4.*, COUNT(*)
FROM CUSTOMER_KNOWN_GOOD C4
GROUP BY C4.CUST_NBR, C4.NAME);
 
   CUST_NBR NAME                             COUNT(*)
----------- ------------------------------ ----------
          2 Samsung                                 1
          3 Panasonic                               3
          2 Samsung                                 2
          3 Panasonic                               1
These results indicate that one table (CUSTOMER_KNOWN_GOOD) has one record for "Samsung", whereas the second table (CUSTOMER_TEST) has two records for the same customer. Also, one table (CUSTOMER_KNOWN_GOOD) has three records for "Panasonic", whereas the second table (CUSTOMER_TEST) has one record for the same customer. Both the tables have the same number of rows (two) for "Sony", and therefore "Sony" doesn't appear in the output.
TIP:   Duplicate rows are not possible in tables that have a primary key or at least one unique index. Use the short form of the table comparison query for such tables.

Using NULLs in Compound Queries

We discussed union compatibility conditions at the beginning of this chapter. The union compatibility issue gets interesting when NULLs are involved. As we know, NULL doesn't have a datatype, and NULL can be used in place of a value of any datatype. If we purposely select NULL as a column value in a component query, Oracle no longer has two datatypes to compare in order to see whether the two component queries are compatible. For character columns, this is no problem. For example:
SELECT 1 NUM, 'DEFINITE' STRING FROM DUAL
UNION
SELECT 2 NUM, NULL STRING FROM DUAL;
 
       NUM STRING
---------- --------
         1 DEFINITE
         2
Notice that Oracle considers the character string 'DEFINITE' from the first component query to be compatible with the NULL value supplied for the corresponding column in the second component qery. However, if a NUMBER or a DATE column of a component query is set to NULL, we must explicitly tell Oracle what "flavor" of NULL to use. Otherwise, we'll encounter errors. For example:
SELECT 1 NUM, 'DEFINITE' STRING FROM DUAL
UNION
SELECT NULL NUM, 'UNKNOWN' STRING FROM DUAL;
 
SELECT 1 NUM, 'DEFINITE' STRING FROM DUAL
       *
ERROR at line 1:
ORA-01790: expression must have same datatype as corresponding expression
Note that the use of NULL in the second component query causes a datatype mismatch between the first column of the first component query, and the first column of the second component query. Using NULL for a DATE column causes the same problem, as in the following example:
SELECT 1 NUM, SYSDATE DATES FROM DUAL
UNION
SELECT 2 NUM, NULL DATES FROM DUAL;
SELECT 1 NUM, SYSDATE DATES FROM DUAL
              *
ERROR at line 1:
ORA-01790: expression must have same datatype as corresponding expression 
In these cases, we need to cast the NULL to a suitable datatype to fix the problem, as in the following examples:
SELECT 1 NUM, 'DEFINITE' STRING FROM DUAL
UNION
SELECT TO_NUMBER(NULL) NUM, 'UNKNOWN' STRING FROM DUAL;
 
       NUM STRING
---------- --------
         1 DEFINITE
           UNKNOWN
 
SELECT 1 NUM, SYSDATE DATES FROM DUAL
UNION
SELECT 2 NUM, TO_DATE(NULL) DATES FROM DUAL;
 
       NUM DATES
---------- ---------
         1 06-JAN-02
         2
This problem of union compatibility when using NULLs is encountered in Oracle8i. However, there is no such problem in Oracle9i, as we can see in the following examples generated from an Oracle9i database:
SELECT 1 NUM, 'DEFINITE' STRING FROM DUAL
UNION
SELECT NULL NUM, 'UNKNOWN' STRING FROM DUAL;
 
       NUM STRING
---------- --------
         1 DEFINITE
           UNKNOWN
 
SELECT 1 NUM, SYSDATE DATES FROM DUAL
UNION
SELECT 2 NUM, NULL DATES FROM DUAL;
 
       NUM DATES
---------- ---------
         1 06-JAN-02
         2
Oracle9i is smart enough to know which flavor of NULL to use in a compound query.

Rules and Restrictions on Set Operations

Other than the union compatibility conditions discussed at the beginning of the chapter, there are some other rules and restrictions that apply to the set operations. These rules and restrictions are as follows:
Column names for the result set are derived from the first SELECT:
SELECT CUST_NBR "Customer ID", NAME "Customer Name"
FROM CUSTOMER
WHERE REGION_ID = 5
UNION
SELECT C.CUST_NBR, C.NAME
FROM CUSTOMER C
WHERE C.CUST_NBR IN (SELECT O.CUST_NBR
                     FROM CUST_ORDER O, EMPLOYEE E
                     WHERE O.SALES_EMP_ID = E.EMP_ID
                     AND E.LNAME = 'MARTIN');
 
Customer ID Customer Name
----------- ----------------------
          1 Cooper Industries
          2 Emblazon Corp.
          3 Ditech Corp.
          4 Flowtech Inc.
          5 Gentech Industries
          8 Zantech Inc.
 
6 rows selected.
Although both SELECTs use column aliases, the result set takes the column names from the first SELECT. The same thing happens when we create a view based on a set operation. The column names in the view are taken from the first SELECT:
CREATE VIEW V_TEST_CUST AS
SELECT CUST_NBR "Customer ID", NAME "Customer Name"
FROM CUSTOMER
WHERE REGION_ID = 5
UNION
SELECT C.CUST_NBR, C.NAME
FROM CUSTOMER C
WHERE C.CUST_NBR IN (SELECT O.CUST_NBR
                     FROM CUST_ORDER O, EMPLOYEE E
                     WHERE O.SALES_EMP_ID = E.EMP_ID
                     AND E.LNAME = 'MARTIN');
 
View created.
 
DESC V_TEST_CUST
 Name                            Null?    Type
 ------------------------------- -------- ----
 Customer_ID                              NUMBER
 Customer_Name                            VARCHAR2(45)
If we want to use ORDER BY in a query involving set operations, we must place the ORDER BY at the end of the entire statement. The ORDER BY clause can appear only once at the end of the compound query. The component queries can't have individual ORDER BY clauses. For example:
SELECT CUST_NBR, NAME
FROM CUSTOMER
WHERE REGION_ID = 5
UNION
SELECT EMP_ID, LNAME
FROM EMPLOYEE
WHERE LNAME = 'MARTIN'
ORDER BY CUST_NBR;
 
  CUST_NBR NAME
---------- ---------------------
         1 Cooper Industries
         2 Emblazon Corp.
         3 Ditech Corp.
         4 Flowtech Inc.
         5 Gentech Industries
      7654 MARTIN
 
6 rows selected.
Note that the column name used in the ORDER BY clause of this query is taken from the first SELECT. We couldn't order these results by EMP_ID. If we attempt to ORDER BY EMP_ID, we will get an error, as in the following example:
SELECT CUST_NBR, NAME
FROM CUSTOMER
WHERE REGION_ID = 5
UNION
SELECT EMP_ID, LNAME
FROM EMPLOYEE
WHERE LNAME = 'MARTIN'ORDER BY EMP_ID;
ORDER BY EMP_ID
         *
ERROR at line 8:
ORA-00904: invalid column name
The ORDER BY clause doesn't recognize the column names of the second SELECT. To avoid confusion over column names, it is a common practice to ORDER BY column positions:
SELECT CUST_NBR, NAME
FROM CUSTOMER
WHERE REGION_ID = 5
UNION
SELECT EMP_ID, LNAME
FROM EMPLOYEE
WHERE LNAME = 'MARTIN'
ORDER BY 1;
 
  CUST_NBR NAME
---------- ---------------------
         1 Cooper Industries
         2 Emblazon Corp.
         3 Ditech Corp.
         4 Flowtech Inc.
         5 Gentech Industries
      7654 MARTIN
 
6 rows selected.
TIP:   Unlike ORDER BY, we can use GROUP BY and HAVING clauses in component queries.
Component queries are executed from top to bottom. If we want to alter the sequence of execution, use parentheses appropriately. For example:
SELECT * FROM SUPPLIER_GOOD
UNION
SELECT * FROM SUPPLIER_TEST
MINUS
SELECT * FROM SUPPLIER;
 
SUPPLIER_ID NAME
----------- --------------------------
          4 Toshiba
Oracle performs the UNION between SUPPLIER_GOOD and SUPPLIER_TEST first, and then performs the MINUS between the result of the UNION and the SUPPLIER table. If we want the MINUS between SUPPLIER_TEST and SUPPLIER to be performed first, and then the UNION between SUPPLIER_GOOD and the result of MINUS, we must use parentheses to indicate so:
SELECT * FROM SUPPLIER_GOOD
UNION
(SELECT * FROM SUPPLIER_TEST
MINUS
SELECT * FROM SUPPLIER);
 
SUPPLIER_ID NAME
----------- -------------------------
          1 Sony
          2 Samsung
          3 Panasonic
          4 Toshiba
The parentheses in this query forces the MINUS to be performed before the UNION. Notice the difference in the result as compared to the previous example.
The following list summarizes some simple rules, restrictions, and notes that don't require examples:
  • Set operations are not permitted on columns of type BLOB, CLOB, BFILE, and VARRAY, nor are set operations permitted on nested table columns.
  • Since UNION, INTERSECT, and MINUS operators involve sort operations, they are not allowed on LONG columns. However, UNION ALL is allowed on LONG columns.
  • Set operations are not allowed on SELECT statements containing TABLE collection expressions.
  • SELECT statements involved in set operations can't use the FOR UPDATE clause.
  • The number and size of columns in the SELECT list of component queries are limited by the block size of the database. The total bytes of the columns SELECTed can't exceed one database block.

No comments:

Post a Comment