T-SQL Set Operators INTERSECT, EXCEPT, and UNION

The T-SQL sample can be downloaded here:
http://www.sqlsafety.com/TSQLCode/SETOperators.sql

INTERSECT, EXCEPT, and UNION are set operators that perform operations between two or more input sets.  UNION has been available in t-sql for some time, while INTERSECT and EXCEPT were introduced in SQL 2005.  All three operators have similar requirements:

  1. Require a minimum of 2 input sets
  2. Each input set must have the same number of columns returned
  3. Each relative column must have compatible data types
  4. The ORDER BY clause can only be used on the final set

I was recently approached by a colleague at a conference asking why my blog posts are not more specific to law enforcement, my full time job.  I explained that public safety is such a small and specific sector that it is difficult to tailor examples towards this area while still making them germane to other areas of business. He requested that I occasionally dedicate post(s) specifically pointed towards public safety. As I began working on the examples for this post I struggled to find meaningful examples using sample databases, AdventureWorks, Nortwhind, etc., that illustrated real life usefulness of all the outlined set operators. I was able to kill two birds with one stone by looking through scripts I use daily that deal directly with public safety.
These examples will utilize 3 tables, incidents, accidents, and arrests.

  1. The incidents table records all reported crimes
    1. An incident can spawn an arrest which is then recorded in the arrests table using the incident’s case number, i.e. a burglary is reported and some days later the suspect is arrested.
    2. An incident cannot spawn an accident
  2. The accidents table records all traffic crashes
    1. An accident can spawn an arrest and is then recorded in the arrests table using the accident’s case number, i.e. a crash caused by a drunk driver.
    2. An accident cannot spawn an incident
  3. The arrests table record all arrests
    1. Arrests can come from incidents or accidents in which case the related incidents/accidents case number is used for the arrest
    2. Arrests can be made without a related incident or accident

A non-clustered index is placed on the case number column of all tables; there can be multiple identical case numbers in each table which makes a clustered index less attractive.
The below statements create and populate the sample tables and indexes:
USE tempdb;
CREATE TABLE incidents(

pkey           INT IDENTITY PRIMARY KEY,

casenumber     CHAR(8),

nature         VARCHAR(35),

reported       DATETIME );

GO
CREATE TABLE accidents( pkey           INT IDENTITY PRIMARY KEY,

casenumber     CHAR(8),

location       VARCHAR(75),

occurred       DATETIME );

GO
CREATE TABLE arrests( pkey           INT IDENTITY PRIMARY KEY,

casenumber     CHAR(8),

primarycharge  VARCHAR(40),

arrested       DATETIME );

GO
INSERT incidents VALUES('1001234',
'Burglary',
'1/2/2010'),
(
'1001244',
'Theft',
'2/5/2010'),
(
'1001264',
'Assault',
'4/6/2010'); GO
INSERT accidents VALUES('1001284',
'123 Main St.',
'11/14/2010'),
(
'1001294',
'754 SW 56th Ave',
'7/9/2010'),
(
'1001304',
'815 Nicholas Pkwy',
'4/6/2010'); GO
INSERT arrests VALUES('1001234',
'Burglary',
'1/2/2010'),
(
'1001254',
'Theft',
'2/5/2010'),
(
'1001264',
'Aggravated Assault',
'4/6/2010'),
(
'1001304',
'DUI Property Damage',
'4/6/2010');
GO
CREATE INDEX inc_casenumber ON incidents(casenumber); GO
CREATE INDEX acc_casenumber ON accidents(casenumber); GO
CREATE INDEX arr_casenumber ON arrests(casenumber); GO

UNION

As UNION was introduced to t-sql first let’s start there. A common request that would be ideal for the UNION operator would be to write a query that returns all reports from the three tables, incidents, accidents, and arrests. The below query utilizes the UNION operator and returns the reports, but does not fully meet the requirements:
SELECT casenumber,
nature,
reported FROM incidents UNION
SELECT
casenumber,
'Traffic Crash',
occurred FROM accidents UNION
SELECT
casenumber,
primarycharge,
arrested FROM arrests ORDER BY casenumber; GO

The results are as follows:
clip_image002
Observe that all columns in each input set are of similar; if not identical, data types and that the ORDER BY clause is utilized, but only in the last statement. The results returned are almost complete except that a row is missing. The UNION operator, as well as INTERSECT and EXCEPT, only return distinct rows to the final result set, which means that a the above result set is missing a single row. An incident and arrest report with the same case number, nature/primary charge and date has been removed from the result set as the selected columns are identical. In order to assure that this row is included in the result set then UNION ALL could be used, which returns all rows regardless of duplicates. However, in order to return all rows and assure that the results are more insightful a string literal can added:
SELECT casenumber,
'Incident Report ' + nature AS Nature,
reported FROM incidents UNION
SELECT
casenumber,
'Traffic Crash',
occurred FROM accidents UNION
SELECT
casenumber,
'Arrest Report ' + primarycharge AS CHARge,
arrested FROM arrests ORDER BY casenumber

clip_image002[4]
The above results now include all rows and, with the use of the string literal, the Nature column includes the type of report.

INTERSECT

Another common request that is made is how many incidents resulted in arrests.  This can be accomplished a couple of different ways, but as this article is on set operators, let’s look at INTERSECT.  The INTERSECT operator returns all rows that occur between the left and right query so the below query will return the distinct case number, nature/primary charge, and date that occurs in both queries.
SELECT casenumber,
nature,
reported FROM incidents INTERSECT
SELECT
casenumber,
primarycharge AS CHARge,
arrested FROM arrests

A single row is returned:
image
**ANSI SQL outline INTERSECT ALL and EXCEPT ALL, which are similar to UNION ALL in that duplicate rows will be returned, but this is not currently supported in SQL 2008.
Similar results can be returned using an INNER JOIN between the incidents and arrests table.  The below query  returns an extra row, thereby giving a more accurate representation of incidents resulting in arrests:
SELECT i.casenumber,
nature,
reported FROM incidents i JOIN arrests a ON i.casenumber = a.casenumber
image
You will also notice that the estimated execution plan between the two queries are similar, but that the estimated overall query cost for the INTERSECT is greater than that of the INNER JOIN.  The bulk of the query cost for the INSERT query is the SORT (DISTINCT Sort) operator, 63%.
image
By adding DISTINCT to the query utilizing the INNER JOIN then the estimated query costs between the two queries become equal, although the execution plans differ slightly.
image
Results identical to the INTERSECT statement can be returned using an inner join by adding another join criteria on the incidents nature and the arrests primary charge.  The result set is identical, but the additional comparison causes a much greater variance between the two query plans and results in the INTERSECT having a greater over all estimated query cost:
SELECT i.casenumber,
nature,
reported FROM incidents i JOIN arrests a ON i.casenumber = a.casenumber AND i.nature = a.primarycharge

image Another way of retrieving all intersecting incidents and arrests is to use a sub-query using IN:
SELECT casenumber,
nature,
reported FROM incidents WHERE casenumber IN(SELECT DISTINCT casenumber
FROM arrests);

The above query will show an identical estimated query cost compared to using an INNER JOIN although the execution plan differs slightly:
image

EXCEPT

The EXCEPT operator acts as would be expected, contrary to the INTERSECT.  EXCEPT will show all results that appear in the left query that do not exist in the right query.  This can be used to display incidents that do not have an arrest associated with them:
SELECT casenumber,
nature,
reported FROM incidents EXCEPT
SELECT
casenumber,
primarycharge,
arrested FROM arrests; GO

The results are two rows as follows:
image Notice again that this is not truly accurate since the case number 1001264 occurs in the incidents table as an assault and the arrests table reflects the charge as an aggravated assault.  This query can also be converted using an outer join and filtering based on the arrest case number being NULL:
SELECT i.casenumber,
nature,
reported FROM incidents i LEFT JOIN arrests a ON i.casenumber = a.casenumber WHERE a.casenumber IS NULL

image
Again since DISTINCT is place on all columns in the result set of EXCEPT the outer join proves to be a better choice and the estimated query costs and plans are comparable to those documented above for INTERSECT and INNER JOIN.
Just as illustrated above identical results can be achieved using NOT IN in the WHERE clause:
SELECT casenumber,
nature,
reported FROM incidents  WHERE casenumber NOT IN(SELECT casenumber FROM arrests)

Estimated query costs and plans are again comparable to those outlined above for INTERSECT.

SET OPERATOR PRECEDENCE
All 3 set operators can be used together to provide a single result set, but each operator has its own precedence.
  1. INTERSECT
  2. EXCEPT/UNION
    1. Both are treated equally and evaluated based on appearance in order
Consider now how to return the case number and date of all incidents and accidents that are present in the arrests table.  As the INTERSECT operator is evaluated first this is rather easy:
SELECT casenumber,
reported FROM incidents INTERSECT
SELECT
casenumber,
arrested FROM arrests UNION
SELECT
casenumber,
occurred FROM accidents INTERSECT
SELECT
casenumber,
arrested FROM arrests

The above query will first return all case numbers and reported incident dates that occur both within incidents table and the arrests table.  Next all case numbers and accident dates will be returned that are also within the arrests table.  Finally the result sets will be UNION’ed and returned as a single result set:
image Precedence can be specified by placing parenthesis around queries utilizing a set operator.  In the query above the UNION operator can be given precedence over the INTERSECT by placing the two queries utilizing UNION in parenthesis:
SELECT casenumber,
reported FROM incidents INTERSECT (SELECT casenumber,
arrested FROM arrests UNION
SELECT
casenumber,
occurred FROM accidents INTERSECT
SELECT
casenumber,
arrested FROM arrests)

image
In the above query the result set from the UNION is returned, and precedence is given to the UNION over the INTERSECT due to this, the INTERSECT query is then evaluated against the UNION’ed results and as such one less row is returned.
CONCLUSION
INTERSECT, EXCEPT, and UNION are all set operators that act on a minimum of two queries.  INTERSECT and EXCEPT can be re-written in several different ways, using joins or utilizing IN/NOT IN within the WHERE clause.  The result set returned by all three operators assures distinctness of each row by default and only UNION supports ANSI defined ALL, the book T-SQL Fundamentals by Itzik Ben Gan outlines ways of mimicking ANSI INTERSECT ALL and EXCEPT ALL using t-sql.  MSDN and books online also document these set operators,http://msdn.microsoft.com/en-us/library/ms188055.aspx

Comment ( 1 )

Leave a Reply


Hit Counter provided by Sign Holders