Need to sum columns: Nested Loops vs. Hash Match

I have two tables, Users and UserFile. Users.UserID is a PK, and
UserFile.UserID is a FK to Users. The relationship between Users and
UserFile is one-to-many.
I have the following two queries:
--Query 1
select count(*)
from Users
join UserFile
on UserFile.UserID = Users.UserID
where Users.UserLastActive > getdate() - 14
--Query 2
select count(*)
from Users
join UserFile
on UserFile.UserID = Users.UserID
where Users.UserLastActive > getdate() - 14
and UserFile.UserFileBlockTrades = 0
The only difference is that the last line of Query 2 is omitted in Query 1.
The selectivity of the Users filter is quite high -- about 6%. The
selectivity of the UserFile filter is quite low -- about 96%.
Here's the problem. Query 1 results in Index S

s on both tables, with a
Nested Loops join joining the two tables. However, Query 2 results in an
Index S

on Users, but an Index Scan on UserFile, as well as the more
costly Hash Match join joining the two tables. The result is, Query 2 has
about 4x CPU cost of Query 1.
Here's my theory.
Query 1 is saying (in English), take all Users (55,100), filter by
UserLastActive (3,226), and join the remaining records on UserFile (8,307).
Query 2 is saying, take all Users (55,100), filter by UserLastActive
(3,226), then take all UserFiles (68,617), filter by UserFileBlockTrades
(65,814), and join the two results (7,520).
Running a filter (UserFileBlockTrades) with such a low selectivity on such a
large set of records is obviously going to be costly. So, my question is,
is there a way to restructure my query so that SQL will do this instead:
Take all Users (55,100), filter by UserLastActive (3,226), join the
remaining records on UserFile (8,307), and filter by UserFileBlockTrades
(7,520).
The selectivity of the UserFileBlockTrades filter will still be low, but it
will be dealing with a much smaller set of data. I don't know if this is
possible, but I would think it would be, considering the fact that this
seems like something that would happen quite a bit.
The following gave me the desired execution plan and query cost (similar to
plan and cost of Query 1), but it uses a temp table which I don't want to
(and shouldn't have to) do:
declare @.Table table(UserFileID int)
insert into @.Table
select UserFile.UserFileID
from Users
join UserFile
on UserFile.UserID = Users.UserID
where Users.UserLastActive > getdate() - 14
select count(*)
from @.Table t
join UserFile
on UserFile.UserFileID = t.UserFileID
where UserFileBlockTrades = 0
I tried to use a derived table to force the low selectivity filter after
everything else has been done:
select count(*)
from (
select UserFile.*
from Users
join UserFile
on UserFile.UserID = Users.UserID
where Users.UserLastActive > getdate() - 14
) UserFile
where UserFile.UserFileBlockTrades = 0
But the execution plan was exactly the same as the one from Query 2, so this
had zero effect.
Basically, I'm just trying to force it to filter UserFileBlockTrades *after*
the join, instead of before. Surely this is possible.
Any ideas or suggestions? Let me know if you need any other info from me.
Thanks in advance for your help.
JeradHi Jerad
The answer to this most likely lies in indexing, not re-structuring the
query. What indexes are available on these tables? Running sp_helpindex
[tablename] against both table would give us more of a full picture.
As a started, I'd suggest that the following two indexes should be there,
but the recommendation could change depending on whether either table has
clustered indexes & on which columns..
Users (UserLastActive, UserID)
UserFile (UserID, UserFileBlockTrades)
HTH
Regards,
Greg Linwood
SQL Server MVP
"Jerad Rose" <no@.spam.com> wrote in message
news:eZzJ7Up4FHA.3588@.TK2MSFTNGP15.phx.gbl...
>I have two tables, Users and UserFile. Users.UserID is a PK, and
>UserFile.UserID is a FK to Users. The relationship between Users and
>UserFile is one-to-many.
> I have the following two queries:
> --Query 1
> select count(*)
> from Users
> join UserFile
> on UserFile.UserID = Users.UserID
> where Users.UserLastActive > getdate() - 14
> --Query 2
> select count(*)
> from Users
> join UserFile
> on UserFile.UserID = Users.UserID
> where Users.UserLastActive > getdate() - 14
> and UserFile.UserFileBlockTrades = 0
> The only difference is that the last line of Query 2 is omitted in Query
> 1. The selectivity of the Users filter is quite high -- about 6%. The
> selectivity of the UserFile filter is quite low -- about 96%.
> Here's the problem. Query 1 results in Index S

s on both tables, with a
> Nested Loops join joining the two tables. However, Query 2 results in an
> Index S

on Users, but an Index Scan on UserFile, as well as the more
> costly Hash Match join joining the two tables. The result is, Query 2 has
> about 4x CPU cost of Query 1.
> Here's my theory.
> Query 1 is saying (in English), take all Users (55,100), filter by
> UserLastActive (3,226), and join the remaining records on UserFile
> (8,307).
> Query 2 is saying, take all Users (55,100), filter by UserLastActive
> (3,226), then take all UserFiles (68,617), filter by UserFileBlockTrades
> (65,814), and join the two results (7,520).
> Running a filter (UserFileBlockTrades) with such a low selectivity on such
> a large set of records is obviously going to be costly. So, my question
> is, is there a way to restructure my query so that SQL will do this
> instead:
> Take all Users (55,100), filter by UserLastActive (3,226), join the
> remaining records on UserFile (8,307), and filter by UserFileBlockTrades
> (7,520).
> The selectivity of the UserFileBlockTrades filter will still be low, but
> it will be dealing with a much smaller set of data. I don't know if this
> is possible, but I would think it would be, considering the fact that this
> seems like something that would happen quite a bit.
> The following gave me the desired execution plan and query cost (similar
> to plan and cost of Query 1), but it uses a temp table which I don't want
> to (and shouldn't have to) do:
> declare @.Table table(UserFileID int)
> insert into @.Table
> select UserFile.UserFileID
> from Users
> join UserFile
> on UserFile.UserID = Users.UserID
> where Users.UserLastActive > getdate() - 14
> select count(*)
> from @.Table t
> join UserFile
> on UserFile.UserFileID = t.UserFileID
> where UserFileBlockTrades = 0
> I tried to use a derived table to force the low selectivity filter after
> everything else has been done:
> select count(*)
> from (
> select UserFile.*
> from Users
> join UserFile
> on UserFile.UserID = Users.UserID
> where Users.UserLastActive > getdate() - 14
> ) UserFile
> where UserFile.UserFileBlockTrades = 0
> But the execution plan was exactly the same as the one from Query 2, so
> this had zero effect.
> Basically, I'm just trying to force it to filter UserFileBlockTrades
> *after* the join, instead of before. Surely this is possible.
> Any ideas or suggestions? Let me know if you need any other info from me.
> Thanks in advance for your help.
> Jerad
>|||Oh wow, you're exactly right Greg. I didn't have an index set up for
UserFileBlockTrades, and once I added it, both queries generated the exact
same plan.
Here's my question now (just so I understand this). The reason I didn't
have an index on UserFileBlockTrades in the first place, is because it is a
bit field, and the selectivity on it is extremely low (as I said in my
original post). So I thought SQL would ignore this index anyway.
So can you help me understand why this made a difference (why SQL did, in
fact, use my index), even considering the low selectivity of that filter? I
generally (if not always) avoid indexing my bit fields for this reason, but
now I see it is needed in some cases.
Thanks so much for your help.
Jerad
"Greg Linwood" <g_linwood@.hotmail.com> wrote in message
news:OqgdXWp4FHA.2060@.TK2MSFTNGP09.phx.gbl...
> Hi Jerad
> The answer to this most likely lies in indexing, not re-structuring the
> query. What indexes are available on these tables? Running sp_helpindex
> [tablename] against both table would give us more of a full picture.
> As a started, I'd suggest that the following two indexes should be there,
> but the recommendation could change depending on whether either table has
> clustered indexes & on which columns..
> Users (UserLastActive, UserID)
> UserFile (UserID, UserFileBlockTrades)
> HTH
> Regards,
> Greg Linwood
> SQL Server MVP
> "Jerad Rose" <no@.spam.com> wrote in message
> news:eZzJ7Up4FHA.3588@.TK2MSFTNGP15.phx.gbl...
>|||Seems like the optimiser found that bit column helpful in this case, but
keep in mind that including columns in indexes isn't always about
selectivity & row identification. It usually helps to have all columns
included in indexes in situations like this (whether in the WHERE, JOIN or
SELECT part of query), where not too many columns are involved in the query.
This allows SQL Server to find all the information in indexes, without
having to go back to the underlying table structures. This is important
because indexes are packed far more densely than tables (many more rows per
page) & can therefore usually give far higher performance.
HTH
Regards,
Greg Linwood
SQL Server MVP
"Jerad Rose" <no@.spam.com> wrote in message
news:OtQo5fp4FHA.1188@.TK2MSFTNGP12.phx.gbl...
> Oh wow, you're exactly right Greg. I didn't have an index set up for
> UserFileBlockTrades, and once I added it, both queries generated the exact
> same plan.
> Here's my question now (just so I understand this). The reason I didn't
> have an index on UserFileBlockTrades in the first place, is because it is
> a bit field, and the selectivity on it is extremely low (as I said in my
> original post). So I thought SQL would ignore this index anyway.
> So can you help me understand why this made a difference (why SQL did, in
> fact, use my index), even considering the low selectivity of that filter?
> I generally (if not always) avoid indexing my bit fields for this reason,
> but now I see it is needed in some cases.
> Thanks so much for your help.
> Jerad
> "Greg Linwood" <g_linwood@.hotmail.com> wrote in message
> news:OqgdXWp4FHA.2060@.TK2MSFTNGP09.phx.gbl...
>|||On Sun, 6 Nov 2005 01:04:53 -0500, Jerad Rose wrote:

>So can you help me understand why this made a difference (why SQL did, in
>fact, use my index), even considering the low selectivity of that filter?
I
>generally (if not always) avoid indexing my bit fields for this reason, but
>now I see it is needed in some cases.
Hi Jerad,
You didn't mention whether the new index was actually used in the
execution plans. Since you're saying that both queries are now using the
exact same plan, I suspect that the index was NOT used.
My assumption on why the index made a difference - I guess that there
were no statistics yet for the UserFileBlockTrades column. Since it's a
bit column, the optimizer would probably have estimated a selectivity of
50%. After adding the index, there were statistics; the optimizer now
knows that the selectivity will actually be 96% and decides on another
plan.
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)|||Jerad Rose (no@.spam.com) writes:
> Oh wow, you're exactly right Greg. I didn't have an index set up for
> UserFileBlockTrades, and once I added it, both queries generated the exact
> same plan.
I think Hugo's remark about statistics is right on the money, but as an
addendum, permit me to point out a gotcha with index on bit columns. This
was the query:
select count(*)
from Users
join UserFile
on UserFile.UserID = Users.UserID
where Users.UserLastActive > getdate() - 14
and UserFile.UserFileBlockTrades = 0
Say now that UserFileBlockTrades = 0 would be very selective, and give a
mere handful of rows. Would the index be used now? Probably not, because
you need to do this:
and UserFile.UserFileBlockTrades = convert(bit, 0)
this is because of the rules for implicit conversion in SQL Server, which
says that these are always performed according to a data-type precendence
order (which is in Books Online). And bit is converted to integer, and
0 is an integer.
I'm adding this, because I did exactly this mistake recently when I added
indexes on some bit columns, and then was surprised that the performance
boost was nowhere near to what I expected it to be.
--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Hugo's point about statistics is probably relevant, but the fact that the
new index/s precisely cover the query probably has more to do with the
consistency in the plans because the optimiser will take their i/o
efficiency into account. Obviously its better to pull the
UserFileBlockTrades value out off the covering index than do a bookmark or
heap lookup to get the value during a s

.
Of course it would far clearer if Jerad would post the table structures,
index structures & execution plans.
Regards,
Greg Linwood
SQL Server MVP
"Erland Sommarskog" <esquel@.sommarskog.se> wrote in message
news:Xns970679CB84306Yazorman@.127.0.0.1...
> Jerad Rose (no@.spam.com) writes:
> I think Hugo's remark about statistics is right on the money, but as an
> addendum, permit me to point out a gotcha with index on bit columns. This
> was the query:
> select count(*)
> from Users
> join UserFile
> on UserFile.UserID = Users.UserID
> where Users.UserLastActive > getdate() - 14
> and UserFile.UserFileBlockTrades = 0
> Say now that UserFileBlockTrades = 0 would be very selective, and give a
> mere handful of rows. Would the index be used now? Probably not, because
> you need to do this:
> and UserFile.UserFileBlockTrades = convert(bit, 0)
> this is because of the rules for implicit conversion in SQL Server, which
> says that these are always performed according to a data-type precendence
> order (which is in Books Online). And bit is converted to integer, and
> 0 is an integer.
> I'm adding this, because I did exactly this mistake recently when I added
> indexes on some bit columns, and then was surprised that the performance
> boost was nowhere near to what I expected it to be.
> --
> Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
> Books Online for SQL Server SP3 at
> http://www.microsoft.com/sql/techin.../2000/books.asp
>|||On Sun, 6 Nov 2005 22:24:47 +1100, Greg Linwood wrote:

>Hugo's point about statistics is probably relevant, but the fact that the
>new index/s precisely cover the query probably has more to do with the
>consistency in the plans because the optimiser will take their i/o
>efficiency into account. Obviously its better to pull the
>UserFileBlockTrades value out off the covering index than do a bookmark or
>heap lookup to get the value during a s

.
Hi Greg,
Ah, I now see that I've been guilty of sloppy reading. I didn't check
the indexes you suggested carefully enough; I missed that you included
the joining column as well as the searched column in the index.
That brings me on yet another theory: maybe the optimizer decided to use
the new index because there was at first not a single index on the
joining column. That could be verified by changing the index to include
only the joining column.

>Of course it would far clearer if Jerad would post the table structures,
>index structures & execution plans.
Yes, indeed!
Best, Hugo
--
(Remove _NO_ and _SPAM_ to get my e-mail address)|||The new theory is also probably correct as well, but I wasn't what indexes
he had on those tables, so I just suggested new ones.
If Jerad sends through his existing indexes & the showplan, it'd make things
nice & clear! (c:
Regards,
Greg Linwood
SQL Server MVP
"Hugo Kornelis" <hugo@.pe_NO_rFact.in_SPAM_fo> wrote in message
news:im0sm156ei3bau989tf0o7vi8mn2t4blaf@.
4ax.com...
> On Sun, 6 Nov 2005 22:24:47 +1100, Greg Linwood wrote:
>
> Hi Greg,
> Ah, I now see that I've been guilty of sloppy reading. I didn't check
> the indexes you suggested carefully enough; I missed that you included
> the joining column as well as the searched column in the index.
> That brings me on yet another theory: maybe the optimizer decided to use
> the new index because there was at first not a single index on the
> joining column. That could be verified by changing the index to include
> only the joining column.
>
> Yes, indeed!
> Best, Hugo
> --
> (Remove _NO_ and _SPAM_ to get my e-mail address)

Monday, March 19, 2012

Nested Loops vs. Hash Match

No comments:

Post a Comment

Need to sum columns

Blog Archive

About Me