Showing posts with label point. Show all posts
Showing posts with label point. Show all posts

Wednesday, March 21, 2012

Read Consistency

What is the mechanism used by select statements to return point in time data?

I have a test setup - table t1 has 1000000 rows. A query (call q1) that selects all the rows (in NOLOCK mode and process them) takes 10 minutes. At the same time another process inserts another 1000000 rows into the same table t1. As expected, client that issued query q1 sees just 1000000 rows.

My understanding is that NOLOCK does not hold any locks. So, how did SQL Server know that it should not return the rows that are inserted after I issued the query q1? Some explanation or link to some whitepapers would be helpful.

Thanks

Unless you use the row-level versioning that was introduced in SQL Server 2005, there is no notion of 'point-in-time' data for the locking-based database engines for a certain fixed time in the past, until the transaction commits and only for the precise subset of the data read or written by the transaction and of the data that was intended to be read but it didn't exist when the transaction tried to access it.

Instead, the engine can provide an illusion of a serialized execution as if during the processing of transactions the view of the data accessed and touched by the transaction was 'frozen', i.e. the things that other transactions did, either occurred in the past, will occur in the future, or they don't matter if other transaction never read or wrote the data accessed and touched by our transaction. As a result the data that is seen by a single transaction ultimately can be viewed as a consistent slice of some relevant subset of entire database as of 'now' while the transaction is active and as of commit time after the transaction commits.

Having said that I realize that it might sound really cryptic and confusing - but this is how the things are done in the locking-based transaction processing systems. If you come from the Oracle world it will take some time to adjust to a different paradigm.

|||

Thanks Tengiz but ...

In my scenario, let us say that query q1 started at 10 AM and finished at 10:10 AM. At 10 AM it had 1000000 rows and by 10:05 AM other transactions inserted 1000000 rows more making it a total of 2000000 rows. Why did not SQL server return 2000000 rows to q1 client? Somehow SQL server knew to provide the rows that existed at 10 AM (that I call point in time data; may be wrong terminology?) and ignore the rows that were added after that point. What is the internal mechanism used by the engine to give that illusion? I am inclined to think that even though it does not create locks it might create some sort of semaphores, latches or tables in memory to keep track of the rows that need to be returned to q1 client. As you guessed, I am from Oracle background; may be you answered my question and it might take little longer to get it.

|||

Could you please be more specific describing you scenario? The fact that the query only returned the initial set of rows and didn't see the rows that were inserted after the query started does not really mean that the server somehow knows or takes into account the time when the rows were inserted.

Again, the things are different if you use the row-level versioning - you either do it be switching to the snapshot isolation (after enabling it for the database) or if you allow versioning-based read-committed isolation.

But assuming the you don't use the row-level versioning, depending on the existing indexes, the actual query plans, the key values that existed in the table before the insert and the key values inserted, the select query with the right timing could easily skip the newly inserted records, but it would have nothing to do with the read-consistency provided to the row-level versioning.

|||

Thanks to Tengiz for your interest and perseverance in helping me out.

The database is not setup to use row versioning. Also, this is a data warehouse system. So, DML statements can come only from ETL. No other concurrent user is touching this table while this program is running. However my SSIS program that maintains this table opens two sessions (a reader and a writer) as I explained in step 4. My concern is that these two sessions stepping on each other.

-

Step 0:

-- display isolation level

dbcc useroptions

isolation level = read committed

-

Step 1:

CREATE TABLE Table1 (

Column1[int] NOT NULL,

Column2[int] NOT NULL,

Column3[bit] NOT NULL,

Column4[datetime] NOT NULL,

Column5[int] NOT NULL,

Column6[int] NOT NULL,

Column7[varchar](255) NULL,

Column8[int] NULL,

Column9[int] NULL,

Column10[int] NULL

CONSTRAINT [PKC_REALDB_StatusHistory] PRIMARY KEY CLUSTERED (

[Column1] ASC,

[Column2] ASC,

[Column4] ASC,

[Column5] ASC,

[Column6] ASC )

)

-

Step 2:

INSERT INTO Table1

SELECT *

FROM Source_Table1

10,556,214 rows inserted.

-

Step 3:

-- delete the rows to simulate unexpected results due to dirty reads

delete

from SourceStage.RealDB_StatusHistory

where Column1 % 2 = 0

5,288,119 rows deleted

-

Step 4:

Now, I have an SSIS package that does update else insert operation. It does SSIS left outer merge join to decide update verses insert. Since I use table lock in the destination component, the source component (one that feeds the data from the target table to do merge join), uses NOLOCK hint. The destination component uses fast load with a batch size of 1000 rows. The rows to be updated are saved to a empty intermediate table. A Transact-SQL UPDATE statement is used after the data flow is done to apply those into Table1.

5,288,119 rows inserted

5,268,095 rows updated

-

I repeated the above steps without the clustered index and I get the same row counts. The row counts show no surprises which leads me to believe that there is some kind of lock. I have to make sure this works 100% before I put this code into production. The documentation leads me to believe that it does not work 100% of the time. If that is the case, I should be able to simulate a scenario where I get whacky row counts.

Here is what I think the reason for not getting whacky row counts in my setup. (a) when the clustered index is in place, it rebalances the tree during the delete operation. So the inserted rows go into new pages and SQL Server somehow knows to ignore them. (b) when there is no clustered index, then it sorts the data in Temp database before it is being fed to the SSIS. So data is essentially is sourced from the Temp database during pipeline operation of SSIS. Am I thinking in the right direction?

So the big question is, can I use NOLOCK without any bad effects in this scenario? If you believe this will lead to some dirty read scenario, how can I simulate it?

|||

A quick answer to your question "can I use NOLOCK without any bad effects in this scenario?" is NO. The NOLOCK hint relaxes certain concurrency-related guarantees in the engine and essentially nullifies the notion of transactional consistency. It doesn’t mean that you will never get any consistency if you use NOLOCK, but you will not in general have predictable results.

I'm not an SSIS expert, so I'm not sure how SSIS really performs the 'insert else update' operation and I still don't quite get what you actually do in this scenario. But from the description of it looks like the plan does include spooling in tempdb for sort - the fast load option normally feeds data in through the BCP API which if the destination table is a clustered index assumes that that data needs to be sorted before it gets delivered to the destination. Hence, if the input provided by the SSIS is not sorted (there is a special hint that SSIS can specify in order to avoid extra sort) the then query optimizer adds the sort operator.

Spooling can certainly make it look like there indeed was some kind of 'read consistency' provided, but, again, depending on the actual query and specific conditions the optimizer is free to choose other options as well.

|||Thanks for the reply. Thinking about it further, when I don't use the NOLOCK hint, my SSIS package just waits forever. When I use NOLOCK hint, it seems to work fine. However, what if SQL Server does not honour the NOLOCK hint? My package might wait forever. So, I decided use the Lookup Transformation of SSIS rather than the Merge Join Transformation. Lookup Transformation can cache the data upfront before the Data Flow Task starts prosessing source rows. This way there is no contention.

Read Consistency

What is the mechanism used by select statements to return point in time data?

I have a test setup - table t1 has 1000000 rows. A query (call q1) that selects all the rows (in NOLOCK mode and process them) takes 10 minutes. At the same time another process inserts another 1000000 rows into the same table t1. As expected, client that issued query q1 sees just 1000000 rows.

My understanding is that NOLOCK does not hold any locks. So, how did SQL Server know that it should not return the rows that are inserted after I issued the query q1? Some explanation or link to some whitepapers would be helpful.

Thanks

Unless you use the row-level versioning that was introduced in SQL Server 2005, there is no notion of 'point-in-time' data for the locking-based database engines for a certain fixed time in the past, until the transaction commits and only for the precise subset of the data read or written by the transaction and of the data that was intended to be read but it didn't exist when the transaction tried to access it.

Instead, the engine can provide an illusion of a serialized execution as if during the processing of transactions the view of the data accessed and touched by the transaction was 'frozen', i.e. the things that other transactions did, either occurred in the past, will occur in the future, or they don't matter if other transaction never read or wrote the data accessed and touched by our transaction. As a result the data that is seen by a single transaction ultimately can be viewed as a consistent slice of some relevant subset of entire database as of 'now' while the transaction is active and as of commit time after the transaction commits.

Having said that I realize that it might sound really cryptic and confusing - but this is how the things are done in the locking-based transaction processing systems. If you come from the Oracle world it will take some time to adjust to a different paradigm.

|||

Thanks Tengiz but ...

In my scenario, let us say that query q1 started at 10 AM and finished at 10:10 AM. At 10 AM it had 1000000 rows and by 10:05 AM other transactions inserted 1000000 rows more making it a total of 2000000 rows. Why did not SQL server return 2000000 rows to q1 client? Somehow SQL server knew to provide the rows that existed at 10 AM (that I call point in time data; may be wrong terminology?) and ignore the rows that were added after that point. What is the internal mechanism used by the engine to give that illusion? I am inclined to think that even though it does not create locks it might create some sort of semaphores, latches or tables in memory to keep track of the rows that need to be returned to q1 client. As you guessed, I am from Oracle background; may be you answered my question and it might take little longer to get it.

|||

Could you please be more specific describing you scenario? The fact that the query only returned the initial set of rows and didn't see the rows that were inserted after the query started does not really mean that the server somehow knows or takes into account the time when the rows were inserted.

Again, the things are different if you use the row-level versioning - you either do it be switching to the snapshot isolation (after enabling it for the database) or if you allow versioning-based read-committed isolation.

But assuming the you don't use the row-level versioning, depending on the existing indexes, the actual query plans, the key values that existed in the table before the insert and the key values inserted, the select query with the right timing could easily skip the newly inserted records, but it would have nothing to do with the read-consistency provided to the row-level versioning.

|||

Thanks to Tengiz for your interest and perseverance in helping me out.

The database is not setup to use row versioning. Also, this is a data warehouse system. So, DML statements can come only from ETL. No other concurrent user is touching this table while this program is running. However my SSIS program that maintains this table opens two sessions (a reader and a writer) as I explained in step 4. My concern is that these two sessions stepping on each other.

-

Step 0:

-- display isolation level

dbcc useroptions

isolation level = read committed

-

Step 1:

CREATE TABLE Table1 (

Column1[int] NOT NULL,

Column2[int] NOT NULL,

Column3[bit] NOT NULL,

Column4[datetime] NOT NULL,

Column5[int] NOT NULL,

Column6 [int] NOT NULL,

Column7[varchar](255) NULL,

Column8[int] NULL,

Column9[int] NULL,

Column10[int] NULL

CONSTRAINT [PKC_REALDB_StatusHistory] PRIMARY KEY CLUSTERED (

[Column1] ASC,

[Column2] ASC,

[Column4] ASC,

[Column5] ASC,

[Column6] ASC )

)

-

Step 2:

INSERT INTO Table1

SELECT *

FROM Source_Table1

10,556,214 rows inserted.

-

Step 3:

-- delete the rows to simulate unexpected results due to dirty reads

delete

from SourceStage.RealDB_StatusHistory

where Column1 % 2 = 0

5,288,119 rows deleted

-

Step 4:

Now, I have an SSIS package that does update else insert operation. It does SSIS left outer merge join to decide update verses insert. Since I use table lock in the destination component, the source component (one that feeds the data from the target table to do merge join), uses NOLOCK hint. The destination component uses fast load with a batch size of 1000 rows. The rows to be updated are saved to a empty intermediate table. A Transact-SQL UPDATE statement is used after the data flow is done to apply those into Table1.

5,288,119 rows inserted

5,268,095 rows updated

-

I repeated the above steps without the clustered index and I get the same row counts. The row counts show no surprises which leads me to believe that there is some kind of lock. I have to make sure this works 100% before I put this code into production. The documentation leads me to believe that it does not work 100% of the time. If that is the case, I should be able to simulate a scenario where I get whacky row counts.

Here is what I think the reason for not getting whacky row counts in my setup. (a) when the clustered index is in place, it rebalances the tree during the delete operation. So the inserted rows go into new pages and SQL Server somehow knows to ignore them. (b) when there is no clustered index, then it sorts the data in Temp database before it is being fed to the SSIS. So data is essentially is sourced from the Temp database during pipeline operation of SSIS. Am I thinking in the right direction?

So the big question is, can I use NOLOCK without any bad effects in this scenario? If you believe this will lead to some dirty read scenario, how can I simulate it?

|||

A quick answer to your question "can I use NOLOCK without any bad effects in this scenario?" is NO. The NOLOCK hint relaxes certain concurrency-related guarantees in the engine and essentially nullifies the notion of transactional consistency. It doesn’t mean that you will never get any consistency if you use NOLOCK, but you will not in general have predictable results.

I'm not an SSIS expert, so I'm not sure how SSIS really performs the 'insert else update' operation and I still don't quite get what you actually do in this scenario. But from the description of it looks like the plan does include spooling in tempdb for sort - the fast load option normally feeds data in through the BCP API which if the destination table is a clustered index assumes that that data needs to be sorted before it gets delivered to the destination. Hence, if the input provided by the SSIS is not sorted (there is a special hint that SSIS can specify in order to avoid extra sort) the then query optimizer adds the sort operator.

Spooling can certainly make it look like there indeed was some kind of 'read consistency' provided, but, again, depending on the actual query and specific conditions the optimizer is free to choose other options as well.

|||Thanks for the reply. Thinking about it further, when I don't use the NOLOCK hint, my SSIS package just waits forever. When I use NOLOCK hint, it seems to work fine. However, what if SQL Server does not honour the NOLOCK hint? My package might wait forever. So, I decided use the Lookup Transformation of SSIS rather than the Merge Join Transformation. Lookup Transformation can cache the data upfront before the Data Flow Task starts prosessing source rows. This way there is no contention.sql

Tuesday, March 20, 2012

Re-"pointing" Stored Procedures

How do I "point" a set of stored procedures to operate on different linked servers?

In other words, I have the following linked servers:

DatabaseA
DatabaseB
DatabaseC
DatabaseD
...and in the future, there may be added additional linked servers.

All the linked servers have identical schema, but they contain unique data--each linked server represents a company.

I have a database which will contain stored procedures which I will want to operate against these linked servers. How can I "redirect" my stored procedures to operate against a chosen linked server?

If these were not linked servers, but SQL Server databases, I'd be able to replicate the same stored procedures in each database. Then, when I called a stored procedure, it would act against the data in that database. But these aren't SQL Server databases, so that idea is out.

Unfortunately, the USE command cannot be used within a stored procedure, and even if you could, I can't get it to respond to a database name given as a variable. That idea is out.

The only alternative I have left is to use the string catenation facility of the EXECUTE command. Unfortunately, with 100s of complex queries, setting that up is going to be a nightmare.

Does anyone have any ideas?With all due respect, having separate companies in separate databases is a bad idea. You've set a new standard - separate companies in separate databases on linked servers that are not SQL Server. Congrats.

Assuming this design is something over which you have no control (sounds like a consulting gig), how 'bout another new standard - separate SQL Server databases with stored procedures that act on separate linked servers (which are really linked databases) that are not SQL Server? You maintain 1:1 relationship b/w SQL Server databases and the linked servers their stored procedures act on.

Then, you write stored procedure generator that replaces all linked server references with the 4-part name to which the db corresponds (using syscomments, PatIndex and cursors) and sticks 'em in the right database. Ugly, eh? Remember, we're reaching for new heights.

You will need a separate database that is your (source) control database in which your stored procedures originate - sounds like you already have that - and a table that maps your SQL Server databases to your linked servers.

Good luck.|||Wow, that sounds like one heck of a way to get hurt really, really bad!

Could you build views that combine the data from all of the various servers, or DTS packages to scoop them into a single container to make management simpler?

If you really, truly want to continue to process the data on N different servers (with N being a variable, not a constant), then I'd suggest that you convert your stored procedures to DTS packages... These could operate against DSN or udl names, which would at least confine the chaos to a much smaller area.

Good luck!
-PatP|||Pat and Max, you guys both came up with good ideas.

Yes, this is a consulting gig. Yes, these are pre-existing, PervasiveSQL databases which are the backend to someone else's product. The separate database for each company is thier concept and there is nothing I can do about that.

We have an existing reporting package that operates against an MSSQL backend. Our client has asked us to port our product to work with this PervasiveSQL-multi-company arrangement.

I like both your ideas. Here's another I came up with while waiting for responses:

Write my stored procedures using a token for the database name. Store those stored procedures as text in a table, much like syscomments stores stored procedures. Then write a "master" execution stored procedure which loads the stored procedure text from that table, uses the replace command to substitute the correct database name for the token, and then use the EXECUTE command to execute the stored procedure. The problem I see with my idea, as compared to your ideas, is that I am not acutally executing stored procedures, hence I lose any precompilation advantage of a true stored procedure.

Thanks for you ideas guys.|||The problem with objects being directly accessed from a linked server is the long naming convention. I guess you guys would agree that most bugs related to objects from linked servers were due to typos on the full object name.

In one of my projects here in the Philippines, a GIS application needs to pull data from a central data repository. The problem is the GIS server (with its own database server) is not allowed to directly access the data repository.

As a work around, I defined both the GIS database server and the central repository in another server as linked servers. Thus, the two servers are linked via the third server. then I did the following:

1. I defined details of the data to needed by the GIS apps and put it in a view in the central repository.
CREATE VIEW vw_PROPERTY_RE_GIS
AS
SELECT B.s_prop_no, B.s_re_no, A.s_PLOP, A.s_loc_id
FROM T_PROPERTY A, T_RE_PROPERTY B
WHERE A.s_prop_no = B.s_prop_no AND A.s_loc_id = b.s_loc_id

2. Defined a view in the third to access the view in the data repository.
CREATE VIEW vw_PROPERTY_RE_GIS
as
SELECT * FROM CNTRL_DB.REMS.REMS.vw_PROPERTY_RE_GIS
3. Defined a view in the GIS DB to access the view in the third server.
CREATE VIEW vw_PROPERTY_RE_GIS
as
SELECT * FROM ALPS.REMS.ALPS.vw_PROPERTY_RE_GIS

Thus, whenever my GIS app would need information about a certain real estate, it would simply kick the simple query SELECT * FROM vw_PROPERTY_RE_GIS.

And if by chance there is a need to add more fields on the data to be extracted, I would just have to modifiy the view on the central repository.

This approach works if there is restriction on how servers are linked and performance is not so much an issue.|||I would NEVER recommend this, but you can use the USE statement in a stored procedure and pass the linked server name as a variable. But you have to use...wait for it...DYNAMIC SQL.

(Pause for shrieks of horror from all competent DBAs...)

I've only implemented this once, for a database schema snapshot application that had to run against any and all databases on a server. It was a mess to program, but it has run very smoothly and reliably since then. The code hasn't changed much in four years and two SQL Server upgrades.|||Blindman,

How were you able to accomplish this ?

I've attempted to use the USE statement in stored procedures and, when compiled, they've always come back with an error stating the the USE statement is not permited in stored procedures. The USE page in the Books Online also state this.

Additionally, the use of a variable for the database with the USE statment is also prohibited. I've tested this several times and have not been successful. Apparently, the USE statement requires a literal for the database name.

If you've found some way to get around these problem, let me know, 'cause they'd make tackling these problems a whole lot easier.

Thanks.

Ken|||create procedure GetDataFromPubs
as
begin
declare @.SQLString varchar(4000)
set @.SQLString = 'Use Pubs Select * from Authors'
execute (@.SQLString)
end

The executed statement runs in it's own scope, and so after completion focus returns to the calling database.