Wednesday, November 24, 2010

MSSQL: Sample Random Data

Recently I was faced by an interesting request. Someone wanted to get sample data from the database for Quality Assurance.

Problem:
Multiple users enter data into the database
A certain percentage must be sampled for QA's validation/verification, i.e. "Sample 10% daily for each user."

Assumptions:
We have the following database tables: Users, Documents, and DocumentsHistory.

Solution:
After some research I was able to find different approaches, like the use of TABLESAMPLE, and the use of NEWID(). I went for the latter, since TABLESAMPLE does not return a consistent number of records.



DECLARE @UserID BIGINT, @PercentageToSample INT 

SET @PercentageToSample = 10

CREATE TABLE #TEMP
(
    DocumentName VARCHAR(50)
)

DECLARE USER_CURSOR CURSOR FOR
SELECT  DISTINCT tUser.UserID FROM tUser 

OPEN USER_CURSOR
FETCH NEXT FROM USER_CURSOR INTO @UserID

WHILE @@FETCH_STATUS = 0
BEGIN

    INSERT INTO #TEMP SELECT TOP (@PercentageToSample) PERCENT Documents.DocumentName 
    FROM Documents (NOLOCK) 
    INNER JOIN DocumentsHistory (NOLOCK) ON Documents.DocumentID = DocumentsHistory.DocumentID
    WHERE DocumentsHistory.UserID = @UserID 
    AND DocumentsHistory.ProcessDate = '2010-11-24' 
    ORDER BY NEWID()

FETCH NEXT FROM USER_CURSOR INTO @UserID
END

SELECT #TEMP.DocumentName FROM #TEMP ORDER BY #TEMP.DocumentName 

CLOSE USER_CURSOR
DEALLOCATE USER_CURSOR
DROP TABLE #TEMP

What's key in this query: TOP n PERCENT and NEWID().

This will sample 10% for each user for any given day, and return the result in a temporal table. If you wish to return data for each user, then eliminate the CURSOR and focus on the internal query. Which will look like this:

SELECT TOP 10 PERCENT Documents.DocumentName 
FROM Documents (NOLOCK) 
INNER JOIN DocumentsHistory (NOLOCK) ON Documents.DocumentID = DocumentsHistory.DocumentID
WHERE DocumentsHistory.UserID = @UserID 
AND DocumentsHistory.ProcessDate = '2010-11-24' 
ORDER BY NEWID()

Modify as needed.

0 comments: