-
Notifications
You must be signed in to change notification settings - Fork 182
Description
Using slice_sample()
on an MSSQL backend always return the same rows:
df <- tbl(con, "mtcars")
df |> slice_sample(n = 3)
#> # Source: SQL [3 x 12]
#> # Database: Microsoft SQL Server 14.00.2052[***]
#> car mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
df |> slice_sample(n = 3)
#> # Source: SQL [3 x 12]
#> # Database: Microsoft SQL Server 14.00.2052[***]
#> car mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
df |> slice_sample(n = 3)
#> # Source: SQL [3 x 12]
#> # Database: Microsoft SQL Server 14.00.2052[***]
#> car mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
Created on 2024-05-16 with reprex v2.1.0
The issue is with the RAND()
function. RAND()
on an MSSQL backend will always return the same value for each row in the result set. As such, ordering the rows by the return value of RAND()
and selecting the top n
rows will just return the data in its initial order:
dbGetQuery(con, "SELECT TOP(3) *, RAND() as rand FROM mtcars")
#> car mpg cyl disp hp drat wt qsec vs am gear carb rand
#> 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.1899416
#> 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.1899416
#> 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.1899416
Created on 2024-05-16 with reprex v2.1.0
An alternative approach to generating random numbers on an MSSQL backend is given here. It works by using NEWID()
to return a globally unique identifer for each row, hashing the GUID with CHECKSUM()
and using this as a seed for RAND()
:
dbGetQuery(con, "SELECT TOP(3) *, RAND(CHECKSUM(NEWID())) as rand FROM mtcars")
#> car mpg cyl disp hp drat wt qsec vs am gear carb rand
#> 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.78626466
#> 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.31999640
#> 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.03604049
Created on 2024-05-16 with reprex v2.1.0
Changing the runif()
function translation in the MSSQL backend should be straightforward:
runif = function(n = n(), min = 0, max = 1) {
sql_runif(RAND(CHECKSUM(NEWID())), n = {{ n }}, min = min, max = max)
}