-
Notifications
You must be signed in to change notification settings - Fork 182
Description
Description
When using dbplyr_uncount()
on a table within a Postgres or AWS Redshift database, we always fail with an error which mentions the seq2()
function. It's not immediately obvious why this happens.
Minimal example
The first few lines will need to be adjusted to create a valid connection/table.
# Establish a connection to a Redshift database
con <- RPostgres::dbConnect(...)
# Create a simple demo query - I'm using a "calendar" table I happen to have available here
q <- dplyr::tbl(con, I("lookups.calendar")) |>
dplyr::select(date) |>
dplyr::filter(date >= "2025-04-01", date <= "2025-04-03")
# Show current results
dplyr::collect(q)
#> # A tibble: 3 × 1
#> date
#> <date>
#> 1 2025-04-01
#> 2 2025-04-02
#> 3 2025-04-03
# All fine so far!
# Now suppose we want to duplicate every row...
dbplyr::dbplyr_uncount(q, 2)
#> Error in `seq2()`:
#> ! `to` must be length one.
Created on 2025-04-01 with reprex v2.1.1
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.2 (2023-10-31)
#> os Amazon Linux 2023.6.20250123
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C.UTF-8
#> ctype C.UTF-8
#> tz Europe/London
#> date 2025-04-01
#> pandoc 3.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> bit 4.5.0.1 2024-12-03 [2] CRAN (R 4.3.2)
#> bit64 4.6.0-1 2025-01-16 [2] CRAN (R 4.3.2)
#> blob 1.2.4 2023-03-17 [2] CRAN (R 4.3.2)
#> box 1.2.0 2024-02-06 [2] CRAN (R 4.3.2)
#> class 7.3-23 2025-01-01 [3] CRAN (R 4.3.2)
#> classInt 0.4-11 2025-01-08 [2] CRAN (R 4.3.2)
#> cli 3.6.3 2024-06-21 [2] CRAN (R 4.3.2)
#> DBI 1.2.3 2024-06-02 [2] CRAN (R 4.3.2)
#> dbplyr 2.5.0 2024-03-19 [2] CRAN (R 4.3.2)
#> digest 0.6.37 2024-08-19 [2] CRAN (R 4.3.2)
#> dplyr 1.1.4 2023-11-17 [2] CRAN (R 4.3.2)
#> e1071 1.7-16 2024-09-16 [2] CRAN (R 4.3.2)
#> evaluate 1.0.3 2025-01-10 [2] CRAN (R 4.3.2)
#> fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.3.2)
#> fs 1.6.5 2024-10-30 [2] CRAN (R 4.3.2)
#> generics 0.1.3 2022-07-05 [2] CRAN (R 4.3.2)
#> glue 1.8.0 2024-09-30 [2] CRAN (R 4.3.2)
#> hms 1.1.3 2023-03-21 [2] CRAN (R 4.3.2)
#> htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.3.2)
#> KernSmooth 2.23-26 2025-01-01 [3] CRAN (R 4.3.2)
#> knitr 1.49 2024-11-08 [2] CRAN (R 4.3.2)
#> lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.3.2)
#> lubridate 1.9.4 2024-12-08 [2] CRAN (R 4.3.2)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.3.2)
#> pillar 1.10.1 2025-01-07 [2] CRAN (R 4.3.2)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.3.2)
#> proxy 0.4-27 2022-06-09 [2] CRAN (R 4.3.2)
#> purrr 1.0.2 2023-08-10 [2] CRAN (R 4.3.2)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.3.2)
#> Rcpp 1.0.14 2025-01-12 [2] CRAN (R 4.3.2)
#> reprex 2.1.1 2024-07-06 [2] CRAN (R 4.3.2)
#> rlang 1.1.5 2025-01-17 [2] CRAN (R 4.3.2)
#> rmarkdown 2.29 2024-11-04 [2] CRAN (R 4.3.2)
#> RPostgres 1.4.7 2024-05-27 [2] CRAN (R 4.3.2)
#> rstudioapi 0.17.1 2024-10-22 [2] CRAN (R 4.3.2)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.3.2)
#> sf 1.0-19 2024-11-05 [2] CRAN (R 4.3.2)
#> tibble 3.2.1 2023-03-20 [2] CRAN (R 4.3.2)
#> tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.3.2)
#> timechange 0.3.0 2024-01-18 [2] CRAN (R 4.3.2)
#> units 0.8-5 2023-11-28 [2] CRAN (R 4.3.2)
#> vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.3.2)
#> withr 3.0.2 2024-10-28 [2] CRAN (R 4.3.2)
#> xfun 0.50 2025-01-07 [2] CRAN (R 4.3.2)
#> yaml 2.3.10 2024-07-26 [2] CRAN (R 4.3.2)
#>
#> [1] /home/owen10004/efs/R/x86_64-amazon-linux-gnu-library/4.3
#> [2] /usr/lib64/R/site-library
#> [3] /usr/lib64/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
This example uses a constant weights
value, but we see the same error if we pass a column name for weights
.
Cause
Postgres converts all (unquoted) column names to lowercase. Redshift converts all column names in results, always, to lowercase. Within the query that's executed in order to calculate n_max
, this conversion leads to a name mismatch, resulting in an unexpected NULL
result.
Details
Early on in dbplyr::dbplyr_uncount()
, a query is constructed calculate the max number of repetitions, and is then executed implicitly via dplyr::pull()
:
Line 50 in 5726930
n_max <- pull(summarise(ungroup(data), max(!!sym(weights_col), na.rm = TRUE))) |
Note that the argument provided to the summarise()
query isn't named, so it's given a name using the expression itself (I think via set_names()
?) - which looks like:
max(..., na.rm = TRUE)
The call is to pull()
is dispatched to dbplyr::pull.tbl_sql()
, which:
-
Derives the name of the variable to extract:
Line 20 in 5726930
var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var") As we just saw, that variable is called something like
max(..., na.rm = TRUE)
. -
Collects query results from the database:
Line 29 in 5726930
.data <- collect(.data) However,
Postgres converts all (unquoted) column names to lowercaseRedshift converts all column names, always, in results to lowercase:There are two types of identifiers, standard identifiers and quoted or delimited identifiers. Identifiers must consist of only UTF-8 printable characters. ASCII letters in standard and delimited identifiers are case-insensitive and are folded to lowercase in the database. In query results, column names are returned as lowercase by default. -- Redshift docs
This means that the collected results contain a column called something like
max(..., na.rm = true)
. -
Uses the name to extract the appropriate column from the collected results:
Line 30 in 5726930
out <- .data[[var]] Due to the mismatch in names, we end up with
out <- NULL
So we end up with n_max <- NULL
, hence the error in the seq2()
call a little later:
Line 63 in 5726930
helper_table <- copy_inline(con, tibble(!!row_id_col := seq2(1, n_max))) |
Output from `debug()`
> debugonce(dbplyr:::pull.tbl_sql)
> dbplyr::dbplyr_uncount(q, 2)
debugging in: pull.tbl_sql(summarise(ungroup(data), max(!!sym(weights_col),
na.rm = TRUE)))
debug: {
vars <- tbl_vars(.data)
var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")
name_quo <- enquo(name)
if (!quo_is_null(name_quo)) {
name <- tidyselect::vars_pull(vars, !!name_quo, error_arg = "name")
}
.data <- ungroup(.data)
.data <- select(.data, all_of(c(var, name)))
.data <- collect(.data)
out <- .data[[var]]
if (!is.null(name)) {
out <- set_names(out, nm = .data[[name]])
}
out
}
Browse[2]> n
debug: vars <- tbl_vars(.data)
Browse[2]> n
debug: var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")
Browse[2]> n
debug: name_quo <- enquo(name)
Browse[2]> n
debug: if (!quo_is_null(name_quo)) {
name <- tidyselect::vars_pull(vars, !!name_quo, error_arg = "name")
}
Browse[2]> n
debug: .data <- ungroup(.data)
Browse[2]> n
debug: .data <- select(.data, all_of(c(var, name)))
Browse[2]> n
debug: .data <- collect(.data)
Browse[2]> n
debug: out <- .data[[var]]
Browse[2]> n
debug: if (!is.null(name)) {
out <- set_names(out, nm = .data[[name]])
}
Browse[2]> .data
# A tibble: 1 × 1
`max(..dbplyr_weight_col, na.rm = true)`
<dbl>
1 2
Browse[2]> var
[1] "max(..dbplyr_weight_col, na.rm = TRUE)"
Browse[2]> out
NULL