Skip to content

dbplyr_uncount() on Redshift tables fails with an opaque error #1601

@owenjonesuob

Description

@owenjonesuob

Description

When using dbplyr_uncount() on a table within a Postgres or AWS Redshift database, we always fail with an error which mentions the seq2() function. It's not immediately obvious why this happens.

Minimal example

The first few lines will need to be adjusted to create a valid connection/table.

# Establish a connection to a Redshift database
con <- RPostgres::dbConnect(...)

# Create a simple demo query - I'm using a "calendar" table I happen to have available here
q <- dplyr::tbl(con, I("lookups.calendar")) |> 
  dplyr::select(date) |> 
  dplyr::filter(date >= "2025-04-01", date <= "2025-04-03")

# Show current results
dplyr::collect(q)
#> # A tibble: 3 × 1
#>   date      
#>   <date>    
#> 1 2025-04-01
#> 2 2025-04-02
#> 3 2025-04-03

# All fine so far!

# Now suppose we want to duplicate every row...
dbplyr::dbplyr_uncount(q, 2)
#> Error in `seq2()`:
#> ! `to` must be length one.

Created on 2025-04-01 with reprex v2.1.1

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       Amazon Linux 2023.6.20250123
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       Europe/London
#>  date     2025-04-01
#>  pandoc   3.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bit           4.5.0.1 2024-12-03 [2] CRAN (R 4.3.2)
#>  bit64         4.6.0-1 2025-01-16 [2] CRAN (R 4.3.2)
#>  blob          1.2.4   2023-03-17 [2] CRAN (R 4.3.2)
#>  box           1.2.0   2024-02-06 [2] CRAN (R 4.3.2)
#>  class         7.3-23  2025-01-01 [3] CRAN (R 4.3.2)
#>  classInt      0.4-11  2025-01-08 [2] CRAN (R 4.3.2)
#>  cli           3.6.3   2024-06-21 [2] CRAN (R 4.3.2)
#>  DBI           1.2.3   2024-06-02 [2] CRAN (R 4.3.2)
#>  dbplyr        2.5.0   2024-03-19 [2] CRAN (R 4.3.2)
#>  digest        0.6.37  2024-08-19 [2] CRAN (R 4.3.2)
#>  dplyr         1.1.4   2023-11-17 [2] CRAN (R 4.3.2)
#>  e1071         1.7-16  2024-09-16 [2] CRAN (R 4.3.2)
#>  evaluate      1.0.3   2025-01-10 [2] CRAN (R 4.3.2)
#>  fastmap       1.2.0   2024-05-15 [2] CRAN (R 4.3.2)
#>  fs            1.6.5   2024-10-30 [2] CRAN (R 4.3.2)
#>  generics      0.1.3   2022-07-05 [2] CRAN (R 4.3.2)
#>  glue          1.8.0   2024-09-30 [2] CRAN (R 4.3.2)
#>  hms           1.1.3   2023-03-21 [2] CRAN (R 4.3.2)
#>  htmltools     0.5.8.1 2024-04-04 [2] CRAN (R 4.3.2)
#>  KernSmooth    2.23-26 2025-01-01 [3] CRAN (R 4.3.2)
#>  knitr         1.49    2024-11-08 [2] CRAN (R 4.3.2)
#>  lifecycle     1.0.4   2023-11-07 [2] CRAN (R 4.3.2)
#>  lubridate     1.9.4   2024-12-08 [2] CRAN (R 4.3.2)
#>  magrittr      2.0.3   2022-03-30 [2] CRAN (R 4.3.2)
#>  pillar        1.10.1  2025-01-07 [2] CRAN (R 4.3.2)
#>  pkgconfig     2.0.3   2019-09-22 [2] CRAN (R 4.3.2)
#>  proxy         0.4-27  2022-06-09 [2] CRAN (R 4.3.2)
#>  purrr         1.0.2   2023-08-10 [2] CRAN (R 4.3.2)
#>  R6            2.5.1   2021-08-19 [2] CRAN (R 4.3.2)
#>  Rcpp          1.0.14  2025-01-12 [2] CRAN (R 4.3.2)
#>  reprex        2.1.1   2024-07-06 [2] CRAN (R 4.3.2)
#>  rlang         1.1.5   2025-01-17 [2] CRAN (R 4.3.2)
#>  rmarkdown     2.29    2024-11-04 [2] CRAN (R 4.3.2)
#>  RPostgres     1.4.7   2024-05-27 [2] CRAN (R 4.3.2)
#>  rstudioapi    0.17.1  2024-10-22 [2] CRAN (R 4.3.2)
#>  sessioninfo   1.2.2   2021-12-06 [2] CRAN (R 4.3.2)
#>  sf            1.0-19  2024-11-05 [2] CRAN (R 4.3.2)
#>  tibble        3.2.1   2023-03-20 [2] CRAN (R 4.3.2)
#>  tidyselect    1.2.1   2024-03-11 [2] CRAN (R 4.3.2)
#>  timechange    0.3.0   2024-01-18 [2] CRAN (R 4.3.2)
#>  units         0.8-5   2023-11-28 [2] CRAN (R 4.3.2)
#>  vctrs         0.6.5   2023-12-01 [2] CRAN (R 4.3.2)
#>  withr         3.0.2   2024-10-28 [2] CRAN (R 4.3.2)
#>  xfun          0.50    2025-01-07 [2] CRAN (R 4.3.2)
#>  yaml          2.3.10  2024-07-26 [2] CRAN (R 4.3.2)
#> 
#>  [1] /home/owen10004/efs/R/x86_64-amazon-linux-gnu-library/4.3
#>  [2] /usr/lib64/R/site-library
#>  [3] /usr/lib64/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

This example uses a constant weights value, but we see the same error if we pass a column name for weights.

Cause

Postgres converts all (unquoted) column names to lowercase. Redshift converts all column names in results, always, to lowercase. Within the query that's executed in order to calculate n_max, this conversion leads to a name mismatch, resulting in an unexpected NULL result.

Details

Early on in dbplyr::dbplyr_uncount(), a query is constructed calculate the max number of repetitions, and is then executed implicitly via dplyr::pull():

n_max <- pull(summarise(ungroup(data), max(!!sym(weights_col), na.rm = TRUE)))

Note that the argument provided to the summarise() query isn't named, so it's given a name using the expression itself (I think via set_names()?) - which looks like:

max(..., na.rm = TRUE)

The call is to pull() is dispatched to dbplyr::pull.tbl_sql(), which:

  1. Derives the name of the variable to extract:

    var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")

    As we just saw, that variable is called something like max(..., na.rm = TRUE).

  2. Collects query results from the database:

    .data <- collect(.data)

    However, Postgres converts all (unquoted) column names to lowercase Redshift converts all column names, always, in results to lowercase:

    There are two types of identifiers, standard identifiers and quoted or delimited identifiers. Identifiers must consist of only UTF-8 printable characters. ASCII letters in standard and delimited identifiers are case-insensitive and are folded to lowercase in the database. In query results, column names are returned as lowercase by default. -- Redshift docs

    This means that the collected results contain a column called something like max(..., na.rm = true).

  3. Uses the name to extract the appropriate column from the collected results:

    out <- .data[[var]]

    Due to the mismatch in names, we end up with out <- NULL

So we end up with n_max <- NULL, hence the error in the seq2() call a little later:

helper_table <- copy_inline(con, tibble(!!row_id_col := seq2(1, n_max)))

Output from `debug()`
> debugonce(dbplyr:::pull.tbl_sql)
> dbplyr::dbplyr_uncount(q, 2)
debugging in: pull.tbl_sql(summarise(ungroup(data), max(!!sym(weights_col), 
    na.rm = TRUE)))
debug: {
    vars <- tbl_vars(.data)
    var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")
    name_quo <- enquo(name)
    if (!quo_is_null(name_quo)) {
        name <- tidyselect::vars_pull(vars, !!name_quo, error_arg = "name")
    }
    .data <- ungroup(.data)
    .data <- select(.data, all_of(c(var, name)))
    .data <- collect(.data)
    out <- .data[[var]]
    if (!is.null(name)) {
        out <- set_names(out, nm = .data[[name]])
    }
    out
}
Browse[2]> n
debug: vars <- tbl_vars(.data)
Browse[2]> n
debug: var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")
Browse[2]> n
debug: name_quo <- enquo(name)
Browse[2]> n
debug: if (!quo_is_null(name_quo)) {
    name <- tidyselect::vars_pull(vars, !!name_quo, error_arg = "name")
}
Browse[2]> n
debug: .data <- ungroup(.data)
Browse[2]> n
debug: .data <- select(.data, all_of(c(var, name)))
Browse[2]> n
debug: .data <- collect(.data)
Browse[2]> n
debug: out <- .data[[var]]
Browse[2]> n
debug: if (!is.null(name)) {
    out <- set_names(out, nm = .data[[name]])
}
Browse[2]> .data
# A tibble: 1 × 1
  `max(..dbplyr_weight_col, na.rm = true)`
                                     <dbl>
1                                        2
Browse[2]> var
[1] "max(..dbplyr_weight_col, na.rm = TRUE)"
Browse[2]> out
NULL

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions