Title: | R Bindings for the 'polars' Rust Library |
---|---|
Description: | Lightning-fast 'DataFrame' library written in 'Rust'. Convert R data to 'Polars' data and vice versa. Perform fast, lazy, larger-than-memory and optimized data queries. 'Polars' is interoperable with the package 'arrow', as both are based on the 'Apache Arrow' Columnar Format. |
Authors: | Tatsuya Shima [aut, cre], Authors of the dependency Rust crates [aut] |
Maintainer: | Tatsuya Shima <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.0.9000 |
Built: | 2024-11-22 03:37:48 UTC |
Source: | https://github.com/pola-rs/r-polars |
The as_polars_df()
function creates a polars DataFrame from various R objects.
Polars DataFrame is based on a sequence of Polars Series,
so basically, the input object is converted to a list of
Polars Series by as_polars_series()
,
then a Polars DataFrame is created from the list.
as_polars_df(x, ...) ## Default S3 method: as_polars_df(x, ...) ## S3 method for class 'polars_series' as_polars_df(x, ..., column_name = NULL, from_struct = TRUE) ## S3 method for class 'polars_data_frame' as_polars_df(x, ...) ## S3 method for class 'polars_group_by' as_polars_df(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_df( x, ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE ) ## S3 method for class 'list' as_polars_df(x, ...) ## S3 method for class 'data.frame' as_polars_df(x, ...) ## S3 method for class ''NULL'' as_polars_df(x, ...)
as_polars_df(x, ...) ## Default S3 method: as_polars_df(x, ...) ## S3 method for class 'polars_series' as_polars_df(x, ..., column_name = NULL, from_struct = TRUE) ## S3 method for class 'polars_data_frame' as_polars_df(x, ...) ## S3 method for class 'polars_group_by' as_polars_df(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_df( x, ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE ) ## S3 method for class 'list' as_polars_df(x, ...) ## S3 method for class 'data.frame' as_polars_df(x, ...) ## S3 method for class ''NULL'' as_polars_df(x, ...)
x |
An R object. |
... |
Additional arguments passed to the methods. |
column_name |
A character or |
from_struct |
A logical. If |
type_coercion |
A logical, indicats type coercion optimization. |
predicate_pushdown |
A logical, indicats predicate pushdown optimization. |
projection_pushdown |
A logical, indicats projection pushdown optimization. |
simplify_expression |
A logical, indicats simplify expression optimization. |
slice_pushdown |
A logical, indicats slice pushdown optimization. |
comm_subplan_elim |
A logical, indicats tring to cache branching subplans that occur on self-joins or unions. |
comm_subexpr_elim |
A logical, indicats tring to cache common subexpressions. |
cluster_with_columns |
A logical, indicats to combine sequential independent calls to with_columns. |
no_optimization |
A logical. If |
streaming |
A logical. If |
The default method of as_polars_df()
throws an error,
so we need to define methods for the classes we want to support.
The argument ...
(except name
) is passed to as_polars_series()
for each element of the list.
All elements of the list must be converted to the same length of Series by as_polars_series()
.
The name of the each element is used as the column name of the DataFrame.
For unnamed elements, the column name will be an empty string ""
or if the element is a Series,
the column name will be the name of the Series.
The argument ...
(except name
) is passed to as_polars_series()
for each column.
All columns must be converted to the same length of Series by as_polars_series()
.
This is a shortcut for <Series>$to_frame()
or
<Series>$struct$unnest()
, depending on the from_struct
argument and the Series data type.
The column_name
argument is passed to the name
argument of the $to_frame()
method.
This is a shortcut for <LazyFrame>$collect()
.
A polars DataFrame
as.list(<polars_data_frame>)
: Export the DataFrame as an R list.
as.data.frame(<polars_data_frame>)
: Export the DataFrame as an R data frame.
# list as_polars_df(list(a = 1:2, b = c("foo", "bar"))) # data.frame as_polars_df(data.frame(a = 1:2, b = c("foo", "bar"))) # polars_series s_int <- as_polars_series(1:2, "a") s_struct <- as_polars_series( data.frame(a = 1:2, b = c("foo", "bar")), "struct" ) ## Use the Series as a column as_polars_df(s_int) as_polars_df(s_struct, column_name = "values", from_struct = FALSE) ## Unnest the struct data as_polars_df(s_struct)
# list as_polars_df(list(a = 1:2, b = c("foo", "bar"))) # data.frame as_polars_df(data.frame(a = 1:2, b = c("foo", "bar"))) # polars_series s_int <- as_polars_series(1:2, "a") s_struct <- as_polars_series( data.frame(a = 1:2, b = c("foo", "bar")), "struct" ) ## Use the Series as a column as_polars_df(s_int) as_polars_df(s_struct, column_name = "values", from_struct = FALSE) ## Unnest the struct data as_polars_df(s_struct)
The as_polars_expr()
function creates a polars expression from various R objects.
This function is used internally by various polars functions that accept expressions.
In most cases, users should use pl$lit()
instead of this function, which is
a shorthand for as_polars_expr(x, as_lit = TRUE)
.
(In other words, this function can be considered as an internal implementation to realize
the lit
function of the Polars API in other languages.)
as_polars_expr(x, ...) ## Default S3 method: as_polars_expr(x, ...) ## S3 method for class 'polars_expr' as_polars_expr(x, ..., structify = FALSE) ## S3 method for class 'polars_series' as_polars_expr(x, ...) ## S3 method for class 'character' as_polars_expr(x, ..., as_lit = FALSE) ## S3 method for class 'logical' as_polars_expr(x, ...) ## S3 method for class 'integer' as_polars_expr(x, ...) ## S3 method for class 'double' as_polars_expr(x, ...) ## S3 method for class 'raw' as_polars_expr(x, ...) ## S3 method for class ''NULL'' as_polars_expr(x, ...)
as_polars_expr(x, ...) ## Default S3 method: as_polars_expr(x, ...) ## S3 method for class 'polars_expr' as_polars_expr(x, ..., structify = FALSE) ## S3 method for class 'polars_series' as_polars_expr(x, ...) ## S3 method for class 'character' as_polars_expr(x, ..., as_lit = FALSE) ## S3 method for class 'logical' as_polars_expr(x, ...) ## S3 method for class 'integer' as_polars_expr(x, ...) ## S3 method for class 'double' as_polars_expr(x, ...) ## S3 method for class 'raw' as_polars_expr(x, ...) ## S3 method for class ''NULL'' as_polars_expr(x, ...)
x |
An R object. |
... |
Additional arguments passed to the methods. |
structify |
A logical. If |
as_lit |
A logical value indicating whether to treat vector as literal values or not.
This argument is always set to |
Because R objects are typically mapped to Series, this function often calls as_polars_series()
internally.
However, unlike R, Polars has scalars of length 1, so if an R object is converted to a Series of length 1,
this function get the first value of the Series and convert it to a scalar literal.
If you want to implement your own conversion from an R class to a Polars object,
define an S3 method for as_polars_series()
instead of this function.
Create a Series by calling as_polars_series()
and then convert that Series to an Expr.
If the length of the Series is 1
, it will be converted to a scalar value.
Additional arguments ...
are passed to as_polars_series()
.
If the as_lit
argument is FALSE
(default), this function will call pl$col()
and
the character vector is treated as column names.
A polars expression
Since R has no scalar class, each of the following types of length 1 cases is specially converted to a scalar literal.
character: String
logical: Boolean
integer: Int32
double: Float64
These types' NA
is converted to a null
literal with casting to the corresponding Polars type.
The raw type vector is converted to a Binary scalar.
raw: Binary
NULL
is converted to a Null type null
literal.
NULL: Null
For other R class, the default S3 method is called and R object will be converted via
as_polars_series()
. So the type mapping is defined by as_polars_series()
.
as_polars_series()
: R -> Polars type mapping is mostly defined by this function.
# character ## as_lit = FALSE (default) as_polars_expr("a") # Same as `pl$col("a")` as_polars_expr(c("a", "b")) # Same as `pl$col("a", "b")` ## as_lit = TRUE as_polars_expr(character(0), as_lit = TRUE) as_polars_expr("a", as_lit = TRUE) as_polars_expr(NA_character_, as_lit = TRUE) as_polars_expr(c("a", "b"), as_lit = TRUE) # logical as_polars_expr(logical(0)) as_polars_expr(TRUE) as_polars_expr(NA) as_polars_expr(c(TRUE, FALSE)) # integer as_polars_expr(integer(0)) as_polars_expr(1L) as_polars_expr(NA_integer_) as_polars_expr(c(1L, 2L)) # double as_polars_expr(double(0)) as_polars_expr(1) as_polars_expr(NA_real_) as_polars_expr(c(1, 2)) # raw as_polars_expr(raw(0)) as_polars_expr(charToRaw("foo")) # NULL as_polars_expr(NULL) # default method (for list) as_polars_expr(list()) as_polars_expr(list(1)) as_polars_expr(list(1, 2)) # default method (for Date) as_polars_expr(as.Date(integer(0))) as_polars_expr(as.Date("2021-01-01")) as_polars_expr(as.Date(c("2021-01-01", "2021-01-02"))) # polars_series ## Unlike the default method, this method does not extract the first value as_polars_series(1) |> as_polars_expr() # polars_expr as_polars_expr(pl$col("a", "b")) as_polars_expr(pl$col("a", "b"), structify = TRUE)
# character ## as_lit = FALSE (default) as_polars_expr("a") # Same as `pl$col("a")` as_polars_expr(c("a", "b")) # Same as `pl$col("a", "b")` ## as_lit = TRUE as_polars_expr(character(0), as_lit = TRUE) as_polars_expr("a", as_lit = TRUE) as_polars_expr(NA_character_, as_lit = TRUE) as_polars_expr(c("a", "b"), as_lit = TRUE) # logical as_polars_expr(logical(0)) as_polars_expr(TRUE) as_polars_expr(NA) as_polars_expr(c(TRUE, FALSE)) # integer as_polars_expr(integer(0)) as_polars_expr(1L) as_polars_expr(NA_integer_) as_polars_expr(c(1L, 2L)) # double as_polars_expr(double(0)) as_polars_expr(1) as_polars_expr(NA_real_) as_polars_expr(c(1, 2)) # raw as_polars_expr(raw(0)) as_polars_expr(charToRaw("foo")) # NULL as_polars_expr(NULL) # default method (for list) as_polars_expr(list()) as_polars_expr(list(1)) as_polars_expr(list(1, 2)) # default method (for Date) as_polars_expr(as.Date(integer(0))) as_polars_expr(as.Date("2021-01-01")) as_polars_expr(as.Date(c("2021-01-01", "2021-01-02"))) # polars_series ## Unlike the default method, this method does not extract the first value as_polars_series(1) |> as_polars_expr() # polars_expr as_polars_expr(pl$col("a", "b")) as_polars_expr(pl$col("a", "b"), structify = TRUE)
The as_polars_lf()
function creates a LazyFrame from various R objects.
It is basically a shortcut for as_polars_df(x, ...) with the
$lazy()
method.
as_polars_lf(x, ...) ## Default S3 method: as_polars_lf(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_lf(x, ...)
as_polars_lf(x, ...) ## Default S3 method: as_polars_lf(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_lf(x, ...)
x |
An R object. |
... |
Additional arguments passed to the methods. |
Create a DataFrame by calling as_polars_df()
and then create a LazyFrame from the DataFrame.
Additional arguments ...
are passed to as_polars_df()
.
A polars LazyFrame
The as_polars_series()
function creates a polars Series from various R objects.
The Data Type of the Series is determined by the class of the input object.
as_polars_series(x, name = NULL, ...) ## Default S3 method: as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_series' as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_data_frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'double' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer' as_polars_series(x, name = NULL, ...) ## S3 method for class 'character' as_polars_series(x, name = NULL, ...) ## S3 method for class 'logical' as_polars_series(x, name = NULL, ...) ## S3 method for class 'raw' as_polars_series(x, name = NULL, ...) ## S3 method for class 'factor' as_polars_series(x, name = NULL, ...) ## S3 method for class 'Date' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXct' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXlt' as_polars_series(x, name = NULL, ...) ## S3 method for class 'difftime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'hms' as_polars_series(x, name = NULL, ...) ## S3 method for class 'blob' as_polars_series(x, name = NULL, ...) ## S3 method for class 'array' as_polars_series(x, name = NULL, ...) ## S3 method for class ''NULL'' as_polars_series(x, name = NULL, ...) ## S3 method for class 'list' as_polars_series(x, name = NULL, ..., strict = FALSE) ## S3 method for class 'AsIs' as_polars_series(x, name = NULL, ...) ## S3 method for class 'data.frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer64' as_polars_series(x, name = NULL, ...) ## S3 method for class 'ITime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_unspecified' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_rcrd' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_time_point' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_sys_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_zoned_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_duration' as_polars_series(x, name = NULL, ...)
as_polars_series(x, name = NULL, ...) ## Default S3 method: as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_series' as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_data_frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'double' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer' as_polars_series(x, name = NULL, ...) ## S3 method for class 'character' as_polars_series(x, name = NULL, ...) ## S3 method for class 'logical' as_polars_series(x, name = NULL, ...) ## S3 method for class 'raw' as_polars_series(x, name = NULL, ...) ## S3 method for class 'factor' as_polars_series(x, name = NULL, ...) ## S3 method for class 'Date' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXct' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXlt' as_polars_series(x, name = NULL, ...) ## S3 method for class 'difftime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'hms' as_polars_series(x, name = NULL, ...) ## S3 method for class 'blob' as_polars_series(x, name = NULL, ...) ## S3 method for class 'array' as_polars_series(x, name = NULL, ...) ## S3 method for class ''NULL'' as_polars_series(x, name = NULL, ...) ## S3 method for class 'list' as_polars_series(x, name = NULL, ..., strict = FALSE) ## S3 method for class 'AsIs' as_polars_series(x, name = NULL, ...) ## S3 method for class 'data.frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer64' as_polars_series(x, name = NULL, ...) ## S3 method for class 'ITime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_unspecified' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_rcrd' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_time_point' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_sys_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_zoned_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_duration' as_polars_series(x, name = NULL, ...)
x |
An R object. |
name |
A single string or |
... |
Additional arguments passed to the methods. |
strict |
A logical value to indicate whether throwing an error when
the input list's elements have different data types.
If |
The default method of as_polars_series()
throws an error,
so we need to define S3 methods for the classes we want to support.
In R, a list can contain elements of different types, but in Polars (Apache Arrow),
all elements must have the same type.
So the as_polars_series()
function automatically casts all elements to the same type
or throws an error, depending on the strict
argument.
If you want to create a list with all elements of the same type in R,
consider using the vctrs::list_of()
function.
Since a list can contain another list, the strict
argument is also used
when creating Series from the inner list in the case of classes constructed on top of a list,
such as data.frame or vctrs_rcrd.
Sub-day values will be ignored (floored to the day).
Sub-millisecond values will be ignored (floored to the millisecond).
If the tzone
attribute is not present or an empty string (""
),
the Series' dtype will be Datetime without timezone.
Sub-nanosecond values will be ignored (floored to the nanosecond).
Sub-millisecond values will be rounded to milliseconds.
Sub-nanosecond values will be ignored (floored to the nanosecond).
If the hms vector contains values greater-equal to 24-oclock or less than 0-oclock, an error will be thrown.
Calendrical durations (years, quarters, months) are treated as chronologically with the internal representation of seconds. Please check the clock_duration documentation for more details.
This method is a shortcut for <DataFrame>$to_struct()
.
<Series>$to_r_vector()
: Export the Series as an R vector.
as_polars_df()
: Create a Polars DataFrame from an R object.
# double as_polars_series(c(NA, 1, 2)) # integer as_polars_series(c(NA, 1:2)) # character as_polars_series(c(NA, "foo", "bar")) # logical as_polars_series(c(NA, TRUE, FALSE)) # raw as_polars_series(charToRaw("foo")) # factor as_polars_series(factor(c(NA, "a", "b"))) # Date as_polars_series(as.Date(c(NA, "2021-01-01"))) ## Sub-day precision will be ignored as.Date(c(-0.5, 0, 0.5)) |> as_polars_series() # POSIXct with timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # POSIXct without timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"))) # POSIXlt as_polars_series(as.POSIXlt(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # difftime as_polars_series(as.difftime(c(NA, 1), units = "days")) ## Sub-millisecond values will be rounded to milliseconds as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "secs") |> as_polars_series() as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "weeks") |> as_polars_series() # NULL as_polars_series(NULL) # list as_polars_series(list(NA, NULL, list(), 1, "foo", TRUE)) ## 1st element will be `null` due to the casting failure as_polars_series(list(list("bar"), "foo")) # data.frame as_polars_series( data.frame(x = 1:2, y = c("foo", "bar"), z = I(list(1, 2))) ) # vctrs_unspecified if (requireNamespace("vctrs", quietly = TRUE)) { as_polars_series(vctrs::unspecified(3L)) } # hms if (requireNamespace("hms", quietly = TRUE)) { as_polars_series(hms::as_hms(c(NA, "01:00:00"))) } # blob if (requireNamespace("blob", quietly = TRUE)) { as_polars_series(blob::as_blob(c(NA, "foo", "bar"))) } # integer64 if (requireNamespace("bit64", quietly = TRUE)) { as_polars_series(bit64::as.integer64(c(NA, "9223372036854775807"))) } # clock_naive_time if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::naive_time_parse(c( NA, "1900-01-01T12:34:56.123456789", "2020-01-01T12:34:56.123456789" ), precision = "nanosecond")) } # clock_duration if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_nanoseconds(c(NA, 1))) } ## Calendrical durations are treated as chronologically if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_years(c(NA, 1))) }
# double as_polars_series(c(NA, 1, 2)) # integer as_polars_series(c(NA, 1:2)) # character as_polars_series(c(NA, "foo", "bar")) # logical as_polars_series(c(NA, TRUE, FALSE)) # raw as_polars_series(charToRaw("foo")) # factor as_polars_series(factor(c(NA, "a", "b"))) # Date as_polars_series(as.Date(c(NA, "2021-01-01"))) ## Sub-day precision will be ignored as.Date(c(-0.5, 0, 0.5)) |> as_polars_series() # POSIXct with timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # POSIXct without timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"))) # POSIXlt as_polars_series(as.POSIXlt(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # difftime as_polars_series(as.difftime(c(NA, 1), units = "days")) ## Sub-millisecond values will be rounded to milliseconds as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "secs") |> as_polars_series() as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "weeks") |> as_polars_series() # NULL as_polars_series(NULL) # list as_polars_series(list(NA, NULL, list(), 1, "foo", TRUE)) ## 1st element will be `null` due to the casting failure as_polars_series(list(list("bar"), "foo")) # data.frame as_polars_series( data.frame(x = 1:2, y = c("foo", "bar"), z = I(list(1, 2))) ) # vctrs_unspecified if (requireNamespace("vctrs", quietly = TRUE)) { as_polars_series(vctrs::unspecified(3L)) } # hms if (requireNamespace("hms", quietly = TRUE)) { as_polars_series(hms::as_hms(c(NA, "01:00:00"))) } # blob if (requireNamespace("blob", quietly = TRUE)) { as_polars_series(blob::as_blob(c(NA, "foo", "bar"))) } # integer64 if (requireNamespace("bit64", quietly = TRUE)) { as_polars_series(bit64::as.integer64(c(NA, "9223372036854775807"))) } # clock_naive_time if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::naive_time_parse(c( NA, "1900-01-01T12:34:56.123456789", "2020-01-01T12:34:56.123456789" ), precision = "nanosecond")) } # clock_duration if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_nanoseconds(c(NA, 1))) } ## Calendrical durations are treated as chronologically if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_years(c(NA, 1))) }
This S3 method is basically a shortcut of
as_polars_df(x, ...)$to_struct()$to_r_vector(ensure_vector = FALSE, struct = "tibble")
.
Additionally, you can check or repair the column names by specifying the .name_repair
argument.
Because polars DataFrame allows empty column name, which is not generally valid column name in R data frame.
## S3 method for class 'polars_data_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
## S3 method for class 'polars_data_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
x |
A polars object |
... |
Passed to |
.name_repair |
Treatment of problematic column names:
This argument is passed on as |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
A tibble
as.data.frame(<polars_object>)
: Export the polars object as a basic data frame.
# Polars DataFrame may have empty column name df <- pl$DataFrame(x = 1:2, c("a", "b")) df # Without checking or repairing the column names tibble::as_tibble(df, .name_repair = "minimal") tibble::as_tibble(df$lazy(), .name_repair = "minimal") # You can make that unique tibble::as_tibble(df, .name_repair = "unique") tibble::as_tibble(df$lazy(), .name_repair = "unique")
# Polars DataFrame may have empty column name df <- pl$DataFrame(x = 1:2, c("a", "b")) df # Without checking or repairing the column names tibble::as_tibble(df, .name_repair = "minimal") tibble::as_tibble(df$lazy(), .name_repair = "minimal") # You can make that unique tibble::as_tibble(df, .name_repair = "unique") tibble::as_tibble(df$lazy(), .name_repair = "unique")
This S3 method is a shortcut for
as_polars_df(x, ...)$to_struct()$to_r_vector(ensure_vector = FALSE, struct = "dataframe")
.
## S3 method for class 'polars_data_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
## S3 method for class 'polars_data_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
x |
A polars object |
... |
Passed to |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
An R data frame
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.data.frame(df) as.data.frame(df$lazy())
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.data.frame(df) as.data.frame(df$lazy())
This S3 method calls as_polars_df(x, ...)$get_columns()
or
as_polars_df(x, ...)$to_struct()$to_r_vector(ensure_vector = TRUE)
depending on the as_series
argument.
## S3 method for class 'polars_data_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
## S3 method for class 'polars_data_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
x |
A polars object |
... |
Passed to |
as_series |
Whether to convert each column to an R vector or a Series.
If |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
struct |
Determine how to convert Polars' Struct type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
Arguments other than x
and as_series
are passed to <Series>$to_r_vector()
,
so they are ignored when as_series=TRUE
.
A list
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.list(df, as_series = TRUE) as.list(df, as_series = FALSE) as.list(df$lazy(), as_series = TRUE) as.list(df$lazy(), as_series = FALSE)
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.list(df, as_series = TRUE) as.list(df, as_series = FALSE) as.list(df$lazy(), as_series = TRUE) as.list(df$lazy(), as_series = FALSE)
Functions to check if the object is a polars object.
is_*
functions return TRUE
of FALSE
depending on the class of the object.
check_*
functions throw an informative error if the object is not the correct class.
Suffixes are corresponding to the polars object classes:
*_dtype
: For polars data types.
*_df
: For polars data frames.
*_expr
: For polars expressions.
*_lf
: For polars lazy frames.
*_selector
: For polars selectors.
*_series
: For polars series.
is_polars_dtype(x) is_polars_df(x) is_polars_expr(x, ...) is_polars_lf(x) is_polars_selector(x, ...) is_polars_series(x) is_list_of_polars_dtype(x, n = NULL) check_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_df( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_expr( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_lf( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_selector( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_series( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_list_of_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() )
is_polars_dtype(x) is_polars_df(x) is_polars_expr(x, ...) is_polars_lf(x) is_polars_selector(x, ...) is_polars_series(x) is_list_of_polars_dtype(x, n = NULL) check_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_df( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_expr( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_lf( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_selector( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_series( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_list_of_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() )
x |
An object to check. |
... |
Arguments passed to |
n |
Expected length of a vector. |
allow_null |
If |
arg |
An argument name as a string. This argument will be mentioned in error messages as the input that is at the origin of a problem. |
call |
The execution environment of a currently
running function, e.g. |
check_polars_*
functions are derived from the standalone-types-check
functions
from the rlang package
(Can be installed with usethis::use_standalone("r-lib/rlang", file = "types-check")
).
is_polars_*
functions return TRUE
or FALSE
.
check_polars_*
functions return NULL
invisibly if the input is valid.
is_polars_df(as_polars_df(mtcars)) is_polars_df(mtcars) # Use `check_polars_*` functions in a function # to ensure the input is a polars object sample_func <- function(x) { check_polars_df(x) TRUE } sample_func(as_polars_df(mtcars)) try(sample_func(mtcars))
is_polars_df(as_polars_df(mtcars)) is_polars_df(mtcars) # Use `check_polars_*` functions in a function # to ensure the input is a polars object sample_func <- function(x) { check_polars_df(x) TRUE } sample_func(as_polars_df(mtcars)) try(sample_func(mtcars))
cs
is an environment class object that stores all
selector functions of the R Polars API which mimics the Python Polars API.
It is intended to work the same way in Python as if you had imported
Python Polars Selectors with import polars.selectors as cs
.
cs
cs
An object of class polars_object
of length 29.
There are 4 supported operators for selectors:
&
to combine conditions with AND, e.g. select columns that contain
"oo"
and end with "t"
with cs$contains("oo") & cs$ends_with("t")
;
|
to combine conditions with OR, e.g. select columns that contain
"oo"
or end with "t"
with cs$contains("oo") | cs$ends_with("t")
;
-
to substract conditions, e.g. select all columns that have alphanumeric
names except those that contain "a"
with
cs$alphanumeric() - cs$contains("a")
;
!
to invert the selection, e.g. select all columns that are not of data
type String
with !cs$string()
.
Note that Python Polars uses ~
instead of !
to invert selectors.
cs # How many members are in the `cs` environment? length(cs)
cs # How many members are in the `cs` environment? length(cs)
Select all columns
cs__all()
cs__all()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame(dt = as.Date(c("2000-1-1")), value = 10) # Select all columns, casting them to string: df$select(cs$all()$cast(pl$String)) # Select all columns except for those matching the given dtypes: df$select(cs$all() - cs$numeric())
df <- pl$DataFrame(dt = as.Date(c("2000-1-1")), value = 10) # Select all columns, casting them to string: df$select(cs$all()$cast(pl$String)) # Select all columns except for those matching the given dtypes: df$select(cs$all() - cs$numeric())
Select all columns with alphabetic names (e.g. only letters)
cs__alpha(ascii_only = FALSE, ..., ignore_spaces = FALSE)
cs__alpha(ascii_only = FALSE, ..., ignore_spaces = FALSE)
ascii_only |
Indicate whether to consider only ASCII alphabetic characters, or the full Unicode range of valid letters (accented, idiographic, etc). |
... |
These dots are for future extensions and must be empty. |
ignore_spaces |
Indicate whether to ignore the presence of spaces in column names; if so, only the other (non-space) characters are considered. |
Matching column names cannot contain any non-alphabetic characters. Note
that the definition of “alphabetic” consists of all valid Unicode alphabetic
characters (p{Alphabetic}
) by default; this can be changed by setting
ascii_only = TRUE
.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( no1 = c(100, 200, 300), café = c("espresso", "latte", "mocha"), `t or f` = c(TRUE, FALSE, NA), hmm = c("aaa", "bbb", "ccc"), 都市 = c("東京", "大阪", "京都") ) # Select columns with alphabetic names; note that accented characters and # kanji are recognised as alphabetic here: df$select(cs$alpha()) # Constrain the definition of “alphabetic” to ASCII characters only: df$select(cs$alpha(ascii_only = TRUE)) df$select(cs$alpha(ascii_only = TRUE, ignore_spaces = TRUE)) # Select all columns except for those with alphabetic names: df$select(!cs$alpha()) df$select(!cs$alpha(ignore_spaces = TRUE))
df <- pl$DataFrame( no1 = c(100, 200, 300), café = c("espresso", "latte", "mocha"), `t or f` = c(TRUE, FALSE, NA), hmm = c("aaa", "bbb", "ccc"), 都市 = c("東京", "大阪", "京都") ) # Select columns with alphabetic names; note that accented characters and # kanji are recognised as alphabetic here: df$select(cs$alpha()) # Constrain the definition of “alphabetic” to ASCII characters only: df$select(cs$alpha(ascii_only = TRUE)) df$select(cs$alpha(ascii_only = TRUE, ignore_spaces = TRUE)) # Select all columns except for those with alphabetic names: df$select(!cs$alpha()) df$select(!cs$alpha(ignore_spaces = TRUE))
Select all columns with alphanumeric names (e.g. only letters and the digits 0-9)
cs__alphanumeric(ascii_only = FALSE, ..., ignore_spaces = FALSE)
cs__alphanumeric(ascii_only = FALSE, ..., ignore_spaces = FALSE)
ascii_only |
Indicate whether to consider only ASCII alphabetic characters, or the full Unicode range of valid letters (accented, idiographic, etc). |
... |
These dots are for future extensions and must be empty. |
ignore_spaces |
Indicate whether to ignore the presence of spaces in column names; if so, only the other (non-space) characters are considered. |
Matching column names cannot contain any non-alphabetic characters. Note
that the definition of “alphabetic” consists of all valid Unicode alphabetic
characters (p{Alphabetic}
) and digit characters (d
) by default; this can
be changed by setting ascii_only = TRUE
.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( `1st_col` = c(100, 200, 300), flagged = c(TRUE, FALSE, TRUE), `00prefix` = c("01:aa", "02:bb", "03:cc"), `last col` = c("x", "y", "z") ) # Select columns with alphanumeric names: df$select(cs$alphanumeric()) df$select(cs$alphanumeric(ignore_spaces = TRUE)) # Select all columns except for those with alphanumeric names: df$select(!cs$alphanumeric()) df$select(!cs$alphanumeric(ignore_spaces = TRUE))
df <- pl$DataFrame( `1st_col` = c(100, 200, 300), flagged = c(TRUE, FALSE, TRUE), `00prefix` = c("01:aa", "02:bb", "03:cc"), `last col` = c("x", "y", "z") ) # Select columns with alphanumeric names: df$select(cs$alphanumeric()) df$select(cs$alphanumeric(ignore_spaces = TRUE)) # Select all columns except for those with alphanumeric names: df$select(!cs$alphanumeric()) df$select(!cs$alphanumeric(ignore_spaces = TRUE))
Select all binary columns
cs__binary()
cs__binary()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( a = charToRaw("hello"), b = "world", c = charToRaw("!"), d = ":" ) # Select binary columns: df$select(cs$binary()) # Select all columns except for those that are binary: df$select(!cs$binary())
df <- pl$DataFrame( a = charToRaw("hello"), b = "world", c = charToRaw("!"), d = ":" ) # Select binary columns: df$select(cs$binary()) # Select all columns except for those that are binary: df$select(!cs$binary())
Select all boolean columns
cs__boolean()
cs__boolean()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( a = 1:4, b = c(FALSE, TRUE, FALSE, TRUE) ) # Select and invert boolean columns: df$with_columns(inverted = cs$boolean()$not()) # Select all columns except for those that are boolean: df$select(!cs$boolean())
df <- pl$DataFrame( a = 1:4, b = c(FALSE, TRUE, FALSE, TRUE) ) # Select and invert boolean columns: df$with_columns(inverted = cs$boolean()$not()) # Select all columns except for those that are boolean: df$select(!cs$boolean())
Select all columns matching the given dtypes
cs__by_dtype(...)
cs__by_dtype(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dt = as.Date(c("1999-12-31", "2024-1-1", "2010-7-5")), value = c(1234500, 5000555, -4500000), other = c("foo", "bar", "foo") ) # Select all columns with date or string dtypes: df$select(cs$by_dtype(pl$Date, pl$String)) # Select all columns that are not of date or string dtype: df$select(!cs$by_dtype(pl$Date, pl$String)) # Group by string columns and sum the numeric columns: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort("other")
df <- pl$DataFrame( dt = as.Date(c("1999-12-31", "2024-1-1", "2010-7-5")), value = c(1234500, 5000555, -4500000), other = c("foo", "bar", "foo") ) # Select all columns with date or string dtypes: df$select(cs$by_dtype(pl$Date, pl$String)) # Select all columns that are not of date or string dtype: df$select(!cs$by_dtype(pl$Date, pl$String)) # Group by string columns and sum the numeric columns: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort("other")
Select all columns matching the given indices (or range objects)
cs__by_index(indices)
cs__by_index(indices)
indices |
One or more column indices (or ranges). Negative indexing is supported. |
Matching columns are returned in the order in which their indexes appear in the selector, not the underlying schema order.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
vals <- as.list(0.5 * 0:100) names(vals) <- paste0("c", 0:100) df <- pl$DataFrame(!!!vals) df # Select columns by index (the two first/last columns): df$select(cs$by_index(c(0, 1, -2, -1))) # Use seq() df$select(cs$by_index(c(0, seq(1, 101, 20)))) df$select(cs$by_index(c(0, seq(101, 0, -25)))) # Select only odd-indexed columns: df$select(!cs$by_index(seq(0, 100, 2)))
vals <- as.list(0.5 * 0:100) names(vals) <- paste0("c", 0:100) df <- pl$DataFrame(!!!vals) df # Select columns by index (the two first/last columns): df$select(cs$by_index(c(0, 1, -2, -1))) # Use seq() df$select(cs$by_index(c(0, seq(1, 101, 20)))) df$select(cs$by_index(c(0, seq(101, 0, -25)))) # Select only odd-indexed columns: df$select(!cs$by_index(seq(0, 100, 2)))
Select all columns matching the given names
cs__by_name(..., require_all = TRUE)
cs__by_name(..., require_all = TRUE)
... |
< |
require_all |
Whether to match all names (the default) or any of the names. |
Matching columns are returned in the order in which their indexes appear in the selector, not the underlying schema order.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns by name: df$select(cs$by_name("foo", "bar")) # Match any of the given columns by name: df$select(cs$by_name("baz", "moose", "foo", "bear", require_all = FALSE)) # Match all columns except for those given: df$select(!cs$by_name("foo", "bar"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns by name: df$select(cs$by_name("foo", "bar")) # Match any of the given columns by name: df$select(cs$by_name("baz", "moose", "foo", "bear", require_all = FALSE)) # Match all columns except for those given: df$select(!cs$by_name("foo", "bar"))
Select all categorical columns
cs__categorical()
cs__categorical()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("xx", "yy"), bar = c(123, 456), baz = c(2.0, 5.5), .schema_overrides = list(foo = pl$Categorical()), ) # Select categorical columns: df$select(cs$categorical()) # Select all columns except for those that are categorical: df$select(!cs$categorical())
df <- pl$DataFrame( foo = c("xx", "yy"), bar = c(123, 456), baz = c(2.0, 5.5), .schema_overrides = list(foo = pl$Categorical()), ) # Select categorical columns: df$select(cs$categorical()) # Select all columns except for those that are categorical: df$select(!cs$categorical())
Select columns whose names contain the given literal substring(s)
cs__contains(...)
cs__contains(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that contain the substring "ba": df$select(cs$contains("ba")) # Select columns that contain the substring "ba" or the letter "z": df$select(cs$contains("ba", "z")) # Select all columns except for those that contain the substring "ba": df$select(!cs$contains("ba"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that contain the substring "ba": df$select(cs$contains("ba")) # Select columns that contain the substring "ba" or the letter "z": df$select(cs$contains("ba", "z")) # Select all columns except for those that contain the substring "ba": df$select(!cs$contains("ba"))
Select all date columns
cs__date()
cs__date()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")) ) # Select date columns: df$select(cs$date()) # Select all columns except for those that are dates: df$select(!cs$date())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")) ) # Select date columns: df$select(cs$date()) # Select all columns except for those that are dates: df$select(!cs$date())
Select all datetime columns
cs__datetime(time_unit = c("ms", "us", "ns"), time_zone = list("*", NULL))
cs__datetime(time_unit = c("ms", "us", "ns"), time_zone = list("*", NULL))
time_unit |
One (or more) of the allowed time unit precision strings,
|
time_zone |
One of the followings. The value or each element of the vector
will be passed to the
|
A Polars selector
cs for the documentation on operators supported by Polars selectors.
chr_vec <- c("1999-07-21 05:20:16.987654", "2000-05-16 06:21:21.123456") df <- pl$DataFrame( tstamp_tokyo = as.POSIXlt(chr_vec, tz = "Asia/Tokyo"), tstamp_utc = as.POSIXct(chr_vec, tz = "UTC"), tstamp = as.POSIXct(chr_vec), dt = as.Date(chr_vec), ) # Select all datetime columns: df$select(cs$datetime()) # Select all datetime columns that have "ms" precision: df$select(cs$datetime("ms")) # Select all datetime columns that have any timezone: df$select(cs$datetime(time_zone = "*")) # Select all datetime columns that have a specific timezone: df$select(cs$datetime(time_zone = "UTC")) # Select all datetime columns that have NO timezone: df$select(cs$datetime(time_zone = NULL)) # Select all columns except for datetime columns: df$select(!cs$datetime())
chr_vec <- c("1999-07-21 05:20:16.987654", "2000-05-16 06:21:21.123456") df <- pl$DataFrame( tstamp_tokyo = as.POSIXlt(chr_vec, tz = "Asia/Tokyo"), tstamp_utc = as.POSIXct(chr_vec, tz = "UTC"), tstamp = as.POSIXct(chr_vec), dt = as.Date(chr_vec), ) # Select all datetime columns: df$select(cs$datetime()) # Select all datetime columns that have "ms" precision: df$select(cs$datetime("ms")) # Select all datetime columns that have any timezone: df$select(cs$datetime(time_zone = "*")) # Select all datetime columns that have a specific timezone: df$select(cs$datetime(time_zone = "UTC")) # Select all datetime columns that have NO timezone: df$select(cs$datetime(time_zone = NULL)) # Select all columns except for datetime columns: df$select(!cs$datetime())
Select all decimal columns
cs__decimal()
cs__decimal()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c("2.0005", "-50.5555"), .schema_overrides = list( bar = pl$Decimal(), baz = pl$Decimal(scale = 5, precision = 10) ) ) # Select decimal columns: df$select(cs$decimal()) # Select all columns except for those that are decimal: df$select(!cs$decimal())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c("2.0005", "-50.5555"), .schema_overrides = list( bar = pl$Decimal(), baz = pl$Decimal(scale = 5, precision = 10) ) ) # Select decimal columns: df$select(cs$decimal()) # Select all columns except for those that are decimal: df$select(!cs$decimal())
Select all columns having names consisting only of digits
cs__digit(ascii_only = FALSE)
cs__digit(ascii_only = FALSE)
ascii_only |
Indicate whether to consider only ASCII alphabetic characters, or the full Unicode range of valid letters (accented, idiographic, etc). |
Matching column names cannot contain any non-digit characters. Note that the
definition of "digit" consists of all valid Unicode digit characters (d
)
by default; this can be changed by setting ascii_only = TRUE
.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( key = c("aaa", "bbb"), `2001` = 1:2, `2025` = 3:4 ) # Select columns with digit names: df$select(cs$digit()) # Select all columns except for those with digit names: df$select(!cs$digit()) # Demonstrate use of ascii_only flag (by default all valid unicode digits # are considered, but this can be constrained to ascii 0-9): df <- pl$DataFrame(`१९९९` = 1999, `२०७७` = 2077, `3000` = 3000) df$select(cs$digit()) df$select(cs$digit(ascii_only = TRUE))
df <- pl$DataFrame( key = c("aaa", "bbb"), `2001` = 1:2, `2025` = 3:4 ) # Select columns with digit names: df$select(cs$digit()) # Select all columns except for those with digit names: df$select(!cs$digit()) # Demonstrate use of ascii_only flag (by default all valid unicode digits # are considered, but this can be constrained to ascii 0-9): df <- pl$DataFrame(`१९९९` = 1999, `२०७७` = 2077, `3000` = 3000) df$select(cs$digit()) df$select(cs$digit(ascii_only = TRUE))
Select all duration columns, optionally filtering by time unit
cs__duration(time_unit = c("ms", "us", "ns"))
cs__duration(time_unit = c("ms", "us", "ns"))
time_unit |
One (or more) of the allowed time unit precision strings,
|
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dur_ms = clock::duration_milliseconds(1:2), dur_us = clock::duration_microseconds(1:2), dur_ns = clock::duration_nanoseconds(1:2), ) # Select duration columns: df$select(cs$duration()) # Select all duration columns that have "ms" precision: df$select(cs$duration("ms")) # Select all duration columns that have "ms" OR "ns" precision: df$select(cs$duration(c("ms", "ns"))) # Select all columns except for those that are duration: df$select(!cs$duration())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dur_ms = clock::duration_milliseconds(1:2), dur_us = clock::duration_microseconds(1:2), dur_ns = clock::duration_nanoseconds(1:2), ) # Select duration columns: df$select(cs$duration()) # Select all duration columns that have "ms" precision: df$select(cs$duration("ms")) # Select all duration columns that have "ms" OR "ns" precision: df$select(cs$duration(c("ms", "ns"))) # Select all columns except for those that are duration: df$select(!cs$duration())
Select columns that end with the given substring(s)
cs__ends_with(...)
cs__ends_with(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that end with the substring "z": df$select(cs$ends_with("z")) # Select columns that end with either the letter "z" or "r": df$select(cs$ends_with("z", "r")) # Select all columns except for those that end with the substring "z": df$select(!cs$ends_with("z"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that end with the substring "z": df$select(cs$ends_with("z")) # Select columns that end with either the letter "z" or "r": df$select(cs$ends_with("z", "r")) # Select all columns except for those that end with the substring "z": df$select(!cs$ends_with("z"))
Select all columns except those matching the given columns, datatypes, or selectors
cs__exclude(...)
cs__exclude(...)
... |
< |
If excluding a single selector it is simpler to write as !selector
instead.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( aa = 1:3, ba = c("a", "b", NA), cc = c(NA, 2.5, 1.5) ) # Exclude by column name(s): df$select(cs$exclude("ba", "xx")) # Exclude using a column name, a selector, and a dtype: df$select(cs$exclude("aa", cs$string(), pl$Int32))
df <- pl$DataFrame( aa = 1:3, ba = c("a", "b", NA), cc = c(NA, 2.5, 1.5) ) # Exclude by column name(s): df$select(cs$exclude("ba", "xx")) # Exclude using a column name, a selector, and a dtype: df$select(cs$exclude("aa", cs$string(), pl$Int32))
Select the first column in the current scope
cs__first()
cs__first()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the first column: df$select(cs$first()) # Select everything except for the first column: df$select(!cs$first())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the first column: df$select(cs$first()) # Select everything except for the first column: df$select(!cs$first())
Select all float columns.
cs__float()
cs__float()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE), .schema_overrides = list(baz = pl$Float32, zap = pl$Float64), ) # Select all float columns: df$select(cs$float()) # Select all columns except for those that are float: df$select(!cs$float())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE), .schema_overrides = list(baz = pl$Float32, zap = pl$Float64), ) # Select all float columns: df$select(cs$float()) # Select all columns except for those that are float: df$select(!cs$float())
Select all integer columns.
cs__integer()
cs__integer()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1 ) # Select all integer columns: df$select(cs$integer()) # Select all columns except for those that are integer: df$select(!cs$integer())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1 ) # Select all integer columns: df$select(cs$integer()) # Select all columns except for those that are integer: df$select(!cs$integer())
Select the last column in the current scope
cs__last()
cs__last()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the last column: df$select(cs$last()) # Select everything except for the last column: df$select(!cs$last())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the last column: df$select(cs$last()) # Select everything except for the last column: df$select(!cs$last())
Select all columns that match the given regex pattern
cs__matches(pattern)
cs__matches(pattern)
pattern |
A valid regular expression pattern, compatible with the
|
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(0, 1) ) # Match column names containing an "a", preceded by a character that is not # "z": df$select(cs$matches("[^z]a")) # Do not match column names ending in "R" or "z" (case-insensitively): df$select(!cs$matches(r"((?i)R|z$)"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(0, 1) ) # Match column names containing an "a", preceded by a character that is not # "z": df$select(cs$matches("[^z]a")) # Do not match column names ending in "R" or "z" (case-insensitively): df$select(!cs$matches(r"((?i)R|z$)"))
Select all numeric columns.
cs__numeric()
cs__numeric()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1, .schema_overrides = list(bar = pl$Int16, baz = pl$Float32, zap = pl$UInt8), ) # Select all numeric columns: df$select(cs$numeric()) # Select all columns except for those that are numeric: df$select(!cs$numeric())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1, .schema_overrides = list(bar = pl$Int16, baz = pl$Float32, zap = pl$UInt8), ) # Select all numeric columns: df$select(cs$numeric()) # Select all columns except for those that are numeric: df$select(!cs$numeric())
Select all signed integer columns
cs__signed_integer()
cs__signed_integer()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select signed integer columns: df$select(cs$signed_integer()) # Select all columns except for those that are signed integer: df$select(!cs$signed_integer()) # Select all integer columns (both signed and unsigned): df$select(cs$integer())
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select signed integer columns: df$select(cs$signed_integer()) # Select all columns except for those that are signed integer: df$select(!cs$signed_integer()) # Select all integer columns (both signed and unsigned): df$select(cs$integer())
Select columns that start with the given substring(s)
cs__starts_with(...)
cs__starts_with(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that start with the substring "b": df$select(cs$starts_with("b")) # Select columns that start with either the letter "b" or "z": df$select(cs$starts_with("b", "z")) # Select all columns except for those that start with the substring "b": df$select(!cs$starts_with("b"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that start with the substring "b": df$select(cs$starts_with("b")) # Select columns that start with either the letter "b" or "z": df$select(cs$starts_with("b", "z")) # Select all columns except for those that start with the substring "b": df$select(!cs$starts_with("b"))
Select all String (and, optionally, Categorical) string columns.
cs__string(..., include_categorical = FALSE)
cs__string(..., include_categorical = FALSE)
... |
These dots are for future extensions and must be empty. |
include_categorical |
If |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( w = c("xx", "yy", "xx", "yy", "xx"), x = c(1, 2, 1, 4, -2), y = c(3.0, 4.5, 1.0, 2.5, -2.0), z = c("a", "b", "a", "b", "b") )$with_columns( z = pl$col("z")$cast(pl$Categorical()) ) # Group by all string columns, sum the numeric columns, then sort by the # string cols: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort(cs$string()) # Group by all string and categorical columns: df$ group_by(cs$string(include_categorical = TRUE))$ agg(cs$numeric()$sum())$ sort(cs$string(include_categorical = TRUE))
df <- pl$DataFrame( w = c("xx", "yy", "xx", "yy", "xx"), x = c(1, 2, 1, 4, -2), y = c(3.0, 4.5, 1.0, 2.5, -2.0), z = c("a", "b", "a", "b", "b") )$with_columns( z = pl$col("z")$cast(pl$Categorical()) ) # Group by all string columns, sum the numeric columns, then sort by the # string cols: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort(cs$string()) # Group by all string and categorical columns: df$ group_by(cs$string(include_categorical = TRUE))$ agg(cs$numeric()$sum())$ sort(cs$string(include_categorical = TRUE))
Select all temporal columns
cs__temporal()
cs__temporal()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), value = 1:2 ) # Match all temporal columns: df$select(cs$temporal()) # Match all temporal columns except for time columns: df$select(cs$temporal() - cs$datetime()) # Match all columns except for temporal columns: df$select(!cs$temporal())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), value = 1:2 ) # Match all temporal columns: df$select(cs$temporal()) # Match all temporal columns except for time columns: df$select(cs$temporal() - cs$datetime()) # Match all columns except for temporal columns: df$select(!cs$temporal())
Select all time columns
cs__time()
cs__time()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), tm = hms::parse_hms(c("0:0:0", "23:59:59")) ) # Select time columns: df$select(cs$time()) # Select all columns except for those that are time: df$select(!cs$time())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), tm = hms::parse_hms(c("0:0:0", "23:59:59")) ) # Select time columns: df$select(cs$time()) # Select all columns except for those that are time: df$select(!cs$time())
Select all unsigned integer columns
cs__unsigned_integer()
cs__unsigned_integer()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select unsigned integer columns: df$select(cs$unsigned_integer()) # Select all columns except for those that are unsigned integer: df$select(!cs$unsigned_integer()) # Select all integer columns (both unsigned and unsigned): df$select(cs$integer())
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select unsigned integer columns: df$select(cs$unsigned_integer()) # Select all columns except for those that are unsigned integer: df$select(!cs$unsigned_integer()) # Select all integer columns (both unsigned and unsigned): df$select(cs$integer())
Cast DataFrame column(s) to the specified dtype
dataframe__cast(..., .strict = TRUE)
dataframe__cast(..., .strict = TRUE)
A polars DataFrame
df <- pl$DataFrame( foo = 1:3, bar = c(6, 7, 8), ham = as.Date(c("2020-01-02", "2020-03-04", "2020-05-06")) ) # Cast only some columns df$cast(foo = pl$Float32, bar = pl$UInt8) # Cast all columns to the same type df$cast(pl$String)
df <- pl$DataFrame( foo = 1:3, bar = c(6, 7, 8), ham = as.Date(c("2020-01-02", "2020-03-04", "2020-05-06")) ) # Cast only some columns df$cast(foo = pl$Float32, bar = pl$UInt8) # Cast all columns to the same type df$cast(pl$String)
This is a cheap operation that does not copy data. Assigning does not copy the DataFrame (environment object). This is because environment objects have reference semantics. Calling $clone() creates a new environment, which can be useful when dealing with attributes (see examples).
dataframe__clone()
dataframe__clone()
A polars DataFrame
df1 <- as_polars_df(iris) # Assigning does not copy the DataFrame (environment object), calling # $clone() creates a new environment. df2 <- df1 df3 <- df1$clone() rlang::env_label(df1) rlang::env_label(df2) rlang::env_label(df3) # Cloning can be useful to add attributes to data used in a function without # adding those attributes to the original object. # Make a function to take a DataFrame, add an attribute, and return a # DataFrame: give_attr <- function(data) { attr(data, "created_on") <- "2024-01-29" data } df2 <- give_attr(df1) # Problem: the original DataFrame also gets the attribute while it shouldn't attributes(df1) # Use $clone() inside the function to avoid that give_attr <- function(data) { data <- data$clone() attr(data, "created_on") <- "2024-01-29" data } df1 <- as_polars_df(iris) df2 <- give_attr(df1) # now, the original DataFrame doesn't get this attribute attributes(df1)
df1 <- as_polars_df(iris) # Assigning does not copy the DataFrame (environment object), calling # $clone() creates a new environment. df2 <- df1 df3 <- df1$clone() rlang::env_label(df1) rlang::env_label(df2) rlang::env_label(df3) # Cloning can be useful to add attributes to data used in a function without # adding those attributes to the original object. # Make a function to take a DataFrame, add an attribute, and return a # DataFrame: give_attr <- function(data) { attr(data, "created_on") <- "2024-01-29" data } df2 <- give_attr(df1) # Problem: the original DataFrame also gets the attribute while it shouldn't attributes(df1) # Use $clone() inside the function to avoid that give_attr <- function(data) { data <- data$clone() attr(data, "created_on") <- "2024-01-29" data } df1 <- as_polars_df(iris) df2 <- give_attr(df1) # now, the original DataFrame doesn't get this attribute attributes(df1)
Drop columns of a DataFrame
dataframe__drop(..., strict = TRUE)
dataframe__drop(..., strict = TRUE)
... |
< |
strict |
Validate that all column names exist in the schema and throw an exception if a column name does not exist in the schema. |
A polars DataFrame
as_polars_df(mtcars)$drop(c("mpg", "hp")) # equivalent as_polars_df(mtcars)$drop("mpg", "hp")
as_polars_df(mtcars)$drop(c("mpg", "hp")) # equivalent as_polars_df(mtcars)$drop("mpg", "hp")
Check whether the DataFrame is equal to another DataFrame
dataframe__equals(other, ..., null_equal = TRUE)
dataframe__equals(other, ..., null_equal = TRUE)
other |
DataFrame to compare with. |
A logical value
dat1 <- as_polars_df(iris) dat2 <- as_polars_df(iris) dat3 <- as_polars_df(mtcars) dat1$equals(dat2) dat1$equals(dat3)
dat1 <- as_polars_df(iris) dat2 <- as_polars_df(iris) dat3 <- as_polars_df(mtcars) dat1$equals(dat2) dat1$equals(dat3)
Filter rows of a DataFrame
dataframe__filter(...)
dataframe__filter(...)
A polars DataFrame
df <- as_polars_df(iris) df$filter(pl$col("Sepal.Length") > 5) # This is equivalent to # df$filter(pl$col("Sepal.Length") > 5 & pl$col("Petal.Width") < 1) df$filter(pl$col("Sepal.Length") > 5, pl$col("Petal.Width") < 1) # rows where condition is NA are dropped iris2 <- iris iris2[c(1, 3, 5), "Species"] <- NA df <- as_polars_df(iris2) df$filter(pl$col("Species") == "setosa")
df <- as_polars_df(iris) df$filter(pl$col("Sepal.Length") > 5) # This is equivalent to # df$filter(pl$col("Sepal.Length") > 5 & pl$col("Petal.Width") < 1) df$filter(pl$col("Sepal.Length") > 5, pl$col("Petal.Width") < 1) # rows where condition is NA are dropped iris2 <- iris iris2[c(1, 3, 5), "Species"] <- NA df <- as_polars_df(iris2) df$filter(pl$col("Species") == "setosa")
Get the DataFrame as a list of Series
dataframe__get_columns()
dataframe__get_columns()
A list of Series
df <- pl$DataFrame(foo = c(1, 2, 3), bar = c(4, 5, 6)) df$get_columns() df <- pl$DataFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$get_columns()
df <- pl$DataFrame(foo = c(1, 2, 3), bar = c(4, 5, 6)) df$get_columns() df <- pl$DataFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$get_columns()
Group a DataFrame
dataframe__group_by(..., .maintain_order = FALSE)
dataframe__group_by(..., .maintain_order = FALSE)
Within each group, the order of the rows is always preserved,
regardless of the maintain_order
argument.
GroupBy (a DataFrame with special groupby methods like $agg()
)
<DataFrame>$partition_by()
df <- pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df$group_by("a")$agg(pl$col("b")$sum()) # Set `maintain_order = TRUE` to ensure the order of the groups is # consistent with the input. df$group_by("a", maintain_order = TRUE)$agg(pl$col("c")) # Group by multiple columns by passing a list of column names. df$group_by(c("a", "b"))$agg(pl$max("c")) # Or pass some arguments to group by multiple columns in the same way. # Expressions are also accepted. df$group_by("a", pl$col("b") %/% 2)$agg( pl$col("c")$mean() ) # The columns will be renamed to the argument names. df$group_by(d = "a", e = pl$col("b") %/% 2)$agg( pl$col("c")$mean() )
df <- pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df$group_by("a")$agg(pl$col("b")$sum()) # Set `maintain_order = TRUE` to ensure the order of the groups is # consistent with the input. df$group_by("a", maintain_order = TRUE)$agg(pl$col("c")) # Group by multiple columns by passing a list of column names. df$group_by(c("a", "b"))$agg(pl$max("c")) # Or pass some arguments to group by multiple columns in the same way. # Expressions are also accepted. df$group_by("a", pl$col("b") %/% 2)$agg( pl$col("c")$mean() ) # The columns will be renamed to the argument names. df$group_by(d = "a", e = pl$col("b") %/% 2)$agg( pl$col("c")$mean() )
Start a new lazy query from a DataFrame.
dataframe__lazy()
dataframe__lazy()
A polars LazyFrame
pl$DataFrame(a = 1:2, b = c(NA, "a"))$lazy()
pl$DataFrame(a = 1:2, b = c(NA, "a"))$lazy()
Get number of chunks used by the ChunkedArrays of this DataFrame
dataframe__n_chunks(strategy = c("first", "all"))
dataframe__n_chunks(strategy = c("first", "all"))
strategy |
Return the number of chunks of the |
An integer vector.
df <- pl$DataFrame( a = c(1, 2, 3, 4), b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$n_chunks() df$n_chunks(strategy = "all")
df <- pl$DataFrame( a = c(1, 2, 3, 4), b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$n_chunks() df$n_chunks(strategy = "all")
This will make sure all subsequent operations have optimal and predictable performance.
dataframe__rechunk()
dataframe__rechunk()
A polars DataFrame
Select and perform operations on a subset of columns only. This discards
unmentioned columns (like .()
in data.table
and contrarily to
dplyr::mutate()
).
One cannot use new variables in subsequent expressions in the same
$select()
call. For instance, if you create a variable x
, you will only
be able to use it in another $select()
or $with_columns()
call.
dataframe__select(...)
dataframe__select(...)
... |
< |
A polars DataFrame
as_polars_df(iris)$select( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 )
as_polars_df(iris)$select( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 )
Get a slice of the DataFrame.
dataframe__slice(offset, length = NULL)
dataframe__slice(offset, length = NULL)
offset |
Start index, can be a negative value. This is 0-indexed, so
|
length |
Length of the slice. If |
A polars DataFrame
# skip the first 2 rows and take the 4 following rows as_polars_df(mtcars)$slice(2, 4) # this is equivalent to: mtcars[3:6, ]
# skip the first 2 rows and take the 4 following rows as_polars_df(mtcars)$slice(2, 4) # this is equivalent to: mtcars[3:6, ]
Sort a DataFrame
dataframe__sort( ..., descending = FALSE, nulls_last = FALSE, multithreaded = TRUE, maintain_order = FALSE )
dataframe__sort( ..., descending = FALSE, nulls_last = FALSE, multithreaded = TRUE, maintain_order = FALSE )
A polars DataFrame
df <- mtcars df$mpg[1] <- NA df <- as_polars_df(df) df$sort("mpg") df$sort("mpg", nulls_last = TRUE) df$sort("cyl", "mpg") df$sort(c("cyl", "mpg")) df$sort(c("cyl", "mpg"), descending = TRUE) df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE)) df$sort(pl$col("cyl"), pl$col("mpg"))
df <- mtcars df$mpg[1] <- NA df <- as_polars_df(df) df$sort("mpg") df$sort("mpg", nulls_last = TRUE) df$sort("cyl", "mpg") df$sort(c("cyl", "mpg")) df$sort(c("cyl", "mpg"), descending = TRUE) df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE)) df$sort(pl$col("cyl"), pl$col("mpg"))
Select column as Series at index location
dataframe__to_series(index = 0)
dataframe__to_series(index = 0)
index |
Index of the column to return as Series. Defaults to 0, which is the first column. |
Series or NULL
df <- as_polars_df(iris[1:10, ]) # default is to extract the first column df$to_series() # Polars is 0-indexed, so we use index = 1 to extract the *2nd* column df$to_series(index = 1) # doesn't error if the column isn't there df$to_series(index = 8)
df <- as_polars_df(iris[1:10, ]) # default is to extract the first column df$to_series() # Polars is 0-indexed, so we use index = 1 to extract the *2nd* column df$to_series(index = 1) # doesn't error if the column isn't there df$to_series(index = 8)
Convert a DataFrame to a Series of type Struct
dataframe__to_struct(name = "")
dataframe__to_struct(name = "")
name |
A character. Name for the struct Series. |
A Series of the struct type
df <- pl$DataFrame( a = 1:5, b = c("one", "two", "three", "four", "five"), ) df$to_struct("nums")
df <- pl$DataFrame( a = 1:5, b = c("one", "two", "three", "four", "five"), ) df$to_struct("nums")
Add columns or modify existing ones with expressions. This is similar to
dplyr::mutate()
as it keeps unmentioned columns (unlike $select()
).
However, unlike dplyr::mutate()
, one cannot use new variables in subsequent
expressions in the same $with_columns()
call. For instance, if you create a
variable x
, you will only be able to use it in another $with_columns()
or $select()
call.
dataframe__with_columns(...)
dataframe__with_columns(...)
... |
< |
A polars DataFrame
as_polars_df(iris)$with_columns( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 ) # same query l_expr <- list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) as_polars_df(iris)$with_columns(l_expr) as_polars_df(iris)$with_columns( SW_add_2 = (pl$col("Sepal.Width") + 2), # unnamed expr will keep name "Sepal.Length" pl$col("Sepal.Length")$abs() )
as_polars_df(iris)$with_columns( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 ) # same query l_expr <- list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) as_polars_df(iris)$with_columns(l_expr) as_polars_df(iris)$with_columns( SW_add_2 = (pl$col("Sepal.Width") + 2), # unnamed expr will keep name "Sepal.Length" pl$col("Sepal.Length")$abs() )
Evaluate whether all boolean values are true for every sub-array
expr_arr_all()
expr_arr_all()
A polars expression
df <- pl$DataFrame( values = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), c(NA, NA)), )$cast(pl$Array(pl$Boolean, 2)) df$with_columns(all = pl$col("values")$arr$all())
df <- pl$DataFrame( values = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), c(NA, NA)), )$cast(pl$Array(pl$Boolean, 2)) df$with_columns(all = pl$col("values")$arr$all())
Evaluate whether any boolean value is true for every sub-array
expr_arr_any()
expr_arr_any()
A polars expression
df <- pl$DataFrame( values = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), c(NA, NA)), )$cast(pl$Array(pl$Boolean, 2)) df$with_columns(any = pl$col("values")$arr$any())
df <- pl$DataFrame( values = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), c(NA, NA)), )$cast(pl$Array(pl$Boolean, 2)) df$with_columns(any = pl$col("values")$arr$any())
Retrieve the index of the maximum value in every sub-array
expr_arr_arg_max()
expr_arr_arg_max()
A polars expression
df <- pl$DataFrame( values = list(1:2, 2:1) )$cast(pl$Array(pl$Int32, 2)) df$with_columns( arg_max = pl$col("values")$arr$arg_max() )
df <- pl$DataFrame( values = list(1:2, 2:1) )$cast(pl$Array(pl$Int32, 2)) df$with_columns( arg_max = pl$col("values")$arr$arg_max() )
Retrieve the index of the minimum value in every sub-array
expr_arr_arg_min()
expr_arr_arg_min()
A polars expression
df <- pl$DataFrame( values = list(1:2, 2:1) )$cast(pl$Array(pl$Int32, 2)) df$with_columns( arg_min = pl$col("values")$arr$arg_min() )
df <- pl$DataFrame( values = list(1:2, 2:1) )$cast(pl$Array(pl$Int32, 2)) df$with_columns( arg_min = pl$col("values")$arr$arg_min() )
Check if sub-arrays contain the given item
expr_arr_contains(item)
expr_arr_contains(item)
item |
Expr or something coercible to an Expr. Strings are not parsed as columns. |
A polars expression
df <- pl$DataFrame( values = list(0:2, 4:6, c(NA, NA, NA)), item = c(0L, 4L, 2L), )$cast(values = pl$Array(pl$Float64, 3)) df$with_columns( with_expr = pl$col("values")$arr$contains(pl$col("item")), with_lit = pl$col("values")$arr$contains(1) )
df <- pl$DataFrame( values = list(0:2, 4:6, c(NA, NA, NA)), item = c(0L, 4L, 2L), )$cast(values = pl$Array(pl$Float64, 3)) df$with_columns( with_expr = pl$col("values")$arr$contains(pl$col("item")), with_lit = pl$col("values")$arr$contains(1) )
Count how often a value occurs in every sub-array
expr_arr_count_matches(element)
expr_arr_count_matches(element)
element |
An Expr or something coercible to an Expr that produces a single value. |
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(1, 1), c(2, 2)) )$cast(pl$Array(pl$Int64, 2)) df$with_columns(number_of_twos = pl$col("values")$arr$count_matches(2))
df <- pl$DataFrame( values = list(c(1, 2), c(1, 1), c(2, 2)) )$cast(pl$Array(pl$Int64, 2)) df$with_columns(number_of_twos = pl$col("values")$arr$count_matches(2))
Returns a column with a separate row for every array element.
expr_arr_explode()
expr_arr_explode()
A polars expression
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$select(pl$col("a")$arr$explode())
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$select(pl$col("a")$arr$explode())
Get the first value of the sub-arrays
expr_arr_first()
expr_arr_first()
A polars expression
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(first = pl$col("a")$arr$first())
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(first = pl$col("a")$arr$first())
This allows to extract one value per array only. Values are 0-indexed (so
index 0
would return the first item of every sub-array) and negative values
start from the end (so index -1
returns the last item).
expr_arr_get(index, ..., null_on_oob = TRUE)
expr_arr_get(index, ..., null_on_oob = TRUE)
index |
An Expr or something coercible to an Expr, that must return a single index. |
... |
These dots are for future extensions and must be empty. |
null_on_oob |
If |
Expr
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)), idx = c(1, NA, 3) )$cast(values = pl$Array(pl$Float64, 2)) df$with_columns( using_expr = pl$col("values")$arr$get("idx"), val_0 = pl$col("values")$arr$get(0), val_minus_1 = pl$col("values")$arr$get(-1), val_oob = pl$col("values")$arr$get(10) )
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)), idx = c(1, NA, 3) )$cast(values = pl$Array(pl$Float64, 2)) df$with_columns( using_expr = pl$col("values")$arr$get("idx"), val_0 = pl$col("values")$arr$get(0), val_minus_1 = pl$col("values")$arr$get(-1), val_oob = pl$col("values")$arr$get(10) )
Join all string items in a sub-array and place a separator between them. This
only works if the inner type of the array is String
.
expr_arr_join(separator, ..., ignore_nulls = FALSE)
expr_arr_join(separator, ..., ignore_nulls = FALSE)
separator |
String to separate the items with. Can be an Expr. Strings are not parsed as columns. |
... |
These dots are for future extensions and must be empty. |
A polars expression
df <- pl$DataFrame( values = list(c("a", "b", "c"), c("x", "y", "z"), c("e", NA, NA)), separator = c("-", "+", "/"), )$cast(values = pl$Array(pl$String, 3)) df$with_columns( join_with_expr = pl$col("values")$arr$join(pl$col("separator")), join_with_lit = pl$col("values")$arr$join(" "), join_ignore_null = pl$col("values")$arr$join(" ", ignore_nulls = TRUE) )
df <- pl$DataFrame( values = list(c("a", "b", "c"), c("x", "y", "z"), c("e", NA, NA)), separator = c("-", "+", "/"), )$cast(values = pl$Array(pl$String, 3)) df$with_columns( join_with_expr = pl$col("values")$arr$join(pl$col("separator")), join_with_lit = pl$col("values")$arr$join(" "), join_ignore_null = pl$col("values")$arr$join(" ", ignore_nulls = TRUE) )
Get the last value of the sub-arrays
expr_arr_last()
expr_arr_last()
A polars expression
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(last = pl$col("a")$arr$last())
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(last = pl$col("a")$arr$last())
Compute the max value of the sub-arrays
expr_arr_max()
expr_arr_max()
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, NA)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(max = pl$col("values")$arr$max())
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, NA)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(max = pl$col("values")$arr$max())
Compute the median value of the sub-arrays
expr_arr_median()
expr_arr_median()
A polars expression
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(median = pl$col("values")$arr$median())
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(median = pl$col("values")$arr$median())
Compute the min value of the sub-arrays
expr_arr_min()
expr_arr_min()
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, NA)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(min = pl$col("values")$arr$min())
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, NA)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(min = pl$col("values")$arr$min())
Count the number of unique values in every sub-array
expr_arr_n_unique()
expr_arr_n_unique()
A polars expression
df <- pl$DataFrame( a = list(c(1, 1, 2), c(2, 3, 4)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(n_unique = pl$col("a")$arr$n_unique())
df <- pl$DataFrame( a = list(c(1, 1, 2), c(2, 3, 4)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(n_unique = pl$col("a")$arr$n_unique())
Reverse values in every sub-array
expr_arr_reverse()
expr_arr_reverse()
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(reverse = pl$col("values")$arr$reverse())
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(reverse = pl$col("values")$arr$reverse())
Shift values in every sub-array by the given number of indices
expr_arr_shift(n = 1)
expr_arr_shift(n = 1)
A polars expression
df <- pl$DataFrame( values = list(1:3, c(2L, NA, 5L)), idx = 1:2, )$cast(values = pl$Array(pl$Int32, 3)) df$with_columns( shift_by_expr = pl$col("values")$arr$shift(pl$col("idx")), shift_by_lit = pl$col("values")$arr$shift(2) )
df <- pl$DataFrame( values = list(1:3, c(2L, NA, 5L)), idx = 1:2, )$cast(values = pl$Array(pl$Int32, 3)) df$with_columns( shift_by_expr = pl$col("values")$arr$shift(pl$col("idx")), shift_by_lit = pl$col("values")$arr$shift(2) )
Sort values in every sub-array
expr_arr_sort(..., descending = FALSE, nulls_last = FALSE)
expr_arr_sort(..., descending = FALSE, nulls_last = FALSE)
... |
These dots are for future extensions and must be empty. |
df <- pl$DataFrame( values = list(c(2, 1), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(sort = pl$col("values")$arr$sort(nulls_last = TRUE))
df <- pl$DataFrame( values = list(c(2, 1), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(sort = pl$col("values")$arr$sort(nulls_last = TRUE))
Compute the standard deviation of the sub-arrays
expr_arr_std(ddof = 1)
expr_arr_std(ddof = 1)
A polars expression
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(std = pl$col("values")$arr$std())
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(std = pl$col("values")$arr$std())
Compute the sum of the sub-arrays
expr_arr_sum()
expr_arr_sum()
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(sum = pl$col("values")$arr$sum())
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(sum = pl$col("values")$arr$sum())
Convert an Array column into a List column with the same inner data type
expr_arr_to_list()
expr_arr_to_list()
A polars expression
df <- pl$DataFrame( a = list(c(1, 2), c(3, 4)) )$cast(pl$Array(pl$Int8, 2)) df$with_columns( list = pl$col("a")$arr$to_list() )
df <- pl$DataFrame( a = list(c(1, 2), c(3, 4)) )$cast(pl$Array(pl$Int8, 2)) df$with_columns( list = pl$col("a")$arr$to_list() )
Get the unique values in every sub-array
expr_arr_unique(..., maintain_order = FALSE)
expr_arr_unique(..., maintain_order = FALSE)
... |
These dots are for future extensions and must be empty. |
A polars expression
df <- pl$DataFrame( values = list(c(1, 1, 2), c(4, 4, 4), c(NA, 6, 7)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(unique = pl$col("values")$arr$unique())
df <- pl$DataFrame( values = list(c(1, 1, 2), c(4, 4, 4), c(NA, 6, 7)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(unique = pl$col("values")$arr$unique())
Compute the variance of the sub-arrays
expr_arr_var(ddof = 1)
expr_arr_var(ddof = 1)
A polars expression
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(var = pl$col("values")$arr$var())
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(var = pl$col("values")$arr$var())
Check if binaries contain a binary substring
expr_bin_contains(literal)
expr_bin_contains(literal)
literal |
The binary substring to look for. |
A polars expression
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), lit = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", contains_with_lit = pl$col("code")$bin$contains("xff"), contains_with_expr = pl$col("code")$bin$contains(pl$col("lit")) )
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), lit = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", contains_with_lit = pl$col("code")$bin$contains("xff"), contains_with_expr = pl$col("code")$bin$contains(pl$col("lit")) )
Decode values using the provided encoding
expr_bin_decode(encoding, ..., strict = TRUE)
expr_bin_decode(encoding, ..., strict = TRUE)
encoding |
A character, |
... |
These dots are for future extensions and must be empty. |
strict |
Raise an error if the underlying value cannot be decoded,
otherwise mask out with a |
A polars expression
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code_hex = as_polars_series(c("000000", "ffff00", "0000ff"))$cast(pl$Binary), code_base64 = as_polars_series(c("AAAA", "//8A", "AAD/"))$cast(pl$Binary) ) df$with_columns( decoded_hex = pl$col("code_hex")$bin$decode("hex"), decoded_base64 = pl$col("code_base64")$bin$decode("base64") ) # Set `strict = FALSE` to set invalid values to `null` instead of raising an error. df <- pl$DataFrame( colors = as_polars_series(c("000000", "ffff00", "invalid_value"))$cast(pl$Binary) ) df$select(pl$col("colors")$bin$decode("hex", strict = FALSE))
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code_hex = as_polars_series(c("000000", "ffff00", "0000ff"))$cast(pl$Binary), code_base64 = as_polars_series(c("AAAA", "//8A", "AAD/"))$cast(pl$Binary) ) df$with_columns( decoded_hex = pl$col("code_hex")$bin$decode("hex"), decoded_base64 = pl$col("code_base64")$bin$decode("base64") ) # Set `strict = FALSE` to set invalid values to `null` instead of raising an error. df <- pl$DataFrame( colors = as_polars_series(c("000000", "ffff00", "invalid_value"))$cast(pl$Binary) ) df$select(pl$col("colors")$bin$decode("hex", strict = FALSE))
Encode a value using the provided encoding
expr_bin_encode(encoding)
expr_bin_encode(encoding)
encoding |
A character, |
A polars expression
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series( c("000000", "ffff00", "0000ff") )$cast(pl$Binary)$bin$decode("hex") ) df$with_columns(encoded = pl$col("code")$bin$encode("hex"))
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series( c("000000", "ffff00", "0000ff") )$cast(pl$Binary)$bin$decode("hex") ) df$with_columns(encoded = pl$col("code")$bin$encode("hex"))
Check if string values end with a binary substring
expr_bin_ends_with(suffix)
expr_bin_ends_with(suffix)
suffix |
Suffix substring. |
A polars expression
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), suffix = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", ends_with_lit = pl$col("code")$bin$ends_with("xff"), ends_with_expr = pl$col("code")$bin$ends_with(pl$col("suffix")) )
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), suffix = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", ends_with_lit = pl$col("code")$bin$ends_with("xff"), ends_with_expr = pl$col("code")$bin$ends_with(pl$col("suffix")) )
Get the size of binary values in the given unit
expr_bin_size(unit = c("b", "kb", "mb", "gb", "tb"))
expr_bin_size(unit = c("b", "kb", "mb", "gb", "tb"))
unit |
Scale the returned size to the given unit. Can be |
A polars expression
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code_hex = as_polars_series(c("000000", "ffff00", "0000ff"))$cast(pl$Binary) ) df$with_columns( n_bytes = pl$col("code_hex")$bin$size(), n_kilobytes = pl$col("code_hex")$bin$size("kb") )
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code_hex = as_polars_series(c("000000", "ffff00", "0000ff"))$cast(pl$Binary) ) df$with_columns( n_bytes = pl$col("code_hex")$bin$size(), n_kilobytes = pl$col("code_hex")$bin$size("kb") )
Check if values start with a binary substring
expr_bin_starts_with(prefix)
expr_bin_starts_with(prefix)
sub |
Prefix substring. |
A polars expression
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), prefix = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", starts_with_lit = pl$col("code")$bin$starts_with("xff"), starts_with_expr = pl$col("code")$bin$starts_with(pl$col("prefix")) )
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), prefix = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", starts_with_lit = pl$col("code")$bin$starts_with("xff"), starts_with_expr = pl$col("code")$bin$starts_with(pl$col("prefix")) )
Get the categories stored in this data type
expr_cat_get_categories()
expr_cat_get_categories()
A polars expression
df <- pl$DataFrame( cats = factor(c("z", "z", "k", "a", "b")), vals = factor(c(3, 1, 2, 2, 3)) ) df df$select( pl$col("cats")$cat$get_categories() ) df$select( pl$col("vals")$cat$get_categories() )
df <- pl$DataFrame( cats = factor(c("z", "z", "k", "a", "b")), vals = factor(c(3, 1, 2, 2, 3)) ) df df$select( pl$col("cats")$cat$get_categories() ) df$select( pl$col("vals")$cat$get_categories() )
Determine how this categorical series should be sorted.
expr_cat_set_ordering(ordering)
expr_cat_set_ordering(ordering)
ordering |
string either 'physical' or 'lexical'
|
A polars expression
df <- pl$DataFrame( cats = factor(c("z", "z", "k", "a", "b")), vals = c(3, 1, 2, 2, 3) ) # sort by the string value of categories df$with_columns( pl$col("cats")$cat$set_ordering("lexical") )$sort("cats", "vals") # sort by the underlying value of categories df$with_columns( pl$col("cats")$cat$set_ordering("physical") )$sort("cats", "vals")
df <- pl$DataFrame( cats = factor(c("z", "z", "k", "a", "b")), vals = c(3, 1, 2, 2, 3) ) # sort by the string value of categories df$with_columns( pl$col("cats")$cat$set_ordering("lexical") )$sort("cats", "vals") # sort by the underlying value of categories df$with_columns( pl$col("cats")$cat$set_ordering("physical") )$sort("cats", "vals")
n
business days.Offset by n
business days.
expr_dt_add_business_days( n, ..., week_mask = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE), holidays = as.Date(integer(0)), roll = c("raise", "backward", "forward") )
expr_dt_add_business_days( n, ..., week_mask = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE), holidays = as.Date(integer(0)), roll = c("raise", "backward", "forward") )
n |
An integer value or a polars expression representing the number of business days to offset by. |
... |
These dots are for future extensions and must be empty. |
week_mask |
Non-NA logical vector of length 7, representing the days of
the week to count. The default is Monday to Friday ( |
holidays |
A Date class vector, representing the holidays to exclude from the count. |
roll |
What to do when the start date lands on a non-business day. Options are:
|
A polars expression
df <- pl$DataFrame(start = as.Date(c("2020-1-1", "2020-1-2"))) df$with_columns(result = pl$col("start")$dt$add_business_days(5)) # You can pass a custom weekend - for example, if you only take Sunday off: week_mask <- c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE) df$with_columns( result = pl$col("start")$dt$add_business_days(5, week_mask = week_mask) ) # You can also pass a list of holidays: holidays <- as.Date(c("2020-1-3", "2020-1-6")) df$with_columns( result = pl$col("start")$dt$add_business_days(5, holidays = holidays) ) # Roll all dates forwards to the next business day: df <- pl$DataFrame(start = as.Date(c("2020-1-5", "2020-1-6"))) df$with_columns( rolled_forwards = pl$col("start")$dt$add_business_days(0, roll = "forward") )
df <- pl$DataFrame(start = as.Date(c("2020-1-1", "2020-1-2"))) df$with_columns(result = pl$col("start")$dt$add_business_days(5)) # You can pass a custom weekend - for example, if you only take Sunday off: week_mask <- c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE) df$with_columns( result = pl$col("start")$dt$add_business_days(5, week_mask = week_mask) ) # You can also pass a list of holidays: holidays <- as.Date(c("2020-1-3", "2020-1-6")) df$with_columns( result = pl$col("start")$dt$add_business_days(5, holidays = holidays) ) # Roll all dates forwards to the next business day: df <- pl$DataFrame(start = as.Date(c("2020-1-5", "2020-1-6"))) df$with_columns( rolled_forwards = pl$col("start")$dt$add_business_days(0, roll = "forward") )
This computes the offset between a time zone and UTC. This is usually constant for all datetimes in a given time zone, but may vary in the rare case that a country switches time zone, like Samoa (Apia) did at the end of 2011. Use $dt$dst_offset() to take daylight saving time into account.
expr_dt_base_utc_offset()
expr_dt_base_utc_offset()
A polars expression
df <- pl$DataFrame( x = as.POSIXct(c("2011-12-29", "2012-01-01"), tz = "Pacific/Apia") ) df$with_columns(base_utc_offset = pl$col("x")$dt$base_utc_offset())
df <- pl$DataFrame( x = as.POSIXct(c("2011-12-29", "2012-01-01"), tz = "Pacific/Apia") ) df$with_columns(base_utc_offset = pl$col("x")$dt$base_utc_offset())
Cast the underlying data to another time unit. This may lose precision.
expr_dt_cast_time_unit(time_unit)
expr_dt_cast_time_unit(time_unit)
time_unit |
One of |
A polars expression
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$with_columns( cast_time_unit_ms = pl$col("date")$dt$cast_time_unit("ms"), cast_time_unit_ns = pl$col("date")$dt$cast_time_unit("ns"), )
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$with_columns( cast_time_unit_ms = pl$col("date")$dt$cast_time_unit("ms"), cast_time_unit_ns = pl$col("date")$dt$cast_time_unit("ns"), )
Returns the century number in the calendar date.
expr_dt_century()
expr_dt_century()
A polars expression
df <- pl$DataFrame( date = as.Date( c("999-12-31", "1897-05-07", "2000-01-01", "2001-07-05", "3002-10-20") ) ) df$with_columns( century = pl$col("date")$dt$century() )
df <- pl$DataFrame( date = as.Date( c("999-12-31", "1897-05-07", "2000-01-01", "2001-07-05", "3002-10-20") ) ) df$with_columns( century = pl$col("date")$dt$century() )
If the underlying expression is a Datetime then its time component is replaced, and if it is a Date then a new Datetime is created by combining the two values.
expr_dt_combine(time, time_unit = c("us", "ns", "ms"))
expr_dt_combine(time, time_unit = c("us", "ns", "ms"))
time |
The number of epoch since or before (if negative) the Date. Can be an Expr or a PTime. |
time_unit |
One of |
A polars expression
df <- pl$DataFrame( dtm = c( ISOdatetime(2022, 12, 31, 10, 30, 45), ISOdatetime(2023, 7, 5, 23, 59, 59) ), dt = c(ISOdate(2022, 10, 10), ISOdate(2022, 7, 5)), tm = hms::parse_hms(c("1:2:3.456000", "7:8:9.101000")) ) df df$select( d1 = pl$col("dtm")$dt$combine(pl$col("tm")), s2 = pl$col("dt")$dt$combine(pl$col("tm")), d3 = pl$col("dt")$dt$combine(hms::parse_hms("4:5:6")) )
df <- pl$DataFrame( dtm = c( ISOdatetime(2022, 12, 31, 10, 30, 45), ISOdatetime(2023, 7, 5, 23, 59, 59) ), dt = c(ISOdate(2022, 10, 10), ISOdate(2022, 7, 5)), tm = hms::parse_hms(c("1:2:3.456000", "7:8:9.101000")) ) df df$select( d1 = pl$col("dtm")$dt$combine(pl$col("tm")), s2 = pl$col("dt")$dt$combine(pl$col("tm")), d3 = pl$col("dt")$dt$combine(hms::parse_hms("4:5:6")) )
If converting from a time-zone-naive datetime, then conversion will happen as if converting from UTC, regardless of your system’s time zone.
expr_dt_convert_time_zone(time_zone)
expr_dt_convert_time_zone(time_zone)
time_zone |
A character time zone from |
A polars expression
df <- pl$select( date = pl$datetime_range( as.POSIXct("2020-03-01", tz = "UTC"), as.POSIXct("2020-05-01", tz = "UTC"), "1mo" ) ) df$with_columns( London = pl$col("date")$dt$convert_time_zone("Europe/London") )
df <- pl$select( date = pl$datetime_range( as.POSIXct("2020-03-01", tz = "UTC"), as.POSIXct("2020-05-01", tz = "UTC"), "1mo" ) ) df$with_columns( London = pl$col("date")$dt$convert_time_zone("Europe/London") )
Extract date from date(time)
expr_dt_date()
expr_dt_date()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct(c("1978-1-1 1:1:1", "1897-5-7 00:00:00"), tz = "UTC") ) df$with_columns( date = pl$col("datetime")$dt$date() )
df <- pl$DataFrame( datetime = as.POSIXct(c("1978-1-1 1:1:1", "1897-5-7 00:00:00"), tz = "UTC") ) df$with_columns( date = pl$col("datetime")$dt$date() )
Returns the day of month starting from 1. The return value ranges from 1 to 31 (the last day of month differs across months).
expr_dt_day()
expr_dt_day()
A polars expression
df <- pl$DataFrame( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d", time_zone = "GMT" ) ) df$with_columns( pl$col("date")$dt$day()$alias("day") )
df <- pl$DataFrame( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d", time_zone = "GMT" ) ) df$with_columns( pl$col("date")$dt$day()$alias("day") )
This computes the offset between a time zone and UTC, taking into account daylight saving time. Use $dt$base_utc_offset() to avoid counting DST.
expr_dt_dst_offset()
expr_dt_dst_offset()
A polars expression
df <- pl$DataFrame( x = as.POSIXct(c("2020-10-25", "2020-10-26"), tz = "Europe/London") ) df$with_columns(dst_offset = pl$col("x")$dt$dst_offset())
df <- pl$DataFrame( x = as.POSIXct(c("2020-10-25", "2020-10-26"), tz = "Europe/London") ) df$with_columns(dst_offset = pl$col("x")$dt$dst_offset())
Get the time passed since the Unix EPOCH in the give time unit.
expr_dt_epoch(time_unit = c("us", "ns", "ms", "s", "d"))
expr_dt_epoch(time_unit = c("us", "ns", "ms", "s", "d"))
time_unit |
Time unit, one of |
A polars expression
df <- pl$DataFrame(date = pl$date_range(as.Date("2001-1-1"), as.Date("2001-1-3"))) df$with_columns( epoch_ns = pl$col("date")$dt$epoch(), epoch_s = pl$col("date")$dt$epoch(time_unit = "s") )
df <- pl$DataFrame(date = pl$date_range(as.Date("2001-1-1"), as.Date("2001-1-3"))) df$with_columns( epoch_ns = pl$col("date")$dt$epoch(), epoch_s = pl$col("date")$dt$epoch(time_unit = "s") )
Returns the hour number from 0 to 23.
expr_dt_hour()
expr_dt_hour()
A polars expression
df <- pl$DataFrame( date = pl$datetime_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d2h", time_zone = "GMT" ) ) df$with_columns( pl$col("date")$dt$hour()$alias("hour") )
df <- pl$DataFrame( date = pl$datetime_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d2h", time_zone = "GMT" ) ) df$with_columns( pl$col("date")$dt$hour()$alias("hour") )
Determine whether the year of the underlying date is a leap year
expr_dt_is_leap_year()
expr_dt_is_leap_year()
A polars expression
df <- pl$DataFrame(date = as.Date(c("2000-01-01", "2001-01-01", "2002-01-01"))) df$with_columns( leap_year = pl$col("date")$dt$is_leap_year() )
df <- pl$DataFrame(date = as.Date(c("2000-01-01", "2001-01-01", "2002-01-01"))) df$with_columns( leap_year = pl$col("date")$dt$is_leap_year() )
Returns the year number in the ISO standard. This may not correspond with the calendar year.
expr_dt_iso_year()
expr_dt_iso_year()
A polars expression
df <- pl$DataFrame( date = as.Date(c("1977-01-01", "1978-01-01", "1979-01-01")) ) df$with_columns( year = pl$col("date")$dt$year(), iso_year = pl$col("date")$dt$iso_year() )
df <- pl$DataFrame( date = as.Date(c("1977-01-01", "1978-01-01", "1979-01-01")) ) df$with_columns( year = pl$col("date")$dt$year(), iso_year = pl$col("date")$dt$iso_year() )
Extract microseconds from underlying Datetime representation
expr_dt_microsecond()
expr_dt_microsecond()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( microsecond = pl$col("datetime")$dt$microsecond() )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( microsecond = pl$col("datetime")$dt$microsecond() )
Extract milliseconds from underlying Datetime representation
expr_dt_millisecond()
expr_dt_millisecond()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( millisecond = pl$col("datetime")$dt$millisecond() )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( millisecond = pl$col("datetime")$dt$millisecond() )
Returns the minute number from 0 to 59.
expr_dt_minute()
expr_dt_minute()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( pl$col("datetime")$dt$minute()$alias("minute") )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( pl$col("datetime")$dt$minute()$alias("minute") )
Returns the month number between 1 and 12.
expr_dt_month()
expr_dt_month()
A polars expression
df <- pl$DataFrame( date = as.Date(c("2001-01-01", "2001-06-30", "2001-12-27")) ) df$with_columns( month = pl$col("date")$dt$month() )
df <- pl$DataFrame( date = as.Date(c("2001-01-01", "2001-06-30", "2001-12-27")) ) df$with_columns( month = pl$col("date")$dt$month() )
For datetimes, the time of day is preserved.
expr_dt_month_end()
expr_dt_month_end()
A polars expression
df <- pl$DataFrame(date = as.Date(c("2000-01-23", "2001-01-12", "2002-01-01"))) df$with_columns( month_end = pl$col("date")$dt$month_end() )
df <- pl$DataFrame(date = as.Date(c("2000-01-23", "2001-01-12", "2002-01-01"))) df$with_columns( month_end = pl$col("date")$dt$month_end() )
For datetimes, the time of day is preserved.
expr_dt_month_start()
expr_dt_month_start()
A polars expression
df <- pl$DataFrame(date = as.Date(c("2000-01-23", "2001-01-12", "2002-01-01"))) df$with_columns( month_start = pl$col("date")$dt$month_start() )
df <- pl$DataFrame(date = as.Date(c("2000-01-23", "2001-01-12", "2002-01-01"))) df$with_columns( month_start = pl$col("date")$dt$month_start() )
Extract nanoseconds from underlying Datetime representation
expr_dt_nanosecond()
expr_dt_nanosecond()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( nanosecond = pl$col("datetime")$dt$nanosecond() )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( nanosecond = pl$col("datetime")$dt$nanosecond() )
This differs from pl$col("foo") + Duration
in that it can
take months and leap years into account. Note that only a single minus
sign is allowed in the by
string, as the first character.
expr_dt_offset_by(by)
expr_dt_offset_by(by)
by |
optional string encoding duration see details. |
The by
are created with the following string language:
1ns # 1 nanosecond
1us # 1 microsecond
1ms # 1 millisecond
1s # 1 second
1m # 1 minute
1h # 1 hour
1d # 1 day
1w # 1 calendar week
1mo # 1 calendar month
1y # 1 calendar year
1i # 1 index count
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
These strings can be combined:
3d12h4m25s # 3 days, 12 hours, 4 minutes, and 25 seconds
A polars expression
df <- pl$select( dates = pl$date_range( as.Date("2000-1-1"), as.Date("2005-1-1"), "1y" ) ) df$with_columns( date_plus_1y = pl$col("dates")$dt$offset_by("1y"), date_negative_offset = pl$col("dates")$dt$offset_by("-1y2mo") ) # the "by" argument also accepts expressions df <- pl$select( dates = pl$datetime_range( as.POSIXct("2022-01-01", tz = "GMT"), as.POSIXct("2022-01-02", tz = "GMT"), interval = "6h", time_unit = "ms", time_zone = "GMT" ), offset = pl$Series(values = c("1d", "-2d", "1mo", NA, "1y")) ) df$with_columns(new_dates = pl$col("dates")$dt$offset_by(pl$col("offset")))
df <- pl$select( dates = pl$date_range( as.Date("2000-1-1"), as.Date("2005-1-1"), "1y" ) ) df$with_columns( date_plus_1y = pl$col("dates")$dt$offset_by("1y"), date_negative_offset = pl$col("dates")$dt$offset_by("-1y2mo") ) # the "by" argument also accepts expressions df <- pl$select( dates = pl$datetime_range( as.POSIXct("2022-01-01", tz = "GMT"), as.POSIXct("2022-01-02", tz = "GMT"), interval = "6h", time_unit = "ms", time_zone = "GMT" ), offset = pl$Series(values = c("1d", "-2d", "1mo", NA, "1y")) ) df$with_columns(new_dates = pl$col("dates")$dt$offset_by(pl$col("offset")))
Returns the day of year starting from 1. The return value ranges from 1 to 366 (the last day of year differs across years).
expr_dt_ordinal_day()
expr_dt_ordinal_day()
A polars expression
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( ordinal_day = pl$col("date")$dt$ordinal_day() )
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( ordinal_day = pl$col("date")$dt$ordinal_day() )
Returns the quarter ranging from 1 to 4.
expr_dt_quarter()
expr_dt_quarter()
A polars expression
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( quarter = pl$col("date")$dt$quarter() )
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( quarter = pl$col("date")$dt$quarter() )
Different from $dt$convert_time_zone(), this will also modify the underlying timestamp and will ignore the original time zone.
expr_dt_replace_time_zone( time_zone, ..., ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
expr_dt_replace_time_zone( time_zone, ..., ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
time_zone |
|
... |
These dots are for future extensions and must be empty. |
ambiguous |
Determine how to deal with ambiguous datetimes. Character vector or expression containing the followings:
|
non_existent |
Determine how to deal with non-existent datetimes. One of the followings:
|
A polars expression
df <- pl$select( london_timezone = pl$datetime_range( as.Date("2020-03-01"), as.Date("2020-07-01"), "1mo", time_zone = "UTC" )$dt$convert_time_zone(time_zone = "Europe/London") ) df$with_columns( London_to_Amsterdam = pl$col("london_timezone")$dt$replace_time_zone(time_zone="Europe/Amsterdam") ) # You can use `ambiguous` to deal with ambiguous datetimes: dates <- c( "2018-10-28 01:30", "2018-10-28 02:00", "2018-10-28 02:30", "2018-10-28 02:00" ) |> as.POSIXct("UTC") df2 <- pl$DataFrame( ts = as_polars_series(dates), ambiguous = c("earliest", "earliest", "latest", "latest"), ) df2$with_columns( ts_localized = pl$col("ts")$dt$replace_time_zone( "Europe/Brussels", ambiguous = pl$col("ambiguous") ) )
df <- pl$select( london_timezone = pl$datetime_range( as.Date("2020-03-01"), as.Date("2020-07-01"), "1mo", time_zone = "UTC" )$dt$convert_time_zone(time_zone = "Europe/London") ) df$with_columns( London_to_Amsterdam = pl$col("london_timezone")$dt$replace_time_zone(time_zone="Europe/Amsterdam") ) # You can use `ambiguous` to deal with ambiguous datetimes: dates <- c( "2018-10-28 01:30", "2018-10-28 02:00", "2018-10-28 02:30", "2018-10-28 02:00" ) |> as.POSIXct("UTC") df2 <- pl$DataFrame( ts = as_polars_series(dates), ambiguous = c("earliest", "earliest", "latest", "latest"), ) df2$with_columns( ts_localized = pl$col("ts")$dt$replace_time_zone( "Europe/Brussels", ambiguous = pl$col("ambiguous") ) )
Divide the date/datetime range into buckets. Each date/datetime in the first
half of the interval is mapped to the start of its bucket. Each
date/datetime in the second half of the interval is mapped to the end of its
bucket. Ambiguous results are localised using the DST offset of the original
timestamp - for example, rounding '2022-11-06 01:20:00 CST'
by '1h'
results in '2022-11-06 01:00:00 CST'
, whereas rounding
'2022-11-06 01:20:00 CDT'
by '1h'
results in '2022-11-06 01:00:00 CDT'
.
expr_dt_round(every)
expr_dt_round(every)
every |
Either an Expr or a string indicating a column name or a duration (see Details). |
The every
and offset
argument are created with the
the following string language:
1ns # 1 nanosecond
1us # 1 microsecond
1ms # 1 millisecond
1s # 1 second
1m # 1 minute
1h # 1 hour
1d # 1 day
1w # 1 calendar week
1mo # 1 calendar month
1y # 1 calendar year These strings can be combined:
3d12h4m25s # 3 days, 12 hours, 4 minutes, and 25 seconds
A polars expression
df <- pl$select( datetime = pl$datetime_range( as.Date("2001-01-01"), as.Date("2001-01-02"), as.difftime("0:25:0") ) ) df$with_columns(round = pl$col("datetime")$dt$round("1h")) df <- pl$select( datetime = pl$datetime_range( as.POSIXct("2001-01-01 00:00"), as.POSIXct("2001-01-01 01:00"), as.difftime("0:10:0") ) ) df$with_columns(round = pl$col("datetime")$dt$round("1h"))
df <- pl$select( datetime = pl$datetime_range( as.Date("2001-01-01"), as.Date("2001-01-02"), as.difftime("0:25:0") ) ) df$with_columns(round = pl$col("datetime")$dt$round("1h")) df <- pl$select( datetime = pl$datetime_range( as.POSIXct("2001-01-01 00:00"), as.POSIXct("2001-01-01 01:00"), as.difftime("0:10:0") ) ) df$with_columns(round = pl$col("datetime")$dt$round("1h"))
Returns the integer second number from 0 to 59, or a floating point number
from 0 to 60 if fractional = TRUE
that includes any milli/micro/nanosecond
component.
expr_dt_second(fractional = FALSE)
expr_dt_second(fractional = FALSE)
fractional |
If |
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( second = pl$col("datetime")$dt$second(), second_fractional = pl$col("datetime")$dt$second(fractional = TRUE) )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( second = pl$col("datetime")$dt$second(), second_fractional = pl$col("datetime")$dt$second(fractional = TRUE) )
Similar to $cast(pl$String)
, but this method allows you to customize the
formatting of the resulting string. This is an alias for $dt$to_string()
.
expr_dt_strftime(format)
expr_dt_strftime(format)
format |
Single string of format to use, or
|
A polars expression
pl$DataFrame( datetime = c(as.POSIXct(c("2021-01-02 00:00:00", "2021-01-03 00:00:00"))) )$ with_columns( datetime_string = pl$col("datetime")$dt$strftime("%Y/%m/%d %H:%M:%S") )
pl$DataFrame( datetime = c(as.POSIXct(c("2021-01-02 00:00:00", "2021-01-03 00:00:00"))) )$ with_columns( datetime_string = pl$col("datetime")$dt$strftime("%Y/%m/%d %H:%M:%S") )
This only works on Datetime columns, it will error on Date columns.
expr_dt_time()
expr_dt_time()
A polars expression
df <- pl$select(dates = pl$datetime_range( as.Date("2000-1-1"), as.Date("2000-1-2"), "1h" )) df$with_columns(times = pl$col("dates")$dt$time())
df <- pl$select(dates = pl$datetime_range( as.Date("2000-1-1"), as.Date("2000-1-2"), "1h" )) df$with_columns(times = pl$col("dates")$dt$time())
Get timestamp in the given time unit
expr_dt_timestamp(time_unit = c("us", "ns", "ms"))
expr_dt_timestamp(time_unit = c("us", "ns", "ms"))
time_unit |
Time unit, one of 'ns', 'us', or 'ms'. |
A polars expression
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$select( pl$col("date"), pl$col("date")$dt$timestamp()$alias("timestamp_ns"), pl$col("date")$dt$timestamp(time_unit = "ms")$alias("timestamp_ms") )
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$select( pl$col("date"), pl$col("date")$dt$timestamp()$alias("timestamp_ns"), pl$col("date")$dt$timestamp(time_unit = "ms")$alias("timestamp_ms") )
Similar to $cast(pl$String)
, but this method allows you to customize the
formatting of the resulting string; if no format is provided, the appropriate
ISO format for the underlying data type is used.
expr_dt_to_string(format = NULL)
expr_dt_to_string(format = NULL)
format |
Single string of format to use, or
|
A polars expression
df <- pl$DataFrame( dt = as.Date(c("1990-03-01", "2020-05-03", "2077-07-05")), dtm = as.POSIXct(c("1980-08-10 00:10:20", "2010-10-20 08:25:35", "2040-12-30 16:40:50")), tm = hms::as_hms(c("01:02:03.456789", "23:59:09.101", "00:00:00.000100")), dur = clock::duration_days(c(-1, 14, 0)) + clock::duration_hours(c(0, -10, 0)) + clock::duration_seconds(c(-42, 0, 0)) + clock::duration_microseconds(c(0, 1001, 0)), ) # Default format for temporal dtypes is ISO8601: df$select((cs$date() | cs$datetime())$dt$to_string()$name$prefix("s_")) df$select((cs$time() | cs$duration())$dt$to_string()$name$prefix("s_")) # All temporal types (aside from Duration) support strftime formatting: df$select( pl$col("dtm"), s_dtm = pl$col("dtm")$dt$to_string("%Y/%m/%d (%H.%M.%S)"), ) # The Polars Duration string format is also available: df$select(pl$col("dur"), s_dur = pl$col("dur")$dt$to_string("polars")) # If you’re interested in extracting the day or month names, # you can use the '%A' and '%B' strftime specifiers: df$select( pl$col("dt"), day_name = pl$col("dtm")$dt$to_string("%A"), month_name = pl$col("dtm")$dt$to_string("%B"), )
df <- pl$DataFrame( dt = as.Date(c("1990-03-01", "2020-05-03", "2077-07-05")), dtm = as.POSIXct(c("1980-08-10 00:10:20", "2010-10-20 08:25:35", "2040-12-30 16:40:50")), tm = hms::as_hms(c("01:02:03.456789", "23:59:09.101", "00:00:00.000100")), dur = clock::duration_days(c(-1, 14, 0)) + clock::duration_hours(c(0, -10, 0)) + clock::duration_seconds(c(-42, 0, 0)) + clock::duration_microseconds(c(0, 1001, 0)), ) # Default format for temporal dtypes is ISO8601: df$select((cs$date() | cs$datetime())$dt$to_string()$name$prefix("s_")) df$select((cs$time() | cs$duration())$dt$to_string()$name$prefix("s_")) # All temporal types (aside from Duration) support strftime formatting: df$select( pl$col("dtm"), s_dtm = pl$col("dtm")$dt$to_string("%Y/%m/%d (%H.%M.%S)"), ) # The Polars Duration string format is also available: df$select(pl$col("dur"), s_dur = pl$col("dur")$dt$to_string("polars")) # If you’re interested in extracting the day or month names, # you can use the '%A' and '%B' strftime specifiers: df$select( pl$col("dt"), day_name = pl$col("dtm")$dt$to_string("%A"), month_name = pl$col("dtm")$dt$to_string("%B"), )
Extract the days from a Duration type
expr_dt_total_days()
expr_dt_total_days()
A polars expression
df <- pl$select( date = pl$datetime_range( start = as.Date("2020-3-1"), end = as.Date("2020-5-1"), interval = "1mo1s" ) ) df$with_columns( diff_days = pl$col("date")$diff()$dt$total_days() )
df <- pl$select( date = pl$datetime_range( start = as.Date("2020-3-1"), end = as.Date("2020-5-1"), interval = "1mo1s" ) ) df$with_columns( diff_days = pl$col("date")$diff()$dt$total_days() )
Extract the hours from a Duration type
expr_dt_total_hours()
expr_dt_total_hours()
A polars expression
df <- pl$select( date = pl$date_range( start = as.Date("2020-1-1"), end = as.Date("2020-1-4"), interval = "1d" ) ) df$with_columns( diff_hours = pl$col("date")$diff()$dt$total_hours() )
df <- pl$select( date = pl$date_range( start = as.Date("2020-1-1"), end = as.Date("2020-1-4"), interval = "1d" ) ) df$with_columns( diff_hours = pl$col("date")$diff()$dt$total_hours() )
Extract the microseconds from a Duration type
expr_dt_total_microseconds()
expr_dt_total_microseconds()
A polars expression
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_microsec = pl$col("date")$diff()$dt$total_microseconds() )
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_microsec = pl$col("date")$diff()$dt$total_microseconds() )
Extract the milliseconds from a Duration type
expr_dt_total_milliseconds()
expr_dt_total_milliseconds()
A polars expression
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_millisec = pl$col("date")$diff()$dt$total_milliseconds() )
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_millisec = pl$col("date")$diff()$dt$total_milliseconds() )
Extract the minutes from a Duration type
expr_dt_total_minutes()
expr_dt_total_minutes()
A polars expression
df <- pl$select( date = pl$date_range( start = as.Date("2020-1-1"), end = as.Date("2020-1-4"), interval = "1d" ) ) df$with_columns( diff_minutes = pl$col("date")$diff()$dt$total_minutes() )
df <- pl$select( date = pl$date_range( start = as.Date("2020-1-1"), end = as.Date("2020-1-4"), interval = "1d" ) ) df$with_columns( diff_minutes = pl$col("date")$diff()$dt$total_minutes() )
Extract the nanoseconds from a Duration type
expr_dt_total_nanoseconds()
expr_dt_total_nanoseconds()
A polars expression
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_nanosec = pl$col("date")$diff()$dt$total_nanoseconds() )
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_nanosec = pl$col("date")$diff()$dt$total_nanoseconds() )
Extract the seconds from a Duration type
expr_dt_total_seconds()
expr_dt_total_seconds()
A polars expression
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:04:00", tz = "GMT"), interval = "1m" )) df$with_columns( diff_sec = pl$col("date")$diff()$dt$total_seconds() )
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:04:00", tz = "GMT"), interval = "1m" )) df$with_columns( diff_sec = pl$col("date")$diff()$dt$total_seconds() )
Divide the date/datetime range into buckets. Each date/datetime is mapped to
the start of its bucket using the corresponding local datetime. Note that
weekly buckets start on Monday. Ambiguous results are localised using the
DST offset of the original timestamp - for example, truncating
'2022-11-06 01:30:00 CST'
by '1h'
results in
'2022-11-06 01:00:00 CST'
, whereas truncating '2022-11-06 01:30:00 CDT'
by '1h'
results in '2022-11-06 01:00:00 CDT'
.
expr_dt_truncate(every)
expr_dt_truncate(every)
every |
Either an Expr or a string indicating a column name or a duration (see Details). |
The every
and offset
argument are created with the
the following string language:
1ns # 1 nanosecond
1us # 1 microsecond
1ms # 1 millisecond
1s # 1 second
1m # 1 minute
1h # 1 hour
1d # 1 day
1w # 1 calendar week
1mo # 1 calendar month
1y # 1 calendar year These strings can be combined:
3d12h4m25s # 3 days, 12 hours, 4 minutes, and 25 seconds
A polars expression
df <- pl$select( datetime = pl$datetime_range( as.Date("2001-01-01"), as.Date("2001-01-02"), as.difftime("0:25:0") ) ) df$with_columns(truncated = pl$col("datetime")$dt$truncate("1h")) df <- pl$select( datetime = pl$datetime_range( as.POSIXct("2001-01-01 00:00"), as.POSIXct("2001-01-01 01:00"), as.difftime("0:10:0") ) ) df$with_columns(truncated = pl$col("datetime")$dt$truncate("30m"))
df <- pl$select( datetime = pl$datetime_range( as.Date("2001-01-01"), as.Date("2001-01-02"), as.difftime("0:25:0") ) ) df$with_columns(truncated = pl$col("datetime")$dt$truncate("1h")) df <- pl$select( datetime = pl$datetime_range( as.POSIXct("2001-01-01 00:00"), as.POSIXct("2001-01-01 01:00"), as.difftime("0:10:0") ) ) df$with_columns(truncated = pl$col("datetime")$dt$truncate("30m"))
Returns the ISO week number starting from 1. The return value ranges from 1 to 53 (the last week of year differs across years).
expr_dt_week()
expr_dt_week()
A polars expression
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( week = pl$col("date")$dt$week() )
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( week = pl$col("date")$dt$week() )
Returns the ISO weekday number where Monday = 1 and Sunday = 7.
expr_dt_weekday()
expr_dt_weekday()
A polars expression
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( weekday = pl$col("date")$dt$weekday() )
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( weekday = pl$col("date")$dt$weekday() )
This is deprecated. Cast to Int64 and then to Datetime instead.
expr_dt_with_time_unit(time_unit = c("ns", "us", "ms"))
expr_dt_with_time_unit(time_unit = c("ns", "us", "ms"))
time_unit |
Time unit, one of 'ns', 'us', or 'ms'. |
A polars expression
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$with_columns( with_time_unit_ns = pl$col("date")$dt$with_time_unit(), with_time_unit_ms = pl$col("date")$dt$with_time_unit(time_unit = "ms") )
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$with_columns( with_time_unit_ns = pl$col("date")$dt$with_time_unit(), with_time_unit_ms = pl$col("date")$dt$with_time_unit(time_unit = "ms") )
Returns the year number in the calendar date.
expr_dt_year()
expr_dt_year()
A polars expression
df <- pl$DataFrame( date = as.Date(c("1977-01-01", "1978-01-01", "1979-01-01")) ) df$with_columns( year = pl$col("date")$dt$year(), iso_year = pl$col("date")$dt$iso_year() )
df <- pl$DataFrame( date = as.Date(c("1977-01-01", "1978-01-01", "1979-01-01")) ) df$with_columns( year = pl$col("date")$dt$year(), iso_year = pl$col("date")$dt$iso_year() )
Evaluate whether all boolean values in a sub-list are true
expr_list_all()
expr_list_all()
A polars expression
df <- pl$DataFrame( a = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), NA, c()) ) df$with_columns(all = pl$col("a")$list$all())
df <- pl$DataFrame( a = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), NA, c()) ) df$with_columns(all = pl$col("a")$list$all())
Evaluate whether any boolean value in a sub-list is true
expr_list_any()
expr_list_any()
A polars expression
df <- pl$DataFrame( a = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), NA, c()) ) df$with_columns(any = pl$col("a")$list$any())
df <- pl$DataFrame( a = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), NA, c()) ) df$with_columns(any = pl$col("a")$list$any())
Retrieve the index of the maximum value in every sub-list
expr_list_arg_max()
expr_list_arg_max()
A polars expression
df <- pl$DataFrame(s = list(1:2, 2:1)) df$with_columns( arg_max = pl$col("s")$list$arg_max() )
df <- pl$DataFrame(s = list(1:2, 2:1)) df$with_columns( arg_max = pl$col("s")$list$arg_max() )
Retrieve the index of the minimum value in every sub-list
expr_list_arg_min()
expr_list_arg_min()
A polars expression
df <- pl$DataFrame(s = list(1:2, 2:1)) df$with_columns( arg_min = pl$col("s")$list$arg_min() )
df <- pl$DataFrame(s = list(1:2, 2:1)) df$with_columns( arg_min = pl$col("s")$list$arg_min() )
Concat the lists into a new list
expr_list_concat(other)
expr_list_concat(other)
other |
Values to concat with. Can be an Expr or something coercible to an Expr. |
A polars expression
df <- pl$DataFrame( a = list("a", "x"), b = list(c("b", "c"), c("y", "z")) ) df$with_columns( conc_to_b = pl$col("a")$list$concat(pl$col("b")), conc_to_lit_str = pl$col("a")$list$concat(pl$lit("some string")), conc_to_lit_list = pl$col("a")$list$concat(pl$lit(list("hello", c("hello", "world")))) )
df <- pl$DataFrame( a = list("a", "x"), b = list(c("b", "c"), c("y", "z")) ) df$with_columns( conc_to_b = pl$col("a")$list$concat(pl$col("b")), conc_to_lit_str = pl$col("a")$list$concat(pl$lit("some string")), conc_to_lit_list = pl$col("a")$list$concat(pl$lit(list("hello", c("hello", "world")))) )
Check if sub-lists contains a given value
expr_list_contains(item)
expr_list_contains(item)
item |
Item that will be checked for membership. Can be an Expr or something coercible to an Expr. Strings are not parsed as columns. |
A polars expression
df <- pl$DataFrame( a = list(3:1, NULL, 1:2), item = 0:2 ) df$with_columns( with_expr = pl$col("a")$list$contains(pl$col("item")), with_lit = pl$col("a")$list$contains(1) )
df <- pl$DataFrame( a = list(3:1, NULL, 1:2), item = 0:2 ) df$with_columns( with_expr = pl$col("a")$list$contains(pl$col("item")), with_lit = pl$col("a")$list$contains(1) )
Count how often a value produced occurs
expr_list_count_matches(element)
expr_list_count_matches(element)
element |
An expression that produces a single value. |
A polars expression
df <- pl$DataFrame(a = list(0, 1, c(1, 2, 3, 2), c(1, 2, 1), c(4, 4))) df$with_columns( number_of_twos = pl$col("a")$list$count_matches(2) )
df <- pl$DataFrame(a = list(0, 1, c(1, 2, 3, 2), c(1, 2, 1), c(4, 4))) df$with_columns( number_of_twos = pl$col("a")$list$count_matches(2) )
This computes the first discrete difference between shifted items of every
list. The parameter n
gives the interval between items to subtract, e.g.
if n = 2
the output will be the difference between the 1st and the 3rd
value, the 2nd and 4th value, etc.
expr_list_diff(n = 1, null_behavior = c("ignore", "drop"))
expr_list_diff(n = 1, null_behavior = c("ignore", "drop"))
n |
Number of slots to shift. If negative, then it starts from the end. |
null_behavior |
How to handle |
A polars expression
df <- pl$DataFrame(s = list(1:4, c(10L, 2L, 1L))) df$with_columns(diff = pl$col("s")$list$diff(2)) # negative value starts shifting from the end df$with_columns(diff = pl$col("s")$list$diff(-2))
df <- pl$DataFrame(s = list(1:4, c(10L, 2L, 1L))) df$with_columns(diff = pl$col("s")$list$diff(2)) # negative value starts shifting from the end df$with_columns(diff = pl$col("s")$list$diff(-2))
Drop all null values in every sub-list
expr_list_drop_nulls()
expr_list_drop_nulls()
A polars expression
df <- pl$DataFrame(values = list(c(NA, 0, NA), c(1, NaN), NA)) df$with_columns( without_nulls = pl$col("values")$list$drop_nulls() )
df <- pl$DataFrame(values = list(c(NA, 0, NA), c(1, NaN), NA)) df$with_columns( without_nulls = pl$col("values")$list$drop_nulls() )
Run any polars expression on the sub-lists' values
expr_list_eval(expr, ..., parallel = FALSE)
expr_list_eval(expr, ..., parallel = FALSE)
expr |
Expression to run. Note that you can select an element with
|
parallel |
Run all expressions in parallel. Don't activate this blindly.
Parallelism is worth it if there is enough work to do per thread. This
likely should not be used in the |
A polars expression
df <- pl$DataFrame( a = list(c(1, 8, 3), c(3, 2), c(NA, NA, 1)), b = list(c("R", "is", "amazing"), c("foo", "bar"), "text") ) df # standardize each value inside a list, using only the values in this list df$select( a_stand = pl$col("a")$list$eval( (pl$element() - pl$element()$mean()) / pl$element()$std() ) ) # count characters for each element in list. Since column "b" is list[str], # we can apply all `$str` functions on elements in the list: df$select( b_len_chars = pl$col("b")$list$eval( pl$element()$str$len_chars() ) ) # concat strings in each list df$select( pl$col("b")$list$eval(pl$element()$str$join(" "))$list$first() )
df <- pl$DataFrame( a = list(c(1, 8, 3), c(3, 2), c(NA, NA, 1)), b = list(c("R", "is", "amazing"), c("foo", "bar"), "text") ) df # standardize each value inside a list, using only the values in this list df$select( a_stand = pl$col("a")$list$eval( (pl$element() - pl$element()$mean()) / pl$element()$std() ) ) # count characters for each element in list. Since column "b" is list[str], # we can apply all `$str` functions on elements in the list: df$select( b_len_chars = pl$col("b")$list$eval( pl$element()$str$len_chars() ) ) # concat strings in each list df$select( pl$col("b")$list$eval(pl$element()$str$join(" "))$list$first() )
Returns a column with a separate row for every list element
expr_list_explode()
expr_list_explode()
A polars expression
df <- pl$DataFrame(a = list(c(1, 2, 3), c(4, 5, 6))) df$select(pl$col("a")$list$explode())
df <- pl$DataFrame(a = list(c(1, 2, 3), c(4, 5, 6))) df$select(pl$col("a")$list$explode())
Get the first value of the sub-lists
expr_list_first()
expr_list_first()
A polars expression
df <- pl$DataFrame(a = list(3:1, NULL, 1:2)) df$with_columns( first = pl$col("a")$list$first() )
df <- pl$DataFrame(a = list(3:1, NULL, 1:2)) df$with_columns( first = pl$col("a")$list$first() )
This allows to extract several values per list. To extract a single value by
index, use $list$get()
. The indices may be defined in a
single column, or by sub-lists in another column of dtype List.
expr_list_gather(index, ..., null_on_oob = FALSE)
expr_list_gather(index, ..., null_on_oob = FALSE)
index |
An Expr or something coercible to an Expr, that can return
several indices. Values are 0-indexed (so index 0 would return the
first item of every sub-list) and negative values start from the end (index
|
... |
These dots are for future extensions and must be empty. |
null_on_oob |
If |
A polars expression
df <- pl$DataFrame( a = list(c(3, 2, 1), 1, c(1, 2)), idx = list(0:1, integer(), c(1L, 999L)) ) df$with_columns( gathered = pl$col("a")$list$gather("idx", null_on_oob = TRUE) ) df$with_columns( gathered = pl$col("a")$list$gather(2, null_on_oob = TRUE) ) # by some column name, must cast to an Int/Uint type to work df$with_columns( gathered = pl$col("a")$list$gather(pl$col("a")$cast(pl$List(pl$UInt64)), null_on_oob = TRUE) )
df <- pl$DataFrame( a = list(c(3, 2, 1), 1, c(1, 2)), idx = list(0:1, integer(), c(1L, 999L)) ) df$with_columns( gathered = pl$col("a")$list$gather("idx", null_on_oob = TRUE) ) df$with_columns( gathered = pl$col("a")$list$gather(2, null_on_oob = TRUE) ) # by some column name, must cast to an Int/Uint type to work df$with_columns( gathered = pl$col("a")$list$gather(pl$col("a")$cast(pl$List(pl$UInt64)), null_on_oob = TRUE) )
n
-th value starting from offset in sub-listsTake every n
-th value starting from offset in sub-lists
expr_list_gather_every(n, offset = 0)
expr_list_gather_every(n, offset = 0)
A polars expression
df <- pl$DataFrame( a = list(1:5, 6:8, 9:12), n = c(2, 1, 3), offset = c(0, 1, 0) ) df$with_columns( gather_every = pl$col("a")$list$gather_every(pl$col("n"), offset = pl$col("offset")) )
df <- pl$DataFrame( a = list(1:5, 6:8, 9:12), n = c(2, 1, 3), offset = c(0, 1, 0) ) df$with_columns( gather_every = pl$col("a")$list$gather_every(pl$col("n"), offset = pl$col("offset")) )
This allows to extract one value per list only. To extract several values by
index, use $list$gather()
.
expr_list_get(index, ..., null_on_oob = TRUE)
expr_list_get(index, ..., null_on_oob = TRUE)
index |
An Expr or something coercible to an Expr, that must return a
single index. Values are 0-indexed (so index 0 would return the first item
of every sub-list) and negative values start from the end (index |
... |
These dots are for future extensions and must be empty. |
null_on_oob |
If |
Expr
df <- pl$DataFrame( values = list(c(2, 2, NA), c(1, 2, 3), NA, NULL), idx = c(1, 2, NA, 3) ) df$with_columns( using_expr = pl$col("values")$list$get("idx"), val_0 = pl$col("values")$list$get(0), val_minus_1 = pl$col("values")$list$get(-1), val_oob = pl$col("values")$list$get(10) )
df <- pl$DataFrame( values = list(c(2, 2, NA), c(1, 2, 3), NA, NULL), idx = c(1, 2, NA, 3) ) df$with_columns( using_expr = pl$col("values")$list$get("idx"), val_0 = pl$col("values")$list$get(0), val_minus_1 = pl$col("values")$list$get(-1), val_oob = pl$col("values")$list$get(10) )
n
values of every sub-listSlice the first n
values of every sub-list
expr_list_head(n = 5L)
expr_list_head(n = 5L)
n |
Number of values to return for each sub-list. Can be an Expr. Strings are parsed as column names. |
A polars expression
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), n = 1:2 ) df$with_columns( head_by_expr = pl$col("s")$list$head("n"), head_by_lit = pl$col("s")$list$head(2) )
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), n = 1:2 ) df$with_columns( head_by_expr = pl$col("s")$list$head("n"), head_by_lit = pl$col("s")$list$head(2) )
Join all string items in a sub-list and place a separator between them. This
only works if the inner dtype is String
.
expr_list_join(separator, ..., ignore_nulls = FALSE)
expr_list_join(separator, ..., ignore_nulls = FALSE)
separator |
String to separate the items with. Can be an Expr. Strings are not parsed as columns. |
... |
These dots are for future extensions and must be empty. |
A polars expression
df <- pl$DataFrame( s = list(c("a", "b", "c"), c("x", "y"), c("e", NA)), separator = c("-", "+", "/") ) df$with_columns( join_with_expr = pl$col("s")$list$join(pl$col("separator")), join_with_lit = pl$col("s")$list$join(" "), join_ignore_null = pl$col("s")$list$join(" ", ignore_nulls = TRUE) )
df <- pl$DataFrame( s = list(c("a", "b", "c"), c("x", "y"), c("e", NA)), separator = c("-", "+", "/") ) df$with_columns( join_with_expr = pl$col("s")$list$join(pl$col("separator")), join_with_lit = pl$col("s")$list$join(" "), join_ignore_null = pl$col("s")$list$join(" ", ignore_nulls = TRUE) )
Get the last value of the sub-lists
expr_list_last()
expr_list_last()
A polars expression
df <- pl$DataFrame(a = list(3:1, NULL, 1:2)) df$with_columns( last = pl$col("a")$list$last() )
df <- pl$DataFrame(a = list(3:1, NULL, 1:2)) df$with_columns( last = pl$col("a")$list$last() )
Null values are counted in the total.
expr_list_len()
expr_list_len()
A polars expression
df <- pl$DataFrame(list_of_strs = list(c("a", "b", NA), "c")) df$with_columns(len_list = pl$col("list_of_strs")$list$len())
df <- pl$DataFrame(list_of_strs = list(c("a", "b", NA), "c")) df$with_columns(len_list = pl$col("list_of_strs")$list$len())
Compute the maximum value in every sub-list
expr_list_max()
expr_list_max()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(max = pl$col("values")$list$max())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(max = pl$col("values")$list$max())
Compute the mean value in every sub-list
expr_list_mean()
expr_list_mean()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(mean = pl$col("values")$list$mean())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(mean = pl$col("values")$list$mean())
Compute the median in every sub-list
expr_list_median()
expr_list_median()
A polars expression
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( median = pl$col("values")$list$median() )
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( median = pl$col("values")$list$median() )
Compute the miminum value in every sub-list
expr_list_min()
expr_list_min()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(min = pl$col("values")$list$min())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(min = pl$col("values")$list$min())
Count the number of unique values in every sub-lists
expr_list_n_unique()
expr_list_n_unique()
A polars expression
df <- pl$DataFrame(values = list(c(2, 2, NA), c(1, 2, 3), NA)) df$with_columns(unique = pl$col("values")$list$n_unique())
df <- pl$DataFrame(values = list(c(2, 2, NA), c(1, 2, 3), NA)) df$with_columns(unique = pl$col("values")$list$n_unique())
Reverse values in every sub-list
expr_list_reverse()
expr_list_reverse()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(reverse = pl$col("values")$list$reverse())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(reverse = pl$col("values")$list$reverse())
Sample values from every sub-list
expr_list_sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
expr_list_sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
A polars expression
df <- pl$DataFrame( values = list(1:3, NA, c(NA, 3L), 5:7), n = c(1, 1, 1, 2) ) df$with_columns( sample = pl$col("values")$list$sample(n = pl$col("n"), seed = 1) )
df <- pl$DataFrame( values = list(1:3, NA, c(NA, 3L), 5:7), n = c(1, 1, 1, 2) ) df$with_columns( sample = pl$col("values")$list$sample(n = pl$col("n"), seed = 1) )
This returns the "asymmetric difference", meaning only the elements of the
first list that are not in the second list. To get all elements that are in
only one of the two lists, use
$set_symmetric_difference()
.
expr_list_set_difference(other)
expr_list_set_difference(other)
other |
Other list variable. Can be an Expr or something coercible to an Expr. |
Note that the datatypes inside the list must have a common supertype. For
example, the first column can be list[i32]
and the second one can be
list[i8]
because it can be cast to list[i32]
. However, the second column
cannot be e.g list[f32]
.
A polars expression
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(difference = pl$col("a")$list$set_difference("b"))
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(difference = pl$col("a")$list$set_difference("b"))
Compute the intersection between elements of a list and other elements
expr_list_set_intersection(other)
expr_list_set_intersection(other)
other |
Other list variable. Can be an Expr or something coercible to an Expr. |
Note that the datatypes inside the list must have a common supertype. For
example, the first column can be list[i32]
and the second one can be
list[i8]
because it can be cast to list[i32]
. However, the second column
cannot be e.g list[f32]
.
A polars expression
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(intersection = pl$col("a")$list$set_intersection("b"))
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(intersection = pl$col("a")$list$set_intersection("b"))
This returns all elements that are in only one of the two lists. To get only
elements that are in the first list but not in the second one, use
$set_difference()
.
expr_list_set_symmetric_difference(other)
expr_list_set_symmetric_difference(other)
other |
Other list variable. Can be an Expr or something coercible to an Expr. |
Note that the datatypes inside the list must have a common supertype. For
example, the first column can be list[i32]
and the second one can be
list[i8]
because it can be cast to list[i32]
. However, the second column
cannot be e.g list[f32]
.
A polars expression
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns( symmetric_difference = pl$col("a")$list$set_symmetric_difference("b") )
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns( symmetric_difference = pl$col("a")$list$set_symmetric_difference("b") )
Compute the union of elements of a list and other elements
expr_list_set_union(other)
expr_list_set_union(other)
other |
Other list variable. Can be an Expr or something coercible to an Expr. |
Note that the datatypes inside the list must have a common supertype. For
example, the first column can be list[i32]
and the second one can be
list[i8]
because it can be cast to list[i32]
. However, the second column
cannot be e.g list[f32]
.
A polars expression
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(union = pl$col("a")$list$set_union("b"))
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(union = pl$col("a")$list$set_union("b"))
Shift list values by the given number of indices
expr_list_shift(n = 1)
expr_list_shift(n = 1)
A polars expression
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), idx = 1:2 ) df$with_columns( shift_by_expr = pl$col("s")$list$shift(pl$col("idx")), shift_by_lit = pl$col("s")$list$shift(2), shift_by_negative_lit = pl$col("s")$list$shift(-2) )
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), idx = 1:2 ) df$with_columns( shift_by_expr = pl$col("s")$list$shift(pl$col("idx")), shift_by_lit = pl$col("s")$list$shift(2), shift_by_negative_lit = pl$col("s")$list$shift(-2) )
This extracts length
values at most, starting at index offset
. This can
return less than length
values if length
is larger than the number of
values.
expr_list_slice(offset, length = NULL)
expr_list_slice(offset, length = NULL)
offset |
Start index. Negative indexing is supported. Can be an Expr. Strings are parsed as column names. |
length |
Length of the slice. If |
A polars expression
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), idx_off = 1:2, len = c(4, 1) ) df$with_columns( slice_by_expr = pl$col("s")$list$slice("idx_off", "len"), slice_by_lit = pl$col("s")$list$slice(2, 3) )
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), idx_off = 1:2, len = c(4, 1) ) df$with_columns( slice_by_expr = pl$col("s")$list$slice("idx_off", "len"), slice_by_lit = pl$col("s")$list$slice(2, 3) )
Sort values in every sub-list
expr_list_sort(..., descending = FALSE, nulls_last = FALSE)
expr_list_sort(..., descending = FALSE, nulls_last = FALSE)
... |
These dots are for future extensions and must be empty. |
descending |
Sort values in descending order. |
nulls_last |
Place null values last. |
A polars expression
df <- pl$DataFrame(values = list(c(NA, 2, 1, 3), c(Inf, 2, 3, NaN), NA)) df$with_columns(sort = pl$col("values")$list$sort())
df <- pl$DataFrame(values = list(c(NA, 2, 1, 3), c(Inf, 2, 3, NaN), NA)) df$with_columns(sort = pl$col("values")$list$sort())
Compute the standard deviation in every sub-list
expr_list_std(ddof = 1)
expr_list_std(ddof = 1)
"Delta |
Degrees of Freedom": the divisor used in the calculation is
|
A polars expression
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( std = pl$col("values")$list$std() )
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( std = pl$col("values")$list$std() )
Sum all elements in every sub-list
expr_list_sum()
expr_list_sum()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(sum = pl$col("values")$list$sum())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(sum = pl$col("values")$list$sum())
n
values of every sub-listSlice the last n
values of every sub-list
expr_list_tail(n = 5L)
expr_list_tail(n = 5L)
n |
Number of values to return for each sub-list. Can be an Expr. Strings are parsed as column names. |
A polars expression
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), n = 1:2 ) df$with_columns( tail_by_expr = pl$col("s")$list$tail("n"), tail_by_lit = pl$col("s")$list$tail(2) )
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), n = 1:2 ) df$with_columns( tail_by_expr = pl$col("s")$list$tail("n"), tail_by_lit = pl$col("s")$list$tail(2) )
Convert a List column into an Array column with the same inner data type
expr_list_to_array(width)
expr_list_to_array(width)
width |
Width of the resulting Array column. |
A polars expression
df <- pl$DataFrame(values = list(c(-1, 0), c(1, 10))) df$with_columns( array = pl$col("values")$list$to_array(2) )
df <- pl$DataFrame(values = list(c(-1, 0), c(1, 10))) df$with_columns( array = pl$col("values")$list$to_array(2) )
Get unique values in a list
expr_list_unique(..., maintain_order = FALSE)
expr_list_unique(..., maintain_order = FALSE)
... |
These dots are for future extensions and must be empty. |
maintain_order |
Maintain order of data. This requires more work. |
A polars expression
df <- pl$DataFrame(values = list(c(2, 2, NA), c(1, 2, 3), NA)) df$with_columns(unique = pl$col("values")$list$unique())
df <- pl$DataFrame(values = list(c(2, 2, NA), c(1, 2, 3), NA)) df$with_columns(unique = pl$col("values")$list$unique())
Compute the variance in every sub-list
expr_list_var(ddof = 1)
expr_list_var(ddof = 1)
A polars expression
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( var = pl$col("values")$list$var() )
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( var = pl$col("values")$list$var() )
Indicate if this expression is the same as another expression
expr_meta_eq(other)
expr_meta_eq(other)
A polars expression
foo_bar <- pl$col("foo")$alias("bar") foo <- pl$col("foo") foo_bar$meta$eq(foo) foo_bar2 <- pl$col("foo")$alias("bar") foo_bar$meta$eq(foo_bar2)
foo_bar <- pl$col("foo")$alias("bar") foo <- pl$col("foo") foo_bar$meta$eq(foo) foo_bar2 <- pl$col("foo")$alias("bar") foo_bar$meta$eq(foo_bar2)
Indicate if this expression expands into multiple expressions
expr_meta_has_multiple_outputs()
expr_meta_has_multiple_outputs()
A polars expression
e <- pl$col(c("a", "b"))$name$suffix("_foo") e$meta$has_multiple_outputs()
e <- pl$col(c("a", "b"))$name$suffix("_foo") e$meta$has_multiple_outputs()
Indicate if this expression is a basic (non-regex) unaliased column
expr_meta_is_column()
expr_meta_is_column()
A logical value.
e <- pl$col("foo") e$meta$is_column() e <- pl.col("foo") * pl.col("bar") e$meta$is_column() e <- pl.col(r"^col.*\d+$") e$meta$is_column()
e <- pl$col("foo") e$meta$is_column() e <- pl.col("foo") * pl.col("bar") e$meta$is_column() e <- pl.col(r"^col.*\d+$") e$meta$is_column()
This can include bare columns, column matches by regex or dtype, selectors and exclude ops, and (optionally) column/expression aliasing.
expr_meta_is_column_selection(..., allow_aliasing = FALSE)
expr_meta_is_column_selection(..., allow_aliasing = FALSE)
... |
These dots are for future extensions and must be empty. |
allow_aliasing |
If |
A logical value.
e <- pl$col("foo") e$meta$is_column_selection() e <- pl$col("foo")$alias("bar") e$meta$is_column_selection() e$meta$is_column_selection(allow_aliasing = TRUE) e <- pl$col("foo") * pl$col("bar") e$meta$is_column_selection() e <- cs$starts_with("foo") e$meta$is_column_selection()
e <- pl$col("foo") e$meta$is_column_selection() e <- pl$col("foo")$alias("bar") e$meta$is_column_selection() e$meta$is_column_selection(allow_aliasing = TRUE) e <- pl$col("foo") * pl$col("bar") e$meta$is_column_selection() e <- cs$starts_with("foo") e$meta$is_column_selection()
Indicate if this expression expands to columns that match a regex pattern
expr_meta_is_regex_projection()
expr_meta_is_regex_projection()
A logical value.
e <- pl$col("^.*$")$name$prefix("foo_") e$meta$is_regex_projection()
e <- pl$col("^.*$")$name$prefix("foo_") e$meta$is_regex_projection()
Indicate if this expression is not the same as another expression
expr_meta_ne(other)
expr_meta_ne(other)
A polars expression
foo_bar <- pl$col("foo")$alias("bar") foo <- pl$col("foo") foo_bar$meta$ne(foo) foo_bar2 <- pl$col("foo")$alias("bar") foo_bar$meta$ne(foo_bar2)
foo_bar <- pl$col("foo")$alias("bar") foo <- pl$col("foo") foo_bar$meta$ne(foo) foo_bar2 <- pl$col("foo")$alias("bar") foo_bar$meta$ne(foo_bar2)
It may not always be possible to determine the output name as that can
depend on the schema of the context; in that case this will raise an error
if raise_if_undetermined = TRUE
(the default), and return NA
otherwise.
expr_meta_output_name(..., raise_if_undetermined = TRUE)
expr_meta_output_name(..., raise_if_undetermined = TRUE)
... |
These dots are for future extensions and must be empty. |
raise_if_undetermined |
If |
A polars expression
e <- pl$col("foo") * pl$col("bar") e$meta$output_name() e_filter <- pl$col("foo")$filter(pl$col("bar") == 13) e_filter$meta$output_name() e_sum_over <- pl$col("foo")$sum()$over("groups") e_sum_over$meta$output_name() e_sum_slice <- pl$col("foo")$sum()$slice(pl$len() - 10, pl$col("bar")) e_sum_slice$meta$output_name() pl$len()$meta$output_name()
e <- pl$col("foo") * pl$col("bar") e$meta$output_name() e_filter <- pl$col("foo")$filter(pl$col("bar") == 13) e_filter$meta$output_name() e_sum_over <- pl$col("foo")$sum()$over("groups") e_sum_over$meta$output_name() e_sum_slice <- pl$col("foo")$sum()$slice(pl$len() - 10, pl$col("bar")) e_sum_slice$meta$output_name() pl$len()$meta$output_name()
Pop the latest expression and return the input(s) of the popped expression
expr_meta_pop()
expr_meta_pop()
A polars expression
e <- pl$col("foo")$alias("bar") pop <- e$meta$pop() pop pop[[1]]$meta$eq(pl$col("foo")) pop[[1]]$meta$eq(pl$col("foo"))
e <- pl$col("foo")$alias("bar") pop <- e$meta$pop() pop pop[[1]]$meta$eq(pl$col("foo")) pop[[1]]$meta$eq(pl$col("foo"))
Get a list with the root column name
expr_meta_root_names()
expr_meta_root_names()
A polars expression
e <- pl$col("foo") * pl$col("bar") e$meta$root_names() e_filter <- pl$col("foo")$filter(pl$col("bar") == 13) e_filter$meta$root_names() e_sum_over <- pl$sum("foo")$over("groups") e_sum_over$meta$root_names() e_sum_slice <- pl$sum("foo")$slice(pl$len() - 10, pl$col("bar")) e_sum_slice$meta$root_names()
e <- pl$col("foo") * pl$col("bar") e$meta$root_names() e_filter <- pl$col("foo")$filter(pl$col("bar") == 13) e_filter$meta$root_names() e_sum_over <- pl$sum("foo")$over("groups") e_sum_over$meta$root_names() e_sum_slice <- pl$sum("foo")$slice(pl$len() - 10, pl$col("bar")) e_sum_slice$meta$root_names()
Serialize this expression to a string in binary or JSON format
expr_meta_serialize(..., format = c("binary", "json"))
expr_meta_serialize(..., format = c("binary", "json"))
... |
These dots are for future extensions and must be empty. |
format |
The format in which to serialize. Must be one of:
|
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
A polars expression
# Serialize the expression into a binary representation. expr <- pl$col("foo")$sum()$over("bar") bytes <- expr$meta$serialize() rawToChar(bytes) pl$deserialize_expr(bytes) # Serialize into json expr$meta$serialize(format = "json") |> jsonlite::prettify()
# Serialize the expression into a binary representation. expr <- pl$col("foo")$sum()$over("bar") bytes <- expr$meta$serialize() rawToChar(bytes) pl$deserialize_expr(bytes) # Serialize into json expr$meta$serialize(format = "json") |> jsonlite::prettify()
Format the expression as a tree
expr_meta_tree_format()
expr_meta_tree_format()
A character vector
my_expr <- (pl$col("foo") * pl$col("bar"))$sum()$over(pl$col("ham")) / 2 my_expr$meta$tree_format() |> cat()
my_expr <- (pl$col("foo") * pl$col("bar"))$sum()$over(pl$col("ham")) / 2 my_expr$meta$tree_format() |> cat()
alias
or name$keep
Undo any renaming operation like alias
or name$keep
expr_meta_undo_aliases()
expr_meta_undo_aliases()
A polars expression
e <- pl$col("foo")$alias("bar") e$meta$undo_aliases()$meta$eq(pl$col("foo")) e <- pl$col("foo")$sum()$over("bar") e$name$keep()$meta$undo_aliases()$meta$eq(e)
e <- pl$col("foo")$alias("bar") e$meta$undo_aliases()$meta$eq(pl$col("foo")) e <- pl$col("foo")$sum()$over("bar") e$name$keep()$meta$undo_aliases()$meta$eq(e)
Check if string contains a substring that matches a pattern
expr_str_contains(pattern, ..., literal = FALSE, strict = TRUE)
expr_str_contains(pattern, ..., literal = FALSE, strict = TRUE)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
strict |
Logical. If |
To modify regular expression behaviour (such as case-sensitivity)
with flags, use the inline (?iLmsuxU)
syntax. See the regex crate’s section
on grouping and flags
for additional information about the use of inline expression modifiers.
A polars expression
$str$start_with()
: Check if string values
start with a substring.
$str$ends_with()
: Check if string values end
with a substring.
$str$find()
: Return the index position of the first
substring matching a pattern.
# The inline `(?i)` syntax example pl$DataFrame(s = c("AAA", "aAa", "aaa"))$with_columns( default_match = pl$col("s")$str$contains("AA"), insensitive_match = pl$col("s")$str$contains("(?i)AA") ) df <- pl$DataFrame(txt = c("Crab", "cat and dog", "rab$bit", NA)) df$with_columns( regex = pl$col("txt")$str$contains("cat|bit"), literal = pl$col("txt")$str$contains("rab$", literal = TRUE) )
# The inline `(?i)` syntax example pl$DataFrame(s = c("AAA", "aAa", "aaa"))$with_columns( default_match = pl$col("s")$str$contains("AA"), insensitive_match = pl$col("s")$str$contains("(?i)AA") ) df <- pl$DataFrame(txt = c("Crab", "cat and dog", "rab$bit", NA)) df$with_columns( regex = pl$col("txt")$str$contains("cat|bit"), literal = pl$col("txt")$str$contains("rab$", literal = TRUE) )
This function determines if any of the patterns find a match.
expr_str_contains_any(patterns, ..., ascii_case_insensitive = FALSE)
expr_str_contains_any(patterns, ..., ascii_case_insensitive = FALSE)
patterns |
Character vector or something can be coerced to strings Expr of a valid regex pattern, compatible with the regex crate. |
... |
These dots are for future extensions and must be empty. |
ascii_case_insensitive |
Enable ASCII-aware case insensitive matching. When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only. |
A polars expression
df <- pl$DataFrame( lyrics = c( "Everybody wants to rule the world", "Tell me what you want, what you really really want", "Can you feel the love tonight" ) ) df$with_columns( contains_any = pl$col("lyrics")$str$contains_any(c("you", "me")) )
df <- pl$DataFrame( lyrics = c( "Everybody wants to rule the world", "Tell me what you want, what you really really want", "Can you feel the love tonight" ) ) df$with_columns( contains_any = pl$col("lyrics")$str$contains_any(c("you", "me")) )
Count all successive non-overlapping regex matches
expr_str_count_matches(pattern, ..., literal = FALSE)
expr_str_count_matches(pattern, ..., literal = FALSE)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
A polars expression
df <- pl$DataFrame(foo = c("12 dbc 3xy", "cat\\w", "1zy3\\d\\d", NA)) df$with_columns( count_digits = pl$col("foo")$str$count_matches(r"(\d)"), count_slash_d = pl$col("foo")$str$count_matches(r"(\d)", literal = TRUE) )
df <- pl$DataFrame(foo = c("12 dbc 3xy", "cat\\w", "1zy3\\d\\d", NA)) df$with_columns( count_digits = pl$col("foo")$str$count_matches(r"(\d)"), count_slash_d = pl$col("foo")$str$count_matches(r"(\d)", literal = TRUE) )
Decode a value using the provided encoding
expr_str_decode(encoding, ..., strict = TRUE)
expr_str_decode(encoding, ..., strict = TRUE)
encoding |
Either 'hex' or 'base64'. |
... |
These dots are for future extensions and must be empty. |
strict |
If |
A polars expression
df <- pl$DataFrame(strings = c("foo", "bar", NA)) df$select(pl$col("strings")$str$encode("hex")) df$with_columns( pl$col("strings")$str$encode("base64")$alias("base64"), # notice DataType is not encoded pl$col("strings")$str$encode("hex")$alias("hex") # ... and must restored with cast )$with_columns( pl$col("base64")$str$decode("base64")$alias("base64_decoded")$cast(pl$String), pl$col("hex")$str$decode("hex")$alias("hex_decoded")$cast(pl$String) )
df <- pl$DataFrame(strings = c("foo", "bar", NA)) df$select(pl$col("strings")$str$encode("hex")) df$with_columns( pl$col("strings")$str$encode("base64")$alias("base64"), # notice DataType is not encoded pl$col("strings")$str$encode("hex")$alias("hex") # ... and must restored with cast )$with_columns( pl$col("base64")$str$decode("base64")$alias("base64_decoded")$cast(pl$String), pl$col("hex")$str$decode("hex")$alias("hex_decoded")$cast(pl$String) )
Encode a value using the provided encoding
expr_str_encode(encoding)
expr_str_encode(encoding)
encoding |
Either 'hex' or 'base64'. |
A polars expression
df <- pl$DataFrame(strings = c("foo", "bar", NA)) df$select(pl$col("strings")$str$encode("hex")) df$with_columns( pl$col("strings")$str$encode("base64")$alias("base64"), # notice DataType is not encoded pl$col("strings")$str$encode("hex")$alias("hex") # ... and must restored with cast )$with_columns( pl$col("base64")$str$decode("base64")$alias("base64_decoded")$cast(pl$String), pl$col("hex")$str$decode("hex")$alias("hex_decoded")$cast(pl$String) )
df <- pl$DataFrame(strings = c("foo", "bar", NA)) df$select(pl$col("strings")$str$encode("hex")) df$with_columns( pl$col("strings")$str$encode("base64")$alias("base64"), # notice DataType is not encoded pl$col("strings")$str$encode("hex")$alias("hex") # ... and must restored with cast )$with_columns( pl$col("base64")$str$decode("base64")$alias("base64_decoded")$cast(pl$String), pl$col("hex")$str$decode("hex")$alias("hex_decoded")$cast(pl$String) )
Check if string values end with a substring.
expr_str_ends_with(sub)
expr_str_ends_with(sub)
sub |
Suffix substring or Expr. |
See also $str$starts_with()
and $str$contains()
.
A polars expression
df <- pl$DataFrame(fruits = c("apple", "mango", NA)) df$select( pl$col("fruits"), pl$col("fruits")$str$ends_with("go")$alias("has_suffix") )
df <- pl$DataFrame(fruits = c("apple", "mango", NA)) df$select( pl$col("fruits"), pl$col("fruits")$str$ends_with("go")$alias("has_suffix") )
Extract the target capture group from provided patterns
expr_str_extract(pattern, group_index)
expr_str_extract(pattern, group_index)
pattern |
A valid regex pattern. Can be an Expr or something coercible to an Expr. Strings are parsed as column names. |
group_index |
Index of the targeted capture group. Group 0 means the whole pattern, first group begin at index 1 (default). |
A polars expression
df <- pl$DataFrame( a = c( "http://vote.com/ballon_dor?candidate=messi&ref=polars", "http://vote.com/ballon_dor?candidat=jorginho&ref=polars", "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars" ) ) df$with_columns( extracted = pl$col("a")$str$extract(pl$lit(r"(candidate=(\w+))"), 1) )
df <- pl$DataFrame( a = c( "http://vote.com/ballon_dor?candidate=messi&ref=polars", "http://vote.com/ballon_dor?candidat=jorginho&ref=polars", "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars" ) ) df$with_columns( extracted = pl$col("a")$str$extract(pl$lit(r"(candidate=(\w+))"), 1) )
Extracts all matches for the given regex pattern. Extracts each successive non-overlapping regex match in an individual string as an array.
expr_str_extract_all(pattern)
expr_str_extract_all(pattern)
pattern |
A valid regex pattern |
A polars expression
df <- pl$DataFrame(foo = c("123 bla 45 asd", "xyz 678 910t")) df$select( pl$col("foo")$str$extract_all(r"((\d+))")$alias("extracted_nrs") )
df <- pl$DataFrame(foo = c("123 bla 45 asd", "xyz 678 910t")) df$select( pl$col("foo")$str$extract_all(r"((\d+))")$alias("extracted_nrs") )
Extract all capture groups for the given regex pattern
expr_str_extract_groups(pattern)
expr_str_extract_groups(pattern)
pattern |
A character of a valid regular expression pattern containing at least one capture group, compatible with the regex crate. |
All group names are strings. If your pattern contains unnamed groups, their numerical position is converted to a string. See examples.
A polars expression
df <- pl$DataFrame( url = c( "http://vote.com/ballon_dor?candidate=messi&ref=python", "http://vote.com/ballon_dor?candidate=weghorst&ref=polars", "http://vote.com/ballon_dor?error=404&ref=rust" ) ) pattern <- r"(candidate=(?<candidate>\w+)&ref=(?<ref>\w+))" df$with_columns( captures = pl$col("url")$str$extract_groups(pattern) )$unnest("captures") # If the groups are unnamed, their numerical position (as a string) is used: pattern <- r"(candidate=(\w+)&ref=(\w+))" df$with_columns( captures = pl$col("url")$str$extract_groups(pattern) )$unnest("captures")
df <- pl$DataFrame( url = c( "http://vote.com/ballon_dor?candidate=messi&ref=python", "http://vote.com/ballon_dor?candidate=weghorst&ref=polars", "http://vote.com/ballon_dor?error=404&ref=rust" ) ) pattern <- r"(candidate=(?<candidate>\w+)&ref=(?<ref>\w+))" df$with_columns( captures = pl$col("url")$str$extract_groups(pattern) )$unnest("captures") # If the groups are unnamed, their numerical position (as a string) is used: pattern <- r"(candidate=(\w+)&ref=(\w+))" df$with_columns( captures = pl$col("url")$str$extract_groups(pattern) )$unnest("captures")
Use the aho-corasick algorithm to extract matches
expr_str_extract_many( patterns, ..., ascii_case_insensitive = FALSE, overlapping = FALSE )
expr_str_extract_many( patterns, ..., ascii_case_insensitive = FALSE, overlapping = FALSE )
patterns |
String patterns to search. This can be an Expr or something coercible to an Expr. Strings are parsed as column names. |
... |
These dots are for future extensions and must be empty. |
ascii_case_insensitive |
Enable ASCII-aware case insensitive matching. When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only. |
overlapping |
Whether matches can overlap. |
A polars expression
df <- pl$DataFrame(values = "discontent") patterns <- pl$lit(c("winter", "disco", "onte", "discontent")) df$with_columns( matches = pl$col("values")$str$extract_many(patterns), matches_overlap = pl$col("values")$str$extract_many(patterns, overlapping = TRUE) ) df <- pl$DataFrame( values = c("discontent", "rhapsody"), patterns = list(c("winter", "disco", "onte", "discontent"), c("rhap", "ody", "coalesce")) ) df$select(pl$col("values")$str$extract_many("patterns"))
df <- pl$DataFrame(values = "discontent") patterns <- pl$lit(c("winter", "disco", "onte", "discontent")) df$with_columns( matches = pl$col("values")$str$extract_many(patterns), matches_overlap = pl$col("values")$str$extract_many(patterns, overlapping = TRUE) ) df <- pl$DataFrame( values = c("discontent", "rhapsody"), patterns = list(c("winter", "disco", "onte", "discontent"), c("rhap", "ody", "coalesce")) ) df$select(pl$col("values")$str$extract_many("patterns"))
Return the index position of the first substring matching a pattern
expr_str_find(pattern, ..., literal = FALSE, strict = TRUE)
expr_str_find(pattern, ..., literal = FALSE, strict = TRUE)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
strict |
Logical. If |
To modify regular expression behaviour (such as case-sensitivity)
with flags, use the inline (?iLmsuxU)
syntax. See the regex crate’s section
on grouping and flags
for additional information about the use of inline expression modifiers.
A polars expression
$str$start_with()
: Check if string values
start with a substring.
$str$ends_with()
: Check if string values end
with a substring.
$str$contains()
: Check if string contains a substring
that matches a pattern.
pl$DataFrame(s = c("AAA", "aAa", "aaa"))$with_columns( default_match = pl$col("s")$str$find("Aa"), insensitive_match = pl$col("s")$str$find("(?i)Aa") )
pl$DataFrame(s = c("AAA", "aAa", "aaa"))$with_columns( default_match = pl$col("s")$str$find("Aa"), insensitive_match = pl$col("s")$str$find("(?i)Aa") )
Return the first n characters of each string
expr_str_head(n)
expr_str_head(n)
n |
Length of the slice (integer or expression). Strings are parsed as column names. Negative indexing is supported. |
The n
input is defined in terms of the number of characters in the (UTF-8)
string. A character is defined as a Unicode scalar value. A single character
is represented by a single byte when working with ASCII text, and a maximum
of 4 bytes otherwise.
When the n
input is negative, head()
returns characters up to the n
th
from the end of the string. For example, if n = -3
, then all characters
except the last three are returned.
If the length of the string has fewer than n
characters, the full string is
returned.
A polars expression
df <- pl$DataFrame( s = c("pear", NA, "papaya", "dragonfruit"), n = c(3, 4, -2, -5) ) df$with_columns( s_head_5 = pl$col("s")$str$head(5), s_head_n = pl$col("s")$str$head("n") )
df <- pl$DataFrame( s = c("pear", NA, "papaya", "dragonfruit"), n = c(3, 4, -2, -5) ) df$with_columns( s_head_5 = pl$col("s")$str$head(5), s_head_n = pl$col("s")$str$head("n") )
Vertically concatenate the string values in the column to a single string value.
expr_str_join(delimiter = "", ..., ignore_nulls = TRUE)
expr_str_join(delimiter = "", ..., ignore_nulls = TRUE)
delimiter |
The delimiter to insert between consecutive string values. |
... |
These dots are for future extensions and must be empty. |
ignore_nulls |
Ignore null values (default). If |
A polars expression
# concatenate a Series of strings to a single string df <- pl$DataFrame(foo = c(1, NA, 2)) df$select(pl$col("foo")$str$join("-")) df$select(pl$col("foo")$str$join("-", ignore_nulls = FALSE))
# concatenate a Series of strings to a single string df <- pl$DataFrame(foo = c(1, NA, 2)) df$select(pl$col("foo")$str$join("-")) df$select(pl$col("foo")$str$join("-", ignore_nulls = FALSE))
Parse string values as JSON.
expr_str_json_decode(dtype, infer_schema_length = 100)
expr_str_json_decode(dtype, infer_schema_length = 100)
dtype |
The dtype to cast the extracted value to. If |
infer_schema_length |
How many rows to parse to determine the schema.
If |
Throw errors if encounter invalid json strings.
A polars expression
df <- pl$DataFrame( json_val = c('{"a":1, "b": true}', NA, '{"a":2, "b": false}') ) dtype <- pl$Struct(pl$Field("a", pl$Int64), pl$Field("b", pl$Boolean)) df$select(pl$col("json_val")$str$json_decode(dtype))
df <- pl$DataFrame( json_val = c('{"a":1, "b": true}', NA, '{"a":2, "b": false}') ) dtype <- pl$Struct(pl$Field("a", pl$Int64), pl$Field("b", pl$Boolean)) df$select(pl$col("json_val")$str$json_decode(dtype))
Extract the first match of JSON string with the provided JSONPath expression
expr_str_json_path_match(json_path)
expr_str_json_path_match(json_path)
json_path |
A valid JSON path query string. |
Throw errors if encounter invalid JSON strings. All return value will be cast to String regardless of the original value.
Documentation on JSONPath standard can be found here: https://goessner.net/articles/JsonPath/.
A polars expression
df <- pl$DataFrame( json_val = c('{"a":"1"}', NA, '{"a":2}', '{"a":2.1}', '{"a":true}') ) df$select(pl$col("json_val")$str$json_path_match("$.a"))
df <- pl$DataFrame( json_val = c('{"a":"1"}', NA, '{"a":2}', '{"a":2.1}', '{"a":true}') ) df$select(pl$col("json_val")$str$json_path_match("$.a"))
Get length of the strings as UInt32 (as number of bytes). Use $str$len_chars()
to get the number of characters.
expr_str_len_bytes()
expr_str_len_bytes()
If you know that you are working with ASCII text, lengths
will be
equivalent, and faster (returns length in terms of the number of bytes).
A polars expression
pl$DataFrame( s = c("Café", NA, "345", "æøå") )$select( pl$col("s"), pl$col("s")$str$len_bytes()$alias("lengths"), pl$col("s")$str$len_chars()$alias("n_chars") )
pl$DataFrame( s = c("Café", NA, "345", "æøå") )$select( pl$col("s"), pl$col("s")$str$len_bytes()$alias("lengths"), pl$col("s")$str$len_chars()$alias("n_chars") )
Get length of the strings as UInt32 (as number of characters). Use
$str$len_bytes()
to get the number of bytes.
expr_str_len_chars()
expr_str_len_chars()
If you know that you are working with ASCII text, lengths
will be
equivalent, and faster (returns length in terms of the number of bytes).
A polars expression
pl$DataFrame( s = c("Café", NA, "345", "æøå") )$select( pl$col("s"), pl$col("s")$str$len_bytes()$alias("lengths"), pl$col("s")$str$len_chars()$alias("n_chars") )
pl$DataFrame( s = c("Café", NA, "345", "æøå") )$select( pl$col("s"), pl$col("s")$str$len_bytes()$alias("lengths"), pl$col("s")$str$len_chars()$alias("n_chars") )
Return the string left justified in a string of length width
.
expr_str_pad_end(width, fillchar = " ")
expr_str_pad_end(width, fillchar = " ")
width |
Justify left to this length. |
fillchar |
Fill with this ASCII character. |
Padding is done using the specified fillchar
. The original string
is returned if width
is less than or equal to len(s)
.
A polars expression
df <- pl$DataFrame(a = c("cow", "monkey", NA, "hippopotamus")) df$select(pl$col("a")$str$pad_end(8, "*"))
df <- pl$DataFrame(a = c("cow", "monkey", NA, "hippopotamus")) df$select(pl$col("a")$str$pad_end(8, "*"))
Return the string right justified in a string of length width
.
expr_str_pad_start(width, fillchar = " ")
expr_str_pad_start(width, fillchar = " ")
width |
Justify right to this length. |
fillchar |
Fill with this ASCII character. |
Padding is done using the specified fillchar
. The original string
is returned if width
is less than or equal to len(s)
.
A polars expression
df <- pl$DataFrame(a = c("cow", "monkey", NA, "hippopotamus")) df$select(pl$col("a")$str$pad_start(8, "*"))
df <- pl$DataFrame(a = c("cow", "monkey", NA, "hippopotamus")) df$select(pl$col("a")$str$pad_start(8, "*"))
Replace first matching regex/literal substring with a new string value
expr_str_replace(pattern, value, ..., literal = FALSE, n = 1L)
expr_str_replace(pattern, value, ..., literal = FALSE, n = 1L)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
value |
A character or an Expr of string that will replace the matched substring. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
n |
A number of matches to replace.
Note that regex replacement with |
To modify regular expression behaviour (such as case-sensitivity)
with flags, use the inline (?iLmsuxU)
syntax. See the regex crate’s section
on grouping and flags
for additional information about the use of inline expression modifiers.
A polars expression
The dollar sign ($
) is a special character related to capture groups.
To refer to a literal dollar sign, use $$
instead or set literal
to TRUE
.
df <- pl$DataFrame(id = 1L:2L, text = c("123abc", "abc456")) df$with_columns(pl$col("text")$str$replace(r"(abc\b)", "ABC")) # Capture groups are supported. # Use `${1}` in the value string to refer to the first capture group in the pattern, # `${2}` to refer to the second capture group, and so on. # You can also use named capture groups. df <- pl$DataFrame(word = c("hat", "hut")) df$with_columns( positional = pl$col("word")$str$replace("h(.)t", "b${1}d"), named = pl$col("word")$str$replace("h(?<vowel>.)t", "b${vowel}d") ) # Apply case-insensitive string replacement using the `(?i)` flag. df <- pl$DataFrame( city = "Philadelphia", season = c("Spring", "Summer", "Autumn", "Winter"), weather = c("Rainy", "Sunny", "Cloudy", "Snowy") ) df$with_columns( pl$col("weather")$str$replace("(?i)foggy|rainy|cloudy|snowy", "Sunny") )
df <- pl$DataFrame(id = 1L:2L, text = c("123abc", "abc456")) df$with_columns(pl$col("text")$str$replace(r"(abc\b)", "ABC")) # Capture groups are supported. # Use `${1}` in the value string to refer to the first capture group in the pattern, # `${2}` to refer to the second capture group, and so on. # You can also use named capture groups. df <- pl$DataFrame(word = c("hat", "hut")) df$with_columns( positional = pl$col("word")$str$replace("h(.)t", "b${1}d"), named = pl$col("word")$str$replace("h(?<vowel>.)t", "b${vowel}d") ) # Apply case-insensitive string replacement using the `(?i)` flag. df <- pl$DataFrame( city = "Philadelphia", season = c("Spring", "Summer", "Autumn", "Winter"), weather = c("Rainy", "Sunny", "Cloudy", "Snowy") ) df$with_columns( pl$col("weather")$str$replace("(?i)foggy|rainy|cloudy|snowy", "Sunny") )
Replace all matching regex/literal substrings with a new string value
expr_str_replace_all(pattern, value, ..., literal = FALSE)
expr_str_replace_all(pattern, value, ..., literal = FALSE)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
value |
A character or an Expr of string that will replace the matched substring. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
To modify regular expression behaviour (such as case-sensitivity)
with flags, use the inline (?iLmsuxU)
syntax. See the regex crate’s section
on grouping and flags
for additional information about the use of inline expression modifiers.
A polars expression
The dollar sign ($
) is a special character related to capture groups.
To refer to a literal dollar sign, use $$
instead or set literal
to TRUE
.
df <- pl$DataFrame(id = 1L:2L, text = c("abcabc", "123a123")) df$with_columns(pl$col("text")$str$replace_all("a", "-")) # Capture groups are supported. # Use `${1}` in the value string to refer to the first capture group in the pattern, # `${2}` to refer to the second capture group, and so on. # You can also use named capture groups. df <- pl$DataFrame(word = c("hat", "hut")) df$with_columns( positional = pl$col("word")$str$replace_all("h(.)t", "b${1}d"), named = pl$col("word")$str$replace_all("h(?<vowel>.)t", "b${vowel}d") ) # Apply case-insensitive string replacement using the `(?i)` flag. df <- pl$DataFrame( city = "Philadelphia", season = c("Spring", "Summer", "Autumn", "Winter"), weather = c("Rainy", "Sunny", "Cloudy", "Snowy") ) df$with_columns( pl$col("weather")$str$replace_all( "(?i)foggy|rainy|cloudy|snowy", "Sunny" ) )
df <- pl$DataFrame(id = 1L:2L, text = c("abcabc", "123a123")) df$with_columns(pl$col("text")$str$replace_all("a", "-")) # Capture groups are supported. # Use `${1}` in the value string to refer to the first capture group in the pattern, # `${2}` to refer to the second capture group, and so on. # You can also use named capture groups. df <- pl$DataFrame(word = c("hat", "hut")) df$with_columns( positional = pl$col("word")$str$replace_all("h(.)t", "b${1}d"), named = pl$col("word")$str$replace_all("h(?<vowel>.)t", "b${vowel}d") ) # Apply case-insensitive string replacement using the `(?i)` flag. df <- pl$DataFrame( city = "Philadelphia", season = c("Spring", "Summer", "Autumn", "Winter"), weather = c("Rainy", "Sunny", "Cloudy", "Snowy") ) df$with_columns( pl$col("weather")$str$replace_all( "(?i)foggy|rainy|cloudy|snowy", "Sunny" ) )
This function replaces several matches at once.
expr_str_replace_many(patterns, replace_with, ascii_case_insensitive = FALSE)
expr_str_replace_many(patterns, replace_with, ascii_case_insensitive = FALSE)
patterns |
String patterns to search. Can be an Expr. |
replace_with |
A vector of strings used as replacements. If this is of
length 1, then it is applied to all matches. Otherwise, it must be of same
length as the |
ascii_case_insensitive |
Enable ASCII-aware case insensitive matching. When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only. |
A polars expression
df <- pl$DataFrame( lyrics = c( "Everybody wants to rule the world", "Tell me what you want, what you really really want", "Can you feel the love tonight" ) ) # a replacement of length 1 is applied to all matches df$with_columns( remove_pronouns = pl$col("lyrics")$str$replace_many(c("you", "me"), "") ) # if there are more than one replacement, the patterns and replacements are # matched df$with_columns( fake_pronouns = pl$col("lyrics")$str$replace_many(c("you", "me"), c("foo", "bar")) )
df <- pl$DataFrame( lyrics = c( "Everybody wants to rule the world", "Tell me what you want, what you really really want", "Can you feel the love tonight" ) ) # a replacement of length 1 is applied to all matches df$with_columns( remove_pronouns = pl$col("lyrics")$str$replace_many(c("you", "me"), "") ) # if there are more than one replacement, the patterns and replacements are # matched df$with_columns( fake_pronouns = pl$col("lyrics")$str$replace_many(c("you", "me"), c("foo", "bar")) )
Returns string values in reversed order
expr_str_reverse()
expr_str_reverse()
A polars expression
df <- pl$DataFrame(text = c("foo", "bar", NA)) df$with_columns(reversed = pl$col("text")$str$reverse())
df <- pl$DataFrame(text = c("foo", "bar", NA)) df$with_columns(reversed = pl$col("text")$str$reverse())
Create subslices of the string values of a String Series
expr_str_slice(offset, length = NULL)
expr_str_slice(offset, length = NULL)
offset |
Start index. Negative indexing is supported. |
length |
Length of the slice. If |
A polars expression
df <- pl$DataFrame(s = c("pear", NA, "papaya", "dragonfruit")) df$with_columns( pl$col("s")$str$slice(-3)$alias("s_sliced") )
df <- pl$DataFrame(s = c("pear", NA, "papaya", "dragonfruit")) df$with_columns( pl$col("s")$str$slice(-3)$alias("s_sliced") )
Split the string by a substring
expr_str_split(by, ..., inclusive = FALSE)
expr_str_split(by, ..., inclusive = FALSE)
by |
Substring to split by. Can be an Expr. |
... |
These dots are for future extensions and must be empty. |
inclusive |
If |
A polars expression
df <- pl$DataFrame(s = c("foo bar", "foo-bar", "foo bar baz")) df$select(pl$col("s")$str$split(by = " ")) df <- pl$DataFrame( s = c("foo^bar", "foo_bar", "foo*bar*baz"), by = c("_", "_", "*") ) df df$select(split = pl$col("s")$str$split(by = pl$col("by")))
df <- pl$DataFrame(s = c("foo bar", "foo-bar", "foo bar baz")) df$select(pl$col("s")$str$split(by = " ")) df <- pl$DataFrame( s = c("foo^bar", "foo_bar", "foo*bar*baz"), by = c("_", "_", "*") ) df df$select(split = pl$col("s")$str$split(by = pl$col("by")))
n
splitsThis results in a struct of n+1
fields. If it cannot make n
splits, the remaining field elements will be null.
expr_str_split_exact(by, n, ..., inclusive = FALSE)
expr_str_split_exact(by, n, ..., inclusive = FALSE)
by |
Substring to split by. Can be an Expr. |
n |
Number of splits to make. |
... |
These dots are for future extensions and must be empty. |
inclusive |
If |
A polars expression
df <- pl$DataFrame(s = c("a_1", NA, "c", "d_4")) df$with_columns( split = pl$col("s")$str$split_exact(by = "_", 1), split_inclusive = pl$col("s")$str$split_exact(by = "_", 1, inclusive = TRUE) )
df <- pl$DataFrame(s = c("a_1", NA, "c", "d_4")) df$with_columns( split = pl$col("s")$str$split_exact(by = "_", 1), split_inclusive = pl$col("s")$str$split_exact(by = "_", 1, inclusive = TRUE) )
n
itemsIf the number of possible splits is less than n-1
, the remaining field
elements will be null. If the number of possible splits is n-1
or greater,
the last (nth) substring will contain the remainder of the string.
expr_str_splitn(by, n)
expr_str_splitn(by, n)
by |
Substring to split by. Can be an Expr. |
n |
Number of splits to make. |
A polars expression
df <- pl$DataFrame(s = c("a_1", NA, "c", "d_4_e")) df$with_columns( s1 = pl$col("s")$str$splitn(by = "_", 1), s2 = pl$col("s")$str$splitn(by = "_", 2), s3 = pl$col("s")$str$splitn(by = "_", 3) )
df <- pl$DataFrame(s = c("a_1", NA, "c", "d_4_e")) df$with_columns( s1 = pl$col("s")$str$splitn(by = "_", 1), s2 = pl$col("s")$str$splitn(by = "_", 2), s3 = pl$col("s")$str$splitn(by = "_", 3) )
Check if string values starts with a substring.
expr_str_starts_with(sub)
expr_str_starts_with(sub)
sub |
Prefix substring or Expr. |
See also $str$contains()
and $str$ends_with()
.
A polars expression
df <- pl$DataFrame(fruits = c("apple", "mango", NA)) df$select( pl$col("fruits"), pl$col("fruits")$str$starts_with("app")$alias("has_suffix") )
df <- pl$DataFrame(fruits = c("apple", "mango", NA)) df$select( pl$col("fruits"), pl$col("fruits")$str$starts_with("app")$alias("has_suffix") )
Remove leading and trailing characters.
expr_str_strip_chars(matches = NULL)
expr_str_strip_chars(matches = NULL)
matches |
The set of characters to be removed. All combinations of this
set of characters will be stripped. If |
This function will not strip any chars beyond the first char not matched.
strip_chars()
removes characters at the beginning and the end of the string.
Use strip_chars_start()
and strip_chars_end()
to remove characters only
from left and right respectively.
A polars expression
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars()) df$select(pl$col("foo")$str$strip_chars(" hel rld"))
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars()) df$select(pl$col("foo")$str$strip_chars(" hel rld"))
Remove trailing characters.
expr_str_strip_chars_end(matches = NULL)
expr_str_strip_chars_end(matches = NULL)
matches |
The set of characters to be removed. All combinations of this
set of characters will be stripped. If |
This function will not strip any chars beyond the first char not matched.
strip_chars_end()
removes characters at the end of the string only.
Use strip_chars()
and strip_chars_start()
to remove characters from the left
and right or only from the left respectively.
A polars expression
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars_end(" hel\trld")) df$select(pl$col("foo")$str$strip_chars_end("rldhel\t "))
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars_end(" hel\trld")) df$select(pl$col("foo")$str$strip_chars_end("rldhel\t "))
Remove leading characters.
expr_str_strip_chars_start(matches = NULL)
expr_str_strip_chars_start(matches = NULL)
matches |
The set of characters to be removed. All combinations of this
set of characters will be stripped. If |
This function will not strip any chars beyond the first char not matched.
strip_chars_start()
removes characters at the beginning of the string only.
Use strip_chars()
and strip_chars_end()
to remove characters from the left
and right or only from the right respectively.
A polars expression
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars_start(" hel rld"))
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars_start(" hel rld"))
Similar to the strptime()
function.
expr_str_strptime( dtype, format = NULL, ..., strict = TRUE, exact = TRUE, cache = TRUE, ambiguous = c("raise", "earliest", "latest", "null") )
expr_str_strptime( dtype, format = NULL, ..., strict = TRUE, exact = TRUE, cache = TRUE, ambiguous = c("raise", "earliest", "latest", "null") )
dtype |
The data type to convert into. Can be either |
format |
Format to use for conversion. Refer to
the chrono crate documentation
for the full specification. Example: |
... |
These dots are for future extensions and must be empty. |
strict |
If |
exact |
If |
cache |
Use a cache of unique, converted dates to apply the datetime conversion. |
ambiguous |
Determine how to deal with ambiguous datetimes. Character vector or expression containing the followings:
|
When parsing a Datetime the column precision will be inferred from the format
string, if given, e.g.: "%F %T%.3f"
=> pl$Datetime("ms")
.
If no fractional second component is found then the default is "us"
(microsecond).
A polars expression
# Dealing with a consistent format df <- pl$DataFrame(x = c("2020-01-01 01:00Z", "2020-01-01 02:00Z")) df$select(pl$col("x")$str$strptime(pl$Datetime(), "%Y-%m-%d %H:%M%#z")) # Auto infer format df$select(pl$col("x")$str$strptime(pl$Datetime())) # Datetime with timezone is interpreted as UTC timezone df <- pl$DataFrame(x = c("2020-01-01T01:00:00+09:00")) df$select(pl$col("x")$str$strptime(pl$Datetime())) # Dealing with different formats. df <- pl$DataFrame( date = c( "2021-04-22", "2022-01-04 00:00:00", "01/31/22", "Sun Jul 8 00:34:60 2001" ) ) df$select( pl$coalesce( pl$col("date")$str$strptime(pl$Date, "%F", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%F %T", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%D", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%c", strict = FALSE) ) ) # Ignore invalid time df <- pl$DataFrame( x = c( "2023-01-01 11:22:33 -0100", "2023-01-01 11:22:33 +0300", "invalid time" ) ) df$select(pl$col("x")$str$strptime( pl$Datetime("ns"), format = "%Y-%m-%d %H:%M:%S %z", strict = FALSE ))
# Dealing with a consistent format df <- pl$DataFrame(x = c("2020-01-01 01:00Z", "2020-01-01 02:00Z")) df$select(pl$col("x")$str$strptime(pl$Datetime(), "%Y-%m-%d %H:%M%#z")) # Auto infer format df$select(pl$col("x")$str$strptime(pl$Datetime())) # Datetime with timezone is interpreted as UTC timezone df <- pl$DataFrame(x = c("2020-01-01T01:00:00+09:00")) df$select(pl$col("x")$str$strptime(pl$Datetime())) # Dealing with different formats. df <- pl$DataFrame( date = c( "2021-04-22", "2022-01-04 00:00:00", "01/31/22", "Sun Jul 8 00:34:60 2001" ) ) df$select( pl$coalesce( pl$col("date")$str$strptime(pl$Date, "%F", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%F %T", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%D", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%c", strict = FALSE) ) ) # Ignore invalid time df <- pl$DataFrame( x = c( "2023-01-01 11:22:33 -0100", "2023-01-01 11:22:33 +0300", "invalid time" ) ) df$select(pl$col("x")$str$strptime( pl$Datetime("ns"), format = "%Y-%m-%d %H:%M:%S %z", strict = FALSE ))
Return the last n characters of each string
expr_str_tail(n)
expr_str_tail(n)
n |
Length of the slice (integer or expression). Strings are parsed as column names. Negative indexing is supported. |
The n
input is defined in terms of the number of characters in the (UTF-8)
string. A character is defined as a Unicode scalar value. A single character
is represented by a single byte when working with ASCII text, and a maximum
of 4 bytes otherwise.
When the n
input is negative, tail()
returns characters starting from the
n
th from the beginning of the string. For example, if n = -3
, then all
characters except the first three are returned.
If the length of the string has fewer than n
characters, the full string is
returned.
A polars expression
df <- pl$DataFrame( s = c("pear", NA, "papaya", "dragonfruit"), n = c(3, 4, -2, -5) ) df$with_columns( s_tail_5 = pl$col("s")$str$tail(5), s_tail_n = pl$col("s")$str$tail("n") )
df <- pl$DataFrame( s = c("pear", NA, "papaya", "dragonfruit"), n = c(3, 4, -2, -5) ) df$with_columns( s_tail_5 = pl$col("s")$str$tail(5), s_tail_n = pl$col("s")$str$tail("n") )
Convert a String column into a Date column
expr_str_to_date(format = NULL, ..., strict = TRUE, exact = TRUE, cache = TRUE)
expr_str_to_date(format = NULL, ..., strict = TRUE, exact = TRUE, cache = TRUE)
format |
Format to use for conversion. Refer to
the chrono crate documentation
for the full specification. Example: |
... |
These dots are for future extensions and must be empty. |
strict |
If |
exact |
If |
cache |
Use a cache of unique, converted dates to apply the datetime conversion. |
A polars expression
df <- pl$DataFrame(x = c("2020/01/01", "2020/02/01", "2020/03/01")) df$select(pl$col("x")$str$to_date()) # by default, this errors if some values cannot be converted df <- pl$DataFrame(x = c("2020/01/01", "2020 02 01", "2020-03-01")) try(df$select(pl$col("x")$str$to_date())) df$select(pl$col("x")$str$to_date(strict = FALSE))
df <- pl$DataFrame(x = c("2020/01/01", "2020/02/01", "2020/03/01")) df$select(pl$col("x")$str$to_date()) # by default, this errors if some values cannot be converted df <- pl$DataFrame(x = c("2020/01/01", "2020 02 01", "2020-03-01")) try(df$select(pl$col("x")$str$to_date())) df$select(pl$col("x")$str$to_date(strict = FALSE))
Convert a String column into a Datetime column
expr_str_to_datetime( format = NULL, ..., time_unit = NULL, time_zone = NULL, strict = TRUE, exact = TRUE, cache = TRUE, ambiguous = c("raise", "earliest", "latest", "null") )
expr_str_to_datetime( format = NULL, ..., time_unit = NULL, time_zone = NULL, strict = TRUE, exact = TRUE, cache = TRUE, ambiguous = c("raise", "earliest", "latest", "null") )
format |
Format to use for conversion. Refer to
the chrono crate documentation
for the full specification. Example: |
... |
These dots are for future extensions and must be empty. |
time_unit |
Unit of time for the resulting Datetime column. If |
time_zone |
for the resulting Datetime column. |
strict |
If |
exact |
If |
cache |
Use a cache of unique, converted dates to apply the datetime conversion. |
ambiguous |
Determine how to deal with ambiguous datetimes. Character vector or expression containing the followings:
|
A polars expression
df <- pl$DataFrame(x = c("2020-01-01 01:00Z", "2020-01-01 02:00Z")) df$select(pl$col("x")$str$to_datetime("%Y-%m-%d %H:%M%#z")) df$select(pl$col("x")$str$to_datetime(time_unit = "ms"))
df <- pl$DataFrame(x = c("2020-01-01 01:00Z", "2020-01-01 02:00Z")) df$select(pl$col("x")$str$to_datetime("%Y-%m-%d %H:%M%#z")) df$select(pl$col("x")$str$to_datetime(time_unit = "ms"))
Convert a String column into an Int64 column with base radix
expr_str_to_integer(..., base = 10L, strict = TRUE)
expr_str_to_integer(..., base = 10L, strict = TRUE)
... |
These dots are for future extensions and must be empty. |
base |
A positive integer or expression which is the base of the string
we are parsing. Characters are parsed as column names. Default: |
strict |
A logical. If |
A polars expression
df <- pl$DataFrame(bin = c("110", "101", "010", "invalid")) df$with_columns( parsed = pl$col("bin")$str$to_integer(base = 2, strict = FALSE) ) df <- pl$DataFrame(hex = c("fa1e", "ff00", "cafe", NA)) df$with_columns( parsed = pl$col("hex")$str$to_integer(base = 16, strict = TRUE) )
df <- pl$DataFrame(bin = c("110", "101", "010", "invalid")) df$with_columns( parsed = pl$col("bin")$str$to_integer(base = 2, strict = FALSE) ) df <- pl$DataFrame(hex = c("fa1e", "ff00", "cafe", NA)) df$with_columns( parsed = pl$col("hex")$str$to_integer(base = 16, strict = TRUE) )
Transform to lowercase variant.
expr_str_to_lowercase()
expr_str_to_lowercase()
A polars expression
pl$lit(c("A", "b", "c", "1", NA))$str$to_lowercase()$to_series()
pl$lit(c("A", "b", "c", "1", NA))$str$to_lowercase()$to_series()
Convert a String column into a Time column
expr_str_to_time(format = NULL, ..., strict = TRUE, cache = TRUE)
expr_str_to_time(format = NULL, ..., strict = TRUE, cache = TRUE)
format |
Format to use for conversion. Refer to
the chrono crate documentation
for the full specification. Example: |
... |
These dots are for future extensions and must be empty. |
strict |
If |
cache |
Use a cache of unique, converted dates to apply the datetime conversion. |
A polars expression
df <- pl$DataFrame(x = c("01:00", "02:00", "03:00")) df$select(pl$col("x")$str$to_time("%H:%M"))
df <- pl$DataFrame(x = c("01:00", "02:00", "03:00")) df$select(pl$col("x")$str$to_time("%H:%M"))
Transform to uppercase variant.
expr_str_to_uppercase()
expr_str_to_uppercase()
A polars expression
pl$lit(c("A", "b", "c", "1", NA))$str$to_uppercase()$to_series()
pl$lit(c("A", "b", "c", "1", NA))$str$to_uppercase()$to_series()
Add zeroes to a string until it reaches n
characters. If the
number of characters is already greater than n
, the string is not modified.
expr_str_zfill(alignment)
expr_str_zfill(alignment)
alignment |
Fill the value up to this length. This can be an Expr or something coercible to an Expr. Strings are parsed as column names. |
Return a copy of the string left filled with ASCII '0' digits to make a string of length width.
A leading sign prefix ('+'/'-') is handled by inserting the padding after the
sign character rather than before. The original string is returned if width is
less than or equal to len(s)
.
A polars expression
some_floats_expr <- pl$lit(c(0, 10, -5, 5)) # cast to String and ljust alignment = 5, and view as R char vector some_floats_expr$cast(pl$String)$str$zfill(5)$to_r() # cast to int and the to utf8 and then ljust alignment = 5, and view as R # char vector some_floats_expr$cast(pl$Int64)$cast(pl$String)$str$zfill(5)$to_r()
some_floats_expr <- pl$lit(c(0, 10, -5, 5)) # cast to String and ljust alignment = 5, and view as R char vector some_floats_expr$cast(pl$String)$str$zfill(5)$to_r() # cast to int and the to utf8 and then ljust alignment = 5, and view as R # char vector some_floats_expr$cast(pl$Int64)$cast(pl$String)$str$zfill(5)$to_r()
Retrieve one or multiple Struct field(s) as a new Series
expr_struct_field(...)
expr_struct_field(...)
... |
< |
A polars expression
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df # Retrieve struct field(s) as Series: df$select(pl$col("struct_col")$struct$field("bbb")) df$select( pl$col("struct_col")$struct$field("bbb"), pl$col("struct_col")$struct$field("ddd") ) # Use wildcard expansion: df$select(pl$col("struct_col")$struct$field("*")) # Retrieve multiple fields by name: df$select(pl$col("struct_col")$struct$field("aaa", "bbb")) # Retrieve multiple fields by regex expansion: df$select(pl$col("struct_col")$struct$field("^a.*|b.*$"))
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df # Retrieve struct field(s) as Series: df$select(pl$col("struct_col")$struct$field("bbb")) df$select( pl$col("struct_col")$struct$field("bbb"), pl$col("struct_col")$struct$field("ddd") ) # Use wildcard expansion: df$select(pl$col("struct_col")$struct$field("*")) # Retrieve multiple fields by name: df$select(pl$col("struct_col")$struct$field("aaa", "bbb")) # Retrieve multiple fields by regex expansion: df$select(pl$col("struct_col")$struct$field("^a.*|b.*$"))
Convert this struct to a string column with json values
expr_struct_json_encode()
expr_struct_json_encode()
A polars expression
df <- pl$DataFrame( a = list(1:2, c(9, 1, 3)), b = list(45, NA) )$select(a = pl$struct("a", "b")) df df$with_columns(encoded = pl$col("a")$struct$json_encode())
df <- pl$DataFrame( a = list(1:2, c(9, 1, 3)), b = list(45, NA) )$select(a = pl$struct("a", "b")) df df$with_columns(encoded = pl$col("a")$struct$json_encode())
Rename the fields of the struct
expr_struct_rename_fields(names)
expr_struct_rename_fields(names)
names |
New names, given in the same order as the struct's fields. |
A polars expression
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df df <- df$select( pl$col("struct_col")$struct$rename_fields(c("www", "xxx", "yyy", "zzz")) ) df$select(pl$col("struct_col")$struct$field("*")) # Following a rename, the previous field names cannot be referenced: tryCatch( { df$select(pl$col("struct_col")$struct$field("aaa")) }, error = function(e) print(e) )
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df df <- df$select( pl$col("struct_col")$struct$rename_fields(c("www", "xxx", "yyy", "zzz")) ) df$select(pl$col("struct_col")$struct$field("*")) # Following a rename, the previous field names cannot be referenced: tryCatch( { df$select(pl$col("struct_col")$struct$field("aaa")) }, error = function(e) print(e) )
This is an alias for Expr$struct$field("*")
.
expr_struct_unnest()
expr_struct_unnest()
A polars expression
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df df$select(pl$col("struct_col")$struct$unnest())
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df df$select(pl$col("struct_col")$struct$unnest())
This is similar to with_columns()
on DataFrame and LazyFrame.
expr_struct_with_fields(...)
expr_struct_with_fields(...)
... |
< |
A polars expression
df <- pl$DataFrame( x = c(1, 4, 9), y = c(4, 9, 16), multiply = c(10, 2, 3) )$select(coords = pl$struct("x", "y"), "multiply") df df <- df$with_columns( pl$col("coords")$struct$with_fields( pl$field("x")$sqrt(), y_mul = pl$field("y") * pl$col("multiply") ) ) df df$select(pl$col("coords")$struct$field("*"))
df <- pl$DataFrame( x = c(1, 4, 9), y = c(4, 9, 16), multiply = c(10, 2, 3) )$select(coords = pl$struct("x", "y"), "multiply") df df <- df$with_columns( pl$col("coords")$struct$with_fields( pl$field("x")$sqrt(), y_mul = pl$field("y") * pl$col("multiply") ) ) df df$select(pl$col("coords")$struct$field("*"))
By default, all query optimizations are enabled.
Individual optimizations may be disabled by setting the corresponding parameter to FALSE
.
lazyframe__collect( ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE, `_eager` = FALSE )
lazyframe__collect( ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE, `_eager` = FALSE )
... |
These dots are for future extensions and must be empty. |
type_coercion |
A logical, indicats type coercion optimization. |
predicate_pushdown |
A logical, indicats predicate pushdown optimization. |
projection_pushdown |
A logical, indicats projection pushdown optimization. |
simplify_expression |
A logical, indicats simplify expression optimization. |
slice_pushdown |
A logical, indicats slice pushdown optimization. |
comm_subplan_elim |
A logical, indicats tring to cache branching subplans that occur on self-joins or unions. |
comm_subexpr_elim |
A logical, indicats tring to cache common subexpressions. |
cluster_with_columns |
A logical, indicats to combine sequential independent calls to with_columns. |
no_optimization |
A logical. If |
streaming |
A logical. If |
_eager |
A logical, indicates to turn off multi-node optimizations and the other optimizations. This option is intended for internal use only. |
A polars DataFrame
lf <- pl$LazyFrame( a = c("a", "b", "a", "b", "b", "c"), b = 1:6, c = 6:1, ) lf$group_by("a")$agg(pl$all()$sum())$collect() # Collect in streaming mode lf$group_by("a")$agg(pl$all()$sum())$collect( streaming = TRUE )
lf <- pl$LazyFrame( a = c("a", "b", "a", "b", "b", "c"), b = 1:6, c = 6:1, ) lf$group_by("a")$agg(pl$all()$sum())$collect() # Collect in streaming mode lf$group_by("a")$agg(pl$all()$sum())$collect( streaming = TRUE )
Select and perform operations on a subset of columns only. This discards
unmentioned columns (like .()
in data.table
and contrarily to
dplyr::mutate()
).
One cannot use new variables in subsequent expressions in the same
$select()
call. For instance, if you create a variable x
, you will only
be able to use it in another $select()
or $with_columns()
call.
lazyframe__select(...)
lazyframe__select(...)
... |
< |
A polars LazyFrame
# Pass the name of a column to select that column. lf <- pl$LazyFrame( foo = 1:3, bar = 6:8, ham = letters[1:3] ) lf$select("foo")$collect() # Multiple columns can be selected by passing a list of column names. lf$select("foo", "bar")$collect() # Expressions are also accepted. lf$select(pl$col("foo"), pl$col("bar") + 1)$collect() # Name expression (used as the column name of the output DataFrame) lf$select( threshold = pl$when(pl$col("foo") > 2)$then(10)$otherwise(0) )$collect() # Expressions with multiple outputs can be automatically instantiated # as Structs by setting the `POLARS_AUTO_STRUCTIFY` environment variable. # (Experimental) if (requireNamespace("withr", quietly = TRUE)) { withr::with_envvar(c(POLARS_AUTO_STRUCTIFY = "1"), { lf$select( is_odd = ((pl$col(pl$Int32) %% 2) == 1)$name$suffix("_is_odd"), )$collect() }) }
# Pass the name of a column to select that column. lf <- pl$LazyFrame( foo = 1:3, bar = 6:8, ham = letters[1:3] ) lf$select("foo")$collect() # Multiple columns can be selected by passing a list of column names. lf$select("foo", "bar")$collect() # Expressions are also accepted. lf$select(pl$col("foo"), pl$col("bar") + 1)$collect() # Name expression (used as the column name of the output DataFrame) lf$select( threshold = pl$when(pl$col("foo") > 2)$then(10)$otherwise(0) )$collect() # Expressions with multiple outputs can be automatically instantiated # as Structs by setting the `POLARS_AUTO_STRUCTIFY` environment variable. # (Experimental) if (requireNamespace("withr", quietly = TRUE)) { withr::with_envvar(c(POLARS_AUTO_STRUCTIFY = "1"), { lf$select( is_odd = ((pl$col(pl$Int32) %% 2) == 1)$name$suffix("_is_odd"), )$collect() }) }
Add columns or modify existing ones with expressions. This is similar to
dplyr::mutate()
as it keeps unmentioned columns (unlike $select()
).
However, unlike dplyr::mutate()
, one cannot use new variables in subsequent
expressions in the same $with_columns()
call. For instance, if you create a
variable x
, you will only be able to use it in another $with_columns()
or $select()
call.
lazyframe__with_columns(...)
lazyframe__with_columns(...)
... |
< |
A polars LazyFrame
# Pass an expression to add it as a new column. lf <- pl$LazyFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE), ) lf$with_columns((pl$col("a")^2)$alias("a^2"))$collect() # Added columns will replace existing columns with the same name. lf$with_columns(a = pl$col("a")$cast(pl$Float64))$collect() # Multiple columns can be added lf$with_columns( (pl$col("a")^2)$alias("a^2"), (pl$col("b") / 2)$alias("b/2"), (pl$col("c")$not())$alias("not c"), )$collect() # Name expression instead of `$alias()` lf$with_columns( `a^2` = pl$col("a")^2, `b/2` = pl$col("b") / 2, `not c` = pl$col("c")$not(), )$collect() # Expressions with multiple outputs can automatically be instantiated # as Structs by enabling the experimental setting `POLARS_AUTO_STRUCTIFY`: if (requireNamespace("withr", quietly = TRUE)) { withr::with_envvar(c(POLARS_AUTO_STRUCTIFY = "1"), { lf$drop("c")$with_columns( diffs = pl$col("a", "b")$diff()$name$suffix("_diff"), )$collect() }) }
# Pass an expression to add it as a new column. lf <- pl$LazyFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE), ) lf$with_columns((pl$col("a")^2)$alias("a^2"))$collect() # Added columns will replace existing columns with the same name. lf$with_columns(a = pl$col("a")$cast(pl$Float64))$collect() # Multiple columns can be added lf$with_columns( (pl$col("a")^2)$alias("a^2"), (pl$col("b") / 2)$alias("b/2"), (pl$col("c")$not())$alias("not c"), )$collect() # Name expression instead of `$alias()` lf$with_columns( `a^2` = pl$col("a")^2, `b/2` = pl$col("b") / 2, `not c` = pl$col("c")$not(), )$collect() # Expressions with multiple outputs can automatically be instantiated # as Structs by enabling the experimental setting `POLARS_AUTO_STRUCTIFY`: if (requireNamespace("withr", quietly = TRUE)) { withr::with_envvar(c(POLARS_AUTO_STRUCTIFY = "1"), { lf$drop("c")$with_columns( diffs = pl$col("a", "b")$diff()$name$suffix("_diff"), )$collect() }) }
pl
is an environment class object
that stores all the top-level functions of the R Polars API
which mimics the Python Polars API.
It is intended to work the same way in Python as if you had imported
Python Polars with import polars as pl
.
pl
pl
An object of class polars_object
of length 65.
pl # How many members are in the `pl` environment? length(pl) # Create a polars DataFrame # In Python: # ```python # >>> import polars as pl # >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) # ``` # In R: df <- pl$DataFrame(a = c(1, 2, 3), b = c(4, 5, 6)) df
pl # How many members are in the `pl` environment? length(pl) # Create a polars DataFrame # In Python: # ```python # >>> import polars as pl # >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) # ``` # In R: df <- pl$DataFrame(a = c(1, 2, 3), b = c(4, 5, 6)) df
Apply the AND logical horizontally across columns
pl__all_horizontal(...)
pl__all_horizontal(...)
... |
< |
Kleene logic is used to
deal with nulls: if the column contains any null values and no FALSE
values, the output is null.
A polars expression
df <- pl$DataFrame( a = c(FALSE, FALSE, TRUE, TRUE, FALSE, NA), b = c(FALSE, TRUE, TRUE, NA, NA, NA), c = c("u", "v", "w", "x", "y", "z") ) df$with_columns( all = pl$all_horizontal("a", "b", "c") )
df <- pl$DataFrame( a = c(FALSE, FALSE, TRUE, TRUE, FALSE, NA), b = c(FALSE, TRUE, TRUE, NA, NA, NA), c = c("u", "v", "w", "x", "y", "z") ) df$with_columns( all = pl$all_horizontal("a", "b", "c") )
Apply the OR logical horizontally across columns
pl__any_horizontal(...)
pl__any_horizontal(...)
... |
< |
Kleene logic is used to
deal with nulls: if the column contains any null values and no FALSE
values, the output is null.
A polars expression
df <- pl$DataFrame( a = c(FALSE, FALSE, TRUE, TRUE, FALSE, NA), b = c(FALSE, TRUE, TRUE, NA, NA, NA), c = c("u", "v", "w", "x", "y", "z") ) df$with_columns( any = pl$any_horizontal("a", "b", "c") )
df <- pl$DataFrame( a = c(FALSE, FALSE, TRUE, TRUE, FALSE, NA), b = c(FALSE, TRUE, TRUE, NA, NA, NA), c = c("u", "v", "w", "x", "y", "z") ) df$with_columns( any = pl$any_horizontal("a", "b", "c") )
Folds the columns from left to right, keeping the first non-null value
pl__coalesce(...)
pl__coalesce(...)
... |
< |
A polars expression
df <- pl$DataFrame( a = c(1, NA, NA, NA), b = c(1, 2, NA, NA), c = c(5, NA, 3, NA) ) df$with_columns(d = pl$coalesce("a", "b", "c", 10)) df$with_columns(d = pl$coalesce(pl$col("a", "b", "c"), 10))
df <- pl$DataFrame( a = c(1, NA, NA, NA), b = c(1, 2, NA, NA), c = c(5, NA, 3, NA) ) df$with_columns(d = pl$coalesce("a", "b", "c", 10)) df$with_columns(d = pl$coalesce(pl$col("a", "b", "c"), 10))
Create an expression representing column(s) in a DataFrame
pl__col(...)
pl__col(...)
... |
<
|
A polars expression
# a single column by a character pl$col("foo") # multiple columns by characters pl$col("foo", "bar") # multiple columns by polars data types pl$col(pl$Float64, pl$String) # Single `"*"` is converted to a wildcard expression pl$col("*") # Character vectors with length > 1 should be used with `!!!` pl$col(!!!c("foo", "bar"), "baz") pl$col("foo", !!!c("bar", "baz")) # there are some special notations for selecting columns df <- pl$DataFrame(foo = 1:3, bar = 4:6, baz = 7:9) ## select all columns with a wildcard `"*"` df$select(pl$col("*")) ## select multiple columns by a regular expression ## starts with `^` and ends with `$` df$select(pl$col("^ba.*$"))
# a single column by a character pl$col("foo") # multiple columns by characters pl$col("foo", "bar") # multiple columns by polars data types pl$col(pl$Float64, pl$String) # Single `"*"` is converted to a wildcard expression pl$col("*") # Character vectors with length > 1 should be used with `!!!` pl$col(!!!c("foo", "bar"), "baz") pl$col("foo", !!!c("bar", "baz")) # there are some special notations for selecting columns df <- pl$DataFrame(foo = 1:3, bar = 4:6, baz = 7:9) ## select all columns with a wildcard `"*"` df$select(pl$col("*")) ## select multiple columns by a regular expression ## starts with `^` and ends with `$` df$select(pl$col("^ba.*$"))
Combine multiple DataFrames, LazyFrames, or Series into a single object
pl__concat( ..., how = c("vertical", "vertical_relaxed", "diagonal", "diagonal_relaxed", "horizontal", "align"), rechunk = FALSE, parallel = TRUE )
pl__concat( ..., how = c("vertical", "vertical_relaxed", "diagonal", "diagonal_relaxed", "horizontal", "align"), rechunk = FALSE, parallel = TRUE )
... |
< |
how |
Strategy to concatenate items. Must be one of:
Series only support the |
rechunk |
Make sure that the result data is in contiguous memory. |
parallel |
Only relevant for LazyFrames. This determines if the concatenated lazy computations may be executed in parallel. |
The same class (polars_data_frame
, polars_lazy_frame
or
polars_series
) as the input.
# default is 'vertical' strategy df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2L, b = 4L) pl$concat(df1, df2) # 'a' is coerced to float64 df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2, b = 4L) pl$concat(df1, df2, how = "vertical_relaxed") df_h1 <- pl$DataFrame(l1 = 1:2, l2 = 3:4) df_h2 <- pl$DataFrame(r1 = 5:6, r2 = 7:8, r3 = 9:10) pl$concat(df_h1, df_h2, how = "horizontal") # use 'diagonal' strategy to fill empty column values with nulls df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2L, c = 4L) pl$concat(df1, df2, how = "diagonal") df_a1 <- pl$DataFrame(id = 1:2, x = 3:4) df_a2 <- pl$DataFrame(id = 2:3, y = 5:6) df_a3 <- pl$DataFrame(id = c(1L, 3L), z = 7:8) pl$concat(df_a1, df_a2, df_a3, how = "align")
# default is 'vertical' strategy df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2L, b = 4L) pl$concat(df1, df2) # 'a' is coerced to float64 df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2, b = 4L) pl$concat(df1, df2, how = "vertical_relaxed") df_h1 <- pl$DataFrame(l1 = 1:2, l2 = 3:4) df_h2 <- pl$DataFrame(r1 = 5:6, r2 = 7:8, r3 = 9:10) pl$concat(df_h1, df_h2, how = "horizontal") # use 'diagonal' strategy to fill empty column values with nulls df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2L, c = 4L) pl$concat(df1, df2, how = "diagonal") df_a1 <- pl$DataFrame(id = 1:2, x = 3:4) df_a2 <- pl$DataFrame(id = 2:3, y = 5:6) df_a3 <- pl$DataFrame(id = c(1L, 3L), z = 7:8) pl$concat(df_a1, df_a2, df_a3, how = "align")
Horizontally concatenate columns into a single list column
pl__concat_list(...)
pl__concat_list(...)
... |
< |
A polars expression
df <- pl$DataFrame(a = list(1:2, 3, 4:5), b = list(4, integer(0), NULL)) # Concatenate two existing list columns. Null values are propagated. df$with_columns(concat_list = pl$concat_list("a", "b")) # Non-list columns are cast to a list before concatenation. The output data # type is the supertype of the concatenated columns. df$select("a", concat_list = pl$concat_list("a", pl$lit("x"))) # Create lagged columns and collect them into a list. This mimics a rolling # window. df <- pl$DataFrame(A = c(1, 2, 9, 2, 13)) df <- df$select( A_lag_1 = pl$col("A")$shift(1), A_lag_2 = pl$col("A")$shift(2), A_lag_3 = pl$col("A")$shift(3) ) df$select(A_rolling = pl$concat_list("A_lag_1", "A_lag_2", "A_lag_3"))
df <- pl$DataFrame(a = list(1:2, 3, 4:5), b = list(4, integer(0), NULL)) # Concatenate two existing list columns. Null values are propagated. df$with_columns(concat_list = pl$concat_list("a", "b")) # Non-list columns are cast to a list before concatenation. The output data # type is the supertype of the concatenated columns. df$select("a", concat_list = pl$concat_list("a", pl$lit("x"))) # Create lagged columns and collect them into a list. This mimics a rolling # window. df <- pl$DataFrame(A = c(1, 2, 9, 2, 13)) df <- df$select( A_lag_1 = pl$col("A")$shift(1), A_lag_2 = pl$col("A")$shift(2), A_lag_3 = pl$col("A")$shift(3) ) df$select(A_rolling = pl$concat_list("A_lag_1", "A_lag_2", "A_lag_3"))
polars_data_frame
)DataFrames are two-dimensional data structure representing data as a table with rows and columns. Polars DataFrames are similar to R Data Frames. R Data Frame's columns are R vectors, while Polars DataFrame's columns are Polars Series.
pl__DataFrame(..., .schema_overrides = NULL, .strict = TRUE)
pl__DataFrame(..., .schema_overrides = NULL, .strict = TRUE)
... |
< |
.schema_overrides |
A list of polars data types or |
.strict |
A logical value. Passed to the |
The pl$DataFrame()
function mimics the constructor of the DataFrame class of Python Polars.
This function is basically a shortcut for
as_polars_df(list(...))$cast(!!!.schema_overrides, .strict = .strict)
, so each argument in ...
is
converted to a Polars Series by as_polars_series()
and then passed to as_polars_df()
.
A polars DataFrame
columns
: $columns
returns a character vector with the names of the columns.
dtypes
: $dtypes
returns a nameless list of the data type of each column.
schema
: $schema
returns a named list with the column names as names and the data types as values.
shape
: $shape
returns a integer vector of length two with the number of rows and columns of the DataFrame.
height
: $height
returns a integer with the number of rows of the DataFrame.
width
: $width
returns a integer with the number of columns of the DataFrame.
# Constructing a DataFrame from vectors: pl$DataFrame(a = 1:2, b = 3:4) # Constructing a DataFrame from Series: pl$DataFrame(pl$Series("a", 1:2), pl$Series("b", 3:4)) # Constructing a DataFrame from a list: data <- list(a = 1:2, b = 3:4) ## Using the as_polars_df function (recommended) as_polars_df(data) ## Using dynamic dots feature pl$DataFrame(!!!data) # Active bindings: df <- pl$DataFrame(a = 1:3, b = c("foo", "bar", "baz")) df$columns df$dtypes df$schema df$shape df$height df$width
# Constructing a DataFrame from vectors: pl$DataFrame(a = 1:2, b = 3:4) # Constructing a DataFrame from Series: pl$DataFrame(pl$Series("a", 1:2), pl$Series("b", 3:4)) # Constructing a DataFrame from a list: data <- list(a = 1:2, b = 3:4) ## Using the as_polars_df function (recommended) as_polars_df(data) ## Using dynamic dots feature pl$DataFrame(!!!data) # Active bindings: df <- pl$DataFrame(a = 1:3, b = c("foo", "bar", "baz")) df$columns df$dtypes df$schema df$shape df$height df$width
If both start
and end
are passed as the Date types (not Datetime), and
the interval
granularity is no finer than "1d"
, the returned range is
also of type Date. All other permutations return a Datetime.
pl__date_range( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right") )
pl__date_range( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right") )
start |
Lower bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
end |
Upper bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
interval |
Interval of the range periods, specified as a difftime
object or using the Polars duration string language. See the |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$date_ranges()
to create a simple Series of
data type list(Date) based on column values.
# Using Polars duration string to specify the interval: pl$select( date = pl$date_range(as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo") ) # Using `difftime` object to specify the interval: pl$select( date = pl$date_range( as.Date("1985-01-01"), as.Date("1985-01-10"), as.difftime(2, units = "days") ) )
# Using Polars duration string to specify the interval: pl$select( date = pl$date_range(as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo") ) # Using `difftime` object to specify the interval: pl$select( date = pl$date_range( as.Date("1985-01-01"), as.Date("1985-01-10"), as.difftime(2, units = "days") ) )
If both start
and end
are passed as Date types (not Datetime), and
the interval
granularity is no finer than "1d"
, the returned range is
also of type Date. All other permutations return a Datetime.
pl__date_ranges( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right") )
pl__date_ranges( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right") )
start |
Lower bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
end |
Upper bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
interval |
Interval of the range periods, specified as a difftime
object or using the Polars duration string language. See the |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$date_range()
to create a simple Series of
data type Date.
df <- pl$DataFrame( start = as.Date(c("2022-01-01", "2022-01-02", NA)), end = rep(as.Date("2022-01-03"), 3) ) df$with_columns( date_range = pl$date_ranges("start", "end"), date_range_cr = pl$date_ranges("start", "end", closed = "right") ) # provide a custom "end" value df$with_columns( date_range_lit = pl$date_ranges("start", pl$lit(as.Date("2022-01-02"))) )
df <- pl$DataFrame( start = as.Date(c("2022-01-01", "2022-01-02", NA)), end = rep(as.Date("2022-01-03"), 3) ) df$with_columns( date_range = pl$date_ranges("start", "end"), date_range_cr = pl$date_ranges("start", "end", closed = "right") ) # provide a custom "end" value df$with_columns( date_range_lit = pl$date_ranges("start", pl$lit(as.Date("2022-01-02"))) )
Create a Polars literal expression of type Datetime
pl__datetime( year, month, day, hour = NULL, minute = NULL, second = NULL, microsecond = NULL, ..., time_unit = c("us", "ns", "ms"), time_zone = NULL, ambiguous = c("raise", "earliest", "latest", "null") )
pl__datetime( year, month, day, hour = NULL, minute = NULL, second = NULL, microsecond = NULL, ..., time_unit = c("us", "ns", "ms"), time_zone = NULL, ambiguous = c("raise", "earliest", "latest", "null") )
year |
An polars expression or something can be coerced to
an polars expression by |
month |
An polars expression or something can be coerced to
an polars expression by |
day |
An polars expression or something can be coerced to
an polars expression by |
hour |
An polars expression or something can be coerced to
an polars expression by |
minute |
An polars expression or something can be coerced to
an polars expression by |
second |
An polars expression or something can be coerced to
an polars expression by |
microsecond |
An polars expression or something can be coerced to
an polars expression by |
... |
These dots are for future extensions and must be empty. |
time_unit |
One of |
time_zone |
A string or |
ambiguous |
Determine how to deal with ambiguous datetimes. Character vector or expression containing the followings:
|
A polars expression
df <- pl$DataFrame( month = c(1, 2, 3), day = c(4, 5, 6), hour = c(12, 13, 14), minute = c(15, 30, 45) ) df$with_columns( pl$datetime( 2024, pl$col("month"), pl$col("day"), pl$col("hour"), pl$col("minute"), time_zone = "Australia/Sydney" ) ) # We can also use `pl$datetime()` for filtering: df <- pl$select( start = ISOdatetime(2024, 1, 1, 0, 0, 0), end = c( ISOdatetime(2024, 5, 1, 20, 15, 10), ISOdatetime(2024, 7, 1, 21, 25, 20), ISOdatetime(2024, 9, 1, 22, 35, 30) ) ) df$filter(pl$col("end") > pl$datetime(2024, 6, 1))
df <- pl$DataFrame( month = c(1, 2, 3), day = c(4, 5, 6), hour = c(12, 13, 14), minute = c(15, 30, 45) ) df$with_columns( pl$datetime( 2024, pl$col("month"), pl$col("day"), pl$col("hour"), pl$col("minute"), time_zone = "Australia/Sydney" ) ) # We can also use `pl$datetime()` for filtering: df <- pl$select( start = ISOdatetime(2024, 1, 1, 0, 0, 0), end = c( ISOdatetime(2024, 5, 1, 20, 15, 10), ISOdatetime(2024, 7, 1, 21, 25, 20), ISOdatetime(2024, 9, 1, 22, 35, 30) ) ) df$filter(pl$col("end") > pl$datetime(2024, 6, 1))
Generate a datetime range
pl__datetime_range( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right"), time_unit = NULL, time_zone = NULL )
pl__datetime_range( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right"), time_unit = NULL, time_zone = NULL )
start |
Lower bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
end |
Upper bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
interval |
Interval of the range periods, specified as a difftime
object or using the Polars duration string language. See the |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
time_unit |
Time unit of the resulting the Datetime
data type. One of |
time_zone |
Time zone of the resulting Datetime data type. |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$datetime_ranges()
to create a simple
Series of data type list(Datetime) based on column values.
# Using Polars duration string to specify the interval: pl$select( datetime = pl$datetime_range(as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo") ) # Using `difftime` object to specify the interval: pl$select( datetime = pl$datetime_range( as.Date("1985-01-01"), as.Date("1985-01-10"), as.difftime(1, units = "days") + as.difftime(12, units = "hours") ) ) # Specifying a time zone: pl$select( datetime = pl$datetime_range( as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo", time_zone = "America/New_York" ) )
# Using Polars duration string to specify the interval: pl$select( datetime = pl$datetime_range(as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo") ) # Using `difftime` object to specify the interval: pl$select( datetime = pl$datetime_range( as.Date("1985-01-01"), as.Date("1985-01-10"), as.difftime(1, units = "days") + as.difftime(12, units = "hours") ) ) # Specifying a time zone: pl$select( datetime = pl$datetime_range( as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo", time_zone = "America/New_York" ) )
Generate a list containing a datetime range
pl__datetime_ranges( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right"), time_unit = NULL, time_zone = NULL )
pl__datetime_ranges( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right"), time_unit = NULL, time_zone = NULL )
start |
Lower bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
end |
Upper bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
interval |
Interval of the range periods, specified as a difftime
object or using the Polars duration string language. See the |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
time_unit |
Time unit of the resulting the Datetime
data type. One of |
time_zone |
Time zone of the resulting Datetime data type. |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$datetime_range()
to create a simple Series
of data type Datetime.
df <- pl$DataFrame( start = as.POSIXct(c("2022-01-01 10:00", "2022-01-01 11:00", NA)), end = rep(as.POSIXct("2022-01-01 12:00"), 3) ) df$with_columns( dt_range = pl$datetime_ranges("start", "end", interval = "1h"), dt_range_cr = pl$datetime_ranges("start", "end", closed = "right", interval = "1h") ) # provide a custom "end" value df$with_columns( dt_range_lit = pl$datetime_ranges( "start", pl$lit(as.POSIXct("2022-01-01 11:00")), interval = "1h" ) )
df <- pl$DataFrame( start = as.POSIXct(c("2022-01-01 10:00", "2022-01-01 11:00", NA)), end = rep(as.POSIXct("2022-01-01 12:00"), 3) ) df$with_columns( dt_range = pl$datetime_ranges("start", "end", interval = "1h"), dt_range_cr = pl$datetime_ranges("start", "end", closed = "right", interval = "1h") ) # provide a custom "end" value df$with_columns( dt_range_lit = pl$datetime_ranges( "start", pl$lit(as.POSIXct("2022-01-01 11:00")), interval = "1h" ) )
A Duration represents a fixed amount of time. For example,
pl$duration(days = 1)
means "exactly 24 hours". By contrast,
<expr>$dt$offset_by("1d")
means "1 calendar day", which could sometimes be
23 hours or 25 hours depending on Daylight Savings Time.
For non-fixed durations such as "calendar month" or "calendar day",
please use <expr>$dt$offset_by()
instead.
pl__duration( ..., weeks = NULL, days = NULL, hours = NULL, minutes = NULL, seconds = NULL, milliseconds = NULL, microseconds = NULL, nanoseconds = NULL, time_unit = NULL )
pl__duration( ..., weeks = NULL, days = NULL, hours = NULL, minutes = NULL, seconds = NULL, milliseconds = NULL, microseconds = NULL, nanoseconds = NULL, time_unit = NULL )
... |
These dots are for future extensions and must be empty. |
weeks |
Something can be coerced to an polars expression by |
days |
Something can be coerced to an polars expression by |
hours |
Something can be coerced to an polars expression by |
minutes |
Something can be coerced to an polars expression by |
seconds |
Something can be coerced to an polars expression by |
milliseconds |
Something can be coerced to an polars expression by |
microseconds |
Something can be coerced to an polars expression by |
nanoseconds |
Something can be coerced to an polars expression by |
time_unit |
One of |
A polars expression
df <- pl$DataFrame( dt = as.POSIXct(c("2022-01-01", "2022-01-02")), add = c(1, 2) ) df df$select( add_weeks = pl$col("dt") + pl$duration(weeks = pl$col("add")), add_days = pl$col("dt") + pl$duration(days = pl$col("add")), add_seconds = pl$col("dt") + pl$duration(seconds = pl$col("add")), add_millis = pl$col("dt") + pl$duration(milliseconds = pl$col("add")), add_hours = pl$col("dt") + pl$duration(hours = pl$col("add")) )
df <- pl$DataFrame( dt = as.POSIXct(c("2022-01-01", "2022-01-02")), add = c(1, 2) ) df df$select( add_weeks = pl$col("dt") + pl$duration(weeks = pl$col("add")), add_days = pl$col("dt") + pl$duration(days = pl$col("add")), add_seconds = pl$col("dt") + pl$duration(seconds = pl$col("add")), add_millis = pl$col("dt") + pl$duration(milliseconds = pl$col("add")), add_hours = pl$col("dt") + pl$duration(hours = pl$col("add")) )
Alias for an element being evaluated in an eval expression
pl__element()
pl__element()
A polars expression
# A horizontal rank computation by taking the elements of a list: df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2) ) df$with_columns( rank = pl$concat_list(c("a", "b"))$list$eval(pl$element()$rank()) ) # A mathematical operation on array elements: df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2) ) df$with_columns( a_b_doubled = pl$concat_list(c("a", "b"))$list$eval(pl$element() * 2) )
# A horizontal rank computation by taking the elements of a list: df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2) ) df$with_columns( rank = pl$concat_list(c("a", "b"))$list$eval(pl$element()$rank()) ) # A mathematical operation on array elements: df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2) ) df$with_columns( a_b_doubled = pl$concat_list(c("a", "b"))$list$eval(pl$element() * 2) )
Get the first column of the context
pl__first()
pl__first()
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$first())
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$first())
Generate a range of integers
pl__int_range(start = 0, end = NULL, step = 1, ..., dtype = pl$Int64)
pl__int_range(start = 0, end = NULL, step = 1, ..., dtype = pl$Int64)
start |
Start of the range (inclusive). Defaults to 0. |
end |
End of the range (exclusive). If |
step |
Step size of the range. |
... |
These dots are for future extensions and must be empty. |
dtype |
Data type of the range. |
A polars expression
pl$select(int = pl$int_range(0, 3)) # end can be omitted for a shorter syntax. pl$select(int = pl$int_range(3)) # Generate an index column by using int_range in conjunction with len(). df <- pl$DataFrame(a = c(1, 3, 5), b = c(2, 4, 6)) df$select( index = pl$int_range(pl$len(), dtype = pl$UInt32), pl$all() )
pl$select(int = pl$int_range(0, 3)) # end can be omitted for a shorter syntax. pl$select(int = pl$int_range(3)) # Generate an index column by using int_range in conjunction with len(). df <- pl$DataFrame(a = c(1, 3, 5), b = c(2, 4, 6)) df$select( index = pl$int_range(pl$len(), dtype = pl$UInt32), pl$all() )
Generate a range of integers for each row of the input columns
pl__int_ranges(start = 0, end = NULL, step = 1, ..., dtype = pl$Int64)
pl__int_ranges(start = 0, end = NULL, step = 1, ..., dtype = pl$Int64)
start |
Start of the range (inclusive). Defaults to 0. |
end |
End of the range (exclusive). If |
step |
Step size of the range. |
... |
These dots are for future extensions and must be empty. |
dtype |
Data type of the range. |
A polars expression
df <- pl$DataFrame(start = c(1, -1), end = c(3, 2)) df$with_columns(int_range = pl$int_ranges("start", "end")) # end can be omitted for a shorter syntax$ df$select("end", int_range = pl$int_ranges("end"))
df <- pl$DataFrame(start = c(1, -1), end = c(3, 2)) df$with_columns(int_range = pl$int_ranges("start", "end")) # end can be omitted for a shorter syntax$ df$select("end", int_range = pl$int_ranges("end"))
Get the last column of the context
pl__last()
pl__last()
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$last())
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$last())
polars_lazy_frame
)Representation of a Lazy computation graph/query against a DataFrame. This allows for whole-query optimisation in addition to parallelism, and is the preferred (and highest-performance) mode of operation for polars.
pl__LazyFrame(..., .schema_overrides = NULL, .strict = TRUE)
pl__LazyFrame(..., .schema_overrides = NULL, .strict = TRUE)
... |
< |
.schema_overrides |
A list of polars data types or |
.strict |
A logical value. Passed to the |
The pl$LazyFrame(...)
function is a shortcut for pl$DataFrame(...)$lazy()
.
A polars LazyFrame
<LazyFrame>$collect()
: Materialize a LazyFrame into a DataFrame.
# Constructing a LazyFrame from vectors: pl$LazyFrame(a = 1:2, b = 3:4) # Constructing a LazyFrame from Series: pl$LazyFrame(pl$Series("a", 1:2), pl$Series("b", 3:4)) # Constructing a LazyFrame from a list: data <- list(a = 1:2, b = 3:4) ## Using dynamic dots feature pl$LazyFrame(!!!data)
# Constructing a LazyFrame from vectors: pl$LazyFrame(a = 1:2, b = 3:4) # Constructing a LazyFrame from Series: pl$LazyFrame(pl$Series("a", 1:2), pl$Series("b", 3:4)) # Constructing a LazyFrame from a list: data <- list(a = 1:2, b = 3:4) ## Using dynamic dots feature pl$LazyFrame(!!!data)
This function is a shorthand for as_polars_expr(x, as_lit = TRUE)
and
in most cases, the actual conversion is done by as_polars_series()
.
pl__lit(value, dtype = NULL)
pl__lit(value, dtype = NULL)
value |
An R object. Passed as the |
dtype |
A polars data type or |
A polars expression
Since R has no scalar class, each of the following types of length 1 cases is specially converted to a scalar literal.
character: String
logical: Boolean
integer: Int32
double: Float64
These types' NA
is converted to a null
literal with casting to the corresponding Polars type.
The raw type vector is converted to a Binary scalar.
raw: Binary
NULL
is converted to a Null type null
literal.
NULL: Null
For other R class, the default S3 method is called and R object will be converted via
as_polars_series()
. So the type mapping is defined by as_polars_series()
.
as_polars_series()
: R -> Polars type mapping is mostly defined by this function.
as_polars_expr()
: Internal implementation of pl$lit()
.
# Literal scalar values pl$lit(1L) pl$lit(5.5) pl$lit(NULL) pl$lit("foo_bar") ## Generally, for a vector (an R object) becomes a Series with length 1, ## it is converted to a Series and then get the first value to become a scalar literal. pl$lit(as.Date("2021-01-20")) pl$lit(as.POSIXct("2023-03-31 10:30:45")) pl$lit(data.frame(a = 1, b = "foo")) # Literal Series data pl$lit(1:3) pl$lit(pl$Series("x", 1:3))
# Literal scalar values pl$lit(1L) pl$lit(5.5) pl$lit(NULL) pl$lit("foo_bar") ## Generally, for a vector (an R object) becomes a Series with length 1, ## it is converted to a Series and then get the first value to become a scalar literal. pl$lit(as.Date("2021-01-20")) pl$lit(as.POSIXct("2023-03-31 10:30:45")) pl$lit(data.frame(a = 1, b = "foo")) # Literal Series data pl$lit(1:3) pl$lit(pl$Series("x", 1:3))
Get the maximum value horizontally across columns
pl__max_horizontal(...)
pl__max_horizontal(...)
... |
< |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c(1, 2, NA, Inf) ) df$with_columns( max = pl$max_horizontal("a", "b") )
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c(1, 2, NA, Inf) ) df$with_columns( max = pl$max_horizontal("a", "b") )
Compute the mean horizontally across columns
pl__mean_horizontal(...)
pl__mean_horizontal(...)
... |
< |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( mean = pl$mean_horizontal("a", "b") )
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( mean = pl$mean_horizontal("a", "b") )
Get the minimum value horizontally across columns
pl__min_horizontal(...)
pl__min_horizontal(...)
... |
< |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( min = pl$min_horizontal("a", "b") )
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( min = pl$min_horizontal("a", "b") )
Get the nth column(s) of the context
pl__nth(indices)
pl__nth(indices)
indices |
One or more indices representing the columns to retrieve. |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$nth(1)) df$select(pl$nth(c(2, 0)))
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$nth(1)) df$select(pl$nth(c(2, 0)))
Read into a DataFrame from Arrow IPC (Feather v2) file
pl__read_ipc( source, ..., n_rows = NULL, cache = TRUE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, storage_options = NULL, retries = 2, file_cache_ttl = NULL, hive_partitioning = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, include_file_paths = NULL )
pl__read_ipc( source, ..., n_rows = NULL, cache = TRUE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, storage_options = NULL, retries = 2, file_cache_ttl = NULL, hive_partitioning = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, include_file_paths = NULL )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
n_rows |
Stop reading from parquet file after reading |
cache |
Cache the result after reading. |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
hive_partitioning |
Infer statistics and schema from Hive partitioned
sources and use them to prune reads. If |
hive_schema |
A list containing the column names and data types of the
columns by which the data is partitioned, e.g.
|
try_parse_hive_dates |
Whether to try parsing hive values as date / datetime types. |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
A polars DataFrame
temp_dir <- tempfile() # Write a hive-style partitioned arrow file dataset arrow::write_dataset( mtcars, temp_dir, partitioning = c("cyl", "gear"), format = "arrow", hive_style = TRUE ) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$read_ipc(temp_dir) # We can also impose a schema to the partition pl$read_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))
temp_dir <- tempfile() # Write a hive-style partitioned arrow file dataset arrow::write_dataset( mtcars, temp_dir, partitioning = c("cyl", "gear"), format = "arrow", hive_style = TRUE ) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$read_ipc(temp_dir) # We can also impose a schema to the partition pl$read_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))
Read into a DataFrame from Parquet file
pl__read_parquet( source, ..., n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, parallel = c("auto", "columns", "row_groups", "prefiltered", "none"), use_statistics = TRUE, hive_partitioning = NULL, glob = TRUE, schema = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, rechunk = FALSE, low_memory = FALSE, cache = TRUE, storage_options = NULL, retries = 2, include_file_paths = NULL, allow_missing_columns = FALSE )
pl__read_parquet( source, ..., n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, parallel = c("auto", "columns", "row_groups", "prefiltered", "none"), use_statistics = TRUE, hive_partitioning = NULL, glob = TRUE, schema = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, rechunk = FALSE, low_memory = FALSE, cache = TRUE, storage_options = NULL, retries = 2, include_file_paths = NULL, allow_missing_columns = FALSE )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
n_rows |
Stop reading from parquet file after reading |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
parallel |
This determines the direction and strategy of parallelism.
The prefiltered settings falls back to auto if no predicate is given. |
use_statistics |
Use statistics in the parquet to determine if pages can be skipped from reading. |
hive_partitioning |
Infer statistics and schema from Hive partitioned sources and use them to prune reads. |
glob |
Expand path given via globbing rules. |
schema |
Specify the datatypes of the columns. The datatypes must match
the datatypes in the file(s). If there are extra columns that are not in the
file(s), consider also enabling |
hive_schema |
The column names and data types of the columns by which
the data is partitioned. If |
try_parse_hive_dates |
Whether to try parsing hive values as date / datetime types. |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
low_memory |
Reduce memory pressure at the expense of performance |
cache |
Cache the result after reading. |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
allow_missing_columns |
When reading a list of parquet files, if a
column existing in the first file cannot be found in subsequent files, the
default behavior is to raise an error. However, if |
A polars DataFrame
# Write a Parquet file than we can then import as DataFrame temp_file <- withr::local_tempfile(fileext = ".parquet") as_polars_df(mtcars)$write_parquet(temp_file) pl$read_parquet(temp_file) # Write a hive-style partitioned parquet dataset temp_dir <- withr::local_tempdir() as_polars_df(mtcars)$write_parquet(temp_dir, partition_by = c("cyl", "gear")) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$read_parquet(temp_dir)
# Write a Parquet file than we can then import as DataFrame temp_file <- withr::local_tempfile(fileext = ".parquet") as_polars_df(mtcars)$write_parquet(temp_file) pl$read_parquet(temp_file) # Write a hive-style partitioned parquet dataset temp_dir <- withr::local_tempdir() as_polars_df(mtcars)$write_parquet(temp_dir, partition_by = c("cyl", "gear")) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$read_parquet(temp_dir)
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
pl__scan_ipc( source, ..., n_rows = NULL, cache = TRUE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, storage_options = NULL, retries = 2, file_cache_ttl = NULL, hive_partitioning = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, include_file_paths = NULL )
pl__scan_ipc( source, ..., n_rows = NULL, cache = TRUE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, storage_options = NULL, retries = 2, file_cache_ttl = NULL, hive_partitioning = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, include_file_paths = NULL )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
n_rows |
Stop reading from parquet file after reading |
cache |
Cache the result after reading. |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
hive_partitioning |
Infer statistics and schema from Hive partitioned
sources and use them to prune reads. If |
hive_schema |
A list containing the column names and data types of the
columns by which the data is partitioned, e.g.
|
try_parse_hive_dates |
Whether to try parsing hive values as date / datetime types. |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
A polars LazyFrame
temp_dir <- tempfile() # Write a hive-style partitioned arrow file dataset arrow::write_dataset( mtcars, temp_dir, partitioning = c("cyl", "gear"), format = "arrow", hive_style = TRUE ) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$scan_ipc(temp_dir)$collect() # We can also impose a schema to the partition pl$scan_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))$collect()
temp_dir <- tempfile() # Write a hive-style partitioned arrow file dataset arrow::write_dataset( mtcars, temp_dir, partitioning = c("cyl", "gear"), format = "arrow", hive_style = TRUE ) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$scan_ipc(temp_dir)$collect() # We can also impose a schema to the partition pl$scan_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))$collect()
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
pl__scan_parquet( source, ..., n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, parallel = c("auto", "columns", "row_groups", "prefiltered", "none"), use_statistics = TRUE, hive_partitioning = NULL, glob = TRUE, schema = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, rechunk = FALSE, low_memory = FALSE, cache = TRUE, storage_options = NULL, retries = 2, include_file_paths = NULL, allow_missing_columns = FALSE )
pl__scan_parquet( source, ..., n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, parallel = c("auto", "columns", "row_groups", "prefiltered", "none"), use_statistics = TRUE, hive_partitioning = NULL, glob = TRUE, schema = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, rechunk = FALSE, low_memory = FALSE, cache = TRUE, storage_options = NULL, retries = 2, include_file_paths = NULL, allow_missing_columns = FALSE )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
n_rows |
Stop reading from parquet file after reading |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
parallel |
This determines the direction and strategy of parallelism.
The prefiltered settings falls back to auto if no predicate is given. |
use_statistics |
Use statistics in the parquet to determine if pages can be skipped from reading. |
hive_partitioning |
Infer statistics and schema from Hive partitioned sources and use them to prune reads. |
glob |
Expand path given via globbing rules. |
schema |
Specify the datatypes of the columns. The datatypes must match
the datatypes in the file(s). If there are extra columns that are not in the
file(s), consider also enabling |
hive_schema |
The column names and data types of the columns by which
the data is partitioned. If |
try_parse_hive_dates |
Whether to try parsing hive values as date / datetime types. |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
low_memory |
Reduce memory pressure at the expense of performance |
cache |
Cache the result after reading. |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
allow_missing_columns |
When reading a list of parquet files, if a
column existing in the first file cannot be found in subsequent files, the
default behavior is to raise an error. However, if |
A polars LazyFrame
# Write a Parquet file than we can then import as DataFrame temp_file <- withr::local_tempfile(fileext = ".parquet") as_polars_df(mtcars)$write_parquet(temp_file) pl$scan_parquet(temp_file)$collect() # Write a hive-style partitioned parquet dataset temp_dir <- withr::local_tempdir() as_polars_df(mtcars)$write_parquet(temp_dir, partition_by = c("cyl", "gear")) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$scan_parquet(temp_dir)$collect()
# Write a Parquet file than we can then import as DataFrame temp_file <- withr::local_tempfile(fileext = ".parquet") as_polars_df(mtcars)$write_parquet(temp_file) pl$scan_parquet(temp_file)$collect() # Write a hive-style partitioned parquet dataset temp_dir <- withr::local_tempdir() as_polars_df(mtcars)$write_parquet(temp_dir, partition_by = c("cyl", "gear")) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$scan_parquet(temp_dir)$collect()
polars_series
)Series are a 1-dimensional data structure, which are similar to R vectors. Within a series all elements have the same Data Type.
pl__Series(name = NULL, values = NULL)
pl__Series(name = NULL, values = NULL)
name |
A single string or |
values |
An R object. Passed as the |
The pl$Series()
function mimics the constructor of the Series class of Python Polars.
This function calls as_polars_series()
internally to convert the input object to a Polars Series.
dtype
: $dtype
returns the data type of the Series.
name
: $name
returns the name of the Series.
shape
: $shape
returns a integer vector of length two with the number of length
of the Series and width of the Series (always 1).
# Constructing a Series by specifying name and values positionally: s <- pl$Series("a", 1:3) s # Active bindings: s$dtype s$name s$shape
# Constructing a Series by specifying name and values positionally: s <- pl$Series("a", 1:3) s # Active bindings: s$dtype s$name s$shape
Print out the version of Polars and its optional dependencies.
pl__show_versions()
pl__show_versions()
cli enhances the terminal output, especially error messages.
These packages may be used for exporting Series to R.
See <Series>$to_r_vector()
for details.
NULL
invisibly.
pl$show_versions()
pl$show_versions()
Collect columns into a struct column
pl__struct(...)
pl__struct(...)
... |
< |
A polars expression
# Collect all columns of a dataframe into a struct by passing pl.all(). df <- pl$DataFrame( int = 1:2, str = c("a", "b"), bool = c(TRUE, NA), list = list(1:2, 3L), ) df$select(pl$struct(pl$all())$alias("my_struct")) # Name each struct field. df$select(pl$struct(p = "int", q = "bool")$alias("my_struct"))$schema
# Collect all columns of a dataframe into a struct by passing pl.all(). df <- pl$DataFrame( int = 1:2, str = c("a", "b"), bool = c(TRUE, NA), list = list(1:2, 3L), ) df$select(pl$struct(pl$all())$alias("my_struct")) # Name each struct field. df$select(pl$struct(p = "int", q = "bool")$alias("my_struct"))$schema
Compute the sum horizontally across columns
pl__sum_horizontal(...)
pl__sum_horizontal(...)
... |
< |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( sum = pl$sum_horizontal("a", "b") )
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( sum = pl$sum_horizontal("a", "b") )
Generate a time range
pl__time_range( start = NULL, end = NULL, interval = "1h", ..., closed = c("both", "left", "none", "right") )
pl__time_range( start = NULL, end = NULL, interval = "1h", ..., closed = c("both", "left", "none", "right") )
start |
Lower bound of the time range. If omitted, defaults to
|
end |
Upper bound of the time range. If omitted, defaults to
|
interval |
Interval of the range periods, specified as a difftime or using the Polars duration string language (see details). |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$select( time = pl$time_range( start = hms::parse_hms("14:00:00"), interval = as.difftime("3:15:00") ) )
pl$select( time = pl$time_range( start = hms::parse_hms("14:00:00"), interval = as.difftime("3:15:00") ) )
Create a column of time ranges
pl__time_ranges( start = NULL, end = NULL, interval = "1h", ..., closed = c("both", "left", "none", "right") )
pl__time_ranges( start = NULL, end = NULL, interval = "1h", ..., closed = c("both", "left", "none", "right") )
start |
Lower bound of the time range. If omitted, defaults to
|
end |
Upper bound of the time range. If omitted, defaults to
|
interval |
Interval of the range periods, specified as a difftime or using the Polars duration string language (see details). |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
df <- pl$DataFrame( start = hms::parse_hms(c("09:00:00", "10:00:00")), end = hms::parse_hms(c("11:00:00", "11:00:00")) ) df$with_columns(time_range = pl$time_ranges("start", "end"))
df <- pl$DataFrame( start = hms::parse_hms(c("09:00:00", "10:00:00")), end = hms::parse_hms(c("11:00:00", "11:00:00")) ) df$with_columns(time_range = pl$time_ranges("start", "end"))
Registering custom functionality with a polars Series
pl_api_register_series_namespace(name, ns_fn)
pl_api_register_series_namespace(name, ns_fn)
name |
Name under which the functionality will be accessed. |
ns_fn |
A function returns a new environment with the custom functionality. See examples for details. |
NULL
invisibly.
# s: polars series math_shortcuts <- function(s) { # Create a new environment to store the methods self <- new.env(parent = emptyenv()) # Store the series self$`_s` <- s # Add methods self$square <- function() self$`_s` * self$`_s` self$cube <- function() self$`_s` * self$`_s` * self$`_s` # Set the class class(self) <- c("polars_namespace_series", "polars_object") # Return the environment self } pl$api$register_series_namespace("math", math_shortcuts) s <- as_polars_series(c(1.5, 31, 42, 64.5)) s$math$square()$rename("s^2") s <- as_polars_series(1:5) s$math$cube()$rename("s^3")
# s: polars series math_shortcuts <- function(s) { # Create a new environment to store the methods self <- new.env(parent = emptyenv()) # Store the series self$`_s` <- s # Add methods self$square <- function() self$`_s` * self$`_s` self$cube <- function() self$`_s` * self$`_s` * self$`_s` # Set the class class(self) <- c("polars_namespace_series", "polars_object") # Return the environment self } pl$api$register_series_namespace("math", math_shortcuts) s <- as_polars_series(c(1.5, 31, 42, 64.5)) s$math$square()$rename("s^2") s <- as_polars_series(1:5) s$math$cube()$rename("s^3")
polars_dtype
)Polars supports a variety of data types that fall broadly under the following categories:
Numeric data types: signed integers, unsigned integers, floating point numbers, and decimals.
Nested data types: lists, structs, and arrays.
Temporal: dates, datetimes, times, and time deltas.
Miscellaneous: strings, binary data, Booleans, categoricals, and enums.
All types support missing values represented by the special value null
.
This is not to be conflated with the special value NaN
in floating number data types;
see the section about floating point numbers for more information.
pl__Decimal(precision = NULL, scale = 0L) pl__Datetime(time_unit = c("us", "ns", "ms"), time_zone = NULL) pl__Duration(time_unit = c("us", "ns", "ms")) pl__Categorical(ordering = c("physical", "lexical")) pl__Enum(categories) pl__Array(inner, shape) pl__List(inner) pl__Struct(...)
pl__Decimal(precision = NULL, scale = 0L) pl__Datetime(time_unit = c("us", "ns", "ms"), time_zone = NULL) pl__Duration(time_unit = c("us", "ns", "ms")) pl__Categorical(ordering = c("physical", "lexical")) pl__Enum(categories) pl__Array(inner, shape) pl__List(inner) pl__Struct(...)
precision |
A integer or |
scale |
A integer. Number of digits to the right of the decimal point in each number. |
time_unit |
One of |
time_zone |
A string or |
ordering |
One of |
categories |
A character vector.
Should not contain |
inner |
A polars data type object. |
shape |
A integer-ish vector, representing the shape of the Array. |
... |
< |
pl$Int8 pl$Int16 pl$Int32 pl$Int64 pl$UInt8 pl$UInt16 pl$UInt32 pl$UInt64 pl$Float32 pl$Float64 pl$Decimal(scale = 2) pl$String pl$Binary pl$Date pl$Time pl$Datetime() pl$Duration() pl$Array(pl$Int32, c(2, 3)) pl$List(pl$Int32) pl$Categorical() pl$Enum(c("a", "b", "c")) pl$Struct(a = pl$Int32, b = pl$String) pl$Null
pl$Int8 pl$Int16 pl$Int32 pl$Int64 pl$UInt8 pl$UInt16 pl$UInt32 pl$UInt64 pl$Float32 pl$Float64 pl$Decimal(scale = 2) pl$String pl$Binary pl$Date pl$Time pl$Datetime() pl$Duration() pl$Array(pl$Int32, c(2, 3)) pl$List(pl$Int32) pl$Categorical() pl$Enum(c("a", "b", "c")) pl$Struct(a = pl$Int32, b = pl$String) pl$Null
The Polars duration string language
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
polars_expr
)An expression is a tree of operations that describe how to construct one or more Series. As the outputs are Series, it is straightforward to apply a sequence of expressions each of which transforms the output from the previous step. See examples for details.
pl$lit()
: Create a literal expression.
pl$col()
: Create an expression representing column(s) in a DataFrame.
# An expression: # 1. Select column `foo`, # 2. Then sort the column (not in reversed order) # 3. Then take the first two values of the sorted output pl$col("foo")$sort()$head(2) # Expressions will be evaluated inside a context, such as `<DataFrame>$select()` df <- pl$DataFrame( foo = c(1, 2, 1, 2, 3), bar = c(5, 4, 3, 2, 1), ) df$select( pl$col("foo")$sort()$head(3), # Return 3 values pl$col("bar")$filter(pl$col("foo") == 1)$sum(), # Return a single value )
# An expression: # 1. Select column `foo`, # 2. Then sort the column (not in reversed order) # 3. Then take the first two values of the sorted output pl$col("foo")$sort()$head(2) # Expressions will be evaluated inside a context, such as `<DataFrame>$select()` df <- pl$DataFrame( foo = c(1, 2, 1, 2, 3), bar = c(5, 4, 3, 2, 1), ) df$select( pl$col("foo")$sort()$head(3), # Return 3 values pl$col("bar")$filter(pl$col("foo") == 1)$sum(), # Return a single value )
Get the number of chunks that this Series contains
series__n_chunks()
series__n_chunks()
An integer value
s <- pl$Series("a", c(1, 2, 3)) s$n_chunks() s2 <- pl$Series("a", c(4, 5, 6)) # Concatenate Series with rechunk = TRUE pl$concat(c(s, s2), rechunk = TRUE)$n_chunks() # Concatenate Series with rechunk = FALSE pl$concat(c(s, s2), rechunk = FALSE)$n_chunks()
s <- pl$Series("a", c(1, 2, 3)) s$n_chunks() s2 <- pl$Series("a", c(4, 5, 6)) # Concatenate Series with rechunk = TRUE pl$concat(c(s, s2), rechunk = TRUE)$n_chunks() # Concatenate Series with rechunk = FALSE pl$concat(c(s, s2), rechunk = FALSE)$n_chunks()
Cast this Series to a DataFrame
series__to_frame(name = NULL)
series__to_frame(name = NULL)
name |
A character or |
A polars DataFrame
s <- pl$Series("a", c(123, 456)) df <- s$to_frame() df df <- s$to_frame("xyz") df
s <- pl$Series("a", c(123, 456)) df <- s$to_frame() df df <- s$to_frame("xyz") df
Export the Series as an R vector.
But note that the Struct data type is exported as a data.frame by default for consistency,
and a data.frame is not a vector.
If you want to ensure the return value is a vector, please set ensure_vector = TRUE
,
or use the as.vector()
function instead.
series__to_r_vector( ..., ensure_vector = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
series__to_r_vector( ..., ensure_vector = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
... |
These dots are for future extensions and must be empty. |
ensure_vector |
A logical value indicating whether to ensure the return value is a vector.
When the Series has the Struct data type and this argument is |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
struct |
Determine how to convert Polars' Struct type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
The class/type of the exported object depends on the data type of the Series as follows:
Boolean: logical.
UInt8, UInt16, Int8, Int16, Int32: integer.
Int64, UInt32, UInt64: double, character, integer, or bit64::integer64,
depending on the int64
argument.
Float32, Float64: double.
Decimal: double.
String: character.
Categorical: factor.
Date: Date or data.table::IDate,
depending on the date
argument.
Time: hms::hms or data.table::ITime,
depending on the time
argument.
Datetime (without timezone): POSIXct or clock_naive_time,
depending on the as_clock_class
argument.
Datetime (with timezone): POSIXct or clock_zoned_time,
depending on the as_clock_class
argument.
Duration: difftime or clock_duration,
depending on the as_clock_class
argument.
Binary: blob::blob.
Null: vctrs::unspecified.
List, Array: vctrs::list_of.
Struct: data.frame or tibble, depending on the struct
argument.
If ensure_vector = TRUE
, the top-level Struct is exported as a named list for
to ensure the return value is a vector.
A vector
# Struct values handling series_struct <- as_polars_series( data.frame( a = 1:2, b = I(list(data.frame(c = "foo"), data.frame(c = "bar"))) ) ) series_struct ## Export Struct as data.frame series_struct$to_r_vector() ## Export Struct as data.frame, ## but the top-level Struct is exported as a named list series_struct$to_r_vector(ensure_vector = TRUE) ## Export Struct as tibble series_struct$to_r_vector(struct = "tibble") ## Export Struct as tibble, ## but the top-level Struct is exported as a named list series_struct$to_r_vector(struct = "tibble", ensure_vector = TRUE) # Integer values handling series_uint64 <- as_polars_series( c(NA, "0", "4294967295", "18446744073709551615") )$cast(pl$UInt64) series_uint64 ## Export UInt64 as double series_uint64$to_r_vector(int64 = "double") ## Export UInt64 as character series_uint64$to_r_vector(int64 = "character") ## Export UInt64 as integer (overflow occurs) series_uint64$to_r_vector(int64 = "integer") ## Export UInt64 as bit64::integer64 (overflow occurs) if (requireNamespace("bit64", quietly = TRUE)) { series_uint64$to_r_vector(int64 = "integer64") } # Duration values handling series_duration <- as_polars_series( c(NA, -1000000000, -10, -1, 1000000000) )$cast(pl$Duration("ns")) series_duration ## Export Duration as difftime series_duration$to_r_vector(as_clock_class = FALSE) ## Export Duration as clock_duration if (requireNamespace("clock", quietly = TRUE)) { series_duration$to_r_vector(as_clock_class = TRUE) } # Datetime values handling series_datetime <- as_polars_series( as.POSIXct( c(NA, "1920-01-01 00:00:00", "1970-01-01 00:00:00", "2020-01-01 00:00:00"), tz = "UTC" ) )$cast(pl$Datetime("ns", "UTC")) series_datetime ## Export zoned datetime as POSIXct series_datetime$to_r_vector(as_clock_class = FALSE) ## Export zoned datetime as clock_zoned_time if (requireNamespace("clock", quietly = TRUE)) { series_datetime$to_r_vector(as_clock_class = TRUE) }
# Struct values handling series_struct <- as_polars_series( data.frame( a = 1:2, b = I(list(data.frame(c = "foo"), data.frame(c = "bar"))) ) ) series_struct ## Export Struct as data.frame series_struct$to_r_vector() ## Export Struct as data.frame, ## but the top-level Struct is exported as a named list series_struct$to_r_vector(ensure_vector = TRUE) ## Export Struct as tibble series_struct$to_r_vector(struct = "tibble") ## Export Struct as tibble, ## but the top-level Struct is exported as a named list series_struct$to_r_vector(struct = "tibble", ensure_vector = TRUE) # Integer values handling series_uint64 <- as_polars_series( c(NA, "0", "4294967295", "18446744073709551615") )$cast(pl$UInt64) series_uint64 ## Export UInt64 as double series_uint64$to_r_vector(int64 = "double") ## Export UInt64 as character series_uint64$to_r_vector(int64 = "character") ## Export UInt64 as integer (overflow occurs) series_uint64$to_r_vector(int64 = "integer") ## Export UInt64 as bit64::integer64 (overflow occurs) if (requireNamespace("bit64", quietly = TRUE)) { series_uint64$to_r_vector(int64 = "integer64") } # Duration values handling series_duration <- as_polars_series( c(NA, -1000000000, -10, -1, 1000000000) )$cast(pl$Duration("ns")) series_duration ## Export Duration as difftime series_duration$to_r_vector(as_clock_class = FALSE) ## Export Duration as clock_duration if (requireNamespace("clock", quietly = TRUE)) { series_duration$to_r_vector(as_clock_class = TRUE) } # Datetime values handling series_datetime <- as_polars_series( as.POSIXct( c(NA, "1920-01-01 00:00:00", "1970-01-01 00:00:00", "2020-01-01 00:00:00"), tz = "UTC" ) )$cast(pl$Datetime("ns", "UTC")) series_datetime ## Export zoned datetime as POSIXct series_datetime$to_r_vector(as_clock_class = FALSE) ## Export zoned datetime as clock_zoned_time if (requireNamespace("clock", quietly = TRUE)) { series_datetime$to_r_vector(as_clock_class = TRUE) }
Convert this struct Series to a DataFrame with a separate column for each field
series_struct_unnest()
series_struct_unnest()
A polars DataFrame
s <- as_polars_series(data.frame(a = c(1, 3), b = c(2, 4))) s$struct$unnest()
s <- as_polars_series(data.frame(a = c(1, 3), b = c(2, 4))) s$struct$unnest()