Title: | R Bindings for the 'polars' Rust Library |
---|---|
Description: | Lightning-fast 'DataFrame' library written in 'Rust'. Convert R data to 'Polars' data and vice versa. Perform fast, lazy, larger-than-memory and optimized data queries. 'Polars' is interoperable with the package 'arrow', as both are based on the 'Apache Arrow' Columnar Format. |
Authors: | Tatsuya Shima [aut, cre], Authors of the dependency Rust crates [aut] |
Maintainer: | Tatsuya Shima <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.0.9000 |
Built: | 2025-01-11 16:40:25 UTC |
Source: | https://github.com/pola-rs/r-polars |
The as_polars_df()
function creates a polars DataFrame from various R objects.
Polars DataFrame is based on a sequence of Polars Series,
so basically, the input object is converted to a list of
Polars Series by as_polars_series()
,
then a Polars DataFrame is created from the list.
as_polars_df(x, ...) ## Default S3 method: as_polars_df(x, ...) ## S3 method for class 'polars_series' as_polars_df(x, ..., column_name = NULL, from_struct = TRUE) ## S3 method for class 'polars_data_frame' as_polars_df(x, ...) ## S3 method for class 'polars_group_by' as_polars_df(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_df( x, ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE ) ## S3 method for class 'list' as_polars_df(x, ...) ## S3 method for class 'data.frame' as_polars_df(x, ...) ## S3 method for class ''NULL'' as_polars_df(x, ...)
as_polars_df(x, ...) ## Default S3 method: as_polars_df(x, ...) ## S3 method for class 'polars_series' as_polars_df(x, ..., column_name = NULL, from_struct = TRUE) ## S3 method for class 'polars_data_frame' as_polars_df(x, ...) ## S3 method for class 'polars_group_by' as_polars_df(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_df( x, ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE ) ## S3 method for class 'list' as_polars_df(x, ...) ## S3 method for class 'data.frame' as_polars_df(x, ...) ## S3 method for class ''NULL'' as_polars_df(x, ...)
x |
An R object. |
... |
Additional arguments passed to the methods. |
column_name |
A character or |
from_struct |
A logical. If |
type_coercion |
A logical, indicats type coercion optimization. |
predicate_pushdown |
A logical, indicats predicate pushdown optimization. |
projection_pushdown |
A logical, indicats projection pushdown optimization. |
simplify_expression |
A logical, indicats simplify expression optimization. |
slice_pushdown |
A logical, indicats slice pushdown optimization. |
comm_subplan_elim |
A logical, indicats tring to cache branching subplans that occur on self-joins or unions. |
comm_subexpr_elim |
A logical, indicats tring to cache common subexpressions. |
cluster_with_columns |
A logical, indicats to combine sequential independent calls to with_columns. |
no_optimization |
A logical. If |
streaming |
A logical. If |
The default method of as_polars_df()
throws an error,
so we need to define methods for the classes we want to support.
The argument ...
(except name
) is passed to as_polars_series()
for each element of the list.
All elements of the list must be converted to the same length of Series by as_polars_series()
.
The name of the each element is used as the column name of the DataFrame.
For unnamed elements, the column name will be an empty string ""
or if the element is a Series,
the column name will be the name of the Series.
The argument ...
(except name
) is passed to as_polars_series()
for each column.
All columns must be converted to the same length of Series by as_polars_series()
.
This is a shortcut for <Series>$to_frame()
or
<Series>$struct$unnest()
, depending on the from_struct
argument and the Series data type.
The column_name
argument is passed to the name
argument of the $to_frame()
method.
This is a shortcut for <LazyFrame>$collect()
.
A polars DataFrame
as.list(<polars_data_frame>)
: Export the DataFrame as an R list.
as.data.frame(<polars_data_frame>)
: Export the DataFrame as an R data frame.
# list as_polars_df(list(a = 1:2, b = c("foo", "bar"))) # data.frame as_polars_df(data.frame(a = 1:2, b = c("foo", "bar"))) # polars_series s_int <- as_polars_series(1:2, "a") s_struct <- as_polars_series( data.frame(a = 1:2, b = c("foo", "bar")), "struct" ) ## Use the Series as a column as_polars_df(s_int) as_polars_df(s_struct, column_name = "values", from_struct = FALSE) ## Unnest the struct data as_polars_df(s_struct)
# list as_polars_df(list(a = 1:2, b = c("foo", "bar"))) # data.frame as_polars_df(data.frame(a = 1:2, b = c("foo", "bar"))) # polars_series s_int <- as_polars_series(1:2, "a") s_struct <- as_polars_series( data.frame(a = 1:2, b = c("foo", "bar")), "struct" ) ## Use the Series as a column as_polars_df(s_int) as_polars_df(s_struct, column_name = "values", from_struct = FALSE) ## Unnest the struct data as_polars_df(s_struct)
The as_polars_expr()
function creates a polars expression from various R objects.
This function is used internally by various polars functions that accept expressions.
In most cases, users should use pl$lit()
instead of this function, which is
a shorthand for as_polars_expr(x, as_lit = TRUE)
.
(In other words, this function can be considered as an internal implementation to realize
the lit
function of the Polars API in other languages.)
as_polars_expr(x, ...) ## Default S3 method: as_polars_expr(x, ...) ## S3 method for class 'polars_expr' as_polars_expr(x, ..., structify = FALSE) ## S3 method for class 'polars_series' as_polars_expr(x, ...) ## S3 method for class 'character' as_polars_expr(x, ..., as_lit = FALSE) ## S3 method for class 'logical' as_polars_expr(x, ...) ## S3 method for class 'integer' as_polars_expr(x, ...) ## S3 method for class 'double' as_polars_expr(x, ...) ## S3 method for class 'raw' as_polars_expr(x, ...) ## S3 method for class ''NULL'' as_polars_expr(x, ...)
as_polars_expr(x, ...) ## Default S3 method: as_polars_expr(x, ...) ## S3 method for class 'polars_expr' as_polars_expr(x, ..., structify = FALSE) ## S3 method for class 'polars_series' as_polars_expr(x, ...) ## S3 method for class 'character' as_polars_expr(x, ..., as_lit = FALSE) ## S3 method for class 'logical' as_polars_expr(x, ...) ## S3 method for class 'integer' as_polars_expr(x, ...) ## S3 method for class 'double' as_polars_expr(x, ...) ## S3 method for class 'raw' as_polars_expr(x, ...) ## S3 method for class ''NULL'' as_polars_expr(x, ...)
x |
An R object. |
... |
Additional arguments passed to the methods. |
structify |
A logical. If |
as_lit |
A logical value indicating whether to treat vector as literal values or not.
This argument is always set to |
Because R objects are typically mapped to Series, this function often calls as_polars_series()
internally.
However, unlike R, Polars has scalars of length 1, so if an R object is converted to a Series of length 1,
this function get the first value of the Series and convert it to a scalar literal.
If you want to implement your own conversion from an R class to a Polars object,
define an S3 method for as_polars_series()
instead of this function.
Create a Series by calling as_polars_series()
and then convert that Series to an Expr.
If the length of the Series is 1
, it will be converted to a scalar value.
Additional arguments ...
are passed to as_polars_series()
.
If the as_lit
argument is FALSE
(default), this function will call pl$col()
and
the character vector is treated as column names.
A polars expression
Since R has no scalar class, each of the following types of length 1 cases is specially converted to a scalar literal.
character: String
logical: Boolean
integer: Int32
double: Float64
These types' NA
is converted to a null
literal with casting to the corresponding Polars type.
The raw type vector is converted to a Binary scalar.
raw: Binary
NULL
is converted to a Null type null
literal.
NULL: Null
For other R class, the default S3 method is called and R object will be converted via
as_polars_series()
. So the type mapping is defined by as_polars_series()
.
as_polars_series()
: R -> Polars type mapping is mostly defined by this function.
# character ## as_lit = FALSE (default) as_polars_expr("a") # Same as `pl$col("a")` as_polars_expr(c("a", "b")) # Same as `pl$col("a", "b")` ## as_lit = TRUE as_polars_expr(character(0), as_lit = TRUE) as_polars_expr("a", as_lit = TRUE) as_polars_expr(NA_character_, as_lit = TRUE) as_polars_expr(c("a", "b"), as_lit = TRUE) # logical as_polars_expr(logical(0)) as_polars_expr(TRUE) as_polars_expr(NA) as_polars_expr(c(TRUE, FALSE)) # integer as_polars_expr(integer(0)) as_polars_expr(1L) as_polars_expr(NA_integer_) as_polars_expr(c(1L, 2L)) # double as_polars_expr(double(0)) as_polars_expr(1) as_polars_expr(NA_real_) as_polars_expr(c(1, 2)) # raw as_polars_expr(raw(0)) as_polars_expr(charToRaw("foo")) # NULL as_polars_expr(NULL) # default method (for list) as_polars_expr(list()) as_polars_expr(list(1)) as_polars_expr(list(1, 2)) # default method (for Date) as_polars_expr(as.Date(integer(0))) as_polars_expr(as.Date("2021-01-01")) as_polars_expr(as.Date(c("2021-01-01", "2021-01-02"))) # polars_series ## Unlike the default method, this method does not extract the first value as_polars_series(1) |> as_polars_expr() # polars_expr as_polars_expr(pl$col("a", "b")) as_polars_expr(pl$col("a", "b"), structify = TRUE)
# character ## as_lit = FALSE (default) as_polars_expr("a") # Same as `pl$col("a")` as_polars_expr(c("a", "b")) # Same as `pl$col("a", "b")` ## as_lit = TRUE as_polars_expr(character(0), as_lit = TRUE) as_polars_expr("a", as_lit = TRUE) as_polars_expr(NA_character_, as_lit = TRUE) as_polars_expr(c("a", "b"), as_lit = TRUE) # logical as_polars_expr(logical(0)) as_polars_expr(TRUE) as_polars_expr(NA) as_polars_expr(c(TRUE, FALSE)) # integer as_polars_expr(integer(0)) as_polars_expr(1L) as_polars_expr(NA_integer_) as_polars_expr(c(1L, 2L)) # double as_polars_expr(double(0)) as_polars_expr(1) as_polars_expr(NA_real_) as_polars_expr(c(1, 2)) # raw as_polars_expr(raw(0)) as_polars_expr(charToRaw("foo")) # NULL as_polars_expr(NULL) # default method (for list) as_polars_expr(list()) as_polars_expr(list(1)) as_polars_expr(list(1, 2)) # default method (for Date) as_polars_expr(as.Date(integer(0))) as_polars_expr(as.Date("2021-01-01")) as_polars_expr(as.Date(c("2021-01-01", "2021-01-02"))) # polars_series ## Unlike the default method, this method does not extract the first value as_polars_series(1) |> as_polars_expr() # polars_expr as_polars_expr(pl$col("a", "b")) as_polars_expr(pl$col("a", "b"), structify = TRUE)
The as_polars_lf()
function creates a LazyFrame from various R objects.
It is basically a shortcut for as_polars_df(x, ...) with the
$lazy()
method.
as_polars_lf(x, ...) ## Default S3 method: as_polars_lf(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_lf(x, ...)
as_polars_lf(x, ...) ## Default S3 method: as_polars_lf(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_lf(x, ...)
x |
An R object. |
... |
Additional arguments passed to the methods. |
Create a DataFrame by calling as_polars_df()
and then create a LazyFrame from the DataFrame.
Additional arguments ...
are passed to as_polars_df()
.
A polars LazyFrame
The as_polars_series()
function creates a polars Series from various R objects.
The Data Type of the Series is determined by the class of the input object.
as_polars_series(x, name = NULL, ...) ## Default S3 method: as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_series' as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_data_frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'double' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer' as_polars_series(x, name = NULL, ...) ## S3 method for class 'character' as_polars_series(x, name = NULL, ...) ## S3 method for class 'logical' as_polars_series(x, name = NULL, ...) ## S3 method for class 'raw' as_polars_series(x, name = NULL, ...) ## S3 method for class 'factor' as_polars_series(x, name = NULL, ...) ## S3 method for class 'Date' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXct' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXlt' as_polars_series(x, name = NULL, ...) ## S3 method for class 'difftime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'hms' as_polars_series(x, name = NULL, ...) ## S3 method for class 'blob' as_polars_series(x, name = NULL, ...) ## S3 method for class 'array' as_polars_series(x, name = NULL, ...) ## S3 method for class ''NULL'' as_polars_series(x, name = NULL, ...) ## S3 method for class 'list' as_polars_series(x, name = NULL, ..., strict = FALSE) ## S3 method for class 'AsIs' as_polars_series(x, name = NULL, ...) ## S3 method for class 'data.frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer64' as_polars_series(x, name = NULL, ...) ## S3 method for class 'ITime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_unspecified' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_rcrd' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_time_point' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_sys_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_zoned_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_duration' as_polars_series(x, name = NULL, ...)
as_polars_series(x, name = NULL, ...) ## Default S3 method: as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_series' as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_data_frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'double' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer' as_polars_series(x, name = NULL, ...) ## S3 method for class 'character' as_polars_series(x, name = NULL, ...) ## S3 method for class 'logical' as_polars_series(x, name = NULL, ...) ## S3 method for class 'raw' as_polars_series(x, name = NULL, ...) ## S3 method for class 'factor' as_polars_series(x, name = NULL, ...) ## S3 method for class 'Date' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXct' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXlt' as_polars_series(x, name = NULL, ...) ## S3 method for class 'difftime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'hms' as_polars_series(x, name = NULL, ...) ## S3 method for class 'blob' as_polars_series(x, name = NULL, ...) ## S3 method for class 'array' as_polars_series(x, name = NULL, ...) ## S3 method for class ''NULL'' as_polars_series(x, name = NULL, ...) ## S3 method for class 'list' as_polars_series(x, name = NULL, ..., strict = FALSE) ## S3 method for class 'AsIs' as_polars_series(x, name = NULL, ...) ## S3 method for class 'data.frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer64' as_polars_series(x, name = NULL, ...) ## S3 method for class 'ITime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_unspecified' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_rcrd' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_time_point' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_sys_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_zoned_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_duration' as_polars_series(x, name = NULL, ...)
x |
An R object. |
name |
A single string or |
... |
Additional arguments passed to the methods. |
strict |
A logical value to indicate whether throwing an error when
the input list's elements have different data types.
If |
The default method of as_polars_series()
throws an error,
so we need to define S3 methods for the classes we want to support.
In R, a list can contain elements of different types, but in Polars (Apache Arrow),
all elements must have the same type.
So the as_polars_series()
function automatically casts all elements to the same type
or throws an error, depending on the strict
argument.
If you want to create a list with all elements of the same type in R,
consider using the vctrs::list_of()
function.
Since a list can contain another list, the strict
argument is also used
when creating Series from the inner list in the case of classes constructed on top of a list,
such as data.frame or vctrs_rcrd.
Sub-day values will be ignored (floored to the day).
Sub-millisecond values will be ignored (floored to the millisecond).
If the tzone
attribute is not present or an empty string (""
),
the Series' dtype will be Datetime without timezone.
Sub-nanosecond values will be ignored (floored to the nanosecond).
Sub-millisecond values will be rounded to milliseconds.
Sub-nanosecond values will be ignored (floored to the nanosecond).
If the hms vector contains values greater-equal to 24-oclock or less than 0-oclock, an error will be thrown.
Calendrical durations (years, quarters, months) are treated as chronologically with the internal representation of seconds. Please check the clock_duration documentation for more details.
This method is a shortcut for <DataFrame>$to_struct()
.
<Series>$to_r_vector()
: Export the Series as an R vector.
as_polars_df()
: Create a Polars DataFrame from an R object.
# double as_polars_series(c(NA, 1, 2)) # integer as_polars_series(c(NA, 1:2)) # character as_polars_series(c(NA, "foo", "bar")) # logical as_polars_series(c(NA, TRUE, FALSE)) # raw as_polars_series(charToRaw("foo")) # factor as_polars_series(factor(c(NA, "a", "b"))) # Date as_polars_series(as.Date(c(NA, "2021-01-01"))) ## Sub-day precision will be ignored as.Date(c(-0.5, 0, 0.5)) |> as_polars_series() # POSIXct with timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # POSIXct without timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"))) # POSIXlt as_polars_series(as.POSIXlt(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # difftime as_polars_series(as.difftime(c(NA, 1), units = "days")) ## Sub-millisecond values will be rounded to milliseconds as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "secs") |> as_polars_series() as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "weeks") |> as_polars_series() # NULL as_polars_series(NULL) # list as_polars_series(list(NA, NULL, list(), 1, "foo", TRUE)) ## 1st element will be `null` due to the casting failure as_polars_series(list(list("bar"), "foo")) # data.frame as_polars_series( data.frame(x = 1:2, y = c("foo", "bar"), z = I(list(1, 2))) ) # vctrs_unspecified if (requireNamespace("vctrs", quietly = TRUE)) { as_polars_series(vctrs::unspecified(3L)) } # hms if (requireNamespace("hms", quietly = TRUE)) { as_polars_series(hms::as_hms(c(NA, "01:00:00"))) } # blob if (requireNamespace("blob", quietly = TRUE)) { as_polars_series(blob::as_blob(c(NA, "foo", "bar"))) } # integer64 if (requireNamespace("bit64", quietly = TRUE)) { as_polars_series(bit64::as.integer64(c(NA, "9223372036854775807"))) } # clock_naive_time if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::naive_time_parse(c( NA, "1900-01-01T12:34:56.123456789", "2020-01-01T12:34:56.123456789" ), precision = "nanosecond")) } # clock_duration if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_nanoseconds(c(NA, 1))) } ## Calendrical durations are treated as chronologically if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_years(c(NA, 1))) }
# double as_polars_series(c(NA, 1, 2)) # integer as_polars_series(c(NA, 1:2)) # character as_polars_series(c(NA, "foo", "bar")) # logical as_polars_series(c(NA, TRUE, FALSE)) # raw as_polars_series(charToRaw("foo")) # factor as_polars_series(factor(c(NA, "a", "b"))) # Date as_polars_series(as.Date(c(NA, "2021-01-01"))) ## Sub-day precision will be ignored as.Date(c(-0.5, 0, 0.5)) |> as_polars_series() # POSIXct with timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # POSIXct without timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"))) # POSIXlt as_polars_series(as.POSIXlt(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # difftime as_polars_series(as.difftime(c(NA, 1), units = "days")) ## Sub-millisecond values will be rounded to milliseconds as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "secs") |> as_polars_series() as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "weeks") |> as_polars_series() # NULL as_polars_series(NULL) # list as_polars_series(list(NA, NULL, list(), 1, "foo", TRUE)) ## 1st element will be `null` due to the casting failure as_polars_series(list(list("bar"), "foo")) # data.frame as_polars_series( data.frame(x = 1:2, y = c("foo", "bar"), z = I(list(1, 2))) ) # vctrs_unspecified if (requireNamespace("vctrs", quietly = TRUE)) { as_polars_series(vctrs::unspecified(3L)) } # hms if (requireNamespace("hms", quietly = TRUE)) { as_polars_series(hms::as_hms(c(NA, "01:00:00"))) } # blob if (requireNamespace("blob", quietly = TRUE)) { as_polars_series(blob::as_blob(c(NA, "foo", "bar"))) } # integer64 if (requireNamespace("bit64", quietly = TRUE)) { as_polars_series(bit64::as.integer64(c(NA, "9223372036854775807"))) } # clock_naive_time if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::naive_time_parse(c( NA, "1900-01-01T12:34:56.123456789", "2020-01-01T12:34:56.123456789" ), precision = "nanosecond")) } # clock_duration if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_nanoseconds(c(NA, 1))) } ## Calendrical durations are treated as chronologically if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_years(c(NA, 1))) }
This S3 method is basically a shortcut of
as_polars_df(x, ...)$to_struct()$to_r_vector(ensure_vector = FALSE, struct = "tibble")
.
Additionally, you can check or repair the column names by specifying the .name_repair
argument.
Because polars DataFrame allows empty column name, which is not generally valid column name in R data frame.
## S3 method for class 'polars_data_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
## S3 method for class 'polars_data_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
x |
A polars object |
... |
Passed to |
.name_repair |
Treatment of problematic column names:
This argument is passed on as |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
A tibble
as.data.frame(<polars_object>)
: Export the polars object as a basic data frame.
# Polars DataFrame may have empty column name df <- pl$DataFrame(x = 1:2, c("a", "b")) df # Without checking or repairing the column names tibble::as_tibble(df, .name_repair = "minimal") tibble::as_tibble(df$lazy(), .name_repair = "minimal") # You can make that unique tibble::as_tibble(df, .name_repair = "unique") tibble::as_tibble(df$lazy(), .name_repair = "unique")
# Polars DataFrame may have empty column name df <- pl$DataFrame(x = 1:2, c("a", "b")) df # Without checking or repairing the column names tibble::as_tibble(df, .name_repair = "minimal") tibble::as_tibble(df$lazy(), .name_repair = "minimal") # You can make that unique tibble::as_tibble(df, .name_repair = "unique") tibble::as_tibble(df$lazy(), .name_repair = "unique")
This S3 method is a shortcut for
as_polars_df(x, ...)$to_struct()$to_r_vector(ensure_vector = FALSE, struct = "dataframe")
.
## S3 method for class 'polars_data_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
## S3 method for class 'polars_data_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
x |
A polars object |
... |
Passed to |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
An R data frame
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.data.frame(df) as.data.frame(df$lazy())
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.data.frame(df) as.data.frame(df$lazy())
This S3 method calls as_polars_df(x, ...)$get_columns()
or
as_polars_df(x, ...)$to_struct()$to_r_vector(ensure_vector = TRUE)
depending on the as_series
argument.
## S3 method for class 'polars_data_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
## S3 method for class 'polars_data_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
x |
A polars object |
... |
Passed to |
as_series |
Whether to convert each column to an R vector or a Series.
If |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
struct |
Determine how to convert Polars' Struct type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
Arguments other than x
and as_series
are passed to <Series>$to_r_vector()
,
so they are ignored when as_series=TRUE
.
A list
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.list(df, as_series = TRUE) as.list(df, as_series = FALSE) as.list(df$lazy(), as_series = TRUE) as.list(df$lazy(), as_series = FALSE)
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.list(df, as_series = TRUE) as.list(df, as_series = FALSE) as.list(df$lazy(), as_series = TRUE) as.list(df$lazy(), as_series = FALSE)
Functions to check if the object is a polars object.
is_*
functions return TRUE
of FALSE
depending on the class of the object.
check_*
functions throw an informative error if the object is not the correct class.
Suffixes are corresponding to the polars object classes:
*_dtype
: For polars data types.
*_df
: For polars data frames.
*_expr
: For polars expressions.
*_lf
: For polars lazy frames.
*_selector
: For polars selectors.
*_series
: For polars series.
is_polars_dtype(x) is_polars_df(x) is_polars_expr(x, ...) is_polars_lf(x) is_polars_selector(x, ...) is_polars_series(x) is_list_of_polars_dtype(x, n = NULL) check_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_df( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_expr( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_lf( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_selector( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_series( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_list_of_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() )
is_polars_dtype(x) is_polars_df(x) is_polars_expr(x, ...) is_polars_lf(x) is_polars_selector(x, ...) is_polars_series(x) is_list_of_polars_dtype(x, n = NULL) check_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_df( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_expr( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_lf( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_selector( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_series( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_list_of_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() )
x |
An object to check. |
... |
Arguments passed to |
n |
Expected length of a vector. |
allow_null |
If |
arg |
An argument name as a string. This argument will be mentioned in error messages as the input that is at the origin of a problem. |
call |
The execution environment of a currently
running function, e.g. |
check_polars_*
functions are derived from the standalone-types-check
functions
from the rlang package
(Can be installed with usethis::use_standalone("r-lib/rlang", file = "types-check")
).
is_polars_*
functions return TRUE
or FALSE
.
check_polars_*
functions return NULL
invisibly if the input is valid.
is_polars_df(as_polars_df(mtcars)) is_polars_df(mtcars) # Use `check_polars_*` functions in a function # to ensure the input is a polars object sample_func <- function(x) { check_polars_df(x) TRUE } sample_func(as_polars_df(mtcars)) try(sample_func(mtcars))
is_polars_df(as_polars_df(mtcars)) is_polars_df(mtcars) # Use `check_polars_*` functions in a function # to ensure the input is a polars object sample_func <- function(x) { check_polars_df(x) TRUE } sample_func(as_polars_df(mtcars)) try(sample_func(mtcars))
cs
is an environment class object that stores all
selector functions of the R Polars API which mimics the Python Polars API.
It is intended to work the same way in Python as if you had imported
Python Polars Selectors with import polars.selectors as cs
.
cs
cs
An object of class polars_object
of length 29.
There are 4 supported operators for selectors:
&
to combine conditions with AND, e.g. select columns that contain
"oo"
and end with "t"
with cs$contains("oo") & cs$ends_with("t")
;
|
to combine conditions with OR, e.g. select columns that contain
"oo"
or end with "t"
with cs$contains("oo") | cs$ends_with("t")
;
-
to substract conditions, e.g. select all columns that have alphanumeric
names except those that contain "a"
with
cs$alphanumeric() - cs$contains("a")
;
!
to invert the selection, e.g. select all columns that are not of data
type String
with !cs$string()
.
Note that Python Polars uses ~
instead of !
to invert selectors.
cs # How many members are in the `cs` environment? length(cs)
cs # How many members are in the `cs` environment? length(cs)
Select all columns
cs__all()
cs__all()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame(dt = as.Date(c("2000-1-1")), value = 10) # Select all columns, casting them to string: df$select(cs$all()$cast(pl$String)) # Select all columns except for those matching the given dtypes: df$select(cs$all() - cs$numeric())
df <- pl$DataFrame(dt = as.Date(c("2000-1-1")), value = 10) # Select all columns, casting them to string: df$select(cs$all()$cast(pl$String)) # Select all columns except for those matching the given dtypes: df$select(cs$all() - cs$numeric())
Select all columns with alphabetic names (e.g. only letters)
cs__alpha(ascii_only = FALSE, ..., ignore_spaces = FALSE)
cs__alpha(ascii_only = FALSE, ..., ignore_spaces = FALSE)
ascii_only |
Indicate whether to consider only ASCII alphabetic characters, or the full Unicode range of valid letters (accented, idiographic, etc). |
... |
These dots are for future extensions and must be empty. |
ignore_spaces |
Indicate whether to ignore the presence of spaces in column names; if so, only the other (non-space) characters are considered. |
Matching column names cannot contain any non-alphabetic characters. Note
that the definition of “alphabetic” consists of all valid Unicode alphabetic
characters (p{Alphabetic}
) by default; this can be changed by setting
ascii_only = TRUE
.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( no1 = c(100, 200, 300), café = c("espresso", "latte", "mocha"), `t or f` = c(TRUE, FALSE, NA), hmm = c("aaa", "bbb", "ccc"), 都市 = c("東京", "大阪", "京都") ) # Select columns with alphabetic names; note that accented characters and # kanji are recognised as alphabetic here: df$select(cs$alpha()) # Constrain the definition of “alphabetic” to ASCII characters only: df$select(cs$alpha(ascii_only = TRUE)) df$select(cs$alpha(ascii_only = TRUE, ignore_spaces = TRUE)) # Select all columns except for those with alphabetic names: df$select(!cs$alpha()) df$select(!cs$alpha(ignore_spaces = TRUE))
df <- pl$DataFrame( no1 = c(100, 200, 300), café = c("espresso", "latte", "mocha"), `t or f` = c(TRUE, FALSE, NA), hmm = c("aaa", "bbb", "ccc"), 都市 = c("東京", "大阪", "京都") ) # Select columns with alphabetic names; note that accented characters and # kanji are recognised as alphabetic here: df$select(cs$alpha()) # Constrain the definition of “alphabetic” to ASCII characters only: df$select(cs$alpha(ascii_only = TRUE)) df$select(cs$alpha(ascii_only = TRUE, ignore_spaces = TRUE)) # Select all columns except for those with alphabetic names: df$select(!cs$alpha()) df$select(!cs$alpha(ignore_spaces = TRUE))
Select all columns with alphanumeric names (e.g. only letters and the digits 0-9)
cs__alphanumeric(ascii_only = FALSE, ..., ignore_spaces = FALSE)
cs__alphanumeric(ascii_only = FALSE, ..., ignore_spaces = FALSE)
ascii_only |
Indicate whether to consider only ASCII alphabetic characters, or the full Unicode range of valid letters (accented, idiographic, etc). |
... |
These dots are for future extensions and must be empty. |
ignore_spaces |
Indicate whether to ignore the presence of spaces in column names; if so, only the other (non-space) characters are considered. |
Matching column names cannot contain any non-alphabetic characters. Note
that the definition of “alphabetic” consists of all valid Unicode alphabetic
characters (p{Alphabetic}
) and digit characters (d
) by default; this can
be changed by setting ascii_only = TRUE
.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( `1st_col` = c(100, 200, 300), flagged = c(TRUE, FALSE, TRUE), `00prefix` = c("01:aa", "02:bb", "03:cc"), `last col` = c("x", "y", "z") ) # Select columns with alphanumeric names: df$select(cs$alphanumeric()) df$select(cs$alphanumeric(ignore_spaces = TRUE)) # Select all columns except for those with alphanumeric names: df$select(!cs$alphanumeric()) df$select(!cs$alphanumeric(ignore_spaces = TRUE))
df <- pl$DataFrame( `1st_col` = c(100, 200, 300), flagged = c(TRUE, FALSE, TRUE), `00prefix` = c("01:aa", "02:bb", "03:cc"), `last col` = c("x", "y", "z") ) # Select columns with alphanumeric names: df$select(cs$alphanumeric()) df$select(cs$alphanumeric(ignore_spaces = TRUE)) # Select all columns except for those with alphanumeric names: df$select(!cs$alphanumeric()) df$select(!cs$alphanumeric(ignore_spaces = TRUE))
Select all binary columns
cs__binary()
cs__binary()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( a = charToRaw("hello"), b = "world", c = charToRaw("!"), d = ":" ) # Select binary columns: df$select(cs$binary()) # Select all columns except for those that are binary: df$select(!cs$binary())
df <- pl$DataFrame( a = charToRaw("hello"), b = "world", c = charToRaw("!"), d = ":" ) # Select binary columns: df$select(cs$binary()) # Select all columns except for those that are binary: df$select(!cs$binary())
Select all boolean columns
cs__boolean()
cs__boolean()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( a = 1:4, b = c(FALSE, TRUE, FALSE, TRUE) ) # Select and invert boolean columns: df$with_columns(inverted = cs$boolean()$not()) # Select all columns except for those that are boolean: df$select(!cs$boolean())
df <- pl$DataFrame( a = 1:4, b = c(FALSE, TRUE, FALSE, TRUE) ) # Select and invert boolean columns: df$with_columns(inverted = cs$boolean()$not()) # Select all columns except for those that are boolean: df$select(!cs$boolean())
Select all columns matching the given dtypes
cs__by_dtype(...)
cs__by_dtype(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dt = as.Date(c("1999-12-31", "2024-1-1", "2010-7-5")), value = c(1234500, 5000555, -4500000), other = c("foo", "bar", "foo") ) # Select all columns with date or string dtypes: df$select(cs$by_dtype(pl$Date, pl$String)) # Select all columns that are not of date or string dtype: df$select(!cs$by_dtype(pl$Date, pl$String)) # Group by string columns and sum the numeric columns: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort("other")
df <- pl$DataFrame( dt = as.Date(c("1999-12-31", "2024-1-1", "2010-7-5")), value = c(1234500, 5000555, -4500000), other = c("foo", "bar", "foo") ) # Select all columns with date or string dtypes: df$select(cs$by_dtype(pl$Date, pl$String)) # Select all columns that are not of date or string dtype: df$select(!cs$by_dtype(pl$Date, pl$String)) # Group by string columns and sum the numeric columns: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort("other")
Select all columns matching the given indices (or range objects)
cs__by_index(indices)
cs__by_index(indices)
indices |
One or more column indices (or ranges). Negative indexing is supported. |
Matching columns are returned in the order in which their indexes appear in the selector, not the underlying schema order.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
vals <- as.list(0.5 * 0:100) names(vals) <- paste0("c", 0:100) df <- pl$DataFrame(!!!vals) df # Select columns by index (the two first/last columns): df$select(cs$by_index(c(0, 1, -2, -1))) # Use seq() df$select(cs$by_index(c(0, seq(1, 101, 20)))) df$select(cs$by_index(c(0, seq(101, 0, -25)))) # Select only odd-indexed columns: df$select(!cs$by_index(seq(0, 100, 2)))
vals <- as.list(0.5 * 0:100) names(vals) <- paste0("c", 0:100) df <- pl$DataFrame(!!!vals) df # Select columns by index (the two first/last columns): df$select(cs$by_index(c(0, 1, -2, -1))) # Use seq() df$select(cs$by_index(c(0, seq(1, 101, 20)))) df$select(cs$by_index(c(0, seq(101, 0, -25)))) # Select only odd-indexed columns: df$select(!cs$by_index(seq(0, 100, 2)))
Select all columns matching the given names
cs__by_name(..., require_all = TRUE)
cs__by_name(..., require_all = TRUE)
... |
< |
require_all |
Whether to match all names (the default) or any of the names. |
Matching columns are returned in the order in which their indexes appear in the selector, not the underlying schema order.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns by name: df$select(cs$by_name("foo", "bar")) # Match any of the given columns by name: df$select(cs$by_name("baz", "moose", "foo", "bear", require_all = FALSE)) # Match all columns except for those given: df$select(!cs$by_name("foo", "bar"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns by name: df$select(cs$by_name("foo", "bar")) # Match any of the given columns by name: df$select(cs$by_name("baz", "moose", "foo", "bear", require_all = FALSE)) # Match all columns except for those given: df$select(!cs$by_name("foo", "bar"))
Select all categorical columns
cs__categorical()
cs__categorical()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("xx", "yy"), bar = c(123, 456), baz = c(2.0, 5.5), .schema_overrides = list(foo = pl$Categorical()), ) # Select categorical columns: df$select(cs$categorical()) # Select all columns except for those that are categorical: df$select(!cs$categorical())
df <- pl$DataFrame( foo = c("xx", "yy"), bar = c(123, 456), baz = c(2.0, 5.5), .schema_overrides = list(foo = pl$Categorical()), ) # Select categorical columns: df$select(cs$categorical()) # Select all columns except for those that are categorical: df$select(!cs$categorical())
Select columns whose names contain the given literal substring(s)
cs__contains(...)
cs__contains(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that contain the substring "ba": df$select(cs$contains("ba")) # Select columns that contain the substring "ba" or the letter "z": df$select(cs$contains("ba", "z")) # Select all columns except for those that contain the substring "ba": df$select(!cs$contains("ba"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that contain the substring "ba": df$select(cs$contains("ba")) # Select columns that contain the substring "ba" or the letter "z": df$select(cs$contains("ba", "z")) # Select all columns except for those that contain the substring "ba": df$select(!cs$contains("ba"))
Select all date columns
cs__date()
cs__date()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")) ) # Select date columns: df$select(cs$date()) # Select all columns except for those that are dates: df$select(!cs$date())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")) ) # Select date columns: df$select(cs$date()) # Select all columns except for those that are dates: df$select(!cs$date())
Select all datetime columns
cs__datetime(time_unit = c("ms", "us", "ns"), time_zone = list("*", NULL))
cs__datetime(time_unit = c("ms", "us", "ns"), time_zone = list("*", NULL))
time_unit |
One (or more) of the allowed time unit precision strings,
|
time_zone |
One of the followings. The value or each element of the vector
will be passed to the
|
A Polars selector
cs for the documentation on operators supported by Polars selectors.
chr_vec <- c("1999-07-21 05:20:16.987654", "2000-05-16 06:21:21.123456") df <- pl$DataFrame( tstamp_tokyo = as.POSIXlt(chr_vec, tz = "Asia/Tokyo"), tstamp_utc = as.POSIXct(chr_vec, tz = "UTC"), tstamp = as.POSIXct(chr_vec), dt = as.Date(chr_vec), ) # Select all datetime columns: df$select(cs$datetime()) # Select all datetime columns that have "ms" precision: df$select(cs$datetime("ms")) # Select all datetime columns that have any timezone: df$select(cs$datetime(time_zone = "*")) # Select all datetime columns that have a specific timezone: df$select(cs$datetime(time_zone = "UTC")) # Select all datetime columns that have NO timezone: df$select(cs$datetime(time_zone = NULL)) # Select all columns except for datetime columns: df$select(!cs$datetime())
chr_vec <- c("1999-07-21 05:20:16.987654", "2000-05-16 06:21:21.123456") df <- pl$DataFrame( tstamp_tokyo = as.POSIXlt(chr_vec, tz = "Asia/Tokyo"), tstamp_utc = as.POSIXct(chr_vec, tz = "UTC"), tstamp = as.POSIXct(chr_vec), dt = as.Date(chr_vec), ) # Select all datetime columns: df$select(cs$datetime()) # Select all datetime columns that have "ms" precision: df$select(cs$datetime("ms")) # Select all datetime columns that have any timezone: df$select(cs$datetime(time_zone = "*")) # Select all datetime columns that have a specific timezone: df$select(cs$datetime(time_zone = "UTC")) # Select all datetime columns that have NO timezone: df$select(cs$datetime(time_zone = NULL)) # Select all columns except for datetime columns: df$select(!cs$datetime())
Select all decimal columns
cs__decimal()
cs__decimal()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c("2.0005", "-50.5555"), .schema_overrides = list( bar = pl$Decimal(), baz = pl$Decimal(scale = 5, precision = 10) ) ) # Select decimal columns: df$select(cs$decimal()) # Select all columns except for those that are decimal: df$select(!cs$decimal())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c("2.0005", "-50.5555"), .schema_overrides = list( bar = pl$Decimal(), baz = pl$Decimal(scale = 5, precision = 10) ) ) # Select decimal columns: df$select(cs$decimal()) # Select all columns except for those that are decimal: df$select(!cs$decimal())
Select all columns having names consisting only of digits
cs__digit(ascii_only = FALSE)
cs__digit(ascii_only = FALSE)
ascii_only |
Indicate whether to consider only ASCII alphabetic characters, or the full Unicode range of valid letters (accented, idiographic, etc). |
Matching column names cannot contain any non-digit characters. Note that the
definition of "digit" consists of all valid Unicode digit characters (d
)
by default; this can be changed by setting ascii_only = TRUE
.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( key = c("aaa", "bbb"), `2001` = 1:2, `2025` = 3:4 ) # Select columns with digit names: df$select(cs$digit()) # Select all columns except for those with digit names: df$select(!cs$digit()) # Demonstrate use of ascii_only flag (by default all valid unicode digits # are considered, but this can be constrained to ascii 0-9): df <- pl$DataFrame(`१९९९` = 1999, `२०७७` = 2077, `3000` = 3000) df$select(cs$digit()) df$select(cs$digit(ascii_only = TRUE))
df <- pl$DataFrame( key = c("aaa", "bbb"), `2001` = 1:2, `2025` = 3:4 ) # Select columns with digit names: df$select(cs$digit()) # Select all columns except for those with digit names: df$select(!cs$digit()) # Demonstrate use of ascii_only flag (by default all valid unicode digits # are considered, but this can be constrained to ascii 0-9): df <- pl$DataFrame(`१९९९` = 1999, `२०७७` = 2077, `3000` = 3000) df$select(cs$digit()) df$select(cs$digit(ascii_only = TRUE))
Select all duration columns, optionally filtering by time unit
cs__duration(time_unit = c("ms", "us", "ns"))
cs__duration(time_unit = c("ms", "us", "ns"))
time_unit |
One (or more) of the allowed time unit precision strings,
|
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dur_ms = clock::duration_milliseconds(1:2), dur_us = clock::duration_microseconds(1:2), dur_ns = clock::duration_nanoseconds(1:2), ) # Select duration columns: df$select(cs$duration()) # Select all duration columns that have "ms" precision: df$select(cs$duration("ms")) # Select all duration columns that have "ms" OR "ns" precision: df$select(cs$duration(c("ms", "ns"))) # Select all columns except for those that are duration: df$select(!cs$duration())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dur_ms = clock::duration_milliseconds(1:2), dur_us = clock::duration_microseconds(1:2), dur_ns = clock::duration_nanoseconds(1:2), ) # Select duration columns: df$select(cs$duration()) # Select all duration columns that have "ms" precision: df$select(cs$duration("ms")) # Select all duration columns that have "ms" OR "ns" precision: df$select(cs$duration(c("ms", "ns"))) # Select all columns except for those that are duration: df$select(!cs$duration())
Select columns that end with the given substring(s)
cs__ends_with(...)
cs__ends_with(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that end with the substring "z": df$select(cs$ends_with("z")) # Select columns that end with either the letter "z" or "r": df$select(cs$ends_with("z", "r")) # Select all columns except for those that end with the substring "z": df$select(!cs$ends_with("z"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that end with the substring "z": df$select(cs$ends_with("z")) # Select columns that end with either the letter "z" or "r": df$select(cs$ends_with("z", "r")) # Select all columns except for those that end with the substring "z": df$select(!cs$ends_with("z"))
Select all columns except those matching the given columns, datatypes, or selectors
cs__exclude(...)
cs__exclude(...)
... |
< |
If excluding a single selector it is simpler to write as !selector
instead.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( aa = 1:3, ba = c("a", "b", NA), cc = c(NA, 2.5, 1.5) ) # Exclude by column name(s): df$select(cs$exclude("ba", "xx")) # Exclude using a column name, a selector, and a dtype: df$select(cs$exclude("aa", cs$string(), pl$Int32))
df <- pl$DataFrame( aa = 1:3, ba = c("a", "b", NA), cc = c(NA, 2.5, 1.5) ) # Exclude by column name(s): df$select(cs$exclude("ba", "xx")) # Exclude using a column name, a selector, and a dtype: df$select(cs$exclude("aa", cs$string(), pl$Int32))
Select the first column in the current scope
cs__first()
cs__first()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the first column: df$select(cs$first()) # Select everything except for the first column: df$select(!cs$first())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the first column: df$select(cs$first()) # Select everything except for the first column: df$select(!cs$first())
Select all float columns.
cs__float()
cs__float()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE), .schema_overrides = list(baz = pl$Float32, zap = pl$Float64), ) # Select all float columns: df$select(cs$float()) # Select all columns except for those that are float: df$select(!cs$float())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE), .schema_overrides = list(baz = pl$Float32, zap = pl$Float64), ) # Select all float columns: df$select(cs$float()) # Select all columns except for those that are float: df$select(!cs$float())
Select all integer columns.
cs__integer()
cs__integer()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1 ) # Select all integer columns: df$select(cs$integer()) # Select all columns except for those that are integer: df$select(!cs$integer())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1 ) # Select all integer columns: df$select(cs$integer()) # Select all columns except for those that are integer: df$select(!cs$integer())
Select the last column in the current scope
cs__last()
cs__last()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the last column: df$select(cs$last()) # Select everything except for the last column: df$select(!cs$last())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the last column: df$select(cs$last()) # Select everything except for the last column: df$select(!cs$last())
Select all columns that match the given regex pattern
cs__matches(pattern)
cs__matches(pattern)
pattern |
A valid regular expression pattern, compatible with the
|
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(0, 1) ) # Match column names containing an "a", preceded by a character that is not # "z": df$select(cs$matches("[^z]a")) # Do not match column names ending in "R" or "z" (case-insensitively): df$select(!cs$matches(r"((?i)R|z$)"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(0, 1) ) # Match column names containing an "a", preceded by a character that is not # "z": df$select(cs$matches("[^z]a")) # Do not match column names ending in "R" or "z" (case-insensitively): df$select(!cs$matches(r"((?i)R|z$)"))
Select all numeric columns.
cs__numeric()
cs__numeric()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1, .schema_overrides = list(bar = pl$Int16, baz = pl$Float32, zap = pl$UInt8), ) # Select all numeric columns: df$select(cs$numeric()) # Select all columns except for those that are numeric: df$select(!cs$numeric())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1, .schema_overrides = list(bar = pl$Int16, baz = pl$Float32, zap = pl$UInt8), ) # Select all numeric columns: df$select(cs$numeric()) # Select all columns except for those that are numeric: df$select(!cs$numeric())
Select all signed integer columns
cs__signed_integer()
cs__signed_integer()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select signed integer columns: df$select(cs$signed_integer()) # Select all columns except for those that are signed integer: df$select(!cs$signed_integer()) # Select all integer columns (both signed and unsigned): df$select(cs$integer())
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select signed integer columns: df$select(cs$signed_integer()) # Select all columns except for those that are signed integer: df$select(!cs$signed_integer()) # Select all integer columns (both signed and unsigned): df$select(cs$integer())
Select columns that start with the given substring(s)
cs__starts_with(...)
cs__starts_with(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that start with the substring "b": df$select(cs$starts_with("b")) # Select columns that start with either the letter "b" or "z": df$select(cs$starts_with("b", "z")) # Select all columns except for those that start with the substring "b": df$select(!cs$starts_with("b"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that start with the substring "b": df$select(cs$starts_with("b")) # Select columns that start with either the letter "b" or "z": df$select(cs$starts_with("b", "z")) # Select all columns except for those that start with the substring "b": df$select(!cs$starts_with("b"))
Select all String (and, optionally, Categorical) string columns.
cs__string(..., include_categorical = FALSE)
cs__string(..., include_categorical = FALSE)
... |
These dots are for future extensions and must be empty. |
include_categorical |
If |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( w = c("xx", "yy", "xx", "yy", "xx"), x = c(1, 2, 1, 4, -2), y = c(3.0, 4.5, 1.0, 2.5, -2.0), z = c("a", "b", "a", "b", "b") )$with_columns( z = pl$col("z")$cast(pl$Categorical()) ) # Group by all string columns, sum the numeric columns, then sort by the # string cols: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort(cs$string()) # Group by all string and categorical columns: df$ group_by(cs$string(include_categorical = TRUE))$ agg(cs$numeric()$sum())$ sort(cs$string(include_categorical = TRUE))
df <- pl$DataFrame( w = c("xx", "yy", "xx", "yy", "xx"), x = c(1, 2, 1, 4, -2), y = c(3.0, 4.5, 1.0, 2.5, -2.0), z = c("a", "b", "a", "b", "b") )$with_columns( z = pl$col("z")$cast(pl$Categorical()) ) # Group by all string columns, sum the numeric columns, then sort by the # string cols: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort(cs$string()) # Group by all string and categorical columns: df$ group_by(cs$string(include_categorical = TRUE))$ agg(cs$numeric()$sum())$ sort(cs$string(include_categorical = TRUE))
Select all temporal columns
cs__temporal()
cs__temporal()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), value = 1:2 ) # Match all temporal columns: df$select(cs$temporal()) # Match all temporal columns except for time columns: df$select(cs$temporal() - cs$datetime()) # Match all columns except for temporal columns: df$select(!cs$temporal())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), value = 1:2 ) # Match all temporal columns: df$select(cs$temporal()) # Match all temporal columns except for time columns: df$select(cs$temporal() - cs$datetime()) # Match all columns except for temporal columns: df$select(!cs$temporal())
Select all time columns
cs__time()
cs__time()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), tm = hms::parse_hms(c("0:0:0", "23:59:59")) ) # Select time columns: df$select(cs$time()) # Select all columns except for those that are time: df$select(!cs$time())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), tm = hms::parse_hms(c("0:0:0", "23:59:59")) ) # Select time columns: df$select(cs$time()) # Select all columns except for those that are time: df$select(!cs$time())
Select all unsigned integer columns
cs__unsigned_integer()
cs__unsigned_integer()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select unsigned integer columns: df$select(cs$unsigned_integer()) # Select all columns except for those that are unsigned integer: df$select(!cs$unsigned_integer()) # Select all integer columns (both unsigned and unsigned): df$select(cs$integer())
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select unsigned integer columns: df$select(cs$unsigned_integer()) # Select all columns except for those that are unsigned integer: df$select(!cs$unsigned_integer()) # Select all integer columns (both unsigned and unsigned): df$select(cs$integer())
Cast DataFrame column(s) to the specified dtype
dataframe__cast(..., .strict = TRUE)
dataframe__cast(..., .strict = TRUE)
A polars DataFrame
df <- pl$DataFrame( foo = 1:3, bar = c(6, 7, 8), ham = as.Date(c("2020-01-02", "2020-03-04", "2020-05-06")) ) # Cast only some columns df$cast(foo = pl$Float32, bar = pl$UInt8) # Cast all columns to the same type df$cast(pl$String)
df <- pl$DataFrame( foo = 1:3, bar = c(6, 7, 8), ham = as.Date(c("2020-01-02", "2020-03-04", "2020-05-06")) ) # Cast only some columns df$cast(foo = pl$Float32, bar = pl$UInt8) # Cast all columns to the same type df$cast(pl$String)
This is a cheap operation that does not copy data. Assigning does not copy the DataFrame (environment object). This is because environment objects have reference semantics. Calling $clone() creates a new environment, which can be useful when dealing with attributes (see examples).
dataframe__clone()
dataframe__clone()
A polars DataFrame
df1 <- as_polars_df(iris) # Assigning does not copy the DataFrame (environment object), calling # $clone() creates a new environment. df2 <- df1 df3 <- df1$clone() rlang::env_label(df1) rlang::env_label(df2) rlang::env_label(df3) # Cloning can be useful to add attributes to data used in a function without # adding those attributes to the original object. # Make a function to take a DataFrame, add an attribute, and return a # DataFrame: give_attr <- function(data) { attr(data, "created_on") <- "2024-01-29" data } df2 <- give_attr(df1) # Problem: the original DataFrame also gets the attribute while it shouldn't attributes(df1) # Use $clone() inside the function to avoid that give_attr <- function(data) { data <- data$clone() attr(data, "created_on") <- "2024-01-29" data } df1 <- as_polars_df(iris) df2 <- give_attr(df1) # now, the original DataFrame doesn't get this attribute attributes(df1)
df1 <- as_polars_df(iris) # Assigning does not copy the DataFrame (environment object), calling # $clone() creates a new environment. df2 <- df1 df3 <- df1$clone() rlang::env_label(df1) rlang::env_label(df2) rlang::env_label(df3) # Cloning can be useful to add attributes to data used in a function without # adding those attributes to the original object. # Make a function to take a DataFrame, add an attribute, and return a # DataFrame: give_attr <- function(data) { attr(data, "created_on") <- "2024-01-29" data } df2 <- give_attr(df1) # Problem: the original DataFrame also gets the attribute while it shouldn't attributes(df1) # Use $clone() inside the function to avoid that give_attr <- function(data) { data <- data$clone() attr(data, "created_on") <- "2024-01-29" data } df1 <- as_polars_df(iris) df2 <- give_attr(df1) # now, the original DataFrame doesn't get this attribute attributes(df1)
Drop columns of a DataFrame
dataframe__drop(..., strict = TRUE)
dataframe__drop(..., strict = TRUE)
... |
< |
strict |
Validate that all column names exist in the schema and throw an exception if a column name does not exist in the schema. |
A polars DataFrame
as_polars_df(mtcars)$drop(c("mpg", "hp")) # equivalent as_polars_df(mtcars)$drop("mpg", "hp")
as_polars_df(mtcars)$drop(c("mpg", "hp")) # equivalent as_polars_df(mtcars)$drop("mpg", "hp")
Check whether the DataFrame is equal to another DataFrame
dataframe__equals(other, ..., null_equal = TRUE)
dataframe__equals(other, ..., null_equal = TRUE)
other |
DataFrame to compare with. |
A logical value
dat1 <- as_polars_df(iris) dat2 <- as_polars_df(iris) dat3 <- as_polars_df(mtcars) dat1$equals(dat2) dat1$equals(dat3)
dat1 <- as_polars_df(iris) dat2 <- as_polars_df(iris) dat3 <- as_polars_df(mtcars) dat1$equals(dat2) dat1$equals(dat3)
Filter rows of a DataFrame
dataframe__filter(...)
dataframe__filter(...)
A polars DataFrame
df <- as_polars_df(iris) df$filter(pl$col("Sepal.Length") > 5) # This is equivalent to # df$filter(pl$col("Sepal.Length") > 5 & pl$col("Petal.Width") < 1) df$filter(pl$col("Sepal.Length") > 5, pl$col("Petal.Width") < 1) # rows where condition is NA are dropped iris2 <- iris iris2[c(1, 3, 5), "Species"] <- NA df <- as_polars_df(iris2) df$filter(pl$col("Species") == "setosa")
df <- as_polars_df(iris) df$filter(pl$col("Sepal.Length") > 5) # This is equivalent to # df$filter(pl$col("Sepal.Length") > 5 & pl$col("Petal.Width") < 1) df$filter(pl$col("Sepal.Length") > 5, pl$col("Petal.Width") < 1) # rows where condition is NA are dropped iris2 <- iris iris2[c(1, 3, 5), "Species"] <- NA df <- as_polars_df(iris2) df$filter(pl$col("Species") == "setosa")
Get the DataFrame as a list of Series
dataframe__get_columns()
dataframe__get_columns()
A list of Series
df <- pl$DataFrame(foo = c(1, 2, 3), bar = c(4, 5, 6)) df$get_columns() df <- pl$DataFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$get_columns()
df <- pl$DataFrame(foo = c(1, 2, 3), bar = c(4, 5, 6)) df$get_columns() df <- pl$DataFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$get_columns()
Group a DataFrame
dataframe__group_by(..., .maintain_order = FALSE)
dataframe__group_by(..., .maintain_order = FALSE)
Within each group, the order of the rows is always preserved,
regardless of the maintain_order
argument.
GroupBy (a DataFrame with special groupby methods like $agg()
)
<DataFrame>$partition_by()
df <- pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df$group_by("a")$agg(pl$col("b")$sum()) # Set `maintain_order = TRUE` to ensure the order of the groups is # consistent with the input. df$group_by("a", maintain_order = TRUE)$agg(pl$col("c")) # Group by multiple columns by passing a list of column names. df$group_by(c("a", "b"))$agg(pl$max("c")) # Or pass some arguments to group by multiple columns in the same way. # Expressions are also accepted. df$group_by("a", pl$col("b") %/% 2)$agg( pl$col("c")$mean() ) # The columns will be renamed to the argument names. df$group_by(d = "a", e = pl$col("b") %/% 2)$agg( pl$col("c")$mean() )
df <- pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df$group_by("a")$agg(pl$col("b")$sum()) # Set `maintain_order = TRUE` to ensure the order of the groups is # consistent with the input. df$group_by("a", maintain_order = TRUE)$agg(pl$col("c")) # Group by multiple columns by passing a list of column names. df$group_by(c("a", "b"))$agg(pl$max("c")) # Or pass some arguments to group by multiple columns in the same way. # Expressions are also accepted. df$group_by("a", pl$col("b") %/% 2)$agg( pl$col("c")$mean() ) # The columns will be renamed to the argument names. df$group_by(d = "a", e = pl$col("b") %/% 2)$agg( pl$col("c")$mean() )
Start a new lazy query from a DataFrame.
dataframe__lazy()
dataframe__lazy()
A polars LazyFrame
pl$DataFrame(a = 1:2, b = c(NA, "a"))$lazy()
pl$DataFrame(a = 1:2, b = c(NA, "a"))$lazy()
Get number of chunks used by the ChunkedArrays of this DataFrame
dataframe__n_chunks(strategy = c("first", "all"))
dataframe__n_chunks(strategy = c("first", "all"))
strategy |
Return the number of chunks of the |
An integer vector.
df <- pl$DataFrame( a = c(1, 2, 3, 4), b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$n_chunks() df$n_chunks(strategy = "all")
df <- pl$DataFrame( a = c(1, 2, 3, 4), b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$n_chunks() df$n_chunks(strategy = "all")
This will make sure all subsequent operations have optimal and predictable performance.
dataframe__rechunk()
dataframe__rechunk()
A polars DataFrame
Select and perform operations on a subset of columns only. This discards
unmentioned columns (like .()
in data.table
and contrarily to
dplyr::mutate()
).
One cannot use new variables in subsequent expressions in the same
$select()
call. For instance, if you create a variable x
, you will only
be able to use it in another $select()
or $with_columns()
call.
dataframe__select(...)
dataframe__select(...)
... |
< |
A polars DataFrame
as_polars_df(iris)$select( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 )
as_polars_df(iris)$select( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 )
Get a slice of the DataFrame.
dataframe__slice(offset, length = NULL)
dataframe__slice(offset, length = NULL)
offset |
Start index, can be a negative value. This is 0-indexed, so
|
length |
Length of the slice. If |
A polars DataFrame
# skip the first 2 rows and take the 4 following rows as_polars_df(mtcars)$slice(2, 4) # this is equivalent to: mtcars[3:6, ]
# skip the first 2 rows and take the 4 following rows as_polars_df(mtcars)$slice(2, 4) # this is equivalent to: mtcars[3:6, ]
Sort a DataFrame
dataframe__sort( ..., descending = FALSE, nulls_last = FALSE, multithreaded = TRUE, maintain_order = FALSE )
dataframe__sort( ..., descending = FALSE, nulls_last = FALSE, multithreaded = TRUE, maintain_order = FALSE )
A polars DataFrame
df <- mtcars df$mpg[1] <- NA df <- as_polars_df(df) df$sort("mpg") df$sort("mpg", nulls_last = TRUE) df$sort("cyl", "mpg") df$sort(c("cyl", "mpg")) df$sort(c("cyl", "mpg"), descending = TRUE) df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE)) df$sort(pl$col("cyl"), pl$col("mpg"))
df <- mtcars df$mpg[1] <- NA df <- as_polars_df(df) df$sort("mpg") df$sort("mpg", nulls_last = TRUE) df$sort("cyl", "mpg") df$sort(c("cyl", "mpg")) df$sort(c("cyl", "mpg"), descending = TRUE) df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE)) df$sort(pl$col("cyl"), pl$col("mpg"))
Select column as Series at index location
dataframe__to_series(index = 0)
dataframe__to_series(index = 0)
index |
Index of the column to return as Series. Defaults to 0, which is the first column. |
Series or NULL
df <- as_polars_df(iris[1:10, ]) # default is to extract the first column df$to_series() # Polars is 0-indexed, so we use index = 1 to extract the *2nd* column df$to_series(index = 1) # doesn't error if the column isn't there df$to_series(index = 8)
df <- as_polars_df(iris[1:10, ]) # default is to extract the first column df$to_series() # Polars is 0-indexed, so we use index = 1 to extract the *2nd* column df$to_series(index = 1) # doesn't error if the column isn't there df$to_series(index = 8)
Convert a DataFrame to a Series of type Struct
dataframe__to_struct(name = "")
dataframe__to_struct(name = "")
name |
A character. Name for the struct Series. |
A Series of the struct type
df <- pl$DataFrame( a = 1:5, b = c("one", "two", "three", "four", "five"), ) df$to_struct("nums")
df <- pl$DataFrame( a = 1:5, b = c("one", "two", "three", "four", "five"), ) df$to_struct("nums")
Add columns or modify existing ones with expressions. This is similar to
dplyr::mutate()
as it keeps unmentioned columns (unlike $select()
).
However, unlike dplyr::mutate()
, one cannot use new variables in subsequent
expressions in the same $with_columns()
call. For instance, if you create a
variable x
, you will only be able to use it in another $with_columns()
or $select()
call.
dataframe__with_columns(...)
dataframe__with_columns(...)
... |
< |
A polars DataFrame
as_polars_df(iris)$with_columns( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 ) # same query l_expr <- list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) as_polars_df(iris)$with_columns(l_expr) as_polars_df(iris)$with_columns( SW_add_2 = (pl$col("Sepal.Width") + 2), # unnamed expr will keep name "Sepal.Length" pl$col("Sepal.Length")$abs() )
as_polars_df(iris)$with_columns( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 ) # same query l_expr <- list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) as_polars_df(iris)$with_columns(l_expr) as_polars_df(iris)$with_columns( SW_add_2 = (pl$col("Sepal.Width") + 2), # unnamed expr will keep name "Sepal.Length" pl$col("Sepal.Length")$abs() )
Compute absolute values
expr__abs()
expr__abs()
A polars expression
df <- pl$DataFrame(a = -1:2) df$with_columns(abs = pl$col("a")$abs())
df <- pl$DataFrame(a = -1:2) df$with_columns(abs = pl$col("a")$abs())
Method equivalent of addition operator expr + other
.
expr__add(other)
expr__add(other)
other |
Element to add. Can be a string (only if |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = 1:5) df$with_columns( `x+int` = pl$col("x")$add(2L), `x+expr` = pl$col("x")$add(pl$col("x")$cum_prod()) ) df <- pl$DataFrame( x = c("a", "d", "g"), y = c("b", "e", "h"), z = c("c", "f", "i") ) df$with_columns( pl$col("x")$add(pl$col("y"))$add(pl$col("z"))$alias("xyz") )
df <- pl$DataFrame(x = 1:5) df$with_columns( `x+int` = pl$col("x")$add(2L), `x+expr` = pl$col("x")$add(pl$col("x")$cum_prod()) ) df <- pl$DataFrame( x = c("a", "d", "g"), y = c("b", "e", "h"), z = c("c", "f", "i") ) df$with_columns( pl$col("x")$add(pl$col("y"))$add(pl$col("z"))$alias("xyz") )
Should be used in aggregation context only.
expr__agg_groups()
expr__agg_groups()
A polars expression
df <- pl$DataFrame( group = rep(c("one", "two"), each = 3), value = c(94, 95, 96, 97, 97, 99) ) df$group_by("group", maintain_order = TRUE)$agg(pl$col("value")$agg_groups())
df <- pl$DataFrame( group = rep(c("one", "two"), each = 3), value = c(94, 95, 96, 97, 97, 99) ) df$group_by("group", maintain_order = TRUE)$agg(pl$col("value")$agg_groups())
Rename the expression
expr__alias(name)
expr__alias(name)
name |
The new name. |
A polars expression
# Rename an expression to avoid overwriting an existing column df <- pl$DataFrame(a = 1:3, b = c("x", "y", "z")) df$with_columns( pl$col("a") + 10, pl$col("b")$str$to_uppercase()$alias("c") ) # Overwrite the default name of literal columns to prevent errors due to # duplicate column names. df$with_columns( pl$lit(TRUE)$alias("c"), pl$lit(4)$alias("d") )
# Rename an expression to avoid overwriting an existing column df <- pl$DataFrame(a = 1:3, b = c("x", "y", "z")) df$with_columns( pl$col("a") + 10, pl$col("b")$str$to_uppercase()$alias("c") ) # Overwrite the default name of literal columns to prevent errors due to # duplicate column names. df$with_columns( pl$lit(TRUE)$alias("c"), pl$lit(4)$alias("d") )
This method is an expression - not to be confused with pl$all()
which is a function to select all columns.
expr__all(..., ignore_nulls = TRUE)
expr__all(..., ignore_nulls = TRUE)
... |
These dots are for future extensions and must be empty. |
ignore_nulls |
If |
A polars expression
df <- pl$DataFrame( a = c(TRUE, TRUE), b = c(TRUE, FALSE), c = c(NA, TRUE), d = c(NA, NA) ) # By default, ignore null values. If there are only nulls, then all() returns # TRUE. df$select(pl$col("*")$all()) # If we set ignore_nulls = FALSE, then we don't know if all values in column # "c" are TRUE, so it returns null df$select(pl$col("*")$all(ignore_nulls = FALSE))
df <- pl$DataFrame( a = c(TRUE, TRUE), b = c(TRUE, FALSE), c = c(NA, TRUE), d = c(NA, NA) ) # By default, ignore null values. If there are only nulls, then all() returns # TRUE. df$select(pl$col("*")$all()) # If we set ignore_nulls = FALSE, then we don't know if all values in column # "c" are TRUE, so it returns null df$select(pl$col("*")$all(ignore_nulls = FALSE))
Combine two boolean expressions with AND.
expr__and(other)
expr__and(other)
other |
Element to add. Can be a string (only if |
A polars expression
pl$lit(TRUE) & TRUE pl$lit(TRUE)$and(pl$lit(TRUE))
pl$lit(TRUE) & TRUE pl$lit(TRUE)$and(pl$lit(TRUE))
Check if any boolean value in a column is true
expr__any(..., ignore_nulls = TRUE)
expr__any(..., ignore_nulls = TRUE)
... |
These dots are for future extensions and must be empty. |
ignore_nulls |
If |
A polars expression
df <- pl$DataFrame( a = c(TRUE, FALSE), b = c(FALSE, FALSE), c = c(NA, FALSE) ) df$select(pl$col("*")$any()) # If we set ignore_nulls = FALSE, then we don't know if any values in column # "c" is TRUE, so it returns null df$select(pl$col("*")$any(ignore_nulls = FALSE))
df <- pl$DataFrame( a = c(TRUE, FALSE), b = c(FALSE, FALSE), c = c(NA, FALSE) ) df$select(pl$col("*")$any()) # If we set ignore_nulls = FALSE, then we don't know if any values in column # "c" is TRUE, so it returns null df$select(pl$col("*")$any(ignore_nulls = FALSE))
Append expressions
expr__append(other, ..., upcast = TRUE)
expr__append(other, ..., upcast = TRUE)
other |
Expression to append. |
... |
These dots are for future extensions and must be empty. |
upcast |
If |
A polars expression
df <- pl$DataFrame(a = 8:10, b = c(NA, 4, 4)) df$select(pl$all()$head(1)$append(pl$all()$tail(1)))
df <- pl$DataFrame(a = 8:10, b = c(NA, 4, 4)) df$select(pl$all()$head(1)$append(pl$all()$tail(1)))
This is done using the HyperLogLog++ algorithm for cardinality estimation.
expr__approx_n_unique()
expr__approx_n_unique()
A polars expression
df <- pl$DataFrame(n = c(1, 1, 2)) df$select(pl$col("n")$approx_n_unique()) df <- pl$DataFrame(n = 0:1000) df$select( exact = pl$col("n")$n_unique(), approx = pl$col("n")$approx_n_unique() )
df <- pl$DataFrame(n = c(1, 1, 2)) df$select(pl$col("n")$approx_n_unique()) df <- pl$DataFrame(n = 0:1000) df$select( exact = pl$col("n")$n_unique(), approx = pl$col("n")$approx_n_unique() )
Compute inverse cosine
expr__arccos()
expr__arccos()
A polars expression
pl$DataFrame(a = c(-1, cos(0.5), 0, 1, NA))$ with_columns(arccos = pl$col("a")$arccos())
pl$DataFrame(a = c(-1, cos(0.5), 0, 1, NA))$ with_columns(arccos = pl$col("a")$arccos())
Compute inverse hyperbolic cosine
expr__arccosh()
expr__arccosh()
A polars expression
pl$DataFrame(a = c(-1, cosh(0.5), 0, 1, NA))$ with_columns(arccosh = pl$col("a")$arccosh())
pl$DataFrame(a = c(-1, cosh(0.5), 0, 1, NA))$ with_columns(arccosh = pl$col("a")$arccosh())
Compute inverse sine
expr__arcsin()
expr__arcsin()
A polars expression
pl$DataFrame(a = c(-1, sin(0.5), 0, 1, NA))$ with_columns(arcsin = pl$col("a")$arcsin())
pl$DataFrame(a = c(-1, sin(0.5), 0, 1, NA))$ with_columns(arcsin = pl$col("a")$arcsin())
Compute inverse hyperbolic sine
expr__arcsinh()
expr__arcsinh()
A polars expression
pl$DataFrame(a = c(-1, sinh(0.5), 0, 1, NA))$ with_columns(arcsinh = pl$col("a")$arcsinh())
pl$DataFrame(a = c(-1, sinh(0.5), 0, 1, NA))$ with_columns(arcsinh = pl$col("a")$arcsinh())
Compute inverse tangent
expr__arctan()
expr__arctan()
A polars expression
pl$DataFrame(a = c(-1, tan(0.5), 0, 1, NA_real_))$ with_columns(arctan = pl$col("a")$arctan())
pl$DataFrame(a = c(-1, tan(0.5), 0, 1, NA_real_))$ with_columns(arctan = pl$col("a")$arctan())
Compute inverse hyperbolic tangent
expr__arctanh()
expr__arctanh()
A polars expression
pl$DataFrame(a = c(-1, tanh(0.5), 0, 1, NA))$ with_columns(arctanh = pl$col("a")$arctanh())
pl$DataFrame(a = c(-1, tanh(0.5), 0, 1, NA))$ with_columns(arctanh = pl$col("a")$arctanh())
Get the index of the maximal value
expr__arg_max()
expr__arg_max()
A polars expression
df <- pl$DataFrame(a = c(20, 10, 30)) df$select(pl$col("a")$arg_max())
df <- pl$DataFrame(a = c(20, 10, 30)) df$select(pl$col("a")$arg_max())
Get the index of the minimal value
expr__arg_min()
expr__arg_min()
A polars expression
df <- pl$DataFrame(a = c(20, 10, 30)) df$select(pl$col("a")$arg_min())
df <- pl$DataFrame(a = c(20, 10, 30)) df$select(pl$col("a")$arg_min())
Get the index values that would sort this column.
expr__arg_sort(..., descending = FALSE, nulls_last = FALSE)
expr__arg_sort(..., descending = FALSE, nulls_last = FALSE)
... |
These dots are for future extensions and must be empty. |
descending |
Sort in descending order. |
nulls_last |
Place null values last. |
A polars expression
pl$arg_sort_by() to find the row indices that would sort multiple columns.
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_sorted = pl$col("a")$arg_sort())
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_sorted = pl$col("a")$arg_sort())
Get the index of the first unique value
expr__arg_unique()
expr__arg_unique()
A polars expression
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$col("a")$arg_unique()) df$select(pl$col("b")$arg_unique())
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$col("a")$arg_unique()) df$select(pl$col("b")$arg_unique())
Fill missing values with the next non-null value
expr__backward_fill(limit = NULL)
expr__backward_fill(limit = NULL)
fill |
The number of consecutive null values to backward fill. |
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA), b = c(4, NA, 6), c = c(NA, NA, 2) ) df$select(pl$all()$backward_fill()) df$select(pl$all()$backward_fill(limit = 1))
df <- pl$DataFrame( a = c(1, 2, NA), b = c(4, NA, 6), c = c(NA, NA, 2) ) df$select(pl$all()$backward_fill()) df$select(pl$all()$backward_fill(limit = 1))
k
smallest elementsNon-null elements are always preferred over null elements. The output is not
guaranteed to be in any particular order, call $sort() after
this function if you wish the output to be sorted. This has time complexity
.
expr__bottom_k(k = 5)
expr__bottom_k(k = 5)
k |
Number of elements to return. |
A polars expression
df <- pl$DataFrame(value = c(1, 98, 2, 3, 99, 4)) df$select( top_k = pl$col("value")$top_k(k = 3), bottom_k = pl$col("value")$bottom_k(k = 3) )
df <- pl$DataFrame(value = c(1, 98, 2, 3, 99, 4)) df$select( top_k = pl$col("value")$top_k(k = 3), bottom_k = pl$col("value")$bottom_k(k = 3) )
k
smallest elements of the by
column(s)Non-null elements are always preferred over null elements. The output is not
guaranteed to be in any particular order, call $sort() after
this function if you wish the output to be sorted. This has time complexity
.
expr__bottom_k_by(by, k = 5, ..., reverse = FALSE)
expr__bottom_k_by(by, k = 5, ..., reverse = FALSE)
by |
Column(s) used to determine the smallest elements. Accepts expression input. Strings are parsed as column names. |
k |
Number of elements to return. |
A polars expression
df <- pl$DataFrame( a = 1:6, b = 6:1, c = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana") ) # Get the bottom 2 rows by column a or b: df$select( pl$all()$bottom_k_by("a", 2)$name$suffix("_btm_by_a"), pl$all()$bottom_k_by("b", 2)$name$suffix("_btm_by_b") ) # Get the bottom 2 rows by multiple columns with given order. df$select( pl$all()$ bottom_k_by(c("c", "a"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_ca"), pl$all()$ bottom_k_by(c("c", "b"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_cb"), ) # Get the bottom 2 rows by column a in each group df$group_by("c", maintain_order = TRUE)$agg( pl$all()$bottom_k_by("a", 2) )$explode(pl$all()$exclude("c"))
df <- pl$DataFrame( a = 1:6, b = 6:1, c = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana") ) # Get the bottom 2 rows by column a or b: df$select( pl$all()$bottom_k_by("a", 2)$name$suffix("_btm_by_a"), pl$all()$bottom_k_by("b", 2)$name$suffix("_btm_by_b") ) # Get the bottom 2 rows by multiple columns with given order. df$select( pl$all()$ bottom_k_by(c("c", "a"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_ca"), pl$all()$ bottom_k_by(c("c", "b"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_cb"), ) # Get the bottom 2 rows by column a in each group df$group_by("c", maintain_order = TRUE)$agg( pl$all()$bottom_k_by("a", 2) )$explode(pl$all()$exclude("c"))
Cast between DataType
expr__cast(dtype, ..., strict = TRUE, wrap_numerical = FALSE)
expr__cast(dtype, ..., strict = TRUE, wrap_numerical = FALSE)
dtype |
DataType to cast to. |
... |
These dots are for future extensions and must be empty. |
strict |
If |
wrap_numerical |
If |
A polars expression
df <- pl$DataFrame(a = 1:3, b = c(1, 2, 3)) df$with_columns( pl$col("a")$cast(pl$dtypes$Float64), pl$col("b")$cast(pl$dtypes$Int32) ) # strict FALSE, inserts null for any cast failure pl$lit(c(100, 200, 300))$cast(pl$dtypes$UInt8, strict = FALSE)$to_series() # strict TRUE, raise any failure as an error when query is executed. tryCatch( { pl$lit("a")$cast(pl$dtypes$Float64, strict = TRUE)$to_series() }, error = function(e) e )
df <- pl$DataFrame(a = 1:3, b = c(1, 2, 3)) df$with_columns( pl$col("a")$cast(pl$dtypes$Float64), pl$col("b")$cast(pl$dtypes$Int32) ) # strict FALSE, inserts null for any cast failure pl$lit(c(100, 200, 300))$cast(pl$dtypes$UInt8, strict = FALSE)$to_series() # strict TRUE, raise any failure as an error when query is executed. tryCatch( { pl$lit("a")$cast(pl$dtypes$Float64, strict = TRUE)$to_series() }, error = function(e) e )
Compute cube root
expr__cbrt()
expr__cbrt()
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns(cbrt = pl$col("a")$cbrt())
pl$DataFrame(a = c(1, 2, 4))$ with_columns(cbrt = pl$col("a")$cbrt())
This only works on floating point Series.
expr__ceil()
expr__ceil()
A polars expression
df <- pl$DataFrame(a = c(0.3, 0.5, 1.0, 1.1)) df$with_columns( ceil = pl$col("a")$ceil() )
df <- pl$DataFrame(a = c(0.3, 0.5, 1.0, 1.1)) df$with_columns( ceil = pl$col("a")$ceil() )
This method only works for numeric and temporal columns. To clip other data types, consider writing a when-then-otherwise expression.
expr__clip(lower_bound = NULL, upper_bound = NULL)
expr__clip(lower_bound = NULL, upper_bound = NULL)
lower_bound |
Lower bound. Accepts expression input. Non-expression inputs are parsed as literals. |
upper_bound |
Upper bound. Accepts expression input. Non-expression inputs are parsed as literals. |
This method only works for numeric and temporal columns. To clip other data types, consider writing a when-then-otherwise expression.
A polars expression
df <- pl$DataFrame(a = c(-50, 5, 50, NA)) # Specifying both a lower and upper bound: df$with_columns( clip = pl$col("a")$clip(1, 10) ) # Specifying only a single bound: df$with_columns( clip = pl$col("a")$clip(upper_bound = 10) )
df <- pl$DataFrame(a = c(-50, 5, 50, NA)) # Specifying both a lower and upper bound: df$with_columns( clip = pl$col("a")$clip(1, 10) ) # Specifying only a single bound: df$with_columns( clip = pl$col("a")$clip(upper_bound = 10) )
Compute cosine
expr__cos()
expr__cos()
A polars expression
pl$DataFrame(a = c(0, pi / 2, pi, NA))$ with_columns(cosine = pl$col("a")$cos())
pl$DataFrame(a = c(0, pi / 2, pi, NA))$ with_columns(cosine = pl$col("a")$cos())
Compute hyperbolic cosine
expr__cosh()
expr__cosh()
A polars expression
pl$DataFrame(a = c(-1, acosh(2), 0, 1, NA))$ with_columns(cosh = pl$col("a")$cosh())
pl$DataFrame(a = c(-1, acosh(2), 0, 1, NA))$ with_columns(cosh = pl$col("a")$cosh())
Compute cotangent
expr__cot()
expr__cot()
A polars expression
pl$DataFrame(a = c(0, pi / 2, -5, NA))$ with_columns(cotangent = pl$col("a")$cot())
pl$DataFrame(a = c(0, pi / 2, -5, NA))$ with_columns(cotangent = pl$col("a")$cot())
Get the number of non-null elements in the column
expr__count()
expr__count()
A polars expression
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$all()$count())
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$all()$count())
Return the cumulative count of the non-null values in the column
expr__cum_count(..., reverse = FALSE)
expr__cum_count(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
A polars expression
pl$DataFrame(a = 1:4)$with_columns( cum_count = pl$col("a")$cum_count(), cum_count_reversed = pl$col("a")$cum_count(reverse = TRUE) )
pl$DataFrame(a = 1:4)$with_columns( cum_count = pl$col("a")$cum_count(), cum_count_reversed = pl$col("a")$cum_count(reverse = TRUE) )
Return the cumulative max computed at every element.
expr__cum_max(..., reverse = FALSE)
expr__cum_max(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
A polars expression
pl$DataFrame(a = c(1:4, 2L))$with_columns( cum_max = pl$col("a")$cum_max(), cum_max_reversed = pl$col("a")$cum_max(reverse = TRUE) )
pl$DataFrame(a = c(1:4, 2L))$with_columns( cum_max = pl$col("a")$cum_max(), cum_max_reversed = pl$col("a")$cum_max(reverse = TRUE) )
Return the cumulative min computed at every element.
expr__cum_min(..., reverse = FALSE)
expr__cum_min(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
A polars expression
pl$DataFrame(a = c(1:4, 2L))$with_columns( cum_min = pl$col("a")$cum_min(), cum_min_reversed = pl$col("a")$cum_min(reverse = TRUE) )
pl$DataFrame(a = c(1:4, 2L))$with_columns( cum_min = pl$col("a")$cum_min(), cum_min_reversed = pl$col("a")$cum_min(reverse = TRUE) )
Return the cumulative product computed at every element.
expr__cum_prod(..., reverse = FALSE)
expr__cum_prod(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
A polars expression
pl$DataFrame(a = 1:4)$with_columns( cum_prod = pl$col("a")$cum_prod(), cum_prod_reversed = pl$col("a")$cum_prod(reverse = TRUE) )
pl$DataFrame(a = 1:4)$with_columns( cum_prod = pl$col("a")$cum_prod(), cum_prod_reversed = pl$col("a")$cum_prod(reverse = TRUE) )
Return the cumulative sum computed at every element.
expr__cum_sum(..., reverse = FALSE)
expr__cum_sum(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
A polars expression
pl$DataFrame(a = 1:4)$with_columns( cum_sum = pl$col("a")$cum_sum(), cum_sum_reversed = pl$col("a")$cum_sum(reverse = TRUE) )
pl$DataFrame(a = 1:4)$with_columns( cum_sum = pl$col("a")$cum_sum(), cum_sum_reversed = pl$col("a")$cum_sum(reverse = TRUE) )
Return the cumulative count of the non-null values in the column
expr__cumulative_eval(expr, ..., min_periods = 1, parallel = FALSE)
expr__cumulative_eval(expr, ..., min_periods = 1, parallel = FALSE)
expr |
Expression to evaluate. |
... |
These dots are for future extensions and must be empty. |
min_periods |
Number of valid values (i.e. |
parallel |
Run in parallel. Don’t do this in a group by or another operation that already has much parallelization. |
This can be really slow as it can have O(n^2)
complexity. Don’t use this
for operations that visit all elements.
A polars expression
df <- pl$DataFrame(values = 1:5) df$with_columns( pl$col("values")$cumulative_eval( pl$element()$first() - pl$element()$last()**2 ) )
df <- pl$DataFrame(values = 1:5) df$with_columns( pl$col("values")$cumulative_eval( pl$element()$first() - pl$element()$last()**2 ) )
expr__cut( breaks, ..., labels = NULL, left_closed = FALSE, include_breaks = FALSE )
expr__cut( breaks, ..., labels = NULL, left_closed = FALSE, include_breaks = FALSE )
breaks |
List of unique cut points. |
... |
These dots are for future extensions and must be empty. |
labels |
Names of the categories. The number of labels must be equal to the number of cut points plus one. |
left_closed |
Set the intervals to be left-closed instead of right-closed. |
include_breaks |
Include a column with the right endpoint of the bin each observation falls in. This will change the data type of the output from a Categorical to a Struct. |
A polars expression
# Divide a column into three categories. df <- pl$DataFrame(foo = -2:2) df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), labels = c("a", "b", "c")) ) # Add both the category and the breakpoint. df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), include_breaks = TRUE) )$unnest()
# Divide a column into three categories. df <- pl$DataFrame(foo = -2:2) df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), labels = c("a", "b", "c")) ) # Add both the category and the breakpoint. df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), include_breaks = TRUE) )$unnest()
Convert from radians to degrees
expr__degrees()
expr__degrees()
A polars expression
pl$DataFrame(a = c(1, 2, 4) * pi)$ with_columns(degrees = pl$col("a")$degrees())
pl$DataFrame(a = c(1, 2, 4) * pi)$ with_columns(degrees = pl$col("a")$degrees())
Calculate the n-th discrete difference between elements
expr__diff(n = 1, null_behavior = c("ignore", "drop"))
expr__diff(n = 1, null_behavior = c("ignore", "drop"))
n |
Integer indicating the number of slots to shift. |
null_behavior |
How to handle null values. Must be |
A polars expression
pl$DataFrame(a = c(20, 10, 30, 25, 35))$with_columns( diff_default = pl$col("a")$diff(), diff_2_ignore = pl$col("a")$diff(2, "ignore") )
pl$DataFrame(a = c(20, 10, 30, 25, 35))$with_columns( diff_default = pl$col("a")$diff(), diff_2_ignore = pl$col("a")$diff(2, "ignore") )
Compute the dot/inner product between two Expressions
expr__dot(expr)
expr__dot(expr)
other |
Expression to compute dot product with. |
A polars expression
df <- pl$DataFrame(a = c(1, 3, 5), b = c(2, 4, 6)) df$select(pl$col("a")$dot(pl$col("b")))
df <- pl$DataFrame(a = c(1, 3, 5), b = c(2, 4, 6)) df$select(pl$col("a")$dot(pl$col("b")))
The original order of the remaining elements is preserved. A NaN
value is
not the same as a null
value. To drop null
values, use
$drop_nulls().
expr__drop_nans()
expr__drop_nans()
A polars expression
df <- pl$DataFrame(a = c(1, NA, 3, NaN)) df$select(pl$col("a")$drop_nans())
df <- pl$DataFrame(a = c(1, NA, 3, NaN)) df$select(pl$col("a")$drop_nans())
The original order of the remaining elements is preserved. A null
value is
not the same as a NaN
value. To drop NaN
values, use
$drop_nans().
expr__drop_nulls()
expr__drop_nulls()
A polars expression
df <- pl$DataFrame(a = c(1, NA, 3, NaN)) df$select(pl$col("a")$drop_nulls())
df <- pl$DataFrame(a = c(1, NA, 3, NaN)) df$select(pl$col("a")$drop_nulls())
Uses the formula -sum(pk * log(pk)
where pk
are discrete probabilities.
expr__entropy(base = exp(1), ..., normalize = TRUE)
expr__entropy(base = exp(1), ..., normalize = TRUE)
base |
Numeric value used as base, defaults to |
... |
These dots are for future extensions and must be empty. |
normalize |
Normalize |
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$entropy(base = 2)) df$select(pl$col("a")$entropy(base = 2, normalize = FALSE))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$entropy(base = 2)) df$select(pl$col("a")$entropy(base = 2, normalize = FALSE))
This propagates null values, i.e. any comparison involving null
will
return null
. Use $eq_missing()
to consider null
values as equal.
expr__eq(other)
expr__eq(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq(pl$col("y")), eq_missing = pl$col("x")$eq_missing(pl$col("y")) )
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq(pl$col("y")), eq_missing = pl$col("x")$eq_missing(pl$col("y")) )
null
propagationThis considers that null values are equal. It differs from
$eq()
where null values are propagated.
expr__eq_missing(other)
expr__eq_missing(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq("y"), eq_missing = pl$col("x")$eq_missing("y") )
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq("y"), eq_missing = pl$col("x")$eq_missing("y") )
Compute exponentially-weighted moving mean
expr__ewm_mean( ..., com, span, half_life, alpha, adjust = TRUE, min_periods = 1, ignore_nulls = FALSE )
expr__ewm_mean( ..., com, span, half_life, alpha, adjust = TRUE, min_periods = 1, ignore_nulls = FALSE )
... |
These dots are for future extensions and must be empty. |
com |
Specify decay in terms of center of mass,
. |
span |
Specify decay in terms of span,
|
half_life |
Specify decay in terms of half-life,
|
alpha |
Specify smoothing factor alpha directly, |
adjust |
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:
|
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
ignore_nulls |
Ignore missing values when calculating weights.
|
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_mean(com = 1, ignore_nulls = FALSE))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_mean(com = 1, ignore_nulls = FALSE))
Given observations ,
, ...,
at times
,
, ...,
, the EWMA is calculated as
where is the
half_life
.
expr__ewm_mean_by(by, ..., half_life)
expr__ewm_mean_by(by, ..., half_life)
by |
Times to calculate average by. Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data type. |
half_life |
Unit over which observation decays to half its value. Can be created either from a timedelta, or by using the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
A polars expression
df <- pl$DataFrame( values = c(0, 1, 2, NA, 4), times = as.Date( c("2020-01-01", "2020-01-03", "2020-01-10", "2020-01-15", "2020-01-17") ) ) df$with_columns( result = pl$col("values")$ewm_mean_by("times", half_life = "4d") )
df <- pl$DataFrame( values = c(0, 1, 2, NA, 4), times = as.Date( c("2020-01-01", "2020-01-03", "2020-01-10", "2020-01-15", "2020-01-17") ) ) df$with_columns( result = pl$col("values")$ewm_mean_by("times", half_life = "4d") )
Compute exponentially-weighted moving standard deviation
expr__ewm_std( ..., com, span, half_life, alpha, adjust = TRUE, bias = FALSE, min_periods = 1, ignore_nulls = FALSE )
expr__ewm_std( ..., com, span, half_life, alpha, adjust = TRUE, bias = FALSE, min_periods = 1, ignore_nulls = FALSE )
... |
These dots are for future extensions and must be empty. |
com |
Specify decay in terms of center of mass,
. |
span |
Specify decay in terms of span,
|
half_life |
Specify decay in terms of half-life,
|
alpha |
Specify smoothing factor alpha directly, |
adjust |
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:
|
bias |
If |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
ignore_nulls |
Ignore missing values when calculating weights.
|
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_std(com = 1, ignore_nulls = FALSE))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_std(com = 1, ignore_nulls = FALSE))
Compute exponentially-weighted moving variance
expr__ewm_var( ..., com, span, half_life, alpha, adjust = TRUE, bias = FALSE, min_periods = 1, ignore_nulls = FALSE )
expr__ewm_var( ..., com, span, half_life, alpha, adjust = TRUE, bias = FALSE, min_periods = 1, ignore_nulls = FALSE )
... |
These dots are for future extensions and must be empty. |
com |
Specify decay in terms of center of mass,
. |
span |
Specify decay in terms of span,
|
half_life |
Specify decay in terms of half-life,
|
alpha |
Specify smoothing factor alpha directly, |
adjust |
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:
|
bias |
If |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
ignore_nulls |
Ignore missing values when calculating weights.
|
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_var(com = 1, ignore_nulls = FALSE))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_var(com = 1, ignore_nulls = FALSE))
Exclude columns from a multi-column expression.
expr__exclude(...)
expr__exclude(...)
... |
The name or datatype of the column(s) to exclude. Accepts regular
expression input. Regular expressions should start with |
A polars expression
df <- pl$DataFrame(aa = 1:2, ba = c("a", NA), cc = c(NA, 2.5)) df # Exclude by column name(s): df$select(pl$all()$exclude("ba")) # Exclude by regex, e.g. removing all columns whose names end with the # letter "a": df$select(pl$all()$exclude("^.*a$")) # Exclude by dtype(s), e.g. removing all columns of type Int64 or Float64: df$select(pl$all()$exclude(pl$Int64, pl$Float64))
df <- pl$DataFrame(aa = 1:2, ba = c("a", NA), cc = c(NA, 2.5)) df # Exclude by column name(s): df$select(pl$all()$exclude("ba")) # Exclude by regex, e.g. removing all columns whose names end with the # letter "a": df$select(pl$all()$exclude("^.*a$")) # Exclude by dtype(s), e.g. removing all columns of type Int64 or Float64: df$select(pl$all()$exclude(pl$Int64, pl$Float64))
Compute the exponential
expr__exp()
expr__exp()
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns(exp = pl$col("a")$exp())
pl$DataFrame(a = c(1, 2, 4))$ with_columns(exp = pl$col("a")$exp())
This means that every item is expanded to a new row.
expr__explode()
expr__explode()
A polars expression
df <- pl$DataFrame( groups = c("a", "b"), values = list(1:2, 3:4) ) df$select(pl$col("values")$explode())
df <- pl$DataFrame( groups = c("a", "b"), values = list(1:2, 3:4) ) df$select(pl$col("values")$explode())
n
copies of a valueExtend the Series with n
copies of a value
expr__extend_constant(value, n)
expr__extend_constant(value, n)
value |
A constant literal value or a unit expression with which to
extend the expression result Series. This can be |
n |
The number of additional values that will be added. |
A polars expression
df <- pl$DataFrame(values = 1:3) df$select(pl$col("values")$extend_constant(99, n = 2))
df <- pl$DataFrame(values = 1:3) df$select(pl$col("values")$extend_constant(99, n = 2))
NaN
value with a fill valueFill floating point NaN
value with a fill value
expr__fill_nan(value)
expr__fill_nan(value)
value |
Value used to fill |
A polars expression
df <- pl$DataFrame(a = c(1, NA, 2, NaN)) df$with_columns( filled_nan = pl$col("a")$fill_nan(99) )
df <- pl$DataFrame(a = c(1, NA, 2, NaN)) df$with_columns( filled_nan = pl$col("a")$fill_nan(99) )
Fill floating point null value with a fill value
expr__fill_null(value, strategy = NULL, limit = NULL)
expr__fill_null(value, strategy = NULL, limit = NULL)
value |
Value used to fill null values. Can be missing if |
strategy |
Strategy used to fill null values. Must be one of
|
limit |
Number of consecutive null values to fill when using the
|
A polars expression
df <- pl$DataFrame(a = c(1, NA, 2, NaN)) df$with_columns( filled_null_zero = pl$col("a")$fill_null(strategy = "zero"), filled_null_99 = pl$col("a")$fill_null(99), filled_null_forward = pl$col("a")$fill_null(strategy = "forward"), filled_null_expr = pl$col("a")$fill_null(pl$col("a")$median()) )
df <- pl$DataFrame(a = c(1, NA, 2, NaN)) df$with_columns( filled_null_zero = pl$col("a")$fill_null(strategy = "zero"), filled_null_99 = pl$col("a")$fill_null(99), filled_null_forward = pl$col("a")$fill_null(strategy = "forward"), filled_null_expr = pl$col("a")$fill_null(pl$col("a")$median()) )
Elements where the filter does not evaluate to TRUE
are discarded,
including nulls. This is mostly useful in an aggregation context. If you
want to filter on a DataFrame level, use
DataFrame$filter()
or
LazyFrame$filter()
.
expr__filter(...)
expr__filter(...)
... |
< |
A polars expression
df <- pl$DataFrame( group_col = c("g1", "g1", "g2"), b = c(1, 2, 3) ) df df$group_by("group_col")$agg( lt = pl$col("b")$filter(pl$col("b") < 2), gte = pl$col("b")$filter(pl$col("b") >= 2) )
df <- pl$DataFrame( group_col = c("g1", "g1", "g2"), b = c(1, 2, 3) ) df df$group_by("group_col")$agg( lt = pl$col("b")$filter(pl$col("b") < 2), gte = pl$col("b")$filter(pl$col("b") >= 2) )
Get the first value
expr__first()
expr__first()
A polars expression
pl$DataFrame(x = 3:1)$with_columns(first = pl$col("x")$first())
pl$DataFrame(x = 3:1)$with_columns(first = pl$col("x")$first())
This is an alias for $explode().
expr__flatten()
expr__flatten()
A polars expression
df <- pl$DataFrame( group = c("a", "b", "b"), values = list(1:2, 2:3, 4) ) df$group_by("group")$agg(pl$col("values")$flatten())
df <- pl$DataFrame( group = c("a", "b", "b"), values = list(1:2, 2:3, 4) ) df$group_by("group")$agg(pl$col("values")$flatten())
This only works on floating point Series.
expr__floor()
expr__floor()
A polars expression
df <- pl$DataFrame(a = c(0.3, 0.5, 1.0, 1.1)) df$with_columns( floor = pl$col("a")$floor() )
df <- pl$DataFrame(a = c(0.3, 0.5, 1.0, 1.1)) df$with_columns( floor = pl$col("a")$floor() )
Method equivalent of floor division operator expr %/% other
.
$floordiv()
is an alias for $floor_div()
, which exists for compatibility
with Python Polars.
expr__floor_div(other) expr__floordiv(other)
expr__floor_div(other) expr__floordiv(other)
other |
Numeric literal or expression value. |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = 1:5) df$with_columns( `x/2` = pl$col("x")$true_div(2), `x%/%2` = pl$col("x")$floor_div(2) )
df <- pl$DataFrame(x = 1:5) df$with_columns( `x/2` = pl$col("x")$true_div(2), `x%/%2` = pl$col("x")$floor_div(2) )
Fill missing values with the last non-null value
expr__forward_fill(limit = NULL)
expr__forward_fill(limit = NULL)
fill |
The number of consecutive null values to forward fill. |
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA), b = c(4, NA, 6), c = c(2, NA, NA) ) df$select(pl$all()$forward_fill()) df$select(pl$all()$forward_fill(limit = 1))
df <- pl$DataFrame( a = c(1, 2, NA), b = c(4, NA, 6), c = c(2, NA, NA) ) df$select(pl$all()$forward_fill()) df$select(pl$all()$forward_fill(limit = 1))
Take values by index
expr__gather(indices)
expr__gather(indices)
indices |
An expression that leads to a UInt32 dtyped Series. |
A polars expression
df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group", maintain_order = TRUE)$agg( pl$col("value")$gather(c(2, 1)) )
df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group", maintain_order = TRUE)$agg( pl$col("value")$gather(c(2, 1)) )
n
-th value in the Series and return as a new SeriesTake every n
-th value in the Series and return as a new Series
expr__gather_every(n, offset = 0)
expr__gather_every(n, offset = 0)
n |
Gather every n-th row. |
offset |
Starting index. |
A polars expression
df <- pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$gather_every(3)) df$select(pl$col("foo")$gather_every(3, offset = 1))
df <- pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$gather_every(3)) df$select(pl$col("foo")$gather_every(3, offset = 1))
Check greater or equal inequality
expr__ge(other)
expr__ge(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = 1:3) df$with_columns( with_ge = pl$col("x")$ge(pl$lit(2)), with_symbol = pl$col("x") >= pl$lit(2) )
df <- pl$DataFrame(x = 1:3) df$with_columns( with_ge = pl$col("x")$ge(pl$lit(2)), with_symbol = pl$col("x") >= pl$lit(2) )
Return a single value by index
expr__get(index)
expr__get(index)
index |
An expression that leads to a UInt32 dtyped Series. |
A polars expression
df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group", maintain_order = TRUE)$agg( pl$col("value")$get(1) )
df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group", maintain_order = TRUE)$agg( pl$col("value")$get(1) )
Check greater or equal inequality
expr__gt(other)
expr__gt(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = 1:3) df$with_columns( with_gt = pl$col("x")$gt(pl$lit(2)), with_symbol = pl$col("x") > pl$lit(2) )
df <- pl$DataFrame(x = 1:3) df$with_columns( with_gt = pl$col("x")$gt(pl$lit(2)), with_symbol = pl$col("x") > pl$lit(2) )
Check whether the expression contains one or more null values
expr__has_nulls()
expr__has_nulls()
A polars expression
df <- pl$DataFrame( a = c(NA, 1, NA), b = c(10, NA, 300), c = c(350, 650, 850) ) df$select(pl$all()$has_nulls())
df <- pl$DataFrame( a = c(NA, 1, NA), b = c(10, NA, 300), c = c(350, 650, 850) ) df$select(pl$all()$has_nulls())
Hash elements
expr__hash(seed = 0, seed_1 = NULL, seed_2 = NULL, seed_3 = NULL)
expr__hash(seed = 0, seed_1 = NULL, seed_2 = NULL, seed_3 = NULL)
seed |
Integer, random seed parameter. Defaults to 0. |
seed_1 , seed_2 , seed_3
|
Integer, random seed parameters. Default to
|
This implementation of hash does not guarantee stable results across different Polars versions. Its stability is only guaranteed within a single version.
A polars expression
df <- pl$DataFrame(a = c(1, 2, NA), b = c("x", NA, "z")) df$with_columns(pl$all()$hash(10, 20, 30, 40))
df <- pl$DataFrame(a = c(1, 2, NA), b = c("x", NA, "z")) df$with_columns(pl$all()$hash(10, 20, 30, 40))
Get the first n elements
expr__head(n = 10)
expr__head(n = 10)
n |
Number of elements to take. |
A polars expression
pl$DataFrame(x = 1:11)$select(pl$col("x")$head(3))
pl$DataFrame(x = 1:11)$select(pl$col("x")$head(3))
expr__hist( bins = NULL, ..., bin_count = NULL, include_category = FALSE, include_breakpoint = FALSE )
expr__hist( bins = NULL, ..., bin_count = NULL, include_category = FALSE, include_breakpoint = FALSE )
bins |
Discretizations to make. If |
... |
These dots are for future extensions and must be empty. |
bin_count |
If no bins provided, this will be used to determine the distance of the bins. |
include_category |
Include a column that shows the intervals as categories. |
include_breakpoint |
Include a column that indicates the upper breakpoint. |
A polars expression
df <- pl$DataFrame(a = c(1, 3, 8, 8, 2, 1, 3)) df$select(pl$col("a")$hist(bins = 1:3)) df$select( pl$col("a")$hist( bins = 1:3, include_category = TRUE, include_breakpoint = TRUE ) )
df <- pl$DataFrame(a = c(1, 3, 8, 8, 2, 1, 3)) df$select(pl$col("a")$hist(bins = 1:3)) df$select( pl$col("a")$hist( bins = 1:3, include_category = TRUE, include_breakpoint = TRUE ) )
Aggregate values into a list
expr__implode()
expr__implode()
A polars expression
df <- pl$DataFrame(a = 1:3, b = 4:6) df$with_columns(pl$col("a")$implode())
df <- pl$DataFrame(a = 1:3, b = 4:6) df$with_columns(pl$col("a")$implode())
Fill null values using interpolation
expr__interpolate(method = c("linear", "nearest"))
expr__interpolate(method = c("linear", "nearest"))
method |
Interpolation method. Must be one of |
A polars expression
df <- pl$DataFrame(a = c(1, NA, 3), b = c(1, NaN, 3)) df$with_columns( a_interpolated = pl$col("a")$interpolate(), b_interpolated = pl$col("b")$interpolate() )
df <- pl$DataFrame(a = c(1, NA, 3), b = c(1, NaN, 3)) df$with_columns( a_interpolated = pl$col("a")$interpolate(), b_interpolated = pl$col("b")$interpolate() )
Fill null values using interpolation based on another column
expr__interpolate_by(by)
expr__interpolate_by(by)
by |
Column to interpolate values based on. |
A polars expression
df <- pl$DataFrame(a = c(1, NA, NA, 3), b = c(1, 2, 7, 8)) df$with_columns( a_interpolated = pl$col("a")$interpolate_by("b") )
df <- pl$DataFrame(a = c(1, NA, NA, 3), b = c(1, 2, 7, 8)) df$with_columns( a_interpolated = pl$col("a")$interpolate_by("b") )
Check if an expression is between the given lower and upper bounds
expr__is_between( lower_bound, upper_bound, closed = c("both", "left", "right", "none") )
expr__is_between( lower_bound, upper_bound, closed = c("both", "left", "right", "none") )
lower_bound |
Lower bound value. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. |
upper_bound |
Upper bound value. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. |
closed |
Define which sides of the interval are closed (inclusive). Must
be one of |
If the value of the lower_bound
is greater than that of the upper_bound
then the result will be FALSE
, as no value can satisfy the condition.
A polars expression
df <- pl$DataFrame(num = 1:5) df$with_columns( is_between = pl$col("num")$is_between(2, 4) ) # Use the closed argument to include or exclude the values at the bounds: df$with_columns( is_between = pl$col("num")$is_between(2, 4, closed = "left") ) # You can also use strings as well as numeric/temporal values (note: ensure # that string literals are wrapped with lit so as not to conflate them with # column names): df <- pl$DataFrame(a = letters[1:5]) df$with_columns( is_between = pl$col("a")$is_between(pl$lit("a"), pl$lit("c")) ) # Use column expressions as lower/upper bounds, comparing to a literal value: df <- pl$DataFrame(a = 1:5, b = 5:1) df$with_columns( between_ab = pl$lit(3)$is_between(pl$col("a"), pl$col("b")) )
df <- pl$DataFrame(num = 1:5) df$with_columns( is_between = pl$col("num")$is_between(2, 4) ) # Use the closed argument to include or exclude the values at the bounds: df$with_columns( is_between = pl$col("num")$is_between(2, 4, closed = "left") ) # You can also use strings as well as numeric/temporal values (note: ensure # that string literals are wrapped with lit so as not to conflate them with # column names): df <- pl$DataFrame(a = letters[1:5]) df$with_columns( is_between = pl$col("a")$is_between(pl$lit("a"), pl$lit("c")) ) # Use column expressions as lower/upper bounds, comparing to a literal value: df <- pl$DataFrame(a = 1:5, b = 5:1) df$with_columns( between_ab = pl$lit(3)$is_between(pl$col("a"), pl$col("b")) )
Return a boolean mask indicating duplicated values
expr__is_duplicated()
expr__is_duplicated()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$select(pl$col("a")$is_duplicated())
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$select(pl$col("a")$is_duplicated())
Check if elements are finite
expr__is_finite()
expr__is_finite()
A polars expression
df <- pl$DataFrame(a = c(1, 2), b = c(3, Inf)) df$with_columns( a_finite = pl$col("a")$is_finite(), b_finite = pl$col("b")$is_finite() )
df <- pl$DataFrame(a = c(1, 2), b = c(3, Inf)) df$with_columns( a_finite = pl$col("a")$is_finite(), b_finite = pl$col("b")$is_finite() )
Return a boolean mask indicating the first occurrence of each distinct value
expr__is_first_distinct()
expr__is_first_distinct()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$with_columns( is_first_distinct = pl$col("a")$is_first_distinct() )
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$with_columns( is_first_distinct = pl$col("a")$is_first_distinct() )
Check if elements of an expression are present in another expression
expr__is_in(other)
expr__is_in(other)
other |
Accepts expression input. Strings are parsed as column names. |
A polars expression
df <- pl$DataFrame( sets = list(1:3, 1:2, 9:10), optional_members = 1:3 ) df$with_columns( contains = pl$col("optional_members")$is_in("sets") )
df <- pl$DataFrame( sets = list(1:3, 1:2, 9:10), optional_members = 1:3 ) df$with_columns( contains = pl$col("optional_members")$is_in("sets") )
Check if elements are infinite
expr__is_infinite()
expr__is_infinite()
A polars expression
df <- pl$DataFrame(a = c(1, 2), b = c(3, Inf)) df$with_columns( a_infinite = pl$col("a")$is_infinite(), b_infinite = pl$col("b")$is_infinite() )
df <- pl$DataFrame(a = c(1, 2), b = c(3, Inf)) df$with_columns( a_infinite = pl$col("a")$is_infinite(), b_infinite = pl$col("b")$is_infinite() )
Return a boolean mask indicating the last occurrence of each distinct value
expr__is_last_distinct()
expr__is_last_distinct()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$with_columns( is_last_distinct = pl$col("a")$is_last_distinct() )
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$with_columns( is_last_distinct = pl$col("a")$is_last_distinct() )
Floating point NaN
(Not A Number) should not be confused with missing data
represented as NA
(in R) or null
(in Polars).
expr__is_nan()
expr__is_nan()
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_nan = pl$col("a")$is_nan(), b_nan = pl$col("b")$is_nan() )
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_nan = pl$col("a")$is_nan(), b_nan = pl$col("b")$is_nan() )
Floating point NaN
(Not A Number) should not be confused with missing data
represented as NA
(in R) or null
(in Polars).
expr__is_not_nan()
expr__is_not_nan()
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_not_nan = pl$col("a")$is_not_nan(), b_not_nan = pl$col("b")$is_not_nan() )
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_not_nan = pl$col("a")$is_not_nan(), b_not_nan = pl$col("b")$is_not_nan() )
Check if elements are not NULL
expr__is_not_null()
expr__is_not_null()
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_not_null = pl$col("a")$is_not_null(), b_not_null = pl$col("b")$is_not_null() )
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_not_null = pl$col("a")$is_not_null(), b_not_null = pl$col("b")$is_not_null() )
Check if elements are NULL
expr__is_null()
expr__is_null()
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_null = pl$col("a")$is_null(), b_null = pl$col("b")$is_null() )
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_null = pl$col("a")$is_null(), b_null = pl$col("b")$is_null() )
Return a boolean mask indicating unique values
expr__is_unique()
expr__is_unique()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$select(pl$col("a")$is_unique())
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$select(pl$col("a")$is_unique())
Kurtosis is the fourth central moment divided by the square of the variance.
If Fisher’s definition is used, then 3.0 is subtracted from the result to
give 0.0 for a normal distribution. If bias
is FALSE
then the kurtosis
is calculated using k
statistics to eliminate bias coming from biased
moment estimators.
expr__kurtosis(..., fisher = TRUE, bias = TRUE)
expr__kurtosis(..., fisher = TRUE, bias = TRUE)
... |
These dots are for future extensions and must be empty. |
fisher |
If |
bias |
If |
A polars expression
df <- pl$DataFrame(x = c(1, 2, 3, 2, 1)) df$select(pl$col("x")$kurtosis())
df <- pl$DataFrame(x = c(1, 2, 3, 2, 1)) df$select(pl$col("x")$kurtosis())
Get the last value
expr__last()
expr__last()
A polars expression
pl$DataFrame(x = 3:1)$with_columns(last = pl$col("x")$last())
pl$DataFrame(x = 3:1)$with_columns(last = pl$col("x")$last())
Check lower or equal inequality
expr__le(other)
expr__le(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = 1:3) df$with_columns( with_le = pl$col("x")$le(pl$lit(2)), with_symbol = pl$col("x") <= pl$lit(2) )
df <- pl$DataFrame(x = 1:3) df$with_columns( with_le = pl$col("x")$le(pl$lit(2)), with_symbol = pl$col("x") <= pl$lit(2) )
Null values are counted in the total.
expr__len()
expr__len()
A polars expression
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$all()$len())
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$all()$len())
This is an alias for $head().
expr__limit(n = 10)
expr__limit(n = 10)
n |
Number of rows to return. |
A polars expression
df <- pl$DataFrame(a = 1:9) df$select(pl$col("a")$limit(3))
df <- pl$DataFrame(a = 1:9) df$select(pl$col("a")$limit(3))
Compute the logarithm
expr__log(base = exp(1))
expr__log(base = exp(1))
base |
Numeric value used as base, defaults to |
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns( log = pl$col("a")$log(), log_base_2 = pl$col("a")$log(base = 2) )
pl$DataFrame(a = c(1, 2, 4))$ with_columns( log = pl$col("a")$log(), log_base_2 = pl$col("a")$log(base = 2) )
Compute the base-10 logarithm
expr__log10()
expr__log10()
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns(log10 = pl$col("a")$log10())
pl$DataFrame(a = c(1, 2, 4))$ with_columns(log10 = pl$col("a")$log10())
This computes log(1 + x)
but is more numerically stable for x
close to
zero.
expr__log1p()
expr__log1p()
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns(log1p = pl$col("a")$log1p())
pl$DataFrame(a = c(1, 2, 4))$ with_columns(log1p = pl$col("a")$log1p())
Returns a unit Series with the lowest value possible for the dtype of this expression.
expr__lower_bound()
expr__lower_bound()
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$lower_bound())
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$lower_bound())
Check strictly lower inequality
expr__lt(other)
expr__lt(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = 1:3) df$with_columns( with_lt = pl$col("x")$lt(pl$lit(2)), with_symbol = pl$col("x") < pl$lit(2) )
df <- pl$DataFrame(x = 1:3) df$with_columns( with_lt = pl$col("x")$lt(pl$lit(2)), with_symbol = pl$col("x") < pl$lit(2) )
Get the maximum value
expr__max()
expr__max()
A polars expression
pl$DataFrame(x = c(1, NaN, 3))$ with_columns(max = pl$col("x")$max())
pl$DataFrame(x = c(1, NaN, 3))$ with_columns(max = pl$col("x")$max())
Get mean value
expr__mean()
expr__mean()
A polars expression
pl$DataFrame(x = c(1, 3, 4, NA))$ with_columns(mean = pl$col("x")$mean())
pl$DataFrame(x = c(1, 3, 4, NA))$ with_columns(mean = pl$col("x")$mean())
Get median value
expr__median()
expr__median()
A polars expression
pl$DataFrame(x = c(1, 3, 4, NA))$ with_columns(median = pl$col("x")$median())
pl$DataFrame(x = c(1, 3, 4, NA))$ with_columns(median = pl$col("x")$median())
Get the minimum value
expr__min()
expr__min()
A polars expression
pl$DataFrame(x = c(1, NaN, 3))$ with_columns(min = pl$col("x")$min())
pl$DataFrame(x = c(1, NaN, 3))$ with_columns(min = pl$col("x")$min())
Method equivalent of modulus operator expr %% other
.
expr__mod(other)
expr__mod(other)
other |
Numeric literal or expression value. |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = -5L:5L) df$with_columns( `x%%2` = pl$col("x")$mod(2) )
df <- pl$DataFrame(x = -5L:5L) df$with_columns( `x%%2` = pl$col("x")$mod(2) )
Compute the most occurring value(s)
expr__mode()
expr__mode()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3), b = c(1, 1, 2, 2)) df$select(pl$col("a")$mode()) df$select(pl$col("b")$mode())
df <- pl$DataFrame(a = c(1, 1, 2, 3), b = c(1, 1, 2, 2)) df$select(pl$col("a")$mode()) df$select(pl$col("b")$mode())
Method equivalent of multiplication operator expr * other
.
expr__mul(other)
expr__mul(other)
other |
Numeric literal or expression value. |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = c(1, 2, 4, 8, 16)) df$with_columns( `x*2` = pl$col("x")$mul(2), `x * xlog2` = pl$col("x")$mul(pl$col("x")$log(2)) )
df <- pl$DataFrame(x = c(1, 2, 4, 8, 16)) df$with_columns( `x*2` = pl$col("x")$mul(2), `x * xlog2` = pl$col("x")$mul(pl$col("x")$log(2)) )
null
is considered to be a unique value for the purposes of this operation.
expr__n_unique()
expr__n_unique()
A polars expression
df <- pl$DataFrame( x = c(1, 1, 2, 2, 3), y = c(1, 1, 1, NA, NA) ) df$select( x_unique = pl$col("x")$n_unique(), y_unique = pl$col("y")$n_unique() )
df <- pl$DataFrame( x = c(1, 1, 2, 2, 3), y = c(1, 1, 1, NA, NA) ) df$select( x_unique = pl$col("x")$n_unique(), y_unique = pl$col("y")$n_unique() )
This returns NaN
if there are any.
expr__nan_max()
expr__nan_max()
A polars expression
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_max = pl$col("x")$nan_max())
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_max = pl$col("x")$nan_max())
This returns NaN
if there are any.
expr__nan_min()
expr__nan_min()
A polars expression
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_min = pl$col("x")$nan_min())
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_min = pl$col("x")$nan_min())
This propagates null values, i.e. any comparison involving null
will
return null
. Use $ne_missing()
to consider null
values as equal.
expr__ne(other)
expr__ne(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( ne = pl$col("x")$ne(pl$col("y")), ne_missing = pl$col("x")$ne_missing(pl$col("y")) )
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( ne = pl$col("x")$ne(pl$col("y")), ne_missing = pl$col("x")$ne_missing(pl$col("y")) )
null
propagationMethod equivalent of addition operator expr + other
.
expr__ne_missing(other)
expr__ne_missing(other)
other |
Element to add. Can be a string (only if |
A polars expression
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( ne = pl$col("x")$ne("y"), ne_missing = pl$col("x")$ne_missing("y") )
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( ne = pl$col("x")$ne("y"), ne_missing = pl$col("x")$ne_missing("y") )
Negate a boolean expression
expr__not()
expr__not()
A polars expression
df <- pl$DataFrame(a = c(TRUE, FALSE, FALSE, NA)) df$with_columns(a_not = pl$col("a")$not()) # Same result with "!" df$with_columns(a_not = !pl$col("a"))
df <- pl$DataFrame(a = c(TRUE, FALSE, FALSE, NA)) df$with_columns(a_not = pl$col("a")$not()) # Same result with "!" df$with_columns(a_not = !pl$col("a"))
Count null values
expr__null_count()
expr__null_count()
A polars expression
df <- pl$DataFrame( a = c(NA, 1, NA), b = c(10, NA, 300), c = c(1, 2, 2) ) df$select(pl$all()$null_count())
df <- pl$DataFrame( a = c(NA, 1, NA), b = c(10, NA, 300), c = c(1, 2, 2) ) df$select(pl$all()$null_count())
Combine two boolean expressions with OR.
expr__or(other)
expr__or(other)
other |
Element to add. Can be a string (only if |
A polars expression
pl$lit(TRUE) | FALSE pl$lit(TRUE)$or(pl$lit(TRUE))
pl$lit(TRUE) | FALSE pl$lit(TRUE)$or(pl$lit(TRUE))
This expression is similar to performing a group by aggregation and joining the result back into the original DataFrame. The outcome is similar to how window functions work in PostgreSQL.
expr__over( ..., order_by = NULL, mapping_strategy = c("group_to_rows", "join", "explode") )
expr__over( ..., order_by = NULL, mapping_strategy = c("group_to_rows", "join", "explode") )
... |
|
order_by |
Order the window functions/aggregations with the partitioned
groups by the result of the expression passed to |
mapping_strategy |
One of the following:
|
A polars expression
# Pass the name of a column to compute the expression over that column. df <- pl$DataFrame( a = c("a", "a", "b", "b", "b"), b = c(1, 2, 3, 5, 3), c = c(5, 4, 2, 1, 3) ) df$with_columns( pl$col("c")$max()$over("a")$name$suffix("_max") ) # Expression input is supported. df$with_columns( pl$col("c")$max()$over(pl$col("b") %/% 2)$name$suffix("_max") ) # Group by multiple columns by passing several column names a or list of # expressions. df$with_columns( pl$col("c")$min()$over("a", "b")$name$suffix("_min") ) group_vars <- list(pl$col("a"), pl$col("b")) df$with_columns( pl$col("c")$min()$over(!!!group_vars)$name$suffix("_min") ) # Or use positional arguments to group by multiple columns in the same way. df$with_columns( pl$col("c")$min()$over("a", pl$col("b") %% 2)$name$suffix("_min") ) # Alternative mapping strategy: join values in a list output df$with_columns( top_2 = pl$col("c")$top_k(2)$over("a", mapping_strategy = "join") ) # order_by specifies how values are sorted within a group, which is # essential when the operation depends on the order of values df <- pl$DataFrame( g = c(1, 1, 1, 1, 2, 2, 2, 2), t = c(1, 2, 3, 4, 4, 1, 2, 3), x = c(10, 20, 30, 40, 10, 20, 30, 40) ) # without order_by, the first and second values in the second group would # be inverted, which would be wrong df$with_columns( x_lag = pl$col("x")$shift(1)$over("g", order_by = "t") )
# Pass the name of a column to compute the expression over that column. df <- pl$DataFrame( a = c("a", "a", "b", "b", "b"), b = c(1, 2, 3, 5, 3), c = c(5, 4, 2, 1, 3) ) df$with_columns( pl$col("c")$max()$over("a")$name$suffix("_max") ) # Expression input is supported. df$with_columns( pl$col("c")$max()$over(pl$col("b") %/% 2)$name$suffix("_max") ) # Group by multiple columns by passing several column names a or list of # expressions. df$with_columns( pl$col("c")$min()$over("a", "b")$name$suffix("_min") ) group_vars <- list(pl$col("a"), pl$col("b")) df$with_columns( pl$col("c")$min()$over(!!!group_vars)$name$suffix("_min") ) # Or use positional arguments to group by multiple columns in the same way. df$with_columns( pl$col("c")$min()$over("a", pl$col("b") %% 2)$name$suffix("_min") ) # Alternative mapping strategy: join values in a list output df$with_columns( top_2 = pl$col("c")$top_k(2)$over("a", mapping_strategy = "join") ) # order_by specifies how values are sorted within a group, which is # essential when the operation depends on the order of values df <- pl$DataFrame( g = c(1, 1, 1, 1, 2, 2, 2, 2), t = c(1, 2, 3, 4, 4, 1, 2, 3), x = c(10, 20, 30, 40, 10, 20, 30, 40) ) # without order_by, the first and second values in the second group would # be inverted, which would be wrong df$with_columns( x_lag = pl$col("x")$shift(1)$over("g", order_by = "t") )
Computes the percentage change (as fraction) between current element and
most-recent non-null element at least n
period(s) before the current
element. By default it computes the change from the previous row.
expr__pct_change(n = 1)
expr__pct_change(n = 1)
n |
Integer or Expr indicating the number of periods to shift for forming percent change. |
A polars expression
df <- pl$DataFrame(a = c(10:12, NA, 12)) df$with_columns( pct_change = pl$col("a")$pct_change() )
df <- pl$DataFrame(a = c(10:12, NA, 12)) df$with_columns( pct_change = pl$col("a")$pct_change() )
Get a boolean mask of the local maximum peaks
expr__peak_max()
expr__peak_max()
A polars expression
df <- pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df$with_columns(peak_max = pl$col("x")$peak_max())
df <- pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df$with_columns(peak_max = pl$col("x")$peak_max())
Get a boolean mask of the local minimum peaks
expr__peak_min()
expr__peak_min()
A polars expression
df <- pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df$with_columns(peak_min = pl$col("x")$peak_min())
df <- pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df$with_columns(peak_min = pl$col("x")$peak_min())
Method equivalent of exponentiation operator expr ^ exponent
.
expr__pow(other)
expr__pow(other)
exponent |
Numeric literal or expression value. |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = c(1, 2, 4, 8)) df$with_columns( cube = pl$col("x")$pow(3), `x^xlog2` = pl$col("x")$pow(pl$col("x")$log(2)) )
df <- pl$DataFrame(x = c(1, 2, 4, 8)) df$with_columns( cube = pl$col("x")$pow(3), `x^xlog2` = pl$col("x")$pow(pl$col("x")$log(2)) )
Compute the product of an expression.
expr__product()
expr__product()
A polars expression
pl$DataFrame(a = 1:3, b = c(NA, 4, 4))$ select(pl$all()$product())
pl$DataFrame(a = 1:3, b = c(NA, 4, 4))$ select(pl$all()$product())
expr__qcut( quantiles, ..., labels = NULL, left_closed = FALSE, allow_duplicates = FALSE, include_breaks = FALSE )
expr__qcut( quantiles, ..., labels = NULL, left_closed = FALSE, allow_duplicates = FALSE, include_breaks = FALSE )
quantiles |
Either a vector of quantile probabilities between 0 and 1 or a positive integer determining the number of bins with uniform probability. |
... |
These dots are for future extensions and must be empty. |
labels |
Names of the categories. The number of labels must be equal to the number of categories. |
left_closed |
Set the intervals to be left-closed instead of right-closed. |
allow_duplicates |
If |
include_breaks |
Include a column with the right endpoint of the bin each observation falls in. This will change the data type of the output from a Categorical to a Struct. |
A polars expression
# Divide a column into three categories according to pre-defined quantile # probabilities. df <- pl$DataFrame(foo = -2:2) df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), labels = c("a", "b", "c")) ) # Divide a column into two categories using uniform quantile probabilities. df$with_columns( qcut = pl$col("foo")$qcut(2, labels = c("low", "high"), left_closed = TRUE) ) # Add both the category and the breakpoint. df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), include_breaks = TRUE) )$unnest()
# Divide a column into three categories according to pre-defined quantile # probabilities. df <- pl$DataFrame(foo = -2:2) df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), labels = c("a", "b", "c")) ) # Divide a column into two categories using uniform quantile probabilities. df$with_columns( qcut = pl$col("foo")$qcut(2, labels = c("low", "high"), left_closed = TRUE) ) # Add both the category and the breakpoint. df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), include_breaks = TRUE) )$unnest()
Get quantile value(s)
expr__quantile( quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear") )
expr__quantile( quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear") )
quantile |
Quantile between 0.0 and 1.0. |
interpolation |
Interpolation method. Must be one of |
A polars expression
df <- pl$DataFrame(a = 0:5) df$select(pl$col("a")$quantile(0.3)) df$select(pl$col("a")$quantile(0.3, interpolation = "higher")) df$select(pl$col("a")$quantile(0.3, interpolation = "lower")) df$select(pl$col("a")$quantile(0.3, interpolation = "midpoint")) df$select(pl$col("a")$quantile(0.3, interpolation = "linear"))
df <- pl$DataFrame(a = 0:5) df$select(pl$col("a")$quantile(0.3)) df$select(pl$col("a")$quantile(0.3, interpolation = "higher")) df$select(pl$col("a")$quantile(0.3, interpolation = "lower")) df$select(pl$col("a")$quantile(0.3, interpolation = "midpoint")) df$select(pl$col("a")$quantile(0.3, interpolation = "linear"))
Convert from degrees to radians
expr__radians()
expr__radians()
A polars expression
pl$DataFrame(a = c(-720, -540, -360, -180, 0, 180, 360, 540, 720))$ with_columns(radians = pl$col("a")$radians())
pl$DataFrame(a = c(-720, -540, -360, -180, 0, 180, 360, 540, 720))$ with_columns(radians = pl$col("a")$radians())
Assign ranks to data, dealing with ties appropriately
expr__rank( method = c("average", "min", "max", "dense", "ordinal", "random"), ..., descending = FALSE, seed = NULL )
expr__rank( method = c("average", "min", "max", "dense", "ordinal", "random"), ..., descending = FALSE, seed = NULL )
method |
The method used to assign ranks to tied elements. Must be one of the following:
|
... |
These dots are for future extensions and must be empty. |
descending |
Rank in descending order. |
seed |
Integer. Only used if |
A polars expression
# Default is to use the "average" method to break ties df <- pl$DataFrame(a = c(3, 6, 1, 1, 6)) df$with_columns(rank = pl$col("a")$rank()) # Ordinal method df$with_columns(rank = pl$col("a")$rank("ordinal")) # Use "rank" with "over" to rank within groups: df <- pl$DataFrame( a = c(1, 1, 2, 2, 2), b = c(6, 7, 5, 14, 11) ) df$with_columns( rank = pl$col("b")$rank()$over("a") )
# Default is to use the "average" method to break ties df <- pl$DataFrame(a = c(3, 6, 1, 1, 6)) df$with_columns(rank = pl$col("a")$rank()) # Ordinal method df$with_columns(rank = pl$col("a")$rank("ordinal")) # Use "rank" with "over" to rank within groups: df <- pl$DataFrame( a = c(1, 1, 2, 2, 2), b = c(6, 7, 5, 14, 11) ) df$with_columns( rank = pl$col("b")$rank()$over("a") )
Create a single chunk of memory for this Series
expr__rechunk()
expr__rechunk()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2)) # Create a Series with 3 nulls, append column a then rechunk df$select(pl$repeat(NA, 3)$append(pl$col("a"))$rechunk())
df <- pl$DataFrame(a = c(1, 1, 2)) # Create a Series with 3 nulls, append column a then rechunk df$select(pl$repeat(NA, 3)$append(pl$col("a"))$rechunk())
This operation is only allowed for 64-bit integers. For lower bits integers, you can safely use the $cast() operation.
expr__reinterpret(..., signed = TRUE)
expr__reinterpret(..., signed = TRUE)
... |
These dots are for future extensions and must be empty. |
signed |
If |
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2))$cast(pl$UInt64) # Create a Series with 3 nulls, append column a then rechunk df$with_columns( reinterpreted = pl$col("a")$reinterpret() )
df <- pl$DataFrame(a = c(1, 1, 2))$cast(pl$UInt64) # Create a Series with 3 nulls, append column a then rechunk df$with_columns( reinterpreted = pl$col("a")$reinterpret() )
The repeated elements are expanded into a List dtype.
expr__repeat_by(by)
expr__repeat_by(by)
by |
Numeric column that determines how often the values will be repeated. The column will be coerced to UInt32. Give this dtype to make the coercion a no-op. Accepts expression input, strings are parsed as column names. |
A polars expression
df <- pl$DataFrame(a = c("x", "y", "z"), n = 1:3) df$with_columns( repeated = pl$col("a")$repeat_by("n") )
df <- pl$DataFrame(a = c("x", "y", "z"), n = 1:3) df$with_columns( repeated = pl$col("a")$repeat_by("n") )
This allows one to recode values in a column, leaving all other values
unchanged. See $replace_strict()
to give a default
value to all other values and to specify the output datatype.
expr__replace(old, new)
expr__replace(old, new)
old |
Value or vector of values to replace. Accepts expression input.
Vectors are parsed as Series, other non-expression inputs are parsed as
literals. Also accepts a list of values like |
new |
Value or vector of values to replace by. Accepts expression
input. Vectors are parsed as Series, other non-expression inputs are parsed
as literals. Length must match the length of |
The global string cache must be enabled when replacing categorical values.
A polars expression
df <- pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace(2, 100)) df$with_columns(replaced = pl$col("a")$replace(c(2, 3), c(100, 200))) # "old" can be a named list where names are values to replace, and values are # the replacements mapping <- list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace(mapping)) # The original data type is preserved when replacing by values of a # different data type. Use $replace_strict() to replace and change the # return data type. df <- pl$DataFrame(a = c("x", "y", "z")) mapping <- list(x = 1, y = 2, z = 3) df$with_columns(replaced = pl$col("a")$replace(mapping)) # "old" and "new" can take Expr df <- pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace( old = pl$col("a")$max(), new = pl$col("b")$sum() ) )
df <- pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace(2, 100)) df$with_columns(replaced = pl$col("a")$replace(c(2, 3), c(100, 200))) # "old" can be a named list where names are values to replace, and values are # the replacements mapping <- list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace(mapping)) # The original data type is preserved when replacing by values of a # different data type. Use $replace_strict() to replace and change the # return data type. df <- pl$DataFrame(a = c("x", "y", "z")) mapping <- list(x = 1, y = 2, z = 3) df$with_columns(replaced = pl$col("a")$replace(mapping)) # "old" and "new" can take Expr df <- pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace( old = pl$col("a")$max(), new = pl$col("b")$sum() ) )
This changes all the values in a column, either using a specific replacement
or a default one. See $replace()
to replace only a subset
of values.
expr__replace_strict(old, new, ..., default = NULL, return_dtype = NULL)
expr__replace_strict(old, new, ..., default = NULL, return_dtype = NULL)
old |
Value or vector of values to replace. Accepts expression input.
Vectors are parsed as Series, other non-expression inputs are parsed as
literals. Also accepts a list of values like |
new |
Value or vector of values to replace by. Accepts expression
input. Vectors are parsed as Series, other non-expression inputs are parsed
as literals. Length must match the length of |
... |
These dots are for future extensions and must be empty. |
default |
Set values that were not replaced to this value. If |
return_dtype |
The data type of the resulting expression. If |
The global string cache must be enabled when replacing categorical values.
A polars expression
df <- pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace_strict(2, 100, default = 1)) df$with_columns( replaced = pl$col("a")$replace_strict(c(2, 3), c(100, 200), default = 1) ) # "old" can be a named list where names are values to replace, and values are # the replacements mapping <- list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace_strict(mapping, default = -1)) # By default, an error is raised if any non-null values were not replaced. # Specify a default to set all values that were not matched. tryCatch( df$with_columns(replaced = pl$col("a")$replace_strict(mapping)), error = function(e) print(e) ) # one can specify the data type to return instead of automatically # inferring it df$with_columns( replaced = pl$col("a")$replace_strict( mapping, default = 1, return_dtype = pl$Int32 ) ) # "old", "new", and "default" can take Expr df <- pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace_strict( old = pl$col("a")$max(), new = pl$col("b")$sum(), default = pl$col("b"), ) )
df <- pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace_strict(2, 100, default = 1)) df$with_columns( replaced = pl$col("a")$replace_strict(c(2, 3), c(100, 200), default = 1) ) # "old" can be a named list where names are values to replace, and values are # the replacements mapping <- list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace_strict(mapping, default = -1)) # By default, an error is raised if any non-null values were not replaced. # Specify a default to set all values that were not matched. tryCatch( df$with_columns(replaced = pl$col("a")$replace_strict(mapping)), error = function(e) print(e) ) # one can specify the data type to return instead of automatically # inferring it df$with_columns( replaced = pl$col("a")$replace_strict( mapping, default = 1, return_dtype = pl$Int32 ) ) # "old", "new", and "default" can take Expr df <- pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace_strict( old = pl$col("a")$max(), new = pl$col("b")$sum(), default = pl$col("b"), ) )
Reshape this Expr to a flat Series or a Series of Lists
expr__reshape(dimensions)
expr__reshape(dimensions)
dimensions |
A integer vector of length of the dimension size.
If |
nested_type |
The nested data type to create. List only supports 2 dimensions, whereas Array supports an arbitrary number of dimensions. |
If a single dimension is given, results in an expression of the original data type. If a multiple dimensions are given, results in an expression of data type List with shape equal to the dimensions.
A polars expression
df <- pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$reshape(9)) df$select(pl$col("foo")$reshape(c(3, 3))) # Use `-1` to infer the other dimension df$select(pl$col("foo")$reshape(c(-1, 3))) df$select(pl$col("foo")$reshape(c(3, -1))) # One can specify more than 2 dimensions by using the Array type df <- pl$DataFrame(foo = 1:12) df$select( pl$col("foo")$reshape(c(3, 2, 2), nested_type = pl$Array(pl$Float32, 2)) )
df <- pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$reshape(9)) df$select(pl$col("foo")$reshape(c(3, 3))) # Use `-1` to infer the other dimension df$select(pl$col("foo")$reshape(c(-1, 3))) df$select(pl$col("foo")$reshape(c(3, -1))) # One can specify more than 2 dimensions by using the Array type df <- pl$DataFrame(foo = 1:12) df$select( pl$col("foo")$reshape(c(3, 2, 2), nested_type = pl$Array(pl$Float32, 2)) )
Reverse an expression
expr__reverse()
expr__reverse()
A polars expression
df <- pl$DataFrame( a = 1:5, fruits = c("banana", "banana", "apple", "apple", "banana"), b = 5:1 ) df$with_columns( pl$all()$reverse()$name$suffix("_reverse") )
df <- pl$DataFrame( a = 1:5, fruits = c("banana", "banana", "apple", "apple", "banana"), b = 5:1 ) df$with_columns( pl$all()$reverse()$name$suffix("_reverse") )
Run-length encoding (RLE) encodes data by storing each run of identical values as a single value and its length.
expr__rle()
expr__rle()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 1, NA, 1, 3, 3)) df$select(pl$col("a")$rle())$unnest("a")
df <- pl$DataFrame(a = c(1, 1, 2, 1, NA, 1, 3, 3)) df$select(pl$col("a")$rle())$unnest("a")
The ID starts at 0 and increases by one each time the value of the column changes.
expr__rle_id()
expr__rle_id()
This functionality is especially useful for defining a new group for every time a column’s value changes, rather than for every distinct value of that column.
A polars expression
df <- pl$DataFrame( a = c(1, 2, 1, 1, 1), b = c("x", "x", NA, "y", "y") ) df$with_columns( rle_id_a = pl$col("a")$rle_id(), rle_id_ab = pl$struct("a", "b")$rle_id() )
df <- pl$DataFrame( a = c(1, 2, 1, 1, 1), b = c("x", "x", NA, "y", "y") ) df$with_columns( rle_id_a = pl$col("a")$rle_id(), rle_id_ab = pl$struct("a", "b")$rle_id() )
If you have a time series <t_0, t_1, ..., t_n>
, then by default the
windows created will be:
(t_0 - period, t_0]
(t_1 - period, t_1]
…
(t_n - period, t_n]
whereas if you pass a non-default offset
, then the windows will be:
(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
…
(t_n + offset, t_n + offset + period]
expr__rolling(index_column, ..., period, offset = NULL, closed = "right")
expr__rolling(index_column, ..., period, offset = NULL, closed = "right")
index_column |
Character. Name of the column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order. In case of a rolling group by on indices, dtype needs to be one of UInt32, UInt64, Int32, Int64. Note that the first three get cast to Int64, so if performance matters use an Int64 column. |
... |
These dots are for future extensions and must be empty. |
period |
Length of the window - must be non-negative. |
offset |
Offset of the window. Default is |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
dates <- as.POSIXct( c( "2020-01-01 13:45:48", "2020-01-01 16:42:13", "2020-01-01 16:45:09", "2020-01-02 18:12:48", "2020-01-03 19:45:32","2020-01-08 23:16:43" ) ) df <- pl$DataFrame(dt = dates, a = c(3, 7, 5, 9, 2, 1)) df$with_columns( sum_a = pl$col("a")$sum()$rolling(index_column = "dt", period = "2d"), min_a = pl$col("a")$min()$rolling(index_column = "dt", period = "2d"), max_a = pl$col("a")$max()$rolling(index_column = "dt", period = "2d") )
dates <- as.POSIXct( c( "2020-01-01 13:45:48", "2020-01-01 16:42:13", "2020-01-01 16:45:09", "2020-01-02 18:12:48", "2020-01-03 19:45:32","2020-01-08 23:16:43" ) ) df <- pl$DataFrame(dt = dates, a = c(3, 7, 5, 9, 2, 1)) df$with_columns( sum_a = pl$col("a")$sum()$rolling(index_column = "dt", period = "2d"), min_a = pl$col("a")$min()$rolling(index_column = "dt", period = "2d"), max_a = pl$col("a")$max()$rolling(index_column = "dt", period = "2d") )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_max( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_max( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_max = pl$col("a")$rolling_max(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_max = pl$col("a")$rolling_max( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_max = pl$col("a")$rolling_max(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_max = pl$col("a")$rolling_max(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_max = pl$col("a")$rolling_max( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_max = pl$col("a")$rolling_max(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_max_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_max_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
This column must be of dtype Datetime or Date. Accepts expression input. Strings are parsed as column names. |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling max with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by( "date", window_size = "2h" ) ) # Compute the rolling max with the closure of windows on both sides df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling max with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by( "date", window_size = "2h" ) ) # Compute the rolling max with the closure of windows on both sides df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_mean( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_mean( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_mean = pl$col("a")$rolling_mean(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_mean = pl$col("a")$rolling_mean( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_mean = pl$col("a")$rolling_mean(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_mean = pl$col("a")$rolling_mean(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_mean = pl$col("a")$rolling_mean( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_mean = pl$col("a")$rolling_mean(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_mean_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_mean_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
This column must be of dtype Datetime or Date. Accepts expression input. Strings are parsed as column names. |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling mean with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by( "date", window_size = "2h" ) ) # Compute the rolling mean with the closure of windows on both sides df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling mean with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by( "date", window_size = "2h" ) ) # Compute the rolling mean with the closure of windows on both sides df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_median( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_median( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_median = pl$col("a")$rolling_median(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_median = pl$col("a")$rolling_median( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_median = pl$col("a")$rolling_median(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_median = pl$col("a")$rolling_median(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_median = pl$col("a")$rolling_median( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_median = pl$col("a")$rolling_median(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_median_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_median_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
This column must be of dtype Datetime or Date. Accepts expression input. Strings are parsed as column names. |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling median with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by( "date", window_size = "2h" ) ) # Compute the rolling median with the closure of windows on both sides df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling median with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by( "date", window_size = "2h" ) ) # Compute the rolling median with the closure of windows on both sides df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_min( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_min( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_min = pl$col("a")$rolling_min(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_min = pl$col("a")$rolling_min( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_min = pl$col("a")$rolling_min(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_min = pl$col("a")$rolling_min(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_min = pl$col("a")$rolling_min( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_min = pl$col("a")$rolling_min(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_min_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_min_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
This column must be of dtype Datetime or Date. Accepts expression input. Strings are parsed as column names. |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling min with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by( "date", window_size = "2h" ) ) # Compute the rolling min with the closure of windows on both sides df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling min with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by( "date", window_size = "2h" ) ) # Compute the rolling min with the closure of windows on both sides df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_quantile( quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear"), window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_quantile( quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear"), window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
quantile |
Quantile between 0.0 and 1.0. |
interpolation |
Interpolation method. Must be one of |
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4 ) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4, weights = c(0.2, 0.4, 0.4, 0.2) ) ) # Specify weights and interpolation method: df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4, weights = c(0.2, 0.4, 0.4, 0.2), interpolation = "linear" ) ) # Center the values in the window df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 5, center = TRUE ) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4 ) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4, weights = c(0.2, 0.4, 0.4, 0.2) ) ) # Specify weights and interpolation method: df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4, weights = c(0.2, 0.4, 0.4, 0.2), interpolation = "linear" ) ) # Center the values in the window df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 5, center = TRUE ) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_quantile_by( by, window_size, ..., quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear"), min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_quantile_by( by, window_size, ..., quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear"), min_periods = 1, closed = c("right", "both", "left", "none") )
by |
This column must be of dtype Datetime or Date. Accepts expression input. Strings are parsed as column names. |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
quantile |
Quantile between 0.0 and 1.0. |
interpolation |
Interpolation method. Must be one of |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling quantile with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h" ) ) # Compute the rolling quantile with the closure of windows on both sides df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling quantile with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h" ) ) # Compute the rolling quantile with the closure of windows on both sides df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_skew(window_size, ..., bias = TRUE)
expr__rolling_skew(window_size, ..., bias = TRUE)
window_size |
The length of the window in number of elements. |
... |
These dots are for future extensions and must be empty. |
bias |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = c(1, 4, 2, 9)) df$with_columns( rolling_skew = pl$col("a")$rolling_skew(3) )
df <- pl$DataFrame(a = c(1, 4, 2, 9)) df$with_columns( rolling_skew = pl$col("a")$rolling_skew(3) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_std( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE, ddof = 1 )
expr__rolling_std( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE, ddof = 1 )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_std = pl$col("a")$rolling_std(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_std = pl$col("a")$rolling_std( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_std = pl$col("a")$rolling_std(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_std = pl$col("a")$rolling_std(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_std = pl$col("a")$rolling_std( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_std = pl$col("a")$rolling_std(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_std_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none"), ddof = 1 )
expr__rolling_std_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none"), ddof = 1 )
by |
This column must be of dtype Datetime or Date. Accepts expression input. Strings are parsed as column names. |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling std with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by( "date", window_size = "2h" ) ) # Compute the rolling std with the closure of windows on both sides df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling std with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by( "date", window_size = "2h" ) ) # Compute the rolling std with the closure of windows on both sides df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_sum( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_sum( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_sum = pl$col("a")$rolling_sum(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_sum = pl$col("a")$rolling_sum( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_sum = pl$col("a")$rolling_sum(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_sum = pl$col("a")$rolling_sum(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_sum = pl$col("a")$rolling_sum( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_sum = pl$col("a")$rolling_sum(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_sum_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_sum_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
This column must be of dtype Datetime or Date. Accepts expression input. Strings are parsed as column names. |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling sum with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_sum = pl$col("index")$rolling_sum_by( "date", window_size = "2h" ) ) # Compute the rolling sum with the closure of windows on both sides df_temporal$with_columns( rolling_row_sum = pl$col("index")$rolling_sum_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling sum with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_sum = pl$col("index")$rolling_sum_by( "date", window_size = "2h" ) ) # Compute the rolling sum with the closure of windows on both sides df_temporal$with_columns( rolling_row_sum = pl$col("index")$rolling_sum_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_var( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE, ddof = 1 )
expr__rolling_var( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE, ddof = 1 )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_var = pl$col("a")$rolling_var(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_var = pl$col("a")$rolling_var( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_var = pl$col("a")$rolling_var(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_var = pl$col("a")$rolling_var(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_var = pl$col("a")$rolling_var( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_var = pl$col("a")$rolling_var(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_var_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none"), ddof = 1 )
expr__rolling_var_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none"), ddof = 1 )
by |
This column must be of dtype Datetime or Date. Accepts expression input. Strings are parsed as column names. |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling var with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_var = pl$col("index")$rolling_var_by( "date", window_size = "2h" ) ) # Compute the rolling var with the closure of windows on both sides df_temporal$with_columns( rolling_row_var = pl$col("index")$rolling_var_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling var with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_var = pl$col("index")$rolling_var_by( "date", window_size = "2h" ) ) # Compute the rolling var with the closure of windows on both sides df_temporal$with_columns( rolling_row_var = pl$col("index")$rolling_var_by( "date", window_size = "2h", closed = "both" ) )
Round underlying floating point data by decimals digits
expr__round(decimals)
expr__round(decimals)
decimals |
Number of decimals to round by. |
A polars expression
df <- pl$DataFrame(a = c(0.33, 0.52, 1.02, 1.17)) df$with_columns( rounded = pl$col("a")$round(1) )
df <- pl$DataFrame(a = c(0.33, 0.52, 1.02, 1.17)) df$with_columns( rounded = pl$col("a")$round(1) )
Round to a number of significant figures
expr__round_sig_figs(digits)
expr__round_sig_figs(digits)
digits |
Number of significant figures to round to. |
A polars expression
df <- pl$DataFrame(a = c(0.01234, 3.333, 1234)) df$with_columns( rounded = pl$col("a")$round_sig_figs(2) )
df <- pl$DataFrame(a = c(0.01234, 3.333, 1234)) df$with_columns( rounded = pl$col("a")$round_sig_figs(2) )
Sample from this expression
expr__sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
expr__sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
n |
Number of items to return. Cannot be used with |
... |
These dots are for future extensions and must be empty. |
fraction |
Fraction of items to return. Cannot be used with |
with_replacement |
Allow values to be sampled more than once. |
shuffle |
Shuffle the order of sampled data points. |
seed |
Seed for the random number generator. If |
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$sample( fraction = 1, with_replacement = TRUE, seed = 1 ))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$sample( fraction = 1, with_replacement = TRUE, seed = 1 ))
This returns -1 if x is lower than 0, 0 if x == 0, and 1 if x is greater than 0.
expr__search_sorted(element, side = c("any", "left", "right"))
expr__search_sorted(element, side = c("any", "left", "right"))
element |
Expression or scalar value. |
side |
Must be one of the following:
|
A polars expression
df <- pl$DataFrame(values = c(1, 2, 3, 5)) df$select( zero = pl$col("values")$search_sorted(0), three = pl$col("values")$search_sorted(3), six = pl$col("values")$search_sorted(6), )
df <- pl$DataFrame(values = c(1, 2, 3, 5)) df$select( zero = pl$col("values")$search_sorted(0), three = pl$col("values")$search_sorted(3), six = pl$col("values")$search_sorted(6), )
Enables downstream code to user fast paths for sorted arrays.
Warning: This can lead to incorrect results if the data is NOT sorted!! Use with care!
expr__set_sorted(..., descending = FALSE)
expr__set_sorted(..., descending = FALSE)
... |
These dots are for future extensions and must be empty. |
descending |
Whether the Series order is descending. |
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$set_sorted()$max())
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$set_sorted()$max())
Shift values by the given number of indices
expr__shift(n = 1, ..., fill_value = NULL)
expr__shift(n = 1, ..., fill_value = NULL)
n |
Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead. |
... |
These dots are for future extensions and must be empty. |
fill_value |
Fill the resulting null values with this value. |
A polars expression
# By default, values are shifted forward by one index. df <- pl$DataFrame(a = 1:4) df$with_columns(shift = pl$col("a")$shift()) # Pass a negative value to shift in the opposite direction instead. df$with_columns(shift = pl$col("a")$shift(-2)) # Specify fill_value to fill the resulting null values. df$with_columns(shift = pl$col("a")$shift(-2, fill_value = 100))
# By default, values are shifted forward by one index. df <- pl$DataFrame(a = 1:4) df$with_columns(shift = pl$col("a")$shift()) # Pass a negative value to shift in the opposite direction instead. df$with_columns(shift = pl$col("a")$shift(-2)) # Specify fill_value to fill the resulting null values. df$with_columns(shift = pl$col("a")$shift(-2, fill_value = 100))
Shrink to the dtype needed to fit the extrema of this Series. This can be used to reduce memory pressure.
expr__shrink_dtype()
expr__shrink_dtype()
A polars expression
df <- pl$DataFrame(a = c(-112, 2, 112))$cast(pl$Int64) df$with_columns( shrunk = pl$col("a")$shrink_dtype() )
df <- pl$DataFrame(a = c(-112, 2, 112))$cast(pl$Int64) df$with_columns( shrunk = pl$col("a")$shrink_dtype() )
Note this is shuffled independently of any other column or Expression.
If you want each row to stay the same use
df$sample(shuffle = TRUE)
.
expr__shuffle(seed = NULL)
expr__shuffle(seed = NULL)
seed |
Integer indicating the seed for the random number generator. If
|
A polars expression
df <- pl$DataFrame(a = 1:3) df$with_columns( shuffled = pl$col("a")$shuffle(seed = 1) )
df <- pl$DataFrame(a = 1:3) df$with_columns( shuffled = pl$col("a")$shuffle(seed = 1) )
This returns -1 if x is lower than 0, 0 if x == 0, and 1 if x is greater than 0.
expr__sign()
expr__sign()
A polars expression
df <- pl$DataFrame(a = c(-9, 0, 0, 4, NA)) df$with_columns(sign = pl$col("a")$sign())
df <- pl$DataFrame(a = c(-9, 0, 0, 4, NA)) df$with_columns(sign = pl$col("a")$sign())
Compute sine
expr__sin()
expr__sin()
A polars expression
pl$DataFrame(a = c(0, pi / 2, pi, NA))$ with_columns(sine = pl$col("a")$sin())
pl$DataFrame(a = c(0, pi / 2, pi, NA))$ with_columns(sine = pl$col("a")$sin())
Compute hyperbolic sine
expr__sinh()
expr__sinh()
A polars expression
pl$DataFrame(a = c(-1, asinh(0.5), 0, 1, NA))$ with_columns(sinh = pl$col("a")$sinh())
pl$DataFrame(a = c(-1, asinh(0.5), 0, 1, NA))$ with_columns(sinh = pl$col("a")$sinh())
For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.
expr__skew(..., bias = TRUE)
expr__skew(..., bias = TRUE)
... |
These dots are for future extensions and must be empty. |
bias |
If |
The sample skewness is computed as the Fisher-Pearson coefficient of skewness, i.e.
where
is the biased sample central moment, and
is the sample mean. If
bias = FALSE
, the calculations are corrected for
bias and the value computed is the adjusted Fisher-Pearson standardized
moment coefficient, i.e.
A polars expression
df <- pl$DataFrame(x = c(1, 2, 3, 2, 1)) df$select(pl$col("x")$skew())
df <- pl$DataFrame(x = c(1, 2, 3, 2, 1)) df$select(pl$col("x")$skew())
Get a slice of this expression
expr__slice(offset, length = NULL)
expr__slice(offset, length = NULL)
offset |
Numeric or expression, zero-indexed. Indicates where to start the slice. A negative value is one-indexed and starts from the end. |
length |
Maximum number of elements contained in the slice. If |
A polars expression
# as head pl$DataFrame(a = 0:100)$select( pl$all()$slice(0, 6) ) # as tail pl$DataFrame(a = 0:100)$select( pl$all()$slice(-6, 6) ) pl$DataFrame(a = 0:100)$select( pl$all()$slice(80) )
# as head pl$DataFrame(a = 0:100)$select( pl$all()$slice(0, 6) ) # as tail pl$DataFrame(a = 0:100)$select( pl$all()$slice(-6, 6) ) pl$DataFrame(a = 0:100)$select( pl$all()$slice(80) )
If used in a groupby context, values within each group are sorted.
expr__sort(..., descending = FALSE, nulls_last = FALSE)
expr__sort(..., descending = FALSE, nulls_last = FALSE)
... |
These dots are for future extensions and must be empty. |
descending |
Sort in descending order. |
nulls_last |
Place null values last. |
A polars expression
df <- pl$DataFrame(a = c(6, 1, 0, NA, Inf, NaN)) df$with_columns( sorted = pl$col("a")$sort(), sorted_desc = pl$col("a")$sort(descending = TRUE), sorted_nulls_last = pl$col("a")$sort(nulls_last = TRUE) ) # When sorting in a group by context, values in each group are sorted. df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group")$agg(pl$col("value")$sort())
df <- pl$DataFrame(a = c(6, 1, 0, NA, Inf, NaN)) df$with_columns( sorted = pl$col("a")$sort(), sorted_desc = pl$col("a")$sort(descending = TRUE), sorted_nulls_last = pl$col("a")$sort(nulls_last = TRUE) ) # When sorting in a group by context, values in each group are sorted. df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group")$agg(pl$col("value")$sort())
If used in a groupby context, values within each group are sorted.
expr__sort_by( ..., descending = FALSE, nulls_last = FALSE, multithreaded = TRUE, maintain_order = FALSE )
expr__sort_by( ..., descending = FALSE, nulls_last = FALSE, multithreaded = TRUE, maintain_order = FALSE )
... |
< |
descending |
Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans. |
nulls_last |
Place null values last; can specify a single boolean applying to all columns or a sequence of booleans for per-column control. |
multithreaded |
Sort using multiple threads. |
maintain_order |
Whether the order should be maintained if elements are equal. |
A polars expression
df <- pl$DataFrame( group = c("a", "a", "b", "b"), value1 = c(1, 3, 4, 2), value2 = c(8, 7, 6, 5) ) # by one column/expression df$with_columns( sorted = pl$col("group")$sort_by("value1") ) # by two columns/expressions df$with_columns( sorted = pl$col("group")$sort_by( "value2", pl$col("value1"), descending = c(TRUE, FALSE) ) ) # by some expression df$with_columns( sorted = pl$col("group")$sort_by(pl$col("value1") + pl$col("value2")) ) # in an aggregation context, values are sorted within groups df$group_by("group")$agg( pl$col("value1")$sort_by("value2") )
df <- pl$DataFrame( group = c("a", "a", "b", "b"), value1 = c(1, 3, 4, 2), value2 = c(8, 7, 6, 5) ) # by one column/expression df$with_columns( sorted = pl$col("group")$sort_by("value1") ) # by two columns/expressions df$with_columns( sorted = pl$col("group")$sort_by( "value2", pl$col("value1"), descending = c(TRUE, FALSE) ) ) # by some expression df$with_columns( sorted = pl$col("group")$sort_by(pl$col("value1") + pl$col("value2")) ) # in an aggregation context, values are sorted within groups df$group_by("group")$agg( pl$col("value1")$sort_by("value2") )
Compute square root
expr__sqrt()
expr__sqrt()
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns(sqrt = pl$col("a")$sqrt())
pl$DataFrame(a = c(1, 2, 4))$ with_columns(sqrt = pl$col("a")$sqrt())
Compute the standard deviation
expr__std(ddof = 1)
expr__std(ddof = 1)
A polars expression
pl$DataFrame(a = c(1, 3, 5, 6))$ select(pl$all()$std())
pl$DataFrame(a = c(1, 3, 5, 6))$ select(pl$all()$std())
Method equivalent of subtraction operator expr - other
.
expr__sub(other)
expr__sub(other)
other |
Numeric literal or expression value. |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = 0:4) df$with_columns( `x-2` = pl$col("x")$sub(2), `x-expr` = pl$col("x")$sub(pl$col("x")$cum_sum()) )
df <- pl$DataFrame(x = 0:4) df$with_columns( `x-2` = pl$col("x")$sub(2), `x-expr` = pl$col("x")$sub(pl$col("x")$cum_sum()) )
Get sum value
expr__sum()
expr__sum()
The dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
A polars expression
pl$DataFrame(x = c(1L, NA, 2L))$ with_columns(sum = pl$col("x")$sum())
pl$DataFrame(x = c(1L, NA, 2L))$ with_columns(sum = pl$col("x")$sum())
Get the last n elements
expr__tail(n = 10)
expr__tail(n = 10)
n |
Number of elements to take. |
A polars expression
pl$DataFrame(x = 1:11)$select(pl$col("x")$tail(3))
pl$DataFrame(x = 1:11)$select(pl$col("x")$tail(3))
Compute tangent
expr__tan()
expr__tan()
A polars expression
pl$DataFrame(a = c(0, pi / 2, pi, NA))$ with_columns(tangent = pl$col("a")$tan())
pl$DataFrame(a = c(0, pi / 2, pi, NA))$ with_columns(tangent = pl$col("a")$tan())
Compute hyperbolic tangent
expr__tanh()
expr__tanh()
A polars expression
pl$DataFrame(a = c(-1, atanh(0.5), 0, 1, NA))$ with_columns(tanh = pl$col("a")$tanh())
pl$DataFrame(a = c(-1, atanh(0.5), 0, 1, NA))$ with_columns(tanh = pl$col("a")$tanh())
The following data types will be changed:
Date -> Int32
Datetime -> Int64
Time -> Int64
Duration -> Int64
Categorical -> UInt32
List(inner) -> List(physical of inner)
Other data types will be left unchanged.
expr__to_physical()
expr__to_physical()
A polars expression
df <- pl$DataFrame(a = factor(c("a", "x", NA, "a"))) df$with_columns( phys = pl$col("a")$to_physical() )
df <- pl$DataFrame(a = factor(c("a", "x", NA, "a"))) df$with_columns( phys = pl$col("a")$to_physical() )
k
largest elementsNon-null elements are always preferred over null elements. The output is not
guaranteed to be in any particular order, call $sort() after
this function if you wish the output to be sorted. This has time complexity
.
expr__top_k(k = 5)
expr__top_k(k = 5)
k |
Number of elements to return. |
A polars expression
df <- pl$DataFrame(value = c(1, 98, 2, 3, 99, 4)) df$select( top_k = pl$col("value")$top_k(k = 3), bottom_k = pl$col("value")$bottom_k(k = 3) )
df <- pl$DataFrame(value = c(1, 98, 2, 3, 99, 4)) df$select( top_k = pl$col("value")$top_k(k = 3), bottom_k = pl$col("value")$bottom_k(k = 3) )
k
largest elementsNon-null elements are always preferred over null elements. The output is not
guaranteed to be in any particular order, call $sort() after
this function if you wish the output to be sorted. This has time complexity
.
expr__top_k_by(by, k = 5, ..., reverse = FALSE)
expr__top_k_by(by, k = 5, ..., reverse = FALSE)
by |
Column(s) used to determine the smallest elements. Accepts expression input. Strings are parsed as column names. |
k |
Number of elements to return. |
A polars expression
df <- pl$DataFrame( a = 1:6, b = 6:1, c = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana") ) # Get the top 2 rows by column a or b: df$select( pl$all()$top_k_by("a", 2)$name$suffix("_btm_by_a"), pl$all()$top_k_by("b", 2)$name$suffix("_btm_by_b") ) # Get the top 2 rows by multiple columns with given order. df$select( pl$all()$ top_k_by(c("c", "a"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_ca"), pl$all()$ top_k_by(c("c", "b"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_cb"), ) # Get the top 2 rows by column a in each group df$group_by("c", maintain_order = TRUE)$agg( pl$all()$top_k_by("a", 2) )$explode(pl$all()$exclude("c"))
df <- pl$DataFrame( a = 1:6, b = 6:1, c = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana") ) # Get the top 2 rows by column a or b: df$select( pl$all()$top_k_by("a", 2)$name$suffix("_btm_by_a"), pl$all()$top_k_by("b", 2)$name$suffix("_btm_by_b") ) # Get the top 2 rows by multiple columns with given order. df$select( pl$all()$ top_k_by(c("c", "a"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_ca"), pl$all()$ top_k_by(c("c", "b"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_cb"), ) # Get the top 2 rows by column a in each group df$group_by("c", maintain_order = TRUE)$agg( pl$all()$top_k_by("a", 2) )$explode(pl$all()$exclude("c"))
Method equivalent of float division operator expr / other
.
$truediv()
is an alias for $true_div()
, which exists for compatibility
with Python Polars.
expr__true_div(other) expr__truediv(other)
expr__true_div(other) expr__truediv(other)
other |
Numeric literal or expression value. |
Zero-division behaviour follows IEEE-754:
0/0
: Invalid operation - mathematically undefined, returns NaN
.
n/0
: On finite operands gives an exact infinite result, e.g.: ±infinity.
A polars expression
Arithmetic operators
<Expr>$floor_div()
df <- pl$DataFrame( x = -2:2, y = c(0.5, 0, 0, -4, -0.5) ) df$with_columns( `x/2` = pl$col("x")$true_div(2), `x/y` = pl$col("x")$true_div(pl$col("y")) )
df <- pl$DataFrame( x = -2:2, y = c(0.5, 0, 0, -4, -0.5) ) df$with_columns( `x/2` = pl$col("x")$true_div(2), `x/y` = pl$col("x")$true_div(pl$col("y")) )
This method differs from $value_counts()
in that it
does not return the values, only the counts and might be faster.
expr__unique(..., maintain_order = FALSE)
expr__unique(..., maintain_order = FALSE)
maintain_order |
Maintain order of data. This requires more work. |
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2)) df$select(pl$col("a")$unique())
df <- pl$DataFrame(a = c(1, 1, 2)) df$select(pl$col("a")$unique())
This method differs from $value_counts()
in that it
does not return the values, only the counts and might be faster.
expr__unique_counts()
expr__unique_counts()
A polars expression
df <- pl$DataFrame(id = c("a", "b", "b", "c", "c", "c")) df$select(pl$col("id")$unique_counts())
df <- pl$DataFrame(id = c("a", "b", "b", "c", "c", "c")) df$select(pl$col("id")$unique_counts())
Returns a unit Series with the highest value possible for the dtype of this expression.
expr__upper_bound()
expr__upper_bound()
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$upper_bound())
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$upper_bound())
Count the occurrences of unique values
expr__value_counts( ..., sort = FALSE, parallel = FALSE, name = "count", normalize = FALSE )
expr__value_counts( ..., sort = FALSE, parallel = FALSE, name = "count", normalize = FALSE )
... |
These dots are for future extensions and must be empty. |
sort |
Sort the output by count in descending order. If |
parallel |
Execute the computation in parallel. This option should likely not be enabled in a group by context, as the computation is already parallelized per group. |
name |
Give the resulting count field a specific name. Default is
|
normalize |
If |
A polars expression
df <- pl$DataFrame(color = c("red", "blue", "red", "green", "blue", "blue")) df$select(pl$col("color")$value_counts()) # Sort the output by (descending) count and customize the count field name. df <- df$select(pl$col("color")$value_counts(sort = TRUE, name = "n")) df df$unnest()
df <- pl$DataFrame(color = c("red", "blue", "red", "green", "blue", "blue")) df$select(pl$col("color")$value_counts()) # Sort the output by (descending) count and customize the count field name. df <- df$select(pl$col("color")$value_counts(sort = TRUE, name = "n")) df df$unnest()
Compute the variance
expr__var(ddof = 1)
expr__var(ddof = 1)
A polars expression
pl$DataFrame(a = c(1, 3, 5, 6))$ select(pl$all()$var())
pl$DataFrame(a = c(1, 3, 5, 6))$ select(pl$all()$var())
Combine two boolean expressions with XOR.
expr__xor(other)
expr__xor(other)
other |
Element to add. Can be a string (only if |
A polars expression
pl$lit(TRUE)$xor(pl$lit(FALSE))
pl$lit(TRUE)$xor(pl$lit(FALSE))
Evaluate whether all boolean values are true for every sub-array
expr_arr_all()
expr_arr_all()
A polars expression
df <- pl$DataFrame( values = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), c(NA, NA)), )$cast(pl$Array(pl$Boolean, 2)) df$with_columns(all = pl$col("values")$arr$all())
df <- pl$DataFrame( values = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), c(NA, NA)), )$cast(pl$Array(pl$Boolean, 2)) df$with_columns(all = pl$col("values")$arr$all())
Evaluate whether any boolean value is true for every sub-array
expr_arr_any()
expr_arr_any()
A polars expression
df <- pl$DataFrame( values = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), c(NA, NA)), )$cast(pl$Array(pl$Boolean, 2)) df$with_columns(any = pl$col("values")$arr$any())
df <- pl$DataFrame( values = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), c(NA, NA)), )$cast(pl$Array(pl$Boolean, 2)) df$with_columns(any = pl$col("values")$arr$any())
Retrieve the index of the maximum value in every sub-array
expr_arr_arg_max()
expr_arr_arg_max()
A polars expression
df <- pl$DataFrame( values = list(1:2, 2:1) )$cast(pl$Array(pl$Int32, 2)) df$with_columns( arg_max = pl$col("values")$arr$arg_max() )
df <- pl$DataFrame( values = list(1:2, 2:1) )$cast(pl$Array(pl$Int32, 2)) df$with_columns( arg_max = pl$col("values")$arr$arg_max() )
Retrieve the index of the minimum value in every sub-array
expr_arr_arg_min()
expr_arr_arg_min()
A polars expression
df <- pl$DataFrame( values = list(1:2, 2:1) )$cast(pl$Array(pl$Int32, 2)) df$with_columns( arg_min = pl$col("values")$arr$arg_min() )
df <- pl$DataFrame( values = list(1:2, 2:1) )$cast(pl$Array(pl$Int32, 2)) df$with_columns( arg_min = pl$col("values")$arr$arg_min() )
Check if sub-arrays contain the given item
expr_arr_contains(item)
expr_arr_contains(item)
item |
Expr or something coercible to an Expr. Strings are not parsed as columns. |
A polars expression
df <- pl$DataFrame( values = list(0:2, 4:6, c(NA, NA, NA)), item = c(0L, 4L, 2L), )$cast(values = pl$Array(pl$Float64, 3)) df$with_columns( with_expr = pl$col("values")$arr$contains(pl$col("item")), with_lit = pl$col("values")$arr$contains(1) )
df <- pl$DataFrame( values = list(0:2, 4:6, c(NA, NA, NA)), item = c(0L, 4L, 2L), )$cast(values = pl$Array(pl$Float64, 3)) df$with_columns( with_expr = pl$col("values")$arr$contains(pl$col("item")), with_lit = pl$col("values")$arr$contains(1) )
Count how often a value occurs in every sub-array
expr_arr_count_matches(element)
expr_arr_count_matches(element)
element |
An Expr or something coercible to an Expr that produces a single value. |
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(1, 1), c(2, 2)) )$cast(pl$Array(pl$Int64, 2)) df$with_columns(number_of_twos = pl$col("values")$arr$count_matches(2))
df <- pl$DataFrame( values = list(c(1, 2), c(1, 1), c(2, 2)) )$cast(pl$Array(pl$Int64, 2)) df$with_columns(number_of_twos = pl$col("values")$arr$count_matches(2))
Returns a column with a separate row for every array element.
expr_arr_explode()
expr_arr_explode()
A polars expression
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$select(pl$col("a")$arr$explode())
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$select(pl$col("a")$arr$explode())
Get the first value of the sub-arrays
expr_arr_first()
expr_arr_first()
A polars expression
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(first = pl$col("a")$arr$first())
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(first = pl$col("a")$arr$first())
This allows to extract one value per array only. Values are 0-indexed (so
index 0
would return the first item of every sub-array) and negative values
start from the end (so index -1
returns the last item).
expr_arr_get(index, ..., null_on_oob = TRUE)
expr_arr_get(index, ..., null_on_oob = TRUE)
index |
An Expr or something coercible to an Expr, that must return a single index. |
... |
These dots are for future extensions and must be empty. |
null_on_oob |
If |
Expr
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)), idx = c(1, NA, 3) )$cast(values = pl$Array(pl$Float64, 2)) df$with_columns( using_expr = pl$col("values")$arr$get("idx"), val_0 = pl$col("values")$arr$get(0), val_minus_1 = pl$col("values")$arr$get(-1), val_oob = pl$col("values")$arr$get(10) )
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)), idx = c(1, NA, 3) )$cast(values = pl$Array(pl$Float64, 2)) df$with_columns( using_expr = pl$col("values")$arr$get("idx"), val_0 = pl$col("values")$arr$get(0), val_minus_1 = pl$col("values")$arr$get(-1), val_oob = pl$col("values")$arr$get(10) )
Join all string items in a sub-array and place a separator between them. This
only works if the inner type of the array is String
.
expr_arr_join(separator, ..., ignore_nulls = FALSE)
expr_arr_join(separator, ..., ignore_nulls = FALSE)
separator |
String to separate the items with. Can be an Expr. Strings are not parsed as columns. |
... |
These dots are for future extensions and must be empty. |
A polars expression
df <- pl$DataFrame( values = list(c("a", "b", "c"), c("x", "y", "z"), c("e", NA, NA)), separator = c("-", "+", "/"), )$cast(values = pl$Array(pl$String, 3)) df$with_columns( join_with_expr = pl$col("values")$arr$join(pl$col("separator")), join_with_lit = pl$col("values")$arr$join(" "), join_ignore_null = pl$col("values")$arr$join(" ", ignore_nulls = TRUE) )
df <- pl$DataFrame( values = list(c("a", "b", "c"), c("x", "y", "z"), c("e", NA, NA)), separator = c("-", "+", "/"), )$cast(values = pl$Array(pl$String, 3)) df$with_columns( join_with_expr = pl$col("values")$arr$join(pl$col("separator")), join_with_lit = pl$col("values")$arr$join(" "), join_ignore_null = pl$col("values")$arr$join(" ", ignore_nulls = TRUE) )
Get the last value of the sub-arrays
expr_arr_last()
expr_arr_last()
A polars expression
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(last = pl$col("a")$arr$last())
df <- pl$DataFrame( a = list(c(1, 2, 3), c(4, 5, 6)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(last = pl$col("a")$arr$last())
Compute the max value of the sub-arrays
expr_arr_max()
expr_arr_max()
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, NA)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(max = pl$col("values")$arr$max())
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, NA)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(max = pl$col("values")$arr$max())
Compute the median value of the sub-arrays
expr_arr_median()
expr_arr_median()
A polars expression
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(median = pl$col("values")$arr$median())
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(median = pl$col("values")$arr$median())
Compute the min value of the sub-arrays
expr_arr_min()
expr_arr_min()
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, NA)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(min = pl$col("values")$arr$min())
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, NA)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(min = pl$col("values")$arr$min())
Count the number of unique values in every sub-array
expr_arr_n_unique()
expr_arr_n_unique()
A polars expression
df <- pl$DataFrame( a = list(c(1, 1, 2), c(2, 3, 4)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(n_unique = pl$col("a")$arr$n_unique())
df <- pl$DataFrame( a = list(c(1, 1, 2), c(2, 3, 4)) )$cast(pl$Array(pl$Int64, 3)) df$with_columns(n_unique = pl$col("a")$arr$n_unique())
Reverse values in every sub-array
expr_arr_reverse()
expr_arr_reverse()
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(reverse = pl$col("values")$arr$reverse())
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(reverse = pl$col("values")$arr$reverse())
Shift values in every sub-array by the given number of indices
expr_arr_shift(n = 1)
expr_arr_shift(n = 1)
A polars expression
df <- pl$DataFrame( values = list(1:3, c(2L, NA, 5L)), idx = 1:2, )$cast(values = pl$Array(pl$Int32, 3)) df$with_columns( shift_by_expr = pl$col("values")$arr$shift(pl$col("idx")), shift_by_lit = pl$col("values")$arr$shift(2) )
df <- pl$DataFrame( values = list(1:3, c(2L, NA, 5L)), idx = 1:2, )$cast(values = pl$Array(pl$Int32, 3)) df$with_columns( shift_by_expr = pl$col("values")$arr$shift(pl$col("idx")), shift_by_lit = pl$col("values")$arr$shift(2) )
Sort values in every sub-array
expr_arr_sort(..., descending = FALSE, nulls_last = FALSE)
expr_arr_sort(..., descending = FALSE, nulls_last = FALSE)
... |
These dots are for future extensions and must be empty. |
df <- pl$DataFrame( values = list(c(2, 1), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(sort = pl$col("values")$arr$sort(nulls_last = TRUE))
df <- pl$DataFrame( values = list(c(2, 1), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(sort = pl$col("values")$arr$sort(nulls_last = TRUE))
Compute the standard deviation of the sub-arrays
expr_arr_std(ddof = 1)
expr_arr_std(ddof = 1)
A polars expression
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(std = pl$col("values")$arr$std())
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(std = pl$col("values")$arr$std())
Compute the sum of the sub-arrays
expr_arr_sum()
expr_arr_sum()
A polars expression
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(sum = pl$col("values")$arr$sum())
df <- pl$DataFrame( values = list(c(1, 2), c(3, 4), c(NA, 6)) )$cast(pl$Array(pl$Float64, 2)) df$with_columns(sum = pl$col("values")$arr$sum())
Convert an Array column into a List column with the same inner data type
expr_arr_to_list()
expr_arr_to_list()
A polars expression
df <- pl$DataFrame( a = list(c(1, 2), c(3, 4)) )$cast(pl$Array(pl$Int8, 2)) df$with_columns( list = pl$col("a")$arr$to_list() )
df <- pl$DataFrame( a = list(c(1, 2), c(3, 4)) )$cast(pl$Array(pl$Int8, 2)) df$with_columns( list = pl$col("a")$arr$to_list() )
Get the unique values in every sub-array
expr_arr_unique(..., maintain_order = FALSE)
expr_arr_unique(..., maintain_order = FALSE)
... |
These dots are for future extensions and must be empty. |
A polars expression
df <- pl$DataFrame( values = list(c(1, 1, 2), c(4, 4, 4), c(NA, 6, 7)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(unique = pl$col("values")$arr$unique())
df <- pl$DataFrame( values = list(c(1, 1, 2), c(4, 4, 4), c(NA, 6, 7)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(unique = pl$col("values")$arr$unique())
Compute the variance of the sub-arrays
expr_arr_var(ddof = 1)
expr_arr_var(ddof = 1)
A polars expression
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(var = pl$col("values")$arr$var())
df <- pl$DataFrame( values = list(c(2, 1, 4), c(8.4, 3.2, 1)), )$cast(pl$Array(pl$Float64, 3)) df$with_columns(var = pl$col("values")$arr$var())
Check if binaries contain a binary substring
expr_bin_contains(literal)
expr_bin_contains(literal)
literal |
The binary substring to look for. |
A polars expression
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), lit = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", contains_with_lit = pl$col("code")$bin$contains("xff"), contains_with_expr = pl$col("code")$bin$contains(pl$col("lit")) )
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), lit = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", contains_with_lit = pl$col("code")$bin$contains("xff"), contains_with_expr = pl$col("code")$bin$contains(pl$col("lit")) )
Decode values using the provided encoding
expr_bin_decode(encoding, ..., strict = TRUE)
expr_bin_decode(encoding, ..., strict = TRUE)
encoding |
A character, |
... |
These dots are for future extensions and must be empty. |
strict |
Raise an error if the underlying value cannot be decoded,
otherwise mask out with a |
A polars expression
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code_hex = as_polars_series(c("000000", "ffff00", "0000ff"))$cast(pl$Binary), code_base64 = as_polars_series(c("AAAA", "//8A", "AAD/"))$cast(pl$Binary) ) df$with_columns( decoded_hex = pl$col("code_hex")$bin$decode("hex"), decoded_base64 = pl$col("code_base64")$bin$decode("base64") ) # Set `strict = FALSE` to set invalid values to `null` instead of raising an error. df <- pl$DataFrame( colors = as_polars_series(c("000000", "ffff00", "invalid_value"))$cast(pl$Binary) ) df$select(pl$col("colors")$bin$decode("hex", strict = FALSE))
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code_hex = as_polars_series(c("000000", "ffff00", "0000ff"))$cast(pl$Binary), code_base64 = as_polars_series(c("AAAA", "//8A", "AAD/"))$cast(pl$Binary) ) df$with_columns( decoded_hex = pl$col("code_hex")$bin$decode("hex"), decoded_base64 = pl$col("code_base64")$bin$decode("base64") ) # Set `strict = FALSE` to set invalid values to `null` instead of raising an error. df <- pl$DataFrame( colors = as_polars_series(c("000000", "ffff00", "invalid_value"))$cast(pl$Binary) ) df$select(pl$col("colors")$bin$decode("hex", strict = FALSE))
Encode a value using the provided encoding
expr_bin_encode(encoding)
expr_bin_encode(encoding)
encoding |
A character, |
A polars expression
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series( c("000000", "ffff00", "0000ff") )$cast(pl$Binary)$bin$decode("hex") ) df$with_columns(encoded = pl$col("code")$bin$encode("hex"))
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series( c("000000", "ffff00", "0000ff") )$cast(pl$Binary)$bin$decode("hex") ) df$with_columns(encoded = pl$col("code")$bin$encode("hex"))
Check if string values end with a binary substring
expr_bin_ends_with(suffix)
expr_bin_ends_with(suffix)
suffix |
Suffix substring. |
A polars expression
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), suffix = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", ends_with_lit = pl$col("code")$bin$ends_with("xff"), ends_with_expr = pl$col("code")$bin$ends_with(pl$col("suffix")) )
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), suffix = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", ends_with_lit = pl$col("code")$bin$ends_with("xff"), ends_with_expr = pl$col("code")$bin$ends_with(pl$col("suffix")) )
Get the size of binary values in the given unit
expr_bin_size(unit = c("b", "kb", "mb", "gb", "tb"))
expr_bin_size(unit = c("b", "kb", "mb", "gb", "tb"))
unit |
Scale the returned size to the given unit. Can be |
A polars expression
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code_hex = as_polars_series(c("000000", "ffff00", "0000ff"))$cast(pl$Binary) ) df$with_columns( n_bytes = pl$col("code_hex")$bin$size(), n_kilobytes = pl$col("code_hex")$bin$size("kb") )
df <- pl$DataFrame( name = c("black", "yellow", "blue"), code_hex = as_polars_series(c("000000", "ffff00", "0000ff"))$cast(pl$Binary) ) df$with_columns( n_bytes = pl$col("code_hex")$bin$size(), n_kilobytes = pl$col("code_hex")$bin$size("kb") )
Check if values start with a binary substring
expr_bin_starts_with(prefix)
expr_bin_starts_with(prefix)
sub |
Prefix substring. |
A polars expression
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), prefix = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", starts_with_lit = pl$col("code")$bin$starts_with("xff"), starts_with_expr = pl$col("code")$bin$starts_with(pl$col("prefix")) )
colors <- pl$DataFrame( name = c("black", "yellow", "blue"), code = as_polars_series(c("x00x00x00", "xffxffx00", "x00x00xff"))$cast(pl$Binary), prefix = as_polars_series(c("x00", "xffx00", "xffxff"))$cast(pl$Binary) ) colors$select( "name", starts_with_lit = pl$col("code")$bin$starts_with("xff"), starts_with_expr = pl$col("code")$bin$starts_with(pl$col("prefix")) )
Get the categories stored in this data type
expr_cat_get_categories()
expr_cat_get_categories()
A polars expression
df <- pl$DataFrame( cats = factor(c("z", "z", "k", "a", "b")), vals = factor(c(3, 1, 2, 2, 3)) ) df df$select( pl$col("cats")$cat$get_categories() ) df$select( pl$col("vals")$cat$get_categories() )
df <- pl$DataFrame( cats = factor(c("z", "z", "k", "a", "b")), vals = factor(c(3, 1, 2, 2, 3)) ) df df$select( pl$col("cats")$cat$get_categories() ) df$select( pl$col("vals")$cat$get_categories() )
Determine how this categorical series should be sorted.
expr_cat_set_ordering(ordering)
expr_cat_set_ordering(ordering)
ordering |
string either 'physical' or 'lexical'
|
A polars expression
df <- pl$DataFrame( cats = factor(c("z", "z", "k", "a", "b")), vals = c(3, 1, 2, 2, 3) ) # sort by the string value of categories df$with_columns( pl$col("cats")$cat$set_ordering("lexical") )$sort("cats", "vals") # sort by the underlying value of categories df$with_columns( pl$col("cats")$cat$set_ordering("physical") )$sort("cats", "vals")
df <- pl$DataFrame( cats = factor(c("z", "z", "k", "a", "b")), vals = c(3, 1, 2, 2, 3) ) # sort by the string value of categories df$with_columns( pl$col("cats")$cat$set_ordering("lexical") )$sort("cats", "vals") # sort by the underlying value of categories df$with_columns( pl$col("cats")$cat$set_ordering("physical") )$sort("cats", "vals")
n
business days.Offset by n
business days.
expr_dt_add_business_days( n, ..., week_mask = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE), holidays = as.Date(integer(0)), roll = c("raise", "backward", "forward") )
expr_dt_add_business_days( n, ..., week_mask = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE), holidays = as.Date(integer(0)), roll = c("raise", "backward", "forward") )
n |
An integer value or a polars expression representing the number of business days to offset by. |
... |
These dots are for future extensions and must be empty. |
week_mask |
Non-NA logical vector of length 7, representing the days of
the week to count. The default is Monday to Friday ( |
holidays |
A Date class vector, representing the holidays to exclude from the count. |
roll |
What to do when the start date lands on a non-business day. Options are:
|
A polars expression
df <- pl$DataFrame(start = as.Date(c("2020-1-1", "2020-1-2"))) df$with_columns(result = pl$col("start")$dt$add_business_days(5)) # You can pass a custom weekend - for example, if you only take Sunday off: week_mask <- c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE) df$with_columns( result = pl$col("start")$dt$add_business_days(5, week_mask = week_mask) ) # You can also pass a list of holidays: holidays <- as.Date(c("2020-1-3", "2020-1-6")) df$with_columns( result = pl$col("start")$dt$add_business_days(5, holidays = holidays) ) # Roll all dates forwards to the next business day: df <- pl$DataFrame(start = as.Date(c("2020-1-5", "2020-1-6"))) df$with_columns( rolled_forwards = pl$col("start")$dt$add_business_days(0, roll = "forward") )
df <- pl$DataFrame(start = as.Date(c("2020-1-1", "2020-1-2"))) df$with_columns(result = pl$col("start")$dt$add_business_days(5)) # You can pass a custom weekend - for example, if you only take Sunday off: week_mask <- c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE) df$with_columns( result = pl$col("start")$dt$add_business_days(5, week_mask = week_mask) ) # You can also pass a list of holidays: holidays <- as.Date(c("2020-1-3", "2020-1-6")) df$with_columns( result = pl$col("start")$dt$add_business_days(5, holidays = holidays) ) # Roll all dates forwards to the next business day: df <- pl$DataFrame(start = as.Date(c("2020-1-5", "2020-1-6"))) df$with_columns( rolled_forwards = pl$col("start")$dt$add_business_days(0, roll = "forward") )
This computes the offset between a time zone and UTC. This is usually constant for all datetimes in a given time zone, but may vary in the rare case that a country switches time zone, like Samoa (Apia) did at the end of 2011. Use $dt$dst_offset() to take daylight saving time into account.
expr_dt_base_utc_offset()
expr_dt_base_utc_offset()
A polars expression
df <- pl$DataFrame( x = as.POSIXct(c("2011-12-29", "2012-01-01"), tz = "Pacific/Apia") ) df$with_columns(base_utc_offset = pl$col("x")$dt$base_utc_offset())
df <- pl$DataFrame( x = as.POSIXct(c("2011-12-29", "2012-01-01"), tz = "Pacific/Apia") ) df$with_columns(base_utc_offset = pl$col("x")$dt$base_utc_offset())
Cast the underlying data to another time unit. This may lose precision.
expr_dt_cast_time_unit(time_unit)
expr_dt_cast_time_unit(time_unit)
time_unit |
One of |
A polars expression
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$with_columns( cast_time_unit_ms = pl$col("date")$dt$cast_time_unit("ms"), cast_time_unit_ns = pl$col("date")$dt$cast_time_unit("ns"), )
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$with_columns( cast_time_unit_ms = pl$col("date")$dt$cast_time_unit("ms"), cast_time_unit_ns = pl$col("date")$dt$cast_time_unit("ns"), )
Returns the century number in the calendar date.
expr_dt_century()
expr_dt_century()
A polars expression
df <- pl$DataFrame( date = as.Date( c("999-12-31", "1897-05-07", "2000-01-01", "2001-07-05", "3002-10-20") ) ) df$with_columns( century = pl$col("date")$dt$century() )
df <- pl$DataFrame( date = as.Date( c("999-12-31", "1897-05-07", "2000-01-01", "2001-07-05", "3002-10-20") ) ) df$with_columns( century = pl$col("date")$dt$century() )
If the underlying expression is a Datetime then its time component is replaced, and if it is a Date then a new Datetime is created by combining the two values.
expr_dt_combine(time, time_unit = c("us", "ns", "ms"))
expr_dt_combine(time, time_unit = c("us", "ns", "ms"))
time |
The number of epoch since or before (if negative) the Date. Can be an Expr or a PTime. |
time_unit |
One of |
A polars expression
df <- pl$DataFrame( dtm = c( ISOdatetime(2022, 12, 31, 10, 30, 45), ISOdatetime(2023, 7, 5, 23, 59, 59) ), dt = c(ISOdate(2022, 10, 10), ISOdate(2022, 7, 5)), tm = hms::parse_hms(c("1:2:3.456000", "7:8:9.101000")) ) df df$select( d1 = pl$col("dtm")$dt$combine(pl$col("tm")), s2 = pl$col("dt")$dt$combine(pl$col("tm")), d3 = pl$col("dt")$dt$combine(hms::parse_hms("4:5:6")) )
df <- pl$DataFrame( dtm = c( ISOdatetime(2022, 12, 31, 10, 30, 45), ISOdatetime(2023, 7, 5, 23, 59, 59) ), dt = c(ISOdate(2022, 10, 10), ISOdate(2022, 7, 5)), tm = hms::parse_hms(c("1:2:3.456000", "7:8:9.101000")) ) df df$select( d1 = pl$col("dtm")$dt$combine(pl$col("tm")), s2 = pl$col("dt")$dt$combine(pl$col("tm")), d3 = pl$col("dt")$dt$combine(hms::parse_hms("4:5:6")) )
If converting from a time-zone-naive datetime, then conversion will happen as if converting from UTC, regardless of your system’s time zone.
expr_dt_convert_time_zone(time_zone)
expr_dt_convert_time_zone(time_zone)
time_zone |
A character time zone from |
A polars expression
df <- pl$select( date = pl$datetime_range( as.POSIXct("2020-03-01", tz = "UTC"), as.POSIXct("2020-05-01", tz = "UTC"), "1mo" ) ) df$with_columns( London = pl$col("date")$dt$convert_time_zone("Europe/London") )
df <- pl$select( date = pl$datetime_range( as.POSIXct("2020-03-01", tz = "UTC"), as.POSIXct("2020-05-01", tz = "UTC"), "1mo" ) ) df$with_columns( London = pl$col("date")$dt$convert_time_zone("Europe/London") )
Extract date from date(time)
expr_dt_date()
expr_dt_date()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct(c("1978-1-1 1:1:1", "1897-5-7 00:00:00"), tz = "UTC") ) df$with_columns( date = pl$col("datetime")$dt$date() )
df <- pl$DataFrame( datetime = as.POSIXct(c("1978-1-1 1:1:1", "1897-5-7 00:00:00"), tz = "UTC") ) df$with_columns( date = pl$col("datetime")$dt$date() )
Returns the day of month starting from 1. The return value ranges from 1 to 31 (the last day of month differs across months).
expr_dt_day()
expr_dt_day()
A polars expression
df <- pl$DataFrame( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d", time_zone = "GMT" ) ) df$with_columns( pl$col("date")$dt$day()$alias("day") )
df <- pl$DataFrame( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d", time_zone = "GMT" ) ) df$with_columns( pl$col("date")$dt$day()$alias("day") )
This computes the offset between a time zone and UTC, taking into account daylight saving time. Use $dt$base_utc_offset() to avoid counting DST.
expr_dt_dst_offset()
expr_dt_dst_offset()
A polars expression
df <- pl$DataFrame( x = as.POSIXct(c("2020-10-25", "2020-10-26"), tz = "Europe/London") ) df$with_columns(dst_offset = pl$col("x")$dt$dst_offset())
df <- pl$DataFrame( x = as.POSIXct(c("2020-10-25", "2020-10-26"), tz = "Europe/London") ) df$with_columns(dst_offset = pl$col("x")$dt$dst_offset())
Get the time passed since the Unix EPOCH in the give time unit.
expr_dt_epoch(time_unit = c("us", "ns", "ms", "s", "d"))
expr_dt_epoch(time_unit = c("us", "ns", "ms", "s", "d"))
time_unit |
Time unit, one of |
A polars expression
df <- pl$DataFrame(date = pl$date_range(as.Date("2001-1-1"), as.Date("2001-1-3"))) df$with_columns( epoch_ns = pl$col("date")$dt$epoch(), epoch_s = pl$col("date")$dt$epoch(time_unit = "s") )
df <- pl$DataFrame(date = pl$date_range(as.Date("2001-1-1"), as.Date("2001-1-3"))) df$with_columns( epoch_ns = pl$col("date")$dt$epoch(), epoch_s = pl$col("date")$dt$epoch(time_unit = "s") )
Returns the hour number from 0 to 23.
expr_dt_hour()
expr_dt_hour()
A polars expression
df <- pl$DataFrame( date = pl$datetime_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d2h", time_zone = "GMT" ) ) df$with_columns( pl$col("date")$dt$hour()$alias("hour") )
df <- pl$DataFrame( date = pl$datetime_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d2h", time_zone = "GMT" ) ) df$with_columns( pl$col("date")$dt$hour()$alias("hour") )
Determine whether the year of the underlying date is a leap year
expr_dt_is_leap_year()
expr_dt_is_leap_year()
A polars expression
df <- pl$DataFrame(date = as.Date(c("2000-01-01", "2001-01-01", "2002-01-01"))) df$with_columns( leap_year = pl$col("date")$dt$is_leap_year() )
df <- pl$DataFrame(date = as.Date(c("2000-01-01", "2001-01-01", "2002-01-01"))) df$with_columns( leap_year = pl$col("date")$dt$is_leap_year() )
Returns the year number in the ISO standard. This may not correspond with the calendar year.
expr_dt_iso_year()
expr_dt_iso_year()
A polars expression
df <- pl$DataFrame( date = as.Date(c("1977-01-01", "1978-01-01", "1979-01-01")) ) df$with_columns( year = pl$col("date")$dt$year(), iso_year = pl$col("date")$dt$iso_year() )
df <- pl$DataFrame( date = as.Date(c("1977-01-01", "1978-01-01", "1979-01-01")) ) df$with_columns( year = pl$col("date")$dt$year(), iso_year = pl$col("date")$dt$iso_year() )
Extract microseconds from underlying Datetime representation
expr_dt_microsecond()
expr_dt_microsecond()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( microsecond = pl$col("datetime")$dt$microsecond() )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( microsecond = pl$col("datetime")$dt$microsecond() )
Extract milliseconds from underlying Datetime representation
expr_dt_millisecond()
expr_dt_millisecond()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( millisecond = pl$col("datetime")$dt$millisecond() )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( millisecond = pl$col("datetime")$dt$millisecond() )
Returns the minute number from 0 to 59.
expr_dt_minute()
expr_dt_minute()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( pl$col("datetime")$dt$minute()$alias("minute") )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( pl$col("datetime")$dt$minute()$alias("minute") )
Returns the month number between 1 and 12.
expr_dt_month()
expr_dt_month()
A polars expression
df <- pl$DataFrame( date = as.Date(c("2001-01-01", "2001-06-30", "2001-12-27")) ) df$with_columns( month = pl$col("date")$dt$month() )
df <- pl$DataFrame( date = as.Date(c("2001-01-01", "2001-06-30", "2001-12-27")) ) df$with_columns( month = pl$col("date")$dt$month() )
For datetimes, the time of day is preserved.
expr_dt_month_end()
expr_dt_month_end()
A polars expression
df <- pl$DataFrame(date = as.Date(c("2000-01-23", "2001-01-12", "2002-01-01"))) df$with_columns( month_end = pl$col("date")$dt$month_end() )
df <- pl$DataFrame(date = as.Date(c("2000-01-23", "2001-01-12", "2002-01-01"))) df$with_columns( month_end = pl$col("date")$dt$month_end() )
For datetimes, the time of day is preserved.
expr_dt_month_start()
expr_dt_month_start()
A polars expression
df <- pl$DataFrame(date = as.Date(c("2000-01-23", "2001-01-12", "2002-01-01"))) df$with_columns( month_start = pl$col("date")$dt$month_start() )
df <- pl$DataFrame(date = as.Date(c("2000-01-23", "2001-01-12", "2002-01-01"))) df$with_columns( month_start = pl$col("date")$dt$month_start() )
Extract nanoseconds from underlying Datetime representation
expr_dt_nanosecond()
expr_dt_nanosecond()
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( nanosecond = pl$col("datetime")$dt$nanosecond() )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( nanosecond = pl$col("datetime")$dt$nanosecond() )
This differs from pl$col("foo") + Duration
in that it can
take months and leap years into account. Note that only a single minus
sign is allowed in the by
string, as the first character.
expr_dt_offset_by(by)
expr_dt_offset_by(by)
by |
optional string encoding duration see details. |
The by
are created with the following string language:
1ns # 1 nanosecond
1us # 1 microsecond
1ms # 1 millisecond
1s # 1 second
1m # 1 minute
1h # 1 hour
1d # 1 day
1w # 1 calendar week
1mo # 1 calendar month
1y # 1 calendar year
1i # 1 index count
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
These strings can be combined:
3d12h4m25s # 3 days, 12 hours, 4 minutes, and 25 seconds
A polars expression
df <- pl$select( dates = pl$date_range( as.Date("2000-1-1"), as.Date("2005-1-1"), "1y" ) ) df$with_columns( date_plus_1y = pl$col("dates")$dt$offset_by("1y"), date_negative_offset = pl$col("dates")$dt$offset_by("-1y2mo") ) # the "by" argument also accepts expressions df <- pl$select( dates = pl$datetime_range( as.POSIXct("2022-01-01", tz = "GMT"), as.POSIXct("2022-01-02", tz = "GMT"), interval = "6h", time_unit = "ms", time_zone = "GMT" ), offset = pl$Series(values = c("1d", "-2d", "1mo", NA, "1y")) ) df$with_columns(new_dates = pl$col("dates")$dt$offset_by(pl$col("offset")))
df <- pl$select( dates = pl$date_range( as.Date("2000-1-1"), as.Date("2005-1-1"), "1y" ) ) df$with_columns( date_plus_1y = pl$col("dates")$dt$offset_by("1y"), date_negative_offset = pl$col("dates")$dt$offset_by("-1y2mo") ) # the "by" argument also accepts expressions df <- pl$select( dates = pl$datetime_range( as.POSIXct("2022-01-01", tz = "GMT"), as.POSIXct("2022-01-02", tz = "GMT"), interval = "6h", time_unit = "ms", time_zone = "GMT" ), offset = pl$Series(values = c("1d", "-2d", "1mo", NA, "1y")) ) df$with_columns(new_dates = pl$col("dates")$dt$offset_by(pl$col("offset")))
Returns the day of year starting from 1. The return value ranges from 1 to 366 (the last day of year differs across years).
expr_dt_ordinal_day()
expr_dt_ordinal_day()
A polars expression
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( ordinal_day = pl$col("date")$dt$ordinal_day() )
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( ordinal_day = pl$col("date")$dt$ordinal_day() )
Returns the quarter ranging from 1 to 4.
expr_dt_quarter()
expr_dt_quarter()
A polars expression
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( quarter = pl$col("date")$dt$quarter() )
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( quarter = pl$col("date")$dt$quarter() )
Different from $dt$convert_time_zone(), this will also modify the underlying timestamp and will ignore the original time zone.
expr_dt_replace_time_zone( time_zone, ..., ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
expr_dt_replace_time_zone( time_zone, ..., ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
time_zone |
|
... |
These dots are for future extensions and must be empty. |
ambiguous |
Determine how to deal with ambiguous datetimes. Character vector or expression containing the followings:
|
non_existent |
Determine how to deal with non-existent datetimes. One of the followings:
|
A polars expression
df <- pl$select( london_timezone = pl$datetime_range( as.Date("2020-03-01"), as.Date("2020-07-01"), "1mo", time_zone = "UTC" )$dt$convert_time_zone(time_zone = "Europe/London") ) df$with_columns( London_to_Amsterdam = pl$col("london_timezone")$dt$replace_time_zone(time_zone="Europe/Amsterdam") ) # You can use `ambiguous` to deal with ambiguous datetimes: dates <- c( "2018-10-28 01:30", "2018-10-28 02:00", "2018-10-28 02:30", "2018-10-28 02:00" ) |> as.POSIXct("UTC") df2 <- pl$DataFrame( ts = as_polars_series(dates), ambiguous = c("earliest", "earliest", "latest", "latest"), ) df2$with_columns( ts_localized = pl$col("ts")$dt$replace_time_zone( "Europe/Brussels", ambiguous = pl$col("ambiguous") ) )
df <- pl$select( london_timezone = pl$datetime_range( as.Date("2020-03-01"), as.Date("2020-07-01"), "1mo", time_zone = "UTC" )$dt$convert_time_zone(time_zone = "Europe/London") ) df$with_columns( London_to_Amsterdam = pl$col("london_timezone")$dt$replace_time_zone(time_zone="Europe/Amsterdam") ) # You can use `ambiguous` to deal with ambiguous datetimes: dates <- c( "2018-10-28 01:30", "2018-10-28 02:00", "2018-10-28 02:30", "2018-10-28 02:00" ) |> as.POSIXct("UTC") df2 <- pl$DataFrame( ts = as_polars_series(dates), ambiguous = c("earliest", "earliest", "latest", "latest"), ) df2$with_columns( ts_localized = pl$col("ts")$dt$replace_time_zone( "Europe/Brussels", ambiguous = pl$col("ambiguous") ) )
Divide the date/datetime range into buckets. Each date/datetime in the first
half of the interval is mapped to the start of its bucket. Each
date/datetime in the second half of the interval is mapped to the end of its
bucket. Ambiguous results are localised using the DST offset of the original
timestamp - for example, rounding '2022-11-06 01:20:00 CST'
by '1h'
results in '2022-11-06 01:00:00 CST'
, whereas rounding
'2022-11-06 01:20:00 CDT'
by '1h'
results in '2022-11-06 01:00:00 CDT'
.
expr_dt_round(every)
expr_dt_round(every)
every |
Either an Expr or a string indicating a column name or a duration (see Details). |
The every
and offset
argument are created with the
the following string language:
1ns # 1 nanosecond
1us # 1 microsecond
1ms # 1 millisecond
1s # 1 second
1m # 1 minute
1h # 1 hour
1d # 1 day
1w # 1 calendar week
1mo # 1 calendar month
1y # 1 calendar year These strings can be combined:
3d12h4m25s # 3 days, 12 hours, 4 minutes, and 25 seconds
A polars expression
df <- pl$select( datetime = pl$datetime_range( as.Date("2001-01-01"), as.Date("2001-01-02"), as.difftime("0:25:0") ) ) df$with_columns(round = pl$col("datetime")$dt$round("1h")) df <- pl$select( datetime = pl$datetime_range( as.POSIXct("2001-01-01 00:00"), as.POSIXct("2001-01-01 01:00"), as.difftime("0:10:0") ) ) df$with_columns(round = pl$col("datetime")$dt$round("1h"))
df <- pl$select( datetime = pl$datetime_range( as.Date("2001-01-01"), as.Date("2001-01-02"), as.difftime("0:25:0") ) ) df$with_columns(round = pl$col("datetime")$dt$round("1h")) df <- pl$select( datetime = pl$datetime_range( as.POSIXct("2001-01-01 00:00"), as.POSIXct("2001-01-01 01:00"), as.difftime("0:10:0") ) ) df$with_columns(round = pl$col("datetime")$dt$round("1h"))
Returns the integer second number from 0 to 59, or a floating point number
from 0 to 60 if fractional = TRUE
that includes any milli/micro/nanosecond
component.
expr_dt_second(fractional = FALSE)
expr_dt_second(fractional = FALSE)
fractional |
If |
A polars expression
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( second = pl$col("datetime")$dt$second(), second_fractional = pl$col("datetime")$dt$second(fractional = TRUE) )
df <- pl$DataFrame( datetime = as.POSIXct( c( "1978-01-01 01:01:01", "2024-10-13 05:30:14.500", "2065-01-01 10:20:30.06" ), "UTC" ) ) df$with_columns( second = pl$col("datetime")$dt$second(), second_fractional = pl$col("datetime")$dt$second(fractional = TRUE) )
Similar to $cast(pl$String)
, but this method allows you to customize the
formatting of the resulting string. This is an alias for $dt$to_string()
.
expr_dt_strftime(format)
expr_dt_strftime(format)
format |
Single string of format to use, or
|
A polars expression
pl$DataFrame( datetime = c(as.POSIXct(c("2021-01-02 00:00:00", "2021-01-03 00:00:00"))) )$ with_columns( datetime_string = pl$col("datetime")$dt$strftime("%Y/%m/%d %H:%M:%S") )
pl$DataFrame( datetime = c(as.POSIXct(c("2021-01-02 00:00:00", "2021-01-03 00:00:00"))) )$ with_columns( datetime_string = pl$col("datetime")$dt$strftime("%Y/%m/%d %H:%M:%S") )
This only works on Datetime columns, it will error on Date columns.
expr_dt_time()
expr_dt_time()
A polars expression
df <- pl$select(dates = pl$datetime_range( as.Date("2000-1-1"), as.Date("2000-1-2"), "1h" )) df$with_columns(times = pl$col("dates")$dt$time())
df <- pl$select(dates = pl$datetime_range( as.Date("2000-1-1"), as.Date("2000-1-2"), "1h" )) df$with_columns(times = pl$col("dates")$dt$time())
Get timestamp in the given time unit
expr_dt_timestamp(time_unit = c("us", "ns", "ms"))
expr_dt_timestamp(time_unit = c("us", "ns", "ms"))
time_unit |
Time unit, one of 'ns', 'us', or 'ms'. |
A polars expression
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$select( pl$col("date"), pl$col("date")$dt$timestamp()$alias("timestamp_ns"), pl$col("date")$dt$timestamp(time_unit = "ms")$alias("timestamp_ms") )
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$select( pl$col("date"), pl$col("date")$dt$timestamp()$alias("timestamp_ns"), pl$col("date")$dt$timestamp(time_unit = "ms")$alias("timestamp_ms") )
Similar to $cast(pl$String)
, but this method allows you to customize the
formatting of the resulting string; if no format is provided, the appropriate
ISO format for the underlying data type is used.
expr_dt_to_string(format = NULL)
expr_dt_to_string(format = NULL)
format |
Single string of format to use, or
|
A polars expression
df <- pl$DataFrame( dt = as.Date(c("1990-03-01", "2020-05-03", "2077-07-05")), dtm = as.POSIXct(c("1980-08-10 00:10:20", "2010-10-20 08:25:35", "2040-12-30 16:40:50")), tm = hms::as_hms(c("01:02:03.456789", "23:59:09.101", "00:00:00.000100")), dur = clock::duration_days(c(-1, 14, 0)) + clock::duration_hours(c(0, -10, 0)) + clock::duration_seconds(c(-42, 0, 0)) + clock::duration_microseconds(c(0, 1001, 0)), ) # Default format for temporal dtypes is ISO8601: df$select((cs$date() | cs$datetime())$dt$to_string()$name$prefix("s_")) df$select((cs$time() | cs$duration())$dt$to_string()$name$prefix("s_")) # All temporal types (aside from Duration) support strftime formatting: df$select( pl$col("dtm"), s_dtm = pl$col("dtm")$dt$to_string("%Y/%m/%d (%H.%M.%S)"), ) # The Polars Duration string format is also available: df$select(pl$col("dur"), s_dur = pl$col("dur")$dt$to_string("polars")) # If you’re interested in extracting the day or month names, # you can use the '%A' and '%B' strftime specifiers: df$select( pl$col("dt"), day_name = pl$col("dtm")$dt$to_string("%A"), month_name = pl$col("dtm")$dt$to_string("%B"), )
df <- pl$DataFrame( dt = as.Date(c("1990-03-01", "2020-05-03", "2077-07-05")), dtm = as.POSIXct(c("1980-08-10 00:10:20", "2010-10-20 08:25:35", "2040-12-30 16:40:50")), tm = hms::as_hms(c("01:02:03.456789", "23:59:09.101", "00:00:00.000100")), dur = clock::duration_days(c(-1, 14, 0)) + clock::duration_hours(c(0, -10, 0)) + clock::duration_seconds(c(-42, 0, 0)) + clock::duration_microseconds(c(0, 1001, 0)), ) # Default format for temporal dtypes is ISO8601: df$select((cs$date() | cs$datetime())$dt$to_string()$name$prefix("s_")) df$select((cs$time() | cs$duration())$dt$to_string()$name$prefix("s_")) # All temporal types (aside from Duration) support strftime formatting: df$select( pl$col("dtm"), s_dtm = pl$col("dtm")$dt$to_string("%Y/%m/%d (%H.%M.%S)"), ) # The Polars Duration string format is also available: df$select(pl$col("dur"), s_dur = pl$col("dur")$dt$to_string("polars")) # If you’re interested in extracting the day or month names, # you can use the '%A' and '%B' strftime specifiers: df$select( pl$col("dt"), day_name = pl$col("dtm")$dt$to_string("%A"), month_name = pl$col("dtm")$dt$to_string("%B"), )
Extract the days from a Duration type
expr_dt_total_days()
expr_dt_total_days()
A polars expression
df <- pl$select( date = pl$datetime_range( start = as.Date("2020-3-1"), end = as.Date("2020-5-1"), interval = "1mo1s" ) ) df$with_columns( diff_days = pl$col("date")$diff()$dt$total_days() )
df <- pl$select( date = pl$datetime_range( start = as.Date("2020-3-1"), end = as.Date("2020-5-1"), interval = "1mo1s" ) ) df$with_columns( diff_days = pl$col("date")$diff()$dt$total_days() )
Extract the hours from a Duration type
expr_dt_total_hours()
expr_dt_total_hours()
A polars expression
df <- pl$select( date = pl$date_range( start = as.Date("2020-1-1"), end = as.Date("2020-1-4"), interval = "1d" ) ) df$with_columns( diff_hours = pl$col("date")$diff()$dt$total_hours() )
df <- pl$select( date = pl$date_range( start = as.Date("2020-1-1"), end = as.Date("2020-1-4"), interval = "1d" ) ) df$with_columns( diff_hours = pl$col("date")$diff()$dt$total_hours() )
Extract the microseconds from a Duration type
expr_dt_total_microseconds()
expr_dt_total_microseconds()
A polars expression
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_microsec = pl$col("date")$diff()$dt$total_microseconds() )
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_microsec = pl$col("date")$diff()$dt$total_microseconds() )
Extract the milliseconds from a Duration type
expr_dt_total_milliseconds()
expr_dt_total_milliseconds()
A polars expression
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_millisec = pl$col("date")$diff()$dt$total_milliseconds() )
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_millisec = pl$col("date")$diff()$dt$total_milliseconds() )
Extract the minutes from a Duration type
expr_dt_total_minutes()
expr_dt_total_minutes()
A polars expression
df <- pl$select( date = pl$date_range( start = as.Date("2020-1-1"), end = as.Date("2020-1-4"), interval = "1d" ) ) df$with_columns( diff_minutes = pl$col("date")$diff()$dt$total_minutes() )
df <- pl$select( date = pl$date_range( start = as.Date("2020-1-1"), end = as.Date("2020-1-4"), interval = "1d" ) ) df$with_columns( diff_minutes = pl$col("date")$diff()$dt$total_minutes() )
Extract the nanoseconds from a Duration type
expr_dt_total_nanoseconds()
expr_dt_total_nanoseconds()
A polars expression
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_nanosec = pl$col("date")$diff()$dt$total_nanoseconds() )
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:00:01", tz = "GMT"), interval = "1ms" )) df$with_columns( diff_nanosec = pl$col("date")$diff()$dt$total_nanoseconds() )
Extract the seconds from a Duration type
expr_dt_total_seconds()
expr_dt_total_seconds()
A polars expression
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:04:00", tz = "GMT"), interval = "1m" )) df$with_columns( diff_sec = pl$col("date")$diff()$dt$total_seconds() )
df <- pl$select(date = pl$datetime_range( start = as.POSIXct("2020-1-1", tz = "GMT"), end = as.POSIXct("2020-1-1 00:04:00", tz = "GMT"), interval = "1m" )) df$with_columns( diff_sec = pl$col("date")$diff()$dt$total_seconds() )
Divide the date/datetime range into buckets. Each date/datetime is mapped to
the start of its bucket using the corresponding local datetime. Note that
weekly buckets start on Monday. Ambiguous results are localised using the
DST offset of the original timestamp - for example, truncating
'2022-11-06 01:30:00 CST'
by '1h'
results in
'2022-11-06 01:00:00 CST'
, whereas truncating '2022-11-06 01:30:00 CDT'
by '1h'
results in '2022-11-06 01:00:00 CDT'
.
expr_dt_truncate(every)
expr_dt_truncate(every)
every |
Either an Expr or a string indicating a column name or a duration (see Details). |
The every
and offset
argument are created with the
the following string language:
1ns # 1 nanosecond
1us # 1 microsecond
1ms # 1 millisecond
1s # 1 second
1m # 1 minute
1h # 1 hour
1d # 1 day
1w # 1 calendar week
1mo # 1 calendar month
1y # 1 calendar year These strings can be combined:
3d12h4m25s # 3 days, 12 hours, 4 minutes, and 25 seconds
A polars expression
df <- pl$select( datetime = pl$datetime_range( as.Date("2001-01-01"), as.Date("2001-01-02"), as.difftime("0:25:0") ) ) df$with_columns(truncated = pl$col("datetime")$dt$truncate("1h")) df <- pl$select( datetime = pl$datetime_range( as.POSIXct("2001-01-01 00:00"), as.POSIXct("2001-01-01 01:00"), as.difftime("0:10:0") ) ) df$with_columns(truncated = pl$col("datetime")$dt$truncate("30m"))
df <- pl$select( datetime = pl$datetime_range( as.Date("2001-01-01"), as.Date("2001-01-02"), as.difftime("0:25:0") ) ) df$with_columns(truncated = pl$col("datetime")$dt$truncate("1h")) df <- pl$select( datetime = pl$datetime_range( as.POSIXct("2001-01-01 00:00"), as.POSIXct("2001-01-01 01:00"), as.difftime("0:10:0") ) ) df$with_columns(truncated = pl$col("datetime")$dt$truncate("30m"))
Returns the ISO week number starting from 1. The return value ranges from 1 to 53 (the last week of year differs across years).
expr_dt_week()
expr_dt_week()
A polars expression
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( week = pl$col("date")$dt$week() )
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( week = pl$col("date")$dt$week() )
Returns the ISO weekday number where Monday = 1 and Sunday = 7.
expr_dt_weekday()
expr_dt_weekday()
A polars expression
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( weekday = pl$col("date")$dt$weekday() )
df <- pl$select( date = pl$date_range( as.Date("2020-12-25"), as.Date("2021-1-05"), interval = "1d" ) ) df$with_columns( weekday = pl$col("date")$dt$weekday() )
This is deprecated. Cast to Int64 and then to Datetime instead.
expr_dt_with_time_unit(time_unit = c("ns", "us", "ms"))
expr_dt_with_time_unit(time_unit = c("ns", "us", "ms"))
time_unit |
Time unit, one of 'ns', 'us', or 'ms'. |
A polars expression
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$with_columns( with_time_unit_ns = pl$col("date")$dt$with_time_unit(), with_time_unit_ms = pl$col("date")$dt$with_time_unit(time_unit = "ms") )
df <- pl$select( date = pl$datetime_range( start = as.Date("2001-1-1"), end = as.Date("2001-1-3"), interval = "1d1s" ) ) df$with_columns( with_time_unit_ns = pl$col("date")$dt$with_time_unit(), with_time_unit_ms = pl$col("date")$dt$with_time_unit(time_unit = "ms") )
Returns the year number in the calendar date.
expr_dt_year()
expr_dt_year()
A polars expression
df <- pl$DataFrame( date = as.Date(c("1977-01-01", "1978-01-01", "1979-01-01")) ) df$with_columns( year = pl$col("date")$dt$year(), iso_year = pl$col("date")$dt$iso_year() )
df <- pl$DataFrame( date = as.Date(c("1977-01-01", "1978-01-01", "1979-01-01")) ) df$with_columns( year = pl$col("date")$dt$year(), iso_year = pl$col("date")$dt$iso_year() )
Evaluate whether all boolean values in a sub-list are true
expr_list_all()
expr_list_all()
A polars expression
df <- pl$DataFrame( a = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), NA, c()) ) df$with_columns(all = pl$col("a")$list$all())
df <- pl$DataFrame( a = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), NA, c()) ) df$with_columns(all = pl$col("a")$list$all())
Evaluate whether any boolean value in a sub-list is true
expr_list_any()
expr_list_any()
A polars expression
df <- pl$DataFrame( a = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), NA, c()) ) df$with_columns(any = pl$col("a")$list$any())
df <- pl$DataFrame( a = list(c(TRUE, TRUE), c(FALSE, TRUE), c(FALSE, FALSE), NA, c()) ) df$with_columns(any = pl$col("a")$list$any())
Retrieve the index of the maximum value in every sub-list
expr_list_arg_max()
expr_list_arg_max()
A polars expression
df <- pl$DataFrame(s = list(1:2, 2:1)) df$with_columns( arg_max = pl$col("s")$list$arg_max() )
df <- pl$DataFrame(s = list(1:2, 2:1)) df$with_columns( arg_max = pl$col("s")$list$arg_max() )
Retrieve the index of the minimum value in every sub-list
expr_list_arg_min()
expr_list_arg_min()
A polars expression
df <- pl$DataFrame(s = list(1:2, 2:1)) df$with_columns( arg_min = pl$col("s")$list$arg_min() )
df <- pl$DataFrame(s = list(1:2, 2:1)) df$with_columns( arg_min = pl$col("s")$list$arg_min() )
Concat the lists into a new list
expr_list_concat(other)
expr_list_concat(other)
other |
Values to concat with. Can be an Expr or something coercible to an Expr. |
A polars expression
df <- pl$DataFrame( a = list("a", "x"), b = list(c("b", "c"), c("y", "z")) ) df$with_columns( conc_to_b = pl$col("a")$list$concat(pl$col("b")), conc_to_lit_str = pl$col("a")$list$concat(pl$lit("some string")), conc_to_lit_list = pl$col("a")$list$concat(pl$lit(list("hello", c("hello", "world")))) )
df <- pl$DataFrame( a = list("a", "x"), b = list(c("b", "c"), c("y", "z")) ) df$with_columns( conc_to_b = pl$col("a")$list$concat(pl$col("b")), conc_to_lit_str = pl$col("a")$list$concat(pl$lit("some string")), conc_to_lit_list = pl$col("a")$list$concat(pl$lit(list("hello", c("hello", "world")))) )
Check if sub-lists contains a given value
expr_list_contains(item)
expr_list_contains(item)
item |
Item that will be checked for membership. Can be an Expr or something coercible to an Expr. Strings are not parsed as columns. |
A polars expression
df <- pl$DataFrame( a = list(3:1, NULL, 1:2), item = 0:2 ) df$with_columns( with_expr = pl$col("a")$list$contains(pl$col("item")), with_lit = pl$col("a")$list$contains(1) )
df <- pl$DataFrame( a = list(3:1, NULL, 1:2), item = 0:2 ) df$with_columns( with_expr = pl$col("a")$list$contains(pl$col("item")), with_lit = pl$col("a")$list$contains(1) )
Count how often a value produced occurs
expr_list_count_matches(element)
expr_list_count_matches(element)
element |
An expression that produces a single value. |
A polars expression
df <- pl$DataFrame(a = list(0, 1, c(1, 2, 3, 2), c(1, 2, 1), c(4, 4))) df$with_columns( number_of_twos = pl$col("a")$list$count_matches(2) )
df <- pl$DataFrame(a = list(0, 1, c(1, 2, 3, 2), c(1, 2, 1), c(4, 4))) df$with_columns( number_of_twos = pl$col("a")$list$count_matches(2) )
This computes the first discrete difference between shifted items of every
list. The parameter n
gives the interval between items to subtract, e.g.
if n = 2
the output will be the difference between the 1st and the 3rd
value, the 2nd and 4th value, etc.
expr_list_diff(n = 1, null_behavior = c("ignore", "drop"))
expr_list_diff(n = 1, null_behavior = c("ignore", "drop"))
n |
Number of slots to shift. If negative, then it starts from the end. |
null_behavior |
How to handle |
A polars expression
df <- pl$DataFrame(s = list(1:4, c(10L, 2L, 1L))) df$with_columns(diff = pl$col("s")$list$diff(2)) # negative value starts shifting from the end df$with_columns(diff = pl$col("s")$list$diff(-2))
df <- pl$DataFrame(s = list(1:4, c(10L, 2L, 1L))) df$with_columns(diff = pl$col("s")$list$diff(2)) # negative value starts shifting from the end df$with_columns(diff = pl$col("s")$list$diff(-2))
Drop all null values in every sub-list
expr_list_drop_nulls()
expr_list_drop_nulls()
A polars expression
df <- pl$DataFrame(values = list(c(NA, 0, NA), c(1, NaN), NA)) df$with_columns( without_nulls = pl$col("values")$list$drop_nulls() )
df <- pl$DataFrame(values = list(c(NA, 0, NA), c(1, NaN), NA)) df$with_columns( without_nulls = pl$col("values")$list$drop_nulls() )
Run any polars expression on the sub-lists' values
expr_list_eval(expr, ..., parallel = FALSE)
expr_list_eval(expr, ..., parallel = FALSE)
expr |
Expression to run. Note that you can select an element with
|
parallel |
Run all expressions in parallel. Don't activate this blindly.
Parallelism is worth it if there is enough work to do per thread. This
likely should not be used in the |
A polars expression
df <- pl$DataFrame( a = list(c(1, 8, 3), c(3, 2), c(NA, NA, 1)), b = list(c("R", "is", "amazing"), c("foo", "bar"), "text") ) df # standardize each value inside a list, using only the values in this list df$select( a_stand = pl$col("a")$list$eval( (pl$element() - pl$element()$mean()) / pl$element()$std() ) ) # count characters for each element in list. Since column "b" is list[str], # we can apply all `$str` functions on elements in the list: df$select( b_len_chars = pl$col("b")$list$eval( pl$element()$str$len_chars() ) ) # concat strings in each list df$select( pl$col("b")$list$eval(pl$element()$str$join(" "))$list$first() )
df <- pl$DataFrame( a = list(c(1, 8, 3), c(3, 2), c(NA, NA, 1)), b = list(c("R", "is", "amazing"), c("foo", "bar"), "text") ) df # standardize each value inside a list, using only the values in this list df$select( a_stand = pl$col("a")$list$eval( (pl$element() - pl$element()$mean()) / pl$element()$std() ) ) # count characters for each element in list. Since column "b" is list[str], # we can apply all `$str` functions on elements in the list: df$select( b_len_chars = pl$col("b")$list$eval( pl$element()$str$len_chars() ) ) # concat strings in each list df$select( pl$col("b")$list$eval(pl$element()$str$join(" "))$list$first() )
Returns a column with a separate row for every list element
expr_list_explode()
expr_list_explode()
A polars expression
df <- pl$DataFrame(a = list(c(1, 2, 3), c(4, 5, 6))) df$select(pl$col("a")$list$explode())
df <- pl$DataFrame(a = list(c(1, 2, 3), c(4, 5, 6))) df$select(pl$col("a")$list$explode())
Get the first value of the sub-lists
expr_list_first()
expr_list_first()
A polars expression
df <- pl$DataFrame(a = list(3:1, NULL, 1:2)) df$with_columns( first = pl$col("a")$list$first() )
df <- pl$DataFrame(a = list(3:1, NULL, 1:2)) df$with_columns( first = pl$col("a")$list$first() )
This allows to extract several values per list. To extract a single value by
index, use $list$get()
. The indices may be defined in a
single column, or by sub-lists in another column of dtype List.
expr_list_gather(index, ..., null_on_oob = FALSE)
expr_list_gather(index, ..., null_on_oob = FALSE)
index |
An Expr or something coercible to an Expr, that can return
several indices. Values are 0-indexed (so index 0 would return the
first item of every sub-list) and negative values start from the end (index
|
... |
These dots are for future extensions and must be empty. |
null_on_oob |
If |
A polars expression
df <- pl$DataFrame( a = list(c(3, 2, 1), 1, c(1, 2)), idx = list(0:1, integer(), c(1L, 999L)) ) df$with_columns( gathered = pl$col("a")$list$gather("idx", null_on_oob = TRUE) ) df$with_columns( gathered = pl$col("a")$list$gather(2, null_on_oob = TRUE) ) # by some column name, must cast to an Int/Uint type to work df$with_columns( gathered = pl$col("a")$list$gather(pl$col("a")$cast(pl$List(pl$UInt64)), null_on_oob = TRUE) )
df <- pl$DataFrame( a = list(c(3, 2, 1), 1, c(1, 2)), idx = list(0:1, integer(), c(1L, 999L)) ) df$with_columns( gathered = pl$col("a")$list$gather("idx", null_on_oob = TRUE) ) df$with_columns( gathered = pl$col("a")$list$gather(2, null_on_oob = TRUE) ) # by some column name, must cast to an Int/Uint type to work df$with_columns( gathered = pl$col("a")$list$gather(pl$col("a")$cast(pl$List(pl$UInt64)), null_on_oob = TRUE) )
n
-th value starting from offset in sub-listsTake every n
-th value starting from offset in sub-lists
expr_list_gather_every(n, offset = 0)
expr_list_gather_every(n, offset = 0)
A polars expression
df <- pl$DataFrame( a = list(1:5, 6:8, 9:12), n = c(2, 1, 3), offset = c(0, 1, 0) ) df$with_columns( gather_every = pl$col("a")$list$gather_every(pl$col("n"), offset = pl$col("offset")) )
df <- pl$DataFrame( a = list(1:5, 6:8, 9:12), n = c(2, 1, 3), offset = c(0, 1, 0) ) df$with_columns( gather_every = pl$col("a")$list$gather_every(pl$col("n"), offset = pl$col("offset")) )
This allows to extract one value per list only. To extract several values by
index, use $list$gather()
.
expr_list_get(index, ..., null_on_oob = TRUE)
expr_list_get(index, ..., null_on_oob = TRUE)
index |
An Expr or something coercible to an Expr, that must return a
single index. Values are 0-indexed (so index 0 would return the first item
of every sub-list) and negative values start from the end (index |
... |
These dots are for future extensions and must be empty. |
null_on_oob |
If |
Expr
df <- pl$DataFrame( values = list(c(2, 2, NA), c(1, 2, 3), NA, NULL), idx = c(1, 2, NA, 3) ) df$with_columns( using_expr = pl$col("values")$list$get("idx"), val_0 = pl$col("values")$list$get(0), val_minus_1 = pl$col("values")$list$get(-1), val_oob = pl$col("values")$list$get(10) )
df <- pl$DataFrame( values = list(c(2, 2, NA), c(1, 2, 3), NA, NULL), idx = c(1, 2, NA, 3) ) df$with_columns( using_expr = pl$col("values")$list$get("idx"), val_0 = pl$col("values")$list$get(0), val_minus_1 = pl$col("values")$list$get(-1), val_oob = pl$col("values")$list$get(10) )
n
values of every sub-listSlice the first n
values of every sub-list
expr_list_head(n = 5L)
expr_list_head(n = 5L)
n |
Number of values to return for each sub-list. Can be an Expr. Strings are parsed as column names. |
A polars expression
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), n = 1:2 ) df$with_columns( head_by_expr = pl$col("s")$list$head("n"), head_by_lit = pl$col("s")$list$head(2) )
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), n = 1:2 ) df$with_columns( head_by_expr = pl$col("s")$list$head("n"), head_by_lit = pl$col("s")$list$head(2) )
Join all string items in a sub-list and place a separator between them. This
only works if the inner dtype is String
.
expr_list_join(separator, ..., ignore_nulls = FALSE)
expr_list_join(separator, ..., ignore_nulls = FALSE)
separator |
String to separate the items with. Can be an Expr. Strings are not parsed as columns. |
... |
These dots are for future extensions and must be empty. |
A polars expression
df <- pl$DataFrame( s = list(c("a", "b", "c"), c("x", "y"), c("e", NA)), separator = c("-", "+", "/") ) df$with_columns( join_with_expr = pl$col("s")$list$join(pl$col("separator")), join_with_lit = pl$col("s")$list$join(" "), join_ignore_null = pl$col("s")$list$join(" ", ignore_nulls = TRUE) )
df <- pl$DataFrame( s = list(c("a", "b", "c"), c("x", "y"), c("e", NA)), separator = c("-", "+", "/") ) df$with_columns( join_with_expr = pl$col("s")$list$join(pl$col("separator")), join_with_lit = pl$col("s")$list$join(" "), join_ignore_null = pl$col("s")$list$join(" ", ignore_nulls = TRUE) )
Get the last value of the sub-lists
expr_list_last()
expr_list_last()
A polars expression
df <- pl$DataFrame(a = list(3:1, NULL, 1:2)) df$with_columns( last = pl$col("a")$list$last() )
df <- pl$DataFrame(a = list(3:1, NULL, 1:2)) df$with_columns( last = pl$col("a")$list$last() )
Null values are counted in the total.
expr_list_len()
expr_list_len()
A polars expression
df <- pl$DataFrame(list_of_strs = list(c("a", "b", NA), "c")) df$with_columns(len_list = pl$col("list_of_strs")$list$len())
df <- pl$DataFrame(list_of_strs = list(c("a", "b", NA), "c")) df$with_columns(len_list = pl$col("list_of_strs")$list$len())
Compute the maximum value in every sub-list
expr_list_max()
expr_list_max()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(max = pl$col("values")$list$max())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(max = pl$col("values")$list$max())
Compute the mean value in every sub-list
expr_list_mean()
expr_list_mean()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(mean = pl$col("values")$list$mean())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(mean = pl$col("values")$list$mean())
Compute the median in every sub-list
expr_list_median()
expr_list_median()
A polars expression
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( median = pl$col("values")$list$median() )
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( median = pl$col("values")$list$median() )
Compute the miminum value in every sub-list
expr_list_min()
expr_list_min()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(min = pl$col("values")$list$min())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(min = pl$col("values")$list$min())
Count the number of unique values in every sub-lists
expr_list_n_unique()
expr_list_n_unique()
A polars expression
df <- pl$DataFrame(values = list(c(2, 2, NA), c(1, 2, 3), NA)) df$with_columns(unique = pl$col("values")$list$n_unique())
df <- pl$DataFrame(values = list(c(2, 2, NA), c(1, 2, 3), NA)) df$with_columns(unique = pl$col("values")$list$n_unique())
Reverse values in every sub-list
expr_list_reverse()
expr_list_reverse()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(reverse = pl$col("values")$list$reverse())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(reverse = pl$col("values")$list$reverse())
Sample values from every sub-list
expr_list_sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
expr_list_sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
n |
Number of items to return. Cannot be used with |
... |
These dots are for future extensions and must be empty. |
fraction |
Fraction of items to return. Cannot be used with |
with_replacement |
Allow values to be sampled more than once. |
shuffle |
Shuffle the order of sampled data points. |
seed |
Seed for the random number generator. If |
A polars expression
df <- pl$DataFrame( values = list(1:3, NA, c(NA, 3L), 5:7), n = c(1, 1, 1, 2) ) df$with_columns( sample = pl$col("values")$list$sample(n = pl$col("n"), seed = 1) )
df <- pl$DataFrame( values = list(1:3, NA, c(NA, 3L), 5:7), n = c(1, 1, 1, 2) ) df$with_columns( sample = pl$col("values")$list$sample(n = pl$col("n"), seed = 1) )
This returns the "asymmetric difference", meaning only the elements of the
first list that are not in the second list. To get all elements that are in
only one of the two lists, use
$set_symmetric_difference()
.
expr_list_set_difference(other)
expr_list_set_difference(other)
other |
Other list variable. Can be an Expr or something coercible to an Expr. |
Note that the datatypes inside the list must have a common supertype. For
example, the first column can be list[i32]
and the second one can be
list[i8]
because it can be cast to list[i32]
. However, the second column
cannot be e.g list[f32]
.
A polars expression
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(difference = pl$col("a")$list$set_difference("b"))
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(difference = pl$col("a")$list$set_difference("b"))
Compute the intersection between elements of a list and other elements
expr_list_set_intersection(other)
expr_list_set_intersection(other)
other |
Other list variable. Can be an Expr or something coercible to an Expr. |
Note that the datatypes inside the list must have a common supertype. For
example, the first column can be list[i32]
and the second one can be
list[i8]
because it can be cast to list[i32]
. However, the second column
cannot be e.g list[f32]
.
A polars expression
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(intersection = pl$col("a")$list$set_intersection("b"))
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(intersection = pl$col("a")$list$set_intersection("b"))
This returns all elements that are in only one of the two lists. To get only
elements that are in the first list but not in the second one, use
$set_difference()
.
expr_list_set_symmetric_difference(other)
expr_list_set_symmetric_difference(other)
other |
Other list variable. Can be an Expr or something coercible to an Expr. |
Note that the datatypes inside the list must have a common supertype. For
example, the first column can be list[i32]
and the second one can be
list[i8]
because it can be cast to list[i32]
. However, the second column
cannot be e.g list[f32]
.
A polars expression
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns( symmetric_difference = pl$col("a")$list$set_symmetric_difference("b") )
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns( symmetric_difference = pl$col("a")$list$set_symmetric_difference("b") )
Compute the union of elements of a list and other elements
expr_list_set_union(other)
expr_list_set_union(other)
other |
Other list variable. Can be an Expr or something coercible to an Expr. |
Note that the datatypes inside the list must have a common supertype. For
example, the first column can be list[i32]
and the second one can be
list[i8]
because it can be cast to list[i32]
. However, the second column
cannot be e.g list[f32]
.
A polars expression
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(union = pl$col("a")$list$set_union("b"))
df <- pl$DataFrame( a = list(1:3, NA, c(NA, 3L), 5:7), b = list(2:4, 3L, c(3L, 4L, NA), c(6L, 8L)) ) df$with_columns(union = pl$col("a")$list$set_union("b"))
Shift list values by the given number of indices
expr_list_shift(n = 1)
expr_list_shift(n = 1)
A polars expression
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), idx = 1:2 ) df$with_columns( shift_by_expr = pl$col("s")$list$shift(pl$col("idx")), shift_by_lit = pl$col("s")$list$shift(2), shift_by_negative_lit = pl$col("s")$list$shift(-2) )
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), idx = 1:2 ) df$with_columns( shift_by_expr = pl$col("s")$list$shift(pl$col("idx")), shift_by_lit = pl$col("s")$list$shift(2), shift_by_negative_lit = pl$col("s")$list$shift(-2) )
This extracts length
values at most, starting at index offset
. This can
return less than length
values if length
is larger than the number of
values.
expr_list_slice(offset, length = NULL)
expr_list_slice(offset, length = NULL)
offset |
Start index. Negative indexing is supported. Can be an Expr. Strings are parsed as column names. |
length |
Length of the slice. If |
A polars expression
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), idx_off = 1:2, len = c(4, 1) ) df$with_columns( slice_by_expr = pl$col("s")$list$slice("idx_off", "len"), slice_by_lit = pl$col("s")$list$slice(2, 3) )
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), idx_off = 1:2, len = c(4, 1) ) df$with_columns( slice_by_expr = pl$col("s")$list$slice("idx_off", "len"), slice_by_lit = pl$col("s")$list$slice(2, 3) )
Sort values in every sub-list
expr_list_sort(..., descending = FALSE, nulls_last = FALSE)
expr_list_sort(..., descending = FALSE, nulls_last = FALSE)
... |
These dots are for future extensions and must be empty. |
descending |
Sort values in descending order. |
nulls_last |
Place null values last. |
A polars expression
df <- pl$DataFrame(values = list(c(NA, 2, 1, 3), c(Inf, 2, 3, NaN), NA)) df$with_columns(sort = pl$col("values")$list$sort())
df <- pl$DataFrame(values = list(c(NA, 2, 1, 3), c(Inf, 2, 3, NaN), NA)) df$with_columns(sort = pl$col("values")$list$sort())
Compute the standard deviation in every sub-list
expr_list_std(ddof = 1)
expr_list_std(ddof = 1)
"Delta |
Degrees of Freedom": the divisor used in the calculation is
|
A polars expression
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( std = pl$col("values")$list$std() )
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( std = pl$col("values")$list$std() )
Sum all elements in every sub-list
expr_list_sum()
expr_list_sum()
A polars expression
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(sum = pl$col("values")$list$sum())
df <- pl$DataFrame(values = list(c(1, 2, 3, NA), c(2, 3), NA)) df$with_columns(sum = pl$col("values")$list$sum())
n
values of every sub-listSlice the last n
values of every sub-list
expr_list_tail(n = 5L)
expr_list_tail(n = 5L)
n |
Number of values to return for each sub-list. Can be an Expr. Strings are parsed as column names. |
A polars expression
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), n = 1:2 ) df$with_columns( tail_by_expr = pl$col("s")$list$tail("n"), tail_by_lit = pl$col("s")$list$tail(2) )
df <- pl$DataFrame( s = list(1:4, c(10L, 2L, 1L)), n = 1:2 ) df$with_columns( tail_by_expr = pl$col("s")$list$tail("n"), tail_by_lit = pl$col("s")$list$tail(2) )
Convert a List column into an Array column with the same inner data type
expr_list_to_array(width)
expr_list_to_array(width)
width |
Width of the resulting Array column. |
A polars expression
df <- pl$DataFrame(values = list(c(-1, 0), c(1, 10))) df$with_columns( array = pl$col("values")$list$to_array(2) )
df <- pl$DataFrame(values = list(c(-1, 0), c(1, 10))) df$with_columns( array = pl$col("values")$list$to_array(2) )
Get unique values in a list
expr_list_unique(..., maintain_order = FALSE)
expr_list_unique(..., maintain_order = FALSE)
... |
These dots are for future extensions and must be empty. |
maintain_order |
Maintain order of data. This requires more work. |
A polars expression
df <- pl$DataFrame(values = list(c(2, 2, NA), c(1, 2, 3), NA)) df$with_columns(unique = pl$col("values")$list$unique())
df <- pl$DataFrame(values = list(c(2, 2, NA), c(1, 2, 3), NA)) df$with_columns(unique = pl$col("values")$list$unique())
Compute the variance in every sub-list
expr_list_var(ddof = 1)
expr_list_var(ddof = 1)
A polars expression
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( var = pl$col("values")$list$var() )
df <- pl$DataFrame(values = list(c(-1, 0, 1), c(1, 10))) df$with_columns( var = pl$col("values")$list$var() )
Indicate if this expression is the same as another expression
expr_meta_eq(other)
expr_meta_eq(other)
A polars expression
foo_bar <- pl$col("foo")$alias("bar") foo <- pl$col("foo") foo_bar$meta$eq(foo) foo_bar2 <- pl$col("foo")$alias("bar") foo_bar$meta$eq(foo_bar2)
foo_bar <- pl$col("foo")$alias("bar") foo <- pl$col("foo") foo_bar$meta$eq(foo) foo_bar2 <- pl$col("foo")$alias("bar") foo_bar$meta$eq(foo_bar2)
Indicate if this expression expands into multiple expressions
expr_meta_has_multiple_outputs()
expr_meta_has_multiple_outputs()
A polars expression
e <- pl$col(c("a", "b"))$name$suffix("_foo") e$meta$has_multiple_outputs()
e <- pl$col(c("a", "b"))$name$suffix("_foo") e$meta$has_multiple_outputs()
Indicate if this expression is a basic (non-regex) unaliased column
expr_meta_is_column()
expr_meta_is_column()
A logical value.
e <- pl$col("foo") e$meta$is_column() e <- pl.col("foo") * pl.col("bar") e$meta$is_column() e <- pl.col(r"^col.*\d+$") e$meta$is_column()
e <- pl$col("foo") e$meta$is_column() e <- pl.col("foo") * pl.col("bar") e$meta$is_column() e <- pl.col(r"^col.*\d+$") e$meta$is_column()
This can include bare columns, column matches by regex or dtype, selectors and exclude ops, and (optionally) column/expression aliasing.
expr_meta_is_column_selection(..., allow_aliasing = FALSE)
expr_meta_is_column_selection(..., allow_aliasing = FALSE)
... |
These dots are for future extensions and must be empty. |
allow_aliasing |
If |
A logical value.
e <- pl$col("foo") e$meta$is_column_selection() e <- pl$col("foo")$alias("bar") e$meta$is_column_selection() e$meta$is_column_selection(allow_aliasing = TRUE) e <- pl$col("foo") * pl$col("bar") e$meta$is_column_selection() e <- cs$starts_with("foo") e$meta$is_column_selection()
e <- pl$col("foo") e$meta$is_column_selection() e <- pl$col("foo")$alias("bar") e$meta$is_column_selection() e$meta$is_column_selection(allow_aliasing = TRUE) e <- pl$col("foo") * pl$col("bar") e$meta$is_column_selection() e <- cs$starts_with("foo") e$meta$is_column_selection()
Indicate if this expression expands to columns that match a regex pattern
expr_meta_is_regex_projection()
expr_meta_is_regex_projection()
A logical value.
e <- pl$col("^.*$")$name$prefix("foo_") e$meta$is_regex_projection()
e <- pl$col("^.*$")$name$prefix("foo_") e$meta$is_regex_projection()
Indicate if this expression is not the same as another expression
expr_meta_ne(other)
expr_meta_ne(other)
A polars expression
foo_bar <- pl$col("foo")$alias("bar") foo <- pl$col("foo") foo_bar$meta$ne(foo) foo_bar2 <- pl$col("foo")$alias("bar") foo_bar$meta$ne(foo_bar2)
foo_bar <- pl$col("foo")$alias("bar") foo <- pl$col("foo") foo_bar$meta$ne(foo) foo_bar2 <- pl$col("foo")$alias("bar") foo_bar$meta$ne(foo_bar2)
It may not always be possible to determine the output name as that can
depend on the schema of the context; in that case this will raise an error
if raise_if_undetermined = TRUE
(the default), and return NA
otherwise.
expr_meta_output_name(..., raise_if_undetermined = TRUE)
expr_meta_output_name(..., raise_if_undetermined = TRUE)
... |
These dots are for future extensions and must be empty. |
raise_if_undetermined |
If |
A polars expression
e <- pl$col("foo") * pl$col("bar") e$meta$output_name() e_filter <- pl$col("foo")$filter(pl$col("bar") == 13) e_filter$meta$output_name() e_sum_over <- pl$col("foo")$sum()$over("groups") e_sum_over$meta$output_name() e_sum_slice <- pl$col("foo")$sum()$slice(pl$len() - 10, pl$col("bar")) e_sum_slice$meta$output_name() pl$len()$meta$output_name()
e <- pl$col("foo") * pl$col("bar") e$meta$output_name() e_filter <- pl$col("foo")$filter(pl$col("bar") == 13) e_filter$meta$output_name() e_sum_over <- pl$col("foo")$sum()$over("groups") e_sum_over$meta$output_name() e_sum_slice <- pl$col("foo")$sum()$slice(pl$len() - 10, pl$col("bar")) e_sum_slice$meta$output_name() pl$len()$meta$output_name()
Pop the latest expression and return the input(s) of the popped expression
expr_meta_pop()
expr_meta_pop()
A polars expression
e <- pl$col("foo")$alias("bar") pop <- e$meta$pop() pop pop[[1]]$meta$eq(pl$col("foo")) pop[[1]]$meta$eq(pl$col("foo"))
e <- pl$col("foo")$alias("bar") pop <- e$meta$pop() pop pop[[1]]$meta$eq(pl$col("foo")) pop[[1]]$meta$eq(pl$col("foo"))
Get a list with the root column name
expr_meta_root_names()
expr_meta_root_names()
A polars expression
e <- pl$col("foo") * pl$col("bar") e$meta$root_names() e_filter <- pl$col("foo")$filter(pl$col("bar") == 13) e_filter$meta$root_names() e_sum_over <- pl$sum("foo")$over("groups") e_sum_over$meta$root_names() e_sum_slice <- pl$sum("foo")$slice(pl$len() - 10, pl$col("bar")) e_sum_slice$meta$root_names()
e <- pl$col("foo") * pl$col("bar") e$meta$root_names() e_filter <- pl$col("foo")$filter(pl$col("bar") == 13) e_filter$meta$root_names() e_sum_over <- pl$sum("foo")$over("groups") e_sum_over$meta$root_names() e_sum_slice <- pl$sum("foo")$slice(pl$len() - 10, pl$col("bar")) e_sum_slice$meta$root_names()
Serialize this expression to a string in binary or JSON format
expr_meta_serialize(..., format = c("binary", "json"))
expr_meta_serialize(..., format = c("binary", "json"))
... |
These dots are for future extensions and must be empty. |
format |
The format in which to serialize. Must be one of:
|
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
A polars expression
# Serialize the expression into a binary representation. expr <- pl$col("foo")$sum()$over("bar") bytes <- expr$meta$serialize() rawToChar(bytes) pl$deserialize_expr(bytes) # Serialize into json expr$meta$serialize(format = "json") |> jsonlite::prettify()
# Serialize the expression into a binary representation. expr <- pl$col("foo")$sum()$over("bar") bytes <- expr$meta$serialize() rawToChar(bytes) pl$deserialize_expr(bytes) # Serialize into json expr$meta$serialize(format = "json") |> jsonlite::prettify()
Format the expression as a tree
expr_meta_tree_format()
expr_meta_tree_format()
A character vector
my_expr <- (pl$col("foo") * pl$col("bar"))$sum()$over(pl$col("ham")) / 2 my_expr$meta$tree_format() |> cat()
my_expr <- (pl$col("foo") * pl$col("bar"))$sum()$over(pl$col("ham")) / 2 my_expr$meta$tree_format() |> cat()
alias
or name$keep
Undo any renaming operation like alias
or name$keep
expr_meta_undo_aliases()
expr_meta_undo_aliases()
A polars expression
e <- pl$col("foo")$alias("bar") e$meta$undo_aliases()$meta$eq(pl$col("foo")) e <- pl$col("foo")$sum()$over("bar") e$name$keep()$meta$undo_aliases()$meta$eq(e)
e <- pl$col("foo")$alias("bar") e$meta$undo_aliases()$meta$eq(pl$col("foo")) e <- pl$col("foo")$sum()$over("bar") e$name$keep()$meta$undo_aliases()$meta$eq(e)
Check if string contains a substring that matches a pattern
expr_str_contains(pattern, ..., literal = FALSE, strict = TRUE)
expr_str_contains(pattern, ..., literal = FALSE, strict = TRUE)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
strict |
Logical. If |
To modify regular expression behaviour (such as case-sensitivity)
with flags, use the inline (?iLmsuxU)
syntax. See the regex crate’s section
on grouping and flags
for additional information about the use of inline expression modifiers.
A polars expression
$str$start_with()
: Check if string values
start with a substring.
$str$ends_with()
: Check if string values end
with a substring.
$str$find()
: Return the index position of the first
substring matching a pattern.
# The inline `(?i)` syntax example pl$DataFrame(s = c("AAA", "aAa", "aaa"))$with_columns( default_match = pl$col("s")$str$contains("AA"), insensitive_match = pl$col("s")$str$contains("(?i)AA") ) df <- pl$DataFrame(txt = c("Crab", "cat and dog", "rab$bit", NA)) df$with_columns( regex = pl$col("txt")$str$contains("cat|bit"), literal = pl$col("txt")$str$contains("rab$", literal = TRUE) )
# The inline `(?i)` syntax example pl$DataFrame(s = c("AAA", "aAa", "aaa"))$with_columns( default_match = pl$col("s")$str$contains("AA"), insensitive_match = pl$col("s")$str$contains("(?i)AA") ) df <- pl$DataFrame(txt = c("Crab", "cat and dog", "rab$bit", NA)) df$with_columns( regex = pl$col("txt")$str$contains("cat|bit"), literal = pl$col("txt")$str$contains("rab$", literal = TRUE) )
This function determines if any of the patterns find a match.
expr_str_contains_any(patterns, ..., ascii_case_insensitive = FALSE)
expr_str_contains_any(patterns, ..., ascii_case_insensitive = FALSE)
patterns |
Character vector or something can be coerced to strings Expr of a valid regex pattern, compatible with the regex crate. |
... |
These dots are for future extensions and must be empty. |
ascii_case_insensitive |
Enable ASCII-aware case insensitive matching. When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only. |
A polars expression
df <- pl$DataFrame( lyrics = c( "Everybody wants to rule the world", "Tell me what you want, what you really really want", "Can you feel the love tonight" ) ) df$with_columns( contains_any = pl$col("lyrics")$str$contains_any(c("you", "me")) )
df <- pl$DataFrame( lyrics = c( "Everybody wants to rule the world", "Tell me what you want, what you really really want", "Can you feel the love tonight" ) ) df$with_columns( contains_any = pl$col("lyrics")$str$contains_any(c("you", "me")) )
Count all successive non-overlapping regex matches
expr_str_count_matches(pattern, ..., literal = FALSE)
expr_str_count_matches(pattern, ..., literal = FALSE)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
A polars expression
df <- pl$DataFrame(foo = c("12 dbc 3xy", "cat\\w", "1zy3\\d\\d", NA)) df$with_columns( count_digits = pl$col("foo")$str$count_matches(r"(\d)"), count_slash_d = pl$col("foo")$str$count_matches(r"(\d)", literal = TRUE) )
df <- pl$DataFrame(foo = c("12 dbc 3xy", "cat\\w", "1zy3\\d\\d", NA)) df$with_columns( count_digits = pl$col("foo")$str$count_matches(r"(\d)"), count_slash_d = pl$col("foo")$str$count_matches(r"(\d)", literal = TRUE) )
Decode a value using the provided encoding
expr_str_decode(encoding, ..., strict = TRUE)
expr_str_decode(encoding, ..., strict = TRUE)
encoding |
Either 'hex' or 'base64'. |
... |
These dots are for future extensions and must be empty. |
strict |
If |
A polars expression
df <- pl$DataFrame(strings = c("foo", "bar", NA)) df$select(pl$col("strings")$str$encode("hex")) df$with_columns( pl$col("strings")$str$encode("base64")$alias("base64"), # notice DataType is not encoded pl$col("strings")$str$encode("hex")$alias("hex") # ... and must restored with cast )$with_columns( pl$col("base64")$str$decode("base64")$alias("base64_decoded")$cast(pl$String), pl$col("hex")$str$decode("hex")$alias("hex_decoded")$cast(pl$String) )
df <- pl$DataFrame(strings = c("foo", "bar", NA)) df$select(pl$col("strings")$str$encode("hex")) df$with_columns( pl$col("strings")$str$encode("base64")$alias("base64"), # notice DataType is not encoded pl$col("strings")$str$encode("hex")$alias("hex") # ... and must restored with cast )$with_columns( pl$col("base64")$str$decode("base64")$alias("base64_decoded")$cast(pl$String), pl$col("hex")$str$decode("hex")$alias("hex_decoded")$cast(pl$String) )
Encode a value using the provided encoding
expr_str_encode(encoding)
expr_str_encode(encoding)
encoding |
Either 'hex' or 'base64'. |
A polars expression
df <- pl$DataFrame(strings = c("foo", "bar", NA)) df$select(pl$col("strings")$str$encode("hex")) df$with_columns( pl$col("strings")$str$encode("base64")$alias("base64"), # notice DataType is not encoded pl$col("strings")$str$encode("hex")$alias("hex") # ... and must restored with cast )$with_columns( pl$col("base64")$str$decode("base64")$alias("base64_decoded")$cast(pl$String), pl$col("hex")$str$decode("hex")$alias("hex_decoded")$cast(pl$String) )
df <- pl$DataFrame(strings = c("foo", "bar", NA)) df$select(pl$col("strings")$str$encode("hex")) df$with_columns( pl$col("strings")$str$encode("base64")$alias("base64"), # notice DataType is not encoded pl$col("strings")$str$encode("hex")$alias("hex") # ... and must restored with cast )$with_columns( pl$col("base64")$str$decode("base64")$alias("base64_decoded")$cast(pl$String), pl$col("hex")$str$decode("hex")$alias("hex_decoded")$cast(pl$String) )
Check if string values end with a substring.
expr_str_ends_with(suffix)
expr_str_ends_with(suffix)
suffix |
Suffix substring or Expr. |
See also $str$starts_with()
and $str$contains()
.
A polars expression
df <- pl$DataFrame(fruits = c("apple", "mango", NA)) df$select( pl$col("fruits"), pl$col("fruits")$str$ends_with("go")$alias("has_suffix") )
df <- pl$DataFrame(fruits = c("apple", "mango", NA)) df$select( pl$col("fruits"), pl$col("fruits")$str$ends_with("go")$alias("has_suffix") )
Extract the target capture group from provided patterns
expr_str_extract(pattern, group_index)
expr_str_extract(pattern, group_index)
pattern |
A valid regex pattern. Can be an Expr or something coercible to an Expr. Strings are parsed as column names. |
group_index |
Index of the targeted capture group. Group 0 means the whole pattern, first group begin at index 1 (default). |
A polars expression
df <- pl$DataFrame( a = c( "http://vote.com/ballon_dor?candidate=messi&ref=polars", "http://vote.com/ballon_dor?candidat=jorginho&ref=polars", "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars" ) ) df$with_columns( extracted = pl$col("a")$str$extract(pl$lit(r"(candidate=(\w+))"), 1) )
df <- pl$DataFrame( a = c( "http://vote.com/ballon_dor?candidate=messi&ref=polars", "http://vote.com/ballon_dor?candidat=jorginho&ref=polars", "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars" ) ) df$with_columns( extracted = pl$col("a")$str$extract(pl$lit(r"(candidate=(\w+))"), 1) )
Extracts all matches for the given regex pattern. Extracts each successive non-overlapping regex match in an individual string as an array.
expr_str_extract_all(pattern)
expr_str_extract_all(pattern)
pattern |
A valid regex pattern |
A polars expression
df <- pl$DataFrame(foo = c("123 bla 45 asd", "xyz 678 910t")) df$select( pl$col("foo")$str$extract_all(r"((\d+))")$alias("extracted_nrs") )
df <- pl$DataFrame(foo = c("123 bla 45 asd", "xyz 678 910t")) df$select( pl$col("foo")$str$extract_all(r"((\d+))")$alias("extracted_nrs") )
Extract all capture groups for the given regex pattern
expr_str_extract_groups(pattern)
expr_str_extract_groups(pattern)
pattern |
A character of a valid regular expression pattern containing at least one capture group, compatible with the regex crate. |
All group names are strings. If your pattern contains unnamed groups, their numerical position is converted to a string. See examples.
A polars expression
df <- pl$DataFrame( url = c( "http://vote.com/ballon_dor?candidate=messi&ref=python", "http://vote.com/ballon_dor?candidate=weghorst&ref=polars", "http://vote.com/ballon_dor?error=404&ref=rust" ) ) pattern <- r"(candidate=(?<candidate>\w+)&ref=(?<ref>\w+))" df$with_columns( captures = pl$col("url")$str$extract_groups(pattern) )$unnest("captures") # If the groups are unnamed, their numerical position (as a string) is used: pattern <- r"(candidate=(\w+)&ref=(\w+))" df$with_columns( captures = pl$col("url")$str$extract_groups(pattern) )$unnest("captures")
df <- pl$DataFrame( url = c( "http://vote.com/ballon_dor?candidate=messi&ref=python", "http://vote.com/ballon_dor?candidate=weghorst&ref=polars", "http://vote.com/ballon_dor?error=404&ref=rust" ) ) pattern <- r"(candidate=(?<candidate>\w+)&ref=(?<ref>\w+))" df$with_columns( captures = pl$col("url")$str$extract_groups(pattern) )$unnest("captures") # If the groups are unnamed, their numerical position (as a string) is used: pattern <- r"(candidate=(\w+)&ref=(\w+))" df$with_columns( captures = pl$col("url")$str$extract_groups(pattern) )$unnest("captures")
Use the aho-corasick algorithm to extract matches
expr_str_extract_many( patterns, ..., ascii_case_insensitive = FALSE, overlapping = FALSE )
expr_str_extract_many( patterns, ..., ascii_case_insensitive = FALSE, overlapping = FALSE )
patterns |
String patterns to search. This can be an Expr or something coercible to an Expr. Strings are parsed as column names. |
... |
These dots are for future extensions and must be empty. |
ascii_case_insensitive |
Enable ASCII-aware case insensitive matching. When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only. |
overlapping |
Whether matches can overlap. |
A polars expression
df <- pl$DataFrame(values = "discontent") patterns <- pl$lit(c("winter", "disco", "onte", "discontent")) df$with_columns( matches = pl$col("values")$str$extract_many(patterns), matches_overlap = pl$col("values")$str$extract_many(patterns, overlapping = TRUE) ) df <- pl$DataFrame( values = c("discontent", "rhapsody"), patterns = list(c("winter", "disco", "onte", "discontent"), c("rhap", "ody", "coalesce")) ) df$select(pl$col("values")$str$extract_many("patterns"))
df <- pl$DataFrame(values = "discontent") patterns <- pl$lit(c("winter", "disco", "onte", "discontent")) df$with_columns( matches = pl$col("values")$str$extract_many(patterns), matches_overlap = pl$col("values")$str$extract_many(patterns, overlapping = TRUE) ) df <- pl$DataFrame( values = c("discontent", "rhapsody"), patterns = list(c("winter", "disco", "onte", "discontent"), c("rhap", "ody", "coalesce")) ) df$select(pl$col("values")$str$extract_many("patterns"))
Return the index position of the first substring matching a pattern
expr_str_find(pattern, ..., literal = FALSE, strict = TRUE)
expr_str_find(pattern, ..., literal = FALSE, strict = TRUE)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
strict |
Logical. If |
To modify regular expression behaviour (such as case-sensitivity)
with flags, use the inline (?iLmsuxU)
syntax. See the regex crate’s section
on grouping and flags
for additional information about the use of inline expression modifiers.
A polars expression
$str$start_with()
: Check if string values
start with a substring.
$str$ends_with()
: Check if string values end
with a substring.
$str$contains()
: Check if string contains a substring
that matches a pattern.
pl$DataFrame(s = c("AAA", "aAa", "aaa"))$with_columns( default_match = pl$col("s")$str$find("Aa"), insensitive_match = pl$col("s")$str$find("(?i)Aa") )
pl$DataFrame(s = c("AAA", "aAa", "aaa"))$with_columns( default_match = pl$col("s")$str$find("Aa"), insensitive_match = pl$col("s")$str$find("(?i)Aa") )
Return the first n characters of each string
expr_str_head(n)
expr_str_head(n)
n |
Length of the slice (integer or expression). Strings are parsed as column names. Negative indexing is supported. |
The n
input is defined in terms of the number of characters in the (UTF-8)
string. A character is defined as a Unicode scalar value. A single character
is represented by a single byte when working with ASCII text, and a maximum
of 4 bytes otherwise.
When the n
input is negative, head()
returns characters up to the n
th
from the end of the string. For example, if n = -3
, then all characters
except the last three are returned.
If the length of the string has fewer than n
characters, the full string is
returned.
A polars expression
df <- pl$DataFrame( s = c("pear", NA, "papaya", "dragonfruit"), n = c(3, 4, -2, -5) ) df$with_columns( s_head_5 = pl$col("s")$str$head(5), s_head_n = pl$col("s")$str$head("n") )
df <- pl$DataFrame( s = c("pear", NA, "papaya", "dragonfruit"), n = c(3, 4, -2, -5) ) df$with_columns( s_head_5 = pl$col("s")$str$head(5), s_head_n = pl$col("s")$str$head("n") )
Vertically concatenate the string values in the column to a single string value.
expr_str_join(delimiter = "", ..., ignore_nulls = TRUE)
expr_str_join(delimiter = "", ..., ignore_nulls = TRUE)
delimiter |
The delimiter to insert between consecutive string values. |
... |
These dots are for future extensions and must be empty. |
ignore_nulls |
Ignore null values (default). If |
A polars expression
# concatenate a Series of strings to a single string df <- pl$DataFrame(foo = c(1, NA, 2)) df$select(pl$col("foo")$str$join("-")) df$select(pl$col("foo")$str$join("-", ignore_nulls = FALSE))
# concatenate a Series of strings to a single string df <- pl$DataFrame(foo = c(1, NA, 2)) df$select(pl$col("foo")$str$join("-")) df$select(pl$col("foo")$str$join("-", ignore_nulls = FALSE))
Parse string values as JSON.
expr_str_json_decode(dtype, infer_schema_length = 100)
expr_str_json_decode(dtype, infer_schema_length = 100)
dtype |
The dtype to cast the extracted value to. If |
infer_schema_length |
How many rows to parse to determine the schema.
If |
Throw errors if encounter invalid json strings.
A polars expression
df <- pl$DataFrame( json_val = c('{"a":1, "b": true}', NA, '{"a":2, "b": false}') ) dtype <- pl$Struct(pl$Field("a", pl$Int64), pl$Field("b", pl$Boolean)) df$select(pl$col("json_val")$str$json_decode(dtype))
df <- pl$DataFrame( json_val = c('{"a":1, "b": true}', NA, '{"a":2, "b": false}') ) dtype <- pl$Struct(pl$Field("a", pl$Int64), pl$Field("b", pl$Boolean)) df$select(pl$col("json_val")$str$json_decode(dtype))
Extract the first match of JSON string with the provided JSONPath expression
expr_str_json_path_match(json_path)
expr_str_json_path_match(json_path)
json_path |
A valid JSON path query string. |
Throw errors if encounter invalid JSON strings. All return value will be cast to String regardless of the original value.
Documentation on JSONPath standard can be found here: https://goessner.net/articles/JsonPath/.
A polars expression
df <- pl$DataFrame( json_val = c('{"a":"1"}', NA, '{"a":2}', '{"a":2.1}', '{"a":true}') ) df$select(pl$col("json_val")$str$json_path_match("$.a"))
df <- pl$DataFrame( json_val = c('{"a":"1"}', NA, '{"a":2}', '{"a":2.1}', '{"a":true}') ) df$select(pl$col("json_val")$str$json_path_match("$.a"))
Get length of the strings as UInt32 (as number of bytes). Use $str$len_chars()
to get the number of characters.
expr_str_len_bytes()
expr_str_len_bytes()
If you know that you are working with ASCII text, lengths
will be
equivalent, and faster (returns length in terms of the number of bytes).
A polars expression
pl$DataFrame( s = c("Café", NA, "345", "æøå") )$select( pl$col("s"), pl$col("s")$str$len_bytes()$alias("lengths"), pl$col("s")$str$len_chars()$alias("n_chars") )
pl$DataFrame( s = c("Café", NA, "345", "æøå") )$select( pl$col("s"), pl$col("s")$str$len_bytes()$alias("lengths"), pl$col("s")$str$len_chars()$alias("n_chars") )
Get length of the strings as UInt32 (as number of characters). Use
$str$len_bytes()
to get the number of bytes.
expr_str_len_chars()
expr_str_len_chars()
If you know that you are working with ASCII text, lengths
will be
equivalent, and faster (returns length in terms of the number of bytes).
A polars expression
pl$DataFrame( s = c("Café", NA, "345", "æøå") )$select( pl$col("s"), pl$col("s")$str$len_bytes()$alias("lengths"), pl$col("s")$str$len_chars()$alias("n_chars") )
pl$DataFrame( s = c("Café", NA, "345", "æøå") )$select( pl$col("s"), pl$col("s")$str$len_bytes()$alias("lengths"), pl$col("s")$str$len_chars()$alias("n_chars") )
Return the string left justified in a string of length width
.
expr_str_pad_end(length, fill_char = " ")
expr_str_pad_end(length, fill_char = " ")
length |
Justify left to this length. |
fill_char |
Fill with this ASCII character. |
Padding is done using the specified fill_char
. The original string
is returned if length
is less than or equal to len(s)
.
A polars expression
df <- pl$DataFrame(a = c("cow", "monkey", NA, "hippopotamus")) df$select(pl$col("a")$str$pad_end(8, "*"))
df <- pl$DataFrame(a = c("cow", "monkey", NA, "hippopotamus")) df$select(pl$col("a")$str$pad_end(8, "*"))
Return the string right justified in a string of length length
.
expr_str_pad_start(length, fill_char = " ")
expr_str_pad_start(length, fill_char = " ")
length |
Justify right to this length. |
fill_char |
Fill with this ASCII character. |
Padding is done using the specified fill_char
. The original string
is returned if length
is less than or equal to len(s)
.
A polars expression
df <- pl$DataFrame(a = c("cow", "monkey", NA, "hippopotamus")) df$select(pl$col("a")$str$pad_start(8, "*"))
df <- pl$DataFrame(a = c("cow", "monkey", NA, "hippopotamus")) df$select(pl$col("a")$str$pad_start(8, "*"))
Replace first matching regex/literal substring with a new string value
expr_str_replace(pattern, value, ..., literal = FALSE, n = 1L)
expr_str_replace(pattern, value, ..., literal = FALSE, n = 1L)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
value |
A character or an Expr of string that will replace the matched substring. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
n |
A number of matches to replace.
Note that regex replacement with |
To modify regular expression behaviour (such as case-sensitivity)
with flags, use the inline (?iLmsuxU)
syntax. See the regex crate’s section
on grouping and flags
for additional information about the use of inline expression modifiers.
A polars expression
The dollar sign ($
) is a special character related to capture groups.
To refer to a literal dollar sign, use $$
instead or set literal
to TRUE
.
df <- pl$DataFrame(id = 1L:2L, text = c("123abc", "abc456")) df$with_columns(pl$col("text")$str$replace(r"(abc\b)", "ABC")) # Capture groups are supported. # Use `${1}` in the value string to refer to the first capture group in the pattern, # `${2}` to refer to the second capture group, and so on. # You can also use named capture groups. df <- pl$DataFrame(word = c("hat", "hut")) df$with_columns( positional = pl$col("word")$str$replace("h(.)t", "b${1}d"), named = pl$col("word")$str$replace("h(?<vowel>.)t", "b${vowel}d") ) # Apply case-insensitive string replacement using the `(?i)` flag. df <- pl$DataFrame( city = "Philadelphia", season = c("Spring", "Summer", "Autumn", "Winter"), weather = c("Rainy", "Sunny", "Cloudy", "Snowy") ) df$with_columns( pl$col("weather")$str$replace("(?i)foggy|rainy|cloudy|snowy", "Sunny") )
df <- pl$DataFrame(id = 1L:2L, text = c("123abc", "abc456")) df$with_columns(pl$col("text")$str$replace(r"(abc\b)", "ABC")) # Capture groups are supported. # Use `${1}` in the value string to refer to the first capture group in the pattern, # `${2}` to refer to the second capture group, and so on. # You can also use named capture groups. df <- pl$DataFrame(word = c("hat", "hut")) df$with_columns( positional = pl$col("word")$str$replace("h(.)t", "b${1}d"), named = pl$col("word")$str$replace("h(?<vowel>.)t", "b${vowel}d") ) # Apply case-insensitive string replacement using the `(?i)` flag. df <- pl$DataFrame( city = "Philadelphia", season = c("Spring", "Summer", "Autumn", "Winter"), weather = c("Rainy", "Sunny", "Cloudy", "Snowy") ) df$with_columns( pl$col("weather")$str$replace("(?i)foggy|rainy|cloudy|snowy", "Sunny") )
Replace all matching regex/literal substrings with a new string value
expr_str_replace_all(pattern, value, ..., literal = FALSE)
expr_str_replace_all(pattern, value, ..., literal = FALSE)
pattern |
A character or something can be coerced to a string Expr of a valid regex pattern, compatible with the regex crate. |
value |
A character or an Expr of string that will replace the matched substring. |
... |
These dots are for future extensions and must be empty. |
literal |
Logical. If |
To modify regular expression behaviour (such as case-sensitivity)
with flags, use the inline (?iLmsuxU)
syntax. See the regex crate’s section
on grouping and flags
for additional information about the use of inline expression modifiers.
A polars expression
The dollar sign ($
) is a special character related to capture groups.
To refer to a literal dollar sign, use $$
instead or set literal
to TRUE
.
df <- pl$DataFrame(id = 1L:2L, text = c("abcabc", "123a123")) df$with_columns(pl$col("text")$str$replace_all("a", "-")) # Capture groups are supported. # Use `${1}` in the value string to refer to the first capture group in the pattern, # `${2}` to refer to the second capture group, and so on. # You can also use named capture groups. df <- pl$DataFrame(word = c("hat", "hut")) df$with_columns( positional = pl$col("word")$str$replace_all("h(.)t", "b${1}d"), named = pl$col("word")$str$replace_all("h(?<vowel>.)t", "b${vowel}d") ) # Apply case-insensitive string replacement using the `(?i)` flag. df <- pl$DataFrame( city = "Philadelphia", season = c("Spring", "Summer", "Autumn", "Winter"), weather = c("Rainy", "Sunny", "Cloudy", "Snowy") ) df$with_columns( pl$col("weather")$str$replace_all( "(?i)foggy|rainy|cloudy|snowy", "Sunny" ) )
df <- pl$DataFrame(id = 1L:2L, text = c("abcabc", "123a123")) df$with_columns(pl$col("text")$str$replace_all("a", "-")) # Capture groups are supported. # Use `${1}` in the value string to refer to the first capture group in the pattern, # `${2}` to refer to the second capture group, and so on. # You can also use named capture groups. df <- pl$DataFrame(word = c("hat", "hut")) df$with_columns( positional = pl$col("word")$str$replace_all("h(.)t", "b${1}d"), named = pl$col("word")$str$replace_all("h(?<vowel>.)t", "b${vowel}d") ) # Apply case-insensitive string replacement using the `(?i)` flag. df <- pl$DataFrame( city = "Philadelphia", season = c("Spring", "Summer", "Autumn", "Winter"), weather = c("Rainy", "Sunny", "Cloudy", "Snowy") ) df$with_columns( pl$col("weather")$str$replace_all( "(?i)foggy|rainy|cloudy|snowy", "Sunny" ) )
This function replaces several matches at once.
expr_str_replace_many(patterns, replace_with, ascii_case_insensitive = FALSE)
expr_str_replace_many(patterns, replace_with, ascii_case_insensitive = FALSE)
patterns |
String patterns to search. Can be an Expr. |
replace_with |
A vector of strings used as replacements. If this is of
length 1, then it is applied to all matches. Otherwise, it must be of same
length as the |
ascii_case_insensitive |
Enable ASCII-aware case insensitive matching. When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only. |
A polars expression
df <- pl$DataFrame( lyrics = c( "Everybody wants to rule the world", "Tell me what you want, what you really really want", "Can you feel the love tonight" ) ) # a replacement of length 1 is applied to all matches df$with_columns( remove_pronouns = pl$col("lyrics")$str$replace_many(c("you", "me"), "") ) # if there are more than one replacement, the patterns and replacements are # matched df$with_columns( fake_pronouns = pl$col("lyrics")$str$replace_many(c("you", "me"), c("foo", "bar")) )
df <- pl$DataFrame( lyrics = c( "Everybody wants to rule the world", "Tell me what you want, what you really really want", "Can you feel the love tonight" ) ) # a replacement of length 1 is applied to all matches df$with_columns( remove_pronouns = pl$col("lyrics")$str$replace_many(c("you", "me"), "") ) # if there are more than one replacement, the patterns and replacements are # matched df$with_columns( fake_pronouns = pl$col("lyrics")$str$replace_many(c("you", "me"), c("foo", "bar")) )
Returns string values in reversed order
expr_str_reverse()
expr_str_reverse()
A polars expression
df <- pl$DataFrame(text = c("foo", "bar", NA)) df$with_columns(reversed = pl$col("text")$str$reverse())
df <- pl$DataFrame(text = c("foo", "bar", NA)) df$with_columns(reversed = pl$col("text")$str$reverse())
Create subslices of the string values of a String Series
expr_str_slice(offset, length = NULL)
expr_str_slice(offset, length = NULL)
offset |
Start index. Negative indexing is supported. |
length |
Length of the slice. If |
A polars expression
df <- pl$DataFrame(s = c("pear", NA, "papaya", "dragonfruit")) df$with_columns( pl$col("s")$str$slice(-3)$alias("s_sliced") )
df <- pl$DataFrame(s = c("pear", NA, "papaya", "dragonfruit")) df$with_columns( pl$col("s")$str$slice(-3)$alias("s_sliced") )
Split the string by a substring
expr_str_split(by, ..., inclusive = FALSE)
expr_str_split(by, ..., inclusive = FALSE)
by |
Substring to split by. Can be an Expr. |
... |
These dots are for future extensions and must be empty. |
inclusive |
If |
A polars expression
df <- pl$DataFrame(s = c("foo bar", "foo-bar", "foo bar baz")) df$select(pl$col("s")$str$split(by = " ")) df <- pl$DataFrame( s = c("foo^bar", "foo_bar", "foo*bar*baz"), by = c("_", "_", "*") ) df df$select(split = pl$col("s")$str$split(by = pl$col("by")))
df <- pl$DataFrame(s = c("foo bar", "foo-bar", "foo bar baz")) df$select(pl$col("s")$str$split(by = " ")) df <- pl$DataFrame( s = c("foo^bar", "foo_bar", "foo*bar*baz"), by = c("_", "_", "*") ) df df$select(split = pl$col("s")$str$split(by = pl$col("by")))
n
splitsThis results in a struct of n+1
fields. If it cannot make n
splits, the remaining field elements will be null.
expr_str_split_exact(by, n, ..., inclusive = FALSE)
expr_str_split_exact(by, n, ..., inclusive = FALSE)
by |
Substring to split by. Can be an Expr. |
n |
Number of splits to make. |
... |
These dots are for future extensions and must be empty. |
inclusive |
If |
A polars expression
df <- pl$DataFrame(s = c("a_1", NA, "c", "d_4")) df$with_columns( split = pl$col("s")$str$split_exact(by = "_", 1), split_inclusive = pl$col("s")$str$split_exact(by = "_", 1, inclusive = TRUE) )
df <- pl$DataFrame(s = c("a_1", NA, "c", "d_4")) df$with_columns( split = pl$col("s")$str$split_exact(by = "_", 1), split_inclusive = pl$col("s")$str$split_exact(by = "_", 1, inclusive = TRUE) )
n
itemsIf the number of possible splits is less than n-1
, the remaining field
elements will be null. If the number of possible splits is n-1
or greater,
the last (nth) substring will contain the remainder of the string.
expr_str_splitn(by, n)
expr_str_splitn(by, n)
by |
Substring to split by. Can be an Expr. |
n |
Number of splits to make. |
A polars expression
df <- pl$DataFrame(s = c("a_1", NA, "c", "d_4_e")) df$with_columns( s1 = pl$col("s")$str$splitn(by = "_", 1), s2 = pl$col("s")$str$splitn(by = "_", 2), s3 = pl$col("s")$str$splitn(by = "_", 3) )
df <- pl$DataFrame(s = c("a_1", NA, "c", "d_4_e")) df$with_columns( s1 = pl$col("s")$str$splitn(by = "_", 1), s2 = pl$col("s")$str$splitn(by = "_", 2), s3 = pl$col("s")$str$splitn(by = "_", 3) )
Check if string values starts with a substring.
expr_str_starts_with(prefix)
expr_str_starts_with(prefix)
prefix |
Prefix substring or Expr. |
See also $str$contains()
and $str$ends_with()
.
A polars expression
df <- pl$DataFrame(fruits = c("apple", "mango", NA)) df$select( pl$col("fruits"), pl$col("fruits")$str$starts_with("app")$alias("has_suffix") )
df <- pl$DataFrame(fruits = c("apple", "mango", NA)) df$select( pl$col("fruits"), pl$col("fruits")$str$starts_with("app")$alias("has_suffix") )
Remove leading and trailing characters.
expr_str_strip_chars(characters = NULL)
expr_str_strip_chars(characters = NULL)
characters |
The set of characters to be removed. All combinations of this
set of characters will be stripped. If |
This function will not strip any chars beyond the first char not matched.
strip_chars()
removes characters at the beginning and the end of the string.
Use strip_chars_start()
and strip_chars_end()
to remove characters only
from left and right respectively.
A polars expression
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars()) df$select(pl$col("foo")$str$strip_chars(" hel rld"))
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars()) df$select(pl$col("foo")$str$strip_chars(" hel rld"))
Remove trailing characters.
expr_str_strip_chars_end(characters = NULL)
expr_str_strip_chars_end(characters = NULL)
characters |
The set of characters to be removed. All combinations of this
set of characters will be stripped. If |
This function will not strip any chars beyond the first char not matched.
strip_chars_end()
removes characters at the end of the string only.
Use strip_chars()
and strip_chars_start()
to remove characters from the left
and right or only from the left respectively.
A polars expression
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars_end(" hel\trld")) df$select(pl$col("foo")$str$strip_chars_end("rldhel\t "))
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars_end(" hel\trld")) df$select(pl$col("foo")$str$strip_chars_end("rldhel\t "))
Remove leading characters.
expr_str_strip_chars_start(characters = NULL)
expr_str_strip_chars_start(characters = NULL)
characters |
The set of characters to be removed. All combinations of this
set of characters will be stripped. If |
This function will not strip any chars beyond the first char not matched.
strip_chars_start()
removes characters at the beginning of the string only.
Use strip_chars()
and strip_chars_end()
to remove characters from the left
and right or only from the right respectively.
A polars expression
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars_start(" hel rld"))
df <- pl$DataFrame(foo = c(" hello", "\tworld")) df$select(pl$col("foo")$str$strip_chars_start(" hel rld"))
The prefix will be removed from the string exactly once, if found.
expr_str_strip_prefix(prefix = NULL)
expr_str_strip_prefix(prefix = NULL)
prefix |
The prefix to be removed. |
This method strips the exact character sequence provided in prefix
from
the start of the input. To strip a set of characters in any order, use
$strip_chars_start()
instead.
A polars expression
df <- pl$DataFrame(a = c("foobar", "foofoobar", "foo", "bar")) df$with_columns( stripped = pl$col("a")$str$strip_prefix("foo") )
df <- pl$DataFrame(a = c("foobar", "foofoobar", "foo", "bar")) df$with_columns( stripped = pl$col("a")$str$strip_prefix("foo") )
The suffix will be removed from the string exactly once, if found.
expr_str_strip_suffix(suffix = NULL)
expr_str_strip_suffix(suffix = NULL)
suffix |
The suffix to be removed. |
This method strips the exact character sequence provided in suffix
from
the end of the input. To strip a set of characters in any order, use
$strip_chars_end()
instead.
A polars expression
df <- pl$DataFrame(a = c("foobar", "foobarbar", "foo", "bar")) df$with_columns( stripped = pl$col("a")$str$strip_suffix("bar") )
df <- pl$DataFrame(a = c("foobar", "foobarbar", "foo", "bar")) df$with_columns( stripped = pl$col("a")$str$strip_suffix("bar") )
Similar to the strptime()
function.
expr_str_strptime( dtype, format = NULL, ..., strict = TRUE, exact = TRUE, cache = TRUE, ambiguous = c("raise", "earliest", "latest", "null") )
expr_str_strptime( dtype, format = NULL, ..., strict = TRUE, exact = TRUE, cache = TRUE, ambiguous = c("raise", "earliest", "latest", "null") )
dtype |
The data type to convert into. Can be either |
format |
Format to use for conversion. Refer to
the chrono crate documentation
for the full specification. Example: |
... |
These dots are for future extensions and must be empty. |
strict |
If |
exact |
If |
cache |
Use a cache of unique, converted dates to apply the datetime conversion. |
ambiguous |
Determine how to deal with ambiguous datetimes. Character vector or expression containing the followings:
|
When parsing a Datetime the column precision will be inferred from the format
string, if given, e.g.: "%F %T%.3f"
=> pl$Datetime("ms")
.
If no fractional second component is found then the default is "us"
(microsecond).
A polars expression
# Dealing with a consistent format df <- pl$DataFrame(x = c("2020-01-01 01:00Z", "2020-01-01 02:00Z")) df$select(pl$col("x")$str$strptime(pl$Datetime(), "%Y-%m-%d %H:%M%#z")) # Auto infer format df$select(pl$col("x")$str$strptime(pl$Datetime())) # Datetime with timezone is interpreted as UTC timezone df <- pl$DataFrame(x = c("2020-01-01T01:00:00+09:00")) df$select(pl$col("x")$str$strptime(pl$Datetime())) # Dealing with different formats. df <- pl$DataFrame( date = c( "2021-04-22", "2022-01-04 00:00:00", "01/31/22", "Sun Jul 8 00:34:60 2001" ) ) df$select( pl$coalesce( pl$col("date")$str$strptime(pl$Date, "%F", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%F %T", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%D", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%c", strict = FALSE) ) ) # Ignore invalid time df <- pl$DataFrame( x = c( "2023-01-01 11:22:33 -0100", "2023-01-01 11:22:33 +0300", "invalid time" ) ) df$select(pl$col("x")$str$strptime( pl$Datetime("ns"), format = "%Y-%m-%d %H:%M:%S %z", strict = FALSE ))
# Dealing with a consistent format df <- pl$DataFrame(x = c("2020-01-01 01:00Z", "2020-01-01 02:00Z")) df$select(pl$col("x")$str$strptime(pl$Datetime(), "%Y-%m-%d %H:%M%#z")) # Auto infer format df$select(pl$col("x")$str$strptime(pl$Datetime())) # Datetime with timezone is interpreted as UTC timezone df <- pl$DataFrame(x = c("2020-01-01T01:00:00+09:00")) df$select(pl$col("x")$str$strptime(pl$Datetime())) # Dealing with different formats. df <- pl$DataFrame( date = c( "2021-04-22", "2022-01-04 00:00:00", "01/31/22", "Sun Jul 8 00:34:60 2001" ) ) df$select( pl$coalesce( pl$col("date")$str$strptime(pl$Date, "%F", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%F %T", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%D", strict = FALSE), pl$col("date")$str$strptime(pl$Date, "%c", strict = FALSE) ) ) # Ignore invalid time df <- pl$DataFrame( x = c( "2023-01-01 11:22:33 -0100", "2023-01-01 11:22:33 +0300", "invalid time" ) ) df$select(pl$col("x")$str$strptime( pl$Datetime("ns"), format = "%Y-%m-%d %H:%M:%S %z", strict = FALSE ))
Return the last n characters of each string
expr_str_tail(n)
expr_str_tail(n)
n |
Length of the slice (integer or expression). Strings are parsed as column names. Negative indexing is supported. |
The n
input is defined in terms of the number of characters in the (UTF-8)
string. A character is defined as a Unicode scalar value. A single character
is represented by a single byte when working with ASCII text, and a maximum
of 4 bytes otherwise.
When the n
input is negative, tail()
returns characters starting from the
n
th from the beginning of the string. For example, if n = -3
, then all
characters except the first three are returned.
If the length of the string has fewer than n
characters, the full string is
returned.
A polars expression
df <- pl$DataFrame( s = c("pear", NA, "papaya", "dragonfruit"), n = c(3, 4, -2, -5) ) df$with_columns( s_tail_5 = pl$col("s")$str$tail(5), s_tail_n = pl$col("s")$str$tail("n") )
df <- pl$DataFrame( s = c("pear", NA, "papaya", "dragonfruit"), n = c(3, 4, -2, -5) ) df$with_columns( s_tail_5 = pl$col("s")$str$tail(5), s_tail_n = pl$col("s")$str$tail("n") )
Convert a String column into a Date column
expr_str_to_date(format = NULL, ..., strict = TRUE, exact = TRUE, cache = TRUE)
expr_str_to_date(format = NULL, ..., strict = TRUE, exact = TRUE, cache = TRUE)
format |
Format to use for conversion. Refer to
the chrono crate documentation
for the full specification. Example: |
... |
These dots are for future extensions and must be empty. |
strict |
If |
exact |
If |
cache |
Use a cache of unique, converted dates to apply the datetime conversion. |
A polars expression
df <- pl$DataFrame(x = c("2020/01/01", "2020/02/01", "2020/03/01")) df$select(pl$col("x")$str$to_date()) # by default, this errors if some values cannot be converted df <- pl$DataFrame(x = c("2020/01/01", "2020 02 01", "2020-03-01")) try(df$select(pl$col("x")$str$to_date())) df$select(pl$col("x")$str$to_date(strict = FALSE))
df <- pl$DataFrame(x = c("2020/01/01", "2020/02/01", "2020/03/01")) df$select(pl$col("x")$str$to_date()) # by default, this errors if some values cannot be converted df <- pl$DataFrame(x = c("2020/01/01", "2020 02 01", "2020-03-01")) try(df$select(pl$col("x")$str$to_date())) df$select(pl$col("x")$str$to_date(strict = FALSE))
Convert a String column into a Datetime column
expr_str_to_datetime( format = NULL, ..., time_unit = NULL, time_zone = NULL, strict = TRUE, exact = TRUE, cache = TRUE, ambiguous = c("raise", "earliest", "latest", "null") )
expr_str_to_datetime( format = NULL, ..., time_unit = NULL, time_zone = NULL, strict = TRUE, exact = TRUE, cache = TRUE, ambiguous = c("raise", "earliest", "latest", "null") )
format |
Format to use for conversion. Refer to
the chrono crate documentation
for the full specification. Example: |
... |
These dots are for future extensions and must be empty. |
time_unit |
Unit of time for the resulting Datetime column. If |
time_zone |
for the resulting Datetime column. |
strict |
If |
exact |
If |
cache |
Use a cache of unique, converted dates to apply the datetime conversion. |
ambiguous |
Determine how to deal with ambiguous datetimes. Character vector or expression containing the followings:
|
A polars expression
df <- pl$DataFrame(x = c("2020-01-01 01:00Z", "2020-01-01 02:00Z")) df$select(pl$col("x")$str$to_datetime("%Y-%m-%d %H:%M%#z")) df$select(pl$col("x")$str$to_datetime(time_unit = "ms"))
df <- pl$DataFrame(x = c("2020-01-01 01:00Z", "2020-01-01 02:00Z")) df$select(pl$col("x")$str$to_datetime("%Y-%m-%d %H:%M%#z")) df$select(pl$col("x")$str$to_datetime(time_unit = "ms"))
This method infers the needed parameters precision
and scale
.
expr_str_to_decimal(..., inference_length = 100)
expr_str_to_decimal(..., inference_length = 100)
... |
These dots are for future extensions and must be empty. |
inference_length |
Number of elements to parse to determine the
|
A polars expression
df <- pl$DataFrame( numbers = c( "40.12", "3420.13", "120134.19", "3212.98", "12.90", "143.09", "143.9" ) ) df$with_columns(numbers_decimal = pl$col("numbers")$str$to_decimal())
df <- pl$DataFrame( numbers = c( "40.12", "3420.13", "120134.19", "3212.98", "12.90", "143.09", "143.9" ) ) df$with_columns(numbers_decimal = pl$col("numbers")$str$to_decimal())
Convert a String column into an Int64 column with base radix
expr_str_to_integer(..., base = 10L, strict = TRUE)
expr_str_to_integer(..., base = 10L, strict = TRUE)
... |
These dots are for future extensions and must be empty. |
base |
A positive integer or expression which is the base of the string
we are parsing. Characters are parsed as column names. Default: |
strict |
A logical. If |
A polars expression
df <- pl$DataFrame(bin = c("110", "101", "010", "invalid")) df$with_columns( parsed = pl$col("bin")$str$to_integer(base = 2, strict = FALSE) ) df <- pl$DataFrame(hex = c("fa1e", "ff00", "cafe", NA)) df$with_columns( parsed = pl$col("hex")$str$to_integer(base = 16, strict = TRUE) )
df <- pl$DataFrame(bin = c("110", "101", "010", "invalid")) df$with_columns( parsed = pl$col("bin")$str$to_integer(base = 2, strict = FALSE) ) df <- pl$DataFrame(hex = c("fa1e", "ff00", "cafe", NA)) df$with_columns( parsed = pl$col("hex")$str$to_integer(base = 16, strict = TRUE) )
Transform to lowercase variant.
expr_str_to_lowercase()
expr_str_to_lowercase()
A polars expression
pl$lit(c("A", "b", "c", "1", NA))$str$to_lowercase()$to_series()
pl$lit(c("A", "b", "c", "1", NA))$str$to_lowercase()$to_series()
Convert a String column into a Time column
expr_str_to_time(format = NULL, ..., strict = TRUE, cache = TRUE)
expr_str_to_time(format = NULL, ..., strict = TRUE, cache = TRUE)
format |
Format to use for conversion. Refer to
the chrono crate documentation
for the full specification. Example: |
... |
These dots are for future extensions and must be empty. |
strict |
If |
cache |
Use a cache of unique, converted dates to apply the datetime conversion. |
A polars expression
df <- pl$DataFrame(x = c("01:00", "02:00", "03:00")) df$select(pl$col("x")$str$to_time("%H:%M"))
df <- pl$DataFrame(x = c("01:00", "02:00", "03:00")) df$select(pl$col("x")$str$to_time("%H:%M"))
Transform to uppercase variant.
expr_str_to_uppercase()
expr_str_to_uppercase()
A polars expression
pl$lit(c("A", "b", "c", "1", NA))$str$to_uppercase()$to_series()
pl$lit(c("A", "b", "c", "1", NA))$str$to_uppercase()$to_series()
Add zeroes to a string until it reaches n
characters. If the
number of characters is already greater than n
, the string is not modified.
expr_str_zfill(length)
expr_str_zfill(length)
length |
Pad the string until it reaches this length. Strings with length equal to or greater than this value are returned as-is. This can be an Expr or something coercible to an Expr. Strings are parsed as column names. |
Return a copy of the string left filled with ASCII '0' digits to make a string of length width.
A leading sign prefix ('+'/'-') is handled by inserting the padding after the
sign character rather than before. The original string is returned if width is
less than or equal to len(s)
.
A polars expression
df <- pl$DataFrame(a = c(-1L, 123L, 999999L, NA)) df$with_columns(zfill = pl$col("a")$cast(pl$String)$str$zfill(4))
df <- pl$DataFrame(a = c(-1L, 123L, 999999L, NA)) df$with_columns(zfill = pl$col("a")$cast(pl$String)$str$zfill(4))
Retrieve one or multiple Struct field(s) as a new Series
expr_struct_field(...)
expr_struct_field(...)
... |
< |
A polars expression
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df # Retrieve struct field(s) as Series: df$select(pl$col("struct_col")$struct$field("bbb")) df$select( pl$col("struct_col")$struct$field("bbb"), pl$col("struct_col")$struct$field("ddd") ) # Use wildcard expansion: df$select(pl$col("struct_col")$struct$field("*")) # Retrieve multiple fields by name: df$select(pl$col("struct_col")$struct$field("aaa", "bbb")) # Retrieve multiple fields by regex expansion: df$select(pl$col("struct_col")$struct$field("^a.*|b.*$"))
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df # Retrieve struct field(s) as Series: df$select(pl$col("struct_col")$struct$field("bbb")) df$select( pl$col("struct_col")$struct$field("bbb"), pl$col("struct_col")$struct$field("ddd") ) # Use wildcard expansion: df$select(pl$col("struct_col")$struct$field("*")) # Retrieve multiple fields by name: df$select(pl$col("struct_col")$struct$field("aaa", "bbb")) # Retrieve multiple fields by regex expansion: df$select(pl$col("struct_col")$struct$field("^a.*|b.*$"))
Convert this struct to a string column with json values
expr_struct_json_encode()
expr_struct_json_encode()
A polars expression
df <- pl$DataFrame( a = list(1:2, c(9, 1, 3)), b = list(45, NA) )$select(a = pl$struct("a", "b")) df df$with_columns(encoded = pl$col("a")$struct$json_encode())
df <- pl$DataFrame( a = list(1:2, c(9, 1, 3)), b = list(45, NA) )$select(a = pl$struct("a", "b")) df df$with_columns(encoded = pl$col("a")$struct$json_encode())
Rename the fields of the struct
expr_struct_rename_fields(names)
expr_struct_rename_fields(names)
names |
New names, given in the same order as the struct's fields. |
A polars expression
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df df <- df$select( pl$col("struct_col")$struct$rename_fields(c("www", "xxx", "yyy", "zzz")) ) df$select(pl$col("struct_col")$struct$field("*")) # Following a rename, the previous field names cannot be referenced: tryCatch( { df$select(pl$col("struct_col")$struct$field("aaa")) }, error = function(e) print(e) )
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df df <- df$select( pl$col("struct_col")$struct$rename_fields(c("www", "xxx", "yyy", "zzz")) ) df$select(pl$col("struct_col")$struct$field("*")) # Following a rename, the previous field names cannot be referenced: tryCatch( { df$select(pl$col("struct_col")$struct$field("aaa")) }, error = function(e) print(e) )
This is an alias for Expr$struct$field("*")
.
expr_struct_unnest()
expr_struct_unnest()
A polars expression
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df df$select(pl$col("struct_col")$struct$unnest())
df <- pl$DataFrame( aaa = c(1, 2), bbb = c("ab", "cd"), ccc = c(TRUE, NA), ddd = list(1:2, 3) )$select(struct_col = pl$struct("aaa", "bbb", "ccc", "ddd")) df df$select(pl$col("struct_col")$struct$unnest())
This is similar to with_columns()
on DataFrame and LazyFrame.
expr_struct_with_fields(...)
expr_struct_with_fields(...)
... |
< |
A polars expression
df <- pl$DataFrame( x = c(1, 4, 9), y = c(4, 9, 16), multiply = c(10, 2, 3) )$select(coords = pl$struct("x", "y"), "multiply") df df <- df$with_columns( pl$col("coords")$struct$with_fields( pl$field("x")$sqrt(), y_mul = pl$field("y") * pl$col("multiply") ) ) df df$select(pl$col("coords")$struct$field("*"))
df <- pl$DataFrame( x = c(1, 4, 9), y = c(4, 9, 16), multiply = c(10, 2, 3) )$select(coords = pl$struct("x", "y"), "multiply") df df <- df$with_columns( pl$col("coords")$struct$with_fields( pl$field("x")$sqrt(), y_mul = pl$field("y") * pl$col("multiply") ) ) df df$select(pl$col("coords")$struct$field("*"))
By default, all query optimizations are enabled.
Individual optimizations may be disabled by setting the corresponding parameter to FALSE
.
lazyframe__collect( ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE, `_eager` = FALSE )
lazyframe__collect( ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE, `_eager` = FALSE )
... |
These dots are for future extensions and must be empty. |
type_coercion |
A logical, indicats type coercion optimization. |
predicate_pushdown |
A logical, indicats predicate pushdown optimization. |
projection_pushdown |
A logical, indicats projection pushdown optimization. |
simplify_expression |
A logical, indicats simplify expression optimization. |
slice_pushdown |
A logical, indicats slice pushdown optimization. |
comm_subplan_elim |
A logical, indicats tring to cache branching subplans that occur on self-joins or unions. |
comm_subexpr_elim |
A logical, indicats tring to cache common subexpressions. |
cluster_with_columns |
A logical, indicats to combine sequential independent calls to with_columns. |
no_optimization |
A logical. If |
streaming |
A logical. If |
_eager |
A logical, indicates to turn off multi-node optimizations and the other optimizations. This option is intended for internal use only. |
A polars DataFrame
lf <- pl$LazyFrame( a = c("a", "b", "a", "b", "b", "c"), b = 1:6, c = 6:1, ) lf$group_by("a")$agg(pl$all()$sum())$collect() # Collect in streaming mode lf$group_by("a")$agg(pl$all()$sum())$collect( streaming = TRUE )
lf <- pl$LazyFrame( a = c("a", "b", "a", "b", "b", "c"), b = 1:6, c = 6:1, ) lf$group_by("a")$agg(pl$all()$sum())$collect() # Collect in streaming mode lf$group_by("a")$agg(pl$all()$sum())$collect( streaming = TRUE )
Select and perform operations on a subset of columns only. This discards
unmentioned columns (like .()
in data.table
and contrarily to
dplyr::mutate()
).
One cannot use new variables in subsequent expressions in the same
$select()
call. For instance, if you create a variable x
, you will only
be able to use it in another $select()
or $with_columns()
call.
lazyframe__select(...)
lazyframe__select(...)
... |
< |
A polars LazyFrame
# Pass the name of a column to select that column. lf <- pl$LazyFrame( foo = 1:3, bar = 6:8, ham = letters[1:3] ) lf$select("foo")$collect() # Multiple columns can be selected by passing a list of column names. lf$select("foo", "bar")$collect() # Expressions are also accepted. lf$select(pl$col("foo"), pl$col("bar") + 1)$collect() # Name expression (used as the column name of the output DataFrame) lf$select( threshold = pl$when(pl$col("foo") > 2)$then(10)$otherwise(0) )$collect() # Expressions with multiple outputs can be automatically instantiated # as Structs by setting the `POLARS_AUTO_STRUCTIFY` environment variable. # (Experimental) if (requireNamespace("withr", quietly = TRUE)) { withr::with_envvar(c(POLARS_AUTO_STRUCTIFY = "1"), { lf$select( is_odd = ((pl$col(pl$Int32) %% 2) == 1)$name$suffix("_is_odd"), )$collect() }) }
# Pass the name of a column to select that column. lf <- pl$LazyFrame( foo = 1:3, bar = 6:8, ham = letters[1:3] ) lf$select("foo")$collect() # Multiple columns can be selected by passing a list of column names. lf$select("foo", "bar")$collect() # Expressions are also accepted. lf$select(pl$col("foo"), pl$col("bar") + 1)$collect() # Name expression (used as the column name of the output DataFrame) lf$select( threshold = pl$when(pl$col("foo") > 2)$then(10)$otherwise(0) )$collect() # Expressions with multiple outputs can be automatically instantiated # as Structs by setting the `POLARS_AUTO_STRUCTIFY` environment variable. # (Experimental) if (requireNamespace("withr", quietly = TRUE)) { withr::with_envvar(c(POLARS_AUTO_STRUCTIFY = "1"), { lf$select( is_odd = ((pl$col(pl$Int32) %% 2) == 1)$name$suffix("_is_odd"), )$collect() }) }
Add columns or modify existing ones with expressions. This is similar to
dplyr::mutate()
as it keeps unmentioned columns (unlike $select()
).
However, unlike dplyr::mutate()
, one cannot use new variables in subsequent
expressions in the same $with_columns()
call. For instance, if you create a
variable x
, you will only be able to use it in another $with_columns()
or $select()
call.
lazyframe__with_columns(...)
lazyframe__with_columns(...)
... |
< |
A polars LazyFrame
# Pass an expression to add it as a new column. lf <- pl$LazyFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE), ) lf$with_columns((pl$col("a")^2)$alias("a^2"))$collect() # Added columns will replace existing columns with the same name. lf$with_columns(a = pl$col("a")$cast(pl$Float64))$collect() # Multiple columns can be added lf$with_columns( (pl$col("a")^2)$alias("a^2"), (pl$col("b") / 2)$alias("b/2"), (pl$col("c")$not())$alias("not c"), )$collect() # Name expression instead of `$alias()` lf$with_columns( `a^2` = pl$col("a")^2, `b/2` = pl$col("b") / 2, `not c` = pl$col("c")$not(), )$collect() # Expressions with multiple outputs can automatically be instantiated # as Structs by enabling the experimental setting `POLARS_AUTO_STRUCTIFY`: if (requireNamespace("withr", quietly = TRUE)) { withr::with_envvar(c(POLARS_AUTO_STRUCTIFY = "1"), { lf$drop("c")$with_columns( diffs = pl$col("a", "b")$diff()$name$suffix("_diff"), )$collect() }) }
# Pass an expression to add it as a new column. lf <- pl$LazyFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE), ) lf$with_columns((pl$col("a")^2)$alias("a^2"))$collect() # Added columns will replace existing columns with the same name. lf$with_columns(a = pl$col("a")$cast(pl$Float64))$collect() # Multiple columns can be added lf$with_columns( (pl$col("a")^2)$alias("a^2"), (pl$col("b") / 2)$alias("b/2"), (pl$col("c")$not())$alias("not c"), )$collect() # Name expression instead of `$alias()` lf$with_columns( `a^2` = pl$col("a")^2, `b/2` = pl$col("b") / 2, `not c` = pl$col("c")$not(), )$collect() # Expressions with multiple outputs can automatically be instantiated # as Structs by enabling the experimental setting `POLARS_AUTO_STRUCTIFY`: if (requireNamespace("withr", quietly = TRUE)) { withr::with_envvar(c(POLARS_AUTO_STRUCTIFY = "1"), { lf$drop("c")$with_columns( diffs = pl$col("a", "b")$diff()$name$suffix("_diff"), )$collect() }) }
pl
is an environment class object
that stores all the top-level functions of the R Polars API
which mimics the Python Polars API.
It is intended to work the same way in Python as if you had imported
Python Polars with import polars as pl
.
pl
pl
An object of class polars_object
of length 75.
pl # How many members are in the `pl` environment? length(pl) # Create a polars DataFrame # In Python: # ```python # >>> import polars as pl # >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) # ``` # In R: df <- pl$DataFrame(a = c(1, 2, 3), b = c(4, 5, 6)) df
pl # How many members are in the `pl` environment? length(pl) # Create a polars DataFrame # In Python: # ```python # >>> import polars as pl # >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) # ``` # In R: df <- pl$DataFrame(a = c(1, 2, 3), b = c(4, 5, 6)) df
If no arguments are passed, this function is syntactic sugar for col("*")
.
Otherwise, this function is syntactic sugar for col(names)$all()
.
pl__all(..., ignore_nulls = TRUE)
pl__all(..., ignore_nulls = TRUE)
... |
Name(s) of the columns to use in the aggregation. |
ignore_nulls |
If |
A polars expression
df <- pl$DataFrame( a = c(TRUE, FALSE, TRUE), b = c(FALSE, FALSE, FALSE) ) # Selecting all columns df$select(pl$all()$sum()) # Evaluate bitwise AND for a column. df$select(pl$all("a"))
df <- pl$DataFrame( a = c(TRUE, FALSE, TRUE), b = c(FALSE, FALSE, FALSE) ) # Selecting all columns df$select(pl$all()$sum()) # Evaluate bitwise AND for a column. df$select(pl$all("a"))
Apply the AND logical horizontally across columns
pl__all_horizontal(...)
pl__all_horizontal(...)
... |
< |
Kleene logic is used to
deal with nulls: if the column contains any null values and no FALSE
values, the output is null.
A polars expression
df <- pl$DataFrame( a = c(FALSE, FALSE, TRUE, TRUE, FALSE, NA), b = c(FALSE, TRUE, TRUE, NA, NA, NA), c = c("u", "v", "w", "x", "y", "z") ) df$with_columns( all = pl$all_horizontal("a", "b", "c") )
df <- pl$DataFrame( a = c(FALSE, FALSE, TRUE, TRUE, FALSE, NA), b = c(FALSE, TRUE, TRUE, NA, NA, NA), c = c("u", "v", "w", "x", "y", "z") ) df$with_columns( all = pl$all_horizontal("a", "b", "c") )
This function is syntactic sugar for col(names)$any()
.
pl__any(..., ignore_nulls = TRUE)
pl__any(..., ignore_nulls = TRUE)
... |
Name(s) of the columns to use in the aggregation. |
ignore_nulls |
If |
A polars expression
df <- pl$DataFrame( a = c(TRUE, FALSE, TRUE), b = c(FALSE, FALSE, FALSE) ) df$select(pl$any("a"))
df <- pl$DataFrame( a = c(TRUE, FALSE, TRUE), b = c(FALSE, FALSE, FALSE) ) df$select(pl$any("a"))
Apply the OR logical horizontally across columns
pl__any_horizontal(...)
pl__any_horizontal(...)
... |
< |
Kleene logic is used to
deal with nulls: if the column contains any null values and no FALSE
values, the output is null.
A polars expression
df <- pl$DataFrame( a = c(FALSE, FALSE, TRUE, TRUE, FALSE, NA), b = c(FALSE, TRUE, TRUE, NA, NA, NA), c = c("u", "v", "w", "x", "y", "z") ) df$with_columns( any = pl$any_horizontal("a", "b", "c") )
df <- pl$DataFrame( a = c(FALSE, FALSE, TRUE, TRUE, FALSE, NA), b = c(FALSE, TRUE, TRUE, NA, NA, NA), c = c("u", "v", "w", "x", "y", "z") ) df$with_columns( any = pl$any_horizontal("a", "b", "c") )
Folds the columns from left to right, keeping the first non-null value
pl__coalesce(...)
pl__coalesce(...)
... |
< |
A polars expression
df <- pl$DataFrame( a = c(1, NA, NA, NA), b = c(1, 2, NA, NA), c = c(5, NA, 3, NA) ) df$with_columns(d = pl$coalesce("a", "b", "c", 10)) df$with_columns(d = pl$coalesce(pl$col("a", "b", "c"), 10))
df <- pl$DataFrame( a = c(1, NA, NA, NA), b = c(1, 2, NA, NA), c = c(5, NA, 3, NA) ) df$with_columns(d = pl$coalesce("a", "b", "c", 10)) df$with_columns(d = pl$coalesce(pl$col("a", "b", "c"), 10))
Create an expression representing column(s) in a DataFrame
pl__col(...)
pl__col(...)
... |
<
|
A polars expression
# a single column by a character pl$col("foo") # multiple columns by characters pl$col("foo", "bar") # multiple columns by polars data types pl$col(pl$Float64, pl$String) # Single `"*"` is converted to a wildcard expression pl$col("*") # Character vectors with length > 1 should be used with `!!!` pl$col(!!!c("foo", "bar"), "baz") pl$col("foo", !!!c("bar", "baz")) # there are some special notations for selecting columns df <- pl$DataFrame(foo = 1:3, bar = 4:6, baz = 7:9) ## select all columns with a wildcard `"*"` df$select(pl$col("*")) ## select multiple columns by a regular expression ## starts with `^` and ends with `$` df$select(pl$col("^ba.*$"))
# a single column by a character pl$col("foo") # multiple columns by characters pl$col("foo", "bar") # multiple columns by polars data types pl$col(pl$Float64, pl$String) # Single `"*"` is converted to a wildcard expression pl$col("*") # Character vectors with length > 1 should be used with `!!!` pl$col(!!!c("foo", "bar"), "baz") pl$col("foo", !!!c("bar", "baz")) # there are some special notations for selecting columns df <- pl$DataFrame(foo = 1:3, bar = 4:6, baz = 7:9) ## select all columns with a wildcard `"*"` df$select(pl$col("*")) ## select multiple columns by a regular expression ## starts with `^` and ends with `$` df$select(pl$col("^ba.*$"))
Combine multiple DataFrames, LazyFrames, or Series into a single object
pl__concat( ..., how = c("vertical", "vertical_relaxed", "diagonal", "diagonal_relaxed", "horizontal", "align"), rechunk = FALSE, parallel = TRUE )
pl__concat( ..., how = c("vertical", "vertical_relaxed", "diagonal", "diagonal_relaxed", "horizontal", "align"), rechunk = FALSE, parallel = TRUE )
... |
< |
how |
Strategy to concatenate items. Must be one of:
Series only support the |
rechunk |
Make sure that the result data is in contiguous memory. |
parallel |
Only relevant for LazyFrames. This determines if the concatenated lazy computations may be executed in parallel. |
The same class (polars_data_frame
, polars_lazy_frame
or
polars_series
) as the input.
# default is 'vertical' strategy df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2L, b = 4L) pl$concat(df1, df2) # 'a' is coerced to float64 df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2, b = 4L) pl$concat(df1, df2, how = "vertical_relaxed") df_h1 <- pl$DataFrame(l1 = 1:2, l2 = 3:4) df_h2 <- pl$DataFrame(r1 = 5:6, r2 = 7:8, r3 = 9:10) pl$concat(df_h1, df_h2, how = "horizontal") # use 'diagonal' strategy to fill empty column values with nulls df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2L, c = 4L) pl$concat(df1, df2, how = "diagonal") df_a1 <- pl$DataFrame(id = 1:2, x = 3:4) df_a2 <- pl$DataFrame(id = 2:3, y = 5:6) df_a3 <- pl$DataFrame(id = c(1L, 3L), z = 7:8) pl$concat(df_a1, df_a2, df_a3, how = "align")
# default is 'vertical' strategy df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2L, b = 4L) pl$concat(df1, df2) # 'a' is coerced to float64 df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2, b = 4L) pl$concat(df1, df2, how = "vertical_relaxed") df_h1 <- pl$DataFrame(l1 = 1:2, l2 = 3:4) df_h2 <- pl$DataFrame(r1 = 5:6, r2 = 7:8, r3 = 9:10) pl$concat(df_h1, df_h2, how = "horizontal") # use 'diagonal' strategy to fill empty column values with nulls df1 <- pl$DataFrame(a = 1L, b = 3L) df2 <- pl$DataFrame(a = 2L, c = 4L) pl$concat(df1, df2, how = "diagonal") df_a1 <- pl$DataFrame(id = 1:2, x = 3:4) df_a2 <- pl$DataFrame(id = 2:3, y = 5:6) df_a3 <- pl$DataFrame(id = c(1L, 3L), z = 7:8) pl$concat(df_a1, df_a2, df_a3, how = "align")
Horizontally concatenate columns into a single list column
pl__concat_list(...)
pl__concat_list(...)
... |
< |
A polars expression
df <- pl$DataFrame(a = list(1:2, 3, 4:5), b = list(4, integer(0), NULL)) # Concatenate two existing list columns. Null values are propagated. df$with_columns(concat_list = pl$concat_list("a", "b")) # Non-list columns are cast to a list before concatenation. The output data # type is the supertype of the concatenated columns. df$select("a", concat_list = pl$concat_list("a", pl$lit("x"))) # Create lagged columns and collect them into a list. This mimics a rolling # window. df <- pl$DataFrame(A = c(1, 2, 9, 2, 13)) df <- df$select( A_lag_1 = pl$col("A")$shift(1), A_lag_2 = pl$col("A")$shift(2), A_lag_3 = pl$col("A")$shift(3) ) df$select(A_rolling = pl$concat_list("A_lag_1", "A_lag_2", "A_lag_3"))
df <- pl$DataFrame(a = list(1:2, 3, 4:5), b = list(4, integer(0), NULL)) # Concatenate two existing list columns. Null values are propagated. df$with_columns(concat_list = pl$concat_list("a", "b")) # Non-list columns are cast to a list before concatenation. The output data # type is the supertype of the concatenated columns. df$select("a", concat_list = pl$concat_list("a", pl$lit("x"))) # Create lagged columns and collect them into a list. This mimics a rolling # window. df <- pl$DataFrame(A = c(1, 2, 9, 2, 13)) df <- df$select( A_lag_1 = pl$col("A")$shift(1), A_lag_2 = pl$col("A")$shift(2), A_lag_3 = pl$col("A")$shift(3) ) df$select(A_rolling = pl$concat_list("A_lag_1", "A_lag_2", "A_lag_3"))
This function is syntactic sugar for col(names)$cum_sum()
.
pl__cum_sum(...)
pl__cum_sum(...)
... |
Name(s) of the columns to use in the aggregation. |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "foo") ) # Get the cum_sum of a column df$select(pl$cum_sum("a")) # Get the cum_sum of multiple columns df$select(pl$cum_sum("a", "b"))
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "foo") ) # Get the cum_sum of a column df$select(pl$cum_sum("a")) # Get the cum_sum of multiple columns df$select(pl$cum_sum("a", "b"))
polars_data_frame
)DataFrames are two-dimensional data structure representing data as a table with rows and columns. Polars DataFrames are similar to R Data Frames. R Data Frame's columns are R vectors, while Polars DataFrame's columns are Polars Series.
pl__DataFrame(..., .schema_overrides = NULL, .strict = TRUE)
pl__DataFrame(..., .schema_overrides = NULL, .strict = TRUE)
... |
< |
.schema_overrides |
A list of polars data types or |
.strict |
A logical value. Passed to the |
The pl$DataFrame()
function mimics the constructor of the DataFrame class of Python Polars.
This function is basically a shortcut for
as_polars_df(list(...))$cast(!!!.schema_overrides, .strict = .strict)
, so each argument in ...
is
converted to a Polars Series by as_polars_series()
and then passed to as_polars_df()
.
A polars DataFrame
columns
: $columns
returns a character vector with the names of the columns.
dtypes
: $dtypes
returns a nameless list of the data type of each column.
schema
: $schema
returns a named list with the column names as names and the data types as values.
shape
: $shape
returns a integer vector of length two with the number of rows and columns of the DataFrame.
height
: $height
returns a integer with the number of rows of the DataFrame.
width
: $width
returns a integer with the number of columns of the DataFrame.
flags
: $flags
returns a list with column names as names and a named
logical vector with the flags as values.
Flags are used internally to avoid doing unnecessary computations, such as
sorting a variable that we know is already sorted. The number of flags
varies depending on the column type: columns of type array
and list
have the flags SORTED_ASC
, SORTED_DESC
, and FAST_EXPLODE
, while other
column types only have the former two.
SORTED_ASC
is set to TRUE
when we sort a column in increasing order, so
that we can use this information later on to avoid re-sorting it.
SORTED_DESC
is similar but applies to sort in decreasing order.
# Constructing a DataFrame from vectors: pl$DataFrame(a = 1:2, b = 3:4) # Constructing a DataFrame from Series: pl$DataFrame(pl$Series("a", 1:2), pl$Series("b", 3:4)) # Constructing a DataFrame from a list: data <- list(a = 1:2, b = 3:4) ## Using the as_polars_df function (recommended) as_polars_df(data) ## Using dynamic dots feature pl$DataFrame(!!!data) # Active bindings: df <- pl$DataFrame(a = 1:3, b = c("foo", "bar", "baz")) df$columns df$dtypes df$schema df$shape df$height df$width
# Constructing a DataFrame from vectors: pl$DataFrame(a = 1:2, b = 3:4) # Constructing a DataFrame from Series: pl$DataFrame(pl$Series("a", 1:2), pl$Series("b", 3:4)) # Constructing a DataFrame from a list: data <- list(a = 1:2, b = 3:4) ## Using the as_polars_df function (recommended) as_polars_df(data) ## Using dynamic dots feature pl$DataFrame(!!!data) # Active bindings: df <- pl$DataFrame(a = 1:3, b = c("foo", "bar", "baz")) df$columns df$dtypes df$schema df$shape df$height df$width
If both start
and end
are passed as the Date types (not Datetime), and
the interval
granularity is no finer than "1d"
, the returned range is
also of type Date. All other permutations return a Datetime.
pl__date_range( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right") )
pl__date_range( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right") )
start |
Lower bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
end |
Upper bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
interval |
Interval of the range periods, specified as a difftime
object or using the Polars duration string language. See the |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$date_ranges()
to create a simple Series of
data type list(Date) based on column values.
# Using Polars duration string to specify the interval: pl$select( date = pl$date_range(as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo") ) # Using `difftime` object to specify the interval: pl$select( date = pl$date_range( as.Date("1985-01-01"), as.Date("1985-01-10"), as.difftime(2, units = "days") ) )
# Using Polars duration string to specify the interval: pl$select( date = pl$date_range(as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo") ) # Using `difftime` object to specify the interval: pl$select( date = pl$date_range( as.Date("1985-01-01"), as.Date("1985-01-10"), as.difftime(2, units = "days") ) )
If both start
and end
are passed as Date types (not Datetime), and
the interval
granularity is no finer than "1d"
, the returned range is
also of type Date. All other permutations return a Datetime.
pl__date_ranges( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right") )
pl__date_ranges( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right") )
start |
Lower bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
end |
Upper bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
interval |
Interval of the range periods, specified as a difftime
object or using the Polars duration string language. See the |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$date_range()
to create a simple Series of
data type Date.
df <- pl$DataFrame( start = as.Date(c("2022-01-01", "2022-01-02", NA)), end = rep(as.Date("2022-01-03"), 3) ) df$with_columns( date_range = pl$date_ranges("start", "end"), date_range_cr = pl$date_ranges("start", "end", closed = "right") ) # provide a custom "end" value df$with_columns( date_range_lit = pl$date_ranges("start", pl$lit(as.Date("2022-01-02"))) )
df <- pl$DataFrame( start = as.Date(c("2022-01-01", "2022-01-02", NA)), end = rep(as.Date("2022-01-03"), 3) ) df$with_columns( date_range = pl$date_ranges("start", "end"), date_range_cr = pl$date_ranges("start", "end", closed = "right") ) # provide a custom "end" value df$with_columns( date_range_lit = pl$date_ranges("start", pl$lit(as.Date("2022-01-02"))) )
Create a Polars literal expression of type Datetime
pl__datetime( year, month, day, hour = NULL, minute = NULL, second = NULL, microsecond = NULL, ..., time_unit = c("us", "ns", "ms"), time_zone = NULL, ambiguous = c("raise", "earliest", "latest", "null") )
pl__datetime( year, month, day, hour = NULL, minute = NULL, second = NULL, microsecond = NULL, ..., time_unit = c("us", "ns", "ms"), time_zone = NULL, ambiguous = c("raise", "earliest", "latest", "null") )
year |
An polars expression or something can be coerced to
an polars expression by |
month |
An polars expression or something can be coerced to
an polars expression by |
day |
An polars expression or something can be coerced to
an polars expression by |
hour |
An polars expression or something can be coerced to
an polars expression by |
minute |
An polars expression or something can be coerced to
an polars expression by |
second |
An polars expression or something can be coerced to
an polars expression by |
microsecond |
An polars expression or something can be coerced to
an polars expression by |
... |
These dots are for future extensions and must be empty. |
time_unit |
One of |
time_zone |
A string or |
ambiguous |
Determine how to deal with ambiguous datetimes. Character vector or expression containing the followings:
|
A polars expression
df <- pl$DataFrame( month = c(1, 2, 3), day = c(4, 5, 6), hour = c(12, 13, 14), minute = c(15, 30, 45) ) df$with_columns( pl$datetime( 2024, pl$col("month"), pl$col("day"), pl$col("hour"), pl$col("minute"), time_zone = "Australia/Sydney" ) ) # We can also use `pl$datetime()` for filtering: df <- pl$select( start = ISOdatetime(2024, 1, 1, 0, 0, 0), end = c( ISOdatetime(2024, 5, 1, 20, 15, 10), ISOdatetime(2024, 7, 1, 21, 25, 20), ISOdatetime(2024, 9, 1, 22, 35, 30) ) ) df$filter(pl$col("end") > pl$datetime(2024, 6, 1))
df <- pl$DataFrame( month = c(1, 2, 3), day = c(4, 5, 6), hour = c(12, 13, 14), minute = c(15, 30, 45) ) df$with_columns( pl$datetime( 2024, pl$col("month"), pl$col("day"), pl$col("hour"), pl$col("minute"), time_zone = "Australia/Sydney" ) ) # We can also use `pl$datetime()` for filtering: df <- pl$select( start = ISOdatetime(2024, 1, 1, 0, 0, 0), end = c( ISOdatetime(2024, 5, 1, 20, 15, 10), ISOdatetime(2024, 7, 1, 21, 25, 20), ISOdatetime(2024, 9, 1, 22, 35, 30) ) ) df$filter(pl$col("end") > pl$datetime(2024, 6, 1))
Generate a datetime range
pl__datetime_range( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right"), time_unit = NULL, time_zone = NULL )
pl__datetime_range( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right"), time_unit = NULL, time_zone = NULL )
start |
Lower bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
end |
Upper bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
interval |
Interval of the range periods, specified as a difftime
object or using the Polars duration string language. See the |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
time_unit |
Time unit of the resulting the Datetime
data type. One of |
time_zone |
Time zone of the resulting Datetime data type. |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$datetime_ranges()
to create a simple
Series of data type list(Datetime) based on column values.
# Using Polars duration string to specify the interval: pl$select( datetime = pl$datetime_range(as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo") ) # Using `difftime` object to specify the interval: pl$select( datetime = pl$datetime_range( as.Date("1985-01-01"), as.Date("1985-01-10"), as.difftime(1, units = "days") + as.difftime(12, units = "hours") ) ) # Specifying a time zone: pl$select( datetime = pl$datetime_range( as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo", time_zone = "America/New_York" ) )
# Using Polars duration string to specify the interval: pl$select( datetime = pl$datetime_range(as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo") ) # Using `difftime` object to specify the interval: pl$select( datetime = pl$datetime_range( as.Date("1985-01-01"), as.Date("1985-01-10"), as.difftime(1, units = "days") + as.difftime(12, units = "hours") ) ) # Specifying a time zone: pl$select( datetime = pl$datetime_range( as.Date("2022-01-01"), as.Date("2022-03-01"), "1mo", time_zone = "America/New_York" ) )
Generate a list containing a datetime range
pl__datetime_ranges( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right"), time_unit = NULL, time_zone = NULL )
pl__datetime_ranges( start, end, interval = "1d", ..., closed = c("both", "left", "none", "right"), time_unit = NULL, time_zone = NULL )
start |
Lower bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
end |
Upper bound of the date range. Something that can be coerced to a Date or a Datetime expression. See examples for details. |
interval |
Interval of the range periods, specified as a difftime
object or using the Polars duration string language. See the |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
time_unit |
Time unit of the resulting the Datetime
data type. One of |
time_zone |
Time zone of the resulting Datetime data type. |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$datetime_range()
to create a simple Series
of data type Datetime.
df <- pl$DataFrame( start = as.POSIXct(c("2022-01-01 10:00", "2022-01-01 11:00", NA)), end = rep(as.POSIXct("2022-01-01 12:00"), 3) ) df$with_columns( dt_range = pl$datetime_ranges("start", "end", interval = "1h"), dt_range_cr = pl$datetime_ranges("start", "end", closed = "right", interval = "1h") ) # provide a custom "end" value df$with_columns( dt_range_lit = pl$datetime_ranges( "start", pl$lit(as.POSIXct("2022-01-01 11:00")), interval = "1h" ) )
df <- pl$DataFrame( start = as.POSIXct(c("2022-01-01 10:00", "2022-01-01 11:00", NA)), end = rep(as.POSIXct("2022-01-01 12:00"), 3) ) df$with_columns( dt_range = pl$datetime_ranges("start", "end", interval = "1h"), dt_range_cr = pl$datetime_ranges("start", "end", closed = "right", interval = "1h") ) # provide a custom "end" value df$with_columns( dt_range_lit = pl$datetime_ranges( "start", pl$lit(as.POSIXct("2022-01-01 11:00")), interval = "1h" ) )
A Duration represents a fixed amount of time. For example,
pl$duration(days = 1)
means "exactly 24 hours". By contrast,
<expr>$dt$offset_by("1d")
means "1 calendar day", which could sometimes be
23 hours or 25 hours depending on Daylight Savings Time.
For non-fixed durations such as "calendar month" or "calendar day",
please use <expr>$dt$offset_by()
instead.
pl__duration( ..., weeks = NULL, days = NULL, hours = NULL, minutes = NULL, seconds = NULL, milliseconds = NULL, microseconds = NULL, nanoseconds = NULL, time_unit = NULL )
pl__duration( ..., weeks = NULL, days = NULL, hours = NULL, minutes = NULL, seconds = NULL, milliseconds = NULL, microseconds = NULL, nanoseconds = NULL, time_unit = NULL )
... |
These dots are for future extensions and must be empty. |
weeks |
Something can be coerced to an polars expression by |
days |
Something can be coerced to an polars expression by |
hours |
Something can be coerced to an polars expression by |
minutes |
Something can be coerced to an polars expression by |
seconds |
Something can be coerced to an polars expression by |
milliseconds |
Something can be coerced to an polars expression by |
microseconds |
Something can be coerced to an polars expression by |
nanoseconds |
Something can be coerced to an polars expression by |
time_unit |
One of |
A polars expression
df <- pl$DataFrame( dt = as.POSIXct(c("2022-01-01", "2022-01-02")), add = c(1, 2) ) df df$select( add_weeks = pl$col("dt") + pl$duration(weeks = pl$col("add")), add_days = pl$col("dt") + pl$duration(days = pl$col("add")), add_seconds = pl$col("dt") + pl$duration(seconds = pl$col("add")), add_millis = pl$col("dt") + pl$duration(milliseconds = pl$col("add")), add_hours = pl$col("dt") + pl$duration(hours = pl$col("add")) )
df <- pl$DataFrame( dt = as.POSIXct(c("2022-01-01", "2022-01-02")), add = c(1, 2) ) df df$select( add_weeks = pl$col("dt") + pl$duration(weeks = pl$col("add")), add_days = pl$col("dt") + pl$duration(days = pl$col("add")), add_seconds = pl$col("dt") + pl$duration(seconds = pl$col("add")), add_millis = pl$col("dt") + pl$duration(milliseconds = pl$col("add")), add_hours = pl$col("dt") + pl$duration(hours = pl$col("add")) )
Alias for an element being evaluated in an eval expression
pl__element()
pl__element()
A polars expression
# A horizontal rank computation by taking the elements of a list: df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2) ) df$with_columns( rank = pl$concat_list(c("a", "b"))$list$eval(pl$element()$rank()) ) # A mathematical operation on array elements: df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2) ) df$with_columns( a_b_doubled = pl$concat_list(c("a", "b"))$list$eval(pl$element() * 2) )
# A horizontal rank computation by taking the elements of a list: df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2) ) df$with_columns( rank = pl$concat_list(c("a", "b"))$list$eval(pl$element()$rank()) ) # A mathematical operation on array elements: df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2) ) df$with_columns( a_b_doubled = pl$concat_list(c("a", "b"))$list$eval(pl$element() * 2) )
Get the first column of the context
pl__first()
pl__first()
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$first())
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$first())
Generate a range of integers
pl__int_range(start = 0, end = NULL, step = 1, ..., dtype = pl$Int64)
pl__int_range(start = 0, end = NULL, step = 1, ..., dtype = pl$Int64)
start |
Start of the range (inclusive). Defaults to 0. |
end |
End of the range (exclusive). If |
step |
Step size of the range. |
... |
These dots are for future extensions and must be empty. |
dtype |
Data type of the range. |
A polars expression
pl$select(int = pl$int_range(0, 3)) # end can be omitted for a shorter syntax. pl$select(int = pl$int_range(3)) # Generate an index column by using int_range in conjunction with len(). df <- pl$DataFrame(a = c(1, 3, 5), b = c(2, 4, 6)) df$select( index = pl$int_range(pl$len(), dtype = pl$UInt32), pl$all() )
pl$select(int = pl$int_range(0, 3)) # end can be omitted for a shorter syntax. pl$select(int = pl$int_range(3)) # Generate an index column by using int_range in conjunction with len(). df <- pl$DataFrame(a = c(1, 3, 5), b = c(2, 4, 6)) df$select( index = pl$int_range(pl$len(), dtype = pl$UInt32), pl$all() )
Generate a range of integers for each row of the input columns
pl__int_ranges(start = 0, end = NULL, step = 1, ..., dtype = pl$Int64)
pl__int_ranges(start = 0, end = NULL, step = 1, ..., dtype = pl$Int64)
start |
Start of the range (inclusive). Defaults to 0. |
end |
End of the range (exclusive). If |
step |
Step size of the range. |
... |
These dots are for future extensions and must be empty. |
dtype |
Data type of the range. |
A polars expression
df <- pl$DataFrame(start = c(1, -1), end = c(3, 2)) df$with_columns(int_range = pl$int_ranges("start", "end")) # end can be omitted for a shorter syntax$ df$select("end", int_range = pl$int_ranges("end"))
df <- pl$DataFrame(start = c(1, -1), end = c(3, 2)) df$with_columns(int_range = pl$int_ranges("start", "end")) # end can be omitted for a shorter syntax$ df$select("end", int_range = pl$int_ranges("end"))
Get the last column of the context
pl__last()
pl__last()
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$last())
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$last())
polars_lazy_frame
)Representation of a Lazy computation graph/query against a DataFrame. This allows for whole-query optimisation in addition to parallelism, and is the preferred (and highest-performance) mode of operation for polars.
pl__LazyFrame(..., .schema_overrides = NULL, .strict = TRUE)
pl__LazyFrame(..., .schema_overrides = NULL, .strict = TRUE)
... |
< |
.schema_overrides |
A list of polars data types or |
.strict |
A logical value. Passed to the |
The pl$LazyFrame(...)
function is a shortcut for pl$DataFrame(...)$lazy()
.
A polars LazyFrame
<LazyFrame>$collect()
: Materialize a LazyFrame into a DataFrame.
# Constructing a LazyFrame from vectors: pl$LazyFrame(a = 1:2, b = 3:4) # Constructing a LazyFrame from Series: pl$LazyFrame(pl$Series("a", 1:2), pl$Series("b", 3:4)) # Constructing a LazyFrame from a list: data <- list(a = 1:2, b = 3:4) ## Using dynamic dots feature pl$LazyFrame(!!!data)
# Constructing a LazyFrame from vectors: pl$LazyFrame(a = 1:2, b = 3:4) # Constructing a LazyFrame from Series: pl$LazyFrame(pl$Series("a", 1:2), pl$Series("b", 3:4)) # Constructing a LazyFrame from a list: data <- list(a = 1:2, b = 3:4) ## Using dynamic dots feature pl$LazyFrame(!!!data)
This function is a shorthand for as_polars_expr(x, as_lit = TRUE)
and
in most cases, the actual conversion is done by as_polars_series()
.
pl__lit(value, dtype = NULL)
pl__lit(value, dtype = NULL)
value |
An R object. Passed as the |
dtype |
A polars data type or |
A polars expression
Since R has no scalar class, each of the following types of length 1 cases is specially converted to a scalar literal.
character: String
logical: Boolean
integer: Int32
double: Float64
These types' NA
is converted to a null
literal with casting to the corresponding Polars type.
The raw type vector is converted to a Binary scalar.
raw: Binary
NULL
is converted to a Null type null
literal.
NULL: Null
For other R class, the default S3 method is called and R object will be converted via
as_polars_series()
. So the type mapping is defined by as_polars_series()
.
as_polars_series()
: R -> Polars type mapping is mostly defined by this function.
as_polars_expr()
: Internal implementation of pl$lit()
.
# Literal scalar values pl$lit(1L) pl$lit(5.5) pl$lit(NULL) pl$lit("foo_bar") ## Generally, for a vector (an R object) becomes a Series with length 1, ## it is converted to a Series and then get the first value to become a scalar literal. pl$lit(as.Date("2021-01-20")) pl$lit(as.POSIXct("2023-03-31 10:30:45")) pl$lit(data.frame(a = 1, b = "foo")) # Literal Series data pl$lit(1:3) pl$lit(pl$Series("x", 1:3))
# Literal scalar values pl$lit(1L) pl$lit(5.5) pl$lit(NULL) pl$lit("foo_bar") ## Generally, for a vector (an R object) becomes a Series with length 1, ## it is converted to a Series and then get the first value to become a scalar literal. pl$lit(as.Date("2021-01-20")) pl$lit(as.POSIXct("2023-03-31 10:30:45")) pl$lit(data.frame(a = 1, b = "foo")) # Literal Series data pl$lit(1:3) pl$lit(pl$Series("x", 1:3))
This function is syntactic sugar for col(names)$max()
.
pl__max(...)
pl__max(...)
... |
Name(s) of the columns to use in the aggregation. |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "foo") ) # Get the maximum value of a column df$select(pl$max("a")) # Get the maximum value of multiple columns df$select(pl$max("a", "b"))
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "foo") ) # Get the maximum value of a column df$select(pl$max("a")) # Get the maximum value of multiple columns df$select(pl$max("a", "b"))
Get the maximum value horizontally across columns
pl__max_horizontal(...)
pl__max_horizontal(...)
... |
< |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c(1, 2, NA, Inf) ) df$with_columns( max = pl$max_horizontal("a", "b") )
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c(1, 2, NA, Inf) ) df$with_columns( max = pl$max_horizontal("a", "b") )
Compute the mean horizontally across columns
pl__mean_horizontal(..., ignore_nulls = TRUE)
pl__mean_horizontal(..., ignore_nulls = TRUE)
... |
< |
ignore_nulls |
A logical.
If |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( mean = pl$mean_horizontal("a", "b") )
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( mean = pl$mean_horizontal("a", "b") )
This function is syntactic sugar for col(names)$min()
.
pl__min(...)
pl__min(...)
... |
Name(s) of the columns to use in the aggregation. |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "foo") ) # Get the minimum value of a column df$select(pl$min("a")) # Get the minimum value of multiple columns df$select(pl$min("a", "b"))
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "foo") ) # Get the minimum value of a column df$select(pl$min("a")) # Get the minimum value of multiple columns df$select(pl$min("a", "b"))
Get the minimum value horizontally across columns
pl__min_horizontal(...)
pl__min_horizontal(...)
... |
< |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( min = pl$min_horizontal("a", "b") )
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( min = pl$min_horizontal("a", "b") )
Get the nth column(s) of the context
pl__nth(indices)
pl__nth(indices)
indices |
One or more indices representing the columns to retrieve. |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$nth(1)) df$select(pl$nth(c(2, 0)))
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "baz") ) df$select(pl$nth(1)) df$select(pl$nth(c(2, 0)))
New DataFrame from CSV
pl__read_csv( source, ..., has_header = TRUE, separator = ",", comment_prefix = NULL, quote_char = "\"", skip_rows = 0, schema = NULL, schema_overrides = NULL, null_values = NULL, missing_utf8_is_empty_string = FALSE, ignore_errors = FALSE, cache = FALSE, infer_schema = TRUE, infer_schema_length = 100, n_rows = NULL, encoding = c("utf8", "utf8-lossy"), low_memory = FALSE, rechunk = FALSE, skip_rows_after_header = 0, row_index_name = NULL, row_index_offset = 0, try_parse_dates = FALSE, eol_char = "\n", raise_if_empty = TRUE, truncate_ragged_lines = FALSE, decimal_comma = FALSE, glob = TRUE, storage_options = NULL, retries = 2, file_cache_ttl = NULL, include_file_paths = NULL )
pl__read_csv( source, ..., has_header = TRUE, separator = ",", comment_prefix = NULL, quote_char = "\"", skip_rows = 0, schema = NULL, schema_overrides = NULL, null_values = NULL, missing_utf8_is_empty_string = FALSE, ignore_errors = FALSE, cache = FALSE, infer_schema = TRUE, infer_schema_length = 100, n_rows = NULL, encoding = c("utf8", "utf8-lossy"), low_memory = FALSE, rechunk = FALSE, skip_rows_after_header = 0, row_index_name = NULL, row_index_offset = 0, try_parse_dates = FALSE, eol_char = "\n", raise_if_empty = TRUE, truncate_ragged_lines = FALSE, decimal_comma = FALSE, glob = TRUE, storage_options = NULL, retries = 2, file_cache_ttl = NULL, include_file_paths = NULL )
source |
Path to a file or URL. It is possible to provide multiple paths provided that all CSV files have the same schema. It is not possible to provide several URLs. |
... |
Dots which should be empty. |
has_header |
Indicate if the first row of dataset is a header or not.If
|
separator |
Single byte character to use as separator in the file. |
comment_prefix |
A string, which can be up to 5 symbols in length, used
to indicate the start of a comment line. For instance, it can be set to |
quote_char |
Single byte character used for quoting. Set to |
skip_rows |
Start reading after a particular number of rows. The header will be parsed at this offset. |
schema |
Provide the schema. This means that polars doesn't do schema
inference. This argument expects the complete schema, whereas
|
schema_overrides |
Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns. |
null_values |
Character vector specifying the values to interpret as
|
missing_utf8_is_empty_string |
By default, a missing value is considered
to be |
ignore_errors |
Keep reading the file even if some lines yield errors.
You can also use |
cache |
Cache the result after reading. |
infer_schema |
If |
infer_schema_length |
The maximum number of rows to scan for schema
inference. If |
n_rows |
Stop reading from CSV file after reading |
encoding |
Either |
low_memory |
Reduce memory pressure at the expense of performance. |
rechunk |
Reallocate to contiguous memory when all chunks / files are parsed. |
skip_rows_after_header |
Skip this number of rows when the header is parsed. |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
try_parse_dates |
Try to automatically parse dates. Most ISO8601-like
formats can be inferred, as well as a handful of others. If this does not
succeed, the column remains of data type |
eol_char |
Single byte end of line character (default: |
raise_if_empty |
If |
truncate_ragged_lines |
Truncate lines that are longer than the schema. |
decimal_comma |
Parse floats using a comma as the decimal separator instead of a period. |
glob |
Expand path given via globbing rules. |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
file_cache_ttl |
Amount of time to keep downloaded cloud files since
their last access time, in seconds. Uses the |
include_file_paths |
Include the path of the source file(s) as a column with this name. |
A polars DataFrame
my_file <- tempfile() write.csv(iris, my_file) pl$read_csv(my_file) unlink(my_file)
my_file <- tempfile() write.csv(iris, my_file) pl$read_csv(my_file) unlink(my_file)
Read into a DataFrame from Arrow IPC (Feather v2) file
pl__read_ipc( source, ..., n_rows = NULL, cache = TRUE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, storage_options = NULL, retries = 2, file_cache_ttl = NULL, hive_partitioning = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, include_file_paths = NULL )
pl__read_ipc( source, ..., n_rows = NULL, cache = TRUE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, storage_options = NULL, retries = 2, file_cache_ttl = NULL, hive_partitioning = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, include_file_paths = NULL )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
n_rows |
Stop reading from parquet file after reading |
cache |
Cache the result after reading. |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
hive_partitioning |
Infer statistics and schema from Hive partitioned
sources and use them to prune reads. If |
hive_schema |
A list containing the column names and data types of the
columns by which the data is partitioned, e.g.
|
try_parse_hive_dates |
Whether to try parsing hive values as date / datetime types. |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
A polars DataFrame
temp_dir <- tempfile() # Write a hive-style partitioned arrow file dataset arrow::write_dataset( mtcars, temp_dir, partitioning = c("cyl", "gear"), format = "arrow", hive_style = TRUE ) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$read_ipc(temp_dir) # We can also impose a schema to the partition pl$read_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))
temp_dir <- tempfile() # Write a hive-style partitioned arrow file dataset arrow::write_dataset( mtcars, temp_dir, partitioning = c("cyl", "gear"), format = "arrow", hive_style = TRUE ) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$read_ipc(temp_dir) # We can also impose a schema to the partition pl$read_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))
Read into a DataFrame from Arrow IPC stream format
pl__read_ipc_stream( source, ..., columns = NULL, n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, rechunk = TRUE )
pl__read_ipc_stream( source, ..., columns = NULL, n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, rechunk = TRUE )
source |
A character of the path to an Arrow IPC stream file. |
... |
These dots are for future extensions and must be empty. |
columns |
A character vector of column names to read. |
n_rows |
Stop reading from parquet file after reading |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
rechunk |
A logical value to indicate whether to make sure that all data is contiguous. |
A polars DataFrame
temp_file <- tempfile(fileext = ".arrows") mtcars |> nanoarrow::write_nanoarrow(temp_file) pl$read_ipc_stream(temp_file, columns = c("cyl", "am"))
temp_file <- tempfile(fileext = ".arrows") mtcars |> nanoarrow::write_nanoarrow(temp_file) pl$read_ipc_stream(temp_file, columns = c("cyl", "am"))
Read into a DataFrame from NDJSON file
pl__read_ndjson( source, ..., schema = NULL, schema_overrides = NULL, infer_schema_length = 100, batch_size = 1024, n_rows = NULL, low_memory = FALSE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, ignore_errors = FALSE, storage_options = NULL, retries = 2, file_cache_ttl = NULL, include_file_paths = NULL )
pl__read_ndjson( source, ..., schema = NULL, schema_overrides = NULL, infer_schema_length = 100, batch_size = 1024, n_rows = NULL, low_memory = FALSE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, ignore_errors = FALSE, storage_options = NULL, retries = 2, file_cache_ttl = NULL, include_file_paths = NULL )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
schema |
Specify the datatypes of the columns. The datatypes must match
the datatypes in the file(s). If there are extra columns that are not in the
file(s), consider also enabling |
schema_overrides |
Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns. |
infer_schema_length |
The maximum number of rows to scan for schema
inference. If |
n_rows |
Stop reading from parquet file after reading |
low_memory |
Reduce memory pressure at the expense of performance |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
ignore_errors |
Keep reading the file even if some lines yield errors.
You can also use |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
file_cache_ttl |
Amount of time to keep downloaded cloud files since
their last access time, in seconds. Uses the |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
A polars DataFrame
ndjson_filename <- tempfile() jsonlite::stream_out(iris, file(ndjson_filename), verbose = FALSE) pl$read_ndjson(ndjson_filename)
ndjson_filename <- tempfile() jsonlite::stream_out(iris, file(ndjson_filename), verbose = FALSE) pl$read_ndjson(ndjson_filename)
Read into a DataFrame from Parquet file
pl__read_parquet( source, ..., n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, parallel = c("auto", "columns", "row_groups", "prefiltered", "none"), use_statistics = TRUE, hive_partitioning = NULL, glob = TRUE, schema = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, rechunk = FALSE, low_memory = FALSE, cache = TRUE, storage_options = NULL, retries = 2, include_file_paths = NULL, allow_missing_columns = FALSE )
pl__read_parquet( source, ..., n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, parallel = c("auto", "columns", "row_groups", "prefiltered", "none"), use_statistics = TRUE, hive_partitioning = NULL, glob = TRUE, schema = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, rechunk = FALSE, low_memory = FALSE, cache = TRUE, storage_options = NULL, retries = 2, include_file_paths = NULL, allow_missing_columns = FALSE )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
n_rows |
Stop reading from parquet file after reading |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
parallel |
This determines the direction and strategy of parallelism.
The prefiltered settings falls back to auto if no predicate is given. |
use_statistics |
Use statistics in the parquet to determine if pages can be skipped from reading. |
hive_partitioning |
Infer statistics and schema from Hive partitioned sources and use them to prune reads. |
glob |
Expand path given via globbing rules. |
schema |
Specify the datatypes of the columns. The datatypes must match
the datatypes in the file(s). If there are extra columns that are not in the
file(s), consider also enabling |
hive_schema |
The column names and data types of the columns by which
the data is partitioned. If |
try_parse_hive_dates |
Whether to try parsing hive values as date / datetime types. |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
low_memory |
Reduce memory pressure at the expense of performance |
cache |
Cache the result after reading. |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
allow_missing_columns |
When reading a list of parquet files, if a
column existing in the first file cannot be found in subsequent files, the
default behavior is to raise an error. However, if |
A polars DataFrame
# Write a Parquet file than we can then import as DataFrame temp_file <- withr::local_tempfile(fileext = ".parquet") as_polars_df(mtcars)$write_parquet(temp_file) pl$read_parquet(temp_file) # Write a hive-style partitioned parquet dataset temp_dir <- withr::local_tempdir() as_polars_df(mtcars)$write_parquet(temp_dir, partition_by = c("cyl", "gear")) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$read_parquet(temp_dir)
# Write a Parquet file than we can then import as DataFrame temp_file <- withr::local_tempfile(fileext = ".parquet") as_polars_df(mtcars)$write_parquet(temp_file) pl$read_parquet(temp_file) # Write a hive-style partitioned parquet dataset temp_dir <- withr::local_tempdir() as_polars_df(mtcars)$write_parquet(temp_dir, partition_by = c("cyl", "gear")) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$read_parquet(temp_dir)
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
pl__scan_csv( source, ..., has_header = TRUE, separator = ",", comment_prefix = NULL, quote_char = "\"", skip_rows = 0, schema = NULL, schema_overrides = NULL, null_values = NULL, missing_utf8_is_empty_string = FALSE, ignore_errors = FALSE, cache = FALSE, infer_schema = TRUE, infer_schema_length = 100, n_rows = NULL, encoding = c("utf8", "utf8-lossy"), low_memory = FALSE, rechunk = FALSE, skip_rows_after_header = 0, row_index_name = NULL, row_index_offset = 0, try_parse_dates = FALSE, eol_char = "\n", raise_if_empty = TRUE, truncate_ragged_lines = FALSE, decimal_comma = FALSE, glob = TRUE, storage_options = NULL, retries = 2, file_cache_ttl = NULL, include_file_paths = NULL )
pl__scan_csv( source, ..., has_header = TRUE, separator = ",", comment_prefix = NULL, quote_char = "\"", skip_rows = 0, schema = NULL, schema_overrides = NULL, null_values = NULL, missing_utf8_is_empty_string = FALSE, ignore_errors = FALSE, cache = FALSE, infer_schema = TRUE, infer_schema_length = 100, n_rows = NULL, encoding = c("utf8", "utf8-lossy"), low_memory = FALSE, rechunk = FALSE, skip_rows_after_header = 0, row_index_name = NULL, row_index_offset = 0, try_parse_dates = FALSE, eol_char = "\n", raise_if_empty = TRUE, truncate_ragged_lines = FALSE, decimal_comma = FALSE, glob = TRUE, storage_options = NULL, retries = 2, file_cache_ttl = NULL, include_file_paths = NULL )
source |
Path to a file or URL. It is possible to provide multiple paths provided that all CSV files have the same schema. It is not possible to provide several URLs. |
... |
Dots which should be empty. |
has_header |
Indicate if the first row of dataset is a header or not.If
|
separator |
Single byte character to use as separator in the file. |
comment_prefix |
A string, which can be up to 5 symbols in length, used
to indicate the start of a comment line. For instance, it can be set to |
quote_char |
Single byte character used for quoting. Set to |
skip_rows |
Start reading after a particular number of rows. The header will be parsed at this offset. |
schema |
Provide the schema. This means that polars doesn't do schema
inference. This argument expects the complete schema, whereas
|
schema_overrides |
Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns. |
null_values |
Character vector specifying the values to interpret as
|
missing_utf8_is_empty_string |
By default, a missing value is considered
to be |
ignore_errors |
Keep reading the file even if some lines yield errors.
You can also use |
cache |
Cache the result after reading. |
infer_schema |
If |
infer_schema_length |
The maximum number of rows to scan for schema
inference. If |
n_rows |
Stop reading from CSV file after reading |
encoding |
Either |
low_memory |
Reduce memory pressure at the expense of performance. |
rechunk |
Reallocate to contiguous memory when all chunks / files are parsed. |
skip_rows_after_header |
Skip this number of rows when the header is parsed. |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
try_parse_dates |
Try to automatically parse dates. Most ISO8601-like
formats can be inferred, as well as a handful of others. If this does not
succeed, the column remains of data type |
eol_char |
Single byte end of line character (default: |
raise_if_empty |
If |
truncate_ragged_lines |
Truncate lines that are longer than the schema. |
decimal_comma |
Parse floats using a comma as the decimal separator instead of a period. |
glob |
Expand path given via globbing rules. |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
file_cache_ttl |
Amount of time to keep downloaded cloud files since
their last access time, in seconds. Uses the |
include_file_paths |
Include the path of the source file(s) as a column with this name. |
credential_provider |
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time. |
A polars LazyFrame
my_file <- tempfile() write.csv(iris, my_file) lazy_frame <- pl$scan_csv(my_file) lazy_frame$collect() unlink(my_file)
my_file <- tempfile() write.csv(iris, my_file) lazy_frame <- pl$scan_csv(my_file) lazy_frame$collect() unlink(my_file)
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
pl__scan_ipc( source, ..., n_rows = NULL, cache = TRUE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, storage_options = NULL, retries = 2, file_cache_ttl = NULL, hive_partitioning = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, include_file_paths = NULL )
pl__scan_ipc( source, ..., n_rows = NULL, cache = TRUE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, storage_options = NULL, retries = 2, file_cache_ttl = NULL, hive_partitioning = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, include_file_paths = NULL )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
n_rows |
Stop reading from parquet file after reading |
cache |
Cache the result after reading. |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
hive_partitioning |
Infer statistics and schema from Hive partitioned
sources and use them to prune reads. If |
hive_schema |
A list containing the column names and data types of the
columns by which the data is partitioned, e.g.
|
try_parse_hive_dates |
Whether to try parsing hive values as date / datetime types. |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
A polars LazyFrame
temp_dir <- tempfile() # Write a hive-style partitioned arrow file dataset arrow::write_dataset( mtcars, temp_dir, partitioning = c("cyl", "gear"), format = "arrow", hive_style = TRUE ) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$scan_ipc(temp_dir)$collect() # We can also impose a schema to the partition pl$scan_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))$collect()
temp_dir <- tempfile() # Write a hive-style partitioned arrow file dataset arrow::write_dataset( mtcars, temp_dir, partitioning = c("cyl", "gear"), format = "arrow", hive_style = TRUE ) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$scan_ipc(temp_dir)$collect() # We can also impose a schema to the partition pl$scan_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))$collect()
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
pl__scan_ndjson( source, ..., schema = NULL, schema_overrides = NULL, infer_schema_length = 100, batch_size = 1024, n_rows = NULL, low_memory = FALSE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, ignore_errors = FALSE, storage_options = NULL, retries = 2, file_cache_ttl = NULL, include_file_paths = NULL )
pl__scan_ndjson( source, ..., schema = NULL, schema_overrides = NULL, infer_schema_length = 100, batch_size = 1024, n_rows = NULL, low_memory = FALSE, rechunk = FALSE, row_index_name = NULL, row_index_offset = 0L, ignore_errors = FALSE, storage_options = NULL, retries = 2, file_cache_ttl = NULL, include_file_paths = NULL )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
schema |
Specify the datatypes of the columns. The datatypes must match
the datatypes in the file(s). If there are extra columns that are not in the
file(s), consider also enabling |
schema_overrides |
Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns. |
infer_schema_length |
The maximum number of rows to scan for schema
inference. If |
n_rows |
Stop reading from parquet file after reading |
low_memory |
Reduce memory pressure at the expense of performance |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
ignore_errors |
Keep reading the file even if some lines yield errors.
You can also use |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
file_cache_ttl |
Amount of time to keep downloaded cloud files since
their last access time, in seconds. Uses the |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
A polars LazyFrame
ndjson_filename <- tempfile() jsonlite::stream_out(iris, file(ndjson_filename), verbose = FALSE) pl$scan_ndjson(ndjson_filename)$collect()
ndjson_filename <- tempfile() jsonlite::stream_out(iris, file(ndjson_filename), verbose = FALSE) pl$scan_ndjson(ndjson_filename)$collect()
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
pl__scan_parquet( source, ..., n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, parallel = c("auto", "columns", "row_groups", "prefiltered", "none"), use_statistics = TRUE, hive_partitioning = NULL, glob = TRUE, schema = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, rechunk = FALSE, low_memory = FALSE, cache = TRUE, storage_options = NULL, retries = 2, include_file_paths = NULL, allow_missing_columns = FALSE )
pl__scan_parquet( source, ..., n_rows = NULL, row_index_name = NULL, row_index_offset = 0L, parallel = c("auto", "columns", "row_groups", "prefiltered", "none"), use_statistics = TRUE, hive_partitioning = NULL, glob = TRUE, schema = NULL, hive_schema = NULL, try_parse_hive_dates = TRUE, rechunk = FALSE, low_memory = FALSE, cache = TRUE, storage_options = NULL, retries = 2, include_file_paths = NULL, allow_missing_columns = FALSE )
source |
Path(s) to a file or directory. When needing to authenticate
for scanning cloud locations, see the |
... |
These dots are for future extensions and must be empty. |
n_rows |
Stop reading from parquet file after reading |
row_index_name |
If not |
row_index_offset |
Offset to start the row index column (only used if the name is set). |
parallel |
This determines the direction and strategy of parallelism.
The prefiltered settings falls back to auto if no predicate is given. |
use_statistics |
Use statistics in the parquet to determine if pages can be skipped from reading. |
hive_partitioning |
Infer statistics and schema from Hive partitioned sources and use them to prune reads. |
glob |
Expand path given via globbing rules. |
schema |
Specify the datatypes of the columns. The datatypes must match
the datatypes in the file(s). If there are extra columns that are not in the
file(s), consider also enabling |
hive_schema |
The column names and data types of the columns by which
the data is partitioned. If |
try_parse_hive_dates |
Whether to try parsing hive values as date / datetime types. |
rechunk |
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
low_memory |
Reduce memory pressure at the expense of performance |
cache |
Cache the result after reading. |
storage_options |
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If |
retries |
Number of retries if accessing a cloud instance fails. |
include_file_paths |
Character value indicating the column name that will include the path of the source file(s). |
allow_missing_columns |
When reading a list of parquet files, if a
column existing in the first file cannot be found in subsequent files, the
default behavior is to raise an error. However, if |
A polars LazyFrame
# Write a Parquet file than we can then import as DataFrame temp_file <- withr::local_tempfile(fileext = ".parquet") as_polars_df(mtcars)$write_parquet(temp_file) pl$scan_parquet(temp_file)$collect() # Write a hive-style partitioned parquet dataset temp_dir <- withr::local_tempdir() as_polars_df(mtcars)$write_parquet(temp_dir, partition_by = c("cyl", "gear")) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$scan_parquet(temp_dir)$collect()
# Write a Parquet file than we can then import as DataFrame temp_file <- withr::local_tempfile(fileext = ".parquet") as_polars_df(mtcars)$write_parquet(temp_file) pl$scan_parquet(temp_file)$collect() # Write a hive-style partitioned parquet dataset temp_dir <- withr::local_tempdir() as_polars_df(mtcars)$write_parquet(temp_dir, partition_by = c("cyl", "gear")) list.files(temp_dir, recursive = TRUE) # If the path is a folder, Polars automatically tries to detect partitions # and includes them in the output pl$scan_parquet(temp_dir)$collect()
polars_series
)Series are a 1-dimensional data structure, which are similar to R vectors. Within a series all elements have the same Data Type.
pl__Series(name = NULL, values = NULL)
pl__Series(name = NULL, values = NULL)
name |
A single string or |
values |
An R object. Passed as the |
The pl$Series()
function mimics the constructor of the Series class of Python Polars.
This function calls as_polars_series()
internally to convert the input object to a Polars Series.
dtype
: $dtype
returns the data type of the Series.
name
: $name
returns the name of the Series.
shape
: $shape
returns a integer vector of length two with the number of length
of the Series and width of the Series (always 1).
# Constructing a Series by specifying name and values positionally: s <- pl$Series("a", 1:3) s # Active bindings: s$dtype s$name s$shape
# Constructing a Series by specifying name and values positionally: s <- pl$Series("a", 1:3) s # Active bindings: s$dtype s$name s$shape
Print out the version of Polars and its optional dependencies.
pl__show_versions()
pl__show_versions()
cli enhances the terminal output, especially error messages.
These packages may be used for exporting Series to R.
See <Series>$to_r_vector()
for details.
NULL
invisibly.
pl$show_versions()
pl$show_versions()
Collect columns into a struct column
pl__struct(...)
pl__struct(...)
... |
< |
A polars expression
# Collect all columns of a dataframe into a struct by passing pl.all(). df <- pl$DataFrame( int = 1:2, str = c("a", "b"), bool = c(TRUE, NA), list = list(1:2, 3L), ) df$select(pl$struct(pl$all())$alias("my_struct")) # Name each struct field. df$select(pl$struct(p = "int", q = "bool")$alias("my_struct"))$schema
# Collect all columns of a dataframe into a struct by passing pl.all(). df <- pl$DataFrame( int = 1:2, str = c("a", "b"), bool = c(TRUE, NA), list = list(1:2, 3L), ) df$select(pl$struct(pl$all())$alias("my_struct")) # Name each struct field. df$select(pl$struct(p = "int", q = "bool")$alias("my_struct"))$schema
This function is syntactic sugar for col(names)$sum()
.
pl__sum(...)
pl__sum(...)
... |
Name(s) of the columns to use in the aggregation. |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "foo") ) # Get the sum of a column df$select(pl$sum("a")) # Get the sum of multiple columns df$select(pl$sum("a", "b"))
df <- pl$DataFrame( a = c(1, 8, 3), b = c(4, 5, 2), c = c("foo", "bar", "foo") ) # Get the sum of a column df$select(pl$sum("a")) # Get the sum of multiple columns df$select(pl$sum("a", "b"))
Compute the sum horizontally across columns
pl__sum_horizontal(..., ignore_nulls = TRUE)
pl__sum_horizontal(..., ignore_nulls = TRUE)
... |
< |
ignore_nulls |
A logical.
If |
A polars expression
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( sum = pl$sum_horizontal("a", "b") )
df <- pl$DataFrame( a = c(1, 8, 3) b = c(4, 5, NA), c = c("x", "y", "z") ) df$with_columns( sum = pl$sum_horizontal("a", "b") )
Generate a time range
pl__time_range( start = NULL, end = NULL, interval = "1h", ..., closed = c("both", "left", "none", "right") )
pl__time_range( start = NULL, end = NULL, interval = "1h", ..., closed = c("both", "left", "none", "right") )
start |
Lower bound of the time range. If omitted, defaults to
|
end |
Upper bound of the time range. If omitted, defaults to
|
interval |
Interval of the range periods, specified as a difftime or using the Polars duration string language (see details). |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
pl$select( time = pl$time_range( start = hms::parse_hms("14:00:00"), interval = as.difftime("3:15:00") ) )
pl$select( time = pl$time_range( start = hms::parse_hms("14:00:00"), interval = as.difftime("3:15:00") ) )
Create a column of time ranges
pl__time_ranges( start = NULL, end = NULL, interval = "1h", ..., closed = c("both", "left", "none", "right") )
pl__time_ranges( start = NULL, end = NULL, interval = "1h", ..., closed = c("both", "left", "none", "right") )
start |
Lower bound of the time range. If omitted, defaults to
|
end |
Upper bound of the time range. If omitted, defaults to
|
interval |
Interval of the range periods, specified as a difftime or using the Polars duration string language (see details). |
... |
These dots are for future extensions and must be empty. |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
A polars expression
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
df <- pl$DataFrame( start = hms::parse_hms(c("09:00:00", "10:00:00")), end = hms::parse_hms(c("11:00:00", "11:00:00")) ) df$with_columns(time_range = pl$time_ranges("start", "end"))
df <- pl$DataFrame( start = hms::parse_hms(c("09:00:00", "10:00:00")), end = hms::parse_hms(c("11:00:00", "11:00:00")) ) df$with_columns(time_range = pl$time_ranges("start", "end"))
Registering custom functionality with a polars Series
pl_api_register_series_namespace(name, ns_fn)
pl_api_register_series_namespace(name, ns_fn)
name |
Name under which the functionality will be accessed. |
ns_fn |
A function returns a new environment with the custom functionality. See examples for details. |
NULL
invisibly.
# s: polars series math_shortcuts <- function(s) { # Create a new environment to store the methods self <- new.env(parent = emptyenv()) # Store the series self$`_s` <- s # Add methods self$square <- function() self$`_s` * self$`_s` self$cube <- function() self$`_s` * self$`_s` * self$`_s` # Set the class class(self) <- c("polars_namespace_series", "polars_object") # Return the environment self } pl$api$register_series_namespace("math", math_shortcuts) s <- as_polars_series(c(1.5, 31, 42, 64.5)) s$math$square()$rename("s^2") s <- as_polars_series(1:5) s$math$cube()$rename("s^3")
# s: polars series math_shortcuts <- function(s) { # Create a new environment to store the methods self <- new.env(parent = emptyenv()) # Store the series self$`_s` <- s # Add methods self$square <- function() self$`_s` * self$`_s` self$cube <- function() self$`_s` * self$`_s` * self$`_s` # Set the class class(self) <- c("polars_namespace_series", "polars_object") # Return the environment self } pl$api$register_series_namespace("math", math_shortcuts) s <- as_polars_series(c(1.5, 31, 42, 64.5)) s$math$square()$rename("s^2") s <- as_polars_series(1:5) s$math$cube()$rename("s^3")
polars_dtype
)Polars supports a variety of data types that fall broadly under the following categories:
Numeric data types: signed integers, unsigned integers, floating point numbers, and decimals.
Nested data types: lists, structs, and arrays.
Temporal: dates, datetimes, times, and time deltas.
Miscellaneous: strings, binary data, Booleans, categoricals, and enums.
All types support missing values represented by the special value null
.
This is not to be conflated with the special value NaN
in floating number data types;
see the section about floating point numbers for more information.
pl__Decimal(precision = NULL, scale = 0L) pl__Datetime(time_unit = c("us", "ns", "ms"), time_zone = NULL) pl__Duration(time_unit = c("us", "ns", "ms")) pl__Categorical(ordering = c("physical", "lexical")) pl__Enum(categories) pl__Array(inner, shape) pl__List(inner) pl__Struct(...)
pl__Decimal(precision = NULL, scale = 0L) pl__Datetime(time_unit = c("us", "ns", "ms"), time_zone = NULL) pl__Duration(time_unit = c("us", "ns", "ms")) pl__Categorical(ordering = c("physical", "lexical")) pl__Enum(categories) pl__Array(inner, shape) pl__List(inner) pl__Struct(...)
precision |
Single integer or |
scale |
Single integer or |
time_unit |
One of |
time_zone |
A string or |
ordering |
One of |
categories |
A character vector.
Should not contain |
inner |
A polars data type object. |
shape |
A integer-ish vector, representing the shape of the Array. |
... |
< |
pl$Int8 pl$Int16 pl$Int32 pl$Int64 pl$UInt8 pl$UInt16 pl$UInt32 pl$UInt64 pl$Float32 pl$Float64 pl$Decimal(scale = 2) pl$String pl$Binary pl$Date pl$Time pl$Datetime() pl$Duration() pl$Array(pl$Int32, c(2, 3)) pl$List(pl$Int32) pl$Categorical() pl$Enum(c("a", "b", "c")) pl$Struct(a = pl$Int32, b = pl$String) pl$Null
pl$Int8 pl$Int16 pl$Int32 pl$Int64 pl$UInt8 pl$UInt16 pl$UInt32 pl$UInt64 pl$Float32 pl$Float64 pl$Decimal(scale = 2) pl$String pl$Binary pl$Date pl$Time pl$Datetime() pl$Duration() pl$Array(pl$Int32, c(2, 3)) pl$List(pl$Int32) pl$Categorical() pl$Enum(c("a", "b", "c")) pl$Struct(a = pl$Int32, b = pl$String) pl$Null
The Polars duration string language
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
polars_expr
)An expression is a tree of operations that describe how to construct one or more Series. As the outputs are Series, it is straightforward to apply a sequence of expressions each of which transforms the output from the previous step. See examples for details.
pl$lit()
: Create a literal expression.
pl$col()
: Create an expression representing column(s) in a DataFrame.
# An expression: # 1. Select column `foo`, # 2. Then sort the column (not in reversed order) # 3. Then take the first two values of the sorted output pl$col("foo")$sort()$head(2) # Expressions will be evaluated inside a context, such as `<DataFrame>$select()` df <- pl$DataFrame( foo = c(1, 2, 1, 2, 3), bar = c(5, 4, 3, 2, 1), ) df$select( pl$col("foo")$sort()$head(3), # Return 3 values pl$col("bar")$filter(pl$col("foo") == 1)$sum(), # Return a single value )
# An expression: # 1. Select column `foo`, # 2. Then sort the column (not in reversed order) # 3. Then take the first two values of the sorted output pl$col("foo")$sort()$head(2) # Expressions will be evaluated inside a context, such as `<DataFrame>$select()` df <- pl$DataFrame( foo = c(1, 2, 1, 2, 3), bar = c(5, 4, 3, 2, 1), ) df$select( pl$col("foo")$sort()$head(3), # Return 3 values pl$col("bar")$filter(pl$col("foo") == 1)$sum(), # Return a single value )
Get the number of chunks that this Series contains
series__n_chunks()
series__n_chunks()
An integer value
s <- pl$Series("a", c(1, 2, 3)) s$n_chunks() s2 <- pl$Series("a", c(4, 5, 6)) # Concatenate Series with rechunk = TRUE pl$concat(c(s, s2), rechunk = TRUE)$n_chunks() # Concatenate Series with rechunk = FALSE pl$concat(c(s, s2), rechunk = FALSE)$n_chunks()
s <- pl$Series("a", c(1, 2, 3)) s$n_chunks() s2 <- pl$Series("a", c(4, 5, 6)) # Concatenate Series with rechunk = TRUE pl$concat(c(s, s2), rechunk = TRUE)$n_chunks() # Concatenate Series with rechunk = FALSE pl$concat(c(s, s2), rechunk = FALSE)$n_chunks()
Cast this Series to a DataFrame
series__to_frame(name = NULL)
series__to_frame(name = NULL)
name |
A character or |
A polars DataFrame
s <- pl$Series("a", c(123, 456)) df <- s$to_frame() df df <- s$to_frame("xyz") df
s <- pl$Series("a", c(123, 456)) df <- s$to_frame() df df <- s$to_frame("xyz") df
Export the Series as an R vector.
But note that the Struct data type is exported as a data.frame by default for consistency,
and a data.frame is not a vector.
If you want to ensure the return value is a vector, please set ensure_vector = TRUE
,
or use the as.vector()
function instead.
series__to_r_vector( ..., ensure_vector = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
series__to_r_vector( ..., ensure_vector = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
... |
These dots are for future extensions and must be empty. |
ensure_vector |
A logical value indicating whether to ensure the return value is a vector.
When the Series has the Struct data type and this argument is |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
struct |
Determine how to convert Polars' Struct type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
The class/type of the exported object depends on the data type of the Series as follows:
Boolean: logical.
UInt8, UInt16, Int8, Int16, Int32: integer.
Int64, UInt32, UInt64: double, character, integer, or bit64::integer64,
depending on the int64
argument.
Float32, Float64: double.
Decimal: double.
String: character.
Categorical: factor.
Date: Date or data.table::IDate,
depending on the date
argument.
Time: hms::hms or data.table::ITime,
depending on the time
argument.
Datetime (without timezone): POSIXct or clock_naive_time,
depending on the as_clock_class
argument.
Datetime (with timezone): POSIXct or clock_zoned_time,
depending on the as_clock_class
argument.
Duration: difftime or clock_duration,
depending on the as_clock_class
argument.
Binary: blob::blob.
Null: vctrs::unspecified.
List, Array: vctrs::list_of.
Struct: data.frame or tibble, depending on the struct
argument.
If ensure_vector = TRUE
, the top-level Struct is exported as a named list for
to ensure the return value is a vector.
A vector
# Struct values handling series_struct <- as_polars_series( data.frame( a = 1:2, b = I(list(data.frame(c = "foo"), data.frame(c = "bar"))) ) ) series_struct ## Export Struct as data.frame series_struct$to_r_vector() ## Export Struct as data.frame, ## but the top-level Struct is exported as a named list series_struct$to_r_vector(ensure_vector = TRUE) ## Export Struct as tibble series_struct$to_r_vector(struct = "tibble") ## Export Struct as tibble, ## but the top-level Struct is exported as a named list series_struct$to_r_vector(struct = "tibble", ensure_vector = TRUE) # Integer values handling series_uint64 <- as_polars_series( c(NA, "0", "4294967295", "18446744073709551615") )$cast(pl$UInt64) series_uint64 ## Export UInt64 as double series_uint64$to_r_vector(int64 = "double") ## Export UInt64 as character series_uint64$to_r_vector(int64 = "character") ## Export UInt64 as integer (overflow occurs) series_uint64$to_r_vector(int64 = "integer") ## Export UInt64 as bit64::integer64 (overflow occurs) if (requireNamespace("bit64", quietly = TRUE)) { series_uint64$to_r_vector(int64 = "integer64") } # Duration values handling series_duration <- as_polars_series( c(NA, -1000000000, -10, -1, 1000000000) )$cast(pl$Duration("ns")) series_duration ## Export Duration as difftime series_duration$to_r_vector(as_clock_class = FALSE) ## Export Duration as clock_duration if (requireNamespace("clock", quietly = TRUE)) { series_duration$to_r_vector(as_clock_class = TRUE) } # Datetime values handling series_datetime <- as_polars_series( as.POSIXct( c(NA, "1920-01-01 00:00:00", "1970-01-01 00:00:00", "2020-01-01 00:00:00"), tz = "UTC" ) )$cast(pl$Datetime("ns", "UTC")) series_datetime ## Export zoned datetime as POSIXct series_datetime$to_r_vector(as_clock_class = FALSE) ## Export zoned datetime as clock_zoned_time if (requireNamespace("clock", quietly = TRUE)) { series_datetime$to_r_vector(as_clock_class = TRUE) }
# Struct values handling series_struct <- as_polars_series( data.frame( a = 1:2, b = I(list(data.frame(c = "foo"), data.frame(c = "bar"))) ) ) series_struct ## Export Struct as data.frame series_struct$to_r_vector() ## Export Struct as data.frame, ## but the top-level Struct is exported as a named list series_struct$to_r_vector(ensure_vector = TRUE) ## Export Struct as tibble series_struct$to_r_vector(struct = "tibble") ## Export Struct as tibble, ## but the top-level Struct is exported as a named list series_struct$to_r_vector(struct = "tibble", ensure_vector = TRUE) # Integer values handling series_uint64 <- as_polars_series( c(NA, "0", "4294967295", "18446744073709551615") )$cast(pl$UInt64) series_uint64 ## Export UInt64 as double series_uint64$to_r_vector(int64 = "double") ## Export UInt64 as character series_uint64$to_r_vector(int64 = "character") ## Export UInt64 as integer (overflow occurs) series_uint64$to_r_vector(int64 = "integer") ## Export UInt64 as bit64::integer64 (overflow occurs) if (requireNamespace("bit64", quietly = TRUE)) { series_uint64$to_r_vector(int64 = "integer64") } # Duration values handling series_duration <- as_polars_series( c(NA, -1000000000, -10, -1, 1000000000) )$cast(pl$Duration("ns")) series_duration ## Export Duration as difftime series_duration$to_r_vector(as_clock_class = FALSE) ## Export Duration as clock_duration if (requireNamespace("clock", quietly = TRUE)) { series_duration$to_r_vector(as_clock_class = TRUE) } # Datetime values handling series_datetime <- as_polars_series( as.POSIXct( c(NA, "1920-01-01 00:00:00", "1970-01-01 00:00:00", "2020-01-01 00:00:00"), tz = "UTC" ) )$cast(pl$Datetime("ns", "UTC")) series_datetime ## Export zoned datetime as POSIXct series_datetime$to_r_vector(as_clock_class = FALSE) ## Export zoned datetime as clock_zoned_time if (requireNamespace("clock", quietly = TRUE)) { series_datetime$to_r_vector(as_clock_class = TRUE) }
Convert this struct Series to a DataFrame with a separate column for each field
series_struct_unnest()
series_struct_unnest()
A polars DataFrame
s <- as_polars_series(data.frame(a = c(1, 3), b = c(2, 4))) s$struct$unnest()
s <- as_polars_series(data.frame(a = c(1, 3), b = c(2, 4))) s$struct$unnest()