Title: | Lightning-Fast 'DataFrame' Library |
---|---|
Description: | Lightning-fast 'DataFrame' library written in 'Rust'. Convert R data to 'Polars' data and vice versa. Perform fast, lazy, larger-than-memory and optimized data queries. 'Polars' is interoperable with the package 'arrow', as both are based on the 'Apache Arrow' Columnar Format. |
Authors: | Ritchie Vink [aut], Soren Welling [aut, cre], Tatsuya Shima [aut], Etienne Bacher [aut] |
Maintainer: | Soren Welling <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.19.1 |
Built: | 2024-08-31 10:16:28 UTC |
Source: | https://github.com/pola-rs/r-polars |
Mimics the behavior of [x[i, j, drop = TRUE]
][Extract] for data.frame or R vector.
## S3 method for class 'RPolarsDataFrame' x[i, j, drop = TRUE] ## S3 method for class 'RPolarsLazyFrame' x[i, j, drop = TRUE] ## S3 method for class 'RPolarsSeries' x[i]
## S3 method for class 'RPolarsDataFrame' x[i, j, drop = TRUE] ## S3 method for class 'RPolarsLazyFrame' x[i, j, drop = TRUE] ## S3 method for class 'RPolarsSeries' x[i]
x |
|
i |
Rows to select. Integer vector, logical vector, or an Expression. |
j |
Columns to select. Integer vector, logical vector, character vector, or an Expression. For LazyFrames, only an Expression can be used. |
drop |
Convert to a Polars Series if only one column is selected.
For LazyFrames, if the result has one column and |
<Series>[i]
is equivalent to pl$select(<Series>)[i, , drop = TRUE]
.
<DataFrame>$select()
,
<LazyFrame>$select()
,
<DataFrame>$filter()
,
<LazyFrame>$filter()
df = pl$DataFrame(data.frame(a = 1:3, b = letters[1:3])) lf = df$lazy() # Select a row df[1, ] # If only `i` is specified, it is treated as `j` # Select a column df[1] # Select a column by name (and convert to a Series) df[, "b"] # Can use Expression for filtering and column selection lf[pl$col("a") >= 2, pl$col("b")$alias("new"), drop = FALSE] |> as.data.frame()
df = pl$DataFrame(data.frame(a = 1:3, b = letters[1:3])) lf = df$lazy() # Select a row df[1, ] # If only `i` is specified, it is treated as `j` # Select a column df[1] # Select a column by name (and convert to a Series) df[, "b"] # Can use Expression for filtering and column selection lf[pl$col("a") >= 2, pl$col("b")$alias("new"), drop = FALSE] |> as.data.frame()
Create a arrow Table from a Polars object
## S3 method for class 'RPolarsDataFrame' as_arrow_table(x, ..., compat_level = FALSE)
## S3 method for class 'RPolarsDataFrame' as_arrow_table(x, ..., compat_level = FALSE)
x |
|
... |
Ignored |
compat_level |
Use a specific compatibility level when exporting Polars’ internal data structures. This can be:
|
library(arrow) pl_df = as_polars_df(mtcars) as_arrow_table(pl_df)
library(arrow) pl_df = as_polars_df(mtcars) as_arrow_table(pl_df)
Create a nanoarrow_array_stream from a Polars object
## S3 method for class 'RPolarsDataFrame' as_nanoarrow_array_stream(x, ..., schema = NULL, compat_level = FALSE) ## S3 method for class 'RPolarsSeries' as_nanoarrow_array_stream(x, ..., schema = NULL, compat_level = FALSE)
## S3 method for class 'RPolarsDataFrame' as_nanoarrow_array_stream(x, ..., schema = NULL, compat_level = FALSE) ## S3 method for class 'RPolarsSeries' as_nanoarrow_array_stream(x, ..., schema = NULL, compat_level = FALSE)
x |
A polars object |
... |
Ignored |
schema |
must stay at default value NULL |
compat_level |
Use a specific compatibility level when exporting Polars’ internal data structures. This can be:
|
library(nanoarrow) pl_df = as_polars_df(mtcars)$head(5) pl_s = as_polars_series(letters[1:5]) as.data.frame(as_nanoarrow_array_stream(pl_df)) as.vector(as_nanoarrow_array_stream(pl_s))
library(nanoarrow) pl_df = as_polars_df(mtcars)$head(5) pl_s = as_polars_series(letters[1:5]) as.data.frame(as_nanoarrow_array_stream(pl_df)) as.vector(as_nanoarrow_array_stream(pl_s))
as_polars_df()
is a generic function that converts an R object to a
polars DataFrame.
as_polars_df(x, ...) ## Default S3 method: as_polars_df(x, ...) ## S3 method for class 'data.frame' as_polars_df( x, ..., rownames = NULL, make_names_unique = TRUE, schema = NULL, schema_overrides = NULL ) ## S3 method for class 'RPolarsDataFrame' as_polars_df(x, ...) ## S3 method for class 'RPolarsGroupBy' as_polars_df(x, ...) ## S3 method for class 'RPolarsRollingGroupBy' as_polars_df(x, ...) ## S3 method for class 'RPolarsDynamicGroupBy' as_polars_df(x, ...) ## S3 method for class 'RPolarsSeries' as_polars_df(x, ...) ## S3 method for class 'RPolarsLazyFrame' as_polars_df( x, n_rows = Inf, ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, streaming = FALSE, no_optimization = FALSE, collect_in_background = FALSE ) ## S3 method for class 'RPolarsLazyGroupBy' as_polars_df(x, ...) ## S3 method for class 'ArrowTabular' as_polars_df( x, ..., rechunk = TRUE, schema = NULL, schema_overrides = NULL, experimental = FALSE ) ## S3 method for class 'RecordBatchReader' as_polars_df(x, ..., experimental = FALSE) ## S3 method for class 'nanoarrow_array' as_polars_df(x, ...) ## S3 method for class 'nanoarrow_array_stream' as_polars_df(x, ..., experimental = FALSE)
as_polars_df(x, ...) ## Default S3 method: as_polars_df(x, ...) ## S3 method for class 'data.frame' as_polars_df( x, ..., rownames = NULL, make_names_unique = TRUE, schema = NULL, schema_overrides = NULL ) ## S3 method for class 'RPolarsDataFrame' as_polars_df(x, ...) ## S3 method for class 'RPolarsGroupBy' as_polars_df(x, ...) ## S3 method for class 'RPolarsRollingGroupBy' as_polars_df(x, ...) ## S3 method for class 'RPolarsDynamicGroupBy' as_polars_df(x, ...) ## S3 method for class 'RPolarsSeries' as_polars_df(x, ...) ## S3 method for class 'RPolarsLazyFrame' as_polars_df( x, n_rows = Inf, ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, streaming = FALSE, no_optimization = FALSE, collect_in_background = FALSE ) ## S3 method for class 'RPolarsLazyGroupBy' as_polars_df(x, ...) ## S3 method for class 'ArrowTabular' as_polars_df( x, ..., rechunk = TRUE, schema = NULL, schema_overrides = NULL, experimental = FALSE ) ## S3 method for class 'RecordBatchReader' as_polars_df(x, ..., experimental = FALSE) ## S3 method for class 'nanoarrow_array' as_polars_df(x, ...) ## S3 method for class 'nanoarrow_array_stream' as_polars_df(x, ..., experimental = FALSE)
x |
Object to convert to a polars DataFrame. |
... |
Additional arguments passed to methods. |
rownames |
How to treat existing row names of a data frame:
|
make_names_unique |
A logical flag to replace duplicated column names
with unique names. If |
schema |
named list of DataTypes, or character vector of column names.
Should match the number of columns in |
schema_overrides |
named list of DataTypes. Cast some columns to the DataType. |
n_rows |
Number of rows to fetch. Defaults to |
type_coercion |
Logical. Coerce types such that operations succeed and run on minimal required memory. |
predicate_pushdown |
Logical. Applies filters as early as possible at scan level. |
projection_pushdown |
Logical. Select only the columns that are needed at the scan level. |
simplify_expression |
Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives. |
slice_pushdown |
Logical. Only load the required slice from the scan
level. Don't materialize sliced outputs (e.g. |
comm_subplan_elim |
Logical. Will try to cache branching subplans that occur on self-joins or unions. |
comm_subexpr_elim |
Logical. Common subexpressions will be cached and reused. |
cluster_with_columns |
Combine sequential independent calls to
|
streaming |
Logical. Run parts of the query in a streaming fashion (this is in an alpha state). |
no_optimization |
Logical. Sets the following parameters to |
collect_in_background |
Logical. Detach this query from R session. Computation will start in background. Get a handle which later can be converted into the resulting DataFrame. Useful in interactive mode to not lock R session. |
rechunk |
A logical flag (default |
experimental |
If |
For LazyFrame objects, this function is a shortcut for $collect() or $fetch(), depending on whether the number of rows to fetch is infinite or not.
# Convert the row names of a data frame to a column as_polars_df(mtcars, rownames = "car") # Convert a data frame, with renaming all columns as_polars_df( data.frame(x = 1, y = 2), schema = c("a", "b") ) # Convert a data frame, with renaming and casting all columns as_polars_df( data.frame(x = 1, y = 2), schema = list(b = pl$Int64, a = pl$String) ) # Convert a data frame, with casting some columns as_polars_df( data.frame(x = 1, y = 2), schema_overrides = list(y = pl$String) # cast some columns ) # Convert an arrow Table to a polars DataFrame at = arrow::arrow_table(x = 1:5, y = 6:10) as_polars_df(at) # Create a polars DataFrame from a data.frame lf = as_polars_df(mtcars)$lazy() # Collect all rows from the LazyFrame as_polars_df(lf) # Fetch 5 rows from the LazyFrame as_polars_df(lf, 5)
# Convert the row names of a data frame to a column as_polars_df(mtcars, rownames = "car") # Convert a data frame, with renaming all columns as_polars_df( data.frame(x = 1, y = 2), schema = c("a", "b") ) # Convert a data frame, with renaming and casting all columns as_polars_df( data.frame(x = 1, y = 2), schema = list(b = pl$Int64, a = pl$String) ) # Convert a data frame, with casting some columns as_polars_df( data.frame(x = 1, y = 2), schema_overrides = list(y = pl$String) # cast some columns ) # Convert an arrow Table to a polars DataFrame at = arrow::arrow_table(x = 1:5, y = 6:10) as_polars_df(at) # Create a polars DataFrame from a data.frame lf = as_polars_df(mtcars)$lazy() # Collect all rows from the LazyFrame as_polars_df(lf) # Fetch 5 rows from the LazyFrame as_polars_df(lf, 5)
as_polars_lf()
is a generic function that converts an R object to a
polars LazyFrame. It is basically a shortcut for as_polars_df(x, ...) with the
$lazy() method.
as_polars_lf(x, ...) ## Default S3 method: as_polars_lf(x, ...) ## S3 method for class 'RPolarsLazyFrame' as_polars_lf(x, ...) ## S3 method for class 'RPolarsLazyGroupBy' as_polars_lf(x, ...)
as_polars_lf(x, ...) ## Default S3 method: as_polars_lf(x, ...) ## S3 method for class 'RPolarsLazyFrame' as_polars_lf(x, ...) ## S3 method for class 'RPolarsLazyGroupBy' as_polars_lf(x, ...)
x |
Object to convert to a polars DataFrame. |
... |
Additional arguments passed to methods. |
as_polars_lf(mtcars)
as_polars_lf(mtcars)
as_polars_series()
is a generic function that converts an R object to
a polars Series.
as_polars_series(x, name = NULL, ...) ## Default S3 method: as_polars_series(x, name = NULL, ...) ## S3 method for class 'RPolarsSeries' as_polars_series(x, name = NULL, ...) ## S3 method for class 'RPolarsExpr' as_polars_series(x, name = NULL, ...) ## S3 method for class 'RPolarsThen' as_polars_series(x, name = NULL, ...) ## S3 method for class 'RPolarsChainedThen' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXlt' as_polars_series(x, name = NULL, ...) ## S3 method for class 'data.frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_rcrd' as_polars_series(x, name = NULL, ...) ## S3 method for class 'Array' as_polars_series(x, name = NULL, ..., rechunk = TRUE) ## S3 method for class 'ChunkedArray' as_polars_series(x, name = NULL, ..., rechunk = TRUE) ## S3 method for class 'RecordBatchReader' as_polars_series(x, name = NULL, ...) ## S3 method for class 'nanoarrow_array' as_polars_series(x, name = NULL, ...) ## S3 method for class 'nanoarrow_array_stream' as_polars_series(x, name = NULL, ..., experimental = FALSE) ## S3 method for class 'clock_time_point' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_sys_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_zoned_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'list' as_polars_series(x, name = NULL, ...)
as_polars_series(x, name = NULL, ...) ## Default S3 method: as_polars_series(x, name = NULL, ...) ## S3 method for class 'RPolarsSeries' as_polars_series(x, name = NULL, ...) ## S3 method for class 'RPolarsExpr' as_polars_series(x, name = NULL, ...) ## S3 method for class 'RPolarsThen' as_polars_series(x, name = NULL, ...) ## S3 method for class 'RPolarsChainedThen' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXlt' as_polars_series(x, name = NULL, ...) ## S3 method for class 'data.frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_rcrd' as_polars_series(x, name = NULL, ...) ## S3 method for class 'Array' as_polars_series(x, name = NULL, ..., rechunk = TRUE) ## S3 method for class 'ChunkedArray' as_polars_series(x, name = NULL, ..., rechunk = TRUE) ## S3 method for class 'RecordBatchReader' as_polars_series(x, name = NULL, ...) ## S3 method for class 'nanoarrow_array' as_polars_series(x, name = NULL, ...) ## S3 method for class 'nanoarrow_array_stream' as_polars_series(x, name = NULL, ..., experimental = FALSE) ## S3 method for class 'clock_time_point' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_sys_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_zoned_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'list' as_polars_series(x, name = NULL, ...)
x |
Object to convert into a polars Series. |
name |
A character to use as the name of the Series.
If |
... |
Additional arguments passed to methods. |
rechunk |
A logical flag (default |
experimental |
If |
a Series
as_polars_series(1:4) as_polars_series(list(1:4)) as_polars_series(data.frame(a = 1:4)) as_polars_series(as_polars_series(1:4, name = "foo")) as_polars_series(pl$lit(1:4)) # Nested type support as_polars_series(list(data.frame(a = I(list(1:4)))))
as_polars_series(1:4) as_polars_series(list(1:4)) as_polars_series(data.frame(a = 1:4)) as_polars_series(as_polars_series(1:4, name = "foo")) as_polars_series(pl$lit(1:4)) # Nested type support as_polars_series(list(data.frame(a = I(list(1:4)))))
Create a arrow RecordBatchReader from a Polars object
## S3 method for class 'RPolarsDataFrame' as_record_batch_reader(x, ..., compat_level = FALSE)
## S3 method for class 'RPolarsDataFrame' as_record_batch_reader(x, ..., compat_level = FALSE)
x |
|
... |
Ignored |
compat_level |
Use a specific compatibility level when exporting Polars’ internal data structures. This can be:
|
library(arrow) pl_df = as_polars_df(mtcars) as_record_batch_reader(pl_df)
library(arrow) pl_df = as_polars_df(mtcars) as_record_batch_reader(pl_df)
Convert to a character vector
## S3 method for class 'RPolarsSeries' as.character(x, ..., str_length = NULL)
## S3 method for class 'RPolarsSeries' as.character(x, ..., str_length = NULL)
x |
A Polars Series |
... |
Not used. |
str_length |
An integer. If specified, utf8 or categorical type Series will be formatted to a string of this length. |
s = as_polars_series(c("foo", "barbaz")) as.character(s) as.character(s, str_length = 3)
s = as_polars_series(c("foo", "barbaz")) as.character(s) as.character(s, str_length = 3)
Equivalent to as_polars_df(x, ...)$to_data_frame(...)
.
## S3 method for class 'RPolarsDataFrame' as.data.frame(x, ..., int64_conversion = polars_options()$int64_conversion) ## S3 method for class 'RPolarsLazyFrame' as.data.frame( x, ..., n_rows = Inf, type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, streaming = FALSE, no_optimization = FALSE, collect_in_background = FALSE )
## S3 method for class 'RPolarsDataFrame' as.data.frame(x, ..., int64_conversion = polars_options()$int64_conversion) ## S3 method for class 'RPolarsLazyFrame' as.data.frame( x, ..., n_rows = Inf, type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, streaming = FALSE, no_optimization = FALSE, collect_in_background = FALSE )
x |
An object to convert to a data.frame. |
... |
Additional arguments passed to methods. |
int64_conversion |
How should Int64 values be handled when converting a polars object to R?
|
n_rows |
Number of rows to fetch. Defaults to |
type_coercion |
Logical. Coerce types such that operations succeed and run on minimal required memory. |
predicate_pushdown |
Logical. Applies filters as early as possible at scan level. |
projection_pushdown |
Logical. Select only the columns that are needed at the scan level. |
simplify_expression |
Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives. |
slice_pushdown |
Logical. Only load the required slice from the scan
level. Don't materialize sliced outputs (e.g. |
comm_subplan_elim |
Logical. Will try to cache branching subplans that occur on self-joins or unions. |
comm_subexpr_elim |
Logical. Common subexpressions will be cached and reused. |
cluster_with_columns |
Combine sequential independent calls to
|
streaming |
Logical. Run parts of the query in a streaming fashion (this is in an alpha state). |
no_optimization |
Logical. Sets the following parameters to |
collect_in_background |
Logical. Detach this query from R session. Computation will start in background. Get a handle which later can be converted into the resulting DataFrame. Useful in interactive mode to not lock R session. |
When converting Polars objects, such as DataFrames
to R objects, for example via the as.data.frame()
generic function,
each type in the Polars object is converted to an R type.
In some cases, an error may occur because the conversion is not appropriate.
In particular, there is a high possibility of an error when converting
a Datetime type without a time zone.
A Datetime type without a time zone in Polars is converted
to the POSIXct type in R, which takes into account the time zone in which
the R session is running (which can be checked with the Sys.timezone()
function). In this case, if ambiguous times are included, a conversion error
will occur. In such cases, change the session time zone using
Sys.setenv(TZ = "UTC")
and then perform the conversion, or use the
$dt$replace_time_zone()
method on the Datetime type column to
explicitly specify the time zone before conversion.
# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am # so this particular date-time doesn't exist non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T") withr::with_timezone( "America/New_York", { tryCatch( # This causes an error due to the time zone (the `TZ` env var is affected). as.vector(non_existent_time), error = function(e) e ) } ) #> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()> withr::with_timezone( "America/New_York", { # This is safe. as.vector(non_existent_time$dt$replace_time_zone("UTC")) } ) #> [1] "2020-03-08 02:00:00 UTC"
Equivalent to as.data.frame(x, ...) |> as.matrix()
.
## S3 method for class 'RPolarsDataFrame' as.matrix(x, ...) ## S3 method for class 'RPolarsLazyFrame' as.matrix(x, ...)
## S3 method for class 'RPolarsDataFrame' as.matrix(x, ...) ## S3 method for class 'RPolarsLazyFrame' as.matrix(x, ...)
x |
An object to convert to a matrix. |
... |
Additional arguments passed to methods. |
Convert to a vector
## S3 method for class 'RPolarsSeries' as.vector(x, mode)
## S3 method for class 'RPolarsSeries' as.vector(x, mode)
x |
A Polars Series |
mode |
Not used. |
When converting Polars objects, such as DataFrames
to R objects, for example via the as.data.frame()
generic function,
each type in the Polars object is converted to an R type.
In some cases, an error may occur because the conversion is not appropriate.
In particular, there is a high possibility of an error when converting
a Datetime type without a time zone.
A Datetime type without a time zone in Polars is converted
to the POSIXct type in R, which takes into account the time zone in which
the R session is running (which can be checked with the Sys.timezone()
function). In this case, if ambiguous times are included, a conversion error
will occur. In such cases, change the session time zone using
Sys.setenv(TZ = "UTC")
and then perform the conversion, or use the
$dt$replace_time_zone()
method on the Datetime type column to
explicitly specify the time zone before conversion.
# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am # so this particular date-time doesn't exist non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T") withr::with_timezone( "America/New_York", { tryCatch( # This causes an error due to the time zone (the `TZ` env var is affected). as.vector(non_existent_time), error = function(e) e ) } ) #> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()> withr::with_timezone( "America/New_York", { # This is safe. as.vector(non_existent_time$dt$replace_time_zone("UTC")) } ) #> [1] "2020-03-08 02:00:00 UTC"
Combine to a Series
## S3 method for class 'RPolarsSeries' c(x, ...)
## S3 method for class 'RPolarsSeries' c(x, ...)
x |
A Polars Series |
... |
Series(s) or any object that can be converted to a Series. |
All objects must have the same datatype. Combining does not rechunk. Read more
about R vectors, Series and chunks in docs_translations
:
a combined Series
s = c(as_polars_series(1:5), 3:1, NA_integer_) s$chunk_lengths() # the series contain three unmerged chunks
s = c(as_polars_series(1:5), 3:1, NA_integer_) s$chunk_lengths() # the series contain three unmerged chunks
The DataFrame
-class is simply two environments of respectively
the public and private methods/function calls to the polars Rust side. The
instantiated DataFrame
-object is an externalptr
to a low-level Rust
polars DataFrame object.
The S3 method .DollarNames.RPolarsDataFrame
exposes all public
$foobar()
-methods which are callable onto the object. Most methods return
another DataFrame
- class instance or similar which allows for method
chaining. This class system could be called "environment classes" (in lack
of a better name) and is the same class system extendr
provides, except
here there are both a public and private set of methods. For implementation
reasons, the private methods are external and must be called from
.pr$DataFrame$methodname()
. Also, all private methods must take any
self
as an argument, thus they are pure functions. Having the private
methods as pure functions solved/simplified self-referential complications.
Check out the source code in
R/dataframe_frame.R
to see how public methods are derived from private methods. Check out
extendr-wrappers.R
to see the extendr
-auto-generated methods. These are moved to .pr
and
converted into pure external functions in
after-wrappers.R.
In zzz.R (named
zzz
to be last file sourced) the extendr
-methods are removed and
replaced by any function prefixed DataFrame_
.
$columns
returns a character vector with the column names.
$dtypes
returns a unnamed list with the data type of each column.
$flags
returns a nested list with column names at the top level and
column flags in each sublist.
Flags are used internally to avoid doing unnecessary computations, such as
sorting a variable that we know is already sorted. The number of flags
varies depending on the column type: columns of type array
and list
have the flags SORTED_ASC
, SORTED_DESC
, and FAST_EXPLODE
, while other
column types only have the former two.
SORTED_ASC
is set to TRUE
when we sort a column in increasing order, so
that we can use this information later on to avoid re-sorting it.
SORTED_DESC
is similar but applies to sort in decreasing order.
$height
returns the number of rows in the DataFrame.
$schema
returns a named list with the data type of each column.
$shape
returns a numeric vector of length two with the number of rows and
the number of columns.
$width
returns the number of columns in the DataFrame.
When converting Polars objects, such as DataFrames
to R objects, for example via the as.data.frame()
generic function,
each type in the Polars object is converted to an R type.
In some cases, an error may occur because the conversion is not appropriate.
In particular, there is a high possibility of an error when converting
a Datetime type without a time zone.
A Datetime type without a time zone in Polars is converted
to the POSIXct type in R, which takes into account the time zone in which
the R session is running (which can be checked with the Sys.timezone()
function). In this case, if ambiguous times are included, a conversion error
will occur. In such cases, change the session time zone using
Sys.setenv(TZ = "UTC")
and then perform the conversion, or use the
$dt$replace_time_zone()
method on the Datetime type column to
explicitly specify the time zone before conversion.
# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am # so this particular date-time doesn't exist non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T") withr::with_timezone( "America/New_York", { tryCatch( # This causes an error due to the time zone (the `TZ` env var is affected). as.vector(non_existent_time), error = function(e) e ) } ) #> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()> withr::with_timezone( "America/New_York", { # This is safe. as.vector(non_existent_time$dt$replace_time_zone("UTC")) } ) #> [1] "2020-03-08 02:00:00 UTC"
# see all public exported method names (normally accessed via a class # instance with $) ls(.pr$env$RPolarsDataFrame) # see all private methods (not intended for regular use) ls(.pr$DataFrame) # make an object df = as_polars_df(iris) # call an active binding df$shape # use a private method, which has mutability result = .pr$DataFrame$set_column_from_robj(df, 150:1, "some_ints") # Column exists in both dataframes-objects now, as they are just pointers to # the same object # There are no public methods with mutability. df2 = df df$columns df2$columns # Show flags df$sort("Sepal.Length")$flags # set_column_from_robj-method is fallible and returned a result which could # be "ok" or an error. # No public method or function will ever return a result. # The `result` is very close to the same as output from functions decorated # with purrr::safely. # To use results on the R side, these must be unwrapped first such that # potentially errors can be thrown. `unwrap(result)` is a way to communicate # errors happening on the Rust side to the R side. `Extendr` default behavior # is to use `panic!`(s) which would cause some unnecessarily confusing and # some very verbose error messages on the inner workings of rust. # `unwrap(result)` in this case no error, just a NULL because this mutable # method does not return any ok-value. # Try unwrapping an error from polars due to unmatching column lengths err_result = .pr$DataFrame$set_column_from_robj(df, 1:10000, "wrong_length") tryCatch(unwrap(err_result, call = NULL), error = \(e) cat(as.character(e)))
# see all public exported method names (normally accessed via a class # instance with $) ls(.pr$env$RPolarsDataFrame) # see all private methods (not intended for regular use) ls(.pr$DataFrame) # make an object df = as_polars_df(iris) # call an active binding df$shape # use a private method, which has mutability result = .pr$DataFrame$set_column_from_robj(df, 150:1, "some_ints") # Column exists in both dataframes-objects now, as they are just pointers to # the same object # There are no public methods with mutability. df2 = df df$columns df2$columns # Show flags df$sort("Sepal.Length")$flags # set_column_from_robj-method is fallible and returned a result which could # be "ok" or an error. # No public method or function will ever return a result. # The `result` is very close to the same as output from functions decorated # with purrr::safely. # To use results on the R side, these must be unwrapped first such that # potentially errors can be thrown. `unwrap(result)` is a way to communicate # errors happening on the Rust side to the R side. `Extendr` default behavior # is to use `panic!`(s) which would cause some unnecessarily confusing and # some very verbose error messages on the inner workings of rust. # `unwrap(result)` in this case no error, just a NULL because this mutable # method does not return any ok-value. # Try unwrapping an error from polars due to unmatching column lengths err_result = .pr$DataFrame$set_column_from_robj(df, 1:10000, "wrong_length") tryCatch(unwrap(err_result, call = NULL), error = \(e) cat(as.character(e)))
Returns a n-row null-filled DataFrame with an identical schema. n
can be
greater than the current number of rows in the DataFrame.
DataFrame_clear(n = 0)
DataFrame_clear(n = 0)
n |
Number of (null-filled) rows to return in the cleared frame. |
A n-row null-filled DataFrame with an identical schema
df = pl$DataFrame( a = c(NA, 2, 3, 4), b = c(0.5, NA, 2.5, 13), c = c(TRUE, TRUE, FALSE, NA) ) df$clear() df$clear(n = 5)
df = pl$DataFrame( a = c(NA, 2, 3, 4), b = c(0.5, NA, 2.5, 13), c = c(TRUE, TRUE, FALSE, NA) ) df$clear() df$clear(n = 5)
This makes a very cheap deep copy/clone of an existing
DataFrame
. Rarely useful as DataFrame
s are nearly 100%
immutable. Any modification of a DataFrame
should lead to a clone anyways,
but this can be useful when dealing with attributes (see examples).
DataFrame_clone()
DataFrame_clone()
A DataFrame
df1 = pl$DataFrame(iris) # Make a function to take a DataFrame, add an attribute, and return a DataFrame give_attr = function(data) { attr(data, "created_on") = "2024-01-29" data } df2 = give_attr(df1) # Problem: the original DataFrame also gets the attribute while it shouldn't! attributes(df1) # Use $clone() inside the function to avoid that give_attr = function(data) { data = data$clone() attr(data, "created_on") = "2024-01-29" data } df1 = pl$DataFrame(iris) df2 = give_attr(df1) # now, the original DataFrame doesn't get this attribute attributes(df1)
df1 = pl$DataFrame(iris) # Make a function to take a DataFrame, add an attribute, and return a DataFrame give_attr = function(data) { attr(data, "created_on") = "2024-01-29" data } df2 = give_attr(df1) # Problem: the original DataFrame also gets the attribute while it shouldn't! attributes(df1) # Use $clone() inside the function to avoid that give_attr = function(data) { data = data$clone() attr(data, "created_on") = "2024-01-29" data } df1 = pl$DataFrame(iris) df2 = give_attr(df1) # now, the original DataFrame doesn't get this attribute attributes(df1)
This returns the total number of rows, the number of missing
values, the mean, standard deviation, min, max, median and the percentiles
specified in the argument percentiles
.
DataFrame_describe(percentiles = c(0.25, 0.75), interpolation = "nearest")
DataFrame_describe(percentiles = c(0.25, 0.75), interpolation = "nearest")
percentiles |
One or more percentiles to include in the summary statistics.
All values must be in the range |
interpolation |
Interpolation method for computing quantiles. One of
|
DataFrame
pl$DataFrame(iris)$describe() # string, date, boolean columns are also supported: df = pl$DataFrame( int = 1:3, string = c(letters[1:2], NA), date = c(as.Date("2024-01-20"), as.Date("2024-01-21"), NA), cat = factor(c(letters[1:2], NA)), bool = c(TRUE, FALSE, NA) ) df df$describe()
pl$DataFrame(iris)$describe() # string, date, boolean columns are also supported: df = pl$DataFrame( int = 1:3, string = c(letters[1:2], NA), date = c(as.Date("2024-01-20"), as.Date("2024-01-21"), NA), cat = factor(c(letters[1:2], NA)), bool = c(TRUE, FALSE, NA) ) df df$describe()
Drop columns of a DataFrame
DataFrame_drop(...)
DataFrame_drop(...)
... |
Characters of column names to drop. Passed to |
DataFrame
pl$DataFrame(mtcars)$drop(c("mpg", "hp")) # equivalent pl$DataFrame(mtcars)$drop("mpg", "hp")
pl$DataFrame(mtcars)$drop(c("mpg", "hp")) # equivalent pl$DataFrame(mtcars)$drop("mpg", "hp")
Drop a single column in-place and return the dropped column.
DataFrame_drop_in_place(name)
DataFrame_drop_in_place(name)
name |
string Name of the column to drop. |
Series
dat = pl$DataFrame(iris) x = dat$drop_in_place("Species") x dat$columns
dat = pl$DataFrame(iris) x = dat$drop_in_place("Species") x dat$columns
Drop all rows that contain nulls (which correspond to NA
in R).
DataFrame_drop_nulls(subset = NULL)
DataFrame_drop_nulls(subset = NULL)
subset |
A character vector with the names of the column(s) for which
nulls are considered. If |
DataFrame
tmp = mtcars tmp[1:3, "mpg"] = NA tmp[4, "hp"] = NA tmp = pl$DataFrame(tmp) # number of rows in `tmp` before dropping nulls tmp$height tmp$drop_nulls()$height tmp$drop_nulls("mpg")$height tmp$drop_nulls(c("mpg", "hp"))$height
tmp = mtcars tmp[1:3, "mpg"] = NA tmp[4, "hp"] = NA tmp = pl$DataFrame(tmp) # number of rows in `tmp` before dropping nulls tmp$height tmp$drop_nulls()$height tmp$drop_nulls("mpg")$height tmp$drop_nulls(c("mpg", "hp"))$height
Get the data type of all columns as strings. You can see all
available types with names(pl$dtypes)
. The data type of each column is also
shown when printing the DataFrame.
DataFrame_dtype_strings()
DataFrame_dtype_strings()
A character vector with the data type of each column
pl$DataFrame(iris)$dtype_strings()
pl$DataFrame(iris)$dtype_strings()
Check if two DataFrames are equal.
DataFrame_equals(other)
DataFrame_equals(other)
other |
DataFrame to compare with. |
A logical value
dat1 = pl$DataFrame(iris) dat2 = pl$DataFrame(iris) dat3 = pl$DataFrame(mtcars) dat1$equals(dat2) dat1$equals(dat3)
dat1 = pl$DataFrame(iris) dat2 = pl$DataFrame(iris) dat3 = pl$DataFrame(mtcars) dat1$equals(dat2) dat1$equals(dat3)
Return an estimation of the total (heap) allocated size of the DataFrame.
DataFrame_estimated_size()
DataFrame_estimated_size()
function
Estimated size in bytes
pl$DataFrame(mtcars)$estimated_size()
pl$DataFrame(mtcars)$estimated_size()
Explode columns containing a list of values
DataFrame_explode(...)
DataFrame_explode(...)
... |
Column(s) to be exploded as individual |
DataFrame
df = pl$DataFrame( letters = letters[1:4], numbers = list(1, c(2, 3), c(4, 5), c(6, 7, 8)), numbers_2 = list(0, c(1, 2), c(3, 4), c(5, 6, 7)) # same structure as numbers ) df # explode a single column, append others df$explode("numbers") # explode two columns of same nesting structure, by names or the common dtype # "List(Float64)" df$explode("numbers", "numbers_2") df$explode(pl$col(pl$List(pl$Float64)))
df = pl$DataFrame( letters = letters[1:4], numbers = list(1, c(2, 3), c(4, 5), c(6, 7, 8)), numbers_2 = list(0, c(1, 2), c(3, 4), c(5, 6, 7)) # same structure as numbers ) df # explode a single column, append others df$explode("numbers") # explode two columns of same nesting structure, by names or the common dtype # "List(Float64)" df$explode("numbers", "numbers_2") df$explode(pl$col(pl$List(pl$Float64)))
Fill floating point NaN value with a fill value
DataFrame_fill_nan(value)
DataFrame_fill_nan(value)
value |
Value used to fill |
DataFrame
df = pl$DataFrame( a = c(1.5, 2, NaN, 4), b = c(1.5, NaN, NaN, 4) ) df$fill_nan(99)
df = pl$DataFrame( a = c(1.5, 2, NaN, 4), b = c(1.5, NaN, NaN, 4) ) df$fill_nan(99)
Fill null values (which correspond to NA
in R) using the
specified value or strategy.
DataFrame_fill_null(fill_value)
DataFrame_fill_null(fill_value)
fill_value |
Value to fill nulls with. |
DataFrame
df = pl$DataFrame( a = c(1.5, 2, NA, 4), b = c(1.5, NA, NA, 4) ) df$fill_null(99) df$fill_null(pl$col("a")$mean())
df = pl$DataFrame( a = c(1.5, 2, NA, 4), b = c(1.5, NA, NA, 4) ) df$fill_null(99) df$fill_null(pl$col("a")$mean())
Filter rows with an Expression defining a boolean column.
Multiple expressions are combined with &
(AND).
This is equivalent to dplyr::filter()
.
DataFrame_filter(...)
DataFrame_filter(...)
... |
Polars expressions which will evaluate to a boolean. |
Rows where the condition returns NA
are dropped.
A DataFrame with only the rows where the conditions are TRUE
.
df = pl$DataFrame(iris) df$filter(pl$col("Sepal.Length") > 5) # This is equivalent to # df$filter(pl$col("Sepal.Length") > 5 & pl$col("Petal.Width") < 1) df$filter(pl$col("Sepal.Length") > 5, pl$col("Petal.Width") < 1) # rows where condition is NA are dropped iris2 = iris iris2[c(1, 3, 5), "Species"] = NA df = pl$DataFrame(iris2) df$filter(pl$col("Species") == "setosa")
df = pl$DataFrame(iris) df$filter(pl$col("Sepal.Length") > 5) # This is equivalent to # df$filter(pl$col("Sepal.Length") > 5 & pl$col("Petal.Width") < 1) df$filter(pl$col("Sepal.Length") > 5, pl$col("Petal.Width") < 1) # rows where condition is NA are dropped iris2 = iris iris2[c(1, 3, 5), "Species"] = NA df = pl$DataFrame(iris2) df$filter(pl$col("Species") == "setosa")
Get the first row of the DataFrame.
DataFrame_first()
DataFrame_first()
A DataFrame with one row.
pl$DataFrame(mtcars)$first()
pl$DataFrame(mtcars)$first()
Take every nth row in the DataFrame
DataFrame_gather_every(n, offset = 0)
DataFrame_gather_every(n, offset = 0)
n |
Gather every |
offset |
Starting index. |
A DataFrame
df = pl$DataFrame(a = 1:4, b = 5:8) df$gather_every(2) df$gather_every(2, offset = 1)
df = pl$DataFrame(a = 1:4, b = 5:8) df$gather_every(2) df$gather_every(2, offset = 1)
Extract a DataFrame column as a Polars series.
DataFrame_get_column(name)
DataFrame_get_column(name)
name |
Name of the column to extract. |
Series
df = pl$DataFrame(iris[1:2, ]) df$get_column("Species")
df = pl$DataFrame(iris[1:2, ]) df$get_column("Species")
Get the DataFrame as a List of Series
DataFrame_get_columns()
DataFrame_get_columns()
A list of Series
<DataFrame>$to_list()
:
Similar to this method but returns a list of vectors instead of Series.
df = pl$DataFrame(foo = 1L:3L, bar = 4L:6L) df$get_columns() df = pl$DataFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$get_columns()
df = pl$DataFrame(foo = 1L:3L, bar = 4L:6L) df$get_columns() df = pl$DataFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$get_columns()
The formatting shows one line per column so that wide DataFrames display cleanly. Each line shows the column name, the data type, and the first few values.
DataFrame_glimpse( ..., max_items_per_column = 10, max_colname_length = 50, return_as_string = FALSE )
DataFrame_glimpse( ..., max_items_per_column = 10, max_colname_length = 50, return_as_string = FALSE )
... |
Ignored. |
max_items_per_column |
Maximum number of items to show per column. |
max_colname_length |
Maximum length of the displayed column names. Values that exceed this value are truncated with a trailing ellipsis. |
return_as_string |
Logical (default |
DataFrame
pl$DataFrame(iris)$glimpse()
pl$DataFrame(iris)$glimpse()
This doesn't modify the data but only stores information about
the group structure. This structure can then be used by several functions
($agg()
, $filter()
, etc.).
DataFrame_group_by(..., maintain_order = polars_options()$maintain_order)
DataFrame_group_by(..., maintain_order = polars_options()$maintain_order)
... |
Column(s) to group by. Accepts expression input. Characters are parsed as column names. |
maintain_order |
Ensure that the order of the groups is consistent with the input data.
This is slower than a default group by.
Setting this to |
Within each group, the order of the rows is always preserved,
regardless of the maintain_order
argument.
GroupBy (a DataFrame with special groupby methods like $agg()
)
df = pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df$group_by("a")$agg(pl$col("b")$sum()) # Set `maintain_order = TRUE` to ensure the order of the groups is consistent with the input. df$group_by("a", maintain_order = TRUE)$agg(pl$col("c")) # Group by multiple columns by passing a list of column names. df$group_by(c("a", "b"))$agg(pl$max("c")) # Or pass some arguments to group by multiple columns in the same way. # Expressions are also accepted. df$group_by("a", pl$col("b") %/% 2)$agg( pl$col("c")$mean() ) # The columns will be renamed to the argument names. df$group_by(d = "a", e = pl$col("b") %/% 2)$agg( pl$col("c")$mean() )
df = pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df$group_by("a")$agg(pl$col("b")$sum()) # Set `maintain_order = TRUE` to ensure the order of the groups is consistent with the input. df$group_by("a", maintain_order = TRUE)$agg(pl$col("c")) # Group by multiple columns by passing a list of column names. df$group_by(c("a", "b"))$agg(pl$max("c")) # Or pass some arguments to group by multiple columns in the same way. # Expressions are also accepted. df$group_by("a", pl$col("b") %/% 2)$agg( pl$col("c")$mean() ) # The columns will be renamed to the argument names. df$group_by(d = "a", e = pl$col("b") %/% 2)$agg( pl$col("c")$mean() )
If you have a time series <t_0, t_1, ..., t_n>
, then by default the windows
created will be:
(t_0 - period, t_0]
(t_1 - period, t_1]
…
(t_n - period, t_n]
whereas if you pass a non-default offset, then the windows will be:
(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
…
(t_n + offset, t_n + offset + period]
DataFrame_group_by_dynamic( index_column, ..., every, period = NULL, offset = NULL, include_boundaries = FALSE, closed = "left", label = "left", group_by = NULL, start_by = "window" )
DataFrame_group_by_dynamic( index_column, ..., every, period = NULL, offset = NULL, include_boundaries = FALSE, closed = "left", label = "left", group_by = NULL, start_by = "window" )
index_column |
Column used to group based on the time window. Often of
type Date/Datetime. This column must be sorted in ascending order (or, if |
... |
Ignored. |
every |
Interval of the window. |
period |
A character representing the length of the window,
must be non-negative. See the |
offset |
A character representing the offset of the window,
or |
include_boundaries |
Add two columns |
closed |
Define which sides of the temporal interval are closed
(inclusive). This can be either |
label |
Define which label to use for the window:
|
group_by |
Also group by this column/these columns. |
start_by |
The strategy to determine the start of the first window by:
|
In case of a rolling operation on an integer column, the windows are defined by:
"1i" # length 1
"10i" # length 10
A GroupBy object
df = pl$DataFrame( time = pl$datetime_range( start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), interval = "30m" ), n = 0:6 ) # get the sum in the following hour relative to the "time" column df$group_by_dynamic("time", every = "1h")$agg( vals = pl$col("n"), sum = pl$col("n")$sum() ) # using "include_boundaries = TRUE" is helpful to see the period considered df$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg( vals = pl$col("n") ) # in the example above, the values didn't include the one *exactly* 1h after # the start because "closed = 'left'" by default. # Changing it to "right" includes values that are exactly 1h after. Note that # the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00], # even if this interval wasn't there originally df$group_by_dynamic("time", every = "1h", closed = "right")$agg( vals = pl$col("n") ) # To keep both boundaries, we use "closed = 'both'". Some values now belong to # several groups: df$group_by_dynamic("time", every = "1h", closed = "both")$agg( vals = pl$col("n") ) # Dynamic group bys can also be combined with grouping on normal keys df = df$with_columns( groups = as_polars_series(c("a", "a", "a", "b", "b", "a", "a")) ) df df$group_by_dynamic( "time", every = "1h", closed = "both", group_by = "groups", include_boundaries = TRUE )$agg(pl$col("n")) # We can also create a dynamic group by based on an index column df = pl$LazyFrame( idx = 0:5, A = c("A", "A", "B", "B", "B", "C") )$with_columns(pl$col("idx")$set_sorted()) df df$group_by_dynamic( "idx", every = "2i", period = "3i", include_boundaries = TRUE, closed = "right" )$agg(A_agg_list = pl$col("A"))
df = pl$DataFrame( time = pl$datetime_range( start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), interval = "30m" ), n = 0:6 ) # get the sum in the following hour relative to the "time" column df$group_by_dynamic("time", every = "1h")$agg( vals = pl$col("n"), sum = pl$col("n")$sum() ) # using "include_boundaries = TRUE" is helpful to see the period considered df$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg( vals = pl$col("n") ) # in the example above, the values didn't include the one *exactly* 1h after # the start because "closed = 'left'" by default. # Changing it to "right" includes values that are exactly 1h after. Note that # the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00], # even if this interval wasn't there originally df$group_by_dynamic("time", every = "1h", closed = "right")$agg( vals = pl$col("n") ) # To keep both boundaries, we use "closed = 'both'". Some values now belong to # several groups: df$group_by_dynamic("time", every = "1h", closed = "both")$agg( vals = pl$col("n") ) # Dynamic group bys can also be combined with grouping on normal keys df = df$with_columns( groups = as_polars_series(c("a", "a", "a", "b", "b", "a", "a")) ) df df$group_by_dynamic( "time", every = "1h", closed = "both", group_by = "groups", include_boundaries = TRUE )$agg(pl$col("n")) # We can also create a dynamic group by based on an index column df = pl$LazyFrame( idx = 0:5, A = c("A", "A", "B", "B", "B", "C") )$with_columns(pl$col("idx")$set_sorted()) df df$group_by_dynamic( "idx", every = "2i", period = "3i", include_boundaries = TRUE, closed = "right" )$agg(A_agg_list = pl$col("A"))
n
rows.Get the first n
rows.
DataFrame_head(n = 5L) DataFrame_limit(n = 5L)
DataFrame_head(n = 5L) DataFrame_limit(n = 5L)
n |
Number of rows to return. If a negative value is passed,
return all rows except the last |
$limit()
is an alias for $head()
.
df = pl$DataFrame(foo = 1:5, bar = 6:10, ham = letters[1:5]) df$head(3) # Pass a negative value to get all rows except the last `abs(n)`. df$head(-3)
df = pl$DataFrame(foo = 1:5, bar = 6:10, ham = letters[1:5]) df$head(3) # Pass a negative value to get all rows except the last `abs(n)`. df$head(-3)
If row and column location are not specified, the DataFrame must have dimensions (1, 1).
DataFrame_item(row = NULL, column = NULL)
DataFrame_item(row = NULL, column = NULL)
row |
Optional row index (0-indexed). |
column |
Optional column index (0-indexed) or name. |
A value of length 1
df = pl$DataFrame(a = c(1, 2, 3), b = c(4, 5, 6)) df$select((pl$col("a") * pl$col("b"))$sum())$item() df$item(1, 1) df$item(2, "b")
df = pl$DataFrame(a = c(1, 2, 3), b = c(4, 5, 6)) df$select((pl$col("a") * pl$col("b"))$sum())$item() df$item(1, 1) df$item(2, "b")
This function can do both mutating joins (adding columns based on matching
observations, for example with how = "left"
) and filtering joins (keeping
observations based on matching observations, for example with how = "inner"
).
DataFrame_join( other, on = NULL, how = "inner", ..., left_on = NULL, right_on = NULL, suffix = "_right", validate = "m:m", join_nulls = FALSE, allow_parallel = TRUE, force_parallel = FALSE, coalesce = NULL )
DataFrame_join( other, on = NULL, how = "inner", ..., left_on = NULL, right_on = NULL, suffix = "_right", validate = "m:m", join_nulls = FALSE, allow_parallel = TRUE, force_parallel = FALSE, coalesce = NULL )
other |
DataFrame to join with. |
on |
Either a vector of column names or a list of expressions and/or
strings. Use |
how |
One of the following methods: "inner", "left", "right", "full", "semi", "anti", "cross". |
... |
Ignored. |
left_on , right_on
|
Same as |
suffix |
Suffix to add to duplicated column names. |
validate |
Checks if join is of specified type:
Note that this is currently not supported by the streaming engine, and is only supported when joining by single columns. |
join_nulls |
Join on null values. By default null values will never produce matches. |
allow_parallel |
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel. |
force_parallel |
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel. |
coalesce |
Coalescing behavior (merging of join columns).
|
DataFrame
# inner join by default df1 = pl$DataFrame(list(key = 1:3, payload = c("f", "i", NA))) df2 = pl$DataFrame(list(key = c(3L, 4L, 5L, NA_integer_))) df1$join(other = df2, on = "key") # cross join df1 = pl$DataFrame(x = letters[1:3]) df2 = pl$DataFrame(y = 1:4) df1$join(other = df2, how = "cross")
# inner join by default df1 = pl$DataFrame(list(key = 1:3, payload = c("f", "i", NA))) df2 = pl$DataFrame(list(key = c(3L, 4L, 5L, NA_integer_))) df1$join(other = df2, on = "key") # cross join df1 = pl$DataFrame(x = letters[1:3]) df2 = pl$DataFrame(y = 1:4) df1$join(other = df2, how = "cross")
This is similar to a left-join except that we match on nearest key rather than equal keys.
DataFrame_join_asof( other, ..., left_on = NULL, right_on = NULL, on = NULL, by_left = NULL, by_right = NULL, by = NULL, strategy = c("backward", "forward", "nearest"), suffix = "_right", tolerance = NULL, allow_parallel = TRUE, force_parallel = FALSE, coalesce = TRUE )
DataFrame_join_asof( other, ..., left_on = NULL, right_on = NULL, on = NULL, by_left = NULL, by_right = NULL, by = NULL, strategy = c("backward", "forward", "nearest"), suffix = "_right", tolerance = NULL, allow_parallel = TRUE, force_parallel = FALSE, coalesce = TRUE )
other |
DataFrame or LazyFrame |
... |
Not used, blocks use of further positional arguments |
left_on , right_on
|
Same as |
on |
Either a vector of column names or a list of expressions and/or
strings. Use |
by_left , by_right
|
Same as |
by |
Join on these columns before performing asof join. Either a vector
of column names or a list of expressions and/or strings. Use |
strategy |
Strategy for where to find match:
|
suffix |
Suffix to add to duplicated column names. |
tolerance |
Numeric tolerance. By setting this the join will only be done if the near
keys are within this distance. If an asof join is done on columns of dtype
"Date", "Datetime", "Duration" or "Time", use the Polars duration string language.
About the language, see the There may be a circumstance where R types are not sufficient to express a
numeric tolerance. In that case, you can use the expression syntax like
|
allow_parallel |
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel. |
force_parallel |
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel. |
coalesce |
Coalescing behavior (merging of
|
Both tables (DataFrames or LazyFrames) must be sorted by the asof_join key.
New joined DataFrame
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
# create two DataFrames to join asof gdp = pl$DataFrame( date = as.Date(c("2015-1-1", "2016-1-1", "2017-5-1", "2018-1-1", "2019-1-1")), gdp = c(4321, 4164, 4411, 4566, 4696), group = c("b", "a", "a", "b", "b") ) pop = pl$DataFrame( date = as.Date(c("2016-5-12", "2017-5-12", "2018-5-12", "2019-5-12")), population = c(82.19, 82.66, 83.12, 83.52), group = c("b", "b", "a", "a") ) # optional make sure tables are already sorted with "on" join-key gdp = gdp$sort("date") pop = pop$sort("date") # Left-join_asof DataFrame pop with gdp on "date" # Look backward in gdp to find closest matching date pop$join_asof(gdp, on = "date", strategy = "backward") # .... and forward pop$join_asof(gdp, on = "date", strategy = "forward") # join by a group: "only look within within group" pop$join_asof(gdp, on = "date", by = "group", strategy = "backward") # only look 2 weeks and 2 days back pop$join_asof(gdp, on = "date", strategy = "backward", tolerance = "2w2d") # only look 11 days back (numeric tolerance depends on polars type, <date> is in days) pop$join_asof(gdp, on = "date", strategy = "backward", tolerance = 11)
# create two DataFrames to join asof gdp = pl$DataFrame( date = as.Date(c("2015-1-1", "2016-1-1", "2017-5-1", "2018-1-1", "2019-1-1")), gdp = c(4321, 4164, 4411, 4566, 4696), group = c("b", "a", "a", "b", "b") ) pop = pl$DataFrame( date = as.Date(c("2016-5-12", "2017-5-12", "2018-5-12", "2019-5-12")), population = c(82.19, 82.66, 83.12, 83.52), group = c("b", "b", "a", "a") ) # optional make sure tables are already sorted with "on" join-key gdp = gdp$sort("date") pop = pop$sort("date") # Left-join_asof DataFrame pop with gdp on "date" # Look backward in gdp to find closest matching date pop$join_asof(gdp, on = "date", strategy = "backward") # .... and forward pop$join_asof(gdp, on = "date", strategy = "forward") # join by a group: "only look within within group" pop$join_asof(gdp, on = "date", by = "group", strategy = "backward") # only look 2 weeks and 2 days back pop$join_asof(gdp, on = "date", strategy = "backward", tolerance = "2w2d") # only look 11 days back (numeric tolerance depends on polars type, <date> is in days) pop$join_asof(gdp, on = "date", strategy = "backward", tolerance = 11)
Get the last row of the DataFrame.
DataFrame_last()
DataFrame_last()
A DataFrame with one row.
pl$DataFrame(mtcars)$last()
pl$DataFrame(mtcars)$last()
Start a new lazy query from a DataFrame.
DataFrame_lazy()
DataFrame_lazy()
A LazyFrame
pl$DataFrame(iris)$lazy()
pl$DataFrame(iris)$lazy()
Aggregate the columns in the DataFrame to their maximum value.
DataFrame_max()
DataFrame_max()
A DataFrame with one row.
pl$DataFrame(mtcars)$max()
pl$DataFrame(mtcars)$max()
Aggregate the columns in the DataFrame to their mean value.
DataFrame_mean()
DataFrame_mean()
A DataFrame with one row.
pl$DataFrame(mtcars)$mean()
pl$DataFrame(mtcars)$mean()
Aggregate the columns in the DataFrame to their median value.
DataFrame_median()
DataFrame_median()
A DataFrame with one row.
pl$DataFrame(mtcars)$median()
pl$DataFrame(mtcars)$median()
Aggregate the columns in the DataFrame to their minimum value.
DataFrame_min()
DataFrame_min()
A DataFrame with one row.
pl$DataFrame(mtcars)$min()
pl$DataFrame(mtcars)$min()
Number of chunks (memory allocations) for all or first Series in a DataFrame.
DataFrame_n_chunks(strategy = "first")
DataFrame_n_chunks(strategy = "first")
strategy |
Either |
A DataFrame is a vector of Series. Each Series in rust-polars is a wrapper around a ChunkedArray, which is like a virtual contiguous vector physically backed by an ordered set of chunks. Each chunk of values has a contiguous memory layout and is an arrow array. Arrow arrays are a fast, thread-safe and cross-platform memory layout.
In R, combining with c()
or rbind()
requires immediate vector re-allocation
to place vector values in contiguous memory. This is slow and memory consuming,
and it is why repeatedly appending to a vector in R is discouraged.
In polars, when we concatenate or append to Series or DataFrame, the re-allocation can be avoided or delayed by simply appending chunks to each individual Series. However, if chunks become many and small or are misaligned across Series, this can hurt the performance of subsequent operations.
Most places in the polars api where chunking could occur, the user have to
typically actively opt-out by setting an argument rechunk = FALSE
.
A real vector of chunk counts per Series.
# create DataFrame with misaligned chunks df = pl$concat( 1:10, # single chunk pl$concat(1:5, 1:5, rechunk = FALSE, how = "vertical")$rename("b"), # two chunks how = "horizontal" ) df df$n_chunks() # rechunk a chunked DataFrame df$rechunk()$n_chunks() # rechunk is not an in-place operation df$n_chunks() # The following toy example emulates the Series "chunkyness" in R. Here it a # S3-classed list with same type of vectors and where have all relevant S3 # generics implemented to make behave as if it was a regular vector. "+.chunked_vector" = \(x, y) structure(list(unlist(x) + unlist(y)), class = "chunked_vector") print.chunked_vector = \(x, ...) print(unlist(x), ...) c.chunked_vector = \(...) { structure(do.call(c, lapply(list(...), unclass)), class = "chunked_vector") } rechunk = \(x) structure(unlist(x), class = "chunked_vector") x = structure(list(1:4, 5L), class = "chunked_vector") x x + 5:1 lapply(x, tracemem) # trace chunks to verify no re-allocation z = c(x, x) z # looks like a plain vector lapply(z, tracemem) # mem allocation in z are the same from x str(z) z = rechunk(z) str(z)
# create DataFrame with misaligned chunks df = pl$concat( 1:10, # single chunk pl$concat(1:5, 1:5, rechunk = FALSE, how = "vertical")$rename("b"), # two chunks how = "horizontal" ) df df$n_chunks() # rechunk a chunked DataFrame df$rechunk()$n_chunks() # rechunk is not an in-place operation df$n_chunks() # The following toy example emulates the Series "chunkyness" in R. Here it a # S3-classed list with same type of vectors and where have all relevant S3 # generics implemented to make behave as if it was a regular vector. "+.chunked_vector" = \(x, y) structure(list(unlist(x) + unlist(y)), class = "chunked_vector") print.chunked_vector = \(x, ...) print(unlist(x), ...) c.chunked_vector = \(...) { structure(do.call(c, lapply(list(...), unclass)), class = "chunked_vector") } rechunk = \(x) structure(unlist(x), class = "chunked_vector") x = structure(list(1:4, 5L), class = "chunked_vector") x x + 5:1 lapply(x, tracemem) # trace chunks to verify no re-allocation z = c(x, x) z # looks like a plain vector lapply(z, tracemem) # mem allocation in z are the same from x str(z) z = rechunk(z) str(z)
Create a new DataFrame that shows the null (which correspond
to NA
in R) counts per column.
DataFrame_null_count()
DataFrame_null_count()
function
DataFrame
x = mtcars x[1, 2:3] = NA pl$DataFrame(x)$null_count()
x = mtcars x[1, 2:3] = NA pl$DataFrame(x)$null_count()
Similar to $group_by()
.
Group by the given columns and return the groups as separate DataFrames.
It is useful to use this in combination with functions like lapply()
or purrr::map()
.
DataFrame_partition_by( ..., maintain_order = TRUE, include_key = TRUE, as_nested_list = FALSE )
DataFrame_partition_by( ..., maintain_order = TRUE, include_key = TRUE, as_nested_list = FALSE )
... |
Characters of column names to group by. Passed to |
maintain_order |
If |
include_key |
If |
as_nested_list |
This affects the format of the output.
If |
A list of DataFrames. See the examples for details.
df = pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df # Pass a single column name to partition by that column. df$partition_by("a") # Partition by multiple columns. df$partition_by("a", "b") # Partition by column data type df$partition_by(pl$String) # If `as_nested_list = TRUE`, the output is a list whose elements have a `key` and a `data` field. # The `key` is a named list of the key values, and the `data` is the DataFrame. df$partition_by("a", "b", as_nested_list = TRUE) # `as_nested_list = TRUE` should be used with `maintain_order = TRUE` or `include_key = TRUE`. tryCatch( df$partition_by("a", "b", maintain_order = FALSE, include_key = FALSE, as_nested_list = TRUE), warning = function(w) w ) # Example of using with lapply(), and printing the key and the data summary df$partition_by("a", "b", maintain_order = FALSE, as_nested_list = TRUE) |> lapply(\(x) { sprintf("\nThe key value of `a` is %s and the key value of `b` is %s\n", x$key$a, x$key$b) |> cat() x$data$drop(names(x$key))$describe() |> print() invisible(NULL) }) |> invisible()
df = pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df # Pass a single column name to partition by that column. df$partition_by("a") # Partition by multiple columns. df$partition_by("a", "b") # Partition by column data type df$partition_by(pl$String) # If `as_nested_list = TRUE`, the output is a list whose elements have a `key` and a `data` field. # The `key` is a named list of the key values, and the `data` is the DataFrame. df$partition_by("a", "b", as_nested_list = TRUE) # `as_nested_list = TRUE` should be used with `maintain_order = TRUE` or `include_key = TRUE`. tryCatch( df$partition_by("a", "b", maintain_order = FALSE, include_key = FALSE, as_nested_list = TRUE), warning = function(w) w ) # Example of using with lapply(), and printing the key and the data summary df$partition_by("a", "b", maintain_order = FALSE, as_nested_list = TRUE) |> lapply(\(x) { sprintf("\nThe key value of `a` is %s and the key value of `b` is %s\n", x$key$a, x$key$b) |> cat() x$data$drop(names(x$key))$describe() |> print() invisible(NULL) }) |> invisible()
Pivot data from long to wide
DataFrame_pivot( on, ..., index, values, aggregate_function = NULL, maintain_order = TRUE, sort_columns = FALSE, separator = "_" )
DataFrame_pivot( on, ..., index, values, aggregate_function = NULL, maintain_order = TRUE, sort_columns = FALSE, separator = "_" )
on |
Name of the column(s) whose values will be used as the header of the output DataFrame. |
... |
Not used. |
index |
One or multiple keys to group by. |
values |
Column values to aggregate. Can be multiple columns if the
|
aggregate_function |
One of:
|
maintain_order |
Sort the grouped keys so that the output order is predictable. |
sort_columns |
Sort the transposed columns by name. Default is by order of discovery. |
separator |
Used as separator/delimiter in generated column names. |
DataFrame
df = pl$DataFrame( foo = c("one", "one", "one", "two", "two", "two"), bar = c("A", "B", "C", "A", "B", "C"), baz = c(1, 2, 3, 4, 5, 6) ) df df$pivot( values = "baz", index = "foo", on = "bar" ) # Run an expression as aggregation function df = pl$DataFrame( col1 = c("a", "a", "a", "b", "b", "b"), col2 = c("x", "x", "x", "x", "y", "y"), col3 = c(6, 7, 3, 2, 5, 7) ) df df$pivot( index = "col1", on = "col2", values = "col3", aggregate_function = pl$element()$tanh()$mean() )
df = pl$DataFrame( foo = c("one", "one", "one", "two", "two", "two"), bar = c("A", "B", "C", "A", "B", "C"), baz = c(1, 2, 3, 4, 5, 6) ) df df$pivot( values = "baz", index = "foo", on = "bar" ) # Run an expression as aggregation function df = pl$DataFrame( col1 = c("a", "a", "a", "b", "b", "b"), col2 = c("x", "x", "x", "x", "y", "y"), col3 = c(6, 7, 3, 2, 5, 7) ) df df$pivot( index = "col1", on = "col2", values = "col3", aggregate_function = pl$element()$tanh()$mean() )
Aggregate the columns in the DataFrame to a unique quantile
value. Use $describe()
to specify several quantiles.
DataFrame_quantile(quantile, interpolation = "nearest")
DataFrame_quantile(quantile, interpolation = "nearest")
quantile |
Numeric of length 1 between 0 and 1. |
interpolation |
One of |
DataFrame
pl$DataFrame(mtcars)$quantile(.4)
pl$DataFrame(mtcars)$quantile(.4)
Rechunking re-allocates any "chunked" memory allocations to speed-up e.g. vectorized operations.
DataFrame_rechunk()
DataFrame_rechunk()
A DataFrame is a vector of Series. Each Series in rust-polars is a wrapper around a ChunkedArray, which is like a virtual contiguous vector physically backed by an ordered set of chunks. Each chunk of values has a contiguous memory layout and is an arrow array. Arrow arrays are a fast, thread-safe and cross-platform memory layout.
In R, combining with c()
or rbind()
requires immediate vector re-allocation
to place vector values in contiguous memory. This is slow and memory consuming,
and it is why repeatedly appending to a vector in R is discouraged.
In polars, when we concatenate or append to Series or DataFrame, the re-allocation can be avoided or delayed by simply appending chunks to each individual Series. However, if chunks become many and small or are misaligned across Series, this can hurt the performance of subsequent operations.
Most places in the polars api where chunking could occur, the user have to
typically actively opt-out by setting an argument rechunk = FALSE
.
A DataFrame
# create DataFrame with misaligned chunks df = pl$concat( 1:10, # single chunk pl$concat(1:5, 1:5, rechunk = FALSE, how = "vertical")$rename("b"), # two chunks how = "horizontal" ) df df$n_chunks() # rechunk a chunked DataFrame df$rechunk()$n_chunks() # rechunk is not an in-place operation df$n_chunks() # The following toy example emulates the Series "chunkyness" in R. Here it a # S3-classed list with same type of vectors and where have all relevant S3 # generics implemented to make behave as if it was a regular vector. "+.chunked_vector" = \(x, y) structure(list(unlist(x) + unlist(y)), class = "chunked_vector") print.chunked_vector = \(x, ...) print(unlist(x), ...) c.chunked_vector = \(...) { structure(do.call(c, lapply(list(...), unclass)), class = "chunked_vector") } rechunk = \(x) structure(unlist(x), class = "chunked_vector") x = structure(list(1:4, 5L), class = "chunked_vector") x x + 5:1 lapply(x, tracemem) # trace chunks to verify no re-allocation z = c(x, x) z # looks like a plain vector lapply(z, tracemem) # mem allocation in z are the same from x str(z) z = rechunk(z) str(z)
# create DataFrame with misaligned chunks df = pl$concat( 1:10, # single chunk pl$concat(1:5, 1:5, rechunk = FALSE, how = "vertical")$rename("b"), # two chunks how = "horizontal" ) df df$n_chunks() # rechunk a chunked DataFrame df$rechunk()$n_chunks() # rechunk is not an in-place operation df$n_chunks() # The following toy example emulates the Series "chunkyness" in R. Here it a # S3-classed list with same type of vectors and where have all relevant S3 # generics implemented to make behave as if it was a regular vector. "+.chunked_vector" = \(x, y) structure(list(unlist(x) + unlist(y)), class = "chunked_vector") print.chunked_vector = \(x, ...) print(unlist(x), ...) c.chunked_vector = \(...) { structure(do.call(c, lapply(list(...), unclass)), class = "chunked_vector") } rechunk = \(x) structure(unlist(x), class = "chunked_vector") x = structure(list(1:4, 5L), class = "chunked_vector") x x + 5:1 lapply(x, tracemem) # trace chunks to verify no re-allocation z = c(x, x) z # looks like a plain vector lapply(z, tracemem) # mem allocation in z are the same from x str(z) z = rechunk(z) str(z)
Rename column names of a DataFrame
DataFrame_rename(...)
DataFrame_rename(...)
... |
One of the following:
|
If existing names are swapped (e.g. A
points to B
and B
points to A
),
polars will block projection and predicate pushdowns at this node.
df = pl$DataFrame( foo = 1:3, bar = 6:8, ham = letters[1:3] ) df$rename(foo = "apple") df$rename( \(column_name) paste0("c", substr(column_name, 2, 100)) )
df = pl$DataFrame( foo = 1:3, bar = 6:8, ham = letters[1:3] ) df$rename(foo = "apple") df$rename( \(column_name) paste0("c", substr(column_name, 2, 100)) )
Reverse the DataFrame (the last row becomes the first one, etc.).
DataFrame_reverse()
DataFrame_reverse()
DataFrame
pl$DataFrame(mtcars)$reverse()
pl$DataFrame(mtcars)$reverse()
If you have a time series <t_0, t_1, ..., t_n>
, then by default the windows
created will be:
(t_0 - period, t_0]
(t_1 - period, t_1]
…
(t_n - period, t_n]
whereas if you pass a non-default offset, then the windows will be:
(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
…
(t_n + offset, t_n + offset + period]
DataFrame_rolling( index_column, ..., period, offset = NULL, closed = "right", group_by = NULL )
DataFrame_rolling( index_column, ..., period, offset = NULL, closed = "right", group_by = NULL )
index_column |
Column used to group based on the time window. Often of
type Date/Datetime. This column must be sorted in ascending order (or, if |
... |
Ignored. |
period |
A character representing the length of the window,
must be non-negative. See the |
offset |
A character representing the offset of the window,
or |
closed |
Define which sides of the temporal interval are closed
(inclusive). This can be either |
group_by |
Also group by this column/these columns. |
In case of a rolling operation on an integer column, the windows are defined by:
"1i" # length 1
"10i" # length 10
A RollingGroupBy object
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
date = c( "2020-01-01 13:45:48", "2020-01-01 16:42:13", "2020-01-01 16:45:09", "2020-01-02 18:12:48", "2020-01-03 19:45:32", "2020-01-08 23:16:43" ) df = pl$DataFrame(dt = date, a = c(3, 7, 5, 9, 2, 1))$with_columns( pl$col("dt")$str$strptime(pl$Datetime())$set_sorted() ) df$rolling(index_column = "dt", period = "2d")$agg( sum_a = pl$sum("a"), min_a = pl$min("a"), max_a = pl$max("a") )
date = c( "2020-01-01 13:45:48", "2020-01-01 16:42:13", "2020-01-01 16:45:09", "2020-01-02 18:12:48", "2020-01-03 19:45:32", "2020-01-08 23:16:43" ) df = pl$DataFrame(dt = date, a = c(3, 7, 5, 9, 2, 1))$with_columns( pl$col("dt")$str$strptime(pl$Datetime())$set_sorted() ) df$rolling(index_column = "dt", period = "2d")$agg( sum_a = pl$sum("a"), min_a = pl$min("a"), max_a = pl$max("a") )
Take a sample of rows from a DataFrame
DataFrame_sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
DataFrame_sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
n |
Number of rows to return. Cannot be used with |
... |
Ignored. |
fraction |
Fraction of rows to return. Cannot be used with |
with_replacement |
Allow values to be sampled more than once. |
shuffle |
If |
seed |
Seed for the random number generator. If set to |
DataFrame
df = pl$DataFrame(iris) df$sample(n = 20) df$sample(fraction = 0.1)
df = pl$DataFrame(iris) df$sample(n = 20) df$sample(fraction = 0.1)
Similar to dplyr::mutate()
. However, it discards unmentioned
columns (like .()
in data.table
).
DataFrame_select(...)
DataFrame_select(...)
... |
Columns to keep. Those can be expressions (e.g |
DataFrame
pl$DataFrame(iris)$select( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") )
pl$DataFrame(iris)$select( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") )
Similar to dplyr::mutate()
. However, it discards unmentioned columns (like
.()
in data.table
).
This will run all expression sequentially instead of in parallel. Use this
when the work per expression is cheap. Otherwise, $select()
should be
preferred.
DataFrame_select_seq(...)
DataFrame_select_seq(...)
... |
Columns to keep. Those can be expressions (e.g |
DataFrame
pl$DataFrame(iris)$select_seq( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") )
pl$DataFrame(iris)$select_seq( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") )
Shift the values by a given period. If the period (n
) is positive,
then n
rows will be inserted at the top of the DataFrame and the last n
rows will be discarded. Vice-versa if the period is negative. In the end,
the total number of rows of the DataFrame doesn't change.
DataFrame_shift(n = 1, fill_value = NULL)
DataFrame_shift(n = 1, fill_value = NULL)
n |
Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead. |
fill_value |
Fill the resulting null values with this value. Accepts expression input. Non-expression inputs are parsed as literals. |
DataFrame
df = pl$DataFrame(a = 1:4, b = 5:8) df$shift(2) df$shift(-2) df$shift(-2, fill_value = 100)
df = pl$DataFrame(a = 1:4, b = 5:8) df$shift(2) df$shift(-2) df$shift(-2, fill_value = 100)
Get a slice of the DataFrame.
DataFrame_slice(offset, length = NULL)
DataFrame_slice(offset, length = NULL)
offset |
Start index, can be a negative value. This is 0-indexed, so
|
length |
Length of the slice. If |
DataFrame
# skip the first 2 rows and take the 4 following rows pl$DataFrame(mtcars)$slice(2, 4) # this is equivalent to: mtcars[3:6, ]
# skip the first 2 rows and take the 4 following rows pl$DataFrame(mtcars)$slice(2, 4) # this is equivalent to: mtcars[3:6, ]
Sort a DataFrame
DataFrame_sort( by, ..., descending = FALSE, nulls_last = FALSE, maintain_order = FALSE )
DataFrame_sort( by, ..., descending = FALSE, nulls_last = FALSE, maintain_order = FALSE )
by |
Column(s) to sort by. Can be character vector of column names, a list of Expr(s) or a list with a mix of Expr(s) and column names. |
... |
More columns to sort by as above but provided one Expr per argument. |
descending |
Logical. Sort in descending order (default is |
nulls_last |
A logical or logical vector of the same length as the number of columns.
If |
maintain_order |
Whether the order should be maintained if elements are
equal. If |
DataFrame
df = mtcars df$mpg[1] = NA df = pl$DataFrame(df) df$sort("mpg") df$sort("mpg", nulls_last = TRUE) df$sort("cyl", "mpg") df$sort(c("cyl", "mpg")) df$sort(c("cyl", "mpg"), descending = TRUE) df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE)) df$sort(pl$col("cyl"), pl$col("mpg"))
df = mtcars df$mpg[1] = NA df = pl$DataFrame(df) df$sort("mpg") df$sort("mpg", nulls_last = TRUE) df$sort("cyl", "mpg") df$sort(c("cyl", "mpg")) df$sort(c("cyl", "mpg"), descending = TRUE) df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE)) df$sort(pl$col("cyl"), pl$col("mpg"))
The calling frame is automatically registered as a table in the SQL context
under the name "self"
. All DataFrames and
LazyFrames found in the envir
are also registered,
using their variable name.
More control over registration and execution behaviour is available by
the SQLContext object.
DataFrame_sql(query, ..., table_name = NULL, envir = parent.frame())
DataFrame_sql(query, ..., table_name = NULL, envir = parent.frame())
query |
A character of the SQL query to execute. |
... |
Ignored. |
table_name |
|
envir |
The environment to search for polars DataFrames/LazyFrames. |
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
df1 = pl$DataFrame( a = 1:3, b = c("zz", "yy", "xx"), c = as.Date(c("1999-12-31", "2010-10-10", "2077-08-08")) ) # Query the DataFrame using SQL: df1$sql("SELECT c, b FROM self WHERE a > 1") # Join two DataFrames using SQL. df2 = pl$DataFrame(a = 3:1, d = c(125, -654, 888)) df1$sql( " SELECT self.*, d FROM self INNER JOIN df2 USING (a) WHERE a > 1 AND EXTRACT(year FROM c) < 2050 " ) # Apply transformations to a DataFrame using SQL, aliasing "self" to "frame". df1$sql( query = r"( SELECT a, MOD(a, 2) == 0 AS a_is_even, CONCAT_WS(':', b, b) AS b_b, EXTRACT(year FROM c) AS year, 0::float AS 'zero' FROM frame )", table_name = "frame" )
df1 = pl$DataFrame( a = 1:3, b = c("zz", "yy", "xx"), c = as.Date(c("1999-12-31", "2010-10-10", "2077-08-08")) ) # Query the DataFrame using SQL: df1$sql("SELECT c, b FROM self WHERE a > 1") # Join two DataFrames using SQL. df2 = pl$DataFrame(a = 3:1, d = c(125, -654, 888)) df1$sql( " SELECT self.*, d FROM self INNER JOIN df2 USING (a) WHERE a > 1 AND EXTRACT(year FROM c) < 2050 " ) # Apply transformations to a DataFrame using SQL, aliasing "self" to "frame". df1$sql( query = r"( SELECT a, MOD(a, 2) == 0 AS a_is_even, CONCAT_WS(':', b, b) AS b_b, EXTRACT(year FROM c) AS year, 0::float AS 'zero' FROM frame )", table_name = "frame" )
Aggregate the columns of this DataFrame to their standard deviation values.
DataFrame_std(ddof = 1)
DataFrame_std(ddof = 1)
ddof |
Delta Degrees of Freedom: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1. |
A DataFrame with one row.
pl$DataFrame(mtcars)$std()
pl$DataFrame(mtcars)$std()
Aggregate the columns of this DataFrame to their sum values.
DataFrame_sum()
DataFrame_sum()
A DataFrame with one row.
pl$DataFrame(mtcars)$sum()
pl$DataFrame(mtcars)$sum()
n
rows.Get the last n
rows.
DataFrame_tail(n = 5L)
DataFrame_tail(n = 5L)
n |
Number of rows to return. If a negative value is passed,
return all rows except the first |
df = pl$DataFrame(foo = 1:5, bar = 6:10, ham = letters[1:5]) df$tail(3) # Pass a negative value to get all rows except the first `abs(n)`. df$tail(-3)
df = pl$DataFrame(foo = 1:5, bar = 6:10, ham = letters[1:5]) df$tail(3) # Pass a negative value to get all rows except the first `abs(n)`. df$tail(-3)
Return Polars DataFrame as R data.frame
DataFrame_to_data_frame( ..., int64_conversion = polars_options()$int64_conversion )
DataFrame_to_data_frame( ..., int64_conversion = polars_options()$int64_conversion )
... |
Any args pased to |
int64_conversion |
How should Int64 values be handled when converting a polars object to R?
|
An R data.frame
When converting Polars objects, such as DataFrames
to R objects, for example via the as.data.frame()
generic function,
each type in the Polars object is converted to an R type.
In some cases, an error may occur because the conversion is not appropriate.
In particular, there is a high possibility of an error when converting
a Datetime type without a time zone.
A Datetime type without a time zone in Polars is converted
to the POSIXct type in R, which takes into account the time zone in which
the R session is running (which can be checked with the Sys.timezone()
function). In this case, if ambiguous times are included, a conversion error
will occur. In such cases, change the session time zone using
Sys.setenv(TZ = "UTC")
and then perform the conversion, or use the
$dt$replace_time_zone()
method on the Datetime type column to
explicitly specify the time zone before conversion.
# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am # so this particular date-time doesn't exist non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T") withr::with_timezone( "America/New_York", { tryCatch( # This causes an error due to the time zone (the `TZ` env var is affected). as.vector(non_existent_time), error = function(e) e ) } ) #> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()> withr::with_timezone( "America/New_York", { # This is safe. as.vector(non_existent_time$dt$replace_time_zone("UTC")) } ) #> [1] "2020-03-08 02:00:00 UTC"
df = pl$DataFrame(iris[1:3, ]) df$to_data_frame()
df = pl$DataFrame(iris[1:3, ]) df$to_data_frame()
Return Polars DataFrame as a list of vectors
DataFrame_to_list( unnest_structs = TRUE, ..., int64_conversion = polars_options()$int64_conversion )
DataFrame_to_list( unnest_structs = TRUE, ..., int64_conversion = polars_options()$int64_conversion )
unnest_structs |
Logical. If |
... |
Any args pased to |
int64_conversion |
How should Int64 values be handled when converting a polars object to R?
|
For simplicity reasons, this implementation relies on unnesting all structs
before exporting to R. If unnest_structs = FALSE
, then struct
columns
will be returned as nested lists, where each row is a list of values. Such a
structure is not very typical or efficient in R.
R list of vectors
When converting Polars objects, such as DataFrames
to R objects, for example via the as.data.frame()
generic function,
each type in the Polars object is converted to an R type.
In some cases, an error may occur because the conversion is not appropriate.
In particular, there is a high possibility of an error when converting
a Datetime type without a time zone.
A Datetime type without a time zone in Polars is converted
to the POSIXct type in R, which takes into account the time zone in which
the R session is running (which can be checked with the Sys.timezone()
function). In this case, if ambiguous times are included, a conversion error
will occur. In such cases, change the session time zone using
Sys.setenv(TZ = "UTC")
and then perform the conversion, or use the
$dt$replace_time_zone()
method on the Datetime type column to
explicitly specify the time zone before conversion.
# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am # so this particular date-time doesn't exist non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T") withr::with_timezone( "America/New_York", { tryCatch( # This causes an error due to the time zone (the `TZ` env var is affected). as.vector(non_existent_time), error = function(e) e ) } ) #> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()> withr::with_timezone( "America/New_York", { # This is safe. as.vector(non_existent_time$dt$replace_time_zone("UTC")) } ) #> [1] "2020-03-08 02:00:00 UTC"
<DataFrame>$get_columns()
:
Similar to this method but returns a list of Series instead of vectors.
pl$DataFrame(iris)$to_list()
pl$DataFrame(iris)$to_list()
Write Arrow IPC data to a raw vector
DataFrame_to_raw_ipc( compression = c("uncompressed", "zstd", "lz4"), ..., compat_level = FALSE )
DataFrame_to_raw_ipc( compression = c("uncompressed", "zstd", "lz4"), ..., compat_level = FALSE )
compression |
|
... |
Ignored. |
compat_level |
Use a specific compatibility level when exporting Polars’ internal data structures. This can be:
|
A raw vector
df = pl$DataFrame( foo = 1:5, bar = 6:10, ham = letters[1:5] ) raw_ipc = df$to_raw_ipc() pl$read_ipc(raw_ipc) if (require("arrow", quietly = TRUE)) { arrow::read_ipc_file(raw_ipc, as_data_frame = FALSE) }
df = pl$DataFrame( foo = 1:5, bar = 6:10, ham = letters[1:5] ) raw_ipc = df$to_raw_ipc() pl$read_ipc(raw_ipc) if (require("arrow", quietly = TRUE)) { arrow::read_ipc_file(raw_ipc, as_data_frame = FALSE) }
Extract a DataFrame column (by index) as a Polars series. Unlike
get_column()
, this method will not fail but will return a NULL
if the
index doesn't exist in the DataFrame. Keep in mind that Polars is 0-indexed
so "0" is the first column.
DataFrame_to_series(idx = 0)
DataFrame_to_series(idx = 0)
idx |
Index of the column to return as Series. Defaults to 0, which is the first column. |
Series or NULL
df = pl$DataFrame(iris[1:10, ]) # default is to extract the first column df$to_series() # Polars is 0-indexed, so we use idx = 1 to extract the *2nd* column df$to_series(idx = 1) # doesn't error if the column isn't there df$to_series(idx = 8)
df = pl$DataFrame(iris[1:10, ]) # default is to extract the first column df$to_series() # Polars is 0-indexed, so we use idx = 1 to extract the *2nd* column df$to_series(idx = 1) # doesn't error if the column isn't there df$to_series(idx = 8)
Convert DataFrame to a Series of type "struct"
DataFrame_to_struct(name = "")
DataFrame_to_struct(name = "")
name |
Name given to the new Series |
A Series of type "struct"
# round-trip conversion from DataFrame with two columns df = pl$DataFrame(a = 1:5, b = c("one", "two", "three", "four", "five")) s = df$to_struct() s # convert to an R list s$to_r() # Convert back to a DataFrame df_s = s$to_frame() df_s
# round-trip conversion from DataFrame with two columns df = pl$DataFrame(a = 1:5, b = c("one", "two", "three", "four", "five")) s = df$to_struct() s # convert to an R list s$to_r() # Convert back to a DataFrame df_s = s$to_frame() df_s
Transpose a DataFrame over the diagonal.
DataFrame_transpose( include_header = FALSE, header_name = "column", column_names = NULL )
DataFrame_transpose( include_header = FALSE, header_name = "column", column_names = NULL )
include_header |
If |
header_name |
If |
column_names |
Character vector indicating the new column names. If |
This is a very expensive operation.
Transpose may be the fastest option to perform non foldable (see fold()
or reduce()
)
row operations like median.
Polars transpose is currently eager only, likely because it is not trivial to deduce the schema.
DataFrame
# simple use-case pl$DataFrame(mtcars)$transpose(include_header = TRUE, column_names = rownames(mtcars)) # All rows must have one shared supertype, recast Categorical to String which is a supertype # of f64, and then dataset "Iris" can be transposed pl$DataFrame(iris)$with_columns(pl$col("Species")$cast(pl$String))$transpose()
# simple use-case pl$DataFrame(mtcars)$transpose(include_header = TRUE, column_names = rownames(mtcars)) # All rows must have one shared supertype, recast Categorical to String which is a supertype # of f64, and then dataset "Iris" can be transposed pl$DataFrame(iris)$with_columns(pl$col("Species")$cast(pl$String))$transpose()
Drop duplicated rows
DataFrame_unique(subset = NULL, ..., keep = "any", maintain_order = FALSE)
DataFrame_unique(subset = NULL, ..., keep = "any", maintain_order = FALSE)
subset |
A character vector with the names of the column(s) to use to
identify duplicates. If |
... |
Not used. |
keep |
Which of the duplicate rows to keep:
|
maintain_order |
Keep the same order as the original data. Setting this
to |
DataFrame
df = pl$DataFrame( x = c(1:3, 1:3, 3:1, 1L), y = c(1:3, 1:3, 1:3, 1L) ) df$height df$unique()$height # subset to define unique, keep only last or first df$unique(subset = "x", keep = "last") df$unique(subset = "x", keep = "first") # only keep unique rows df$unique(keep = "none")
df = pl$DataFrame( x = c(1:3, 1:3, 3:1, 1L), y = c(1:3, 1:3, 1:3, 1L) ) df$height df$unique()$height # subset to define unique, keep only last or first df$unique(subset = "x", keep = "last") df$unique(subset = "x", keep = "first") # only keep unique rows df$unique(keep = "none")
Unnest the Struct columns of a DataFrame
DataFrame_unnest(...)
DataFrame_unnest(...)
... |
Names of the struct columns to unnest. This doesn't accept Expr. If nothing is provided, then all columns of datatype Struct are unnested. |
A DataFrame where some or all columns of datatype Struct are unnested.
df = pl$DataFrame( a = 1:5, b = c("one", "two", "three", "four", "five"), c = 6:10 )$ select( pl$struct("b"), pl$struct(c("a", "c"))$alias("a_and_c") ) df # by default, all struct columns are unnested df$unnest() # we can specify specific columns to unnest df$unnest("a_and_c")
df = pl$DataFrame( a = 1:5, b = c("one", "two", "three", "four", "five"), c = 6:10 )$ select( pl$struct("b"), pl$struct(c("a", "c"))$alias("a_and_c") ) df # by default, all struct columns are unnested df$unnest() # we can specify specific columns to unnest df$unnest("a_and_c")
Unpivot a Frame from wide to long format
DataFrame_unpivot( on = NULL, ..., index = NULL, variable_name = NULL, value_name = NULL )
DataFrame_unpivot( on = NULL, ..., index = NULL, variable_name = NULL, value_name = NULL )
on |
Values to use as identifier variables. If |
... |
Not used. |
index |
Columns to use as identifier variables. |
variable_name |
Name to give to the new column containing the names of the melted columns. Defaults to "variable". |
value_name |
Name to give to the new column containing the values of
the melted columns. Defaults to |
Optionally leaves identifiers set.
This function is useful to massage a Frame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are "unpivoted" to the row axis, leaving just two non-identifier columns, 'variable' and 'value'.
A new DataFrame
df = pl$DataFrame( a = c("x", "y", "z"), b = c(1, 3, 5), c = c(2, 4, 6), d = c(7, 8, 9) ) df$unpivot(index = "a", on = c("b", "c", "d"))
df = pl$DataFrame( a = c("x", "y", "z"), b = c(1, 3, 5), c = c(2, 4, 6), d = c(7, 8, 9) ) df$unpivot(index = "a", on = c("b", "c", "d"))
Aggregate the columns of this DataFrame to their variance values.
DataFrame_var(ddof = 1)
DataFrame_var(ddof = 1)
ddof |
Delta Degrees of Freedom: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1. |
A DataFrame with one row.
pl$DataFrame(mtcars)$var()
pl$DataFrame(mtcars)$var()
Add columns or modify existing ones with expressions. This is
the equivalent of dplyr::mutate()
as it keeps unmentioned columns (unlike
$select()
).
DataFrame_with_columns(...)
DataFrame_with_columns(...)
... |
Any expressions or string column name, or same wrapped in a list. If first and only element is a list, it is unwrapped as a list of args. |
A DataFrame
pl$DataFrame(iris)$with_columns( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) # same query l_expr = list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) pl$DataFrame(iris)$with_columns(l_expr) pl$DataFrame(iris)$with_columns( pl$col("Sepal.Length")$abs(), # not named expr will keep name "Sepal.Length" SW_add_2 = (pl$col("Sepal.Width") + 2) )
pl$DataFrame(iris)$with_columns( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) # same query l_expr = list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) pl$DataFrame(iris)$with_columns(l_expr) pl$DataFrame(iris)$with_columns( pl$col("Sepal.Length")$abs(), # not named expr will keep name "Sepal.Length" SW_add_2 = (pl$col("Sepal.Width") + 2) )
Add columns or modify existing ones with expressions. This is
the equivalent of dplyr::mutate()
as it keeps unmentioned columns (unlike
$select()
).
This will run all expression sequentially instead of in parallel. Use this
when the work per expression is cheap. Otherwise, $with_columns()
should be
preferred.
DataFrame_with_columns_seq(...)
DataFrame_with_columns_seq(...)
... |
Any expressions or string column name, or same wrapped in a list. If first and only element is a list, it is unwrapped as a list of args. |
A DataFrame
pl$DataFrame(iris)$with_columns_seq( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) # same query l_expr = list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) pl$DataFrame(iris)$with_columns_seq(l_expr) pl$DataFrame(iris)$with_columns_seq( pl$col("Sepal.Length")$abs(), # not named expr will keep name "Sepal.Length" SW_add_2 = (pl$col("Sepal.Width") + 2) )
pl$DataFrame(iris)$with_columns_seq( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) # same query l_expr = list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) pl$DataFrame(iris)$with_columns_seq(l_expr) pl$DataFrame(iris)$with_columns_seq( pl$col("Sepal.Length")$abs(), # not named expr will keep name "Sepal.Length" SW_add_2 = (pl$col("Sepal.Width") + 2) )
Add a new column at index 0 that counts the rows
DataFrame_with_row_index(name, offset = NULL)
DataFrame_with_row_index(name, offset = NULL)
name |
string name of the created column |
offset |
positive integer offset for the start of the counter |
A new DataFrame
object with a counter column in front
df = pl$DataFrame(mtcars) # by default, the index starts at 0 (to mimic the behavior of Python Polars) df$with_row_index("idx") # but in R, we use a 1-index df$with_row_index("idx", offset = 1)
df = pl$DataFrame(mtcars) # by default, the index starts at 0 (to mimic the behavior of Python Polars) df$with_row_index("idx") # but in R, we use a 1-index df$with_row_index("idx", offset = 1)
Write to comma-separated values (CSV) file
DataFrame_write_csv( file, ..., include_bom = FALSE, include_header = TRUE, separator = ",", line_terminator = "\n", quote_char = "\"", batch_size = 1024, datetime_format = NULL, date_format = NULL, time_format = NULL, float_precision = NULL, null_values = "", quote_style = "necessary" )
DataFrame_write_csv( file, ..., include_bom = FALSE, include_header = TRUE, separator = ",", line_terminator = "\n", quote_char = "\"", batch_size = 1024, datetime_format = NULL, date_format = NULL, time_format = NULL, float_precision = NULL, null_values = "", quote_style = "necessary" )
file |
File path to which the result should be written. |
... |
Ignored. |
include_bom |
Whether to include UTF-8 BOM (byte order mark) in the CSV output. |
include_header |
Whether to include header in the CSV output. |
separator |
Separate CSV fields with this symbol. |
line_terminator |
String used to end each row. |
quote_char |
Byte to use as quoting character. |
batch_size |
Number of rows that will be processed per thread. |
datetime_format |
A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any). |
date_format |
A format string, with the specifiers defined by the chrono Rust crate. |
time_format |
A format string, with the specifiers defined by the chrono Rust crate. |
float_precision |
Number of decimal places to write, applied to both Float32 and Float64 datatypes. |
null_values |
A string representing null values (defaulting to the empty string). |
quote_style |
Determines the quoting strategy used.
|
Invisibly returns the input DataFrame.
dat = pl$DataFrame(mtcars) destination = tempfile(fileext = ".csv") dat$select(pl$col("drat", "mpg"))$write_csv(destination) pl$read_csv(destination)
dat = pl$DataFrame(mtcars) destination = tempfile(fileext = ".csv") dat$select(pl$col("drat", "mpg"))$write_csv(destination) pl$read_csv(destination)
Write to Arrow IPC file (a.k.a Feather file)
DataFrame_write_ipc( file, compression = c("uncompressed", "zstd", "lz4"), ..., compat_level = TRUE )
DataFrame_write_ipc( file, compression = c("uncompressed", "zstd", "lz4"), ..., compat_level = TRUE )
file |
File path to which the result should be written. |
compression |
|
... |
Ignored. |
compat_level |
Use a specific compatibility level when exporting Polars’ internal data structures. This can be:
|
Invisibly returns the input DataFrame.
dat = pl$DataFrame(mtcars) destination = tempfile(fileext = ".arrow") dat$write_ipc(destination) if (require("arrow", quietly = TRUE)) { arrow::read_ipc_file(destination, as_data_frame = FALSE) }
dat = pl$DataFrame(mtcars) destination = tempfile(fileext = ".arrow") dat$write_ipc(destination) if (require("arrow", quietly = TRUE)) { arrow::read_ipc_file(destination, as_data_frame = FALSE) }
Write to JSON file
DataFrame_write_json(file, ..., pretty = FALSE, row_oriented = FALSE)
DataFrame_write_json(file, ..., pretty = FALSE, row_oriented = FALSE)
file |
File path to which the result should be written. |
... |
Ignored. |
pretty |
Pretty serialize JSON. |
row_oriented |
Write to row-oriented JSON. This is slower, but more common. |
Invisibly returns the input DataFrame.
if (require("jsonlite", quiet = TRUE)) { dat = pl$DataFrame(head(mtcars)) destination = tempfile() dat$select(pl$col("drat", "mpg"))$write_json(destination) jsonlite::fromJSON(destination) dat$select(pl$col("drat", "mpg"))$write_json(destination, row_oriented = TRUE) jsonlite::fromJSON(destination) }
if (require("jsonlite", quiet = TRUE)) { dat = pl$DataFrame(head(mtcars)) destination = tempfile() dat$select(pl$col("drat", "mpg"))$write_json(destination) jsonlite::fromJSON(destination) dat$select(pl$col("drat", "mpg"))$write_json(destination, row_oriented = TRUE) jsonlite::fromJSON(destination) }
Write to NDJSON file
DataFrame_write_ndjson(file)
DataFrame_write_ndjson(file)
file |
File path to which the result should be written. |
Invisibly returns the input DataFrame.
dat = pl$DataFrame(head(mtcars)) destination = tempfile() dat$select(pl$col("drat", "mpg"))$write_ndjson(destination) pl$read_ndjson(destination)
dat = pl$DataFrame(head(mtcars)) destination = tempfile() dat$select(pl$col("drat", "mpg"))$write_ndjson(destination) pl$read_ndjson(destination)
Write to parquet file
DataFrame_write_parquet( file, ..., compression = "zstd", compression_level = 3, statistics = TRUE, row_group_size = NULL, data_page_size = NULL, partition_by = NULL, partition_chunk_size_bytes = 4294967296 )
DataFrame_write_parquet( file, ..., compression = "zstd", compression_level = 3, statistics = TRUE, row_group_size = NULL, data_page_size = NULL, partition_by = NULL, partition_chunk_size_bytes = 4294967296 )
file |
File path to which the result should be written. This should be a path to a directory if writing a partitioned dataset. |
... |
Ignored. |
compression |
String. The compression method. One of:
|
compression_level |
|
statistics |
Whether statistics should be written to the Parquet headers. Possible values:
|
row_group_size |
|
data_page_size |
Size of the data page in bytes. If |
partition_by |
Column(s) to partition by. A partitioned dataset will be written if this is specified. |
partition_chunk_size_bytes |
Approximate size to split DataFrames within a single partition when writing. Note this is calculated using the size of the DataFrame in memory - the size of the output file may differ depending on the file format / compression. |
Invisibly returns the input DataFrame.
dat = pl$DataFrame(mtcars) # write data to a single parquet file destination = withr::local_tempfile(fileext = ".parquet") dat$write_parquet(destination) # write data to folder with a hive-partitioned structure dest_folder = withr::local_tempdir() dat$write_parquet(dest_folder, partition_by = c("gear", "cyl")) list.files(dest_folder, recursive = TRUE)
dat = pl$DataFrame(mtcars) # write data to a single parquet file destination = withr::local_tempfile(fileext = ".parquet") dat$write_parquet(destination) # write data to folder with a hive-partitioned structure dest_folder = withr::local_tempdir() dat$write_parquet(dest_folder, partition_by = c("gear", "cyl")) list.files(dest_folder, recursive = TRUE)
The Array and List datatypes are very similar. The only difference is that
sub-arrays all have the same length while sublists can have different lengths.
Array methods can be accessed via the $arr
subnamespace.
DataType_Array(datatype = "unknown", width)
DataType_Array(datatype = "unknown", width)
datatype |
An inner DataType. The default is |
width |
The length of the arrays. |
An array DataType with an inner DataType
# basic Array pl$Array(pl$Int32, 4) # some nested Array pl$Array(pl$Array(pl$Boolean, 3), 2)
# basic Array pl$Array(pl$Int32, 4) # some nested Array pl$Array(pl$Array(pl$Boolean, 3), 2)
Create Categorical DataType
DataType_Categorical(ordering = "physical")
DataType_Categorical(ordering = "physical")
ordering |
Either |
When a categorical variable is created, its string values (or "lexical" values) are stored and encoded as integers ("physical" values) by order of appearance. Therefore, sorting a categorical value can be done either on the lexical or on the physical values. See Examples.
A Categorical DataType
# default is to order by physical values df = pl$DataFrame(x = c("z", "z", "k", "a", "z"), schema = list(x = pl$Categorical())) df$sort("x") # when setting ordering = "lexical", sorting will be based on the strings df_lex = pl$DataFrame( x = c("z", "z", "k", "a", "z"), schema = list(x = pl$Categorical("lexical")) ) df_lex$sort("x")
# default is to order by physical values df = pl$DataFrame(x = c("z", "z", "k", "a", "z"), schema = list(x = pl$Categorical())) df$sort("x") # when setting ordering = "lexical", sorting will be based on the strings df_lex = pl$DataFrame( x = c("z", "z", "k", "a", "z"), schema = list(x = pl$Categorical("lexical")) ) df_lex$sort("x")
Check whether the data type contains categoricals
DataType_contains_categoricals()
DataType_contains_categoricals()
A logical value
pl$List(pl$Categorical())$contains_categoricals() pl$List(pl$Enum(c("a", "b")))$contains_categoricals() pl$List(pl$Float32)$contains_categoricals() pl$List(pl$List(pl$Categorical()))$contains_categoricals()
pl$List(pl$Categorical())$contains_categoricals() pl$List(pl$Enum(c("a", "b")))$contains_categoricals() pl$List(pl$Float32)$contains_categoricals() pl$List(pl$List(pl$Categorical()))$contains_categoricals()
Check whether the data type contains views
DataType_contains_views()
DataType_contains_views()
A logical value
pl$List(pl$String)$contains_views() pl$List(pl$Binary)$contains_views() pl$List(pl$Float32)$contains_views() pl$List(pl$List(pl$Binary))$contains_views()
pl$List(pl$String)$contains_views() pl$List(pl$Binary)$contains_views() pl$List(pl$Float32)$contains_views() pl$List(pl$List(pl$Binary))$contains_views()
The underlying representation of this type is a 64-bit signed integer. The integer indicates the number of time units since the Unix epoch (1970-01-01 00:00:00). The number can be negative to indicate datetimes before the epoch.
DataType_Datetime(time_unit = "us", time_zone = NULL)
DataType_Datetime(time_unit = "us", time_zone = NULL)
time_unit |
Unit of time. One of |
time_zone |
Time zone string, as defined in |
Datetime DataType
pl$Datetime("ns", "Pacific/Samoa") df = pl$DataFrame( naive_time = as.POSIXct("1900-01-01"), zoned_time = as.POSIXct("1900-01-01", "UTC") ) df df$select(pl$col(pl$Datetime("us", "*")))
pl$Datetime("ns", "Pacific/Samoa") df = pl$DataFrame( naive_time = as.POSIXct("1900-01-01"), zoned_time = as.POSIXct("1900-01-01", "UTC") ) df df$select(pl$col(pl$Datetime("us", "*")))
Data type representing a time duration
DataType_Duration(time_unit = "us")
DataType_Duration(time_unit = "us")
time_unit |
Unit of time. One of |
Duration DataType
test = pl$DataFrame( a = 1:2, b = c("a", "b"), c = pl$duration(weeks = c(1, 2), days = c(0, 2)) ) # select all columns of type "duration" test$select(pl$col(pl$Duration()))
test = pl$DataFrame( a = 1:2, b = c("a", "b"), c = pl$duration(weeks = c(1, 2), days = c(0, 2)) ) # select all columns of type "duration" test$select(pl$col(pl$Duration()))
An Enum
is a fixed set categorical encoding of a set of strings. It is
similar to the Categorical
data type, but the
categories are explicitly provided by the user and cannot be modified.
DataType_Enum(categories)
DataType_Enum(categories)
categories |
A character vector specifying the categories of the variable. |
This functionality is unstable. It is a work-in-progress feature and may not always work as expected. It may be changed at any point without it being considered a breaking change.
An Enum DataType
pl$DataFrame( x = c("Polar", "Panda", "Brown", "Brown", "Polar"), schema = list(x = pl$Enum(c("Polar", "Panda", "Brown"))) ) # All values of the variable have to be in the categories dtype = pl$Enum(c("Polar", "Panda", "Brown")) tryCatch( pl$DataFrame( x = c("Polar", "Panda", "Brown", "Brown", "Polar", "Black"), schema = list(x = dtype) ), error = function(e) e ) # Comparing two Enum is only valid if they have the same categories df = pl$DataFrame( x = c("Polar", "Panda", "Brown", "Brown", "Polar"), y = c("Polar", "Polar", "Polar", "Brown", "Brown"), z = c("Polar", "Polar", "Polar", "Brown", "Brown"), schema = list( x = pl$Enum(c("Polar", "Panda", "Brown")), y = pl$Enum(c("Polar", "Panda", "Brown")), z = pl$Enum(c("Polar", "Black", "Brown")) ) ) # Same categories df$with_columns(x_eq_y = pl$col("x") == pl$col("y")) # Different categories tryCatch( df$with_columns(x_eq_z = pl$col("x") == pl$col("z")), error = function(e) e )
pl$DataFrame( x = c("Polar", "Panda", "Brown", "Brown", "Polar"), schema = list(x = pl$Enum(c("Polar", "Panda", "Brown"))) ) # All values of the variable have to be in the categories dtype = pl$Enum(c("Polar", "Panda", "Brown")) tryCatch( pl$DataFrame( x = c("Polar", "Panda", "Brown", "Brown", "Polar", "Black"), schema = list(x = dtype) ), error = function(e) e ) # Comparing two Enum is only valid if they have the same categories df = pl$DataFrame( x = c("Polar", "Panda", "Brown", "Brown", "Polar"), y = c("Polar", "Polar", "Polar", "Brown", "Brown"), z = c("Polar", "Polar", "Polar", "Brown", "Brown"), schema = list( x = pl$Enum(c("Polar", "Panda", "Brown")), y = pl$Enum(c("Polar", "Panda", "Brown")), z = pl$Enum(c("Polar", "Black", "Brown")) ) ) # Same categories df$with_columns(x_eq_y = pl$col("x") == pl$col("y")) # Different categories tryCatch( df$with_columns(x_eq_z = pl$col("x") == pl$col("z")), error = function(e) e )
Check whether the data type is an array type
DataType_is_array()
DataType_is_array()
A logical value
pl$Array(width = 2)$is_array() pl$Float32$is_array()
pl$Array(width = 2)$is_array() pl$Float32$is_array()
Check whether the data type is a binary type
DataType_is_binary()
DataType_is_binary()
A logical value
pl$Binary$is_binary() pl$Float32$is_binary()
pl$Binary$is_binary() pl$Float32$is_binary()
Check whether the data type is a boolean type
DataType_is_bool()
DataType_is_bool()
A logical value
pl$Boolean$is_bool() pl$Float32$is_bool()
pl$Boolean$is_bool() pl$Float32$is_bool()
Check whether the data type is a Categorical type
DataType_is_categorical()
DataType_is_categorical()
A logical value
pl$Categorical()$is_categorical() pl$Enum(c("a", "b"))$is_categorical()
pl$Categorical()$is_categorical() pl$Enum(c("a", "b"))$is_categorical()
Check whether the data type is an Enum type
DataType_is_enum()
DataType_is_enum()
A logical value
pl$Enum(c("a", "b"))$is_enum() pl$Categorical()$is_enum()
pl$Enum(c("a", "b"))$is_enum() pl$Categorical()$is_enum()
Check whether the data type is a float type
DataType_is_float()
DataType_is_float()
A logical value
pl$Float32$is_float() pl$Int32$is_float()
pl$Float32$is_float() pl$Int32$is_float()
Check whether the data type is an integer type
DataType_is_integer()
DataType_is_integer()
A logical value
pl$Int32$is_integer() pl$Float32$is_integer()
pl$Int32$is_integer() pl$Float32$is_integer()
Check whether the data type is known
DataType_is_known()
DataType_is_known()
A logical value
pl$String$is_known() pl$Unknown$is_known()
pl$String$is_known() pl$Unknown$is_known()
Check whether the data type is a list type
DataType_is_list()
DataType_is_list()
A logical value
pl$List()$is_list() pl$Float32$is_list()
pl$List()$is_list() pl$Float32$is_list()
Check whether the data type is a logical type
DataType_is_logical()
DataType_is_logical()
A logical value
Check whether the data type is a nested type
DataType_is_nested()
DataType_is_nested()
A logical value
pl$List()$is_nested() pl$Array(width = 2)$is_nested() pl$Float32$is_nested()
pl$List()$is_nested() pl$Array(width = 2)$is_nested() pl$Float32$is_nested()
Check whether the data type is a null type
DataType_is_null()
DataType_is_null()
A logical value
pl$Null$is_null() pl$Float32$is_null()
pl$Null$is_null() pl$Float32$is_null()
Check whether the data type is a numeric type
DataType_is_numeric()
DataType_is_numeric()
A logical value
pl$Float32$is_numeric() pl$Int32$is_numeric() pl$String$is_numeric()
pl$Float32$is_numeric() pl$Int32$is_numeric() pl$String$is_numeric()
Check whether the data type is an ordinal type
DataType_is_ord()
DataType_is_ord()
A logical value
pl$String$is_ord() pl$Categorical()$is_ord()
pl$String$is_ord() pl$Categorical()$is_ord()
Check whether the data type is a primitive type
DataType_is_primitive()
DataType_is_primitive()
A logical value
pl$Float32$is_primitive() pl$List()$is_primitive()
pl$Float32$is_primitive() pl$List()$is_primitive()
Check whether the data type is a signed integer type
DataType_is_signed_integer()
DataType_is_signed_integer()
A logical value
pl$Int32$is_signed_integer() pl$UInt32$is_signed_integer()
pl$Int32$is_signed_integer() pl$UInt32$is_signed_integer()
Check whether the data type is a String type
DataType_is_string()
DataType_is_string()
A logical value
pl$String$is_string() pl$Float32$is_string()
pl$String$is_string() pl$Float32$is_string()
Check whether the data type is a temporal type
DataType_is_struct()
DataType_is_struct()
A logical value
pl$Struct()$is_struct() pl$Float32$is_struct()
pl$Struct()$is_struct() pl$Float32$is_struct()
Check whether the data type is a temporal type
DataType_is_temporal()
DataType_is_temporal()
A logical value
pl$Date$is_temporal() pl$Float32$is_temporal()
pl$Date$is_temporal() pl$Float32$is_temporal()
Check whether the data type is an unsigned integer type
DataType_is_unsigned_integer()
DataType_is_unsigned_integer()
A logical value
pl$UInt32$is_unsigned_integer() pl$Int32$is_unsigned_integer()
pl$UInt32$is_unsigned_integer() pl$Int32$is_unsigned_integer()
Create List DataType
DataType_List(datatype = "unknown")
DataType_List(datatype = "unknown")
datatype |
The inner DataType. |
A list DataType with an inner DataType
# some nested List pl$List(pl$List(pl$Boolean)) # check if some maybe_list is a List DataType maybe_List = pl$List(pl$UInt64) pl$same_outer_dt(maybe_List, pl$List())
# some nested List pl$List(pl$List(pl$Boolean)) # check if some maybe_list is a List DataType maybe_List = pl$List(pl$UInt64) pl$same_outer_dt(maybe_List, pl$List())
One can create a Struct
data type with pl$Struct()
. There are also
multiple ways to create columns of data type Struct
in a DataFrame
or
a Series
, see the examples.
DataType_Struct(...)
DataType_Struct(...)
... |
Either named inputs of the form |
A Struct DataType containing a list of Fields
# create a Struct-DataType pl$Struct(foo = pl$Int32, pl$Field("bar", pl$Boolean)) # check if an element is any kind of Struct() test = pl$Struct(a = pl$UInt64) pl$same_outer_dt(test, pl$Struct()) # `test` is a type of Struct, but it doesn't mean it is equal to an empty Struct test == pl$Struct() # The way to create a `Series` of type `Struct` is a bit convoluted as it involves # `data.frame()`, `list()`, and `I()`: as_polars_series( data.frame(a = 1:2, b = I(list(c("x", "y"), "z"))) ) # A slightly simpler way would be via `tibble::tibble()` or # `data.table::data.table()`: if (requireNamespace("tibble", quietly = TRUE)) { as_polars_series( tibble::tibble(a = 1:2, b = list(c("x", "y"), "z")) ) } # Finally, one can use `pl$struct()` to convert existing columns or `Series` # to a `Struct`: x = pl$DataFrame( a = 1:2, b = list(c("x", "y"), "z") ) out = x$select(pl$struct(c("a", "b"))) out out$schema
# create a Struct-DataType pl$Struct(foo = pl$Int32, pl$Field("bar", pl$Boolean)) # check if an element is any kind of Struct() test = pl$Struct(a = pl$UInt64) pl$same_outer_dt(test, pl$Struct()) # `test` is a type of Struct, but it doesn't mean it is equal to an empty Struct test == pl$Struct() # The way to create a `Series` of type `Struct` is a bit convoluted as it involves # `data.frame()`, `list()`, and `I()`: as_polars_series( data.frame(a = 1:2, b = I(list(c("x", "y"), "z"))) ) # A slightly simpler way would be via `tibble::tibble()` or # `data.table::data.table()`: if (requireNamespace("tibble", quietly = TRUE)) { as_polars_series( tibble::tibble(a = 1:2, b = list(c("x", "y"), "z")) ) } # Finally, one can use `pl$struct()` to convert existing columns or `Series` # to a `Struct`: x = pl$DataFrame( a = 1:2, b = list(c("x", "y"), "z") ) out = x$select(pl$struct(c("a", "b"))) out out$schema
Get the dimensions
## S3 method for class 'RPolarsDataFrame' dim(x) ## S3 method for class 'RPolarsLazyFrame' dim(x)
## S3 method for class 'RPolarsDataFrame' dim(x) ## S3 method for class 'RPolarsLazyFrame' dim(x)
x |
Get the row and column names
## S3 method for class 'RPolarsDataFrame' dimnames(x) ## S3 method for class 'RPolarsLazyFrame' dimnames(x)
## S3 method for class 'RPolarsDataFrame' dimnames(x) ## S3 method for class 'RPolarsLazyFrame' dimnames(x)
x |
#Comments for how the R and python world translates into polars:
R and python are both high-level glue languages great for Data Science. Rust is a pedantic low-level language with similar use cases as C and C++. Polars is written in ~100k lines of rust and has a rust API. Py-polars the python API for polars, is implemented as an interface with the rust API. r-polars is very parallel to py-polars except it interfaces with R. The performance and behavior are unexpectedly quite similar as the 'engine' is the exact same rust code and data structures.
info
Not applicable
R only has a native Int32 type, no Uint32, Int64, UInt64 , ... types. These days Int32 is getting a bit small, to refer to more rows than ~ 2^31-1. There are packages which provide int64, but the most normal hack' is to just use floats as 'integerish'. There is an unique float64 value for every integer up to about 2^52 which is plenty for all practical concerns. Some polars methods may accept or return a floats even though an integer ideally would be more accurate. Most R functions intermix Int32 (integer) and Float64 (double) seamlessly.
R has allocated a value in every vector type to signal missingness, these are collectively
called NAs
. Polars uses a bool bitmask to signal NA
-like missing value and it is called Null
and Nulls
in plural. Not to confuse with R NULL
(see paragraph below). Polars supports
missingness for any possible type as it kept separately in the bitmask. In python lists the
symbol None
can carry a similar meaning. R NA
~ polars Null
~ py-polars [None]
(in a py list)
From writing a lot of tests for all implementations, it appears polars does not have a
fully consistent nor well documented behavior, when it comes to comparisons and sorting of
floats. Though some general thumb rules do apply:
Polars have chosen to define in sorting that Null
is a value lower than -Inf
as in
Expr.arg_min()
However except when Null
is ignored Expr.min()
, there is a Expr.nan_min()
but no Expr.nan_min()
.
NaN
is sometimes a value higher than Inf and sometimes regarded as a Null
.
Polars conventions NaN
> Inf
> 99
> -99
> -Inf
> Null
Null == Null
yields often times false, sometimes true, sometimes Null
.
The documentation or examples do not reveal this variations. The best to do, when in doubt,
is to do test sort on a small Series/Column of all values.
#' R NaN
~ polars NaN
~ python [float("NaN")]
#only floats have NaN
s
R Inf
~ polars inf
~ python [float("inf")]
#only floats have Inf
The R NULL does not exist inside polars frames and series and so on. It resembles the
Option::None in the hidden rust code. It resembles the python None
. In all three languages the
NULL
/None
/None
are used in this context as function argument to signal default behavior or
perhaps a deactivated feature. R NULL
does NOT translate into the polars bitmask Null
, that
is NA
. R NULL
~ rust-polars Option::None
~ pypolars None
#typically used for function
arguments
The following translations are relevant when loading data into polars. The R list appears
similar to python dictionary (hashmap), but is implemented more similar to the python list
(array of pointers). R list do support string naming elements via a string vector.
In polars both lists (of vectors or series) and data.frames can be used to construct a polars
DataFrame, just a as dictionaries would be used in python. In terms of loading in/out data the
follow translation holds: R data.frame
/list
~ polars DataFrame
~ python dictonary
The R vector (Integer, Double, Character, ...) resembles the Series as both are external from any
frame and can be of any length. The implementation is quite different. E.g. for
-loop appending
to an R vector is considered quite bad for performance. The vector will be fully rewritten in
memory for every append. The polars Series has chunked memory allocation, which allows any
append data to be written only. However fragmented memory is not great for fast computations and
polars objects have a rechunk
()-method, to reallocate chunks into one. Rechunk might be called
implicitly by polars. In the context of constructing. Series and extracting data , the following
translation holds: R vector
~ polars Series
/column
~ python list
The polars Expr do not have any base R counterpart. Expr are analogous to how ggplot split
plotting instructions from the rendering. Base R plot immediately pushes any instruction by
adding e.g. pixels to a .png canvas. ggplot
collects instructions and in the end when executed
the rendering can be performed with optimization across all instructions. Btw ggplot
command-syntax is a monoid meaning the order does not matter, that is not the case for polars
Expr. Polars Expr's can be understood as a DSL (domain specific language) that expresses syntax
trees of instructions. R expressions evaluate to syntax trees also, but it difficult to optimize
the execution order automatically, without rewriting the code. A great selling point of Polars is
that any query will be optimized. Expr are very light-weight symbols chained together.
Aggregate a DataFrame over a time or integer window created with
$group_by_dynamic()
.
DynamicGroupBy_agg(...)
DynamicGroupBy_agg(...)
... |
Exprs to aggregate over. Those can also be passed wrapped in a
list, e.g |
An aggregated DataFrame
df = pl$DataFrame( time = pl$datetime_range( start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), interval = "30m" ), n = 0:6 ) # get the sum in the following hour relative to the "time" column df$group_by_dynamic("time", every = "1h")$agg( vals = pl$col("n"), sum = pl$col("n")$sum() ) # using "include_boundaries = TRUE" is helpful to see the period considered df$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg( vals = pl$col("n") ) # in the example above, the values didn't include the one *exactly* 1h after # the start because "closed = 'left'" by default. # Changing it to "right" includes values that are exactly 1h after. Note that # the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00], # even if this interval wasn't there originally df$group_by_dynamic("time", every = "1h", closed = "right")$agg( vals = pl$col("n") ) # To keep both boundaries, we use "closed = 'both'". Some values now belong to # several groups: df$group_by_dynamic("time", every = "1h", closed = "both")$agg( vals = pl$col("n") ) # Dynamic group bys can also be combined with grouping on normal keys df = df$with_columns( groups = as_polars_series(c("a", "a", "a", "b", "b", "a", "a")) ) df df$group_by_dynamic( "time", every = "1h", closed = "both", group_by = "groups", include_boundaries = TRUE )$agg(pl$col("n")) # We can also create a dynamic group by based on an index column df = pl$LazyFrame( idx = 0:5, A = c("A", "A", "B", "B", "B", "C") )$with_columns(pl$col("idx")$set_sorted()) df df$group_by_dynamic( "idx", every = "2i", period = "3i", include_boundaries = TRUE, closed = "right" )$agg(A_agg_list = pl$col("A"))
df = pl$DataFrame( time = pl$datetime_range( start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), interval = "30m" ), n = 0:6 ) # get the sum in the following hour relative to the "time" column df$group_by_dynamic("time", every = "1h")$agg( vals = pl$col("n"), sum = pl$col("n")$sum() ) # using "include_boundaries = TRUE" is helpful to see the period considered df$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg( vals = pl$col("n") ) # in the example above, the values didn't include the one *exactly* 1h after # the start because "closed = 'left'" by default. # Changing it to "right" includes values that are exactly 1h after. Note that # the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00], # even if this interval wasn't there originally df$group_by_dynamic("time", every = "1h", closed = "right")$agg( vals = pl$col("n") ) # To keep both boundaries, we use "closed = 'both'". Some values now belong to # several groups: df$group_by_dynamic("time", every = "1h", closed = "both")$agg( vals = pl$col("n") ) # Dynamic group bys can also be combined with grouping on normal keys df = df$with_columns( groups = as_polars_series(c("a", "a", "a", "b", "b", "a", "a")) ) df df$group_by_dynamic( "time", every = "1h", closed = "both", group_by = "groups", include_boundaries = TRUE )$agg(pl$col("n")) # We can also create a dynamic group by based on an index column df = pl$LazyFrame( idx = 0:5, A = c("A", "A", "B", "B", "B", "C") )$with_columns(pl$col("idx")$set_sorted()) df df$group_by_dynamic( "idx", every = "2i", period = "3i", include_boundaries = TRUE, closed = "right" )$agg(A_agg_list = pl$col("A"))
This class comes from <DataFrame>$group_by_dynamic()
.
df = pl$DataFrame( time = pl$datetime_range( start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), interval = "30m" ), n = 0:6 ) # get the sum in the following hour relative to the "time" column df$group_by_dynamic("time", every = "1h")
df = pl$DataFrame( time = pl$datetime_range( start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), interval = "30m" ), n = 0:6 ) # get the sum in the following hour relative to the "time" column df$group_by_dynamic("time", every = "1h")
Revert the $group_by_dynamic()
operation. Doing
<DataFrame>$group_by_dynamic(...)$ungroup()
returns the original DataFrame
.
DynamicGroupBy_ungroup()
DynamicGroupBy_ungroup()
df = pl$DataFrame( time = pl$datetime_range( start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), interval = "30m" ), n = 0:6 ) df df$group_by_dynamic("time", every = "1h")$ungroup()
df = pl$DataFrame( time = pl$datetime_range( start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), interval = "30m" ), n = 0:6 ) df df$group_by_dynamic("time", every = "1h")$ungroup()
Compute the absolute values
Expr_abs()
Expr_abs()
Expr
pl$DataFrame(a = -1:1)$ with_columns(abs = pl$col("a")$abs())
pl$DataFrame(a = -1:1)$ with_columns(abs = pl$col("a")$abs())
Method equivalent of addition operator expr + other
.
Expr_add(other)
Expr_add(other)
other |
numeric or string value; accepts expression input. |
df = pl$DataFrame(x = 1:5) df$with_columns( `x+int` = pl$col("x")$add(2L), `x+expr` = pl$col("x")$add(pl$col("x")$cum_prod()) ) df = pl$DataFrame( x = c("a", "d", "g"), y = c("b", "e", "h"), z = c("c", "f", "i") ) df$with_columns( pl$col("x")$add(pl$col("y"))$add(pl$col("z"))$alias("xyz") )
df = pl$DataFrame(x = 1:5) df$with_columns( `x+int` = pl$col("x")$add(2L), `x+expr` = pl$col("x")$add(pl$col("x")$cum_prod()) ) df = pl$DataFrame( x = c("a", "d", "g"), y = c("b", "e", "h"), z = c("c", "f", "i") ) df$with_columns( pl$col("x")$add(pl$col("y"))$add(pl$col("z"))$alias("xyz") )
Get the group indexes of the group by operation. Should be used in aggregation context only.
Expr_agg_groups()
Expr_agg_groups()
Expr
df = pl$DataFrame(list( group = c("one", "one", "one", "two", "two", "two"), value = c(94, 95, 96, 97, 97, 99) )) df$group_by("group", maintain_order = TRUE)$agg(pl$col("value")$agg_groups())
df = pl$DataFrame(list( group = c("one", "one", "one", "two", "two", "two"), value = c(94, 95, 96, 97, 97, 99) )) df$group_by("group", maintain_order = TRUE)$agg(pl$col("value")$agg_groups())
Rename the output of an expression.
Expr_alias(name)
Expr_alias(name)
name |
New name of output |
Expr
pl$col("bob")$alias("alice")
pl$col("bob")$alias("alice")
Check if all values in a Boolean column are TRUE
. This method is an
expression - not to be confused with pl$all()
which is a function
to select all columns.
Expr_all(..., ignore_nulls = TRUE)
Expr_all(..., ignore_nulls = TRUE)
... |
Ignored. |
ignore_nulls |
If |
A logical value
df = pl$DataFrame( a = c(TRUE, TRUE), b = c(TRUE, FALSE), c = c(NA, TRUE), d = c(NA, NA) ) # By default, ignore null values. If there are only nulls, then all() returns # TRUE. df$select(pl$col("*")$all()) # If we set ignore_nulls = FALSE, then we don't know if all values in column # "c" are TRUE, so it returns null df$select(pl$col("*")$all(ignore_nulls = FALSE))
df = pl$DataFrame( a = c(TRUE, TRUE), b = c(TRUE, FALSE), c = c(NA, TRUE), d = c(NA, NA) ) # By default, ignore null values. If there are only nulls, then all() returns # TRUE. df$select(pl$col("*")$all()) # If we set ignore_nulls = FALSE, then we don't know if all values in column # "c" are TRUE, so it returns null df$select(pl$col("*")$all(ignore_nulls = FALSE))
Combine two boolean expressions with AND.
Expr_and(other)
Expr_and(other)
other |
numeric or string value; accepts expression input. |
pl$lit(TRUE) & TRUE pl$lit(TRUE)$and(pl$lit(TRUE))
pl$lit(TRUE) & TRUE pl$lit(TRUE)$and(pl$lit(TRUE))
Check if any boolean value in a Boolean column is TRUE
.
Expr_any(..., ignore_nulls = TRUE)
Expr_any(..., ignore_nulls = TRUE)
... |
Ignored. |
ignore_nulls |
If |
A logical value
df = pl$DataFrame( a = c(TRUE, FALSE), b = c(FALSE, FALSE), c = c(NA, FALSE) ) df$select(pl$col("*")$any()) # If we set ignore_nulls = FALSE, then we don't know if any values in column # "c" is TRUE, so it returns null df$select(pl$col("*")$any(ignore_nulls = FALSE))
df = pl$DataFrame( a = c(TRUE, FALSE), b = c(FALSE, FALSE), c = c(NA, FALSE) ) df$select(pl$col("*")$any()) # If we set ignore_nulls = FALSE, then we don't know if any values in column # "c" is TRUE, so it returns null df$select(pl$col("*")$any(ignore_nulls = FALSE))
This is done by adding the chunks of other
to this output
.
Expr_append(other, upcast = TRUE)
Expr_append(other, upcast = TRUE)
other |
Expr or something coercible to an Expr. |
upcast |
Cast both Expr to a common supertype if they have one. |
Expr
# append bottom to to row df = pl$DataFrame(list(a = 1:3, b = c(NA_real_, 4, 5))) df$select(pl$all()$head(1)$append(pl$all()$tail(1))) # implicit upcast, when default = TRUE pl$DataFrame(list())$select(pl$lit(42)$append(42L)) pl$DataFrame(list())$select(pl$lit(42)$append(FALSE)) pl$DataFrame(list())$select(pl$lit("Bob")$append(FALSE))
# append bottom to to row df = pl$DataFrame(list(a = 1:3, b = c(NA_real_, 4, 5))) df$select(pl$all()$head(1)$append(pl$all()$tail(1))) # implicit upcast, when default = TRUE pl$DataFrame(list())$select(pl$lit(42)$append(42L)) pl$DataFrame(list())$select(pl$lit(42)$append(FALSE)) pl$DataFrame(list())$select(pl$lit("Bob")$append(FALSE))
This is done using the HyperLogLog++ algorithm for cardinality estimation.
Expr_approx_n_unique()
Expr_approx_n_unique()
Expr
pl$DataFrame(iris[, 4:5])$ with_columns(count = pl$col("Species")$approx_n_unique())
pl$DataFrame(iris[, 4:5])$ with_columns(count = pl$col("Species")$approx_n_unique())
Compute inverse cosine
Expr_arccos()
Expr_arccos()
Expr
pl$DataFrame(a = c(-1, cos(0.5), 0, 1, NA_real_))$ with_columns(arccos = pl$col("a")$arccos())
pl$DataFrame(a = c(-1, cos(0.5), 0, 1, NA_real_))$ with_columns(arccos = pl$col("a")$arccos())
Compute inverse hyperbolic cosine
Expr_arccosh()
Expr_arccosh()
Expr
pl$DataFrame(a = c(-1, cosh(0.5), 0, 1, NA_real_))$ with_columns(arccosh = pl$col("a")$arccosh())
pl$DataFrame(a = c(-1, cosh(0.5), 0, 1, NA_real_))$ with_columns(arccosh = pl$col("a")$arccosh())
Compute inverse sine
Expr_arcsin()
Expr_arcsin()
Expr
pl$DataFrame(a = c(-1, sin(0.5), 0, 1, NA_real_))$ with_columns(arcsin = pl$col("a")$arcsin())
pl$DataFrame(a = c(-1, sin(0.5), 0, 1, NA_real_))$ with_columns(arcsin = pl$col("a")$arcsin())
Compute inverse hyperbolic sine
Expr_arcsinh()
Expr_arcsinh()
Expr
pl$DataFrame(a = c(-1, sinh(0.5), 0, 1, NA_real_))$ with_columns(arcsinh = pl$col("a")$arcsinh())
pl$DataFrame(a = c(-1, sinh(0.5), 0, 1, NA_real_))$ with_columns(arcsinh = pl$col("a")$arcsinh())
Compute inverse tangent
Expr_arctan()
Expr_arctan()
Expr
pl$DataFrame(a = c(-1, tan(0.5), 0, 1, NA_real_))$ with_columns(arctan = pl$col("a")$arctan())
pl$DataFrame(a = c(-1, tan(0.5), 0, 1, NA_real_))$ with_columns(arctan = pl$col("a")$arctan())
Compute inverse hyperbolic tangent
Expr_arctanh()
Expr_arctanh()
Expr
pl$DataFrame(a = c(-1, tanh(0.5), 0, 1, NA_real_))$ with_columns(arctanh = pl$col("a")$arctanh())
pl$DataFrame(a = c(-1, tanh(0.5), 0, 1, NA_real_))$ with_columns(arctanh = pl$col("a")$arctanh())
Get the index of the maximal value.
Expr_arg_max()
Expr_arg_max()
Expr
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_max = pl$col("a")$arg_max())
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_max = pl$col("a")$arg_max())
Get the index of the minimal value.
Expr_arg_min()
Expr_arg_min()
Expr
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_min = pl$col("a")$arg_min())
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_min = pl$col("a")$arg_min())
Get the index values that would sort this column.
Expr_arg_sort(descending = FALSE, nulls_last = FALSE)
Expr_arg_sort(descending = FALSE, nulls_last = FALSE)
descending |
A logical. If |
nulls_last |
A logical. If |
Expr
pl$arg_sort_by() to find the row indices that would sort multiple columns.
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_sorted = pl$col("a")$arg_sort())
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_sorted = pl$col("a")$arg_sort())
This finds the position of first occurrence of each unique value.
Expr_arg_unique()
Expr_arg_unique()
Expr
pl$select(pl$lit(c(1:2, 1:3))$arg_unique())
pl$select(pl$lit(c(1:2, 1:3))$arg_unique())
Fill missing values with the next to be seen values. Syntactic sugar for
$fill_null(strategy = "backward")
.
Expr_backward_fill(limit = NULL)
Expr_backward_fill(limit = NULL)
limit |
Number of consecutive null values to fill when using the
|
Expr
pl$DataFrame(a = c(NA, 1, NA, 2, NA))$ with_columns( backward = pl$col("a")$backward_fill() )
pl$DataFrame(a = c(NA, 1, NA, 2, NA))$ with_columns( backward = pl$col("a")$backward_fill() )
Return the k
smallest elements. This has time complexity:
Expr_bottom_k(k)
Expr_bottom_k(k)
k |
Number of top values to get. |
Expr
pl$DataFrame(a = c(6, 1, 0, NA, Inf, NaN))$select(pl$col("a")$bottom_k(5))
pl$DataFrame(a = c(6, 1, 0, NA, Inf, NaN))$select(pl$col("a")$bottom_k(5))
Cast between DataType
Expr_cast(dtype, strict = TRUE)
Expr_cast(dtype, strict = TRUE)
dtype |
DataType to cast to. |
strict |
If |
Expr
df = pl$DataFrame(a = 1:3, b = c(1, 2, 3)) df$with_columns( pl$col("a")$cast(pl$dtypes$Float64), pl$col("b")$cast(pl$dtypes$Int32) ) # strict FALSE, inserts null for any cast failure pl$lit(c(100, 200, 300))$cast(pl$dtypes$UInt8, strict = FALSE)$to_series() # strict TRUE, raise any failure as an error when query is executed. tryCatch( { pl$lit("a")$cast(pl$dtypes$Float64, strict = TRUE)$to_series() }, error = function(e) e )
df = pl$DataFrame(a = 1:3, b = c(1, 2, 3)) df$with_columns( pl$col("a")$cast(pl$dtypes$Float64), pl$col("b")$cast(pl$dtypes$Int32) ) # strict FALSE, inserts null for any cast failure pl$lit(c(100, 200, 300))$cast(pl$dtypes$UInt8, strict = FALSE)$to_series() # strict TRUE, raise any failure as an error when query is executed. tryCatch( { pl$lit("a")$cast(pl$dtypes$Float64, strict = TRUE)$to_series() }, error = function(e) e )
Rounds up to the nearest integer value. Only works on floating point Series.
Expr_ceil()
Expr_ceil()
Expr
pl$DataFrame(a = c(0.33, 0.5, 1.02, 1.5, NaN, NA, Inf, -Inf))$with_columns( ceiling = pl$col("a")$ceil() )
pl$DataFrame(a = c(0.33, 0.5, 1.02, 1.5, NaN, NA, Inf, -Inf))$with_columns( ceiling = pl$col("a")$ceil() )
Expressions are all the functions and methods that are applicable to a Polars DataFrame or LazyFrame object. Some methods are under the sub-namespaces.
$arr
stores all array related methods.
$bin
stores all binary related methods.
$cat
stores all categorical related methods.
$dt
stores all temporal related methods.
$list
stores all list related methods.
$meta
stores all methods for working with the meta data.
$name
stores all name related methods.
$str
stores all string related methods.
$struct
stores all struct related methods.
df = pl$DataFrame( a = 1:2, b = list(1:2, 3:4), schema = list(a = pl$Int64, b = pl$Array(pl$Int64, 2)) ) df$select(pl$col("a")$first()) df$select(pl$col("b")$arr$sum())
df = pl$DataFrame( a = 1:2, b = list(1:2, 3:4), schema = list(a = pl$Int64, b = pl$Array(pl$Int64, 2)) ) df$select(pl$col("a")$first()) df$select(pl$col("b")$arr$sum())
Set values outside the given boundaries to the boundary value. This only works for numeric and temporal values.
Expr_clip(lower_bound = NULL, upper_bound = NULL)
Expr_clip(lower_bound = NULL, upper_bound = NULL)
lower_bound |
Lower bound. Accepts expression input. Strings are parsed as column names and other non-expression inputs are parsed as literals. |
upper_bound |
Upper bound. Accepts expression input. Strings are parsed as column names and other non-expression inputs are parsed as literals. |
Expr
df = pl$DataFrame(foo = c(-50L, 5L, NA_integer_, 50L), bound = c(1, 10, 1, 1)) # With the two bounds df$with_columns(clipped = pl$col("foo")$clip(1, 10)) # Without lower bound df$with_columns(clipped = pl$col("foo")$clip(upper_bound = 10)) # Using another column as lower bound df$with_columns(clipped = pl$col("foo")$clip(lower_bound = "bound"))
df = pl$DataFrame(foo = c(-50L, 5L, NA_integer_, 50L), bound = c(1, 10, 1, 1)) # With the two bounds df$with_columns(clipped = pl$col("foo")$clip(1, 10)) # Without lower bound df$with_columns(clipped = pl$col("foo")$clip(upper_bound = 10)) # Using another column as lower bound df$with_columns(clipped = pl$col("foo")$clip(lower_bound = "bound"))
Compute cosine
Expr_cos()
Expr_cos()
Expr
pl$DataFrame(a = c(0, pi / 2, pi, NA_real_))$ with_columns(cosine = pl$col("a")$cos())
pl$DataFrame(a = c(0, pi / 2, pi, NA_real_))$ with_columns(cosine = pl$col("a")$cos())
Compute hyperbolic cosine
Expr_cosh()
Expr_cosh()
Expr
pl$DataFrame(a = c(-1, acosh(2), 0, 1, NA_real_))$ with_columns(cosh = pl$col("a")$cosh())
pl$DataFrame(a = c(-1, acosh(2), 0, 1, NA_real_))$ with_columns(cosh = pl$col("a")$cosh())
Count the number of elements in this expression. Note that NULL
values are
also counted. $len()
is an alias.
Expr_count() Expr_len()
Expr_count() Expr_len()
Expr
pl$DataFrame( all = c(TRUE, TRUE), any = c(TRUE, FALSE), none = c(FALSE, FALSE) )$select( pl$all()$count() )
pl$DataFrame( all = c(TRUE, TRUE), any = c(TRUE, FALSE), none = c(FALSE, FALSE) )$select( pl$all()$count() )
Get an array with the cumulative count (zero-indexed) computed at every element.
Expr_cum_count(reverse = FALSE)
Expr_cum_count(reverse = FALSE)
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
$cum_count()
does not seem to count within lists.
Expr
pl$DataFrame(a = 1:4)$with_columns( pl$col("a")$cum_count()$alias("cum_count"), pl$col("a")$cum_count(reverse = TRUE)$alias("cum_count_reversed") )
pl$DataFrame(a = 1:4)$with_columns( pl$col("a")$cum_count()$alias("cum_count"), pl$col("a")$cum_count(reverse = TRUE)$alias("cum_count_reversed") )
Get an array with the cumulative max computed at every element.
Expr_cum_max(reverse = FALSE)
Expr_cum_max(reverse = FALSE)
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
Expr
pl$DataFrame(a = c(1:4, 2L))$with_columns( pl$col("a")$cum_max()$alias("cummux"), pl$col("a")$cum_max(reverse = TRUE)$alias("cum_max_reversed") )
pl$DataFrame(a = c(1:4, 2L))$with_columns( pl$col("a")$cum_max()$alias("cummux"), pl$col("a")$cum_max(reverse = TRUE)$alias("cum_max_reversed") )
Get an array with the cumulative min computed at every element.
Expr_cum_min(reverse = FALSE)
Expr_cum_min(reverse = FALSE)
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
Expr
pl$DataFrame(a = c(1:4, 2L))$with_columns( pl$col("a")$cum_min()$alias("cum_min"), pl$col("a")$cum_min(reverse = TRUE)$alias("cum_min_reversed") )
pl$DataFrame(a = c(1:4, 2L))$with_columns( pl$col("a")$cum_min()$alias("cum_min"), pl$col("a")$cum_min(reverse = TRUE)$alias("cum_min_reversed") )
Get an array with the cumulative product computed at every element.
Expr_cum_prod(reverse = FALSE)
Expr_cum_prod(reverse = FALSE)
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
Expr
pl$DataFrame(a = 1:4)$with_columns( pl$col("a")$cum_prod()$alias("cum_prod"), pl$col("a")$cum_prod(reverse = TRUE)$alias("cum_prod_reversed") )
pl$DataFrame(a = 1:4)$with_columns( pl$col("a")$cum_prod()$alias("cum_prod"), pl$col("a")$cum_prod(reverse = TRUE)$alias("cum_prod_reversed") )
Get an array with the cumulative sum computed at every element.
Expr_cum_sum(reverse = FALSE)
Expr_cum_sum(reverse = FALSE)
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
Expr
pl$DataFrame(a = 1:4)$with_columns( pl$col("a")$cum_sum()$alias("cum_sum"), pl$col("a")$cum_sum(reverse = TRUE)$alias("cum_sum_reversed") )
pl$DataFrame(a = 1:4)$with_columns( pl$col("a")$cum_sum()$alias("cum_sum"), pl$col("a")$cum_sum(reverse = TRUE)$alias("cum_sum_reversed") )
Run an expression over a sliding window that increases by 1
slot every
iteration.
Expr_cumulative_eval(expr, min_periods = 1L, parallel = FALSE)
Expr_cumulative_eval(expr, min_periods = 1L, parallel = FALSE)
expr |
Expression to evaluate. |
min_periods |
Number of valid (non-null) values there should be in the window before the expression is evaluated. |
parallel |
Run in parallel. Don't do this in a groupby or another operation that already has much parallelization. |
This can be really slow as it can have O(n^2)
complexity. Don't use this
for operations that visit all elements.
Expr
pl$lit(1:5)$cumulative_eval( pl$element()$first() - pl$element()$last()^2 )$to_r()
pl$lit(1:5)$cumulative_eval( pl$element()$first() - pl$element()$last()^2 )$to_r()
Bin continuous values into discrete categories
Expr_cut( breaks, ..., labels = NULL, left_closed = FALSE, include_breaks = FALSE )
Expr_cut( breaks, ..., labels = NULL, left_closed = FALSE, include_breaks = FALSE )
breaks |
Unique cut points. |
... |
Ignored. |
labels |
Names of the categories. The number of labels must be equal to the number of cut points plus one. |
left_closed |
Set the intervals to be left-closed instead of right-closed. |
include_breaks |
Include a column with the right endpoint of the bin each
observation falls in. This will change the data type of the output from a
|
Expr of data type Categorical
is include_breaks
is FALSE
and
of data type Struct
if include_breaks
is TRUE
.
df = pl$DataFrame(foo = c(-2, -1, 0, 1, 2)) df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), labels = c("a", "b", "c")) ) # Add both the category and the breakpoint df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), include_breaks = TRUE) )$unnest("cut")
df = pl$DataFrame(foo = c(-2, -1, 0, 1, 2)) df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), labels = c("a", "b", "c")) ) # Add both the category and the breakpoint df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), include_breaks = TRUE) )$unnest("cut")
Calculate the n-th discrete difference.
Expr_diff(n = 1, null_behavior = c("ignore", "drop"))
Expr_diff(n = 1, null_behavior = c("ignore", "drop"))
n |
Number of slots to shift. |
null_behavior |
String, either |
Expr
pl$DataFrame(a = c(20L, 10L, 30L, 40L))$with_columns( diff_default = pl$col("a")$diff(), diff_2_ignore = pl$col("a")$diff(2, "ignore") )
pl$DataFrame(a = c(20L, 10L, 30L, 40L))$with_columns( diff_default = pl$col("a")$diff(), diff_2_ignore = pl$col("a")$diff(2, "ignore") )
Method equivalent of float division operator expr / other
.
Expr_div(other)
Expr_div(other)
other |
Numeric literal or expression value. |
Zero-division behaviour follows IEEE-754:
0/0
: Invalid operation - mathematically undefined, returns NaN
.
n/0
: On finite operands gives an exact infinite result, e.g.: ±infinity.
df = pl$DataFrame( x = -2:2, y = c(0.5, 0, 0, -4, -0.5) ) df$with_columns( `x/2` = pl$col("x")$div(2), `x/y` = pl$col("x")$div(pl$col("y")) )
df = pl$DataFrame( x = -2:2, y = c(0.5, 0, 0, -4, -0.5) ) df$with_columns( `x/2` = pl$col("x")$div(2), `x/y` = pl$col("x")$div(pl$col("y")) )
Compute the dot/inner product between two Expressions.
Expr_dot(other)
Expr_dot(other)
other |
numeric or string value; accepts expression input. |
pl$DataFrame( a = 1:4, b = c(1, 2, 3, 4) )$with_columns( pl$col("a")$dot(pl$col("b"))$alias("a dot b"), pl$col("a")$dot(pl$col("a"))$alias("a dot a") )
pl$DataFrame( a = 1:4, b = c(1, 2, 3, 4) )$with_columns( pl$col("a")$dot(pl$col("b"))$alias("a dot b"), pl$col("a")$dot(pl$col("a"))$alias("a dot a") )
Drop NaN
Expr_drop_nans()
Expr_drop_nans()
Note that NaN
values are not null
values. Null values correspond to NA
in R.
Expr
drop_nulls()
pl$DataFrame(list(x = c(1, 2, NaN, NA)))$select(pl$col("x")$drop_nans())
pl$DataFrame(list(x = c(1, 2, NaN, NA)))$select(pl$col("x")$drop_nans())
Drop missing values
Expr_drop_nulls()
Expr_drop_nulls()
Expr
drop_nans()
pl$DataFrame(list(x = c(1, 2, NaN, NA)))$select(pl$col("x")$drop_nulls())
pl$DataFrame(list(x = c(1, 2, NaN, NA)))$select(pl$col("x")$drop_nulls())
The entropy is measured with the formula -sum(pk * log(pk))
where pk
are
discrete probabilities.
Expr_entropy(base = base::exp(1), normalize = TRUE)
Expr_entropy(base = base::exp(1), normalize = TRUE)
base |
Given exponential base, defaults to |
normalize |
Normalize |
Expr
pl$DataFrame(x = c(1, 2, 3, 2))$ with_columns(entropy = pl$col("x")$entropy(base = 2))
pl$DataFrame(x = c(1, 2, 3, 2))$ with_columns(entropy = pl$col("x")$entropy(base = 2))
Method equivalent of addition operator expr + other
.
Expr_eq(other)
Expr_eq(other)
other |
numeric or string value; accepts expression input. |
pl$lit(2) == 2 pl$lit(2) == pl$lit(2) pl$lit(2)$eq(pl$lit(2))
pl$lit(2) == 2 pl$lit(2) == pl$lit(2) pl$lit(2)$eq(pl$lit(2))
null
propagationMethod equivalent of addition operator expr + other
.
Expr_eq_missing(other)
Expr_eq_missing(other)
other |
numeric or string value; accepts expression input. |
df = pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq("y"), eq_missing = pl$col("x")$eq_missing("y") )
df = pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq("y"), eq_missing = pl$col("x")$eq_missing("y") )
Exponentially-weighted moving average
Expr_ewm_mean( com = NULL, span = NULL, half_life = NULL, alpha = NULL, adjust = TRUE, min_periods = 1L, ignore_nulls = TRUE )
Expr_ewm_mean( com = NULL, span = NULL, half_life = NULL, alpha = NULL, adjust = TRUE, min_periods = 1L, ignore_nulls = TRUE )
com |
Specify decay in terms of center of mass, |
span |
Specify decay in terms of span, |
half_life |
Specify decay in terms of half-life, :math: |
alpha |
Specify smoothing factor alpha directly, |
adjust |
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:
|
min_periods |
Minimum number of observations in window required to have a value (otherwise result is null). |
ignore_nulls |
Ignore missing values when calculating weights:
|
Expr
pl$DataFrame(a = 1:3)$ with_columns(ewm_mean = pl$col("a")$ewm_mean(com = 1))
pl$DataFrame(a = 1:3)$ with_columns(ewm_mean = pl$col("a")$ewm_mean(com = 1))
Exponentially-weighted moving standard deviation
Expr_ewm_std( com = NULL, span = NULL, half_life = NULL, alpha = NULL, adjust = TRUE, bias = FALSE, min_periods = 1L, ignore_nulls = TRUE )
Expr_ewm_std( com = NULL, span = NULL, half_life = NULL, alpha = NULL, adjust = TRUE, bias = FALSE, min_periods = 1L, ignore_nulls = TRUE )
com |
Specify decay in terms of center of mass, |
span |
Specify decay in terms of span, |
half_life |
Specify decay in terms of half-life, :math: |
alpha |
Specify smoothing factor alpha directly, |
adjust |
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:
|
bias |
If |
min_periods |
Minimum number of observations in window required to have a value (otherwise result is null). |
ignore_nulls |
Ignore missing values when calculating weights:
|
Expr
pl$DataFrame(a = 1:3)$ with_columns(ewm_std = pl$col("a")$ewm_std(com = 1))
pl$DataFrame(a = 1:3)$ with_columns(ewm_std = pl$col("a")$ewm_std(com = 1))
Exponentially-weighted moving variance
Expr_ewm_var( com = NULL, span = NULL, half_life = NULL, alpha = NULL, adjust = TRUE, bias = FALSE, min_periods = 1L, ignore_nulls = TRUE )
Expr_ewm_var( com = NULL, span = NULL, half_life = NULL, alpha = NULL, adjust = TRUE, bias = FALSE, min_periods = 1L, ignore_nulls = TRUE )
com |
Specify decay in terms of center of mass, |
span |
Specify decay in terms of span, |
half_life |
Specify decay in terms of half-life, :math: |
alpha |
Specify smoothing factor alpha directly, |
adjust |
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:
|
bias |
If |
min_periods |
Minimum number of observations in window required to have a value (otherwise result is null). |
ignore_nulls |
Ignore missing values when calculating weights:
|
Expr
pl$DataFrame(a = 1:3)$ with_columns(ewm_var = pl$col("a")$ewm_var(com = 1))
pl$DataFrame(a = 1:3)$ with_columns(ewm_var = pl$col("a")$ewm_var(com = 1))
Exclude certain columns from selection
Expr_exclude(columns)
Expr_exclude(columns)
columns |
Given param type:
|
Expr
# make DataFrame df = pl$DataFrame(iris) # by name(s) df$select(pl$all()$exclude("Species")) # by type df$select(pl$all()$exclude(pl$Categorical())) df$select(pl$all()$exclude(list(pl$Categorical(), pl$Float64))) # by regex df$select(pl$all()$exclude("^Sepal.*$"))
# make DataFrame df = pl$DataFrame(iris) # by name(s) df$select(pl$all()$exclude("Species")) # by type df$select(pl$all()$exclude(pl$Categorical())) df$select(pl$all()$exclude(list(pl$Categorical(), pl$Float64))) # by regex df$select(pl$all()$exclude("^Sepal.*$"))
Compute the exponential of the elements
Expr_exp()
Expr_exp()
Expr
pl$DataFrame(a = -1:3)$with_columns(a_exp = pl$col("a")$exp())
pl$DataFrame(a = -1:3)$with_columns(a_exp = pl$col("a")$exp())
This means that every item is expanded to a new row.
Expr_explode()
Expr_explode()
Categorical values are not supported.
Expr
df = pl$DataFrame(x = c("abc", "ab"), y = c(list(1:3), list(3:5))) df df$select(pl$col("y")$explode())
df = pl$DataFrame(x = c("abc", "ab"), y = c(list(1:3), list(3:5))) df df$select(pl$col("y")$explode())
Extend the Series with given number of values.
Expr_extend_constant(value, n)
Expr_extend_constant(value, n)
value |
The value to extend the Series with. This value may be |
n |
The number of values to extend. |
Expr
pl$select(pl$lit(1:4)$extend_constant(10.1, 2)) pl$select(pl$lit(1:4)$extend_constant(NULL, 2))
pl$select(pl$lit(1:4)$extend_constant(10.1, 2)) pl$select(pl$lit(1:4)$extend_constant(NULL, 2))
Fill floating point NaN value with a fill value
Expr_fill_nan(value = NULL)
Expr_fill_nan(value = NULL)
value |
Value used to fill |
Expr
pl$DataFrame(a = c(NaN, 1, NaN, 2, NA))$ with_columns( literal = pl$col("a")$fill_nan(999), # implicit coercion to string string = pl$col("a")$fill_nan("invalid") )
pl$DataFrame(a = c(NaN, 1, NaN, 2, NA))$ with_columns( literal = pl$col("a")$fill_nan(999), # implicit coercion to string string = pl$col("a")$fill_nan("invalid") )
Fill null values with a value or strategy
Expr_fill_null(value = NULL, strategy = NULL, limit = NULL)
Expr_fill_null(value = NULL, strategy = NULL, limit = NULL)
value |
Expr or something coercible in an Expr |
strategy |
Possible choice are |
limit |
Number of consecutive null values to fill when using the
|
Expr
pl$DataFrame(a = c(NA, 1, NA, 2, NA))$ with_columns( value = pl$col("a")$fill_null(999), backward = pl$col("a")$fill_null(strategy = "backward"), mean = pl$col("a")$fill_null(strategy = "mean") )
pl$DataFrame(a = c(NA, 1, NA, 2, NA))$ with_columns( value = pl$col("a")$fill_null(999), backward = pl$col("a")$fill_null(strategy = "backward"), mean = pl$col("a")$fill_null(strategy = "mean") )
Mostly useful in an aggregation context. If you want to filter on a
DataFrame level, use DataFrame$filter()
(or LazyFrame$filter()
).
Expr_filter(predicate)
Expr_filter(predicate)
predicate |
An Expr or something coercible to an Expr. Must return a boolean. |
Expr
df = pl$DataFrame( group_col = c("g1", "g1", "g2"), b = c(1, 2, 3) ) df df$group_by("group_col")$agg( lt = pl$col("b")$filter(pl$col("b") < 2), gte = pl$col("b")$filter(pl$col("b") >= 2) )
df = pl$DataFrame( group_col = c("g1", "g1", "g2"), b = c(1, 2, 3) ) df df$group_by("group_col")$agg( lt = pl$col("b")$filter(pl$col("b") < 2), gte = pl$col("b")$filter(pl$col("b") >= 2) )
Get the first value.
Expr_first()
Expr_first()
Expr
pl$DataFrame(x = 3:1)$with_columns(first = pl$col("x")$first())
pl$DataFrame(x = 3:1)$with_columns(first = pl$col("x")$first())
This is an alias for <Expr>$explode()
.
Expr_flatten()
Expr_flatten()
Expr
df = pl$DataFrame(x = c("abc", "ab"), y = c(list(1:3), list(3:5))) df df$select(pl$col("y")$flatten())
df = pl$DataFrame(x = c("abc", "ab"), y = c(list(1:3), list(3:5))) df df$select(pl$col("y")$flatten())
Rounds down to the nearest integer value. Only works on floating point Series.
Expr_floor()
Expr_floor()
Expr
pl$DataFrame(a = c(0.33, 0.5, 1.02, 1.5, NaN, NA, Inf, -Inf))$with_columns( floor = pl$col("a")$floor() )
pl$DataFrame(a = c(0.33, 0.5, 1.02, 1.5, NaN, NA, Inf, -Inf))$with_columns( floor = pl$col("a")$floor() )
Method equivalent of floor division operator expr %/% other
.
Expr_floor_div(other)
Expr_floor_div(other)
other |
Numeric literal or expression value. |
df = pl$DataFrame(x = 1:5) df$with_columns( `x/2` = pl$col("x")$div(2), `x%/%2` = pl$col("x")$floor_div(2) )
df = pl$DataFrame(x = 1:5) df$with_columns( `x/2` = pl$col("x")$div(2), `x%/%2` = pl$col("x")$floor_div(2) )
Fill missing values with the last seen values. Syntactic sugar for
$fill_null(strategy = "forward")
.
Expr_forward_fill(limit = NULL)
Expr_forward_fill(limit = NULL)
limit |
Number of consecutive null values to fill when using the
|
Expr
pl$DataFrame(a = c(NA, 1, NA, 2, NA))$ with_columns( backward = pl$col("a")$forward_fill() )
pl$DataFrame(a = c(NA, 1, NA, 2, NA))$ with_columns( backward = pl$col("a")$forward_fill() )
Gather values by index
Expr_gather(indices)
Expr_gather(indices)
indices |
R vector or Series, or Expr that leads to a Series of dtype Int64. (0-indexed) |
Expr
df = pl$DataFrame(a = 1:10) df$select(pl$col("a")$gather(c(0, 2, 4, -1)))
df = pl$DataFrame(a = 1:10) df$select(pl$col("a")$gather(c(0, 2, 4, -1)))
Gather every nth value in the Series and return as a new Series.
Expr_gather_every(n, offset = 0)
Expr_gather_every(n, offset = 0)
n |
Positive integer. |
offset |
Starting index. |
Expr
pl$DataFrame(a = 0:24)$select(pl$col("a")$gather_every(6))
pl$DataFrame(a = 0:24)$select(pl$col("a")$gather_every(6))
Method equivalent of addition operator expr + other
.
Expr_gt(other)
Expr_gt(other)
other |
numeric or string value; accepts expression input. |
pl$lit(2) > 1 pl$lit(2) > pl$lit(1) pl$lit(2)$gt(pl$lit(1))
pl$lit(2) > 1 pl$lit(2) > pl$lit(1) pl$lit(2)$gt(pl$lit(1))
Method equivalent of addition operator expr + other
.
Expr_gt_eq(other)
Expr_gt_eq(other)
other |
numeric or string value; accepts expression input. |
pl$lit(2) >= 2 pl$lit(2) >= pl$lit(2) pl$lit(2)$gt_eq(pl$lit(2))
pl$lit(2) >= 2 pl$lit(2) >= pl$lit(2) pl$lit(2)$gt_eq(pl$lit(2))
Check whether the expression contains one or more null values
Expr_has_nulls()
Expr_has_nulls()
Expr
df = pl$DataFrame( a = c(NA, 1, NA), b = c(1, NA, 2), c = c(1, 2, 3) ) df$select(pl$all()$has_nulls())
df = pl$DataFrame( a = c(NA, 1, NA), b = c(1, NA, 2), c = c(1, 2, 3) ) df$select(pl$all()$has_nulls())
The hash value is of type UInt64
.
Expr_hash(seed = 0, seed_1 = NULL, seed_2 = NULL, seed_3 = NULL)
Expr_hash(seed = 0, seed_1 = NULL, seed_2 = NULL, seed_3 = NULL)
seed |
Random seed parameter. Defaults to 0. Doesn't have any effect for now. |
seed_1 , seed_2 , seed_3
|
Random seed parameter. Defaults to arg seed. The column will be coerced to UInt32. |
Expr
df = pl$DataFrame(iris[1:3, c(1, 2)]) df$with_columns(pl$all()$hash(1234)$name$suffix("_hash"))
df = pl$DataFrame(iris[1:3, c(1, 2)]) df$with_columns(pl$all()$hash(1234)$name$suffix("_hash"))
Get the first n elements
Expr_head(n = 10)
Expr_head(n = 10)
n |
Number of elements to take. |
Expr
pl$DataFrame(x = 1:11)$select(pl$col("x")$head(3))
pl$DataFrame(x = 1:11)$select(pl$col("x")$head(3))
Aggregate values into a list.
Expr_implode()
Expr_implode()
Use $to_struct()
to wrap a DataFrame.
Expr
df = pl$DataFrame( a = 1:3, b = 4:6 ) df$select(pl$all()$implode())
df = pl$DataFrame( a = 1:3, b = 4:6 ) df$select(pl$all()$implode())
Print the value that this expression evaluates to and pass on the value. The printing will happen when the expression evaluates, not when it is formed.
Expr_inspect(fmt = "{}")
Expr_inspect(fmt = "{}")
fmt |
format string, should contain one set of |
Expr
pl$select(pl$lit(1:5)$inspect( "Here's what the Series looked like before keeping the first two values: {}" )$head(2))
pl$select(pl$lit(1:5)$inspect( "Here's what the Series looked like before keeping the first two values: {}" )$head(2))
Fill nulls with linear interpolation using non-missing values. Can also be used to regrid data to a new grid - see examples below.
Expr_interpolate(method = "linear")
Expr_interpolate(method = "linear")
method |
String, either |
Expr
pl$DataFrame(x = c(1, NA, 4, NA, 100, NaN, 150))$ with_columns( interp_lin = pl$col("x")$interpolate(), interp_near = pl$col("x")$interpolate("nearest") ) # x, y interpolation over a grid df_original_grid = pl$DataFrame( grid_points = c(1, 3, 10), values = c(2.0, 6.0, 20.0) ) df_original_grid df_new_grid = pl$DataFrame(grid_points = (1:10) * 1.0) df_new_grid # Interpolate from this to the new grid df_new_grid$join( df_original_grid, on = "grid_points", how = "left" )$with_columns(pl$col("values")$interpolate())
pl$DataFrame(x = c(1, NA, 4, NA, 100, NaN, 150))$ with_columns( interp_lin = pl$col("x")$interpolate(), interp_near = pl$col("x")$interpolate("nearest") ) # x, y interpolation over a grid df_original_grid = pl$DataFrame( grid_points = c(1, 3, 10), values = c(2.0, 6.0, 20.0) ) df_original_grid df_new_grid = pl$DataFrame(grid_points = (1:10) * 1.0) df_new_grid # Interpolate from this to the new grid df_new_grid$join( df_original_grid, on = "grid_points", how = "left" )$with_columns(pl$col("values")$interpolate())
Check if an expression is between the given lower and upper bounds
Expr_is_between(lower_bound, upper_bound, closed = "both")
Expr_is_between(lower_bound, upper_bound, closed = "both")
lower_bound |
Lower bound, can be an Expr. Strings are parsed as column names. |
upper_bound |
Upper bound, can be an Expr. Strings are parsed as column names. |
closed |
Define which sides of the interval are closed (inclusive). This
can be either |
Note that in polars, NaN
are equal to other NaN
s, and greater than any
non-NaN
value.
Expr
df = pl$DataFrame(num = 1:5) df$with_columns( is_between = pl$col("num")$is_between(2, 4), is_between_excl_upper = pl$col("num")$is_between(2, 4, closed = "left"), is_between_excl_both = pl$col("num")$is_between(2, 4, closed = "none") ) # lower and upper bounds can also be column names or expr df = pl$DataFrame( num = 1:5, lower = c(0, 2, 3, 3, 3), upper = c(6, 4, 4, 8, 3.5) ) df$with_columns( is_between_cols = pl$col("num")$is_between("lower", "upper"), is_between_expr = pl$col("num")$is_between(pl$col("lower") / 2, "upper") )
df = pl$DataFrame(num = 1:5) df$with_columns( is_between = pl$col("num")$is_between(2, 4), is_between_excl_upper = pl$col("num")$is_between(2, 4, closed = "left"), is_between_excl_both = pl$col("num")$is_between(2, 4, closed = "none") ) # lower and upper bounds can also be column names or expr df = pl$DataFrame( num = 1:5, lower = c(0, 2, 3, 3, 3), upper = c(6, 4, 4, 8, 3.5) ) df$with_columns( is_between_cols = pl$col("num")$is_between("lower", "upper"), is_between_expr = pl$col("num")$is_between(pl$col("lower") / 2, "upper") )
This is syntactic sugar for $is_unique()$not()
.
Expr_is_duplicated()
Expr_is_duplicated()
Expr
pl$DataFrame(head(mtcars[, 1:2]))$ with_columns(is_duplicated = pl$col("mpg")$is_duplicated())
pl$DataFrame(head(mtcars[, 1:2]))$ with_columns(is_duplicated = pl$col("mpg")$is_duplicated())
Returns a boolean Series indicating which values are finite.
Expr_is_finite()
Expr_is_finite()
Expr
pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$ with_columns(finite = pl$col("alice")$is_finite())
pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$ with_columns(finite = pl$col("alice")$is_finite())
Check whether each value is the first occurrence
Expr_is_first_distinct()
Expr_is_first_distinct()
Expr
pl$DataFrame(head(mtcars[, 1:2]))$ with_columns(is_ufirst = pl$col("mpg")$is_first_distinct())
pl$DataFrame(head(mtcars[, 1:2]))$ with_columns(is_ufirst = pl$col("mpg")$is_first_distinct())
Notice that to check whether a factor value is in a vector of strings, you
need to use the string cache, either with pl$enable_string_cache()
or
with pl$with_string_cache()
. See examples.
Expr_is_in(other)
Expr_is_in(other)
other |
numeric or string value; accepts expression input. |
Expr
pl$DataFrame(a = c(1:4, NA_integer_))$with_columns( in_1_3 = pl$col("a")$is_in(c(1, 3)), in_NA = pl$col("a")$is_in(pl$lit(NA_real_)) ) # this fails because we can't compare factors to strings # pl$DataFrame(a = factor(letters[1:5]))$with_columns( # in_abc = pl$col("a")$is_in(c("a", "b", "c")) # ) # need to use the string cache for this pl$with_string_cache({ pl$DataFrame(a = factor(letters[1:5]))$with_columns( in_abc = pl$col("a")$is_in(c("a", "b", "c")) ) })
pl$DataFrame(a = c(1:4, NA_integer_))$with_columns( in_1_3 = pl$col("a")$is_in(c(1, 3)), in_NA = pl$col("a")$is_in(pl$lit(NA_real_)) ) # this fails because we can't compare factors to strings # pl$DataFrame(a = factor(letters[1:5]))$with_columns( # in_abc = pl$col("a")$is_in(c("a", "b", "c")) # ) # need to use the string cache for this pl$with_string_cache({ pl$DataFrame(a = factor(letters[1:5]))$with_columns( in_abc = pl$col("a")$is_in(c("a", "b", "c")) ) })
Returns a boolean Series indicating which values are infinite.
Expr_is_infinite()
Expr_is_infinite()
Expr
pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$ with_columns(infinite = pl$col("alice")$is_infinite())
pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$ with_columns(infinite = pl$col("alice")$is_infinite())
Check whether each value is the last occurrence
Expr_is_last_distinct()
Expr_is_last_distinct()
Expr
pl$DataFrame(head(mtcars[, 1:2]))$ with_columns(is_ulast = pl$col("mpg")$is_last_distinct())
pl$DataFrame(head(mtcars[, 1:2]))$ with_columns(is_ulast = pl$col("mpg")$is_last_distinct())
Returns a boolean Series indicating which values are NaN.
Expr_is_nan()
Expr_is_nan()
Expr
pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$ with_columns(nan = pl$col("alice")$is_nan())
pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$ with_columns(nan = pl$col("alice")$is_nan())
Returns a boolean Series indicating which values are not NaN. Syntactic sugar
for $is_nan()$not()
.
Expr_is_not_nan()
Expr_is_not_nan()
Expr
pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$ with_columns(not_nan = pl$col("alice")$is_not_nan())
pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$ with_columns(not_nan = pl$col("alice")$is_not_nan())
Returns a boolean Series indicating which values are not null. Syntactic sugar
for $is_null()$not()
.
Expr_is_not_null()
Expr_is_not_null()
Expr
pl$DataFrame(list(x = c(1, NA, 3)))$select(pl$col("x")$is_not_null())
pl$DataFrame(list(x = c(1, NA, 3)))$select(pl$col("x")$is_not_null())
Returns a boolean Series indicating which values are null.
Expr_is_null()
Expr_is_null()
Expr
pl$DataFrame(list(x = c(1, NA, 3)))$select(pl$col("x")$is_null())
pl$DataFrame(list(x = c(1, NA, 3)))$select(pl$col("x")$is_null())
Check whether each value is unique
Expr_is_unique()
Expr_is_unique()
Expr
pl$DataFrame(head(mtcars[, 1:2]))$ with_columns(is_unique = pl$col("mpg")$is_unique())
pl$DataFrame(head(mtcars[, 1:2]))$ with_columns(is_unique = pl$col("mpg")$is_unique())
Compute the kurtosis (Fisher or Pearson) of a dataset.
Expr_kurtosis(fisher = TRUE, bias = TRUE)
Expr_kurtosis(fisher = TRUE, bias = TRUE)
fisher |
If |
bias |
If |
Kurtosis is the fourth central moment divided by the square of the variance. If Fisher's definition is used, then 3 is subtracted from the result to give 0 for a normal distribution.
If bias is FALSE
, then the kurtosis is calculated using k
statistics to
eliminate bias coming from biased moment estimators.
Expr
pl$DataFrame(a = c(1:3, 2:1))$ with_columns(kurt = pl$col("a")$kurtosis())
pl$DataFrame(a = c(1:3, 2:1))$ with_columns(kurt = pl$col("a")$kurtosis())
Get the last value
Expr_last()
Expr_last()
Expr
pl$DataFrame(x = 3:1)$with_columns(last = pl$col("x")$last())
pl$DataFrame(x = 3:1)$with_columns(last = pl$col("x")$last())
This is an alias for <Expr>$head()
.
Expr_limit(n = 10)
Expr_limit(n = 10)
n |
Number of elements to take. |
Expr
pl$DataFrame(x = 1:11)$select(pl$col("x")$limit(3))
pl$DataFrame(x = 1:11)$select(pl$col("x")$limit(3))
Compute the logarithm of elements
Expr_log(base = base::exp(1))
Expr_log(base = base::exp(1))
base |
Numeric base value for logarithm, default is |
Expr
pl$DataFrame(a = c(1, 2, 3, exp(1)))$ with_columns(log = pl$col("a")$log())
pl$DataFrame(a = c(1, 2, 3, exp(1)))$ with_columns(log = pl$col("a")$log())
Compute the base-10 logarithm of elements
Expr_log10()
Expr_log10()
Expr
pl$DataFrame(a = c(1, 2, 3, exp(1)))$ with_columns(log10 = pl$col("a")$log10())
pl$DataFrame(a = c(1, 2, 3, exp(1)))$ with_columns(log10 = pl$col("a")$log10())
Find the lower bound of a DataType
Expr_lower_bound()
Expr_lower_bound()
Expr
pl$DataFrame( x = 1:3, y = 1:3, schema = list(x = pl$UInt32, y = pl$Int32) )$ select(pl$all()$lower_bound())
pl$DataFrame( x = 1:3, y = 1:3, schema = list(x = pl$UInt32, y = pl$Int32) )$ select(pl$all()$lower_bound())
Method equivalent of addition operator expr + other
.
Expr_lt(other)
Expr_lt(other)
other |
numeric or string value; accepts expression input. |
pl$lit(5) < 10 pl$lit(5) < pl$lit(10) pl$lit(5)$lt(pl$lit(10))
pl$lit(5) < 10 pl$lit(5) < pl$lit(10) pl$lit(5)$lt(pl$lit(10))
Method equivalent of addition operator expr + other
.
Expr_lt_eq(other)
Expr_lt_eq(other)
other |
numeric or string value; accepts expression input. |
pl$lit(2) <= 2 pl$lit(2) <= pl$lit(2) pl$lit(2)$lt_eq(pl$lit(2))
pl$lit(2) <= 2 pl$lit(2) <= pl$lit(2) pl$lit(2)$lt_eq(pl$lit(2))
Map an expression with an R function
Expr_map_batches( f, output_type = NULL, agg_list = FALSE, in_background = FALSE )
Expr_map_batches( f, output_type = NULL, agg_list = FALSE, in_background = FALSE )
f |
a function to map with |
output_type |
|
agg_list |
Aggregate list. Map from vector to group in group_by context. |
in_background |
Logical. Whether to execute the map in a background R
process. Combined with setting e.g. |
It is sometimes necessary to apply a specific R function on one or several
columns. However, note that using R code in $map_batches()
is slower than native polars.
The user function must take one polars Series
as input and the return
should be a Series
or any Robj convertible into a Series
(e.g. vectors).
Map fully supports browser()
.
If in_background = FALSE
the function can access any global variable of the
R session. However, note that several calls to $map_batches()
will sequentially share the same main R session,
so the global environment might change between the start of the query and the moment
a $map_batches()
call is evaluated. Any native
polars computations can still be executed meanwhile. If in_background = TRUE
,
the map will run in one or more other R sessions and will not have access
to global variables. Use options(polars.rpool_cap = 4)
and
polars_options()$rpool_cap
to set and view number of parallel R sessions.
Expr
pl$DataFrame(iris)$ select( pl$col("Sepal.Length")$map_batches(\(x) { paste("cheese", as.character(x$to_vector())) }, pl$dtypes$String) ) # R parallel process example, use Sys.sleep() to imitate some CPU expensive # computation. # map a,b,c,d sequentially pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select( pl$all()$map_batches(\(s) { Sys.sleep(.1) s * 2 }) )$collect() |> system.time() # map in parallel 1: Overhead to start up extra R processes / sessions options(polars.rpool_cap = 0) # drop any previous processes, just to show start-up overhead options(polars.rpool_cap = 4) # set back to 4, the default polars_options()$rpool_cap pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select( pl$all()$map_batches(\(s) { Sys.sleep(.1) s * 2 }, in_background = TRUE) )$collect() |> system.time() # map in parallel 2: Reuse R processes in "polars global_rpool". polars_options()$rpool_cap pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select( pl$all()$map_batches(\(s) { Sys.sleep(.1) s * 2 }, in_background = TRUE) )$collect() |> system.time()
pl$DataFrame(iris)$ select( pl$col("Sepal.Length")$map_batches(\(x) { paste("cheese", as.character(x$to_vector())) }, pl$dtypes$String) ) # R parallel process example, use Sys.sleep() to imitate some CPU expensive # computation. # map a,b,c,d sequentially pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select( pl$all()$map_batches(\(s) { Sys.sleep(.1) s * 2 }) )$collect() |> system.time() # map in parallel 1: Overhead to start up extra R processes / sessions options(polars.rpool_cap = 0) # drop any previous processes, just to show start-up overhead options(polars.rpool_cap = 4) # set back to 4, the default polars_options()$rpool_cap pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select( pl$all()$map_batches(\(s) { Sys.sleep(.1) s * 2 }, in_background = TRUE) )$collect() |> system.time() # map in parallel 2: Reuse R processes in "polars global_rpool". polars_options()$rpool_cap pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select( pl$all()$map_batches(\(s) { Sys.sleep(.1) s * 2 }, in_background = TRUE) )$collect() |> system.time()
The UDF is applied to each element of a column. See Details for more information on specificities related to the context.
Expr_map_elements( f, return_type = NULL, strict_return_type = TRUE, allow_fail_eval = FALSE, in_background = FALSE )
Expr_map_elements( f, return_type = NULL, strict_return_type = TRUE, allow_fail_eval = FALSE, in_background = FALSE )
f |
Function to map |
return_type |
DataType of the output Series. If |
strict_return_type |
If |
allow_fail_eval |
If |
in_background |
Whether to run the function in a background R process,
default is |
Note that, in a GroupBy context, the column will have been pre-aggregated and so each element will itself be a Series. Therefore, depending on the context, requirements for function differ:
in $select()
or $with_columns()
(selection context), the function must
operate on R values of length 1. Polars will convert each element into an R value
and pass it to the function. The output of the user function will be converted
back into a polars type (the return type must match, see argument return_type
).
Using $map_elements()
in this context should be avoided as a lapply()
has half the overhead.
in $agg()
(GroupBy context), the function must take a Series
and return
a Series
or an R object convertible to Series
, e.g. a vector. In this
context, it is much faster if there are the number of groups is much lower
than the number of rows, as the iteration is only across the groups. The R
user function could e.g. convert the Series
to a vector with $to_r()
and
perform some vectorized operations.
Note that it is preferred to express your function in polars syntax, which will almost always be significantly faster and more memory efficient because:
the native expression engine runs in Rust; functions run in R.
use of R functions forces the DataFrame to be materialized in memory.
Polars-native expressions can be parallelized (R functions cannot).
Polars-native expressions can be logically optimized (R functions cannot).
Wherever possible you should strongly prefer the native expression API to
achieve the best performance and avoid using $map_elements()
.
Expr
# apply over groups: here, the input must be a Series # prepare two expressions, one to compute the sum of each variable, one to # get the first two values of each variable and store them in a list e_sum = pl$all()$map_elements(\(s) sum(s$to_r()))$name$suffix("_sum") e_head = pl$all()$map_elements(\(s) head(s$to_r(), 2))$name$suffix("_head") pl$DataFrame(iris)$group_by("Species")$agg(e_sum, e_head) # apply a function on each value (should be avoided): here the input is an R # value of length 1 # select only Float64 columns my_selection = pl$col(pl$dtypes$Float64) # prepare two expressions, the first one only adds 10 to each element, the # second returns the letter whose index matches the element e_add10 = my_selection$map_elements(\(x) { x + 10 })$name$suffix("_sum") e_letter = my_selection$map_elements(\(x) { letters[ceiling(x)] }, return_type = pl$dtypes$String)$name$suffix("_letter") pl$DataFrame(iris)$select(e_add10, e_letter) # Small benchmark -------------------------------- # Using `$map_elements()` is much slower than a more polars-native approach. # First we multiply each element of a Series of 1M elements by 2. n = 1000000L set.seed(1) df = pl$DataFrame(list( a = 1:n, b = sample(letters, n, replace = TRUE) )) system.time({ df$with_columns( bob = pl$col("a")$map_elements(\(x) { x * 2L }) ) }) # Comparing this to the standard polars syntax: system.time({ df$with_columns( bob = pl$col("a") * 2L ) }) # Running in parallel -------------------------------- # here, we use Sys.sleep() to imitate some CPU expensive computation. # use apply over each Species-group in each column equal to 12 sequential # runs ~1.2 sec. system.time({ pl$LazyFrame(iris)$group_by("Species")$agg( pl$all()$map_elements(\(s) { Sys.sleep(.1) s$sum() }) )$collect() }) # first run in parallel: there is some overhead to start up extra R processes # drop any previous processes, just to show start-up overhead here options(polars.rpool_cap = 0) # set back to 4, the default options(polars.rpool_cap = 4) polars_options()$rpool_cap system.time({ pl$LazyFrame(iris)$group_by("Species")$agg( pl$all()$map_elements(\(s) { Sys.sleep(.1) s$sum() }, in_background = TRUE) )$collect() }) # second run in parallel: this reuses R processes in "polars global_rpool". polars_options()$rpool_cap system.time({ pl$LazyFrame(iris)$group_by("Species")$agg( pl$all()$map_elements(\(s) { Sys.sleep(.1) s$sum() }, in_background = TRUE) )$collect() })
# apply over groups: here, the input must be a Series # prepare two expressions, one to compute the sum of each variable, one to # get the first two values of each variable and store them in a list e_sum = pl$all()$map_elements(\(s) sum(s$to_r()))$name$suffix("_sum") e_head = pl$all()$map_elements(\(s) head(s$to_r(), 2))$name$suffix("_head") pl$DataFrame(iris)$group_by("Species")$agg(e_sum, e_head) # apply a function on each value (should be avoided): here the input is an R # value of length 1 # select only Float64 columns my_selection = pl$col(pl$dtypes$Float64) # prepare two expressions, the first one only adds 10 to each element, the # second returns the letter whose index matches the element e_add10 = my_selection$map_elements(\(x) { x + 10 })$name$suffix("_sum") e_letter = my_selection$map_elements(\(x) { letters[ceiling(x)] }, return_type = pl$dtypes$String)$name$suffix("_letter") pl$DataFrame(iris)$select(e_add10, e_letter) # Small benchmark -------------------------------- # Using `$map_elements()` is much slower than a more polars-native approach. # First we multiply each element of a Series of 1M elements by 2. n = 1000000L set.seed(1) df = pl$DataFrame(list( a = 1:n, b = sample(letters, n, replace = TRUE) )) system.time({ df$with_columns( bob = pl$col("a")$map_elements(\(x) { x * 2L }) ) }) # Comparing this to the standard polars syntax: system.time({ df$with_columns( bob = pl$col("a") * 2L ) }) # Running in parallel -------------------------------- # here, we use Sys.sleep() to imitate some CPU expensive computation. # use apply over each Species-group in each column equal to 12 sequential # runs ~1.2 sec. system.time({ pl$LazyFrame(iris)$group_by("Species")$agg( pl$all()$map_elements(\(s) { Sys.sleep(.1) s$sum() }) )$collect() }) # first run in parallel: there is some overhead to start up extra R processes # drop any previous processes, just to show start-up overhead here options(polars.rpool_cap = 0) # set back to 4, the default options(polars.rpool_cap = 4) polars_options()$rpool_cap system.time({ pl$LazyFrame(iris)$group_by("Species")$agg( pl$all()$map_elements(\(s) { Sys.sleep(.1) s$sum() }, in_background = TRUE) )$collect() }) # second run in parallel: this reuses R processes in "polars global_rpool". polars_options()$rpool_cap system.time({ pl$LazyFrame(iris)$group_by("Species")$agg( pl$all()$map_elements(\(s) { Sys.sleep(.1) s$sum() }, in_background = TRUE) )$collect() })
Get maximum value
Expr_max()
Expr_max()
Expr
pl$DataFrame(x = c(1, NA, 3))$ with_columns(max = pl$col("x")$max())
pl$DataFrame(x = c(1, NA, 3))$ with_columns(max = pl$col("x")$max())
Get mean value
Expr_mean()
Expr_mean()
Expr
pl$DataFrame(x = c(1L, NA, 2L))$ with_columns(mean = pl$col("x")$mean())
pl$DataFrame(x = c(1L, NA, 2L))$ with_columns(mean = pl$col("x")$mean())
Get median value
Expr_median()
Expr_median()
Expr
pl$DataFrame(x = c(1L, NA, 2L))$ with_columns(median = pl$col("x")$median())
pl$DataFrame(x = c(1L, NA, 2L))$ with_columns(median = pl$col("x")$median())
Get minimum value
Expr_min()
Expr_min()
Expr
pl$DataFrame(x = c(1, NA, 3))$ with_columns(min = pl$col("x")$min())
pl$DataFrame(x = c(1, NA, 3))$ with_columns(min = pl$col("x")$min())
Method equivalent of modulus operator expr %% other
.
Expr_mod(other)
Expr_mod(other)
other |
Numeric literal or expression value. |
df = pl$DataFrame(x = -5L:5L) df$with_columns( `x%%2` = pl$col("x")$mod(2) )
df = pl$DataFrame(x = -5L:5L) df$with_columns( `x%%2` = pl$col("x")$mod(2) )
Compute the most occurring value(s). Can return multiple values if there are ties.
Expr_mode()
Expr_mode()
Expr
df = pl$DataFrame(a = 1:6, b = c(1L, 1L, 3L, 3L, 5L, 6L), c = c(1L, 1L, 2L, 2L, 3L, 3L)) df$select(pl$col("a")$mode()) df$select(pl$col("b")$mode()) df$select(pl$col("c")$mode())
df = pl$DataFrame(a = 1:6, b = c(1L, 1L, 3L, 3L, 5L, 6L), c = c(1L, 1L, 2L, 2L, 3L, 3L)) df$select(pl$col("a")$mode()) df$select(pl$col("b")$mode()) df$select(pl$col("c")$mode())
Method equivalent of multiplication operator expr * other
.
Expr_mul(other)
Expr_mul(other)
other |
Numeric literal or expression value. |
df = pl$DataFrame(x = c(1, 2, 4, 8, 16)) df$with_columns( `x*2` = pl$col("x")$mul(2), `x * xlog2` = pl$col("x")$mul(pl$col("x")$log(2)) )
df = pl$DataFrame(x = c(1, 2, 4, 8, 16)) df$with_columns( `x*2` = pl$col("x")$mul(2), `x * xlog2` = pl$col("x")$mul(pl$col("x")$log(2)) )
Count number of unique values
Expr_n_unique()
Expr_n_unique()
Expr
pl$DataFrame(iris[, 4:5])$with_columns(count = pl$col("Species")$n_unique())
pl$DataFrame(iris[, 4:5])$with_columns(count = pl$col("Species")$n_unique())
Get maximum value, but returns NaN
if there are any.
Expr_nan_max()
Expr_nan_max()
Expr
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_max = pl$col("x")$nan_max())
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_max = pl$col("x")$nan_max())
Get minimum value, but returns NaN
if there are any.
Expr_nan_min()
Expr_nan_min()
Expr
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_min = pl$col("x")$nan_min())
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_min = pl$col("x")$nan_min())
Method equivalent of addition operator expr + other
.
Expr_neq(other)
Expr_neq(other)
other |
numeric or string value; accepts expression input. |
pl$lit(1) != 2 pl$lit(1) != pl$lit(2) pl$lit(1)$neq(pl$lit(2))
pl$lit(1) != 2 pl$lit(1) != pl$lit(2) pl$lit(1)$neq(pl$lit(2))
null
propagationMethod equivalent of addition operator expr + other
.
Expr_neq_missing(other)
Expr_neq_missing(other)
other |
numeric or string value; accepts expression input. |
df = pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( neq = pl$col("x")$neq("y"), neq_missing = pl$col("x")$neq_missing("y") )
df = pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( neq = pl$col("x")$neq("y"), neq_missing = pl$col("x")$neq_missing("y") )
Method equivalent of negation operator !expr
.
Expr_not()
Expr_not()
# two syntaxes same result pl$lit(TRUE)$not() !pl$lit(TRUE)
# two syntaxes same result pl$lit(TRUE)$not() !pl$lit(TRUE)
Count missing values
Expr_null_count()
Expr_null_count()
Expr
pl$DataFrame(x = c(NA, "a", NA, "b"))$ with_columns(n_missing = pl$col("x")$null_count())
pl$DataFrame(x = c(NA, "a", NA, "b"))$ with_columns(n_missing = pl$col("x")$null_count())
Combine two boolean expressions with OR.
Expr_or(other)
Expr_or(other)
other |
numeric or string value; accepts expression input. |
pl$lit(TRUE) | FALSE pl$lit(TRUE)$or(pl$lit(TRUE))
pl$lit(TRUE) | FALSE pl$lit(TRUE)$or(pl$lit(TRUE))
This expression is similar to performing a group by aggregation and joining the result back into the original DataFrame. The outcome is similar to how window functions work in PostgreSQL.
Expr_over(..., order_by = NULL, mapping_strategy = "group_to_rows")
Expr_over(..., order_by = NULL, mapping_strategy = "group_to_rows")
... |
Column(s) to group by. Accepts expression input. Characters are parsed as column names. |
order_by |
Order the window functions/aggregations with the partitioned
groups by the result of the expression passed to |
mapping_strategy |
One of the following:
|
Expr
# Pass the name of a column to compute the expression over that column. df = pl$DataFrame( a = c("a", "a", "b", "b", "b"), b = c(1, 2, 3, 5, 3), c = c(5, 4, 2, 1, 3) ) df$with_columns( pl$col("c")$max()$over("a")$name$suffix("_max") ) # Expression input is supported. df$with_columns( pl$col("c")$max()$over(pl$col("b") %/% 2)$name$suffix("_max") ) # Group by multiple columns by passing a character vector of column names # or list of expressions. df$with_columns( pl$col("c")$min()$over(c("a", "b"))$name$suffix("_min") ) df$with_columns( pl$col("c")$min()$over(list(pl$col("a"), pl$col("b")))$name$suffix("_min") ) # Or use positional arguments to group by multiple columns in the same way. df$with_columns( pl$col("c")$min()$over("a", pl$col("b") %% 2)$name$suffix("_min") ) # Alternative mapping strategy: join values in a list output df$with_columns( top_2 = pl$col("c")$top_k(2)$over("a", mapping_strategy = "join") ) # order_by specifies how values are sorted within a group, which is # essential when the operation depends on the order of values df = pl$DataFrame( g = c(1, 1, 1, 1, 2, 2, 2, 2), t = c(1, 2, 3, 4, 4, 1, 2, 3), x = c(10, 20, 30, 40, 10, 20, 30, 40) ) # without order_by, the first and second values in the second group would # be inverted, which would be wrong df$with_columns( x_lag = pl$col("x")$shift(1)$over("g", order_by = "t") )
# Pass the name of a column to compute the expression over that column. df = pl$DataFrame( a = c("a", "a", "b", "b", "b"), b = c(1, 2, 3, 5, 3), c = c(5, 4, 2, 1, 3) ) df$with_columns( pl$col("c")$max()$over("a")$name$suffix("_max") ) # Expression input is supported. df$with_columns( pl$col("c")$max()$over(pl$col("b") %/% 2)$name$suffix("_max") ) # Group by multiple columns by passing a character vector of column names # or list of expressions. df$with_columns( pl$col("c")$min()$over(c("a", "b"))$name$suffix("_min") ) df$with_columns( pl$col("c")$min()$over(list(pl$col("a"), pl$col("b")))$name$suffix("_min") ) # Or use positional arguments to group by multiple columns in the same way. df$with_columns( pl$col("c")$min()$over("a", pl$col("b") %% 2)$name$suffix("_min") ) # Alternative mapping strategy: join values in a list output df$with_columns( top_2 = pl$col("c")$top_k(2)$over("a", mapping_strategy = "join") ) # order_by specifies how values are sorted within a group, which is # essential when the operation depends on the order of values df = pl$DataFrame( g = c(1, 1, 1, 1, 2, 2, 2, 2), t = c(1, 2, 3, 4, 4, 1, 2, 3), x = c(10, 20, 30, 40, 10, 20, 30, 40) ) # without order_by, the first and second values in the second group would # be inverted, which would be wrong df$with_columns( x_lag = pl$col("x")$shift(1)$over("g", order_by = "t") )
Computes percentage change (as fraction) between current element and most-
recent non-null element at least n
period(s) before the current element.
Computes the change from the previous row by default.
Expr_pct_change(n = 1)
Expr_pct_change(n = 1)
n |
Periods to shift for computing percent change. |
Expr
pl$DataFrame(a = c(10L, 11L, 12L, NA_integer_, 12L))$ with_columns(pct_change = pl$col("a")$pct_change())
pl$DataFrame(a = c(10L, 11L, 12L, NA_integer_, 12L))$ with_columns(pct_change = pl$col("a")$pct_change())
A local maximum is the point that marks the transition between an increase and a decrease in a Series. The first and last values of the Series can never be a peak.
Expr_peak_max()
Expr_peak_max()
Expr
$peak_min()
df = pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df df$with_columns(peak_max = pl$col("x")$peak_max())
df = pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df df$with_columns(peak_max = pl$col("x")$peak_max())
A local minimum is the point that marks the transition between a decrease and an increase in a Series. The first and last values of the Series can never be a peak.
Expr_peak_min()
Expr_peak_min()
Expr
$peak_max()
df = pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df df$with_columns(peak_min = pl$col("x")$peak_min())
df = pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df df$with_columns(peak_min = pl$col("x")$peak_min())
Method equivalent of exponentiation operator expr ^ exponent
.
Expr_pow(exponent)
Expr_pow(exponent)
exponent |
Numeric literal or expression value. |
df = pl$DataFrame(x = c(1, 2, 4, 8)) df$with_columns( cube = pl$col("x")$pow(3), `x^xlog2` = pl$col("x")$pow(pl$col("x")$log(2)) )
df = pl$DataFrame(x = c(1, 2, 4, 8)) df$with_columns( cube = pl$col("x")$pow(3), `x^xlog2` = pl$col("x")$pow(pl$col("x")$log(2)) )
Compute the product of an expression.
Expr_product()
Expr_product()
Expr
pl$DataFrame(x = c(2L, NA, 2L))$ with_columns(product = pl$col("x")$product())
pl$DataFrame(x = c(2L, NA, 2L))$ with_columns(product = pl$col("x")$product())
Bin continuous values into discrete categories based on their quantiles
Expr_qcut( quantiles, ..., labels = NULL, left_closed = FALSE, allow_duplicates = FALSE, include_breaks = FALSE )
Expr_qcut( quantiles, ..., labels = NULL, left_closed = FALSE, allow_duplicates = FALSE, include_breaks = FALSE )
quantiles |
Either a vector of quantile probabilities between 0 and 1 or a positive integer determining the number of bins with uniform probability. |
... |
Ignored. |
labels |
Names of the categories. The number of labels must be equal to the number of cut points plus one. |
left_closed |
Set the intervals to be left-closed instead of right-closed. |
allow_duplicates |
If set to |
include_breaks |
Include a column with the right endpoint of the bin each
observation falls in. This will change the data type of the output from a
|
Expr of data type Categorical
is include_breaks
is FALSE
and
of data type Struct
if include_breaks
is TRUE
.
df = pl$DataFrame(foo = c(-2, -1, 0, 1, 2)) # Divide a column into three categories according to pre-defined quantile # probabilities df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), labels = c("a", "b", "c")) ) # Divide a column into two categories using uniform quantile probabilities. df$with_columns( qcut = pl$col("foo")$qcut(2, labels = c("low", "high"), left_closed = TRUE) ) # Add both the category and the breakpoint df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), include_breaks = TRUE) )$unnest("qcut")
df = pl$DataFrame(foo = c(-2, -1, 0, 1, 2)) # Divide a column into three categories according to pre-defined quantile # probabilities df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), labels = c("a", "b", "c")) ) # Divide a column into two categories using uniform quantile probabilities. df$with_columns( qcut = pl$col("foo")$qcut(2, labels = c("low", "high"), left_closed = TRUE) ) # Add both the category and the breakpoint df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), include_breaks = TRUE) )$unnest("qcut")
Get quantile value.
Expr_quantile(quantile, interpolation = "nearest")
Expr_quantile(quantile, interpolation = "nearest")
quantile |
Either a numeric value or an Expr whose value must be between 0 and 1. |
interpolation |
One of |
Null values are ignored and NaN
s are ranked as the largest value.
For linear interpolation NaN
poisons Inf
, that poisons any other value.
Expr
pl$DataFrame(x = -5:5)$ select(pl$col("x")$quantile(0.5))
pl$DataFrame(x = -5:5)$ select(pl$col("x")$quantile(0.5))
Assign ranks to data, dealing with ties appropriately.
Expr_rank( method = c("average", "min", "max", "dense", "ordinal", "random"), descending = FALSE, seed = NULL )
Expr_rank( method = c("average", "min", "max", "dense", "ordinal", "random"), descending = FALSE, seed = NULL )
method |
String, one of
|
descending |
Rank in descending order. |
seed |
string parsed or number converted into uint64. Used if method="random". |
Expr
# The 'average' method: pl$DataFrame(a = c(3, 6, 1, 1, 6))$ with_columns(rank = pl$col("a")$rank()) # The 'ordinal' method: pl$DataFrame(a = c(3, 6, 1, 1, 6))$ with_columns(rank = pl$col("a")$rank("ordinal"))
# The 'average' method: pl$DataFrame(a = c(3, 6, 1, 1, 6))$ with_columns(rank = pl$col("a")$rank()) # The 'ordinal' method: pl$DataFrame(a = c(3, 6, 1, 1, 6))$ with_columns(rank = pl$col("a")$rank("ordinal"))
Create a single chunk of memory for this Series.
Expr_rechunk()
Expr_rechunk()
See rechunk() explained here docs_translations
.
Expr
# get chunked lengths with/without rechunk series_list = pl$DataFrame(list(a = 1:3, b = 4:6))$select( pl$col("a")$append(pl$col("b"))$alias("a_chunked"), pl$col("a")$append(pl$col("b"))$rechunk()$alias("a_rechunked") )$get_columns() lapply(series_list, \(x) x$chunk_lengths())
# get chunked lengths with/without rechunk series_list = pl$DataFrame(list(a = 1:3, b = 4:6))$select( pl$col("a")$append(pl$col("b"))$alias("a_chunked"), pl$col("a")$append(pl$col("b"))$rechunk()$alias("a_rechunked") )$get_columns() lapply(series_list, \(x) x$chunk_lengths())
Reinterpret the underlying bits as a signed/unsigned integer. This operation is only allowed for Int64. For lower bits integers, you can safely use the cast operation.
Expr_reinterpret(signed = TRUE)
Expr_reinterpret(signed = TRUE)
signed |
If |
Expr
df = pl$DataFrame(x = 1:5, schema = list(x = pl$Int64)) df$select(pl$all()$reinterpret())
df = pl$DataFrame(x = 1:5, schema = list(x = pl$Int64)) df$select(pl$all()$reinterpret())
This expression takes input and repeats it n times and append chunk.
Expr_rep(n, rechunk = TRUE)
Expr_rep(n, rechunk = TRUE)
n |
The number of times to repeat, must be non-negative and finite. |
rechunk |
If |
If the input has length 1, this uses a special faster implementation that
doesn't require rechunking (so rechunk = TRUE
has no effect).
Expr
pl$select(pl$lit("alice")$rep(n = 3)) pl$select(pl$lit(1:3)$rep(n = 2))
pl$select(pl$lit("alice")$rep(n = 3)) pl$select(pl$lit(1:3)$rep(n = 2))
Repeat the elements in this Series as specified in the given expression.
The repeated elements are expanded into a List
.
Expr_repeat_by(by)
Expr_repeat_by(by)
by |
Expr that determines how often the values will be repeated. The column will be coerced to UInt32. |
Expr
df = pl$DataFrame(a = c("w", "x", "y", "z"), n = c(-1, 0, 1, 2)) df$with_columns(repeated = pl$col("a")$repeat_by("n"))
df = pl$DataFrame(a = c("w", "x", "y", "z"), n = c(-1, 0, 1, 2)) df$with_columns(repeated = pl$col("a")$repeat_by("n"))
This allows one to recode values in a column, leaving all other values
unchanged. See $replace_strict()
to give a default
value to all other values and to specify the output datatype.
Expr_replace(old, new)
Expr_replace(old, new)
old |
Can be several things:
|
new |
Either a vector of length 1, a vector of same length as |
Expr
df = pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace(2, 100)) df$with_columns(replaced = pl$col("a")$replace(c(2, 3), c(100, 200))) # "old" can be a named list where names are values to replace, and values are # the replacements mapping = list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace(mapping)) df = pl$DataFrame(a = c("x", "y", "z")) mapping = list(x = 1, y = 2, z = 3) df$with_columns(replaced = pl$col("a")$replace(mapping)) # "old" and "new" can take Expr df = pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace( old = pl$col("a")$max(), new = pl$col("b")$sum() ) )
df = pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace(2, 100)) df$with_columns(replaced = pl$col("a")$replace(c(2, 3), c(100, 200))) # "old" can be a named list where names are values to replace, and values are # the replacements mapping = list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace(mapping)) df = pl$DataFrame(a = c("x", "y", "z")) mapping = list(x = 1, y = 2, z = 3) df$with_columns(replaced = pl$col("a")$replace(mapping)) # "old" and "new" can take Expr df = pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace( old = pl$col("a")$max(), new = pl$col("b")$sum() ) )
This changes all the values in a column, either using a specific replacement
or a default one. See $replace()
to replace only a subset
of values.
Expr_replace_strict(old, new, default = NULL, return_dtype = NULL)
Expr_replace_strict(old, new, default = NULL, return_dtype = NULL)
old |
Can be several things:
|
new |
Either a vector of length 1, a vector of same length as |
default |
The default replacement if the value is not in |
return_dtype |
The data type of the resulting expression. If set to
|
Expr
df = pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace_strict(2, 100, default = 1)) df$with_columns( replaced = pl$col("a")$replace_strict(c(2, 3), c(100, 200), default = 1) ) # "old" can be a named list where names are values to replace, and values are # the replacements mapping = list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace_strict(mapping, default = -1)) # one can specify the data type to return instead of automatically # inferring it df$with_columns( replaced = pl$col("a")$replace_strict(mapping, default = 1, return_dtype = pl$Int32) ) # "old", "new", and "default" can take Expr df = pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace_strict( old = pl$col("a")$max(), new = pl$col("b")$sum(), default = pl$col("b"), ) )
df = pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace_strict(2, 100, default = 1)) df$with_columns( replaced = pl$col("a")$replace_strict(c(2, 3), c(100, 200), default = 1) ) # "old" can be a named list where names are values to replace, and values are # the replacements mapping = list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace_strict(mapping, default = -1)) # one can specify the data type to return instead of automatically # inferring it df$with_columns( replaced = pl$col("a")$replace_strict(mapping, default = 1, return_dtype = pl$Int32) ) # "old", "new", and "default" can take Expr df = pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace_strict( old = pl$col("a")$max(), new = pl$col("b")$sum(), default = pl$col("b"), ) )
Reshape this Expr to a flat Series or a Series of Lists
Expr_reshape(dimensions, nested_type = pl$List())
Expr_reshape(dimensions, nested_type = pl$List())
dimensions |
A integer vector of length of the dimension size.
If |
nested_type |
The nested data type to create. List only supports 2 dimensions, whereas Array supports an arbitrary number of dimensions. |
Expr. If a single dimension is given, results in an expression of the original data type. If a multiple dimensions are given, results in an expression of data type List with shape equal to the dimensions.
df = pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$reshape(9)) df$select(pl$col("foo")$reshape(c(3, 3))) # Use `-1` to infer the other dimension df$select(pl$col("foo")$reshape(c(-1, 3))) df$select(pl$col("foo")$reshape(c(3, -1))) # One can specify more than 2 dimensions by using the Array type df = pl$DataFrame(foo = 1:12) df$select( pl$col("foo")$reshape(c(3, 2, 2), nested_type = pl$Array(pl$Float32, 2)) )
df = pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$reshape(9)) df$select(pl$col("foo")$reshape(c(3, 3))) # Use `-1` to infer the other dimension df$select(pl$col("foo")$reshape(c(-1, 3))) df$select(pl$col("foo")$reshape(c(3, -1))) # One can specify more than 2 dimensions by using the Array type df = pl$DataFrame(foo = 1:12) df$select( pl$col("foo")$reshape(c(3, 2, 2), nested_type = pl$Array(pl$Float32, 2)) )
Reverse a variable
Expr_reverse()
Expr_reverse()
Expr
pl$DataFrame(list(a = 1:5))$select(pl$col("a")$reverse())
pl$DataFrame(list(a = 1:5))$select(pl$col("a")$reverse())
Get the lengths of runs of identical values
Expr_rle()
Expr_rle()
Expr
df = pl$DataFrame(s = c(1, 1, 2, 1, NA, 1, 3, 3)) df$select(pl$col("s")$rle())$unnest("s")
df = pl$DataFrame(s = c(1, 1, 2, 1, NA, 1, 3, 3)) df$select(pl$col("s")$rle())$unnest("s")
Similar to $rle(), but it maps each value to an ID corresponding to the run into which it falls. This is especially useful when you want to define groups by runs of identical values rather than the values themselves. Note that the ID is 0-indexed.
Expr_rle_id()
Expr_rle_id()
Expr
df = pl$DataFrame(a = c(1, 2, 1, 1, 1, 4)) df$with_columns(a_r = pl$col("a")$rle_id())
df = pl$DataFrame(a = c(1, 2, 1, 1, 1, 4)) df$with_columns(a_r = pl$col("a")$rle_id())
If you have a time series <t_0, t_1, ..., t_n>
, then by default the windows
created will be:
(t_0 - period, t_0]
(t_1 - period, t_1]
…
(t_n - period, t_n]
whereas if you pass a non-default offset, then the windows will be:
(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
…
(t_n + offset, t_n + offset + period]
Expr_rolling(index_column, ..., period, offset = NULL, closed = "right")
Expr_rolling(index_column, ..., period, offset = NULL, closed = "right")
index_column |
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order. If this column represents an index, it has to be either Int32 or Int64. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column. |
... |
Ignored. |
period |
A character representing the length of the window,
must be non-negative. See the |
offset |
A character representing the offset of the window,
or |
closed |
Define which sides of the temporal interval are closed
(inclusive). This can be either |
In case of a rolling operation on an integer column, the windows are defined by:
"1i" # length 1
"10i" # length 10
Expr
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: "3d12h4m25s"
# 3 days, 12 hours, 4 minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
# create a DataFrame with a Datetime column and an f64 column dates = c( "2020-01-01 13:45:48", "2020-01-01 16:42:13", "2020-01-01 16:45:09", "2020-01-02 18:12:48", "2020-01-03 19:45:32", "2020-01-08 23:16:43" ) df = pl$DataFrame(dt = dates, a = c(3, 7, 5, 9, 2, 1))$ with_columns( pl$col("dt")$str$strptime(pl$Datetime("us"), format = "%Y-%m-%d %H:%M:%S")$set_sorted() ) df$with_columns( sum_a = pl$sum("a")$rolling(index_column = "dt", period = "2d"), min_a = pl$min("a")$rolling(index_column = "dt", period = "2d"), max_a = pl$max("a")$rolling(index_column = "dt", period = "2d") ) # we can use "offset" to change the start of the window period. Here, with # offset = "1d", we start the window one day after the value in "dt", and # then we add a 2-day window relative to the window start. df$with_columns( sum_a_offset1 = pl$sum("a")$rolling(index_column = "dt", period = "2d", offset = "1d") )
# create a DataFrame with a Datetime column and an f64 column dates = c( "2020-01-01 13:45:48", "2020-01-01 16:42:13", "2020-01-01 16:45:09", "2020-01-02 18:12:48", "2020-01-03 19:45:32", "2020-01-08 23:16:43" ) df = pl$DataFrame(dt = dates, a = c(3, 7, 5, 9, 2, 1))$ with_columns( pl$col("dt")$str$strptime(pl$Datetime("us"), format = "%Y-%m-%d %H:%M:%S")$set_sorted() ) df$with_columns( sum_a = pl$sum("a")$rolling(index_column = "dt", period = "2d"), min_a = pl$min("a")$rolling(index_column = "dt", period = "2d"), max_a = pl$max("a")$rolling(index_column = "dt", period = "2d") ) # we can use "offset" to change the start of the window period. Here, with # offset = "1d", we start the window one day after the value in "dt", and # then we add a 2-day window relative to the window start. df$with_columns( sum_a_offset1 = pl$sum("a")$rolling(index_column = "dt", period = "2d", offset = "1d") )
Compute the rolling (= moving) max over the values in this array. A window of
length window_size
will traverse the array. The values that fill this window
will (optionally) be multiplied with the weights given by the weight
vector.
Expr_rolling_max( window_size, weights = NULL, min_periods = NULL, ..., center = FALSE )
Expr_rolling_max( window_size, weights = NULL, min_periods = NULL, ..., center = FALSE )
window_size |
Integer specifying the length of the window. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
... |
Ignored. |
center |
Set the labels at the center of the window |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_max = pl$col("a")$rolling_max(window_size = 2))
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_max = pl$col("a")$rolling_max(window_size = 2))
Apply a rolling max based on another column.
Expr_rolling_max_by(by, window_size, ..., min_periods = 1, closed = "right")
Expr_rolling_max_by(by, window_size, ..., min_periods = 1, closed = "right")
by |
|
window_size |
The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:
|
... |
Ignored. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
closed |
Define which sides of the temporal interval are closed
(inclusive). This can be either |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by("date", window_size = "3h") )
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by("date", window_size = "3h") )
Compute the rolling (= moving) mean over the values in this array. A window of
length window_size
will traverse the array. The values that fill this window
will (optionally) be multiplied with the weights given by the weight
vector.
Expr_rolling_mean( window_size, weights = NULL, min_periods = NULL, ..., center = FALSE )
Expr_rolling_mean( window_size, weights = NULL, min_periods = NULL, ..., center = FALSE )
window_size |
Integer specifying the length of the window. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
... |
Ignored. |
center |
Set the labels at the center of the window |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_mean = pl$col("a")$rolling_mean(window_size = 2))
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_mean = pl$col("a")$rolling_mean(window_size = 2))
Apply a rolling mean based on another column.
Expr_rolling_mean_by(by, window_size, ..., min_periods = 1, closed = "right")
Expr_rolling_mean_by(by, window_size, ..., min_periods = 1, closed = "right")
by |
|
window_size |
The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:
|
... |
Ignored. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
closed |
Define which sides of the temporal interval are closed
(inclusive). This can be either |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by("date", window_size = "3h") )
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by("date", window_size = "3h") )
Compute the rolling (= moving) median over the values in this array. A window
of length window_size
will traverse the array. The values that fill this
window will (optionally) be multiplied with the weights given by the weight
vector.
Expr_rolling_median( window_size, weights = NULL, min_periods = NULL, center = FALSE )
Expr_rolling_median( window_size, weights = NULL, min_periods = NULL, center = FALSE )
window_size |
Integer specifying the length of the window. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
center |
Set the labels at the center of the window |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_median = pl$col("a")$rolling_median(window_size = 2))
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_median = pl$col("a")$rolling_median(window_size = 2))
Apply a rolling median based on another column.
Expr_rolling_median_by(by, window_size, ..., min_periods = 1, closed = "right")
Expr_rolling_median_by(by, window_size, ..., min_periods = 1, closed = "right")
by |
|
window_size |
The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:
|
... |
Ignored. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
closed |
Define which sides of the temporal interval are closed
(inclusive). This can be either |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by("date", window_size = "3h") )
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by("date", window_size = "3h") )
Compute the rolling (= moving) min over the values in this array. A window of
length window_size
will traverse the array. The values that fill this window
will (optionally) be multiplied with the weights given by the weight
vector.
Expr_rolling_min( window_size, weights = NULL, min_periods = NULL, ..., center = FALSE )
Expr_rolling_min( window_size, weights = NULL, min_periods = NULL, ..., center = FALSE )
window_size |
Integer specifying the length of the window. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
... |
Ignored. |
center |
Set the labels at the center of the window |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_min = pl$col("a")$rolling_min(window_size = 2))
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_min = pl$col("a")$rolling_min(window_size = 2))
Apply a rolling min based on another column.
Expr_rolling_min_by(by, window_size, ..., min_periods = 1, closed = "right")
Expr_rolling_min_by(by, window_size, ..., min_periods = 1, closed = "right")
by |
|
window_size |
The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:
|
... |
Ignored. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
closed |
Define which sides of the temporal interval are closed
(inclusive). This can be either |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by("date", window_size = "3h") )
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by("date", window_size = "3h") )
Compute the rolling (= moving) quantile over the values in this array. A
window of length window_size
will traverse the array. The values that fill
this window will (optionally) be multiplied with the weights given by the
weight
vector.
Expr_rolling_quantile( quantile, interpolation = "nearest", window_size, weights = NULL, min_periods = NULL, ..., center = FALSE )
Expr_rolling_quantile( quantile, interpolation = "nearest", window_size, weights = NULL, min_periods = NULL, ..., center = FALSE )
quantile |
Quantile between 0 and 1. |
interpolation |
String, one of |
window_size |
Integer specifying the length of the window. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
... |
Ignored. |
center |
Set the labels at the center of the window |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_quant = pl$col("a")$rolling_quantile(0.3, window_size = 2))
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_quant = pl$col("a")$rolling_quantile(0.3, window_size = 2))
Compute a rolling quantile based on another column
Expr_rolling_quantile_by( by, window_size, ..., quantile, interpolation = "nearest", min_periods = 1, closed = "right" )
Expr_rolling_quantile_by( by, window_size, ..., quantile, interpolation = "nearest", min_periods = 1, closed = "right" )
by |
|
window_size |
The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:
|
... |
Ignored. |
quantile |
Either a numeric value or an Expr whose value must be between 0 and 1. |
interpolation |
One of |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
closed |
Define which sides of the temporal interval are closed
(inclusive). This can be either |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h", quantile = 0.3 ) )
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h", quantile = 0.3 ) )
Compute the rolling (= moving) skewness over the values in this array. A
window of length window_size
will traverse the array.
Expr_rolling_skew(window_size, bias = TRUE)
Expr_rolling_skew(window_size, bias = TRUE)
window_size |
Integer specifying the length of the window. |
bias |
If |
For normally distributed data, the skewness should be about zero. For uni-modal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.
Expr
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_skew = pl$col("a")$rolling_skew(window_size = 2))
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_skew = pl$col("a")$rolling_skew(window_size = 2))
Compute the rolling (= moving) standard deviation over the values in this
array. A window of length window_size
will traverse the array. The values
that fill this window will (optionally) be multiplied with the weights given
by the weight
vector.
Expr_rolling_std( window_size, weights = NULL, min_periods = NULL, ..., center = FALSE, ddof = 1 )
Expr_rolling_std( window_size, weights = NULL, min_periods = NULL, ..., center = FALSE, ddof = 1 )
window_size |
Integer specifying the length of the window. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
... |
Ignored. |
center |
Set the labels at the center of the window |
ddof |
An integer representing "Delta Degrees of Freedom":
the divisor used in the calculation is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_std = pl$col("a")$rolling_std(window_size = 2))
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_std = pl$col("a")$rolling_std(window_size = 2))
Compute a rolling standard deviation based on another column
Expr_rolling_std_by( by, window_size, ..., min_periods = 1, closed = "right", ddof = 1 )
Expr_rolling_std_by( by, window_size, ..., min_periods = 1, closed = "right", ddof = 1 )
by |
|
window_size |
The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:
|
... |
Ignored. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
closed |
Define which sides of the temporal interval are closed
(inclusive). This can be either |
ddof |
An integer representing "Delta Degrees of Freedom":
the divisor used in the calculation is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal # Compute the rolling std with the temporal windows closed on the right (default) df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by("date", window_size = "2h") ) # Compute the rolling std with the closure of windows on both sides df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by("date", window_size = "2h", closed = "both") )
df_temporal = pl$DataFrame( date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h") )$with_row_index("index") df_temporal # Compute the rolling std with the temporal windows closed on the right (default) df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by("date", window_size = "2h") ) # Compute the rolling std with the closure of windows on both sides df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by("date", window_size = "2h", closed = "both") )
Compute the rolling (= moving) sum over the values in this array. A window of
length window_size
will traverse the array. The values that fill this window
will (optionally) be multiplied with the weights given by the weight
vector.
Expr_rolling_sum( window_size, weights = NULL, min_periods = NULL, center = FALSE )
Expr_rolling_sum( window_size, weights = NULL, min_periods = NULL, center = FALSE )
window_size |
Integer specifying the length of the window. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be non-null
before computing a result. If |
center |
Set the labels at the center of the window |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
this method can cache the window size
computation.
Expr
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_sum = pl$col("a")$rolling_sum(window_size = 2))
pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$ with_columns(roll_sum = pl$col("a")$rolling_sum(window_size = 2))
Apply a rolling sum based on another column.
Expr_rolling_sum_by(by, window_size, ..., min_periods = 1, closed = "right")
Expr_rolling_sum_by(by, window_size, ..., min_periods = 1, closed = "right")
by |
This column must of dtype |