Package 'polars'

Title: Lightning-Fast 'DataFrame' Library
Description: Lightning-fast 'DataFrame' library written in 'Rust'. Convert R data to 'Polars' data and vice versa. Perform fast, lazy, larger-than-memory and optimized data queries. 'Polars' is interoperable with the package 'arrow', as both are based on the 'Apache Arrow' Columnar Format.
Authors: Ritchie Vink [aut], Soren Welling [aut, cre], Tatsuya Shima [aut], Etienne Bacher [aut]
Maintainer: Soren Welling <[email protected]>
License: MIT + file LICENSE
Version: 0.19.1
Built: 2024-08-31 10:16:28 UTC
Source: https://github.com/pola-rs/r-polars

Help Index


Extract Parts of a Polars Object

Description

Mimics the behavior of [x[i, j, drop = TRUE]][Extract] for data.frame or R vector.

Usage

## S3 method for class 'RPolarsDataFrame'
x[i, j, drop = TRUE]

## S3 method for class 'RPolarsLazyFrame'
x[i, j, drop = TRUE]

## S3 method for class 'RPolarsSeries'
x[i]

Arguments

x

A DataFrame, LazyFrame, or Series

i

Rows to select. Integer vector, logical vector, or an Expression.

j

Columns to select. Integer vector, logical vector, character vector, or an Expression. For LazyFrames, only an Expression can be used.

drop

Convert to a Polars Series if only one column is selected. For LazyFrames, if the result has one column and drop = TRUE, an error will occur.

Details

⁠<Series>[i]⁠ is equivalent to ⁠pl$select(<Series>)[i, , drop = TRUE]⁠.

See Also

<DataFrame>$select(), <LazyFrame>$select(), <DataFrame>$filter(), <LazyFrame>$filter()

Examples

df = pl$DataFrame(data.frame(a = 1:3, b = letters[1:3]))
lf = df$lazy()

# Select a row
df[1, ]

# If only `i` is specified, it is treated as `j`
# Select a column
df[1]

# Select a column by name (and convert to a Series)
df[, "b"]

# Can use Expression for filtering and column selection
lf[pl$col("a") >= 2, pl$col("b")$alias("new"), drop = FALSE] |>
  as.data.frame()

Create a arrow Table from a Polars object

Description

Create a arrow Table from a Polars object

Usage

## S3 method for class 'RPolarsDataFrame'
as_arrow_table(x, ..., compat_level = FALSE)

Arguments

x

A Polars DataFrame

...

Ignored

compat_level

Use a specific compatibility level when exporting Polars’ internal data structures. This can be:

  • an integer indicating the compatibility version (currently only 0 for oldest and 1 for newest);

  • a logical value with TRUE for the newest version and FALSE for the oldest version.

Examples

library(arrow)

pl_df = as_polars_df(mtcars)
as_arrow_table(pl_df)

Create a nanoarrow_array_stream from a Polars object

Description

Create a nanoarrow_array_stream from a Polars object

Usage

## S3 method for class 'RPolarsDataFrame'
as_nanoarrow_array_stream(x, ..., schema = NULL, compat_level = FALSE)

## S3 method for class 'RPolarsSeries'
as_nanoarrow_array_stream(x, ..., schema = NULL, compat_level = FALSE)

Arguments

x

A polars object

...

Ignored

schema

must stay at default value NULL

compat_level

Use a specific compatibility level when exporting Polars’ internal data structures. This can be:

  • an integer indicating the compatibility version (currently only 0 for oldest and 1 for newest);

  • a logical value with TRUE for the newest version and FALSE for the oldest version.

Examples

library(nanoarrow)

pl_df = as_polars_df(mtcars)$head(5)
pl_s = as_polars_series(letters[1:5])

as.data.frame(as_nanoarrow_array_stream(pl_df))
as.vector(as_nanoarrow_array_stream(pl_s))

To polars DataFrame

Description

as_polars_df() is a generic function that converts an R object to a polars DataFrame.

Usage

as_polars_df(x, ...)

## Default S3 method:
as_polars_df(x, ...)

## S3 method for class 'data.frame'
as_polars_df(
  x,
  ...,
  rownames = NULL,
  make_names_unique = TRUE,
  schema = NULL,
  schema_overrides = NULL
)

## S3 method for class 'RPolarsDataFrame'
as_polars_df(x, ...)

## S3 method for class 'RPolarsGroupBy'
as_polars_df(x, ...)

## S3 method for class 'RPolarsRollingGroupBy'
as_polars_df(x, ...)

## S3 method for class 'RPolarsDynamicGroupBy'
as_polars_df(x, ...)

## S3 method for class 'RPolarsSeries'
as_polars_df(x, ...)

## S3 method for class 'RPolarsLazyFrame'
as_polars_df(
  x,
  n_rows = Inf,
  ...,
  type_coercion = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  comm_subplan_elim = TRUE,
  comm_subexpr_elim = TRUE,
  cluster_with_columns = TRUE,
  streaming = FALSE,
  no_optimization = FALSE,
  collect_in_background = FALSE
)

## S3 method for class 'RPolarsLazyGroupBy'
as_polars_df(x, ...)

## S3 method for class 'ArrowTabular'
as_polars_df(
  x,
  ...,
  rechunk = TRUE,
  schema = NULL,
  schema_overrides = NULL,
  experimental = FALSE
)

## S3 method for class 'RecordBatchReader'
as_polars_df(x, ..., experimental = FALSE)

## S3 method for class 'nanoarrow_array'
as_polars_df(x, ...)

## S3 method for class 'nanoarrow_array_stream'
as_polars_df(x, ..., experimental = FALSE)

Arguments

x

Object to convert to a polars DataFrame.

...

Additional arguments passed to methods.

rownames

How to treat existing row names of a data frame:

  • NULL: Remove row names. This is the default.

  • A string: The name of a new column, which will contain the row names. If x already has a column with that name, an error is thrown.

make_names_unique

A logical flag to replace duplicated column names with unique names. If FALSE and there are duplicated column names, an error is thrown.

schema

named list of DataTypes, or character vector of column names. Should match the number of columns in x and correspond to each column in x by position. If a column in x does not match the name or type at the same position, it will be renamed/recast. If NULL (default), convert columns as is.

schema_overrides

named list of DataTypes. Cast some columns to the DataType.

n_rows

Number of rows to fetch. Defaults to Inf, meaning all rows.

type_coercion

Logical. Coerce types such that operations succeed and run on minimal required memory.

predicate_pushdown

Logical. Applies filters as early as possible at scan level.

projection_pushdown

Logical. Select only the columns that are needed at the scan level.

simplify_expression

Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives.

slice_pushdown

Logical. Only load the required slice from the scan level. Don't materialize sliced outputs (e.g. join$head(10)).

comm_subplan_elim

Logical. Will try to cache branching subplans that occur on self-joins or unions.

comm_subexpr_elim

Logical. Common subexpressions will be cached and reused.

cluster_with_columns

Combine sequential independent calls to with_columns().

streaming

Logical. Run parts of the query in a streaming fashion (this is in an alpha state).

no_optimization

Logical. Sets the following parameters to FALSE: predicate_pushdown, projection_pushdown, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns.

collect_in_background

Logical. Detach this query from R session. Computation will start in background. Get a handle which later can be converted into the resulting DataFrame. Useful in interactive mode to not lock R session.

rechunk

A logical flag (default TRUE). Make sure that all data of each column is in contiguous memory.

experimental

If TRUE, use experimental Arrow C stream interface inside the function. This argument is experimental and may be removed in the future.

Details

For LazyFrame objects, this function is a shortcut for $collect() or $fetch(), depending on whether the number of rows to fetch is infinite or not.

Value

a DataFrame

Examples

# Convert the row names of a data frame to a column
as_polars_df(mtcars, rownames = "car")

# Convert a data frame, with renaming all columns
as_polars_df(
  data.frame(x = 1, y = 2),
  schema = c("a", "b")
)

# Convert a data frame, with renaming and casting all columns
as_polars_df(
  data.frame(x = 1, y = 2),
  schema = list(b = pl$Int64, a = pl$String)
)

# Convert a data frame, with casting some columns
as_polars_df(
  data.frame(x = 1, y = 2),
  schema_overrides = list(y = pl$String) # cast some columns
)

# Convert an arrow Table to a polars DataFrame
at = arrow::arrow_table(x = 1:5, y = 6:10)
as_polars_df(at)

# Create a polars DataFrame from a data.frame
lf = as_polars_df(mtcars)$lazy()

# Collect all rows from the LazyFrame
as_polars_df(lf)

# Fetch 5 rows from the LazyFrame
as_polars_df(lf, 5)

To polars LazyFrame

Description

as_polars_lf() is a generic function that converts an R object to a polars LazyFrame. It is basically a shortcut for as_polars_df(x, ...) with the $lazy() method.

Usage

as_polars_lf(x, ...)

## Default S3 method:
as_polars_lf(x, ...)

## S3 method for class 'RPolarsLazyFrame'
as_polars_lf(x, ...)

## S3 method for class 'RPolarsLazyGroupBy'
as_polars_lf(x, ...)

Arguments

x

Object to convert to a polars DataFrame.

...

Additional arguments passed to methods.

Value

a LazyFrame

Examples

as_polars_lf(mtcars)

To polars Series

Description

as_polars_series() is a generic function that converts an R object to a polars Series.

Usage

as_polars_series(x, name = NULL, ...)

## Default S3 method:
as_polars_series(x, name = NULL, ...)

## S3 method for class 'RPolarsSeries'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'RPolarsExpr'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'RPolarsThen'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'RPolarsChainedThen'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'POSIXlt'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'data.frame'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'vctrs_rcrd'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'Array'
as_polars_series(x, name = NULL, ..., rechunk = TRUE)

## S3 method for class 'ChunkedArray'
as_polars_series(x, name = NULL, ..., rechunk = TRUE)

## S3 method for class 'RecordBatchReader'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'nanoarrow_array'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'nanoarrow_array_stream'
as_polars_series(x, name = NULL, ..., experimental = FALSE)

## S3 method for class 'clock_time_point'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'clock_sys_time'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'clock_zoned_time'
as_polars_series(x, name = NULL, ...)

## S3 method for class 'list'
as_polars_series(x, name = NULL, ...)

Arguments

x

Object to convert into a polars Series.

name

A character to use as the name of the Series. If NULL (default), the name of x is used or an empty character "" will be used if x has no name.

...

Additional arguments passed to methods.

rechunk

A logical flag (default TRUE). Make sure that all data is in contiguous memory.

experimental

If TRUE, use experimental Arrow C stream interface inside the function. This argument is experimental and may be removed in the future.

Value

a Series

Examples

as_polars_series(1:4)

as_polars_series(list(1:4))

as_polars_series(data.frame(a = 1:4))

as_polars_series(as_polars_series(1:4, name = "foo"))

as_polars_series(pl$lit(1:4))

# Nested type support
as_polars_series(list(data.frame(a = I(list(1:4)))))

Create a arrow RecordBatchReader from a Polars object

Description

Create a arrow RecordBatchReader from a Polars object

Usage

## S3 method for class 'RPolarsDataFrame'
as_record_batch_reader(x, ..., compat_level = FALSE)

Arguments

x

A Polars DataFrame

...

Ignored

compat_level

Use a specific compatibility level when exporting Polars’ internal data structures. This can be:

  • an integer indicating the compatibility version (currently only 0 for oldest and 1 for newest);

  • a logical value with TRUE for the newest version and FALSE for the oldest version.

Examples

library(arrow)

pl_df = as_polars_df(mtcars)
as_record_batch_reader(pl_df)

Convert to a character vector

Description

Convert to a character vector

Usage

## S3 method for class 'RPolarsSeries'
as.character(x, ..., str_length = NULL)

Arguments

x

A Polars Series

...

Not used.

str_length

An integer. If specified, utf8 or categorical type Series will be formatted to a string of this length.

Examples

s = as_polars_series(c("foo", "barbaz"))
as.character(s)
as.character(s, str_length = 3)

Convert to a data.frame

Description

Equivalent to as_polars_df(x, ...)$to_data_frame(...).

Usage

## S3 method for class 'RPolarsDataFrame'
as.data.frame(x, ..., int64_conversion = polars_options()$int64_conversion)

## S3 method for class 'RPolarsLazyFrame'
as.data.frame(
  x,
  ...,
  n_rows = Inf,
  type_coercion = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  comm_subplan_elim = TRUE,
  comm_subexpr_elim = TRUE,
  cluster_with_columns = TRUE,
  streaming = FALSE,
  no_optimization = FALSE,
  collect_in_background = FALSE
)

Arguments

x

An object to convert to a data.frame.

...

Additional arguments passed to methods.

int64_conversion

How should Int64 values be handled when converting a polars object to R?

  • "double" (default) converts the integer values to double.

  • "bit64" uses bit64::as.integer64() to do the conversion (requires the package bit64 to be attached).

  • "string" converts Int64 values to character.

n_rows

Number of rows to fetch. Defaults to Inf, meaning all rows.

type_coercion

Logical. Coerce types such that operations succeed and run on minimal required memory.

predicate_pushdown

Logical. Applies filters as early as possible at scan level.

projection_pushdown

Logical. Select only the columns that are needed at the scan level.

simplify_expression

Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives.

slice_pushdown

Logical. Only load the required slice from the scan level. Don't materialize sliced outputs (e.g. join$head(10)).

comm_subplan_elim

Logical. Will try to cache branching subplans that occur on self-joins or unions.

comm_subexpr_elim

Logical. Common subexpressions will be cached and reused.

cluster_with_columns

Combine sequential independent calls to with_columns().

streaming

Logical. Run parts of the query in a streaming fashion (this is in an alpha state).

no_optimization

Logical. Sets the following parameters to FALSE: predicate_pushdown, projection_pushdown, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns.

collect_in_background

Logical. Detach this query from R session. Computation will start in background. Get a handle which later can be converted into the resulting DataFrame. Useful in interactive mode to not lock R session.

Conversion to R data types considerations

When converting Polars objects, such as DataFrames to R objects, for example via the as.data.frame() generic function, each type in the Polars object is converted to an R type. In some cases, an error may occur because the conversion is not appropriate. In particular, there is a high possibility of an error when converting a Datetime type without a time zone. A Datetime type without a time zone in Polars is converted to the POSIXct type in R, which takes into account the time zone in which the R session is running (which can be checked with the Sys.timezone() function). In this case, if ambiguous times are included, a conversion error will occur. In such cases, change the session time zone using Sys.setenv(TZ = "UTC") and then perform the conversion, or use the $dt$replace_time_zone() method on the Datetime type column to explicitly specify the time zone before conversion.

# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am
# so this particular date-time doesn't exist
non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T")

withr::with_timezone(
  "America/New_York",
  {
    tryCatch(
      # This causes an error due to the time zone (the `TZ` env var is affected).
      as.vector(non_existent_time),
      error = function(e) e
    )
  }
)
#> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()>

withr::with_timezone(
  "America/New_York",
  {
    # This is safe.
    as.vector(non_existent_time$dt$replace_time_zone("UTC"))
  }
)
#> [1] "2020-03-08 02:00:00 UTC"

See Also


Convert to a matrix

Description

Equivalent to as.data.frame(x, ...) |> as.matrix().

Usage

## S3 method for class 'RPolarsDataFrame'
as.matrix(x, ...)

## S3 method for class 'RPolarsLazyFrame'
as.matrix(x, ...)

Arguments

x

An object to convert to a matrix.

...

Additional arguments passed to methods.


Convert to a vector

Description

Convert to a vector

Usage

## S3 method for class 'RPolarsSeries'
as.vector(x, mode)

Arguments

x

A Polars Series

mode

Not used.

Conversion to R data types considerations

When converting Polars objects, such as DataFrames to R objects, for example via the as.data.frame() generic function, each type in the Polars object is converted to an R type. In some cases, an error may occur because the conversion is not appropriate. In particular, there is a high possibility of an error when converting a Datetime type without a time zone. A Datetime type without a time zone in Polars is converted to the POSIXct type in R, which takes into account the time zone in which the R session is running (which can be checked with the Sys.timezone() function). In this case, if ambiguous times are included, a conversion error will occur. In such cases, change the session time zone using Sys.setenv(TZ = "UTC") and then perform the conversion, or use the $dt$replace_time_zone() method on the Datetime type column to explicitly specify the time zone before conversion.

# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am
# so this particular date-time doesn't exist
non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T")

withr::with_timezone(
  "America/New_York",
  {
    tryCatch(
      # This causes an error due to the time zone (the `TZ` env var is affected).
      as.vector(non_existent_time),
      error = function(e) e
    )
  }
)
#> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()>

withr::with_timezone(
  "America/New_York",
  {
    # This is safe.
    as.vector(non_existent_time$dt$replace_time_zone("UTC"))
  }
)
#> [1] "2020-03-08 02:00:00 UTC"

Combine to a Series

Description

Combine to a Series

Usage

## S3 method for class 'RPolarsSeries'
c(x, ...)

Arguments

x

A Polars Series

...

Series(s) or any object that can be converted to a Series.

Details

All objects must have the same datatype. Combining does not rechunk. Read more about R vectors, Series and chunks in docs_translations:

Value

a combined Series

Examples

s = c(as_polars_series(1:5), 3:1, NA_integer_)
s$chunk_lengths() # the series contain three unmerged chunks

Inner workings of the DataFrame-class

Description

The DataFrame-class is simply two environments of respectively the public and private methods/function calls to the polars Rust side. The instantiated DataFrame-object is an externalptr to a low-level Rust polars DataFrame object.

The S3 method .DollarNames.RPolarsDataFrame exposes all public ⁠$foobar()⁠-methods which are callable onto the object. Most methods return another DataFrame- class instance or similar which allows for method chaining. This class system could be called "environment classes" (in lack of a better name) and is the same class system extendr provides, except here there are both a public and private set of methods. For implementation reasons, the private methods are external and must be called from .pr$DataFrame$methodname(). Also, all private methods must take any self as an argument, thus they are pure functions. Having the private methods as pure functions solved/simplified self-referential complications.

Details

Check out the source code in R/dataframe_frame.R to see how public methods are derived from private methods. Check out extendr-wrappers.R to see the extendr-auto-generated methods. These are moved to .pr and converted into pure external functions in after-wrappers.R. In zzz.R (named zzz to be last file sourced) the extendr-methods are removed and replaced by any function prefixed DataFrame_.

Active bindings

columns

⁠$columns⁠ returns a character vector with the column names.

dtypes

⁠$dtypes⁠ returns a unnamed list with the data type of each column.

flags

⁠$flags⁠ returns a nested list with column names at the top level and column flags in each sublist.

Flags are used internally to avoid doing unnecessary computations, such as sorting a variable that we know is already sorted. The number of flags varies depending on the column type: columns of type array and list have the flags SORTED_ASC, SORTED_DESC, and FAST_EXPLODE, while other column types only have the former two.

  • SORTED_ASC is set to TRUE when we sort a column in increasing order, so that we can use this information later on to avoid re-sorting it.

  • SORTED_DESC is similar but applies to sort in decreasing order.

height

⁠$height⁠ returns the number of rows in the DataFrame.

schema

⁠$schema⁠ returns a named list with the data type of each column.

shape

⁠$shape⁠ returns a numeric vector of length two with the number of rows and the number of columns.

width

⁠$width⁠ returns the number of columns in the DataFrame.

Conversion to R data types considerations

When converting Polars objects, such as DataFrames to R objects, for example via the as.data.frame() generic function, each type in the Polars object is converted to an R type. In some cases, an error may occur because the conversion is not appropriate. In particular, there is a high possibility of an error when converting a Datetime type without a time zone. A Datetime type without a time zone in Polars is converted to the POSIXct type in R, which takes into account the time zone in which the R session is running (which can be checked with the Sys.timezone() function). In this case, if ambiguous times are included, a conversion error will occur. In such cases, change the session time zone using Sys.setenv(TZ = "UTC") and then perform the conversion, or use the $dt$replace_time_zone() method on the Datetime type column to explicitly specify the time zone before conversion.

# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am
# so this particular date-time doesn't exist
non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T")

withr::with_timezone(
  "America/New_York",
  {
    tryCatch(
      # This causes an error due to the time zone (the `TZ` env var is affected).
      as.vector(non_existent_time),
      error = function(e) e
    )
  }
)
#> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()>

withr::with_timezone(
  "America/New_York",
  {
    # This is safe.
    as.vector(non_existent_time$dt$replace_time_zone("UTC"))
  }
)
#> [1] "2020-03-08 02:00:00 UTC"

Examples

# see all public exported method names (normally accessed via a class
# instance with $)
ls(.pr$env$RPolarsDataFrame)

# see all private methods (not intended for regular use)
ls(.pr$DataFrame)

# make an object
df = as_polars_df(iris)

# call an active binding
df$shape

# use a private method, which has mutability
result = .pr$DataFrame$set_column_from_robj(df, 150:1, "some_ints")

# Column exists in both dataframes-objects now, as they are just pointers to
# the same object
# There are no public methods with mutability.
df2 = df

df$columns
df2$columns

# Show flags
df$sort("Sepal.Length")$flags

# set_column_from_robj-method is fallible and returned a result which could
# be "ok" or an error.
# No public method or function will ever return a result.
# The `result` is very close to the same as output from functions decorated
# with purrr::safely.
# To use results on the R side, these must be unwrapped first such that
# potentially errors can be thrown. `unwrap(result)` is a way to communicate
# errors happening on the Rust side to the R side. `Extendr` default behavior
# is to use `panic!`(s) which would cause some unnecessarily confusing and
# some very verbose error messages on the inner workings of rust.
# `unwrap(result)` in this case no error, just a NULL because this mutable
# method does not return any ok-value.

# Try unwrapping an error from polars due to unmatching column lengths
err_result = .pr$DataFrame$set_column_from_robj(df, 1:10000, "wrong_length")
tryCatch(unwrap(err_result, call = NULL), error = \(e) cat(as.character(e)))

Create an empty or n-row null-filled copy of the DataFrame

Description

Returns a n-row null-filled DataFrame with an identical schema. n can be greater than the current number of rows in the DataFrame.

Usage

DataFrame_clear(n = 0)

Arguments

n

Number of (null-filled) rows to return in the cleared frame.

Value

A n-row null-filled DataFrame with an identical schema

Examples

df = pl$DataFrame(
  a = c(NA, 2, 3, 4),
  b = c(0.5, NA, 2.5, 13),
  c = c(TRUE, TRUE, FALSE, NA)
)

df$clear()

df$clear(n = 5)

Clone a DataFrame

Description

This makes a very cheap deep copy/clone of an existing DataFrame. Rarely useful as DataFrames are nearly 100% immutable. Any modification of a DataFrame should lead to a clone anyways, but this can be useful when dealing with attributes (see examples).

Usage

DataFrame_clone()

Value

A DataFrame

Examples

df1 = pl$DataFrame(iris)

# Make a function to take a DataFrame, add an attribute, and return a DataFrame
give_attr = function(data) {
  attr(data, "created_on") = "2024-01-29"
  data
}
df2 = give_attr(df1)

# Problem: the original DataFrame also gets the attribute while it shouldn't!
attributes(df1)

# Use $clone() inside the function to avoid that
give_attr = function(data) {
  data = data$clone()
  attr(data, "created_on") = "2024-01-29"
  data
}
df1 = pl$DataFrame(iris)
df2 = give_attr(df1)

# now, the original DataFrame doesn't get this attribute
attributes(df1)

Summary statistics for a DataFrame

Description

This returns the total number of rows, the number of missing values, the mean, standard deviation, min, max, median and the percentiles specified in the argument percentiles.

Usage

DataFrame_describe(percentiles = c(0.25, 0.75), interpolation = "nearest")

Arguments

percentiles

One or more percentiles to include in the summary statistics. All values must be in the range ⁠[0; 1]⁠.

interpolation

Interpolation method for computing quantiles. One of "nearest", "higher", "lower", "midpoint", or "linear".

Value

DataFrame

Examples

pl$DataFrame(iris)$describe()

# string, date, boolean columns are also supported:
df = pl$DataFrame(
  int = 1:3,
  string = c(letters[1:2], NA),
  date = c(as.Date("2024-01-20"), as.Date("2024-01-21"), NA),
  cat = factor(c(letters[1:2], NA)),
  bool = c(TRUE, FALSE, NA)
)
df

df$describe()

Drop columns of a DataFrame

Description

Drop columns of a DataFrame

Usage

DataFrame_drop(...)

Arguments

...

Characters of column names to drop. Passed to pl$col().

Value

DataFrame

Examples

pl$DataFrame(mtcars)$drop(c("mpg", "hp"))

# equivalent
pl$DataFrame(mtcars)$drop("mpg", "hp")

Drop in place

Description

Drop a single column in-place and return the dropped column.

Usage

DataFrame_drop_in_place(name)

Arguments

name

string Name of the column to drop.

Value

Series

Examples

dat = pl$DataFrame(iris)
x = dat$drop_in_place("Species")
x
dat$columns

Drop nulls (missing values)

Description

Drop all rows that contain nulls (which correspond to NA in R).

Usage

DataFrame_drop_nulls(subset = NULL)

Arguments

subset

A character vector with the names of the column(s) for which nulls are considered. If NULL (default), use all columns.

Value

DataFrame

Examples

tmp = mtcars
tmp[1:3, "mpg"] = NA
tmp[4, "hp"] = NA
tmp = pl$DataFrame(tmp)

# number of rows in `tmp` before dropping nulls
tmp$height

tmp$drop_nulls()$height
tmp$drop_nulls("mpg")$height
tmp$drop_nulls(c("mpg", "hp"))$height

Data types information

Description

Get the data type of all columns as strings. You can see all available types with names(pl$dtypes). The data type of each column is also shown when printing the DataFrame.

Usage

DataFrame_dtype_strings()

Value

A character vector with the data type of each column

Examples

pl$DataFrame(iris)$dtype_strings()

Compare two DataFrames

Description

Check if two DataFrames are equal.

Usage

DataFrame_equals(other)

Arguments

other

DataFrame to compare with.

Value

A logical value

Examples

dat1 = pl$DataFrame(iris)
dat2 = pl$DataFrame(iris)
dat3 = pl$DataFrame(mtcars)
dat1$equals(dat2)
dat1$equals(dat3)

Estimated size

Description

Return an estimation of the total (heap) allocated size of the DataFrame.

Usage

DataFrame_estimated_size()

Format

function

Value

Estimated size in bytes

Examples

pl$DataFrame(mtcars)$estimated_size()

Explode columns containing a list of values

Description

Explode columns containing a list of values

Usage

DataFrame_explode(...)

Arguments

...

Column(s) to be exploded as individual ⁠Into<Expr>⁠ or list/vector of ⁠Into<Expr>⁠. In a handful of places in rust-polars, only the plain variant Expr::Column is accepted. This is currenly one of such places. Therefore pl$col("name") and pl$all() is allowed, not pl$col("name")$alias("newname"). "name" is implicitly converted to pl$col("name").

Value

DataFrame

Examples

df = pl$DataFrame(
  letters = letters[1:4],
  numbers = list(1, c(2, 3), c(4, 5), c(6, 7, 8)),
  numbers_2 = list(0, c(1, 2), c(3, 4), c(5, 6, 7)) # same structure as numbers
)
df

# explode a single column, append others
df$explode("numbers")

# explode two columns of same nesting structure, by names or the common dtype
# "List(Float64)"
df$explode("numbers", "numbers_2")
df$explode(pl$col(pl$List(pl$Float64)))

Fill floating point NaN value with a fill value

Description

Fill floating point NaN value with a fill value

Usage

DataFrame_fill_nan(value)

Arguments

value

Value used to fill NaN values.

Value

DataFrame

Examples

df = pl$DataFrame(
  a = c(1.5, 2, NaN, 4),
  b = c(1.5, NaN, NaN, 4)
)
df$fill_nan(99)

Fill nulls

Description

Fill null values (which correspond to NA in R) using the specified value or strategy.

Usage

DataFrame_fill_null(fill_value)

Arguments

fill_value

Value to fill nulls with.

Value

DataFrame

Examples

df = pl$DataFrame(
  a = c(1.5, 2, NA, 4),
  b = c(1.5, NA, NA, 4)
)

df$fill_null(99)

df$fill_null(pl$col("a")$mean())

Filter rows of a DataFrame

Description

Filter rows with an Expression defining a boolean column. Multiple expressions are combined with & (AND). This is equivalent to dplyr::filter().

Usage

DataFrame_filter(...)

Arguments

...

Polars expressions which will evaluate to a boolean.

Details

Rows where the condition returns NA are dropped.

Value

A DataFrame with only the rows where the conditions are TRUE.

Examples

df = pl$DataFrame(iris)

df$filter(pl$col("Sepal.Length") > 5)

# This is equivalent to
# df$filter(pl$col("Sepal.Length") > 5 & pl$col("Petal.Width") < 1)
df$filter(pl$col("Sepal.Length") > 5, pl$col("Petal.Width") < 1)

# rows where condition is NA are dropped
iris2 = iris
iris2[c(1, 3, 5), "Species"] = NA
df = pl$DataFrame(iris2)

df$filter(pl$col("Species") == "setosa")

Get the first row of the DataFrame.

Description

Get the first row of the DataFrame.

Usage

DataFrame_first()

Value

A DataFrame with one row.

Examples

pl$DataFrame(mtcars)$first()

Take every nth row in the DataFrame

Description

Take every nth row in the DataFrame

Usage

DataFrame_gather_every(n, offset = 0)

Arguments

n

Gather every n-th row.

offset

Starting index.

Value

A DataFrame

Examples

df = pl$DataFrame(a = 1:4, b = 5:8)
df$gather_every(2)

df$gather_every(2, offset = 1)

Get column (as one Series)

Description

Extract a DataFrame column as a Polars series.

Usage

DataFrame_get_column(name)

Arguments

name

Name of the column to extract.

Value

Series

Examples

df = pl$DataFrame(iris[1:2, ])
df$get_column("Species")

Get the DataFrame as a List of Series

Description

Get the DataFrame as a List of Series

Usage

DataFrame_get_columns()

Value

A list of Series

See Also

Examples

df = pl$DataFrame(foo = 1L:3L, bar = 4L:6L)
df$get_columns()

df = pl$DataFrame(
  a = 1:4,
  b = c(0.5, 4, 10, 13),
  c = c(TRUE, TRUE, FALSE, TRUE)
)
df$get_columns()

Show a dense preview of the DataFrame

Description

The formatting shows one line per column so that wide DataFrames display cleanly. Each line shows the column name, the data type, and the first few values.

Usage

DataFrame_glimpse(
  ...,
  max_items_per_column = 10,
  max_colname_length = 50,
  return_as_string = FALSE
)

Arguments

...

Ignored.

max_items_per_column

Maximum number of items to show per column.

max_colname_length

Maximum length of the displayed column names. Values that exceed this value are truncated with a trailing ellipsis.

return_as_string

Logical (default FALSE). If TRUE, return the output as a string.

Value

DataFrame

Examples

pl$DataFrame(iris)$glimpse()

Group a DataFrame

Description

This doesn't modify the data but only stores information about the group structure. This structure can then be used by several functions (⁠$agg()⁠, ⁠$filter()⁠, etc.).

Usage

DataFrame_group_by(..., maintain_order = polars_options()$maintain_order)

Arguments

...

Column(s) to group by. Accepts expression input. Characters are parsed as column names.

maintain_order

Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Setting this to TRUE blocks the possibility to run on the streaming engine. The default value can be changed with options(polars.maintain_order = TRUE).

Details

Within each group, the order of the rows is always preserved, regardless of the maintain_order argument.

Value

GroupBy (a DataFrame with special groupby methods like ⁠$agg()⁠)

See Also

Examples

df = pl$DataFrame(
  a = c("a", "b", "a", "b", "c"),
  b = c(1, 2, 1, 3, 3),
  c = c(5, 4, 3, 2, 1)
)

df$group_by("a")$agg(pl$col("b")$sum())

# Set `maintain_order = TRUE` to ensure the order of the groups is consistent with the input.
df$group_by("a", maintain_order = TRUE)$agg(pl$col("c"))

# Group by multiple columns by passing a list of column names.
df$group_by(c("a", "b"))$agg(pl$max("c"))

# Or pass some arguments to group by multiple columns in the same way.
# Expressions are also accepted.
df$group_by("a", pl$col("b") %/% 2)$agg(
  pl$col("c")$mean()
)

# The columns will be renamed to the argument names.
df$group_by(d = "a", e = pl$col("b") %/% 2)$agg(
  pl$col("c")$mean()
)

Group based on a date/time or integer column

Description

If you have a time series ⁠<t_0, t_1, ..., t_n>⁠, then by default the windows created will be:

  • (t_0 - period, t_0]

  • (t_1 - period, t_1]

  • (t_n - period, t_n]

whereas if you pass a non-default offset, then the windows will be:

  • (t_0 + offset, t_0 + offset + period]

  • (t_1 + offset, t_1 + offset + period]

  • (t_n + offset, t_n + offset + period]

Usage

DataFrame_group_by_dynamic(
  index_column,
  ...,
  every,
  period = NULL,
  offset = NULL,
  include_boundaries = FALSE,
  closed = "left",
  label = "left",
  group_by = NULL,
  start_by = "window"
)

Arguments

index_column

Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if by is specified, then it must be sorted in ascending order within each group). In case of a rolling group by on indices, dtype needs to be either Int32 or Int64. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

...

Ignored.

every

Interval of the window.

period

A character representing the length of the window, must be non-negative. See the ⁠Polars duration string language⁠ section for details.

offset

A character representing the offset of the window, or NULL (default). If NULL, -period is used. See the ⁠Polars duration string language⁠ section for details.

include_boundaries

Add two columns "_lower_boundary" and "_upper_boundary" columns that show the boundaries of the window. This will impact performance because it’s harder to parallelize.

closed

Define which sides of the temporal interval are closed (inclusive). This can be either "left", "right", "both" or "none".

label

Define which label to use for the window:

  • "left": lower boundary of the window

  • "right": upper boundary of the window

  • "datapoint": the first value of the index column in the given window. If you don’t need the label to be at one of the boundaries, choose this option for maximum performance.

group_by

Also group by this column/these columns.

start_by

The strategy to determine the start of the first window by:

  • "window": start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.

  • "datapoint": start from the first encountered data point.

  • a day of the week (only takes effect if every contains "w"): "monday" starts the window on the Monday before the first data point, etc.

Details

In case of a rolling operation on an integer column, the windows are defined by:

  • "1i" # length 1

  • "10i" # length 10

Value

A GroupBy object

See Also

Examples

df = pl$DataFrame(
  time = pl$datetime_range(
    start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
    end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
    interval = "30m"
  ),
  n = 0:6
)

# get the sum in the following hour relative to the "time" column
df$group_by_dynamic("time", every = "1h")$agg(
  vals = pl$col("n"),
  sum = pl$col("n")$sum()
)

# using "include_boundaries = TRUE" is helpful to see the period considered
df$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg(
  vals = pl$col("n")
)

# in the example above, the values didn't include the one *exactly* 1h after
# the start because "closed = 'left'" by default.
# Changing it to "right" includes values that are exactly 1h after. Note that
# the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00],
# even if this interval wasn't there originally
df$group_by_dynamic("time", every = "1h", closed = "right")$agg(
  vals = pl$col("n")
)
# To keep both boundaries, we use "closed = 'both'". Some values now belong to
# several groups:
df$group_by_dynamic("time", every = "1h", closed = "both")$agg(
  vals = pl$col("n")
)

# Dynamic group bys can also be combined with grouping on normal keys
df = df$with_columns(
  groups = as_polars_series(c("a", "a", "a", "b", "b", "a", "a"))
)
df

df$group_by_dynamic(
  "time",
  every = "1h",
  closed = "both",
  group_by = "groups",
  include_boundaries = TRUE
)$agg(pl$col("n"))

# We can also create a dynamic group by based on an index column
df = pl$LazyFrame(
  idx = 0:5,
  A = c("A", "A", "B", "B", "B", "C")
)$with_columns(pl$col("idx")$set_sorted())
df

df$group_by_dynamic(
  "idx",
  every = "2i",
  period = "3i",
  include_boundaries = TRUE,
  closed = "right"
)$agg(A_agg_list = pl$col("A"))

Get the first n rows.

Description

Get the first n rows.

Usage

DataFrame_head(n = 5L)

DataFrame_limit(n = 5L)

Arguments

n

Number of rows to return. If a negative value is passed, return all rows except the last abs(n).

Details

⁠$limit()⁠ is an alias for ⁠$head()⁠.

Value

A DataFrame

Examples

df = pl$DataFrame(foo = 1:5, bar = 6:10, ham = letters[1:5])

df$head(3)

# Pass a negative value to get all rows except the last `abs(n)`.
df$head(-3)

Return the element at the given row/column.

Description

If row and column location are not specified, the DataFrame must have dimensions (1, 1).

Usage

DataFrame_item(row = NULL, column = NULL)

Arguments

row

Optional row index (0-indexed).

column

Optional column index (0-indexed) or name.

Value

A value of length 1

Examples

df = pl$DataFrame(a = c(1, 2, 3), b = c(4, 5, 6))

df$select((pl$col("a") * pl$col("b"))$sum())$item()

df$item(1, 1)

df$item(2, "b")

Join DataFrames

Description

This function can do both mutating joins (adding columns based on matching observations, for example with how = "left") and filtering joins (keeping observations based on matching observations, for example with how = "inner").

Usage

DataFrame_join(
  other,
  on = NULL,
  how = "inner",
  ...,
  left_on = NULL,
  right_on = NULL,
  suffix = "_right",
  validate = "m:m",
  join_nulls = FALSE,
  allow_parallel = TRUE,
  force_parallel = FALSE,
  coalesce = NULL
)

Arguments

other

DataFrame to join with.

on

Either a vector of column names or a list of expressions and/or strings. Use left_on and right_on if the column names to match on are different between the two DataFrames.

how

One of the following methods: "inner", "left", "right", "full", "semi", "anti", "cross".

...

Ignored.

left_on, right_on

Same as on but only for the left or the right DataFrame. They must have the same length.

suffix

Suffix to add to duplicated column names.

validate

Checks if join is of specified type:

  • "m:m" (default): many-to-many, doesn't perform any checks;

  • "1:1": one-to-one, check if join keys are unique in both left and right datasets;

  • "1:m": one-to-many, check if join keys are unique in left dataset

  • "m:1": many-to-one, check if join keys are unique in right dataset

Note that this is currently not supported by the streaming engine, and is only supported when joining by single columns.

join_nulls

Join on null values. By default null values will never produce matches.

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

coalesce

Coalescing behavior (merging of join columns).

  • NULL: join specific.

  • TRUE: Always coalesce join columns.

  • FALSE: Never coalesce join columns.

Value

DataFrame

Examples

# inner join by default
df1 = pl$DataFrame(list(key = 1:3, payload = c("f", "i", NA)))
df2 = pl$DataFrame(list(key = c(3L, 4L, 5L, NA_integer_)))
df1$join(other = df2, on = "key")

# cross join
df1 = pl$DataFrame(x = letters[1:3])
df2 = pl$DataFrame(y = 1:4)
df1$join(other = df2, how = "cross")

Perform joins on nearest keys

Description

This is similar to a left-join except that we match on nearest key rather than equal keys.

Usage

DataFrame_join_asof(
  other,
  ...,
  left_on = NULL,
  right_on = NULL,
  on = NULL,
  by_left = NULL,
  by_right = NULL,
  by = NULL,
  strategy = c("backward", "forward", "nearest"),
  suffix = "_right",
  tolerance = NULL,
  allow_parallel = TRUE,
  force_parallel = FALSE,
  coalesce = TRUE
)

Arguments

other

DataFrame or LazyFrame

...

Not used, blocks use of further positional arguments

left_on, right_on

Same as on but only for the left or the right DataFrame. They must have the same length.

on

Either a vector of column names or a list of expressions and/or strings. Use left_on and right_on if the column names to match on are different between the two DataFrames.

by_left, by_right

Same as by but only for the left or the right table. They must have the same length.

by

Join on these columns before performing asof join. Either a vector of column names or a list of expressions and/or strings. Use left_by and right_by if the column names to match on are different between the two tables.

strategy

Strategy for where to find match:

  • "backward" (default): search for the last row in the right table whose on key is less than or equal to the left key.

  • "forward": search for the first row in the right table whose on key is greater than or equal to the left key.

  • "nearest": search for the last row in the right table whose value is nearest to the left key. String keys are not currently supported for a nearest search.

suffix

Suffix to add to duplicated column names.

tolerance

Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime", "Duration" or "Time", use the Polars duration string language. About the language, see the ⁠Polars duration string language⁠ section for details.

There may be a circumstance where R types are not sufficient to express a numeric tolerance. In that case, you can use the expression syntax like tolerance = pl$lit(42)$cast(pl$Uint64)

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

coalesce

Coalescing behavior (merging of on / left_on / right_on columns):

  • TRUE: Always coalesce join columns;

  • FALSE: Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.

Details

Both tables (DataFrames or LazyFrames) must be sorted by the asof_join key.

Value

New joined DataFrame

Polars duration string language

Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.

It has the following format:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 calendar day)

  • 1w (1 calendar week)

  • 1mo (1 calendar month)

  • 1q (1 calendar quarter)

  • 1y (1 calendar year)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".

Examples

# create two DataFrames to join asof
gdp = pl$DataFrame(
  date = as.Date(c("2015-1-1", "2016-1-1", "2017-5-1", "2018-1-1", "2019-1-1")),
  gdp = c(4321, 4164, 4411, 4566, 4696),
  group = c("b", "a", "a", "b", "b")
)

pop = pl$DataFrame(
  date = as.Date(c("2016-5-12", "2017-5-12", "2018-5-12", "2019-5-12")),
  population = c(82.19, 82.66, 83.12, 83.52),
  group = c("b", "b", "a", "a")
)

# optional make sure tables are already sorted with "on" join-key
gdp = gdp$sort("date")
pop = pop$sort("date")

# Left-join_asof DataFrame pop with gdp on "date"
# Look backward in gdp to find closest matching date
pop$join_asof(gdp, on = "date", strategy = "backward")

# .... and forward
pop$join_asof(gdp, on = "date", strategy = "forward")

# join by a group: "only look within within group"
pop$join_asof(gdp, on = "date", by = "group", strategy = "backward")

# only look 2 weeks and 2 days back
pop$join_asof(gdp, on = "date", strategy = "backward", tolerance = "2w2d")

# only look 11 days back (numeric tolerance depends on polars type, <date> is in days)
pop$join_asof(gdp, on = "date", strategy = "backward", tolerance = 11)

Get the last row of the DataFrame.

Description

Get the last row of the DataFrame.

Usage

DataFrame_last()

Value

A DataFrame with one row.

Examples

pl$DataFrame(mtcars)$last()

Convert an existing DataFrame to a LazyFrame

Description

Start a new lazy query from a DataFrame.

Usage

DataFrame_lazy()

Value

A LazyFrame

Examples

pl$DataFrame(iris)$lazy()

Max

Description

Aggregate the columns in the DataFrame to their maximum value.

Usage

DataFrame_max()

Value

A DataFrame with one row.

Examples

pl$DataFrame(mtcars)$max()

Mean

Description

Aggregate the columns in the DataFrame to their mean value.

Usage

DataFrame_mean()

Value

A DataFrame with one row.

Examples

pl$DataFrame(mtcars)$mean()

Median

Description

Aggregate the columns in the DataFrame to their median value.

Usage

DataFrame_median()

Value

A DataFrame with one row.

Examples

pl$DataFrame(mtcars)$median()

Min

Description

Aggregate the columns in the DataFrame to their minimum value.

Usage

DataFrame_min()

Value

A DataFrame with one row.

Examples

pl$DataFrame(mtcars)$min()

Number of chunks of the Series in a DataFrame

Description

Number of chunks (memory allocations) for all or first Series in a DataFrame.

Usage

DataFrame_n_chunks(strategy = "first")

Arguments

strategy

Either "all" or "first". "first" only returns chunks for the first Series.

Details

A DataFrame is a vector of Series. Each Series in rust-polars is a wrapper around a ChunkedArray, which is like a virtual contiguous vector physically backed by an ordered set of chunks. Each chunk of values has a contiguous memory layout and is an arrow array. Arrow arrays are a fast, thread-safe and cross-platform memory layout.

In R, combining with c() or rbind() requires immediate vector re-allocation to place vector values in contiguous memory. This is slow and memory consuming, and it is why repeatedly appending to a vector in R is discouraged.

In polars, when we concatenate or append to Series or DataFrame, the re-allocation can be avoided or delayed by simply appending chunks to each individual Series. However, if chunks become many and small or are misaligned across Series, this can hurt the performance of subsequent operations.

Most places in the polars api where chunking could occur, the user have to typically actively opt-out by setting an argument rechunk = FALSE.

Value

A real vector of chunk counts per Series.

See Also

<DataFrame>$rechunk()

Examples

# create DataFrame with misaligned chunks
df = pl$concat(
  1:10, # single chunk
  pl$concat(1:5, 1:5, rechunk = FALSE, how = "vertical")$rename("b"), # two chunks
  how = "horizontal"
)
df
df$n_chunks()

# rechunk a chunked DataFrame
df$rechunk()$n_chunks()

# rechunk is not an in-place operation
df$n_chunks()

# The following toy example emulates the Series "chunkyness" in R. Here it a
# S3-classed list with same type of vectors and where have all relevant S3
# generics implemented to make behave as if it was a regular vector.
"+.chunked_vector" = \(x, y) structure(list(unlist(x) + unlist(y)), class = "chunked_vector")
print.chunked_vector = \(x, ...) print(unlist(x), ...)
c.chunked_vector = \(...) {
  structure(do.call(c, lapply(list(...), unclass)), class = "chunked_vector")
}
rechunk = \(x) structure(unlist(x), class = "chunked_vector")
x = structure(list(1:4, 5L), class = "chunked_vector")
x
x + 5:1
lapply(x, tracemem) # trace chunks to verify no re-allocation
z = c(x, x)
z # looks like a plain vector
lapply(z, tracemem) # mem allocation  in z are the same from x
str(z)
z = rechunk(z)
str(z)

Count null values

Description

Create a new DataFrame that shows the null (which correspond to NA in R) counts per column.

Usage

DataFrame_null_count()

Format

function

Value

DataFrame

Examples

x = mtcars
x[1, 2:3] = NA
pl$DataFrame(x)$null_count()

Split a DataFrame into multiple DataFrames

Description

Similar to $group_by(). Group by the given columns and return the groups as separate DataFrames. It is useful to use this in combination with functions like lapply() or purrr::map().

Usage

DataFrame_partition_by(
  ...,
  maintain_order = TRUE,
  include_key = TRUE,
  as_nested_list = FALSE
)

Arguments

...

Characters of column names to group by. Passed to pl$col().

maintain_order

If TRUE, ensure that the order of the groups is consistent with the input data. This is slower than a default partition by operation.

include_key

If TRUE, include the columns used to partition the DataFrame in the output.

as_nested_list

This affects the format of the output. If FALSE (default), the output is a flat list of DataFrames. IF TRUE and one of the maintain_order or include_key argument is TRUE, then each element of the output has two children: key and data. See the examples for more details.

Value

A list of DataFrames. See the examples for details.

See Also

Examples

df = pl$DataFrame(
  a = c("a", "b", "a", "b", "c"),
  b = c(1, 2, 1, 3, 3),
  c = c(5, 4, 3, 2, 1)
)
df

# Pass a single column name to partition by that column.
df$partition_by("a")

# Partition by multiple columns.
df$partition_by("a", "b")

# Partition by column data type
df$partition_by(pl$String)

# If `as_nested_list = TRUE`, the output is a list whose elements have a `key` and a `data` field.
# The `key` is a named list of the key values, and the `data` is the DataFrame.
df$partition_by("a", "b", as_nested_list = TRUE)

# `as_nested_list = TRUE` should be used with `maintain_order = TRUE` or `include_key = TRUE`.
tryCatch(
  df$partition_by("a", "b", maintain_order = FALSE, include_key = FALSE, as_nested_list = TRUE),
  warning = function(w) w
)

# Example of using with lapply(), and printing the key and the data summary
df$partition_by("a", "b", maintain_order = FALSE, as_nested_list = TRUE) |>
  lapply(\(x) {
    sprintf("\nThe key value of `a` is %s and the key value of `b` is %s\n", x$key$a, x$key$b) |>
      cat()
    x$data$drop(names(x$key))$describe() |>
      print()
    invisible(NULL)
  }) |>
  invisible()

Pivot data from long to wide

Description

Pivot data from long to wide

Usage

DataFrame_pivot(
  on,
  ...,
  index,
  values,
  aggregate_function = NULL,
  maintain_order = TRUE,
  sort_columns = FALSE,
  separator = "_"
)

Arguments

on

Name of the column(s) whose values will be used as the header of the output DataFrame.

...

Not used.

index

One or multiple keys to group by.

values

Column values to aggregate. Can be multiple columns if the on arguments contains multiple columns as well.

aggregate_function

One of:

  • string indicating the expressions to aggregate with, such as 'first', 'sum', 'max', 'min', 'mean', 'median', 'last', 'count'),

  • an Expr e.g. pl$element()$sum()

maintain_order

Sort the grouped keys so that the output order is predictable.

sort_columns

Sort the transposed columns by name. Default is by order of discovery.

separator

Used as separator/delimiter in generated column names.

Value

DataFrame

Examples

df = pl$DataFrame(
  foo = c("one", "one", "one", "two", "two", "two"),
  bar = c("A", "B", "C", "A", "B", "C"),
  baz = c(1, 2, 3, 4, 5, 6)
)
df

df$pivot(
  values = "baz", index = "foo", on = "bar"
)

# Run an expression as aggregation function
df = pl$DataFrame(
  col1 = c("a", "a", "a", "b", "b", "b"),
  col2 = c("x", "x", "x", "x", "y", "y"),
  col3 = c(6, 7, 3, 2, 5, 7)
)
df

df$pivot(
  index = "col1",
  on = "col2",
  values = "col3",
  aggregate_function = pl$element()$tanh()$mean()
)

Quantile

Description

Aggregate the columns in the DataFrame to a unique quantile value. Use ⁠$describe()⁠ to specify several quantiles.

Usage

DataFrame_quantile(quantile, interpolation = "nearest")

Arguments

quantile

Numeric of length 1 between 0 and 1.

interpolation

One of "nearest", "higher", "lower", "midpoint", or "linear".

Value

DataFrame

Examples

pl$DataFrame(mtcars)$quantile(.4)

Rechunk a DataFrame

Description

Rechunking re-allocates any "chunked" memory allocations to speed-up e.g. vectorized operations.

Usage

DataFrame_rechunk()

Details

A DataFrame is a vector of Series. Each Series in rust-polars is a wrapper around a ChunkedArray, which is like a virtual contiguous vector physically backed by an ordered set of chunks. Each chunk of values has a contiguous memory layout and is an arrow array. Arrow arrays are a fast, thread-safe and cross-platform memory layout.

In R, combining with c() or rbind() requires immediate vector re-allocation to place vector values in contiguous memory. This is slow and memory consuming, and it is why repeatedly appending to a vector in R is discouraged.

In polars, when we concatenate or append to Series or DataFrame, the re-allocation can be avoided or delayed by simply appending chunks to each individual Series. However, if chunks become many and small or are misaligned across Series, this can hurt the performance of subsequent operations.

Most places in the polars api where chunking could occur, the user have to typically actively opt-out by setting an argument rechunk = FALSE.

Value

A DataFrame

See Also

<DataFrame>$n_chunks()

Examples

# create DataFrame with misaligned chunks
df = pl$concat(
  1:10, # single chunk
  pl$concat(1:5, 1:5, rechunk = FALSE, how = "vertical")$rename("b"), # two chunks
  how = "horizontal"
)
df
df$n_chunks()

# rechunk a chunked DataFrame
df$rechunk()$n_chunks()

# rechunk is not an in-place operation
df$n_chunks()

# The following toy example emulates the Series "chunkyness" in R. Here it a
# S3-classed list with same type of vectors and where have all relevant S3
# generics implemented to make behave as if it was a regular vector.
"+.chunked_vector" = \(x, y) structure(list(unlist(x) + unlist(y)), class = "chunked_vector")
print.chunked_vector = \(x, ...) print(unlist(x), ...)
c.chunked_vector = \(...) {
  structure(do.call(c, lapply(list(...), unclass)), class = "chunked_vector")
}
rechunk = \(x) structure(unlist(x), class = "chunked_vector")
x = structure(list(1:4, 5L), class = "chunked_vector")
x
x + 5:1
lapply(x, tracemem) # trace chunks to verify no re-allocation
z = c(x, x)
z # looks like a plain vector
lapply(z, tracemem) # mem allocation  in z are the same from x
str(z)
z = rechunk(z)
str(z)

Rename column names of a DataFrame

Description

Rename column names of a DataFrame

Usage

DataFrame_rename(...)

Arguments

...

One of the following:

  • Key value pairs that map from old name to new name, like old_name = "new_name".

  • As above but with params wrapped in a list

  • An R function that takes the old names character vector as input and returns the new names character vector.

Details

If existing names are swapped (e.g. A points to B and B points to A), polars will block projection and predicate pushdowns at this node.

Value

DataFrame

Examples

df = pl$DataFrame(
  foo = 1:3,
  bar = 6:8,
  ham = letters[1:3]
)

df$rename(foo = "apple")

df$rename(
  \(column_name) paste0("c", substr(column_name, 2, 100))
)

Reverse

Description

Reverse the DataFrame (the last row becomes the first one, etc.).

Usage

DataFrame_reverse()

Value

DataFrame

Examples

pl$DataFrame(mtcars)$reverse()

Create rolling groups based on a date/time or integer column

Description

If you have a time series ⁠<t_0, t_1, ..., t_n>⁠, then by default the windows created will be:

  • (t_0 - period, t_0]

  • (t_1 - period, t_1]

  • (t_n - period, t_n]

whereas if you pass a non-default offset, then the windows will be:

  • (t_0 + offset, t_0 + offset + period]

  • (t_1 + offset, t_1 + offset + period]

  • (t_n + offset, t_n + offset + period]

Usage

DataFrame_rolling(
  index_column,
  ...,
  period,
  offset = NULL,
  closed = "right",
  group_by = NULL
)

Arguments

index_column

Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if by is specified, then it must be sorted in ascending order within each group). In case of a rolling group by on indices, dtype needs to be either Int32 or Int64. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

...

Ignored.

period

A character representing the length of the window, must be non-negative. See the ⁠Polars duration string language⁠ section for details.

offset

A character representing the offset of the window, or NULL (default). If NULL, -period is used. See the ⁠Polars duration string language⁠ section for details.

closed

Define which sides of the temporal interval are closed (inclusive). This can be either "left", "right", "both" or "none".

group_by

Also group by this column/these columns.

Details

In case of a rolling operation on an integer column, the windows are defined by:

  • "1i" # length 1

  • "10i" # length 10

Value

A RollingGroupBy object

Polars duration string language

Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.

It has the following format:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 calendar day)

  • 1w (1 calendar week)

  • 1mo (1 calendar month)

  • 1q (1 calendar quarter)

  • 1y (1 calendar year)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".

See Also

Examples

date = c(
  "2020-01-01 13:45:48",
  "2020-01-01 16:42:13",
  "2020-01-01 16:45:09",
  "2020-01-02 18:12:48",
  "2020-01-03 19:45:32",
  "2020-01-08 23:16:43"
)
df = pl$DataFrame(dt = date, a = c(3, 7, 5, 9, 2, 1))$with_columns(
  pl$col("dt")$str$strptime(pl$Datetime())$set_sorted()
)

df$rolling(index_column = "dt", period = "2d")$agg(
  sum_a = pl$sum("a"),
  min_a = pl$min("a"),
  max_a = pl$max("a")
)

Take a sample of rows from a DataFrame

Description

Take a sample of rows from a DataFrame

Usage

DataFrame_sample(
  n = NULL,
  ...,
  fraction = NULL,
  with_replacement = FALSE,
  shuffle = FALSE,
  seed = NULL
)

Arguments

n

Number of rows to return. Cannot be used with fraction.

...

Ignored.

fraction

Fraction of rows to return. Cannot be used with n. Can be larger than 1 if with_replacement is TRUE.

with_replacement

Allow values to be sampled more than once.

shuffle

If TRUE, the order of the sampled rows will be shuffled. If FALSE (default), the order of the returned rows will be neither stable nor fully random.

seed

Seed for the random number generator. If set to NULL (default), a random seed is generated for each sample operation.

Value

DataFrame

Examples

df = pl$DataFrame(iris)
df$sample(n = 20)
df$sample(fraction = 0.1)

Select and modify columns of a DataFrame

Description

Similar to dplyr::mutate(). However, it discards unmentioned columns (like .() in data.table).

Usage

DataFrame_select(...)

Arguments

...

Columns to keep. Those can be expressions (e.g pl$col("a")), column names (e.g "a"), or list containing expressions or column names (e.g list(pl$col("a"))).

Value

DataFrame

Examples

pl$DataFrame(iris)$select(
  pl$col("Sepal.Length")$abs()$alias("abs_SL"),
  (pl$col("Sepal.Length") + 2)$alias("add_2_SL")
)

Select and modify columns of a DataFrame

Description

Similar to dplyr::mutate(). However, it discards unmentioned columns (like .() in data.table).

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap. Otherwise, ⁠$select()⁠ should be preferred.

Usage

DataFrame_select_seq(...)

Arguments

...

Columns to keep. Those can be expressions (e.g pl$col("a")), column names (e.g "a"), or list containing expressions or column names (e.g list(pl$col("a"))).

Value

DataFrame

Examples

pl$DataFrame(iris)$select_seq(
  pl$col("Sepal.Length")$abs()$alias("abs_SL"),
  (pl$col("Sepal.Length") + 2)$alias("add_2_SL")
)

Shift a DataFrame

Description

Shift the values by a given period. If the period (n) is positive, then n rows will be inserted at the top of the DataFrame and the last n rows will be discarded. Vice-versa if the period is negative. In the end, the total number of rows of the DataFrame doesn't change.

Usage

DataFrame_shift(n = 1, fill_value = NULL)

Arguments

n

Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead.

fill_value

Fill the resulting null values with this value. Accepts expression input. Non-expression inputs are parsed as literals.

Value

DataFrame

Examples

df = pl$DataFrame(a = 1:4, b = 5:8)

df$shift(2)

df$shift(-2)

df$shift(-2, fill_value = 100)

Slice

Description

Get a slice of the DataFrame.

Usage

DataFrame_slice(offset, length = NULL)

Arguments

offset

Start index, can be a negative value. This is 0-indexed, so offset = 1 doesn't include the first row.

length

Length of the slice. If NULL (default), all rows starting at the offset will be selected.

Value

DataFrame

Examples

# skip the first 2 rows and take the 4 following rows
pl$DataFrame(mtcars)$slice(2, 4)

# this is equivalent to:
mtcars[3:6, ]

Sort a DataFrame

Description

Sort a DataFrame

Usage

DataFrame_sort(
  by,
  ...,
  descending = FALSE,
  nulls_last = FALSE,
  maintain_order = FALSE
)

Arguments

by

Column(s) to sort by. Can be character vector of column names, a list of Expr(s) or a list with a mix of Expr(s) and column names.

...

More columns to sort by as above but provided one Expr per argument.

descending

Logical. Sort in descending order (default is FALSE). This must be either of length 1 or a logical vector of the same length as the number of Expr(s) specified in by and ....

nulls_last

A logical or logical vector of the same length as the number of columns. If TRUE, place null values last insead of first.

maintain_order

Whether the order should be maintained if elements are equal. If TRUE, streaming is not possible and performance might be worse since this requires a stable search.

Value

DataFrame

Examples

df = mtcars
df$mpg[1] = NA
df = pl$DataFrame(df)
df$sort("mpg")
df$sort("mpg", nulls_last = TRUE)
df$sort("cyl", "mpg")
df$sort(c("cyl", "mpg"))
df$sort(c("cyl", "mpg"), descending = TRUE)
df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE))
df$sort(pl$col("cyl"), pl$col("mpg"))

Execute a SQL query against the DataFrame

Description

The calling frame is automatically registered as a table in the SQL context under the name "self". All DataFrames and LazyFrames found in the envir are also registered, using their variable name. More control over registration and execution behaviour is available by the SQLContext object.

Usage

DataFrame_sql(query, ..., table_name = NULL, envir = parent.frame())

Arguments

query

A character of the SQL query to execute.

...

Ignored.

table_name

NULL (default) or a character of an explicit name for the table that represents the calling frame (the alias "self" will always be registered/available).

envir

The environment to search for polars DataFrames/LazyFrames.

Details

This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.

Value

DataFrame

See Also

Examples

df1 = pl$DataFrame(
  a = 1:3,
  b = c("zz", "yy", "xx"),
  c = as.Date(c("1999-12-31", "2010-10-10", "2077-08-08"))
)

# Query the DataFrame using SQL:
df1$sql("SELECT c, b FROM self WHERE a > 1")

# Join two DataFrames using SQL.
df2 = pl$DataFrame(a = 3:1, d = c(125, -654, 888))
df1$sql(
  "
SELECT self.*, d
FROM self
INNER JOIN df2 USING (a)
WHERE a > 1 AND EXTRACT(year FROM c) < 2050
"
)

# Apply transformations to a DataFrame using SQL, aliasing "self" to "frame".
df1$sql(
  query = r"(
SELECT
a,
MOD(a, 2) == 0 AS a_is_even,
CONCAT_WS(':', b, b) AS b_b,
EXTRACT(year FROM c) AS year,
0::float AS 'zero'
FROM frame
)",
  table_name = "frame"
)

Std

Description

Aggregate the columns of this DataFrame to their standard deviation values.

Usage

DataFrame_std(ddof = 1)

Arguments

ddof

Delta Degrees of Freedom: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.

Value

A DataFrame with one row.

Examples

pl$DataFrame(mtcars)$std()

Sum

Description

Aggregate the columns of this DataFrame to their sum values.

Usage

DataFrame_sum()

Value

A DataFrame with one row.

Examples

pl$DataFrame(mtcars)$sum()

Get the last n rows.

Description

Get the last n rows.

Usage

DataFrame_tail(n = 5L)

Arguments

n

Number of rows to return. If a negative value is passed, return all rows except the first abs(n).

Value

A DataFrame

Examples

df = pl$DataFrame(foo = 1:5, bar = 6:10, ham = letters[1:5])

df$tail(3)

# Pass a negative value to get all rows except the first `abs(n)`.
df$tail(-3)

Return Polars DataFrame as R data.frame

Description

Return Polars DataFrame as R data.frame

Usage

DataFrame_to_data_frame(
  ...,
  int64_conversion = polars_options()$int64_conversion
)

Arguments

...

Any args pased to as.data.frame().

int64_conversion

How should Int64 values be handled when converting a polars object to R?

  • "double" (default) converts the integer values to double.

  • "bit64" uses bit64::as.integer64() to do the conversion (requires the package bit64 to be attached).

  • "string" converts Int64 values to character.

Value

An R data.frame

Conversion to R data types considerations

When converting Polars objects, such as DataFrames to R objects, for example via the as.data.frame() generic function, each type in the Polars object is converted to an R type. In some cases, an error may occur because the conversion is not appropriate. In particular, there is a high possibility of an error when converting a Datetime type without a time zone. A Datetime type without a time zone in Polars is converted to the POSIXct type in R, which takes into account the time zone in which the R session is running (which can be checked with the Sys.timezone() function). In this case, if ambiguous times are included, a conversion error will occur. In such cases, change the session time zone using Sys.setenv(TZ = "UTC") and then perform the conversion, or use the $dt$replace_time_zone() method on the Datetime type column to explicitly specify the time zone before conversion.

# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am
# so this particular date-time doesn't exist
non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T")

withr::with_timezone(
  "America/New_York",
  {
    tryCatch(
      # This causes an error due to the time zone (the `TZ` env var is affected).
      as.vector(non_existent_time),
      error = function(e) e
    )
  }
)
#> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()>

withr::with_timezone(
  "America/New_York",
  {
    # This is safe.
    as.vector(non_existent_time$dt$replace_time_zone("UTC"))
  }
)
#> [1] "2020-03-08 02:00:00 UTC"

Examples

df = pl$DataFrame(iris[1:3, ])
df$to_data_frame()

Return Polars DataFrame as a list of vectors

Description

Return Polars DataFrame as a list of vectors

Usage

DataFrame_to_list(
  unnest_structs = TRUE,
  ...,
  int64_conversion = polars_options()$int64_conversion
)

Arguments

unnest_structs

Logical. If TRUE (default), then ⁠$unnest()⁠ is applied on any struct column.

...

Any args pased to as.data.frame().

int64_conversion

How should Int64 values be handled when converting a polars object to R?

  • "double" (default) converts the integer values to double.

  • "bit64" uses bit64::as.integer64() to do the conversion (requires the package bit64 to be attached).

  • "string" converts Int64 values to character.

Details

For simplicity reasons, this implementation relies on unnesting all structs before exporting to R. If unnest_structs = FALSE, then struct columns will be returned as nested lists, where each row is a list of values. Such a structure is not very typical or efficient in R.

Value

R list of vectors

Conversion to R data types considerations

When converting Polars objects, such as DataFrames to R objects, for example via the as.data.frame() generic function, each type in the Polars object is converted to an R type. In some cases, an error may occur because the conversion is not appropriate. In particular, there is a high possibility of an error when converting a Datetime type without a time zone. A Datetime type without a time zone in Polars is converted to the POSIXct type in R, which takes into account the time zone in which the R session is running (which can be checked with the Sys.timezone() function). In this case, if ambiguous times are included, a conversion error will occur. In such cases, change the session time zone using Sys.setenv(TZ = "UTC") and then perform the conversion, or use the $dt$replace_time_zone() method on the Datetime type column to explicitly specify the time zone before conversion.

# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am
# so this particular date-time doesn't exist
non_existent_time = as_polars_series("2020-03-08 02:00:00")$str$strptime(pl$Datetime(), "%F %T")

withr::with_timezone(
  "America/New_York",
  {
    tryCatch(
      # This causes an error due to the time zone (the `TZ` env var is affected).
      as.vector(non_existent_time),
      error = function(e) e
    )
  }
)
#> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()>

withr::with_timezone(
  "America/New_York",
  {
    # This is safe.
    as.vector(non_existent_time$dt$replace_time_zone("UTC"))
  }
)
#> [1] "2020-03-08 02:00:00 UTC"

See Also

Examples

pl$DataFrame(iris)$to_list()

Write Arrow IPC data to a raw vector

Description

Write Arrow IPC data to a raw vector

Usage

DataFrame_to_raw_ipc(
  compression = c("uncompressed", "zstd", "lz4"),
  ...,
  compat_level = FALSE
)

Arguments

compression

NULL or a character of the compression method, "uncompressed" or "lz4" or "zstd". NULL is equivalent to "uncompressed". Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression.

...

Ignored.

compat_level

Use a specific compatibility level when exporting Polars’ internal data structures. This can be:

  • an integer indicating the compatibility version (currently only 0 for oldest and 1 for newest);

  • a logical value with TRUE for the newest version and FALSE for the oldest version.

Value

A raw vector

See Also

Examples

df = pl$DataFrame(
  foo = 1:5,
  bar = 6:10,
  ham = letters[1:5]
)

raw_ipc = df$to_raw_ipc()

pl$read_ipc(raw_ipc)

if (require("arrow", quietly = TRUE)) {
  arrow::read_ipc_file(raw_ipc, as_data_frame = FALSE)
}

Get column by index

Description

Extract a DataFrame column (by index) as a Polars series. Unlike get_column(), this method will not fail but will return a NULL if the index doesn't exist in the DataFrame. Keep in mind that Polars is 0-indexed so "0" is the first column.

Usage

DataFrame_to_series(idx = 0)

Arguments

idx

Index of the column to return as Series. Defaults to 0, which is the first column.

Value

Series or NULL

Examples

df = pl$DataFrame(iris[1:10, ])

# default is to extract the first column
df$to_series()

# Polars is 0-indexed, so we use idx = 1 to extract the *2nd* column
df$to_series(idx = 1)

# doesn't error if the column isn't there
df$to_series(idx = 8)

Convert DataFrame to a Series of type "struct"

Description

Convert DataFrame to a Series of type "struct"

Usage

DataFrame_to_struct(name = "")

Arguments

name

Name given to the new Series

Value

A Series of type "struct"

Examples

# round-trip conversion from DataFrame with two columns
df = pl$DataFrame(a = 1:5, b = c("one", "two", "three", "four", "five"))
s = df$to_struct()
s

# convert to an R list
s$to_r()

# Convert back to a DataFrame
df_s = s$to_frame()
df_s

Transpose a DataFrame over the diagonal.

Description

Transpose a DataFrame over the diagonal.

Usage

DataFrame_transpose(
  include_header = FALSE,
  header_name = "column",
  column_names = NULL
)

Arguments

include_header

If TRUE, the column names will be added as first column.

header_name

If include_header is TRUE, this determines the name of the column that will be inserted.

column_names

Character vector indicating the new column names. If NULL (default), the columns will be named as "column_1", "column_2", etc. The length of this vector must match the number of rows of the original input.

Details

This is a very expensive operation.

Transpose may be the fastest option to perform non foldable (see fold() or reduce()) row operations like median.

Polars transpose is currently eager only, likely because it is not trivial to deduce the schema.

Value

DataFrame

Examples

# simple use-case
pl$DataFrame(mtcars)$transpose(include_header = TRUE, column_names = rownames(mtcars))

# All rows must have one shared supertype, recast Categorical to String which is a supertype
# of f64, and then dataset "Iris" can be transposed
pl$DataFrame(iris)$with_columns(pl$col("Species")$cast(pl$String))$transpose()

Drop duplicated rows

Description

Drop duplicated rows

Usage

DataFrame_unique(subset = NULL, ..., keep = "any", maintain_order = FALSE)

Arguments

subset

A character vector with the names of the column(s) to use to identify duplicates. If NULL (default), use all columns.

...

Not used.

keep

Which of the duplicate rows to keep:

  • "any" (default): Does not give any guarantee of which row is kept. This allows more optimizations.

  • "first": Keep first unique row.

  • "last": Keep last unique row.

  • "none": Don’t keep duplicate rows.

maintain_order

Keep the same order as the original data. Setting this to TRUE makes it more expensive to compute and blocks the possibility to run on the streaming engine.

Value

DataFrame

Examples

df = pl$DataFrame(
  x = c(1:3, 1:3, 3:1, 1L),
  y = c(1:3, 1:3, 1:3, 1L)
)
df$height

df$unique()$height

# subset to define unique, keep only last or first
df$unique(subset = "x", keep = "last")
df$unique(subset = "x", keep = "first")

# only keep unique rows
df$unique(keep = "none")

Unnest the Struct columns of a DataFrame

Description

Unnest the Struct columns of a DataFrame

Usage

DataFrame_unnest(...)

Arguments

...

Names of the struct columns to unnest. This doesn't accept Expr. If nothing is provided, then all columns of datatype Struct are unnested.

Value

A DataFrame where some or all columns of datatype Struct are unnested.

Examples

df = pl$DataFrame(
  a = 1:5,
  b = c("one", "two", "three", "four", "five"),
  c = 6:10
)$
  select(
  pl$struct("b"),
  pl$struct(c("a", "c"))$alias("a_and_c")
)
df

# by default, all struct columns are unnested
df$unnest()

# we can specify specific columns to unnest
df$unnest("a_and_c")

Unpivot a Frame from wide to long format

Description

Unpivot a Frame from wide to long format

Usage

DataFrame_unpivot(
  on = NULL,
  ...,
  index = NULL,
  variable_name = NULL,
  value_name = NULL
)

Arguments

on

Values to use as identifier variables. If value_vars is empty all columns that are not in id_vars will be used.

...

Not used.

index

Columns to use as identifier variables.

variable_name

Name to give to the new column containing the names of the melted columns. Defaults to "variable".

value_name

Name to give to the new column containing the values of the melted columns. Defaults to "value".

Details

Optionally leaves identifiers set.

This function is useful to massage a Frame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are "unpivoted" to the row axis, leaving just two non-identifier columns, 'variable' and 'value'.

Value

A new DataFrame

Examples

df = pl$DataFrame(
  a = c("x", "y", "z"),
  b = c(1, 3, 5),
  c = c(2, 4, 6),
  d = c(7, 8, 9)
)
df$unpivot(index = "a", on = c("b", "c", "d"))

Var

Description

Aggregate the columns of this DataFrame to their variance values.

Usage

DataFrame_var(ddof = 1)

Arguments

ddof

Delta Degrees of Freedom: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.

Value

A DataFrame with one row.

Examples

pl$DataFrame(mtcars)$var()

Modify/append column(s)

Description

Add columns or modify existing ones with expressions. This is the equivalent of dplyr::mutate() as it keeps unmentioned columns (unlike ⁠$select()⁠).

Usage

DataFrame_with_columns(...)

Arguments

...

Any expressions or string column name, or same wrapped in a list. If first and only element is a list, it is unwrapped as a list of args.

Value

A DataFrame

Examples

pl$DataFrame(iris)$with_columns(
  pl$col("Sepal.Length")$abs()$alias("abs_SL"),
  (pl$col("Sepal.Length") + 2)$alias("add_2_SL")
)

# same query
l_expr = list(
  pl$col("Sepal.Length")$abs()$alias("abs_SL"),
  (pl$col("Sepal.Length") + 2)$alias("add_2_SL")
)
pl$DataFrame(iris)$with_columns(l_expr)

pl$DataFrame(iris)$with_columns(
  pl$col("Sepal.Length")$abs(), # not named expr will keep name "Sepal.Length"
  SW_add_2 = (pl$col("Sepal.Width") + 2)
)

Modify/append column(s)

Description

Add columns or modify existing ones with expressions. This is the equivalent of dplyr::mutate() as it keeps unmentioned columns (unlike ⁠$select()⁠).

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap. Otherwise, ⁠$with_columns()⁠ should be preferred.

Usage

DataFrame_with_columns_seq(...)

Arguments

...

Any expressions or string column name, or same wrapped in a list. If first and only element is a list, it is unwrapped as a list of args.

Value

A DataFrame

Examples

pl$DataFrame(iris)$with_columns_seq(
  pl$col("Sepal.Length")$abs()$alias("abs_SL"),
  (pl$col("Sepal.Length") + 2)$alias("add_2_SL")
)

# same query
l_expr = list(
  pl$col("Sepal.Length")$abs()$alias("abs_SL"),
  (pl$col("Sepal.Length") + 2)$alias("add_2_SL")
)
pl$DataFrame(iris)$with_columns_seq(l_expr)

pl$DataFrame(iris)$with_columns_seq(
  pl$col("Sepal.Length")$abs(), # not named expr will keep name "Sepal.Length"
  SW_add_2 = (pl$col("Sepal.Width") + 2)
)

Add a column for row indices

Description

Add a new column at index 0 that counts the rows

Usage

DataFrame_with_row_index(name, offset = NULL)

Arguments

name

string name of the created column

offset

positive integer offset for the start of the counter

Value

A new DataFrame object with a counter column in front

Examples

df = pl$DataFrame(mtcars)

# by default, the index starts at 0 (to mimic the behavior of Python Polars)
df$with_row_index("idx")

# but in R, we use a 1-index
df$with_row_index("idx", offset = 1)

Write to comma-separated values (CSV) file

Description

Write to comma-separated values (CSV) file

Usage

DataFrame_write_csv(
  file,
  ...,
  include_bom = FALSE,
  include_header = TRUE,
  separator = ",",
  line_terminator = "\n",
  quote_char = "\"",
  batch_size = 1024,
  datetime_format = NULL,
  date_format = NULL,
  time_format = NULL,
  float_precision = NULL,
  null_values = "",
  quote_style = "necessary"
)

Arguments

file

File path to which the result should be written.

...

Ignored.

include_bom

Whether to include UTF-8 BOM (byte order mark) in the CSV output.

include_header

Whether to include header in the CSV output.

separator

Separate CSV fields with this symbol.

line_terminator

String used to end each row.

quote_char

Byte to use as quoting character.

batch_size

Number of rows that will be processed per thread.

datetime_format

A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).

date_format

A format string, with the specifiers defined by the chrono Rust crate.

time_format

A format string, with the specifiers defined by the chrono Rust crate.

float_precision

Number of decimal places to write, applied to both Float32 and Float64 datatypes.

null_values

A string representing null values (defaulting to the empty string).

quote_style

Determines the quoting strategy used.

  • "necessary" (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.

  • "always": This puts quotes around every field.

  • "non_numeric": This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren't strictly necessary.

  • "never": This never puts quotes around fields, even if that results in invalid CSV data (e.g. by not quoting strings containing the separator).

Value

Invisibly returns the input DataFrame.

Examples

dat = pl$DataFrame(mtcars)

destination = tempfile(fileext = ".csv")
dat$select(pl$col("drat", "mpg"))$write_csv(destination)

pl$read_csv(destination)

Write to Arrow IPC file (a.k.a Feather file)

Description

Write to Arrow IPC file (a.k.a Feather file)

Usage

DataFrame_write_ipc(
  file,
  compression = c("uncompressed", "zstd", "lz4"),
  ...,
  compat_level = TRUE
)

Arguments

file

File path to which the result should be written.

compression

NULL or a character of the compression method, "uncompressed" or "lz4" or "zstd". NULL is equivalent to "uncompressed". Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression.

...

Ignored.

compat_level

Use a specific compatibility level when exporting Polars’ internal data structures. This can be:

  • an integer indicating the compatibility version (currently only 0 for oldest and 1 for newest);

  • a logical value with TRUE for the newest version and FALSE for the oldest version.

Value

Invisibly returns the input DataFrame.

See Also

Examples

dat = pl$DataFrame(mtcars)

destination = tempfile(fileext = ".arrow")
dat$write_ipc(destination)

if (require("arrow", quietly = TRUE)) {
  arrow::read_ipc_file(destination, as_data_frame = FALSE)
}

Write to JSON file

Description

Write to JSON file

Usage

DataFrame_write_json(file, ..., pretty = FALSE, row_oriented = FALSE)

Arguments

file

File path to which the result should be written.

...

Ignored.

pretty

Pretty serialize JSON.

row_oriented

Write to row-oriented JSON. This is slower, but more common.

Value

Invisibly returns the input DataFrame.

Examples

if (require("jsonlite", quiet = TRUE)) {
  dat = pl$DataFrame(head(mtcars))
  destination = tempfile()

  dat$select(pl$col("drat", "mpg"))$write_json(destination)
  jsonlite::fromJSON(destination)

  dat$select(pl$col("drat", "mpg"))$write_json(destination, row_oriented = TRUE)
  jsonlite::fromJSON(destination)
}

Write to NDJSON file

Description

Write to NDJSON file

Usage

DataFrame_write_ndjson(file)

Arguments

file

File path to which the result should be written.

Value

Invisibly returns the input DataFrame.

Examples

dat = pl$DataFrame(head(mtcars))

destination = tempfile()
dat$select(pl$col("drat", "mpg"))$write_ndjson(destination)

pl$read_ndjson(destination)

Write to parquet file

Description

Write to parquet file

Usage

DataFrame_write_parquet(
  file,
  ...,
  compression = "zstd",
  compression_level = 3,
  statistics = TRUE,
  row_group_size = NULL,
  data_page_size = NULL,
  partition_by = NULL,
  partition_chunk_size_bytes = 4294967296
)

Arguments

file

File path to which the result should be written. This should be a path to a directory if writing a partitioned dataset.

...

Ignored.

compression

String. The compression method. One of:

  • "lz4": fast compression/decompression.

  • "uncompressed"

  • "snappy": this guarantees that the parquet file will be compatible with older parquet readers.

  • "gzip"

  • "lzo"

  • "brotli"

  • "zstd": good compression performance.

compression_level

NULL or Integer. The level of compression to use. Only used if method is one of 'gzip', 'brotli', or 'zstd'. Higher compression means smaller files on disk:

  • "gzip": min-level: 0, max-level: 10.

  • "brotli": min-level: 0, max-level: 11.

  • "zstd": min-level: 1, max-level: 22.

statistics

Whether statistics should be written to the Parquet headers. Possible values:

  • TRUE: enable default set of statistics (default)

  • FALSE: disable all statistics

  • "full": calculate and write all available statistics.

  • A named list where all values must be TRUE or FALSE, e.g. list(min = TRUE, max = FALSE). Statistics available are "min", "max", "distinct_count", "null_count".

row_group_size

NULL or Integer. Size of the row groups in number of rows. If NULL (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.

data_page_size

Size of the data page in bytes. If NULL (default), it is set to 1024^2 bytes. will be ~1MB.

partition_by

Column(s) to partition by. A partitioned dataset will be written if this is specified.

partition_chunk_size_bytes

Approximate size to split DataFrames within a single partition when writing. Note this is calculated using the size of the DataFrame in memory - the size of the output file may differ depending on the file format / compression.

Value

Invisibly returns the input DataFrame.

Examples

dat = pl$DataFrame(mtcars)

# write data to a single parquet file
destination = withr::local_tempfile(fileext = ".parquet")
dat$write_parquet(destination)

# write data to folder with a hive-partitioned structure
dest_folder = withr::local_tempdir()
dat$write_parquet(dest_folder, partition_by = c("gear", "cyl"))
list.files(dest_folder, recursive = TRUE)

Create Array DataType

Description

The Array and List datatypes are very similar. The only difference is that sub-arrays all have the same length while sublists can have different lengths. Array methods can be accessed via the ⁠$arr⁠ subnamespace.

Usage

DataType_Array(datatype = "unknown", width)

Arguments

datatype

An inner DataType. The default is "Unknown" and is only a placeholder for when inner DataType does not matter, e.g. as used in example.

width

The length of the arrays.

Value

An array DataType with an inner DataType

Examples

# basic Array
pl$Array(pl$Int32, 4)
# some nested Array
pl$Array(pl$Array(pl$Boolean, 3), 2)

Create Categorical DataType

Description

Create Categorical DataType

Usage

DataType_Categorical(ordering = "physical")

Arguments

ordering

Either "physical" (default) or "lexical".

Details

When a categorical variable is created, its string values (or "lexical" values) are stored and encoded as integers ("physical" values) by order of appearance. Therefore, sorting a categorical value can be done either on the lexical or on the physical values. See Examples.

Value

A Categorical DataType

Examples

# default is to order by physical values
df = pl$DataFrame(x = c("z", "z", "k", "a", "z"), schema = list(x = pl$Categorical()))
df$sort("x")

# when setting ordering = "lexical", sorting will be based on the strings
df_lex = pl$DataFrame(
  x = c("z", "z", "k", "a", "z"),
  schema = list(x = pl$Categorical("lexical"))
)
df_lex$sort("x")

Check whether the data type contains categoricals

Description

Check whether the data type contains categoricals

Usage

DataType_contains_categoricals()

Value

A logical value

Examples

pl$List(pl$Categorical())$contains_categoricals()
pl$List(pl$Enum(c("a", "b")))$contains_categoricals()
pl$List(pl$Float32)$contains_categoricals()
pl$List(pl$List(pl$Categorical()))$contains_categoricals()

Check whether the data type contains views

Description

Check whether the data type contains views

Usage

DataType_contains_views()

Value

A logical value

Examples

pl$List(pl$String)$contains_views()
pl$List(pl$Binary)$contains_views()
pl$List(pl$Float32)$contains_views()
pl$List(pl$List(pl$Binary))$contains_views()

Data type representing a calendar date and time of day.

Description

The underlying representation of this type is a 64-bit signed integer. The integer indicates the number of time units since the Unix epoch (1970-01-01 00:00:00). The number can be negative to indicate datetimes before the epoch.

Usage

DataType_Datetime(time_unit = "us", time_zone = NULL)

Arguments

time_unit

Unit of time. One of "ms", "us" (default) or "ns".

time_zone

Time zone string, as defined in OlsonNames(). Setting timezone = "*" will match any timezone, which can be useful to select all Datetime columns containing a timezone.

Value

Datetime DataType

Examples

pl$Datetime("ns", "Pacific/Samoa")

df = pl$DataFrame(
  naive_time = as.POSIXct("1900-01-01"),
  zoned_time = as.POSIXct("1900-01-01", "UTC")
)
df

df$select(pl$col(pl$Datetime("us", "*")))

Data type representing a time duration

Description

Data type representing a time duration

Usage

DataType_Duration(time_unit = "us")

Arguments

time_unit

Unit of time. One of "ms", "us" (default) or "ns".

Value

Duration DataType

Examples

test = pl$DataFrame(
  a = 1:2,
  b = c("a", "b"),
  c = pl$duration(weeks = c(1, 2), days = c(0, 2))
)

# select all columns of type "duration"
test$select(pl$col(pl$Duration()))

Create Enum DataType

Description

An Enum is a fixed set categorical encoding of a set of strings. It is similar to the Categorical data type, but the categories are explicitly provided by the user and cannot be modified.

Usage

DataType_Enum(categories)

Arguments

categories

A character vector specifying the categories of the variable.

Details

This functionality is unstable. It is a work-in-progress feature and may not always work as expected. It may be changed at any point without it being considered a breaking change.

Value

An Enum DataType

Examples

pl$DataFrame(
  x = c("Polar", "Panda", "Brown", "Brown", "Polar"),
  schema = list(x = pl$Enum(c("Polar", "Panda", "Brown")))
)

# All values of the variable have to be in the categories
dtype = pl$Enum(c("Polar", "Panda", "Brown"))
tryCatch(
  pl$DataFrame(
    x = c("Polar", "Panda", "Brown", "Brown", "Polar", "Black"),
    schema = list(x = dtype)
  ),
  error = function(e) e
)

# Comparing two Enum is only valid if they have the same categories
df = pl$DataFrame(
  x = c("Polar", "Panda", "Brown", "Brown", "Polar"),
  y = c("Polar", "Polar", "Polar", "Brown", "Brown"),
  z = c("Polar", "Polar", "Polar", "Brown", "Brown"),
  schema = list(
    x = pl$Enum(c("Polar", "Panda", "Brown")),
    y = pl$Enum(c("Polar", "Panda", "Brown")),
    z = pl$Enum(c("Polar", "Black", "Brown"))
  )
)

# Same categories
df$with_columns(x_eq_y = pl$col("x") == pl$col("y"))

# Different categories
tryCatch(
  df$with_columns(x_eq_z = pl$col("x") == pl$col("z")),
  error = function(e) e
)

Check whether the data type is an array type

Description

Check whether the data type is an array type

Usage

DataType_is_array()

Value

A logical value

Examples

pl$Array(width = 2)$is_array()
pl$Float32$is_array()

Check whether the data type is a binary type

Description

Check whether the data type is a binary type

Usage

DataType_is_binary()

Value

A logical value

Examples

pl$Binary$is_binary()
pl$Float32$is_binary()

Check whether the data type is a boolean type

Description

Check whether the data type is a boolean type

Usage

DataType_is_bool()

Value

A logical value

Examples

pl$Boolean$is_bool()
pl$Float32$is_bool()

Check whether the data type is a Categorical type

Description

Check whether the data type is a Categorical type

Usage

DataType_is_categorical()

Value

A logical value

Examples

pl$Categorical()$is_categorical()
pl$Enum(c("a", "b"))$is_categorical()

Check whether the data type is an Enum type

Description

Check whether the data type is an Enum type

Usage

DataType_is_enum()

Value

A logical value

Examples

pl$Enum(c("a", "b"))$is_enum()
pl$Categorical()$is_enum()

Check whether the data type is a float type

Description

Check whether the data type is a float type

Usage

DataType_is_float()

Value

A logical value

Examples

pl$Float32$is_float()
pl$Int32$is_float()

Check whether the data type is an integer type

Description

Check whether the data type is an integer type

Usage

DataType_is_integer()

Value

A logical value

Examples

pl$Int32$is_integer()
pl$Float32$is_integer()

Check whether the data type is known

Description

Check whether the data type is known

Usage

DataType_is_known()

Value

A logical value

Examples

pl$String$is_known()
pl$Unknown$is_known()

Check whether the data type is a list type

Description

Check whether the data type is a list type

Usage

DataType_is_list()

Value

A logical value

Examples

pl$List()$is_list()
pl$Float32$is_list()

Check whether the data type is a logical type

Description

Check whether the data type is a logical type

Usage

DataType_is_logical()

Value

A logical value


Check whether the data type is a nested type

Description

Check whether the data type is a nested type

Usage

DataType_is_nested()

Value

A logical value

Examples

pl$List()$is_nested()
pl$Array(width = 2)$is_nested()
pl$Float32$is_nested()

Check whether the data type is a null type

Description

Check whether the data type is a null type

Usage

DataType_is_null()

Value

A logical value

Examples

pl$Null$is_null()
pl$Float32$is_null()

Check whether the data type is a numeric type

Description

Check whether the data type is a numeric type

Usage

DataType_is_numeric()

Value

A logical value

Examples

pl$Float32$is_numeric()
pl$Int32$is_numeric()
pl$String$is_numeric()

Check whether the data type is an ordinal type

Description

Check whether the data type is an ordinal type

Usage

DataType_is_ord()

Value

A logical value

Examples

pl$String$is_ord()
pl$Categorical()$is_ord()

Check whether the data type is a primitive type

Description

Check whether the data type is a primitive type

Usage

DataType_is_primitive()

Value

A logical value

Examples

pl$Float32$is_primitive()
pl$List()$is_primitive()

Check whether the data type is a signed integer type

Description

Check whether the data type is a signed integer type

Usage

DataType_is_signed_integer()

Value

A logical value

Examples

pl$Int32$is_signed_integer()
pl$UInt32$is_signed_integer()

Check whether the data type is a String type

Description

Check whether the data type is a String type

Usage

DataType_is_string()

Value

A logical value

Examples

pl$String$is_string()
pl$Float32$is_string()

Check whether the data type is a temporal type

Description

Check whether the data type is a temporal type

Usage

DataType_is_struct()

Value

A logical value

Examples

pl$Struct()$is_struct()
pl$Float32$is_struct()

Check whether the data type is a temporal type

Description

Check whether the data type is a temporal type

Usage

DataType_is_temporal()

Value

A logical value

Examples

pl$Date$is_temporal()
pl$Float32$is_temporal()

Check whether the data type is an unsigned integer type

Description

Check whether the data type is an unsigned integer type

Usage

DataType_is_unsigned_integer()

Value

A logical value

Examples

pl$UInt32$is_unsigned_integer()
pl$Int32$is_unsigned_integer()

Create List DataType

Description

Create List DataType

Usage

DataType_List(datatype = "unknown")

Arguments

datatype

The inner DataType.

Value

A list DataType with an inner DataType

Examples

# some nested List
pl$List(pl$List(pl$Boolean))

# check if some maybe_list is a List DataType
maybe_List = pl$List(pl$UInt64)
pl$same_outer_dt(maybe_List, pl$List())

Create Struct DataType

Description

One can create a Struct data type with pl$Struct(). There are also multiple ways to create columns of data type Struct in a DataFrame or a Series, see the examples.

Usage

DataType_Struct(...)

Arguments

...

Either named inputs of the form field_name = datatype or objects of class RPolarsField created by pl$Field().

Value

A Struct DataType containing a list of Fields

Examples

# create a Struct-DataType
pl$Struct(foo = pl$Int32, pl$Field("bar", pl$Boolean))

# check if an element is any kind of Struct()
test = pl$Struct(a = pl$UInt64)
pl$same_outer_dt(test, pl$Struct())

# `test` is a type of Struct, but it doesn't mean it is equal to an empty Struct
test == pl$Struct()

# The way to create a `Series` of type `Struct` is a bit convoluted as it involves
# `data.frame()`, `list()`, and `I()`:
as_polars_series(
  data.frame(a = 1:2, b = I(list(c("x", "y"), "z")))
)

# A slightly simpler way would be via `tibble::tibble()` or
# `data.table::data.table()`:
if (requireNamespace("tibble", quietly = TRUE)) {
  as_polars_series(
    tibble::tibble(a = 1:2, b = list(c("x", "y"), "z"))
  )
}

# Finally, one can use `pl$struct()` to convert existing columns or `Series`
# to a `Struct`:
x = pl$DataFrame(
  a = 1:2,
  b = list(c("x", "y"), "z")
)

out = x$select(pl$struct(c("a", "b")))
out

out$schema

Get the dimensions

Description

Get the dimensions

Usage

## S3 method for class 'RPolarsDataFrame'
dim(x)

## S3 method for class 'RPolarsLazyFrame'
dim(x)

Arguments

x

A DataFrame or LazyFrame


Get the row and column names

Description

Get the row and column names

Usage

## S3 method for class 'RPolarsDataFrame'
dimnames(x)

## S3 method for class 'RPolarsLazyFrame'
dimnames(x)

Arguments

x

A DataFrame or LazyFrame


Translation definitions across python, R and polars.

Description

#Comments for how the R and python world translates into polars:

R and python are both high-level glue languages great for Data Science. Rust is a pedantic low-level language with similar use cases as C and C++. Polars is written in ~100k lines of rust and has a rust API. Py-polars the python API for polars, is implemented as an interface with the rust API. r-polars is very parallel to py-polars except it interfaces with R. The performance and behavior are unexpectedly quite similar as the 'engine' is the exact same rust code and data structures.

Format

info

Value

Not applicable

Translation details

R and the integerish

R only has a native Int32 type, no Uint32, Int64, UInt64 , ... types. These days Int32 is getting a bit small, to refer to more rows than ~ 2^31-1. There are packages which provide int64, but the most normal hack' is to just use floats as 'integerish'. There is an unique float64 value for every integer up to about 2^52 which is plenty for all practical concerns. Some polars methods may accept or return a floats even though an integer ideally would be more accurate. Most R functions intermix Int32 (integer) and Float64 (double) seamlessly.

Missingness

R has allocated a value in every vector type to signal missingness, these are collectively called NAs. Polars uses a bool bitmask to signal NA-like missing value and it is called Null and Nulls in plural. Not to confuse with R NULL (see paragraph below). Polars supports missingness for any possible type as it kept separately in the bitmask. In python lists the symbol None can carry a similar meaning. R NA ~ polars Null ~ py-polars ⁠[None]⁠ (in a py list)

Sorting and comparisons

From writing a lot of tests for all implementations, it appears polars does not have a fully consistent nor well documented behavior, when it comes to comparisons and sorting of floats. Though some general thumb rules do apply: Polars have chosen to define in sorting that Null is a value lower than -Inf as in Expr.arg_min() However except when Null is ignored Expr.min(), there is a Expr.nan_min() but no Expr.nan_min(). NaN is sometimes a value higher than Inf and sometimes regarded as a Null. Polars conventions NaN > Inf > 99 > -99 > -Inf > Null Null == Null yields often times false, sometimes true, sometimes Null. The documentation or examples do not reveal this variations. The best to do, when in doubt, is to do test sort on a small Series/Column of all values.

#' R NaN ~ polars NaN ~ python ⁠[float("NaN")]⁠ #only floats have NaNs

R Inf ~ polars inf ~ python ⁠[float("inf")]⁠ #only floats have Inf

NULL IS NOT Null is not NULL

The R NULL does not exist inside polars frames and series and so on. It resembles the Option::None in the hidden rust code. It resembles the python None. In all three languages the NULL/None/None are used in this context as function argument to signal default behavior or perhaps a deactivated feature. R NULL does NOT translate into the polars bitmask Null, that is NA. R NULL ~ rust-polars Option::None ~ pypolars None #typically used for function arguments

LISTS, FRAMES AND DICTS

The following translations are relevant when loading data into polars. The R list appears similar to python dictionary (hashmap), but is implemented more similar to the python list (array of pointers). R list do support string naming elements via a string vector. In polars both lists (of vectors or series) and data.frames can be used to construct a polars DataFrame, just a as dictionaries would be used in python. In terms of loading in/out data the follow translation holds: R data.frame/list ~ polars DataFrame ~ python dictonary

Series and Vectors

The R vector (Integer, Double, Character, ...) resembles the Series as both are external from any frame and can be of any length. The implementation is quite different. E.g. for-loop appending to an R vector is considered quite bad for performance. The vector will be fully rewritten in memory for every append. The polars Series has chunked memory allocation, which allows any append data to be written only. However fragmented memory is not great for fast computations and polars objects have a rechunk()-method, to reallocate chunks into one. Rechunk might be called implicitly by polars. In the context of constructing. Series and extracting data , the following translation holds: R vector ~ polars Series/column ~ python list

Expressions

The polars Expr do not have any base R counterpart. Expr are analogous to how ggplot split plotting instructions from the rendering. Base R plot immediately pushes any instruction by adding e.g. pixels to a .png canvas. ggplot collects instructions and in the end when executed the rendering can be performed with optimization across all instructions. Btw ggplot command-syntax is a monoid meaning the order does not matter, that is not the case for polars Expr. Polars Expr's can be understood as a DSL (domain specific language) that expresses syntax trees of instructions. R expressions evaluate to syntax trees also, but it difficult to optimize the execution order automatically, without rewriting the code. A great selling point of Polars is that any query will be optimized. Expr are very light-weight symbols chained together.


Aggregate over a DynamicGroupBy

Description

Aggregate a DataFrame over a time or integer window created with ⁠$group_by_dynamic()⁠.

Usage

DynamicGroupBy_agg(...)

Arguments

...

Exprs to aggregate over. Those can also be passed wrapped in a list, e.g ⁠$agg(list(e1,e2,e3))⁠.

Value

An aggregated DataFrame

Examples

df = pl$DataFrame(
  time = pl$datetime_range(
    start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
    end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
    interval = "30m"
  ),
  n = 0:6
)

# get the sum in the following hour relative to the "time" column
df$group_by_dynamic("time", every = "1h")$agg(
  vals = pl$col("n"),
  sum = pl$col("n")$sum()
)

# using "include_boundaries = TRUE" is helpful to see the period considered
df$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg(
  vals = pl$col("n")
)

# in the example above, the values didn't include the one *exactly* 1h after
# the start because "closed = 'left'" by default.
# Changing it to "right" includes values that are exactly 1h after. Note that
# the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00],
# even if this interval wasn't there originally
df$group_by_dynamic("time", every = "1h", closed = "right")$agg(
  vals = pl$col("n")
)
# To keep both boundaries, we use "closed = 'both'". Some values now belong to
# several groups:
df$group_by_dynamic("time", every = "1h", closed = "both")$agg(
  vals = pl$col("n")
)

# Dynamic group bys can also be combined with grouping on normal keys
df = df$with_columns(
  groups = as_polars_series(c("a", "a", "a", "b", "b", "a", "a"))
)
df

df$group_by_dynamic(
  "time",
  every = "1h",
  closed = "both",
  group_by = "groups",
  include_boundaries = TRUE
)$agg(pl$col("n"))

# We can also create a dynamic group by based on an index column
df = pl$LazyFrame(
  idx = 0:5,
  A = c("A", "A", "B", "B", "B", "C")
)$with_columns(pl$col("idx")$set_sorted())
df

df$group_by_dynamic(
  "idx",
  every = "2i",
  period = "3i",
  include_boundaries = TRUE,
  closed = "right"
)$agg(A_agg_list = pl$col("A"))

Operations on Polars DataFrame grouped on time or integer values

Description

This class comes from <DataFrame>$group_by_dynamic().

Examples

df = pl$DataFrame(
  time = pl$datetime_range(
    start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
    end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
    interval = "30m"
  ),
  n = 0:6
)

# get the sum in the following hour relative to the "time" column
df$group_by_dynamic("time", every = "1h")

Ungroup a DynamicGroupBy object

Description

Revert the ⁠$group_by_dynamic()⁠ operation. Doing ⁠<DataFrame>$group_by_dynamic(...)$ungroup()⁠ returns the original DataFrame.

Usage

DynamicGroupBy_ungroup()

Value

DataFrame

Examples

df = pl$DataFrame(
  time = pl$datetime_range(
    start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
    end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
    interval = "30m"
  ),
  n = 0:6
)
df

df$group_by_dynamic("time", every = "1h")$ungroup()

Compute the absolute values

Description

Compute the absolute values

Usage

Expr_abs()

Value

Expr

Examples

pl$DataFrame(a = -1:1)$
  with_columns(abs = pl$col("a")$abs())

Add two expressions

Description

Method equivalent of addition operator expr + other.

Usage

Expr_add(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

See Also

Examples

df = pl$DataFrame(x = 1:5)

df$with_columns(
  `x+int` = pl$col("x")$add(2L),
  `x+expr` = pl$col("x")$add(pl$col("x")$cum_prod())
)

df = pl$DataFrame(
  x = c("a", "d", "g"),
  y = c("b", "e", "h"),
  z = c("c", "f", "i")
)

df$with_columns(
  pl$col("x")$add(pl$col("y"))$add(pl$col("z"))$alias("xyz")
)

Aggregate groups

Description

Get the group indexes of the group by operation. Should be used in aggregation context only.

Usage

Expr_agg_groups()

Value

Expr

Examples

df = pl$DataFrame(list(
  group = c("one", "one", "one", "two", "two", "two"),
  value = c(94, 95, 96, 97, 97, 99)
))
df$group_by("group", maintain_order = TRUE)$agg(pl$col("value")$agg_groups())

Rename Expr output

Description

Rename the output of an expression.

Usage

Expr_alias(name)

Arguments

name

New name of output

Value

Expr

Examples

pl$col("bob")$alias("alice")

Apply logical AND on a column

Description

Check if all values in a Boolean column are TRUE. This method is an expression - not to be confused with pl$all() which is a function to select all columns.

Usage

Expr_all(..., ignore_nulls = TRUE)

Arguments

...

Ignored.

ignore_nulls

If TRUE (default), ignore null values. If FALSE, Kleene logic is used to deal with nulls: if the column contains any null values and no TRUE values, the output is null.

Value

A logical value

Examples

df = pl$DataFrame(
  a = c(TRUE, TRUE),
  b = c(TRUE, FALSE),
  c = c(NA, TRUE),
  d = c(NA, NA)
)

# By default, ignore null values. If there are only nulls, then all() returns
# TRUE.
df$select(pl$col("*")$all())

# If we set ignore_nulls = FALSE, then we don't know if all values in column
# "c" are TRUE, so it returns null
df$select(pl$col("*")$all(ignore_nulls = FALSE))

Apply logical AND on two expressions

Description

Combine two boolean expressions with AND.

Usage

Expr_and(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

Examples

pl$lit(TRUE) & TRUE
pl$lit(TRUE)$and(pl$lit(TRUE))

Apply logical OR on a column

Description

Check if any boolean value in a Boolean column is TRUE.

Usage

Expr_any(..., ignore_nulls = TRUE)

Arguments

...

Ignored.

ignore_nulls

If TRUE (default), ignore null values. If FALSE, Kleene logic is used to deal with nulls: if the column contains any null values and no TRUE values, the output is null.

Value

A logical value

Examples

df = pl$DataFrame(
  a = c(TRUE, FALSE),
  b = c(FALSE, FALSE),
  c = c(NA, FALSE)
)

df$select(pl$col("*")$any())

# If we set ignore_nulls = FALSE, then we don't know if any values in column
# "c" is TRUE, so it returns null
df$select(pl$col("*")$any(ignore_nulls = FALSE))

Append expressions

Description

This is done by adding the chunks of other to this output.

Usage

Expr_append(other, upcast = TRUE)

Arguments

other

Expr or something coercible to an Expr.

upcast

Cast both Expr to a common supertype if they have one.

Value

Expr

Examples

# append bottom to to row
df = pl$DataFrame(list(a = 1:3, b = c(NA_real_, 4, 5)))
df$select(pl$all()$head(1)$append(pl$all()$tail(1)))

# implicit upcast, when default = TRUE
pl$DataFrame(list())$select(pl$lit(42)$append(42L))
pl$DataFrame(list())$select(pl$lit(42)$append(FALSE))
pl$DataFrame(list())$select(pl$lit("Bob")$append(FALSE))

Approx count unique values

Description

This is done using the HyperLogLog++ algorithm for cardinality estimation.

Usage

Expr_approx_n_unique()

Value

Expr

Examples

pl$DataFrame(iris[, 4:5])$
  with_columns(count = pl$col("Species")$approx_n_unique())

Compute inverse cosine

Description

Compute inverse cosine

Usage

Expr_arccos()

Value

Expr

Examples

pl$DataFrame(a = c(-1, cos(0.5), 0, 1, NA_real_))$
  with_columns(arccos = pl$col("a")$arccos())

Compute inverse hyperbolic cosine

Description

Compute inverse hyperbolic cosine

Usage

Expr_arccosh()

Value

Expr

Examples

pl$DataFrame(a = c(-1, cosh(0.5), 0, 1, NA_real_))$
  with_columns(arccosh = pl$col("a")$arccosh())

Compute inverse sine

Description

Compute inverse sine

Usage

Expr_arcsin()

Value

Expr

Examples

pl$DataFrame(a = c(-1, sin(0.5), 0, 1, NA_real_))$
  with_columns(arcsin = pl$col("a")$arcsin())

Compute inverse hyperbolic sine

Description

Compute inverse hyperbolic sine

Usage

Expr_arcsinh()

Value

Expr

Examples

pl$DataFrame(a = c(-1, sinh(0.5), 0, 1, NA_real_))$
  with_columns(arcsinh = pl$col("a")$arcsinh())

Compute inverse tangent

Description

Compute inverse tangent

Usage

Expr_arctan()

Value

Expr

Examples

pl$DataFrame(a = c(-1, tan(0.5), 0, 1, NA_real_))$
  with_columns(arctan = pl$col("a")$arctan())

Compute inverse hyperbolic tangent

Description

Compute inverse hyperbolic tangent

Usage

Expr_arctanh()

Value

Expr

Examples

pl$DataFrame(a = c(-1, tanh(0.5), 0, 1, NA_real_))$
  with_columns(arctanh = pl$col("a")$arctanh())

Index of max value

Description

Get the index of the maximal value.

Usage

Expr_arg_max()

Value

Expr

Examples

pl$DataFrame(
  a = c(6, 1, 0, NA, Inf, NaN)
)$with_columns(arg_max = pl$col("a")$arg_max())

Index of min value

Description

Get the index of the minimal value.

Usage

Expr_arg_min()

Value

Expr

Examples

pl$DataFrame(
  a = c(6, 1, 0, NA, Inf, NaN)
)$with_columns(arg_min = pl$col("a")$arg_min())

Index of a sort

Description

Get the index values that would sort this column.

Usage

Expr_arg_sort(descending = FALSE, nulls_last = FALSE)

Arguments

descending

A logical. If TRUE, sort in descending order.

nulls_last

A logical. If TRUE, place null values last insead of first.

Value

Expr

See Also

pl$arg_sort_by() to find the row indices that would sort multiple columns.

Examples

pl$DataFrame(
  a = c(6, 1, 0, NA, Inf, NaN)
)$with_columns(arg_sorted = pl$col("a")$arg_sort())

Index of first unique values

Description

This finds the position of first occurrence of each unique value.

Usage

Expr_arg_unique()

Value

Expr

Examples

pl$select(pl$lit(c(1:2, 1:3))$arg_unique())

Fill null values backward

Description

Fill missing values with the next to be seen values. Syntactic sugar for ⁠$fill_null(strategy = "backward")⁠.

Usage

Expr_backward_fill(limit = NULL)

Arguments

limit

Number of consecutive null values to fill when using the "forward" or "backward" strategy.

Value

Expr

Examples

pl$DataFrame(a = c(NA, 1, NA, 2, NA))$
  with_columns(
  backward = pl$col("a")$backward_fill()
)

Bottom k values

Description

Return the k smallest elements. This has time complexity: O(n+klognk2)O(n + k \\log{}n - \frac{k}{2})

Usage

Expr_bottom_k(k)

Arguments

k

Number of top values to get.

Value

Expr

Examples

pl$DataFrame(a = c(6, 1, 0, NA, Inf, NaN))$select(pl$col("a")$bottom_k(5))

Cast between DataType

Description

Cast between DataType

Usage

Expr_cast(dtype, strict = TRUE)

Arguments

dtype

DataType to cast to.

strict

If TRUE (default), an error will be thrown if cast failed at resolve time.

Value

Expr

Examples

df = pl$DataFrame(a = 1:3, b = c(1, 2, 3))
df$with_columns(
  pl$col("a")$cast(pl$dtypes$Float64),
  pl$col("b")$cast(pl$dtypes$Int32)
)

# strict FALSE, inserts null for any cast failure
pl$lit(c(100, 200, 300))$cast(pl$dtypes$UInt8, strict = FALSE)$to_series()

# strict TRUE, raise any failure as an error when query is executed.
tryCatch(
  {
    pl$lit("a")$cast(pl$dtypes$Float64, strict = TRUE)$to_series()
  },
  error = function(e) e
)

Ceiling

Description

Rounds up to the nearest integer value. Only works on floating point Series.

Usage

Expr_ceil()

Value

Expr

Examples

pl$DataFrame(a = c(0.33, 0.5, 1.02, 1.5, NaN, NA, Inf, -Inf))$with_columns(
  ceiling = pl$col("a")$ceil()
)

Polars Expressions

Description

Expressions are all the functions and methods that are applicable to a Polars DataFrame or LazyFrame object. Some methods are under the sub-namespaces.

Sub-namespaces

arr

⁠$arr⁠ stores all array related methods.

bin

⁠$bin⁠ stores all binary related methods.

cat

⁠$cat⁠ stores all categorical related methods.

dt

⁠$dt⁠ stores all temporal related methods.

list

⁠$list⁠ stores all list related methods.

meta

⁠$meta⁠ stores all methods for working with the meta data.

name

⁠$name⁠ stores all name related methods.

str

⁠$str⁠ stores all string related methods.

struct

⁠$struct⁠ stores all struct related methods.

Examples

df = pl$DataFrame(
  a = 1:2,
  b = list(1:2, 3:4),
  schema = list(a = pl$Int64, b = pl$Array(pl$Int64, 2))
)

df$select(pl$col("a")$first())

df$select(pl$col("b")$arr$sum())

Clip elements

Description

Set values outside the given boundaries to the boundary value. This only works for numeric and temporal values.

Usage

Expr_clip(lower_bound = NULL, upper_bound = NULL)

Arguments

lower_bound

Lower bound. Accepts expression input. Strings are parsed as column names and other non-expression inputs are parsed as literals.

upper_bound

Upper bound. Accepts expression input. Strings are parsed as column names and other non-expression inputs are parsed as literals.

Value

Expr

Examples

df = pl$DataFrame(foo = c(-50L, 5L, NA_integer_, 50L), bound = c(1, 10, 1, 1))

# With the two bounds
df$with_columns(clipped = pl$col("foo")$clip(1, 10))

# Without lower bound
df$with_columns(clipped = pl$col("foo")$clip(upper_bound = 10))

# Using another column as lower bound
df$with_columns(clipped = pl$col("foo")$clip(lower_bound = "bound"))

Compute cosine

Description

Compute cosine

Usage

Expr_cos()

Value

Expr

Examples

pl$DataFrame(a = c(0, pi / 2, pi, NA_real_))$
  with_columns(cosine = pl$col("a")$cos())

Compute hyperbolic cosine

Description

Compute hyperbolic cosine

Usage

Expr_cosh()

Value

Expr

Examples

pl$DataFrame(a = c(-1, acosh(2), 0, 1, NA_real_))$
  with_columns(cosh = pl$col("a")$cosh())

Count elements

Description

Count the number of elements in this expression. Note that NULL values are also counted. ⁠$len()⁠ is an alias.

Usage

Expr_count()

Expr_len()

Value

Expr

Examples

pl$DataFrame(
  all = c(TRUE, TRUE),
  any = c(TRUE, FALSE),
  none = c(FALSE, FALSE)
)$select(
  pl$all()$count()
)

Cumulative count

Description

Get an array with the cumulative count (zero-indexed) computed at every element.

Usage

Expr_cum_count(reverse = FALSE)

Arguments

reverse

If TRUE, reverse the count.

Details

The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.

⁠$cum_count()⁠ does not seem to count within lists.

Value

Expr

Examples

pl$DataFrame(a = 1:4)$with_columns(
  pl$col("a")$cum_count()$alias("cum_count"),
  pl$col("a")$cum_count(reverse = TRUE)$alias("cum_count_reversed")
)

Cumulative maximum

Description

Get an array with the cumulative max computed at every element.

Usage

Expr_cum_max(reverse = FALSE)

Arguments

reverse

If TRUE, start from the last value.

Details

The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.

Value

Expr

Examples

pl$DataFrame(a = c(1:4, 2L))$with_columns(
  pl$col("a")$cum_max()$alias("cummux"),
  pl$col("a")$cum_max(reverse = TRUE)$alias("cum_max_reversed")
)

Cumulative minimum

Description

Get an array with the cumulative min computed at every element.

Usage

Expr_cum_min(reverse = FALSE)

Arguments

reverse

If TRUE, start from the last value.

Details

The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.

Value

Expr

Examples

pl$DataFrame(a = c(1:4, 2L))$with_columns(
  pl$col("a")$cum_min()$alias("cum_min"),
  pl$col("a")$cum_min(reverse = TRUE)$alias("cum_min_reversed")
)

Cumulative product

Description

Get an array with the cumulative product computed at every element.

Usage

Expr_cum_prod(reverse = FALSE)

Arguments

reverse

If TRUE, start with the total product of elements and divide each row one by one.

Details

The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.

Value

Expr

Examples

pl$DataFrame(a = 1:4)$with_columns(
  pl$col("a")$cum_prod()$alias("cum_prod"),
  pl$col("a")$cum_prod(reverse = TRUE)$alias("cum_prod_reversed")
)

Cumulative sum

Description

Get an array with the cumulative sum computed at every element.

Usage

Expr_cum_sum(reverse = FALSE)

Arguments

reverse

If TRUE, start with the total sum of elements and substract each row one by one.

Details

The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.

Value

Expr

Examples

pl$DataFrame(a = 1:4)$with_columns(
  pl$col("a")$cum_sum()$alias("cum_sum"),
  pl$col("a")$cum_sum(reverse = TRUE)$alias("cum_sum_reversed")
)

Cumulative evaluation of expressions

Description

Run an expression over a sliding window that increases by 1 slot every iteration.

Usage

Expr_cumulative_eval(expr, min_periods = 1L, parallel = FALSE)

Arguments

expr

Expression to evaluate.

min_periods

Number of valid (non-null) values there should be in the window before the expression is evaluated.

parallel

Run in parallel. Don't do this in a groupby or another operation that already has much parallelization.

Details

This can be really slow as it can have O(n^2) complexity. Don't use this for operations that visit all elements.

Value

Expr

Examples

pl$lit(1:5)$cumulative_eval(
  pl$element()$first() - pl$element()$last()^2
)$to_r()

Bin continuous values into discrete categories

Description

Bin continuous values into discrete categories

Usage

Expr_cut(
  breaks,
  ...,
  labels = NULL,
  left_closed = FALSE,
  include_breaks = FALSE
)

Arguments

breaks

Unique cut points.

...

Ignored.

labels

Names of the categories. The number of labels must be equal to the number of cut points plus one.

left_closed

Set the intervals to be left-closed instead of right-closed.

include_breaks

Include a column with the right endpoint of the bin each observation falls in. This will change the data type of the output from a Categorical to a Struct.

Value

Expr of data type Categorical is include_breaks is FALSE and of data type Struct if include_breaks is TRUE.

See Also

$qcut()

Examples

df = pl$DataFrame(foo = c(-2, -1, 0, 1, 2))

df$with_columns(
  cut = pl$col("foo")$cut(c(-1, 1), labels = c("a", "b", "c"))
)

# Add both the category and the breakpoint
df$with_columns(
  cut = pl$col("foo")$cut(c(-1, 1), include_breaks = TRUE)
)$unnest("cut")

Difference

Description

Calculate the n-th discrete difference.

Usage

Expr_diff(n = 1, null_behavior = c("ignore", "drop"))

Arguments

n

Number of slots to shift.

null_behavior

String, either "ignore" (default), else "drop".

Value

Expr

Examples

pl$DataFrame(a = c(20L, 10L, 30L, 40L))$with_columns(
  diff_default = pl$col("a")$diff(),
  diff_2_ignore = pl$col("a")$diff(2, "ignore")
)

Divide two expressions

Description

Method equivalent of float division operator expr / other.

Usage

Expr_div(other)

Arguments

other

Numeric literal or expression value.

Details

Zero-division behaviour follows IEEE-754:

  • 0/0: Invalid operation - mathematically undefined, returns NaN.

  • n/0: On finite operands gives an exact infinite result, e.g.: ±infinity.

Value

Expr

See Also

Examples

df = pl$DataFrame(
  x = -2:2,
  y = c(0.5, 0, 0, -4, -0.5)
)

df$with_columns(
  `x/2` = pl$col("x")$div(2),
  `x/y` = pl$col("x")$div(pl$col("y"))
)

Dot product

Description

Compute the dot/inner product between two Expressions.

Usage

Expr_dot(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

Examples

pl$DataFrame(
  a = 1:4, b = c(1, 2, 3, 4)
)$with_columns(
  pl$col("a")$dot(pl$col("b"))$alias("a dot b"),
  pl$col("a")$dot(pl$col("a"))$alias("a dot a")
)

Drop NaN

Description

Drop NaN

Usage

Expr_drop_nans()

Details

Note that NaN values are not null values. Null values correspond to NA in R.

Value

Expr

See Also

drop_nulls()

Examples

pl$DataFrame(list(x = c(1, 2, NaN, NA)))$select(pl$col("x")$drop_nans())

Drop missing values

Description

Drop missing values

Usage

Expr_drop_nulls()

Value

Expr

See Also

drop_nans()

Examples

pl$DataFrame(list(x = c(1, 2, NaN, NA)))$select(pl$col("x")$drop_nulls())

Entropy

Description

The entropy is measured with the formula -sum(pk * log(pk)) where pk are discrete probabilities.

Usage

Expr_entropy(base = base::exp(1), normalize = TRUE)

Arguments

base

Given exponential base, defaults to exp(1).

normalize

Normalize pk if it doesn't sum to 1.

Value

Expr

Examples

pl$DataFrame(x = c(1, 2, 3, 2))$
  with_columns(entropy = pl$col("x")$entropy(base = 2))

Check equality

Description

Method equivalent of addition operator expr + other.

Usage

Expr_eq(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

See Also

Expr_eq_missing

Examples

pl$lit(2) == 2
pl$lit(2) == pl$lit(2)
pl$lit(2)$eq(pl$lit(2))

Check equality without null propagation

Description

Method equivalent of addition operator expr + other.

Usage

Expr_eq_missing(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

See Also

Expr_eq

Examples

df = pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE))
df$with_columns(
  eq = pl$col("x")$eq("y"),
  eq_missing = pl$col("x")$eq_missing("y")
)

Exponentially-weighted moving average

Description

Exponentially-weighted moving average

Usage

Expr_ewm_mean(
  com = NULL,
  span = NULL,
  half_life = NULL,
  alpha = NULL,
  adjust = TRUE,
  min_periods = 1L,
  ignore_nulls = TRUE
)

Arguments

com

Specify decay in terms of center of mass, γ\gamma, with α=11+γ    γ0\alpha = \frac{1}{1 + \gamma} \; \forall \; \gamma \geq 0

span

Specify decay in terms of span, θ\theta, with α=2θ+1    θ1\alpha = \frac{2}{\theta + 1} \; \forall \; \theta \geq 1

half_life

Specify decay in terms of half-life, :math:⁠\lambda⁠, with α=1exp{ln(2)λ}\alpha = 1 - \exp \left\{ \frac{ -\ln(2) }{ \lambda } \right\}   λ>0\forall \; \lambda > 0

alpha

Specify smoothing factor alpha directly, 0<α10 < \alpha \leq 1.

adjust

Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:

  • When adjust=TRUE the EW function is calculatedusing weights wi=(1α)iw_i = (1 - \alpha)^i

  • When adjust=FALSE the EW function is calculated recursively by y0=x0yt=(1α)yt1+αxty_0 = x_0 \\ y_t = (1 - \alpha)y_{t - 1} + \alpha x_t

min_periods

Minimum number of observations in window required to have a value (otherwise result is null).

ignore_nulls

Ignore missing values when calculating weights:

  • When TRUE (default), weights are based on relative positions. For example, the weights of x0x_0 and x2x_2 used in calculating the final weighted average of [ x0x_0, None, x2x_2⁠]⁠ are 1α1-\alpha and 11 if adjust=TRUE, and 1α1-\alpha and α\alpha if adjust=FALSE.

  • When FALSE, weights are based on absolute positions. For example, the weights of :math:x_0 and :math:x_2 used in calculating the final weighted average of [ x0x_0, None, x2x_2\⁠]⁠ are 1α)21-\alpha)^2 and 11 if adjust=TRUE, and (1α)2(1-\alpha)^2 and α\alpha if adjust=FALSE.

Value

Expr

Examples

pl$DataFrame(a = 1:3)$
  with_columns(ewm_mean = pl$col("a")$ewm_mean(com = 1))

Exponentially-weighted moving standard deviation

Description

Exponentially-weighted moving standard deviation

Usage

Expr_ewm_std(
  com = NULL,
  span = NULL,
  half_life = NULL,
  alpha = NULL,
  adjust = TRUE,
  bias = FALSE,
  min_periods = 1L,
  ignore_nulls = TRUE
)

Arguments

com

Specify decay in terms of center of mass, γ\gamma, with α=11+γ    γ0\alpha = \frac{1}{1 + \gamma} \; \forall \; \gamma \geq 0

span

Specify decay in terms of span, θ\theta, with α=2θ+1    θ1\alpha = \frac{2}{\theta + 1} \; \forall \; \theta \geq 1

half_life

Specify decay in terms of half-life, :math:⁠\lambda⁠, with α=1exp{ln(2)λ}\alpha = 1 - \exp \left\{ \frac{ -\ln(2) }{ \lambda } \right\}   λ>0\forall \; \lambda > 0

alpha

Specify smoothing factor alpha directly, 0<α10 < \alpha \leq 1.

adjust

Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:

  • When adjust=TRUE the EW function is calculatedusing weights wi=(1α)iw_i = (1 - \alpha)^i

  • When adjust=FALSE the EW function is calculated recursively by y0=x0yt=(1α)yt1+αxty_0 = x_0 \\ y_t = (1 - \alpha)y_{t - 1} + \alpha x_t

bias

If FALSE, the calculations are corrected for statistical bias.

min_periods

Minimum number of observations in window required to have a value (otherwise result is null).

ignore_nulls

Ignore missing values when calculating weights:

  • When TRUE (default), weights are based on relative positions. For example, the weights of x0x_0 and x2x_2 used in calculating the final weighted average of [ x0x_0, None, x2x_2⁠]⁠ are 1α1-\alpha and 11 if adjust=TRUE, and 1α1-\alpha and α\alpha if adjust=FALSE.

  • When FALSE, weights are based on absolute positions. For example, the weights of :math:x_0 and :math:x_2 used in calculating the final weighted average of [ x0x_0, None, x2x_2\⁠]⁠ are 1α)21-\alpha)^2 and 11 if adjust=TRUE, and (1α)2(1-\alpha)^2 and α\alpha if adjust=FALSE.

Value

Expr

Examples

pl$DataFrame(a = 1:3)$
  with_columns(ewm_std = pl$col("a")$ewm_std(com = 1))

Exponentially-weighted moving variance

Description

Exponentially-weighted moving variance

Usage

Expr_ewm_var(
  com = NULL,
  span = NULL,
  half_life = NULL,
  alpha = NULL,
  adjust = TRUE,
  bias = FALSE,
  min_periods = 1L,
  ignore_nulls = TRUE
)

Arguments

com

Specify decay in terms of center of mass, γ\gamma, with α=11+γ    γ0\alpha = \frac{1}{1 + \gamma} \; \forall \; \gamma \geq 0

span

Specify decay in terms of span, θ\theta, with α=2θ+1    θ1\alpha = \frac{2}{\theta + 1} \; \forall \; \theta \geq 1

half_life

Specify decay in terms of half-life, :math:⁠\lambda⁠, with α=1exp{ln(2)λ}\alpha = 1 - \exp \left\{ \frac{ -\ln(2) }{ \lambda } \right\}   λ>0\forall \; \lambda > 0

alpha

Specify smoothing factor alpha directly, 0<α10 < \alpha \leq 1.

adjust

Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:

  • When adjust=TRUE the EW function is calculatedusing weights wi=(1α)iw_i = (1 - \alpha)^i

  • When adjust=FALSE the EW function is calculated recursively by y0=x0yt=(1α)yt1+αxty_0 = x_0 \\ y_t = (1 - \alpha)y_{t - 1} + \alpha x_t

bias

If FALSE, the calculations are corrected for statistical bias.

min_periods

Minimum number of observations in window required to have a value (otherwise result is null).

ignore_nulls

Ignore missing values when calculating weights:

  • When TRUE (default), weights are based on relative positions. For example, the weights of x0x_0 and x2x_2 used in calculating the final weighted average of [ x0x_0, None, x2x_2⁠]⁠ are 1α1-\alpha and 11 if adjust=TRUE, and 1α1-\alpha and α\alpha if adjust=FALSE.

  • When FALSE, weights are based on absolute positions. For example, the weights of :math:x_0 and :math:x_2 used in calculating the final weighted average of [ x0x_0, None, x2x_2\⁠]⁠ are 1α)21-\alpha)^2 and 11 if adjust=TRUE, and (1α)2(1-\alpha)^2 and α\alpha if adjust=FALSE.

Value

Expr

Examples

pl$DataFrame(a = 1:3)$
  with_columns(ewm_var = pl$col("a")$ewm_var(com = 1))

Exclude certain columns from selection

Description

Exclude certain columns from selection

Usage

Expr_exclude(columns)

Arguments

columns

Given param type:

  • string: single column name or regex starting with ^ and ending with $

  • character vector: exclude all these column names, no regex allowed

  • DataType: Exclude any of this DataType

  • List(DataType): Exclude any of these DataType(s)

Value

Expr

Examples

# make DataFrame
df = pl$DataFrame(iris)

# by name(s)
df$select(pl$all()$exclude("Species"))

# by type
df$select(pl$all()$exclude(pl$Categorical()))
df$select(pl$all()$exclude(list(pl$Categorical(), pl$Float64)))

# by regex
df$select(pl$all()$exclude("^Sepal.*$"))

Compute the exponential of the elements

Description

Compute the exponential of the elements

Usage

Expr_exp()

Value

Expr

Examples

pl$DataFrame(a = -1:3)$with_columns(a_exp = pl$col("a")$exp())

Explode a list or String Series

Description

This means that every item is expanded to a new row.

Usage

Expr_explode()

Details

Categorical values are not supported.

Value

Expr

Examples

df = pl$DataFrame(x = c("abc", "ab"), y = c(list(1:3), list(3:5)))
df

df$select(pl$col("y")$explode())

Extend Series with a constant

Description

Extend the Series with given number of values.

Usage

Expr_extend_constant(value, n)

Arguments

value

The value to extend the Series with. This value may be NULL to fill with nulls.

n

The number of values to extend.

Value

Expr

Examples

pl$select(pl$lit(1:4)$extend_constant(10.1, 2))
pl$select(pl$lit(1:4)$extend_constant(NULL, 2))

Fill floating point NaN value with a fill value

Description

Fill floating point NaN value with a fill value

Usage

Expr_fill_nan(value = NULL)

Arguments

value

Value used to fill NaN values.

Value

Expr

Examples

pl$DataFrame(a = c(NaN, 1, NaN, 2, NA))$
  with_columns(
  literal = pl$col("a")$fill_nan(999),
  # implicit coercion to string
  string = pl$col("a")$fill_nan("invalid")
)

Fill null values with a value or strategy

Description

Fill null values with a value or strategy

Usage

Expr_fill_null(value = NULL, strategy = NULL, limit = NULL)

Arguments

value

Expr or something coercible in an Expr

strategy

Possible choice are NULL (default, requires a non-null value), "forward", "backward", "min", "max", "mean", "zero", "one".

limit

Number of consecutive null values to fill when using the "forward" or "backward" strategy.

Value

Expr

Examples

pl$DataFrame(a = c(NA, 1, NA, 2, NA))$
  with_columns(
  value = pl$col("a")$fill_null(999),
  backward = pl$col("a")$fill_null(strategy = "backward"),
  mean = pl$col("a")$fill_null(strategy = "mean")
)

Filter a single column.

Description

Mostly useful in an aggregation context. If you want to filter on a DataFrame level, use DataFrame$filter() (or LazyFrame$filter()).

Usage

Expr_filter(predicate)

Arguments

predicate

An Expr or something coercible to an Expr. Must return a boolean.

Value

Expr

Examples

df = pl$DataFrame(
  group_col = c("g1", "g1", "g2"),
  b = c(1, 2, 3)
)
df

df$group_by("group_col")$agg(
  lt = pl$col("b")$filter(pl$col("b") < 2),
  gte = pl$col("b")$filter(pl$col("b") >= 2)
)

Get the first value.

Description

Get the first value.

Usage

Expr_first()

Value

Expr

Examples

pl$DataFrame(x = 3:1)$with_columns(first = pl$col("x")$first())

Explode a list or String Series

Description

This is an alias for ⁠<Expr>$explode()⁠.

Usage

Expr_flatten()

Value

Expr

Examples

df = pl$DataFrame(x = c("abc", "ab"), y = c(list(1:3), list(3:5)))
df

df$select(pl$col("y")$flatten())

Floor

Description

Rounds down to the nearest integer value. Only works on floating point Series.

Usage

Expr_floor()

Value

Expr

Examples

pl$DataFrame(a = c(0.33, 0.5, 1.02, 1.5, NaN, NA, Inf, -Inf))$with_columns(
  floor = pl$col("a")$floor()
)

Floor divide two expressions

Description

Method equivalent of floor division operator expr %/% other.

Usage

Expr_floor_div(other)

Arguments

other

Numeric literal or expression value.

Value

Expr

See Also

Examples

df = pl$DataFrame(x = 1:5)

df$with_columns(
  `x/2` = pl$col("x")$div(2),
  `x%/%2` = pl$col("x")$floor_div(2)
)

Fill null values forward

Description

Fill missing values with the last seen values. Syntactic sugar for ⁠$fill_null(strategy = "forward")⁠.

Usage

Expr_forward_fill(limit = NULL)

Arguments

limit

Number of consecutive null values to fill when using the "forward" or "backward" strategy.

Value

Expr

Examples

pl$DataFrame(a = c(NA, 1, NA, 2, NA))$
  with_columns(
  backward = pl$col("a")$forward_fill()
)

Gather values by index

Description

Gather values by index

Usage

Expr_gather(indices)

Arguments

indices

R vector or Series, or Expr that leads to a Series of dtype Int64. (0-indexed)

Value

Expr

Examples

df = pl$DataFrame(a = 1:10)

df$select(pl$col("a")$gather(c(0, 2, 4, -1)))

Gather every nth element

Description

Gather every nth value in the Series and return as a new Series.

Usage

Expr_gather_every(n, offset = 0)

Arguments

n

Positive integer.

offset

Starting index.

Value

Expr

Examples

pl$DataFrame(a = 0:24)$select(pl$col("a")$gather_every(6))

Check strictly greater inequality

Description

Method equivalent of addition operator expr + other.

Usage

Expr_gt(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

Examples

pl$lit(2) > 1
pl$lit(2) > pl$lit(1)
pl$lit(2)$gt(pl$lit(1))

Check greater or equal inequality

Description

Method equivalent of addition operator expr + other.

Usage

Expr_gt_eq(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

Examples

pl$lit(2) >= 2
pl$lit(2) >= pl$lit(2)
pl$lit(2)$gt_eq(pl$lit(2))

Check whether the expression contains one or more null values

Description

Check whether the expression contains one or more null values

Usage

Expr_has_nulls()

Value

Expr

Examples

df = pl$DataFrame(
  a = c(NA, 1, NA),
  b = c(1, NA, 2),
  c = c(1, 2, 3)
)

df$select(pl$all()$has_nulls())

Hash elements

Description

The hash value is of type UInt64.

Usage

Expr_hash(seed = 0, seed_1 = NULL, seed_2 = NULL, seed_3 = NULL)

Arguments

seed

Random seed parameter. Defaults to 0. Doesn't have any effect for now.

seed_1, seed_2, seed_3

Random seed parameter. Defaults to arg seed. The column will be coerced to UInt32.

Value

Expr

Examples

df = pl$DataFrame(iris[1:3, c(1, 2)])
df$with_columns(pl$all()$hash(1234)$name$suffix("_hash"))

Get the first n elements

Description

Get the first n elements

Usage

Expr_head(n = 10)

Arguments

n

Number of elements to take.

Value

Expr

Examples

pl$DataFrame(x = 1:11)$select(pl$col("x")$head(3))

Wrap column in list

Description

Aggregate values into a list.

Usage

Expr_implode()

Details

Use ⁠$to_struct()⁠ to wrap a DataFrame.

Value

Expr

Examples

df = pl$DataFrame(
  a = 1:3,
  b = 4:6
)
df$select(pl$all()$implode())

Inspect evaluated Series

Description

Print the value that this expression evaluates to and pass on the value. The printing will happen when the expression evaluates, not when it is formed.

Usage

Expr_inspect(fmt = "{}")

Arguments

fmt

format string, should contain one set of {} where object will be printed. This formatting mimics python "string".format() use in py-polars.

Value

Expr

Examples

pl$select(pl$lit(1:5)$inspect(
  "Here's what the Series looked like before keeping the first two values: {}"
)$head(2))

Interpolate null values

Description

Fill nulls with linear interpolation using non-missing values. Can also be used to regrid data to a new grid - see examples below.

Usage

Expr_interpolate(method = "linear")

Arguments

method

String, either "linear" (default) or "nearest".

Value

Expr

Examples

pl$DataFrame(x = c(1, NA, 4, NA, 100, NaN, 150))$
  with_columns(
  interp_lin = pl$col("x")$interpolate(),
  interp_near = pl$col("x")$interpolate("nearest")
)

# x, y interpolation over a grid
df_original_grid = pl$DataFrame(
  grid_points = c(1, 3, 10),
  values = c(2.0, 6.0, 20.0)
)
df_original_grid
df_new_grid = pl$DataFrame(grid_points = (1:10) * 1.0)
df_new_grid

# Interpolate from this to the new grid
df_new_grid$join(
  df_original_grid,
  on = "grid_points", how = "left"
)$with_columns(pl$col("values")$interpolate())

Check if an expression is between the given lower and upper bounds

Description

Check if an expression is between the given lower and upper bounds

Usage

Expr_is_between(lower_bound, upper_bound, closed = "both")

Arguments

lower_bound

Lower bound, can be an Expr. Strings are parsed as column names.

upper_bound

Upper bound, can be an Expr. Strings are parsed as column names.

closed

Define which sides of the interval are closed (inclusive). This can be either "left", "right", "both" or "none".

Details

Note that in polars, NaN are equal to other NaNs, and greater than any non-NaN value.

Value

Expr

Examples

df = pl$DataFrame(num = 1:5)
df$with_columns(
  is_between = pl$col("num")$is_between(2, 4),
  is_between_excl_upper = pl$col("num")$is_between(2, 4, closed = "left"),
  is_between_excl_both = pl$col("num")$is_between(2, 4, closed = "none")
)

# lower and upper bounds can also be column names or expr
df = pl$DataFrame(
  num = 1:5,
  lower = c(0, 2, 3, 3, 3),
  upper = c(6, 4, 4, 8, 3.5)
)
df$with_columns(
  is_between_cols = pl$col("num")$is_between("lower", "upper"),
  is_between_expr = pl$col("num")$is_between(pl$col("lower") / 2, "upper")
)

Check whether each value is duplicated

Description

This is syntactic sugar for ⁠$is_unique()$not()⁠.

Usage

Expr_is_duplicated()

Value

Expr

Examples

pl$DataFrame(head(mtcars[, 1:2]))$
  with_columns(is_duplicated = pl$col("mpg")$is_duplicated())

Check if elements are finite

Description

Returns a boolean Series indicating which values are finite.

Usage

Expr_is_finite()

Value

Expr

Examples

pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$
  with_columns(finite = pl$col("alice")$is_finite())

Check whether each value is the first occurrence

Description

Check whether each value is the first occurrence

Usage

Expr_is_first_distinct()

Value

Expr

Examples

pl$DataFrame(head(mtcars[, 1:2]))$
  with_columns(is_ufirst = pl$col("mpg")$is_first_distinct())

Check whether a value is in a vector

Description

Notice that to check whether a factor value is in a vector of strings, you need to use the string cache, either with pl$enable_string_cache() or with pl$with_string_cache(). See examples.

Usage

Expr_is_in(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

Examples

pl$DataFrame(a = c(1:4, NA_integer_))$with_columns(
  in_1_3 = pl$col("a")$is_in(c(1, 3)),
  in_NA = pl$col("a")$is_in(pl$lit(NA_real_))
)

# this fails because we can't compare factors to strings
# pl$DataFrame(a = factor(letters[1:5]))$with_columns(
#   in_abc = pl$col("a")$is_in(c("a", "b", "c"))
# )

# need to use the string cache for this
pl$with_string_cache({
  pl$DataFrame(a = factor(letters[1:5]))$with_columns(
    in_abc = pl$col("a")$is_in(c("a", "b", "c"))
  )
})

Check if elements are infinite

Description

Returns a boolean Series indicating which values are infinite.

Usage

Expr_is_infinite()

Value

Expr

Examples

pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$
  with_columns(infinite = pl$col("alice")$is_infinite())

Check whether each value is the last occurrence

Description

Check whether each value is the last occurrence

Usage

Expr_is_last_distinct()

Value

Expr

Examples

pl$DataFrame(head(mtcars[, 1:2]))$
  with_columns(is_ulast = pl$col("mpg")$is_last_distinct())

Check if elements are NaN

Description

Returns a boolean Series indicating which values are NaN.

Usage

Expr_is_nan()

Value

Expr

Examples

pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$
  with_columns(nan = pl$col("alice")$is_nan())

Check if elements are not NaN

Description

Returns a boolean Series indicating which values are not NaN. Syntactic sugar for ⁠$is_nan()$not()⁠.

Usage

Expr_is_not_nan()

Value

Expr

Examples

pl$DataFrame(list(alice = c(0, NaN, NA, Inf, -Inf)))$
  with_columns(not_nan = pl$col("alice")$is_not_nan())

Check if elements are not NULL

Description

Returns a boolean Series indicating which values are not null. Syntactic sugar for ⁠$is_null()$not()⁠.

Usage

Expr_is_not_null()

Value

Expr

Examples

pl$DataFrame(list(x = c(1, NA, 3)))$select(pl$col("x")$is_not_null())

Check if elements are NULL

Description

Returns a boolean Series indicating which values are null.

Usage

Expr_is_null()

Value

Expr

Examples

pl$DataFrame(list(x = c(1, NA, 3)))$select(pl$col("x")$is_null())

Check whether each value is unique

Description

Check whether each value is unique

Usage

Expr_is_unique()

Value

Expr

Examples

pl$DataFrame(head(mtcars[, 1:2]))$
  with_columns(is_unique = pl$col("mpg")$is_unique())

Kurtosis

Description

Compute the kurtosis (Fisher or Pearson) of a dataset.

Usage

Expr_kurtosis(fisher = TRUE, bias = TRUE)

Arguments

fisher

If TRUE (default), Fisher’s definition is used (normal, centered at 0). Otherwise, Pearson’s definition is used (normal, centered at 3).

bias

If FALSE, the calculations are corrected for statistical bias.

Details

Kurtosis is the fourth central moment divided by the square of the variance. If Fisher's definition is used, then 3 is subtracted from the result to give 0 for a normal distribution.

If bias is FALSE, then the kurtosis is calculated using k statistics to eliminate bias coming from biased moment estimators.

Value

Expr

Examples

pl$DataFrame(a = c(1:3, 2:1))$
  with_columns(kurt = pl$col("a")$kurtosis())

Get the last value

Description

Get the last value

Usage

Expr_last()

Value

Expr

Examples

pl$DataFrame(x = 3:1)$with_columns(last = pl$col("x")$last())

Get the first n elements

Description

This is an alias for ⁠<Expr>$head()⁠.

Usage

Expr_limit(n = 10)

Arguments

n

Number of elements to take.

Value

Expr

Examples

pl$DataFrame(x = 1:11)$select(pl$col("x")$limit(3))

Compute the logarithm of elements

Description

Compute the logarithm of elements

Usage

Expr_log(base = base::exp(1))

Arguments

base

Numeric base value for logarithm, default is exp(1).

Value

Expr

Examples

pl$DataFrame(a = c(1, 2, 3, exp(1)))$
  with_columns(log = pl$col("a")$log())

Compute the base-10 logarithm of elements

Description

Compute the base-10 logarithm of elements

Usage

Expr_log10()

Value

Expr

Examples

pl$DataFrame(a = c(1, 2, 3, exp(1)))$
  with_columns(log10 = pl$col("a")$log10())

Find the lower bound of a DataType

Description

Find the lower bound of a DataType

Usage

Expr_lower_bound()

Value

Expr

Examples

pl$DataFrame(
  x = 1:3, y = 1:3,
  schema = list(x = pl$UInt32, y = pl$Int32)
)$
  select(pl$all()$lower_bound())

Check strictly lower inequality

Description

Method equivalent of addition operator expr + other.

Usage

Expr_lt(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

Examples

pl$lit(5) < 10
pl$lit(5) < pl$lit(10)
pl$lit(5)$lt(pl$lit(10))

Check lower or equal inequality

Description

Method equivalent of addition operator expr + other.

Usage

Expr_lt_eq(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

Examples

pl$lit(2) <= 2
pl$lit(2) <= pl$lit(2)
pl$lit(2)$lt_eq(pl$lit(2))

Map an expression with an R function

Description

Map an expression with an R function

Usage

Expr_map_batches(
  f,
  output_type = NULL,
  agg_list = FALSE,
  in_background = FALSE
)

Arguments

f

a function to map with

output_type

NULL or a type available in names(pl$dtypes). If NULL (default), the output datatype will match the input datatype. This is used to inform schema of the actual return type of the R function. Setting this wrong could theoretically have some downstream implications to the query.

agg_list

Aggregate list. Map from vector to group in group_by context.

in_background

Logical. Whether to execute the map in a background R process. Combined with setting e.g. options(polars.rpool_cap = 4) it can speed up some slow R functions as they can run in parallel R sessions. The communication speed between processes is quite slower than between threads. This will likely only give a speed-up in a "low IO - high CPU" use case. If there are multiple $map_batches(in_background = TRUE) calls in the query, they will be run in parallel.

Details

It is sometimes necessary to apply a specific R function on one or several columns. However, note that using R code in $map_batches() is slower than native polars. The user function must take one polars Series as input and the return should be a Series or any Robj convertible into a Series (e.g. vectors). Map fully supports browser().

If in_background = FALSE the function can access any global variable of the R session. However, note that several calls to $map_batches() will sequentially share the same main R session, so the global environment might change between the start of the query and the moment a $map_batches() call is evaluated. Any native polars computations can still be executed meanwhile. If in_background = TRUE, the map will run in one or more other R sessions and will not have access to global variables. Use options(polars.rpool_cap = 4) and polars_options()$rpool_cap to set and view number of parallel R sessions.

Value

Expr

Examples

pl$DataFrame(iris)$
  select(
  pl$col("Sepal.Length")$map_batches(\(x) {
    paste("cheese", as.character(x$to_vector()))
  }, pl$dtypes$String)
)

# R parallel process example, use Sys.sleep() to imitate some CPU expensive
# computation.

# map a,b,c,d sequentially
pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select(
  pl$all()$map_batches(\(s) {
    Sys.sleep(.1)
    s * 2
  })
)$collect() |> system.time()

# map in parallel 1: Overhead to start up extra R processes / sessions
options(polars.rpool_cap = 0) # drop any previous processes, just to show start-up overhead
options(polars.rpool_cap = 4) # set back to 4, the default
polars_options()$rpool_cap
pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select(
  pl$all()$map_batches(\(s) {
    Sys.sleep(.1)
    s * 2
  }, in_background = TRUE)
)$collect() |> system.time()

# map in parallel 2: Reuse R processes in "polars global_rpool".
polars_options()$rpool_cap
pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select(
  pl$all()$map_batches(\(s) {
    Sys.sleep(.1)
    s * 2
  }, in_background = TRUE)
)$collect() |> system.time()

Map a custom/user-defined function (UDF) to each element of a column

Description

The UDF is applied to each element of a column. See Details for more information on specificities related to the context.

Usage

Expr_map_elements(
  f,
  return_type = NULL,
  strict_return_type = TRUE,
  allow_fail_eval = FALSE,
  in_background = FALSE
)

Arguments

f

Function to map

return_type

DataType of the output Series. If NULL, the dtype will be pl$Unknown.

strict_return_type

If TRUE (default), error if not correct datatype returned from R. If FALSE, the output will be converted to a polars null value.

allow_fail_eval

If FALSE (default), raise an error if the function fails. If TRUE, the result will be converted to a polars null value.

in_background

Whether to run the function in a background R process, default is FALSE. Combined with setting e.g. options(polars.rpool_cap = 4), this can speed up some slow R functions as they can run in parallel R sessions. The communication speed between processes is quite slower than between threads. This will likely only give a speed-up in a "low IO - high CPU" usecase. A single map will not be paralleled, only in case of multiple ⁠$map_elements()⁠ in the query can these run in parallel.

Details

Note that, in a GroupBy context, the column will have been pre-aggregated and so each element will itself be a Series. Therefore, depending on the context, requirements for function differ:

  • in ⁠$select()⁠ or ⁠$with_columns()⁠ (selection context), the function must operate on R values of length 1. Polars will convert each element into an R value and pass it to the function. The output of the user function will be converted back into a polars type (the return type must match, see argument return_type). Using ⁠$map_elements()⁠ in this context should be avoided as a lapply() has half the overhead.

  • in ⁠$agg()⁠ (GroupBy context), the function must take a Series and return a Series or an R object convertible to Series, e.g. a vector. In this context, it is much faster if there are the number of groups is much lower than the number of rows, as the iteration is only across the groups. The R user function could e.g. convert the Series to a vector with ⁠$to_r()⁠ and perform some vectorized operations.

Note that it is preferred to express your function in polars syntax, which will almost always be significantly faster and more memory efficient because:

  • the native expression engine runs in Rust; functions run in R.

  • use of R functions forces the DataFrame to be materialized in memory.

  • Polars-native expressions can be parallelized (R functions cannot).

  • Polars-native expressions can be logically optimized (R functions cannot).

Wherever possible you should strongly prefer the native expression API to achieve the best performance and avoid using ⁠$map_elements()⁠.

Value

Expr

Examples

# apply over groups: here, the input must be a Series
# prepare two expressions, one to compute the sum of each variable, one to
# get the first two values of each variable and store them in a list
e_sum = pl$all()$map_elements(\(s) sum(s$to_r()))$name$suffix("_sum")
e_head = pl$all()$map_elements(\(s) head(s$to_r(), 2))$name$suffix("_head")
pl$DataFrame(iris)$group_by("Species")$agg(e_sum, e_head)

# apply a function on each value (should be avoided): here the input is an R
# value of length 1
# select only Float64 columns
my_selection = pl$col(pl$dtypes$Float64)

# prepare two expressions, the first one only adds 10 to each element, the
# second returns the letter whose index matches the element
e_add10 = my_selection$map_elements(\(x)  {
  x + 10
})$name$suffix("_sum")

e_letter = my_selection$map_elements(\(x) {
  letters[ceiling(x)]
}, return_type = pl$dtypes$String)$name$suffix("_letter")
pl$DataFrame(iris)$select(e_add10, e_letter)


# Small benchmark --------------------------------

# Using `$map_elements()` is much slower than a more polars-native approach.
# First we multiply each element of a Series of 1M elements by 2.
n = 1000000L
set.seed(1)
df = pl$DataFrame(list(
  a = 1:n,
  b = sample(letters, n, replace = TRUE)
))

system.time({
  df$with_columns(
    bob = pl$col("a")$map_elements(\(x) {
      x * 2L
    })
  )
})

# Comparing this to the standard polars syntax:
system.time({
  df$with_columns(
    bob = pl$col("a") * 2L
  )
})


# Running in parallel --------------------------------

# here, we use Sys.sleep() to imitate some CPU expensive computation.

# use apply over each Species-group in each column equal to 12 sequential
# runs ~1.2 sec.
system.time({
  pl$LazyFrame(iris)$group_by("Species")$agg(
    pl$all()$map_elements(\(s) {
      Sys.sleep(.1)
      s$sum()
    })
  )$collect()
})

# first run in parallel: there is some overhead to start up extra R processes
# drop any previous processes, just to show start-up overhead here
options(polars.rpool_cap = 0)
# set back to 4, the default
options(polars.rpool_cap = 4)
polars_options()$rpool_cap

system.time({
  pl$LazyFrame(iris)$group_by("Species")$agg(
    pl$all()$map_elements(\(s) {
      Sys.sleep(.1)
      s$sum()
    }, in_background = TRUE)
  )$collect()
})

# second run in parallel: this reuses R processes in "polars global_rpool".
polars_options()$rpool_cap
system.time({
  pl$LazyFrame(iris)$group_by("Species")$agg(
    pl$all()$map_elements(\(s) {
      Sys.sleep(.1)
      s$sum()
    }, in_background = TRUE)
  )$collect()
})

Get maximum value

Description

Get maximum value

Usage

Expr_max()

Value

Expr

Examples

pl$DataFrame(x = c(1, NA, 3))$
  with_columns(max = pl$col("x")$max())

Get mean value

Description

Get mean value

Usage

Expr_mean()

Value

Expr

Examples

pl$DataFrame(x = c(1L, NA, 2L))$
  with_columns(mean = pl$col("x")$mean())

Get median value

Description

Get median value

Usage

Expr_median()

Value

Expr

Examples

pl$DataFrame(x = c(1L, NA, 2L))$
  with_columns(median = pl$col("x")$median())

Get minimum value

Description

Get minimum value

Usage

Expr_min()

Value

Expr

Examples

pl$DataFrame(x = c(1, NA, 3))$
  with_columns(min = pl$col("x")$min())

Modulo two expressions

Description

Method equivalent of modulus operator expr %% other.

Usage

Expr_mod(other)

Arguments

other

Numeric literal or expression value.

Value

Expr

See Also

Examples

df = pl$DataFrame(x = -5L:5L)

df$with_columns(
  `x%%2` = pl$col("x")$mod(2)
)

Mode

Description

Compute the most occurring value(s). Can return multiple values if there are ties.

Usage

Expr_mode()

Value

Expr

Examples

df = pl$DataFrame(a = 1:6, b = c(1L, 1L, 3L, 3L, 5L, 6L), c = c(1L, 1L, 2L, 2L, 3L, 3L))
df$select(pl$col("a")$mode())
df$select(pl$col("b")$mode())
df$select(pl$col("c")$mode())

Multiply two expressions

Description

Method equivalent of multiplication operator expr * other.

Usage

Expr_mul(other)

Arguments

other

Numeric literal or expression value.

Value

Expr

See Also

Examples

df = pl$DataFrame(x = c(1, 2, 4, 8, 16))

df$with_columns(
  `x*2` = pl$col("x")$mul(2),
  `x * xlog2` = pl$col("x")$mul(pl$col("x")$log(2))
)

Count number of unique values

Description

Count number of unique values

Usage

Expr_n_unique()

Value

Expr

Examples

pl$DataFrame(iris[, 4:5])$with_columns(count = pl$col("Species")$n_unique())

Get maximum value with NaN

Description

Get maximum value, but returns NaN if there are any.

Usage

Expr_nan_max()

Value

Expr

Examples

pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$
  with_columns(nan_max = pl$col("x")$nan_max())

Get minimum value with NaN

Description

Get minimum value, but returns NaN if there are any.

Usage

Expr_nan_min()

Value

Expr

Examples

pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$
  with_columns(nan_min = pl$col("x")$nan_min())

Check inequality

Description

Method equivalent of addition operator expr + other.

Usage

Expr_neq(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

See Also

Expr_neq_missing

Examples

pl$lit(1) != 2
pl$lit(1) != pl$lit(2)
pl$lit(1)$neq(pl$lit(2))

Check inequality without null propagation

Description

Method equivalent of addition operator expr + other.

Usage

Expr_neq_missing(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

See Also

Expr_neq

Examples

df = pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE))
df$with_columns(
  neq = pl$col("x")$neq("y"),
  neq_missing = pl$col("x")$neq_missing("y")
)

Negate a boolean expression

Description

Method equivalent of negation operator !expr.

Usage

Expr_not()

Value

Expr

Examples

# two syntaxes same result
pl$lit(TRUE)$not()
!pl$lit(TRUE)

Count missing values

Description

Count missing values

Usage

Expr_null_count()

Value

Expr

Examples

pl$DataFrame(x = c(NA, "a", NA, "b"))$
  with_columns(n_missing = pl$col("x")$null_count())

Apply logical OR on two expressions

Description

Combine two boolean expressions with OR.

Usage

Expr_or(other)

Arguments

other

numeric or string value; accepts expression input.

Value

Expr

Examples

pl$lit(TRUE) | FALSE
pl$lit(TRUE)$or(pl$lit(TRUE))

Compute expressions over the given groups

Description

This expression is similar to performing a group by aggregation and joining the result back into the original DataFrame. The outcome is similar to how window functions work in PostgreSQL.

Usage

Expr_over(..., order_by = NULL, mapping_strategy = "group_to_rows")

Arguments

...

Column(s) to group by. Accepts expression input. Characters are parsed as column names.

order_by

Order the window functions/aggregations with the partitioned groups by the result of the expression passed to order_by. Can be an Expr. Strings are parsed as column names.

mapping_strategy

One of the following:

  • "group_to_rows" (default): if the aggregation results in multiple values, assign them back to their position in the DataFrame. This can only be done if the group yields the same elements before aggregation as after.

  • "join": join the groups as ⁠List<group_dtype>⁠ to the row positions. Note that this can be memory intensive.

  • "explode": don’t do any mapping, but simply flatten the group. This only makes sense if the input data is sorted.

Value

Expr

Examples

# Pass the name of a column to compute the expression over that column.
df = pl$DataFrame(
  a = c("a", "a", "b", "b", "b"),
  b = c(1, 2, 3, 5, 3),
  c = c(5, 4, 2, 1, 3)
)

df$with_columns(
  pl$col("c")$max()$over("a")$name$suffix("_max")
)

# Expression input is supported.
df$with_columns(
  pl$col("c")$max()$over(pl$col("b") %/% 2)$name$suffix("_max")
)

# Group by multiple columns by passing a character vector of column names
# or list of expressions.
df$with_columns(
  pl$col("c")$min()$over(c("a", "b"))$name$suffix("_min")
)

df$with_columns(
  pl$col("c")$min()$over(list(pl$col("a"), pl$col("b")))$name$suffix("_min")
)

# Or use positional arguments to group by multiple columns in the same way.
df$with_columns(
  pl$col("c")$min()$over("a", pl$col("b") %% 2)$name$suffix("_min")
)

# Alternative mapping strategy: join values in a list output
df$with_columns(
  top_2 = pl$col("c")$top_k(2)$over("a", mapping_strategy = "join")
)

# order_by specifies how values are sorted within a group, which is
# essential when the operation depends on the order of values
df = pl$DataFrame(
  g = c(1, 1, 1, 1, 2, 2, 2, 2),
  t = c(1, 2, 3, 4, 4, 1, 2, 3),
  x = c(10, 20, 30, 40, 10, 20, 30, 40)
)

# without order_by, the first and second values in the second group would
# be inverted, which would be wrong
df$with_columns(
  x_lag = pl$col("x")$shift(1)$over("g", order_by = "t")
)

Percentage change

Description

Computes percentage change (as fraction) between current element and most- recent non-null element at least n period(s) before the current element. Computes the change from the previous row by default.

Usage

Expr_pct_change(n = 1)

Arguments

n

Periods to shift for computing percent change.

Value

Expr

Examples

pl$DataFrame(a = c(10L, 11L, 12L, NA_integer_, 12L))$
  with_columns(pct_change = pl$col("a")$pct_change())

Find local maxima

Description

A local maximum is the point that marks the transition between an increase and a decrease in a Series. The first and last values of the Series can never be a peak.

Usage

Expr_peak_max()

Value

Expr

See Also

⁠$peak_min()⁠

Examples

df = pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2))
df

df$with_columns(peak_max = pl$col("x")$peak_max())

Find local minima

Description

A local minimum is the point that marks the transition between a decrease and an increase in a Series. The first and last values of the Series can never be a peak.

Usage

Expr_peak_min()

Value

Expr

See Also

⁠$peak_max()⁠

Examples

df = pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2))
df

df$with_columns(peak_min = pl$col("x")$peak_min())

Exponentiation two expressions

Description

Method equivalent of exponentiation operator expr ^ exponent.

Usage

Expr_pow(exponent)

Arguments

exponent

Numeric literal or expression value.

Value

Expr

See Also

Examples

df = pl$DataFrame(x = c(1, 2, 4, 8))

df$with_columns(
  cube = pl$col("x")$pow(3),
  `x^xlog2` = pl$col("x")$pow(pl$col("x")$log(2))
)

Product

Description

Compute the product of an expression.

Usage

Expr_product()

Value

Expr

Examples

pl$DataFrame(x = c(2L, NA, 2L))$
  with_columns(product = pl$col("x")$product())

Bin continuous values into discrete categories based on their quantiles

Description

Bin continuous values into discrete categories based on their quantiles

Usage

Expr_qcut(
  quantiles,
  ...,
  labels = NULL,
  left_closed = FALSE,
  allow_duplicates = FALSE,
  include_breaks = FALSE
)

Arguments

quantiles

Either a vector of quantile probabilities between 0 and 1 or a positive integer determining the number of bins with uniform probability.

...

Ignored.

labels

Names of the categories. The number of labels must be equal to the number of cut points plus one.

left_closed

Set the intervals to be left-closed instead of right-closed.

allow_duplicates

If set to TRUE, duplicates in the resulting quantiles are dropped, rather than raising an error. This can happen even with unique probabilities, depending on the data.

include_breaks

Include a column with the right endpoint of the bin each observation falls in. This will change the data type of the output from a Categorical to a Struct.

Value

Expr of data type Categorical is include_breaks is FALSE and of data type Struct if include_breaks is TRUE.

See Also

$cut()

Examples

df = pl$DataFrame(foo = c(-2, -1, 0, 1, 2))

# Divide a column into three categories according to pre-defined quantile
# probabilities
df$with_columns(
  qcut = pl$col("foo")$qcut(c(0.25, 0.75), labels = c("a", "b", "c"))
)

# Divide a column into two categories using uniform quantile probabilities.
df$with_columns(
  qcut = pl$col("foo")$qcut(2, labels = c("low", "high"), left_closed = TRUE)
)

# Add both the category and the breakpoint
df$with_columns(
  qcut = pl$col("foo")$qcut(c(0.25, 0.75), include_breaks = TRUE)
)$unnest("qcut")

Get quantile value.

Description

Get quantile value.

Usage

Expr_quantile(quantile, interpolation = "nearest")

Arguments

quantile

Either a numeric value or an Expr whose value must be between 0 and 1.

interpolation

One of "nearest", "higher", "lower", "midpoint", or "linear".

Details

Null values are ignored and NaNs are ranked as the largest value. For linear interpolation NaN poisons Inf, that poisons any other value.

Value

Expr

Examples

pl$DataFrame(x = -5:5)$
  select(pl$col("x")$quantile(0.5))

Rank elements

Description

Assign ranks to data, dealing with ties appropriately.

Usage

Expr_rank(
  method = c("average", "min", "max", "dense", "ordinal", "random"),
  descending = FALSE,
  seed = NULL
)

Arguments

method

String, one of "average" (default), "min", "max", "dense", "ordinal", "random". The method used to assign ranks to tied elements:

  • "average": The average of the ranks that would have been assigned to all the tied values is assigned to each value.

  • "min": The minimum of the ranks that would have been assigned to all the tied values is assigned to each value. (This is also referred to as "competition" ranking.)

  • "max" : The maximum of the ranks that would have been assigned to all the tied values is assigned to each value.

  • "dense": Like 'min', but the rank of the next highest element is assigned the rank immediately after those assigned to the tied elements.

  • "ordinal" : All values are given a distinct rank, corresponding to the order that the values occur in the Series.

  • "random" : Like 'ordinal', but the rank for ties is not dependent on the order that the values occur in the Series.

descending

Rank in descending order.

seed

string parsed or number converted into uint64. Used if method="random".

Value

Expr

Examples

#  The 'average' method:
pl$DataFrame(a = c(3, 6, 1, 1, 6))$
  with_columns(rank = pl$col("a")$rank())

#  The 'ordinal' method:
pl$DataFrame(a = c(3, 6, 1, 1, 6))$
  with_columns(rank = pl$col("a")$rank("ordinal"))

Rechunk memory layout

Description

Create a single chunk of memory for this Series.

Usage

Expr_rechunk()

Details

See rechunk() explained here docs_translations.

Value

Expr

Examples

# get chunked lengths with/without rechunk
series_list = pl$DataFrame(list(a = 1:3, b = 4:6))$select(
  pl$col("a")$append(pl$col("b"))$alias("a_chunked"),
  pl$col("a")$append(pl$col("b"))$rechunk()$alias("a_rechunked")
)$get_columns()
lapply(series_list, \(x) x$chunk_lengths())

Reinterpret bits

Description

Reinterpret the underlying bits as a signed/unsigned integer. This operation is only allowed for Int64. For lower bits integers, you can safely use the cast operation.

Usage

Expr_reinterpret(signed = TRUE)

Arguments

signed

If TRUE (default), reinterpret into Int64. Otherwise, it will be reinterpreted in UInt64.

Value

Expr

Examples

df = pl$DataFrame(x = 1:5, schema = list(x = pl$Int64))
df$select(pl$all()$reinterpret())

Repeat a Series

Description

This expression takes input and repeats it n times and append chunk.

Usage

Expr_rep(n, rechunk = TRUE)

Arguments

n

The number of times to repeat, must be non-negative and finite.

rechunk

If TRUE (default), memory layout will be rewritten.

Details

If the input has length 1, this uses a special faster implementation that doesn't require rechunking (so rechunk = TRUE has no effect).

Value

Expr

Examples

pl$select(pl$lit("alice")$rep(n = 3))
pl$select(pl$lit(1:3)$rep(n = 2))

Repeat values

Description

Repeat the elements in this Series as specified in the given expression. The repeated elements are expanded into a List.

Usage

Expr_repeat_by(by)

Arguments

by

Expr that determines how often the values will be repeated. The column will be coerced to UInt32.

Value

Expr

Examples

df = pl$DataFrame(a = c("w", "x", "y", "z"), n = c(-1, 0, 1, 2))
df$with_columns(repeated = pl$col("a")$repeat_by("n"))

Replace the given values by different values of the same data type.

Description

This allows one to recode values in a column, leaving all other values unchanged. See $replace_strict() to give a default value to all other values and to specify the output datatype.

Usage

Expr_replace(old, new)

Arguments

old

Can be several things:

  • a vector indicating the values to recode;

  • if new is missing, this can be a named list e.g list(old = "new") where the names are the old values and the values are the replacements. Note that if old values are numeric, the names must be wrapped in backticks;

  • an Expr

new

Either a vector of length 1, a vector of same length as old or an Expr. If missing, old must be a named list.

Value

Expr

Examples

df = pl$DataFrame(a = c(1, 2, 2, 3))

# "old" and "new" can take vectors of length 1 or of same length
df$with_columns(replaced = pl$col("a")$replace(2, 100))
df$with_columns(replaced = pl$col("a")$replace(c(2, 3), c(100, 200)))

# "old" can be a named list where names are values to replace, and values are
# the replacements
mapping = list(`2` = 100, `3` = 200)
df$with_columns(replaced = pl$col("a")$replace(mapping))

df = pl$DataFrame(a = c("x", "y", "z"))
mapping = list(x = 1, y = 2, z = 3)
df$with_columns(replaced = pl$col("a")$replace(mapping))

# "old" and "new" can take Expr
df = pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1))
df$with_columns(
  replaced = pl$col("a")$replace(
    old = pl$col("a")$max(),
    new = pl$col("b")$sum()
  )
)

Replace all values by different values.

Description

This changes all the values in a column, either using a specific replacement or a default one. See $replace() to replace only a subset of values.

Usage

Expr_replace_strict(old, new, default = NULL, return_dtype = NULL)

Arguments

old

Can be several things:

  • a vector indicating the values to recode;

  • if new is missing, this can be a named list e.g list(old = "new") where the names are the old values and the values are the replacements. Note that if old values are numeric, the names must be wrapped in backticks;

  • an Expr

new

Either a vector of length 1, a vector of same length as old or an Expr. If missing, old must be a named list.

default

The default replacement if the value is not in old. Can be an Expr. If NULL (default), then the value doesn't change.

return_dtype

The data type of the resulting expression. If set to NULL (default), the data type is determined automatically based on the other inputs.

Value

Expr

Examples

df = pl$DataFrame(a = c(1, 2, 2, 3))

# "old" and "new" can take vectors of length 1 or of same length
df$with_columns(replaced = pl$col("a")$replace_strict(2, 100, default = 1))
df$with_columns(
  replaced = pl$col("a")$replace_strict(c(2, 3), c(100, 200), default = 1)
)

# "old" can be a named list where names are values to replace, and values are
# the replacements
mapping = list(`2` = 100, `3` = 200)
df$with_columns(replaced = pl$col("a")$replace_strict(mapping, default = -1))

# one can specify the data type to return instead of automatically
# inferring it
df$with_columns(
  replaced = pl$col("a")$replace_strict(mapping, default = 1, return_dtype = pl$Int32)
)

# "old", "new", and "default" can take Expr
df = pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1))
df$with_columns(
  replaced = pl$col("a")$replace_strict(
    old = pl$col("a")$max(),
    new = pl$col("b")$sum(),
    default = pl$col("b"),
  )
)

Reshape this Expr to a flat Series or a Series of Lists

Description

Reshape this Expr to a flat Series or a Series of Lists

Usage

Expr_reshape(dimensions, nested_type = pl$List())

Arguments

dimensions

A integer vector of length of the dimension size. If -1 is used in any of the dimensions, that dimension is inferred. Currently, more than two dimensions not supported.

nested_type

The nested data type to create. List only supports 2 dimensions, whereas Array supports an arbitrary number of dimensions.

Value

Expr. If a single dimension is given, results in an expression of the original data type. If a multiple dimensions are given, results in an expression of data type List with shape equal to the dimensions.

Examples

df = pl$DataFrame(foo = 1:9)

df$select(pl$col("foo")$reshape(9))
df$select(pl$col("foo")$reshape(c(3, 3)))

# Use `-1` to infer the other dimension
df$select(pl$col("foo")$reshape(c(-1, 3)))
df$select(pl$col("foo")$reshape(c(3, -1)))

# One can specify more than 2 dimensions by using the Array type
df = pl$DataFrame(foo = 1:12)
df$select(
  pl$col("foo")$reshape(c(3, 2, 2), nested_type = pl$Array(pl$Float32, 2))
)

Reverse a variable

Description

Reverse a variable

Usage

Expr_reverse()

Value

Expr

Examples

pl$DataFrame(list(a = 1:5))$select(pl$col("a")$reverse())

Get the lengths of runs of identical values

Description

Get the lengths of runs of identical values

Usage

Expr_rle()

Value

Expr

Examples

df = pl$DataFrame(s = c(1, 1, 2, 1, NA, 1, 3, 3))
df$select(pl$col("s")$rle())$unnest("s")

Map values to run IDs

Description

Similar to $rle(), but it maps each value to an ID corresponding to the run into which it falls. This is especially useful when you want to define groups by runs of identical values rather than the values themselves. Note that the ID is 0-indexed.

Usage

Expr_rle_id()

Value

Expr

Examples

df = pl$DataFrame(a = c(1, 2, 1, 1, 1, 4))
df$with_columns(a_r = pl$col("a")$rle_id())

Create rolling groups based on a time or numeric column

Description

If you have a time series ⁠<t_0, t_1, ..., t_n>⁠, then by default the windows created will be:

  • (t_0 - period, t_0]

  • (t_1 - period, t_1]

  • (t_n - period, t_n]

whereas if you pass a non-default offset, then the windows will be:

  • (t_0 + offset, t_0 + offset + period]

  • (t_1 + offset, t_1 + offset + period]

  • (t_n + offset, t_n + offset + period]

Usage

Expr_rolling(index_column, ..., period, offset = NULL, closed = "right")

Arguments

index_column

Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order. If this column represents an index, it has to be either Int32 or Int64. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

...

Ignored.

period

A character representing the length of the window, must be non-negative. See the ⁠Polars duration string language⁠ section for details.

offset

A character representing the offset of the window, or NULL (default). If NULL, -period is used. See the ⁠Polars duration string language⁠ section for details.

closed

Define which sides of the temporal interval are closed (inclusive). This can be either "left", "right", "both" or "none".

Details

In case of a rolling operation on an integer column, the windows are defined by:

  • "1i" # length 1

  • "10i" # length 10

Value

Expr

Polars duration string language

Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.

It has the following format:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 calendar day)

  • 1w (1 calendar week)

  • 1mo (1 calendar month)

  • 1q (1 calendar quarter)

  • 1y (1 calendar year)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".

Examples

# create a DataFrame with a Datetime column and an f64 column
dates = c(
  "2020-01-01 13:45:48", "2020-01-01 16:42:13", "2020-01-01 16:45:09",
  "2020-01-02 18:12:48", "2020-01-03 19:45:32", "2020-01-08 23:16:43"
)

df = pl$DataFrame(dt = dates, a = c(3, 7, 5, 9, 2, 1))$
  with_columns(
  pl$col("dt")$str$strptime(pl$Datetime("us"), format = "%Y-%m-%d %H:%M:%S")$set_sorted()
)

df$with_columns(
  sum_a = pl$sum("a")$rolling(index_column = "dt", period = "2d"),
  min_a = pl$min("a")$rolling(index_column = "dt", period = "2d"),
  max_a = pl$max("a")$rolling(index_column = "dt", period = "2d")
)

# we can use "offset" to change the start of the window period. Here, with
# offset = "1d", we start the window one day after the value in "dt", and
# then we add a 2-day window relative to the window start.
df$with_columns(
  sum_a_offset1 = pl$sum("a")$rolling(index_column = "dt", period = "2d", offset = "1d")
)

Rolling maximum

Description

Compute the rolling (= moving) max over the values in this array. A window of length window_size will traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by the weight vector.

Usage

Expr_rolling_max(
  window_size,
  weights = NULL,
  min_periods = NULL,
  ...,
  center = FALSE
)

Arguments

window_size

Integer specifying the length of the window.

weights

An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

...

Ignored.

center

Set the labels at the center of the window

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$
  with_columns(roll_max = pl$col("a")$rolling_max(window_size = 2))

Apply a rolling max based on another column.

Description

Apply a rolling max based on another column.

Usage

Expr_rolling_max_by(by, window_size, ..., min_periods = 1, closed = "right")

Arguments

by

This column must of dtype Date or Datetime.

window_size

The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count) If the dynamic string language is used, the by and closed arguments must also be set.

...

Ignored.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

closed

Define which sides of the temporal interval are closed (inclusive). This can be either "left", "right", "both" or "none".

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

df_temporal = pl$DataFrame(
  date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h")
)$with_row_index("index")

df_temporal

df_temporal$with_columns(
  rolling_row_max = pl$col("index")$rolling_max_by("date", window_size = "3h")
)

Rolling mean

Description

Compute the rolling (= moving) mean over the values in this array. A window of length window_size will traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by the weight vector.

Usage

Expr_rolling_mean(
  window_size,
  weights = NULL,
  min_periods = NULL,
  ...,
  center = FALSE
)

Arguments

window_size

Integer specifying the length of the window.

weights

An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

...

Ignored.

center

Set the labels at the center of the window

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$
  with_columns(roll_mean = pl$col("a")$rolling_mean(window_size = 2))

Apply a rolling mean based on another column.

Description

Apply a rolling mean based on another column.

Usage

Expr_rolling_mean_by(by, window_size, ..., min_periods = 1, closed = "right")

Arguments

by

This column must of dtype Date or Datetime.

window_size

The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count) If the dynamic string language is used, the by and closed arguments must also be set.

...

Ignored.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

closed

Define which sides of the temporal interval are closed (inclusive). This can be either "left", "right", "both" or "none".

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

df_temporal = pl$DataFrame(
  date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h")
)$with_row_index("index")

df_temporal

df_temporal$with_columns(
  rolling_row_mean = pl$col("index")$rolling_mean_by("date", window_size = "3h")
)

Rolling median

Description

Compute the rolling (= moving) median over the values in this array. A window of length window_size will traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by the weight vector.

Usage

Expr_rolling_median(
  window_size,
  weights = NULL,
  min_periods = NULL,
  center = FALSE
)

Arguments

window_size

Integer specifying the length of the window.

weights

An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

center

Set the labels at the center of the window

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$
  with_columns(roll_median = pl$col("a")$rolling_median(window_size = 2))

Apply a rolling median based on another column.

Description

Apply a rolling median based on another column.

Usage

Expr_rolling_median_by(by, window_size, ..., min_periods = 1, closed = "right")

Arguments

by

This column must of dtype Date or Datetime.

window_size

The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count) If the dynamic string language is used, the by and closed arguments must also be set.

...

Ignored.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

closed

Define which sides of the temporal interval are closed (inclusive). This can be either "left", "right", "both" or "none".

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

df_temporal = pl$DataFrame(
  date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h")
)$with_row_index("index")

df_temporal

df_temporal$with_columns(
  rolling_row_median = pl$col("index")$rolling_median_by("date", window_size = "3h")
)

Rolling minimum

Description

Compute the rolling (= moving) min over the values in this array. A window of length window_size will traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by the weight vector.

Usage

Expr_rolling_min(
  window_size,
  weights = NULL,
  min_periods = NULL,
  ...,
  center = FALSE
)

Arguments

window_size

Integer specifying the length of the window.

weights

An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

...

Ignored.

center

Set the labels at the center of the window

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$
  with_columns(roll_min = pl$col("a")$rolling_min(window_size = 2))

Apply a rolling min based on another column.

Description

Apply a rolling min based on another column.

Usage

Expr_rolling_min_by(by, window_size, ..., min_periods = 1, closed = "right")

Arguments

by

This column must of dtype Date or Datetime.

window_size

The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count) If the dynamic string language is used, the by and closed arguments must also be set.

...

Ignored.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

closed

Define which sides of the temporal interval are closed (inclusive). This can be either "left", "right", "both" or "none".

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

df_temporal = pl$DataFrame(
  date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h")
)$with_row_index("index")

df_temporal

df_temporal$with_columns(
  rolling_row_min = pl$col("index")$rolling_min_by("date", window_size = "3h")
)

Rolling quantile

Description

Compute the rolling (= moving) quantile over the values in this array. A window of length window_size will traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by the weight vector.

Usage

Expr_rolling_quantile(
  quantile,
  interpolation = "nearest",
  window_size,
  weights = NULL,
  min_periods = NULL,
  ...,
  center = FALSE
)

Arguments

quantile

Quantile between 0 and 1.

interpolation

String, one of "nearest", "higher", "lower", "midpoint", "linear".

window_size

Integer specifying the length of the window.

weights

An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

...

Ignored.

center

Set the labels at the center of the window

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$
  with_columns(roll_quant = pl$col("a")$rolling_quantile(0.3, window_size = 2))

Compute a rolling quantile based on another column

Description

Compute a rolling quantile based on another column

Usage

Expr_rolling_quantile_by(
  by,
  window_size,
  ...,
  quantile,
  interpolation = "nearest",
  min_periods = 1,
  closed = "right"
)

Arguments

by

This column must of dtype Date or Datetime.

window_size

The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count) If the dynamic string language is used, the by and closed arguments must also be set.

...

Ignored.

quantile

Either a numeric value or an Expr whose value must be between 0 and 1.

interpolation

One of "nearest", "higher", "lower", "midpoint", or "linear".

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

closed

Define which sides of the temporal interval are closed (inclusive). This can be either "left", "right", "both" or "none".

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

df_temporal = pl$DataFrame(
  date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h")
)$with_row_index("index")

df_temporal

df_temporal$with_columns(
  rolling_row_quantile = pl$col("index")$rolling_quantile_by(
    "date",
    window_size = "2h", quantile = 0.3
  )
)

Rolling skew

Description

Compute the rolling (= moving) skewness over the values in this array. A window of length window_size will traverse the array.

Usage

Expr_rolling_skew(window_size, bias = TRUE)

Arguments

window_size

Integer specifying the length of the window.

bias

If FALSE, the calculations are corrected for statistical bias.

Details

For normally distributed data, the skewness should be about zero. For uni-modal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.

Value

Expr

Examples

pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$
  with_columns(roll_skew = pl$col("a")$rolling_skew(window_size = 2))

Rolling standard deviation

Description

Compute the rolling (= moving) standard deviation over the values in this array. A window of length window_size will traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by the weight vector.

Usage

Expr_rolling_std(
  window_size,
  weights = NULL,
  min_periods = NULL,
  ...,
  center = FALSE,
  ddof = 1
)

Arguments

window_size

Integer specifying the length of the window.

weights

An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

...

Ignored.

center

Set the labels at the center of the window

ddof

An integer representing "Delta Degrees of Freedom": the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$
  with_columns(roll_std = pl$col("a")$rolling_std(window_size = 2))

Compute a rolling standard deviation based on another column

Description

Compute a rolling standard deviation based on another column

Usage

Expr_rolling_std_by(
  by,
  window_size,
  ...,
  min_periods = 1,
  closed = "right",
  ddof = 1
)

Arguments

by

This column must of dtype Date or Datetime.

window_size

The length of the window. Can be a fixed integer size, or a dynamic temporal size indicated by the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count) If the dynamic string language is used, the by and closed arguments must also be set.

...

Ignored.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

closed

Define which sides of the temporal interval are closed (inclusive). This can be either "left", "right", "both" or "none".

ddof

An integer representing "Delta Degrees of Freedom": the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

df_temporal = pl$DataFrame(
  date = pl$datetime_range(as.Date("2001-1-1"), as.Date("2001-1-2"), "1h")
)$with_row_index("index")

df_temporal

# Compute the rolling std with the temporal windows closed on the right (default)
df_temporal$with_columns(
  rolling_row_std = pl$col("index")$rolling_std_by("date", window_size = "2h")
)

# Compute the rolling std with the closure of windows on both sides
df_temporal$with_columns(
  rolling_row_std = pl$col("index")$rolling_std_by("date", window_size = "2h", closed = "both")
)

Rolling sum

Description

Compute the rolling (= moving) sum over the values in this array. A window of length window_size will traverse the array. The values that fill this window will (optionally) be multiplied with the weights given by the weight vector.

Usage

Expr_rolling_sum(
  window_size,
  weights = NULL,
  min_periods = NULL,
  center = FALSE
)

Arguments

window_size

Integer specifying the length of the window.

weights

An optional slice with the same length as the window that will be multiplied elementwise with the values in the window.

min_periods

The number of values in the window that should be non-null before computing a result. If NULL, it will be set equal to window size.

center

Set the labels at the center of the window

Details

If you want to compute multiple aggregation statistics over the same dynamic window, consider using ⁠$rolling()⁠ this method can cache the window size computation.

Value

Expr

Examples

pl$DataFrame(a = c(1, 3, 2, 4, 5, 6))$
  with_columns(roll_sum = pl$col("a")$rolling_sum(window_size = 2))

Apply a rolling sum based on another column.

Description

Apply a rolling sum based on another column.

Usage

Expr_rolling_sum_by(by, window_size, ..., min_periods = 1, closed = "right")

Arguments

by

This column must of dtype Date or