---
title: "tibblify"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{tibblify}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
```

## Introduction

With `tibblify()` you can rectangle deeply nested lists into a tidy tibble. These
lists might come from an API in the form of JSON or from scraping XML. The reasons
to use `tibblify()` over other tools like `jsonlite::fromJSON()` or `tidyr::hoist()`
are:

* It can guess the output format like `jsonlite::fromJSON()`.
* You can also provide a specification how to rectangle.
* The specification is easy to understand.
* You can bring most inputs into the shape you want in a single step.
* Rectangling is much faster than with `jsonlite::fromJSON()`.


## Example

Let's start with `gh_users`, which is a list containing information about four
GitHub users.

```{r}
library(tibblify)

gh_users_small <- purrr::map(gh_users, ~ .x[c("followers", "login", "url", "name", "location", "email", "public_gists")])

names(gh_users_small[[1]])
```

Quickly rectangling `gh_users_small` is as easy as applying `tibblify()` to it:

```{r}
tibblify(gh_users_small)
```

We can now look at the specification `tibblify()` used for rectangling

```{r}
guess_tspec(gh_users_small)
```


If we are only interested in some of the fields we can easily adapt the specification

```{r}
spec <- tspec_df(
  login_name = tib_chr("login"),
  tib_chr("name"),
  tib_int("public_gists")
)

tibblify(gh_users_small, spec)
```


## Objects

We refer to lists like `gh_users_small` as _collection_ and _objects_ are the
elements of such lists. Objects and collections are the typical input for
`tibblify()`.

Basically, an _object_ is simply something that can be converted to a one row tibble.
This boils down to a condition on the names of the object:

* the `object` must have names (the `names` attribute must not be `NULL`),
* every element must be named (no name can be `NA` or `""`),
* and the names must be unique.

In other words, the names must fulfill `vec_as_names(repair = "check_unique")`.
The name-value pairs of an object are the _fields_.

For example `list(x = 1, y = "a")` is an object with the fields `(x, 1)` and
`(y, "a")` but `list(1, z = 3)` is not an object because it is not fully named.

A _collection_ is basically just a list of similar objects so that the fields can
become the columns in a tibble.


## Specification

Providing an explicit specification has a couple of advantages:

* you can ensure type and shape stability of the resulting tibble in automated scripts.
* you can give the columns different names.
* you can restrict to parsing only the fields you need.
* you can specify what happens if a value is missing.


As seen before the specification for a collection is done with `tspec_df()`. The
columns of the output tibble are describe with the `tib_*()` functions. They
describe the path to the field to extract and the output type of the field. There
are the following five types of functions:

* `tib_scalar(ptype)`: a length one vector with type `ptype`
* `tib_vector(ptype)`: a vector of arbitrary length with type `ptype`
* `tib_variant()`: a vector of arbitrary length and type; you should barely ever need this
* `tib_row(...)`: an object with the fields `...`
* `tib_df(...)`: a collection where the objects have the fields `...`

For convenience there are shortcuts for `tib_scalar()` and `tib_vector()` for
the most common prototypes:

* `logical()`: `tib_lgl()` and `tib_lgl_vec()`
* `integer()`: `tib_int()` and `tib_int_vec()`
* `double()`: `tib_dbl()` and `tib_dbl_vec()`
* `character()`: `tib_chr()` and `tib_chr_vec()`
* `Date`: `tib_date()` and `tib_date_vec()`
* `Date` encoded as character: `tib_chr_date()` and `tib_chr_date_vec()`


### Scalar Elements

Scalar elements are the most common case and result in a normal vector column

```{r}
tibblify(
  list(
    list(id = 1, name = "Peter"),
    list(id = 2, name = "Lilly")
  ),
  tspec_df(
    tib_int("id"),
    tib_chr("name")
  )
)
```

With `tib_scalar()` you can also provide your own prototype

Let's say you have a list with durations

```{r}
x <- list(
  list(id = 1, duration = vctrs::new_duration(100)),
  list(id = 2, duration = vctrs::new_duration(200))
)
x
```

and then use it in `tib_scalar()`

```{r}
tibblify(
  x,
  tspec_df(
    tib_int("id"),
    tib_scalar("duration", ptype = vctrs::new_duration())
  )
)
```


### Vector Elements

If an element does not always have size one then it is a vector element. If it
still always has the same type `ptype` then it produces a list of `ptype` column:

```{r}
x <- list(
  list(id = 1, children = c("Peter", "Lilly")),
  list(id = 2, children = "James"),
  list(id = 3, children = c("Emma", "Noah", "Charlotte"))
)

tibblify(
  x,
  tspec_df(
    tib_int("id"),
    tib_chr_vec("children")
  )
)
```

You can use [`tidyr::unnest()`](https://tidyr.tidyverse.org/reference/nest.html) or [`tidyr::unnest_longer()`](https://tidyr.tidyverse.org/reference/hoist.html) to flatten these columns to regular columns.


### Object Elements

For example in `gh_repos_small`

```{r}
gh_repos_small <- purrr::map(gh_repos, ~ .x[c("id", "name", "owner")])
gh_repos_small <- purrr::map(
  gh_repos_small,
  function(repo) {
    repo$owner <- repo$owner[c("login", "id", "url")]
    repo
  }
)

gh_repos_small[[1]]
```

the field `owner` is an object itself. The specification to extract it uses `tib_row()`

```{r}
spec <- guess_tspec(gh_repos_small)
spec
```

and results in a tibble column

```{r}
tibblify(gh_repos_small, spec)
```

If you don't like the tibble column you can unpack it with `tidyr::unpack()`.
Alternatively, if you only want to extract some of the fields in `owner` you
can use a nested path

```{r}
spec2 <- tspec_df(
  id = tib_int("id"),
  name = tib_chr("name"),
  owner_id = tib_int(c("owner", "id")),
  owner_login = tib_chr(c("owner", "login"))
)
spec2

tibblify(gh_repos_small, spec2)
```


## Required and Optional Fields

Objects usually have some fields that always exist and some that are optional.
By default `tib_*()` demands that a field exists

```{r error=TRUE}
x <- list(
  list(x = 1, y = "a"),
  list(x = 2)
)

spec <- tspec_df(
  x = tib_int("x"),
  y = tib_chr("y")
)

tibblify(x, spec)
```

You can mark a field as optional with the argument `required = FALSE`:

```{r}
spec <- tspec_df(
  x = tib_int("x"),
  y = tib_chr("y", required = FALSE)
)

tibblify(x, spec)
```

You can specify the value to use with the `fill` argument

```{r}
spec <- tspec_df(
  x = tib_int("x"),
  y = tib_chr("y", required = FALSE, fill = "missing")
)

tibblify(x, spec)
```


## Converting a Single Object

To rectangle a single object you have two options: `tspec_object()` which produces
a list or `tspec_row()` which produces a tibble with one row.

While tibbles are great for a single object it often makes more sense to convert
them to a list.

For example a typical API response might be something like

```{r}
api_output <- list(
  status = "success",
  requested_at = "2021-10-26 09:17:12",
  data = list(
    list(x = 1),
    list(x = 2)
  )
)
```


To convert to a one row tibble

```{r}
row_spec <- tspec_row(
  status = tib_chr("status"),
  data = tib_df(
    "data",
    x = tib_int("x")
  )
)

api_output_df <- tibblify(api_output, row_spec)
api_output_df
```

it is necessary to wrap `data` in a list. To access `data` one has to use
`api_output_df$data[[1]]` which is not very nice.

```{r}
object_spec <- tspec_object(
  status = tib_chr("status"),
  data = tib_df(
    "data",
    x = tib_int("x")
  )
)

api_output_list <- tibblify(api_output, object_spec)
api_output_list
```

Now accessing `data` does not required an extra subsetting step

```{r}
api_output_list$data
```