Skip to contents

This vignette demonstrates how to construct a semantically annotated recordset_df from ordinary filesystem observations. As a minimal example, we create five files representing two Records. Record a consists of a textual description and an image of a museum object. Record b consists of a textual description, a digital surrogate, and an OCR transcription of an archival document.

library(fscontext)
tmp_dir <- file.path(tempdir(), "recordset_df")
if (dir.exists(tmp_dir)) unlink(tmp_dir, recursive = TRUE)
dir.create(tmp_dir)

writeLines(
  c("<html>", "<body>", "Description of the A object.", "</body>", "</html>"),
  file.path(tmp_dir, "a.html")
)
Sys.sleep(1)

writeBin(charToRaw("JPEG"), file.path(tmp_dir, "a.jpg"))
Sys.sleep(1)

writeLines(
  c("<html>", "<body>", "Description of the B document.", "</body>", "</html>"),
  file.path(tmp_dir, "b.html")
)
Sys.sleep(1)

writeLines(
  c("%PDF-1.4", "Digital surrogate of B."),
  file.path(tmp_dir, "b.pdf")
)
Sys.sleep(1)

writeLines(
  "Plain text transcription of B.",
  file.path(tmp_dir, "b.txt")
)

We observe the temporary directory using snapshot_storage(). The resulting snapshot records file-level observations such as names, timestamps, checksums and other filesystem metadata without making any assumptions about the semantic relationships between the files.

rs001_snapshot_file <- snapshot_storage(
  path = tmp_dir,
  root = tmp_dir
)
#> Starting scan_storage() on: C:/Users/DANIEL~1/AppData/Local/Temp/RtmpuC0UnR/recordset_df
#> Scanning 5 files.
#> Signatures computed. Detecting repositories and Git status...
#> Files scanned: 5
#> Files in Git repos: 0
#> Files tracked by Git: 0
#> Skipped approximately 0 inaccessible files
#> scan_storage completed in 0.08 seconds
#> Saved: C:\Users\DANIEL~1\AppData\Local\Temp\RtmpuC0UnR/recordset_df/scan_local-storage_c_recordset_df_20260629-143017_6427d2.rds

Next we add simple curatorial metadata. In this example we assign a human-readable description to each observed file and indicate which files belong to the same Record. You can add any further metadata or data about the records.

rs001_snapshot <- readRDS(rs001_snapshot_file)
rs001_snapshot$description <- c(
  "Description of Object A", "Image of Object A",
  "Description of Record B", "Surrogate of Record B", "OCR Text of Record B"
)
rs001_df <- rs001_snapshot[
  ,
  c("stem", "filename", "description", "quick_sig", "ctime")
]

recordset_df

The recordset_df class extends dataset::dataset_df with lightweight semantics for describing Record Sets, Records and Record Parts. Rather than implementing the complete RiC ontology, it provides a small number of conventions that support reproducible workflows while remaining compatible with ordinary tidy data.

Therecordset_df uses dataset_df internally for metadata, provenance and serialisation, which is an extended tibble::tibble() tbl_df data frame. Users who require richer metadata or publication-oriented functionality can use the methods provided by the dataset package directly.

rs001 <- recordset_df(
  x = rs001_df,
  creator = utils::person("Jane", "Doe", role = "aut"),
  title = "Demonstrator Record Set",
  record_set_identifier = "http://example.com/archive/sets/rs001",
  description = "A demonstration of a record set",
  record_identifier = "stem",
  record_part_identifier = "filename"
)
#> Warning: Record identifiers are not unique.

The constructor warns that the Record identifiers are not unique. This is expected because each Record is represented by multiple observed files. In this example, Record a has two Record Parts (individual files) and Record b has three Record Parts (files), so the Record identifier necessarily occurs more than once.

print(rs001)
#> Doe (:tba): Demonstrator Record Set [dataset]
#>   rowid stem  filename description             quick_sig ctime               
#>   <chr> <chr> <chr>    <chr>                   <chr>     <dttm>             
#> 1 obs1  a     a.html   Description of Object A 7ebf6252  2026-06-29 14:30:13
#> 2 obs2  a     a.jpg    Image of Object A       c6849a17  2026-06-29 14:30:14
#> 3 obs3  b     b.html   Description of Record B d391c0c0  2026-06-29 14:30:15
#> 4 obs4  b     b.pdf    Surrogate of Record B   ca0753e1  2026-06-29 14:30:16
#> 5 obs5  b     b.txt    OCR Text of Record B    e1c59197  2026-06-29 14:30:17

The stem column is declared to contain identifiers of RiC Records. The values are annotated as rico:Identifier objects and labelled “Record Identifier”, allowing downstream software to distinguish Record identifiers from other identifiers without requiring a complete RiC knowledge graph. (See: rico:Record)

rs001$stem
#> x: Record Identifier
#> Defined as rico:Identifier 
#> [1] "a" "a" "b" "b" "b"

The filename column is declared to contain identifiers of RiC Record Parts. Record a consists of two Record Parts—a textual description and an image of the object—while Record b consists of three Record Parts: a textual description, a digital surrogate and an OCR transcription. Each file therefore identifies an individual Record Part within its parent Record. (See: rico:RecordPart)

rs001$filename
#> x: Record Part Identifier
#> Defined as rico:Identifier 
#> [1] "a.html" "a.jpg"  "b.html" "b.pdf"  "b.txt"

This example illustrates the intended role of recordset_df: observational evidence is acquired first, and lightweight semantic assertions are added afterwards. The resulting object remains an ordinary data.frame while carrying sufficient metadata to support reproducible archival, curatorial and semantic enrichment workflows.