This vignette demonstrates how to construct a semantically annotated
recordset_df from ordinary filesystem observations. As a
minimal example, we create five files representing two Records. Record
a consists of a textual description and an image of a
museum object. Record b consists of a textual description,
a digital surrogate, and an OCR transcription of an archival
document.
library(fscontext)
tmp_dir <- file.path(tempdir(), "recordset_df")
if (dir.exists(tmp_dir)) unlink(tmp_dir, recursive = TRUE)
dir.create(tmp_dir)
writeLines(
c("<html>", "<body>", "Description of the A object.", "</body>", "</html>"),
file.path(tmp_dir, "a.html")
)
Sys.sleep(1)
writeBin(charToRaw("JPEG"), file.path(tmp_dir, "a.jpg"))
Sys.sleep(1)
writeLines(
c("<html>", "<body>", "Description of the B document.", "</body>", "</html>"),
file.path(tmp_dir, "b.html")
)
Sys.sleep(1)
writeLines(
c("%PDF-1.4", "Digital surrogate of B."),
file.path(tmp_dir, "b.pdf")
)
Sys.sleep(1)
writeLines(
"Plain text transcription of B.",
file.path(tmp_dir, "b.txt")
)We observe the temporary directory using
snapshot_storage(). The resulting snapshot records
file-level observations such as names, timestamps, checksums and other
filesystem metadata without making any assumptions about the semantic
relationships between the files.
rs001_snapshot_file <- snapshot_storage(
path = tmp_dir,
root = tmp_dir
)
#> Starting scan_storage() on: C:/Users/DANIEL~1/AppData/Local/Temp/RtmpuC0UnR/recordset_df
#> Scanning 5 files.
#> Signatures computed. Detecting repositories and Git status...
#> Files scanned: 5
#> Files in Git repos: 0
#> Files tracked by Git: 0
#> Skipped approximately 0 inaccessible files
#> scan_storage completed in 0.08 seconds
#> Saved: C:\Users\DANIEL~1\AppData\Local\Temp\RtmpuC0UnR/recordset_df/scan_local-storage_c_recordset_df_20260629-143017_6427d2.rdsNext we add simple curatorial metadata. In this example we assign a human-readable description to each observed file and indicate which files belong to the same Record. You can add any further metadata or data about the records.
rs001_snapshot <- readRDS(rs001_snapshot_file)
rs001_snapshot$description <- c(
"Description of Object A", "Image of Object A",
"Description of Record B", "Surrogate of Record B", "OCR Text of Record B"
)
rs001_df <- rs001_snapshot[
,
c("stem", "filename", "description", "quick_sig", "ctime")
]recordset_df
The recordset_df class extends dataset::dataset_df with
lightweight semantics for describing Record Sets, Records and Record
Parts. Rather than implementing the complete RiC ontology, it provides a
small number of conventions that support reproducible workflows while
remaining compatible with ordinary tidy data.
Therecordset_df uses dataset_df internally
for metadata, provenance and serialisation, which is an extended
tibble::tibble() tbl_df data frame. Users who
require richer metadata or publication-oriented functionality can use
the methods provided by the dataset
package directly.
rs001 <- recordset_df(
x = rs001_df,
creator = utils::person("Jane", "Doe", role = "aut"),
title = "Demonstrator Record Set",
record_set_identifier = "http://example.com/archive/sets/rs001",
description = "A demonstration of a record set",
record_identifier = "stem",
record_part_identifier = "filename"
)
#> Warning: Record identifiers are not unique.The constructor warns that the Record identifiers are not unique.
This is expected because each Record is represented by multiple observed
files. In this example, Record a has two Record Parts
(individual files) and Record b has three Record Parts
(files), so the Record identifier necessarily occurs more than once.
print(rs001)
#> Doe (:tba): Demonstrator Record Set [dataset]
#> rowid stem filename description quick_sig ctime
#> <chr> <chr> <chr> <chr> <chr> <dttm>
#> 1 obs1 a a.html Description of Object A 7ebf6252 2026-06-29 14:30:13
#> 2 obs2 a a.jpg Image of Object A c6849a17 2026-06-29 14:30:14
#> 3 obs3 b b.html Description of Record B d391c0c0 2026-06-29 14:30:15
#> 4 obs4 b b.pdf Surrogate of Record B ca0753e1 2026-06-29 14:30:16
#> 5 obs5 b b.txt OCR Text of Record B e1c59197 2026-06-29 14:30:17The stem column is declared to contain identifiers of
RiC Records. The values are annotated as rico:Identifier
objects and labelled “Record Identifier”, allowing downstream software
to distinguish Record identifiers from other identifiers without
requiring a complete RiC knowledge graph. (See: rico:Record)
rs001$stem
#> x: Record Identifier
#> Defined as rico:Identifier
#> [1] "a" "a" "b" "b" "b"The filename column is declared to contain identifiers
of RiC Record Parts. Record a consists of two Record
Parts—a textual description and an image of the object—while Record
b consists of three Record Parts: a textual description, a
digital surrogate and an OCR transcription. Each file therefore
identifies an individual Record Part within its parent Record. (See: rico:RecordPart)
rs001$filename
#> x: Record Part Identifier
#> Defined as rico:Identifier
#> [1] "a.html" "a.jpg" "b.html" "b.pdf" "b.txt"This example illustrates the intended role of
recordset_df: observational evidence is acquired first, and
lightweight semantic assertions are added afterwards. The resulting
object remains an ordinary data.frame while carrying
sufficient metadata to support reproducible archival, curatorial and
semantic enrichment workflows.
