Skip to contents

Aggregates observed filesystem observations by filename and lightweight content signatures (quick_sig).

Usage

summarise_duplicates(df)

summarize_duplicates(df)

Arguments

df

A snapshot data.frame conforming to the canonical snapshot schema created by scan_storage() or read_snapshot().

The dataset must contain:

  • filename

  • quick_sig (may contain NA)

Value

A data.frame with one row per filename.

The returned variables include:

filename

File basename used as grouping key.

total_copies

Total number of observed filesystem occurrences.

identical_copies

Size of the largest identical-signature group.

versioned_copies

Number of observations outside the largest identical-signature group.

n_versions

Number of distinct observed signatures.

Details

The function identifies:

  • repeated identical observations;

  • potentially synchronised copies;

  • diverging versions of similarly named resources;

  • distributed working duplicates.

The function operates on observational filesystem evidence only.

It does not:

  • infer authoritative file identity;

  • establish Record Resource equivalence;

  • reconstruct provenance lineage;

  • determine curatorial relationships.

In RiC-aligned operational terminology:

  • rows in the snapshot represent filesystem observations;

  • repeated identical quick_sig values provide operational evidence that multiple observations may correspond to the same underlying digital resource;

  • differing signatures associated with the same filename may indicate divergent versions, forks, or independently evolving resources.

The function therefore supports:

  • longitudinal reconstruction;

  • distributed workflow analysis;

  • duplicate detection;

  • exploratory Record Set construction;

  • provenance-aware analytical workflows.

Duplicate observations are not inherently anomalous.

In distributed development workflows the same file may legitimately appear:

  • across multiple machines;

  • across synchronised project folders;

  • in backup or staging locations;

  • in derived analytical Record Sets.

The function therefore reports observational duplication rather than asserting erroneous copying.

The function treats:

  • filename as a weak identity signal;

  • quick_sig as a lightweight content equivalence signal.

Missing signatures (NA) are treated as a valid observational group.

This means:

  • multiple NA signatures are considered identical;

  • a mix of NA and non-NA signatures counts as versioning.

The function operates on observational snapshots and does not resolve identity across time or storage contexts.

Examples

data("fscontextdemo_snapshot_01")
data("fscontextdemo_snapshot_01")

combined_snapshot <- rbind(
  fscontextdemo_snapshot_01,
  fscontextdemo_snapshot_01
)

summarise_duplicates(combined_snapshot)
#>                        filename total_copies identical_copies versioned_copies
#> 1                 .Rbuildignore            2                2                0
#> 2                    .gitignore            6                2                4
#> 3                      404.html            2                2                0
#> 4                        404.md            2                2                0
#> 5                   DESCRIPTION            2                2                0
#> 6                       LICENSE            2                2                0
#> 7             LICENSE-text.html            2                2                0
#> 8               LICENSE-text.md            2                2                0
#> 9                  LICENSE.html            2                2                0
#> 10                   LICENSE.md            4                2                2
#> 11                    NAMESPACE            2                2                0
#> 12                   README.Rmd            2                2                0
#> 13                    README.md            2                2                0
#> 14                 _pkgdown.yml            2                2                0
#> 15                      all.css            2                2                0
#> 16                  all.min.css            2                2                0
#> 17                 authors.html            2                2                0
#> 18                   authors.md            2                2                0
#> 19   autocomplete.jquery.min.js            2                2                0
#> 20         bootstrap-toc.min.js            2                2                0
#> 21      bootstrap.bundle.min.js            2                2                0
#> 22  bootstrap.bundle.min.js.map            2                2                0
#> 23            bootstrap.min.css            2                2                0
#> 24             clipboard.min.js            2                2                0
#> 25          country_barplot.jpg            2                2                0
#> 26          country_barplot.png            4                2                2
#> 27          country_barplot.svg            2                2                0
#> 28 create_fsdemo_country_data.R            2                2                0
#> 29                data-deps.txt            2                2                0
#> 30   data-fsdemo_country_data.R            2                2                0
#> 31                     demo.Rmd            2                2                0
#> 32            fa-brands-400.ttf            2                2                0
#> 33          fa-brands-400.woff2            2                2                0
#> 34           fa-regular-400.ttf            2                2                0
#> 35         fa-regular-400.woff2            2                2                0
#> 36             fa-solid-900.ttf            2                2                0
#> 37           fa-solid-900.woff2            2                2                0
#> 38       fa-v4compatibility.ttf            2                2                0
#> 39     fa-v4compatibility.woff2            2                2                0
#> 40          fscontextdemo.Rproj            2                2                0
#> 41       fsdemo_country_data.Rd            2                2                0
#> 42      fsdemo_country_data.csv            2                2                0
#> 43     fsdemo_country_data.html            2                2                0
#> 44       fsdemo_country_data.md            2                2                0
#> 45      fsdemo_country_data.rda            2                2                0
#> 46                  fuse.min.js            2                2                0
#> 47              headroom.min.js            2                2                0
#> 48                hello_world.R            2                2                0
#> 49               hello_world.Rd            2                2                0
#> 50             hello_world.html            2                2                0
#> 51               hello_world.md            2                2                0
#> 52                   index.html            4                2                2
#> 53                     index.md            4                2                2
#> 54       jQuery.headroom.min.js            2                2                0
#> 55              jquery-3.6.0.js            2                2                0
#> 56          jquery-3.6.0.min.js            2                2                0
#> 57         jquery-3.6.0.min.map            2                2                0
#> 58                katex-auto.js            2                2                0
#> 59               lightswitch.js            2                2                0
#> 60                     link.svg            2                2                0
#> 61                     llms.txt            2                2                0
#> 62                  mark.min.js            2                2                0
#> 63     package_initialisation.R            2                2                0
#> 64                   pkgdown.js            2                2                0
#> 65                 pkgdown.yaml            2                2                0
#> 66                  pkgdown.yml            2                2                0
#> 67                  search.json            2                2                0
#> 68                  sitemap.xml            2                2                0
#> 69           test-hello_world.R            2                2                0
#> 70                   testthat.R            2                2                0
#> 71                 v4-shims.css            2                2                0
#> 72             v4-shims.min.css            2                2                0
#>    n_versions
#> 1           1
#> 2           3
#> 3           1
#> 4           1
#> 5           1
#> 6           1
#> 7           1
#> 8           1
#> 9           1
#> 10          2
#> 11          1
#> 12          1
#> 13          1
#> 14          1
#> 15          1
#> 16          1
#> 17          1
#> 18          1
#> 19          1
#> 20          1
#> 21          1
#> 22          1
#> 23          1
#> 24          1
#> 25          1
#> 26          2
#> 27          1
#> 28          1
#> 29          1
#> 30          1
#> 31          1
#> 32          1
#> 33          1
#> 34          1
#> 35          1
#> 36          1
#> 37          1
#> 38          1
#> 39          1
#> 40          1
#> 41          1
#> 42          1
#> 43          1
#> 44          1
#> 45          1
#> 46          1
#> 47          1
#> 48          1
#> 49          1
#> 50          1
#> 51          1
#> 52          2
#> 53          2
#> 54          1
#> 55          1
#> 56          1
#> 57          1
#> 58          1
#> 59          1
#> 60          1
#> 61          1
#> 62          1
#> 63          1
#> 64          1
#> 65          1
#> 66          1
#> 67          1
#> 68          1
#> 69          1
#> 70          1
#> 71          1
#> 72          1