Skip to contents

This function checks for duplicate records based on latitude, longitude, sub-catchment ID, and species name. The date can be provided as a single column representing the year.

Usage

check_duplicates(
  data,
  lat_col,
  lon_col,
  subcatchment_col,
  species_col,
  year_col,
  delete_duplicates = TRUE,
  verbose = FALSE
)

Arguments

data

A data frame containing columns for latitude, longitude, sub-catchment ID, and species name. Additionally, it must contain a column for the year.

lat_col

The name of the column representing latitude.

lon_col

The name of the column representing longitude.

subcatchment_col

The name of the column representing sub-catchment ID.

species_col

The name of the column representing species name.

year_col

The name of the column representing the year (mandatory).

delete_duplicates

Logical. If TRUE (default), only one record is kept for each group of duplicates. If FALSE, a new column is added to flag duplicates.

verbose

Logical. If TRUE, details about duplicate records will be printed. Default is FALSE.

Value

Returns a data frame with either duplicates removed or flagged, along with a summary of the number of rows with changes.

Details

Duplicates are defined as multiple entries for the same species recorded at the same location (same coordinates or sub-catchment ID) in the same year.

Examples

data <- data.frame(
  latitude = c(34.5, 34.5, 35.1, 35.1),
  longitude = c(-118.1, -118.1, -118.5, -118.5),
  subcatchment_id = c(101, 101, 102, 102),
  year = c(2021, 2021, 2021, 2021),
  species = c("Species A", "Species A", "Species B", "Species B")
)
result <- check_duplicates(data,
                           "latitude",
                           "longitude",
                           "subcatchment_id",
                           "species",
                           year_col = "year",
                           delete_duplicates = FALSE,
                           verbose = TRUE)
#> Number of duplicate records: 4
#> Duplicate records:
#>   latitude longitude subcatchment_id year   species full_date duplicate_flag
#> 1     34.5    -118.1             101 2021 Species A      2021           TRUE
#> 2     34.5    -118.1             101 2021 Species A      2021           TRUE
#> 3     35.1    -118.5             102 2021 Species B      2021           TRUE
#> 4     35.1    -118.5             102 2021 Species B      2021           TRUE
#> Duplicates flagged in the 'duplicate_flag' column.
#> Number of rows with changes: 4
print(result)
#>   latitude longitude subcatchment_id year   species duplicate_flag
#> 1     34.5    -118.1             101 2021 Species A           TRUE
#> 2     34.5    -118.1             101 2021 Species A           TRUE
#> 3     35.1    -118.5             102 2021 Species B           TRUE
#> 4     35.1    -118.5             102 2021 Species B           TRUE