
Check for Duplicate Records in a Data Frame
check_duplicates.RdThis function checks for duplicate records based on latitude, longitude, sub-catchment ID, and species name. The date can be provided as a single column representing the year.
Usage
check_duplicates(
data,
lat_col,
lon_col,
subcatchment_col,
species_col,
year_col,
delete_duplicates = TRUE,
verbose = FALSE
)Arguments
- data
A data frame containing columns for latitude, longitude, sub-catchment ID, and species name. Additionally, it must contain a column for the year.
- lat_col
The name of the column representing latitude.
- lon_col
The name of the column representing longitude.
- subcatchment_col
The name of the column representing sub-catchment ID.
- species_col
The name of the column representing species name.
- year_col
The name of the column representing the year (mandatory).
- delete_duplicates
Logical. If
TRUE(default), only one record is kept for each group of duplicates. IfFALSE, a new column is added to flag duplicates.- verbose
Logical. If
TRUE, details about duplicate records will be printed. Default isFALSE.
Value
Returns a data frame with either duplicates removed or flagged, along with a summary of the number of rows with changes.
Details
Duplicates are defined as multiple entries for the same species recorded at the same location (same coordinates or sub-catchment ID) in the same year.
Examples
data <- data.frame(
latitude = c(34.5, 34.5, 35.1, 35.1),
longitude = c(-118.1, -118.1, -118.5, -118.5),
subcatchment_id = c(101, 101, 102, 102),
year = c(2021, 2021, 2021, 2021),
species = c("Species A", "Species A", "Species B", "Species B")
)
result <- check_duplicates(data,
"latitude",
"longitude",
"subcatchment_id",
"species",
year_col = "year",
delete_duplicates = FALSE,
verbose = TRUE)
#> Number of duplicate records: 4
#> Duplicate records:
#> latitude longitude subcatchment_id year species full_date duplicate_flag
#> 1 34.5 -118.1 101 2021 Species A 2021 TRUE
#> 2 34.5 -118.1 101 2021 Species A 2021 TRUE
#> 3 35.1 -118.5 102 2021 Species B 2021 TRUE
#> 4 35.1 -118.5 102 2021 Species B 2021 TRUE
#> Duplicates flagged in the 'duplicate_flag' column.
#> Number of rows with changes: 4
print(result)
#> latitude longitude subcatchment_id year species duplicate_flag
#> 1 34.5 -118.1 101 2021 Species A TRUE
#> 2 34.5 -118.1 101 2021 Species A TRUE
#> 3 35.1 -118.5 102 2021 Species B TRUE
#> 4 35.1 -118.5 102 2021 Species B TRUE