I had a table something like this:
patient_id <- c(rep(100,5), rep(101,6))
drug <- c(rep("A",3), rep("B",2), rep("A",3), rep("B",3))
start_date = c("2018-01-08", "2018-01-09", "2018-01-18", "2018-01-09", "2018-01-24", "2020-04-01", "2020-04-07", "2020-04-20", "2020-05-05", "2020-05-09", "2020-05-12")
end_date = c("2018-01-08", "2018-01-15", "2018-01-24", "2018-01-31", "2018-01-31", "2020-04-07", "2020-04-23", "2020-04-30", "2020-05-08", "2020-05-12", "2020-05-15")
df <- data.frame(patient_id = patient_id,
drug = drug,
start_date = start_date,
end_date = end_date)
df$start_date <- as.Date(df$start_date)
df$end_date <- as.Date(df$end_date)
df
patient_id drug start_date end_date
1 100 A 2018-01-08 2018-01-08
2 100 A 2018-01-09 2018-01-15
3 100 A 2018-01-18 2018-01-24
4 100 B 2018-01-09 2018-01-31
5 100 B 2018-01-24 2018-01-31
6 101 A 2020-04-01 2020-04-07
7 101 A 2020-04-07 2020-04-23
8 101 A 2020-04-20 2020-04-30
9 101 B 2020-05-05 2020-05-08
10 101 B 2020-05-09 2020-05-12
11 101 B 2020-05-12 2020-05-15
Here I want to merge overlapping or consecutive observations into a single observation like this:
patient_id drug start_date end_date
1 100 A 2018-01-08 2018-01-15
2 100 A 2018-01-18 2018-01-24
3 100 B 2018-01-09 2018-01-31
4 101 A 2020-04-01 2020-04-30
5 101 B 2020-05-05 2020-05-15
I figured out how to do that by reading this post on Stack Overflow.
library(dplyr)
df %>%
arrange(patient_id, drug, start_date, end_date) %>%
group_by(patient_id, drug) %>%
mutate(idx = c(0, cumsum(as.numeric(lead(start_date)) > cummax(as.numeric(end_date + 1)))[-n()])) %>%
group_by(patient_id, drug, idx) %>%
summarise(start_date = min(start_date), end_date = max(end_date)) %>%
select(!idx)
Because I couldn’t understand how the solution works, I explain how it works in this post.
The idea is first to group observations by patient_id
and drug
, compare the start_date
of the next observation (lead(start_date)
) with the current end_date
until that moment, and create an index that will separate the observations into groups. Note that here I allowed consecutive time periods by end_date + 1
. If you want to allow up to three-day gap between time periods, you can do that by adding three end_date + 1
.
When R looks at date
as an integer, its origin is January 1, 1970.
as.numeric(as.Date("1970-01-01"))
[1] 0
as.numeric(as.Date("1970-01-02"))
[1] 1
as.numeric(as.Date("1971-01-01"))
[1] 365
cumsum
returns the cumulative sum of the observations, and cummax
returns the cumulative maximum values.
df %>%
arrange(patient_id, drug, start_date, end_date) %>%
group_by(patient_id, drug) %>%
mutate(cumsum = cumsum(as.numeric(start_date)),
cummax = cummax(as.numeric(end_date)))
# A tibble: 11 x 6
# Groups: patient_id, drug [4]
patient_id drug start_date end_date cumsum cummax
<dbl> <chr> <date> <date> <dbl> <dbl>
1 100 A 2018-01-08 2018-01-08 17539 17539
2 100 A 2018-01-09 2018-01-15 35079 17546
3 100 A 2018-01-18 2018-01-24 52628 17555
4 100 B 2018-01-09 2018-01-31 17540 17562
5 100 B 2018-01-24 2018-01-31 35095 17562
6 101 A 2020-04-01 2020-04-07 18353 18359
7 101 A 2020-04-07 2020-04-23 36712 18375
8 101 A 2020-04-20 2020-04-30 55084 18382
9 101 B 2020-05-05 2020-05-08 18387 18390
10 101 B 2020-05-09 2020-05-12 36778 18394
11 101 B 2020-05-12 2020-05-15 55172 18397
In each group, compare observations to check if these observations belong to the same subgroup. If they belong to the same subgroup, give them the same index number in the idx
column.
df %>%
arrange(patient_id, drug, start_date, end_date) %>%
group_by(patient_id, drug) %>%
mutate(idx = cumsum(as.numeric(lead(start_date)) > cummax(as.numeric(end_date))))
# A tibble: 11 x 5
# Groups: patient_id, drug [4]
patient_id drug start_date end_date idx
<dbl> <chr> <date> <date> <int>
1 100 A 2018-01-08 2018-01-08 1
2 100 A 2018-01-09 2018-01-15 2
3 100 A 2018-01-18 2018-01-24 NA
4 100 B 2018-01-09 2018-01-31 0
5 100 B 2018-01-24 2018-01-31 NA
6 101 A 2020-04-01 2020-04-07 0
7 101 A 2020-04-07 2020-04-23 0
8 101 A 2020-04-20 2020-04-30 NA
9 101 B 2020-05-05 2020-05-08 1
10 101 B 2020-05-09 2020-05-12 1
11 101 B 2020-05-12 2020-05-15 NA
For each group, add zero at the beginning and remove the last item by [-n()]
(since the last observation does not have the next observation). n()
counts the number of observations in each group.
Here, you want to compare the start_date
of the next observation and end_date
of the current observation, and then return TRUE
if the next start_date
is later than the current end_date
, otherwise return FALSE
. And then, create the index number for each observation for classifying them into subgroups.
cumsum(c(0, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE))
[1] 0 0 0 1 2 2 3
df %>%
arrange(patient_id, drug, start_date, end_date) %>%
group_by(patient_id, drug) %>%
mutate(idx = c(0, cumsum(as.numeric(lead(start_date)) > cummax(as.numeric(end_date)))[-n()]))
# A tibble: 11 x 5
# Groups: patient_id, drug [4]
patient_id drug start_date end_date idx
<dbl> <chr> <date> <date> <dbl>
1 100 A 2018-01-08 2018-01-08 0
2 100 A 2018-01-09 2018-01-15 1
3 100 A 2018-01-18 2018-01-24 2
4 100 B 2018-01-09 2018-01-31 0
5 100 B 2018-01-24 2018-01-31 0
6 101 A 2020-04-01 2020-04-07 0
7 101 A 2020-04-07 2020-04-23 0
8 101 A 2020-04-20 2020-04-30 0
9 101 B 2020-05-05 2020-05-08 0
10 101 B 2020-05-09 2020-05-12 1
11 101 B 2020-05-12 2020-05-15 1
If you just want to merge overlapping time periods:
df %>%
arrange(patient_id, drug, start_date, end_date) %>%
group_by(patient_id, drug) %>%
mutate(idx = c(0, cumsum(as.numeric(lead(start_date)) > cummax(as.numeric(end_date)))[-n()])) %>%
group_by(patient_id, drug, idx) %>%
summarise(start_date = min(start_date), end_date = max(end_date)) %>%
select(!idx)
# A tibble: 7 x 4
# Groups: patient_id, drug [4]
patient_id drug start_date end_date
<dbl> <chr> <date> <date>
1 100 A 2018-01-08 2018-01-08
2 100 A 2018-01-09 2018-01-15
3 100 A 2018-01-18 2018-01-24
4 100 B 2018-01-09 2018-01-31
5 101 A 2020-04-01 2020-04-30
6 101 B 2020-05-05 2020-05-08
7 101 B 2020-05-09 2020-05-15
Comments