Merging consecutive or overlapping time periods in R

I had a table something like this:

patient_id <- c(rep(100,5), rep(101,6))
drug <- c(rep("A",3), rep("B",2), rep("A",3), rep("B",3))
start_date = c("2018-01-08", "2018-01-09", "2018-01-18", "2018-01-09", "2018-01-24", "2020-04-01", "2020-04-07", "2020-04-20", "2020-05-05", "2020-05-09", "2020-05-12")
end_date = c("2018-01-08", "2018-01-15", "2018-01-24", "2018-01-31", "2018-01-31", "2020-04-07", "2020-04-23", "2020-04-30", "2020-05-08", "2020-05-12", "2020-05-15")

df <- data.frame(patient_id = patient_id, 
                 drug = drug, 
                 start_date = start_date, 
                 end_date = end_date)

df$start_date <- as.Date(df$start_date)
df$end_date <- as.Date(df$end_date)

df

   patient_id drug start_date   end_date
1         100    A 2018-01-08 2018-01-08
2         100    A 2018-01-09 2018-01-15
3         100    A 2018-01-18 2018-01-24
4         100    B 2018-01-09 2018-01-31
5         100    B 2018-01-24 2018-01-31
6         101    A 2020-04-01 2020-04-07
7         101    A 2020-04-07 2020-04-23
8         101    A 2020-04-20 2020-04-30
9         101    B 2020-05-05 2020-05-08
10        101    B 2020-05-09 2020-05-12
11        101    B 2020-05-12 2020-05-15

Here I want to merge overlapping or consecutive observations into a single observation like this:

  patient_id drug  start_date end_date   
1        100 A     2018-01-08 2018-01-15
2        100 A     2018-01-18 2018-01-24
3        100 B     2018-01-09 2018-01-31
4        101 A     2020-04-01 2020-04-30
5        101 B     2020-05-05 2020-05-15

I figured out how to do that by reading this post on Stack Overflow.

library(dplyr)

df %>% 
  arrange(patient_id, drug, start_date, end_date) %>% 
  group_by(patient_id, drug) %>% 
  mutate(idx = c(0, cumsum(as.numeric(lead(start_date)) > cummax(as.numeric(end_date + 1)))[-n()])) %>% 
  group_by(patient_id, drug, idx) %>% 
  summarise(start_date = min(start_date), end_date = max(end_date)) %>% 
  select(!idx)

Because I couldn’t understand how the solution works, I explain how it works in this post.

The idea is first to group observations by patient_id and drug, compare the start_date of the next observation (lead(start_date)) with the current end_date until that moment, and create an index that will separate the observations into groups. Note that here I allowed consecutive time periods by end_date + 1. If you want to allow up to three-day gap between time periods, you can do that by adding three end_date + 1.

When R looks at date as an integer, its origin is January 1, 1970.

as.numeric(as.Date("1970-01-01"))
[1] 0

as.numeric(as.Date("1970-01-02"))
[1] 1

as.numeric(as.Date("1971-01-01"))
[1] 365

cumsum returns the cumulative sum of the observations, and cummax returns the cumulative maximum values.

df %>% 
  arrange(patient_id, drug, start_date, end_date) %>% 
  group_by(patient_id, drug) %>% 
  mutate(cumsum = cumsum(as.numeric(start_date)), 
         cummax = cummax(as.numeric(end_date)))

# A tibble: 11 x 6
# Groups:   patient_id, drug [4]
   patient_id drug  start_date end_date   cumsum cummax
        <dbl> <chr> <date>     <date>      <dbl>  <dbl>
 1        100 A     2018-01-08 2018-01-08  17539  17539
 2        100 A     2018-01-09 2018-01-15  35079  17546
 3        100 A     2018-01-18 2018-01-24  52628  17555
 4        100 B     2018-01-09 2018-01-31  17540  17562
 5        100 B     2018-01-24 2018-01-31  35095  17562
 6        101 A     2020-04-01 2020-04-07  18353  18359
 7        101 A     2020-04-07 2020-04-23  36712  18375
 8        101 A     2020-04-20 2020-04-30  55084  18382
 9        101 B     2020-05-05 2020-05-08  18387  18390
10        101 B     2020-05-09 2020-05-12  36778  18394
11        101 B     2020-05-12 2020-05-15  55172  18397

In each group, compare observations to check if these observations belong to the same subgroup. If they belong to the same subgroup, give them the same index number in the idx column.

df %>% 
  arrange(patient_id, drug, start_date, end_date) %>% 
  group_by(patient_id, drug) %>% 
  mutate(idx = cumsum(as.numeric(lead(start_date)) > cummax(as.numeric(end_date))))

# A tibble: 11 x 5
# Groups:   patient_id, drug [4]
   patient_id drug  start_date end_date     idx
        <dbl> <chr> <date>     <date>     <int>
 1        100 A     2018-01-08 2018-01-08     1
 2        100 A     2018-01-09 2018-01-15     2
 3        100 A     2018-01-18 2018-01-24    NA
 4        100 B     2018-01-09 2018-01-31     0
 5        100 B     2018-01-24 2018-01-31    NA
 6        101 A     2020-04-01 2020-04-07     0
 7        101 A     2020-04-07 2020-04-23     0
 8        101 A     2020-04-20 2020-04-30    NA
 9        101 B     2020-05-05 2020-05-08     1
10        101 B     2020-05-09 2020-05-12     1
11        101 B     2020-05-12 2020-05-15    NA

For each group, add zero at the beginning and remove the last item by [-n()] (since the last observation does not have the next observation). n() counts the number of observations in each group.

Here, you want to compare the start_date of the next observation and end_date of the current observation, and then return TRUE if the next start_date is later than the current end_date, otherwise return FALSE. And then, create the index number for each observation for classifying them into subgroups.

cumsum(c(0, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE))
[1] 0 0 0 1 2 2 3

df %>% 
  arrange(patient_id, drug, start_date, end_date) %>% 
  group_by(patient_id, drug) %>% 
  mutate(idx = c(0, cumsum(as.numeric(lead(start_date)) > cummax(as.numeric(end_date)))[-n()]))

# A tibble: 11 x 5
# Groups:   patient_id, drug [4]
   patient_id drug  start_date end_date     idx
        <dbl> <chr> <date>     <date>     <dbl>
 1        100 A     2018-01-08 2018-01-08     0
 2        100 A     2018-01-09 2018-01-15     1
 3        100 A     2018-01-18 2018-01-24     2
 4        100 B     2018-01-09 2018-01-31     0
 5        100 B     2018-01-24 2018-01-31     0
 6        101 A     2020-04-01 2020-04-07     0
 7        101 A     2020-04-07 2020-04-23     0
 8        101 A     2020-04-20 2020-04-30     0
 9        101 B     2020-05-05 2020-05-08     0
10        101 B     2020-05-09 2020-05-12     1
11        101 B     2020-05-12 2020-05-15     1

If you just want to merge overlapping time periods:

df %>% 
  arrange(patient_id, drug, start_date, end_date) %>% 
  group_by(patient_id, drug) %>% 
  mutate(idx = c(0, cumsum(as.numeric(lead(start_date)) > cummax(as.numeric(end_date)))[-n()])) %>% 
  group_by(patient_id, drug, idx) %>% 
  summarise(start_date = min(start_date), end_date = max(end_date)) %>% 
  select(!idx)

# A tibble: 7 x 4
# Groups:   patient_id, drug [4]
  patient_id drug  start_date end_date  
       <dbl> <chr> <date>     <date>    
1        100 A     2018-01-08 2018-01-08
2        100 A     2018-01-09 2018-01-15
3        100 A     2018-01-18 2018-01-24
4        100 B     2018-01-09 2018-01-31
5        101 A     2020-04-01 2020-04-30
6        101 B     2020-05-05 2020-05-08
7        101 B     2020-05-09 2020-05-15

Comments

Copied title and URL