x trials lagged start streak_id 1 0 NA TRUE 1 2 1 0 TRUE 2 3 1 1 FALSE 2 4 1 1 FALSE 2 5 0 1 TRUE 3 6 0 0 FALSE 3 7 1 0 TRUE 4 8 0 1 TRUE 5 9 1 0 TRUE 6 10 1 1 FALSE 6 11 1 1 FALSE 6 12 0 1 TRUE 7 13 0 0 FALSE 7 14 0 0 FALSE 7 15 0 0 FALSE 7 16 0 0 FALSE 7 17 1 0 TRUE 8 From there, we just group by streak_id, get the row number for each row in each group, and then ungroup to get our final result. One convenient thing in this case is that R is one-indexed, so we don't have to add 1 to the streak counter like in Python. > x % group_by(streak_id) %>% mutate(streak = row_number()) %>% ungroup() > x # A tibble: 17 x 5 trials lagged start streak_id streak 1 0 NA TRUE 1 1 2 1 0 TRUE 2 1 3 1 1 FALSE 2 2 4 1 1 FALSE 2 3 5 0 1 TRUE 3 1 6 0 0 FALSE 3 2 7 1 0 TRUE 4 1 8 0 1 TRUE 5 1 9 1 0 TRUE 6 1 10 1 1 FALSE 6 2 11 1 1 FALSE 6 3 12 0 1 TRUE 7 1 13 0 0 FALSE 7 2 14 0 0 FALSE 7 3 15 0 0 FALSE 7 4 16 0 0 FALSE 7 5 17 1 0 TRUE 8 1 Bringing this all together into one function: get_streaks"> Detecting Streaks in R | R-bloggers - 188金宝搏app

Detecting Streaks in R

June 5, 2020
By

[This article was first published onR on Data & The World, and kindly contributed to188bet app]. (You can report issue about the content on this pagehere)
Want to share your content on R-bloggers?188bet app if you have a blog, orhereif you don't.

Inspired bythis post, which tries to calculate streaks in Python’spandaslibrary, I thought I’d give it a try in R, since it’s all just dataframe operations in the Python post. I won’t repeat his analysis, but I will replicate the streak determination and some of the plots. The data he uses ishere.

Determining Streaks

As outlined in the above post, we first need a little dummy data to play with. For reproducability’s sake, I’m just using a fixed vector.

>library(tidyverse)>x<-data.frame(trials=c(0,1,1,1,0,0,1,0,1,1,1,0,0,0,0,0,1))

The start of a streak is indicated when the two consecutive values are different. We have to handle this a little differently than in Python, though. Thelag()function fromdplyrgenerates an NA as the first value in the lagged vector, and comparisons involving NA will return NA:

>x<-x%>%mutate(lagged=lag(trials))%>%#note: that's dplyr::lag, not stats::lagmutate(start=(trials!=lagged))>x试验滞后开始10NANA210TRUE311FALSE411FALSE501TRUE600FALSE710TRUE801TRUE910TRUE1011FALSE1111FALSE1201TRUE1300FALSE1400FALSE1500FALSE1600FALSE1710TRUE

因为我们知道,总是会第一个条目the start of a streak, we can fix this by just setting the first element toTRUE:

>x [1,"start"]<-TRUE

From there, we can get a little clever. Like in the Python post, R will happily convert booleans to numerics if prompted, so we can come up with an identification of when a streak starts by taking a cumulative sum of thestartcolumn:

>x<-x%>%mutate(streak_id=cumsum(start))>x试验滞后开始streak_id10NATRUE1210TRUE2311FALSE2411FALSE2501TRUE3600FALSE3710TRUE4801TRUE5910TRUE61011FALSE61111FALSE61201TRUE71300FALSE71400FALSE71500FALSE71600FALSE71710TRUE8

From there, we just group bystreak_id, get the row number for each row in each group, and then ungroup to get our final result. One convenient thing in this case is that R is one-indexed, so we don’t have to add 1 to the streak counter like in Python.

>x<-x%>%group_by(streak_id)%>%mutate(streak=row_number())%>%ungroup()>x# A tibble: 17 x 5trials lagged start streak_id streak<dbl><dbl><lgl><int><int>10NATRUE11210TRUE21311FALSE22411FALSE23501TRUE31600FALSE32710TRUE41801TRUE51910TRUE611011FALSE621111FALSE631201TRUE711300FALSE721400FALSE731500FALSE741600FALSE751710TRUE81

Bringing this all together into one function:

get_streaks<-function(vec){ x<-data.frame(trials=vec) x<-x%>%mutate(lagged=lag(trials))%>%#note: that's dplyr::lag, not stats::lagmutate(start=(trials!=lagged)) x[1,"start"]<-TRUEx<-x%>%mutate(streak_id=cumsum(start)) x<-x%>%group_by(streak_id)%>%mutate(streak=row_number())%>%ungroup()return(x) }

Plotting Streaks

Replicating the initial plot is pretty quick:

>shots<-read_csv("playoff_shots.csv")>durant_ft<-shots%>%filter(player_name=="Kevin Durant"&shot_type=="FT")>durant_ft<-get_streaks(durant_ft$result)>ggplot(durant_ft,aes(x=1:nrow(durant_ft), y=streak))+geom_bar(stat="identity")

First plot

Recreating one of the later ones requires additional work, butggplot2has all of the necessary functionality on its own, so we don’t need to bring in anything to extend it, it’s just lengthy. We also make the slight tweak to thestreakvariable so that the miss streaks go down under the x=0 axis.

>durant_ft2<-durant_ft%>%mutate(streak=streak*ifelse(trials=="make",1,-1))>caption<-paste(c("Kevin Durant","FT"), collapse="\n")>ggplot(durant_ft2,aes(x=1:nrow(durant_ft2), y=streak))+>geom_bar(aes(fill=trials), stat="identity")+>theme_void()+>geom_hline(yintercept=0)+>geom_vline(xintercept=0)+>scale_fill_manual(values=c("make"="darkgreen","miss"="red"), guide=FALSE)+>annotate(geom="text", label=caption, x=nrow(durant_ft2), y=max(durant_ft2$streak),>hjust="right", vjust="top")

Second plot

Full code for this post is availablehere.

Toleave a commentfor the author, please follow the link and comment on their blog:R on Data & The World.

R-bloggers.comoffersdaily e-mail updates金宝搏网址 Rnews and tutorials about金宝搏网址 and many other topics.Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers?188bet app if you have a blog, orhereif you don't.



If you got this far, why notsubscribe for updatesfrom the site? Choose your flavor:e-mail,twitter,1188bet app, orfacebook...

Comments are closed.

Search R-bloggers

Sponsors

从来没有错过一个更新!
Subscribe to R-bloggersto receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)