前段段时间受伤卧病在床,难得的闲暇时间,又以躺着不便于学习为由,疯狂娱乐。几乎沉迷 B 站无法自拔,蓦然回首发现好像在小破站花费了不少时间,遂试图总结一番。
既然想要总结分析在 B 站的动态,数据获取必然是最重要的,然而 B 站似乎并未提供公开的 API 供查询,幸而已有热心网友分享:
SocialSisterYi/bilibili-API-collect
SocialSisterYi/bilibili-API-collect(下文简称项目),通过对 B 站 Web 端、移动端以及 TV 端等诸多来源的 B 站 API 进行收集整理,汇总了一份较为全面的非官方 API 文档。
本文基于项目,利用 R 语言对笔者在 B 站的历史记录进行分析总结。
1 设置登陆信息
既然要访问历史记录,毫无疑问需要设置登陆信息。根据项目中的API 认证与鉴权以及登录基本信息的说明,首先设置 Cookies 信息,然而本以为只要简单的 httr::GET + httr::set_cookies 就能轻松秒杀,然而未曾想过的是,设置 cookies 就耗时良久。
根据 API 认证与鉴权中的说明,访问 B 站的 cookies 需要 DedeUserID、DedeUserID__ckMd5、SESSDATA 以及 bili_jct。
这不难,直接 Chrome + F12 调试模式,Application 选项卡直接查看即可。
然而,这里获取的 SESSDATA 和 bili_jct 是经过转义了的,因此在使用 httr::set_cookies
生成 cookies 时程序默认会再次转译,然后就报错了……就这个问题,我已经在 httr 提交了新的 PR 试图解决,至于能不能合并以及什么时候会合并,就不得而知了。
不过既然是要强制转译,那我们就给 httr::set_cookies
提供已经反转译的 cookies 即可。这里要用到 curl::curl_unescape
,实际上 httr::set_cookies
就是通过向量化调用 curl::curl_escape
来完成的转换。具体而言,代码如下:
library("httr")
cookies <-
httr::set_cookies(
DedeUserID = rstudioapi::askForPassword("DedeUserID"),
DedeUserID__ckMd5 = rstudioapi::askForPassword("DedeUserID__ckMd5"),
SESSDATA = curl::curl_unescape(rstudioapi::askForPassword("SESSDATA")),
bili_jct = curl::curl_unescape(rstudioapi::askForPassword("bili_jct"))
)
在后续的操作中,只要在请求中附上 cookies
即可。
2 获取历史记录
首先是查询历史记录,在历史记录章节中提供了新/旧两个 API.
虽然新的 API 可以请求到包括视频、直播和专栏在内的多种观看记录,然而笔者仅从 B 站观看视频,因此旧 API 就足够,其次旧版 API 可以返回更多的历史记录,也特别适合本次案例。
此外,为了获取尽可能多的观看记录,这里还使用 pn
控制历史记录偏移量,pn 每增大一,请求记录就往更久方向移动 300 条。笔者经过实验,发现该案例中最多请求到 pn=4
。那么我们就分别执行 4 次请求并合并。其中请求在返回对象的 $data
中。
library("jsonlite")
library("pillar")
library("purrr")
library("dplyr")
library("tibble")
pn_ls <-
c(1:4)
history_resp_ls <-
map(pn_ls,
function(pn){
history_resp <-
httr::GET(url =
"http://api.bilibili.com/x/v2/history",
config = cookies,
query = list(pn = pn))
history_content <-
httr::content(history_resp, type = "text")
# The response of GET is a json
history_from_json <-
jsonlite::fromJSON(history_content)
# The history records are in `data`
history_from_json$data
}
)
history_tb <-
reduce(history_resp_ls, bind_rows) %>%
as_tibble()
# glimpse(history_tb)
head(history_tb)
## # A tibble: 6 × 37
## aid videos tid tname copyright pic title pubdate ctime desc state
## <int> <int> <int> <chr> <int> <chr> <chr> <int> <int> <chr> <int>
## 1 632936267 1 212 美食侦… 1 http… 还是… 1.63e9 1.63e9 "-" 0
## 2 721202142 1 228 人文历… 1 http… 中国… 1.63e9 1.63e9 "移… 0
## 3 378655043 1 176 汽车生… 1 http… 什么… 1.63e9 1.63e9 "诈… 0
## 4 676243719 1 76 美食制… 1 http… 这是… 1.63e9 1.63e9 "日… 0
## 5 758813440 1 138 搞笑 1 http… 太顶… 1.62e9 1.62e9 "吴… 0
## 6 847480646 1 28 原创音… 1 http… “梗… 1.63e9 1.63e9 "引… 0
## # … with 26 more variables: duration <int>, rights <df[,12]>, owner <df[,3]>,
## # stat <df[,11]>, dynamic <chr>, cid <int>, dimension <df[,3]>,
## # short_link_v2 <chr>, up_from_v2 <int>, favorite <lgl>, type <int>,
## # sub_type <int>, device <int>, page <df[,8]>, count <int>, progress <int>,
## # view_at <int>, kid <int>, business <chr>, redirect_link <chr>, bvid <chr>,
## # mission_id <int>, season_id <int>, redirect_url <chr>, bangumi <df[,7]>,
## # cheese <df[,5]>
summary(history_tb)
## aid videos tid tname
## Min. : 2599625 Min. : 1.0 Min. : 17.0 Length:1200
## 1st Qu.:336101652 1st Qu.: 1.0 1st Qu.: 31.0 Class :character
## Median :587790294 Median : 1.0 Median :138.0 Mode :character
## Mean :560113126 Mean : 1.2 Mean :124.2
## 3rd Qu.:763318868 3rd Qu.: 1.0 3rd Qu.:212.0
## Max. :976237705 Max. :49.0 Max. :239.0
##
## copyright pic title pubdate
## Min. :1.000 Length:1200 Length:1200 Min. :1.437e+09
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:1.614e+09
## Median :1.000 Mode :character Mode :character Median :1.631e+09
## Mean :1.142 Mean :1.619e+09
## 3rd Qu.:1.000 3rd Qu.:1.632e+09
## Max. :2.000 Max. :1.635e+09
##
## ctime desc state duration
## Min. :1.497e+09 Length:1200 Min. :-100.000 Min. : 9.0
## 1st Qu.:1.614e+09 Class :character 1st Qu.: 0.000 1st Qu.: 120.0
## Median :1.631e+09 Mode :character Median : 0.000 Median : 364.0
## Mean :1.620e+09 Mean : -0.755 Mean : 737.6
## 3rd Qu.:1.632e+09 3rd Qu.: 0.000 3rd Qu.: 713.0
## Max. :1.635e+09 Max. : 0.000 Max. :132074.0
##
## rights.bp rights.elec rights.download rights.movie rights.pay rights.hd5 rights.no_reprint rights.autoplay rights.ugc_pay rights.is_cooperation rights.ugc_pay_preview rights.no_background
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0.0000 Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0 Min. :0.0000 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000000 1st Qu.:1.0000 1st Qu.:0 1st Qu.:0.0000 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0.0000 Median :0.0000 Median :1.0000000 Median :1.0000 Median :0 Median :0.0000 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0.0025 Mean :0.4375 Mean :0.8316667 Mean :0.9925 Mean :0 Mean :0.0375 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000000 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:0.0000 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :1.0000 Max. :1.0000 Max. :1.0000000 Max. :1.0000 Max. :0 Max. :1.0000 Max. :0 Max. :0
##
## owner.mid owner.name owner.face
## Min. : 28457 Length:1200 Length:1200
## 1st Qu.: 23947287 Class :character Class :character
## Median : 337521240 Mode :character Mode :character
## Mean : 467453220 NA NA
## 3rd Qu.: 544336675 NA NA
## Max. :2105467274 NA NA
##
## stat.aid stat.view stat.danmaku stat.reply stat.favorite stat.coin stat.share stat.now_rank stat.his_rank stat.like stat.dislike
## Min. : 2599625 Min. : 378 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. :0 Min. : 0.0000 Min. : 0.0 Min. :0
## 1st Qu.:336101652 1st Qu.: 142812 1st Qu.: 292.00 1st Qu.: 236.00 1st Qu.: 626.2 1st Qu.: 473.8 1st Qu.: 97.75 1st Qu.:0 1st Qu.: 0.0000 1st Qu.: 5843.8 1st Qu.:0
## Median :587790294 Median : 496958 Median : 1145.00 Median : 691.00 Median : 2543.0 Median : 2616.0 Median : 598.00 Median :0 Median : 0.0000 Median : 19623.0 Median :0
## Mean :560113126 Mean : 1237879 Mean : 4531.27 Mean : 1697.03 Mean : 12988.9 Mean : 22294.0 Mean : 4892.27 Mean :0 Mean : 6.7825 Mean : 68178.2 Mean :0
## 3rd Qu.:763318868 3rd Qu.: 1451116 3rd Qu.: 4072.50 3rd Qu.: 1827.50 3rd Qu.: 8584.0 3rd Qu.: 13353.0 3rd Qu.: 2610.75 3rd Qu.:0 3rd Qu.: 0.0000 3rd Qu.: 71615.8 3rd Qu.:0
## Max. :976237705 Max. :26865323 Max. :175789.00 Max. :37253.00 Max. :730496.0 Max. :774376.0 Max. :279676.00 Max. :0 Max. :819.0000 Max. :1276872.0 Max. :0
##
## dynamic cid
## Length:1200 Min. : 4062651
## Class :character 1st Qu.:300695667
## Mode :character Median :401334872
## Mean :348783941
## 3rd Qu.:412164764
## Max. :428187561
##
## dimension.width dimension.height dimension.rotate
## Min. : 318.000 Min. : 240.000 Min. :0.0000000
## 1st Qu.:1280.000 1st Qu.:1080.000 1st Qu.:0.0000000
## Median :1920.000 Median :1080.000 Median :0.0000000
## Mean :1839.597 Mean :1231.498 Mean :0.0016667
## 3rd Qu.:1920.000 3rd Qu.:1080.000 3rd Qu.:0.0000000
## Max. :4096.000 Max. :4320.000 Max. :1.0000000
##
## short_link_v2 up_from_v2 favorite type
## Length:1200 Min. : 1.00 Mode :logical Min. : 3.000
## Class :character 1st Qu.: 8.00 FALSE:1167 1st Qu.: 3.000
## Mode :character Median : 9.00 TRUE :33 Median : 3.000
## Mean :15.96 Mean : 3.012
## 3rd Qu.:20.00 3rd Qu.: 3.000
## Max. :36.00 Max. :10.000
## NA's :1000
## sub_type device
## Min. :0.00000 Min. :1.000
## 1st Qu.:0.00000 1st Qu.:1.000
## Median :0.00000 Median :1.000
## Mean :0.01833 Mean :2.118
## 3rd Qu.:0.00000 3rd Qu.:4.000
## Max. :7.00000 Max. :4.000
##
## page.cid page.page page.from page.part page.duration page.vid page.weblink page.dimension.width dimension.height dimension.rotate
## Min. : 4062651 Min. : 1.000000 Length:1200 Length:1200 Min. : 7.00 Length:1200 Length:1200 Min. : 318.000 Min. : 240.000 Min. :0.000000
## 1st Qu.:300266664 1st Qu.: 1.000000 Class :character Class :character 1st Qu.: 119.00 Class :character Class :character 1st Qu.:1280.000 1st Qu.:1080.000 1st Qu.:0.000000
## Median :401174042 Median : 1.000000 Mode :character Mode :character Median : 340.00 Mode :character Mode :character Median :1920.000 Median :1080.000 Median :0.000000
## Mean :348207322 Mean : 1.026072 NA NA Mean : 487.39 NA NA Mean :1838.506 Mean :1229.653 Mean :0.001682
## 3rd Qu.:412109363 3rd Qu.: 1.000000 NA NA 3rd Qu.: 694.00 NA NA 3rd Qu.:1920.000 3rd Qu.:1080.000 3rd Qu.:0.000000
## Max. :428187561 Max. :17.000000 NA NA Max. :10943.00 NA NA Max. :4096.000 Max. :4320.000 Max. :1.000000
## NA's :11 NA's :11 NA NA NA's :11 NA NA NA's :11 NA's :11 NA's :11
## count progress view_at kid
## Min. : 1.000 Min. : -1.0 Min. :1.630e+09 Min. : 27040
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.:1.632e+09 1st Qu.:335887840
## Median : 1.000 Median : 1.0 Median :1.632e+09 Median :587030088
## Mean : 1.201 Mean : 122.3 Mean :1.633e+09 Mean :556435075
## 3rd Qu.: 1.000 3rd Qu.: 101.2 3rd Qu.:1.633e+09 3rd Qu.:763218471
## Max. :49.000 Max. :2334.0 Max. :1.635e+09 Max. :976237705
## NA's :8
## business redirect_link bvid mission_id
## Length:1200 Length:1200 Length:1200 Min. : 10923
## Class :character Class :character Class :character 1st Qu.: 28604
## Mode :character Mode :character Mode :character Median : 84241
## Mean : 82051
## 3rd Qu.:122069
## Max. :208463
## NA's :542
## season_id redirect_url
## Min. : 107 Length:1200
## 1st Qu.: 562 Class :character
## Median : 3491 Mode :character
## Mean :10602
## 3rd Qu.:24327
## Max. :32364
## NA's :1083
## bangumi.ep_id bangumi.title bangumi.long_title bangumi.episode_status bangumi.follow bangumi.cover bangumi.season.season_id season.title season.season_status season.is_finish season.total_count season.newest_ep_id season.newest_ep_index season.season_type
## Min. :330566.0 Length:1200 Length:1200 Min. : 2.0 Min. :0 Length:1200 Min. :27040 Length:1200 Min. : 2.0000 Min. :0.0000 Min. :-1.0000 Min. :330566.0 Length:1200 Min. :2.0000
## 1st Qu.:339391.0 Class :character Class :character 1st Qu.: 2.0 1st Qu.:0 Class :character 1st Qu.:31395 Class :character 1st Qu.: 2.0000 1st Qu.:0.5000 1st Qu.:-1.0000 1st Qu.:359631.0 Class :character 1st Qu.:2.0000
## Median :384953.0 Mode :character Mode :character Median : 2.0 Median :0 Mode :character Median :34041 Mode :character Median : 8.0000 Median :1.0000 Median : 1.0000 Median :416137.0 Mode :character Median :3.0000
## Mean :378549.3 NA NA Mean : 6.0 Mean :0 NA Mean :34280 NA Mean : 7.5714 Mean :0.7143 Mean : 0.7143 Mean :391498.3 NA Mean :3.1429
## 3rd Qu.:416009.0 NA NA 3rd Qu.:10.5 3rd Qu.:0 NA 3rd Qu.:38450 NA 3rd Qu.:13.0000 3rd Qu.:1.0000 3rd Qu.: 1.0000 3rd Qu.:424143.0 NA 3rd Qu.:3.0000
## Max. :423526.0 NA NA Max. :13.0 Max. :0 NA Max. :39189 NA Max. :13.0000 Max. :1.0000 Max. : 5.0000 Max. :426237.0 NA Max. :7.0000
## NA's :1193 NA NA NA's :1193 NA's :1193 NA NA's :1193 NA NA's :1193 NA's :1193 NA's :1193 NA's :1193 NA NA's :1193
## cheese.season_id cheese.number cheese.long_title cheese.cover cheese.update_info
## Min. :359 Length:1200 Length:1200 Length:1200 Length:1200
## 1st Qu.:359 Class :character Class :character Class :character Class :character
## Median :359 Mode :character Mode :character Mode :character Mode :character
## Mean :359 NA NA NA NA
## 3rd Qu.:359 NA NA NA NA
## Max. :359 NA NA NA NA
## NA's :1199 NA NA NA NA
对于数据的每一列的含义,项目中获取全部视频历史记录(旧)均有说明。不过我们首先要弄明白,我们的观看记录最早记录到什么时候?
3 数据整理
根据此前的 summary()
以及获取全部视频历史记录(旧)中的说明。duration
键值为视频长度,progress
为视频播放进度,对于完播视频其键值为 -1
。为了便于计算播放时长,我们将 duration
与 progress
结合输出为 play_time
以计算播放时间。
记录观看时间的 view_at
键值为 1634781264 这样的形式,根据经验此处应为 Unix 时间戳,使用 as.POSIXct
转换为 date/time 格式。之后按照每天中的时间以及星期将观看时间进行归类。同样的方法来处理 pubdate
。
代表分区大类的 tid
键值类型为数值型,然而根据其实际意义,应与 tname
结合使用,通过 forcats::fct_reorder
根据 tid
对 tname
进行排序。
library("lubridate")
library("forcats")
histroy_tidy_tb <-
history_tb %>%
mutate(
tname = fct_reorder(tname, tid),
play_time =
if_else(progress<0, duration, progress),
pubdate =
as.POSIXct(pubdate, origin = "1970-01-01"),
view_at =
as.POSIXct(view_at, origin = "1970-01-01"),
date = date(view_at),
time = round(local_time(view_at, units = "hours")),
dow = wday(view_at, week_start = 1)
)
4 数据可视化
4.1 我到底看了多久的视频?
首先我们回答第一个问题,这段时间我到底看了多久的 B 站视频?
total_sec <- histroy_tidy_tb %>%
summarise(
min = min(view_at),
max = max(view_at),
total = sum(play_time))
total_sec
## # A tibble: 1 × 3
## min max total
## <dttm> <dttm> <int>
## 1 2021-09-01 19:43:49 2021-10-21 09:54:24 331559
从 2021-09-01 19:43:49 到 2021-10-21 09:54:24 总共看视频 331559 秒!也就是92.1 小时!妈见打系列了属于是。
4.2 什么时候才会看 B 站?
之后,我们开始探究新的问题,我都是在什么时候看的 B 站视频?我们分别对日期、一日中的时间、一周中的每天进行了可视化分析。
library("ggplot2")
library("cowplot")
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:lubridate':
##
## stamp
view_date_p <-
histroy_tidy_tb %>%
group_by(date) %>%
summarise(duration_sum = sum(play_time, na.rm = TRUE)/3600) %>%
ggplot(aes(x = date, y = duration_sum)) +
geom_line() +
scale_x_date(
"",
date_breaks = "7 day") +
ylab("Total Play Duration\n(Hour)")
view_time_p <-
histroy_tidy_tb %>%
group_by(time) %>%
summarise(duration_mean = mean(play_time, na.rm = TRUE)/3600) %>%
ggplot(aes(x = time, y = duration_mean)) +
geom_col() +
scale_x_continuous(
"Time of Day",
limits = c(0, 24)) +
ylab("Mean Play Duration\n(Hour/day)")
view_dow_p <-
histroy_tidy_tb %>%
group_by(dow) %>%
summarise(duration_mean = mean(play_time, na.rm = TRUE)/3600) %>%
ggplot(aes(x = dow, y = duration_mean)) +
geom_col() +
scale_x_continuous("Day of Week") +
ylab("Mean Play Duration\n(Hour/day)")
view_bottom_grid_p <-
plot_grid(view_time_p,
view_dow_p,
labels = c("B", "C"))
view_title_p <-
ggdraw() +
draw_label(
"Play Duration (Hour)",
fontface = "bold"
) +
theme(
plot.margin = margin(0, 0, 0, 0)
)
view_grid_p <-
plot_grid(
view_date_p,
view_bottom_grid_p,
view_title_p,
labels = c("A", "", ""),
rel_heights = c(1, 1, .1),
nrow = 3)
view_grid_p
从可视化结果来看9月23号到10月2日我看了比往常更多个B站频。此外周一周二我刷视频时间似乎更久,然而最有趣的是,我到底是个怎样的夜猫子哇,居然凌晨也不休息????注意身体哇少年中年。
4.3 我看了哪类视频?
既然花费了那么久看视频,那么我到底看了什么视频呢?
library(forcats)
library(showtext)
## Loading required package: sysfonts
## Loading required package: showtextdb
showtext_auto()
histroy_tidy_tb %>%
select(
tname,
play_time
) %>%
group_by(tname) %>%
summarise(duration_sum = sum(play_time, na.rm = TRUE)/3600) %>%
mutate(tname = fct_reorder(
tname, duration_sum
)) %>%
ggplot(aes(x = tname,
y = duration_sum)) +
geom_col() +
coord_flip() +
labs(x = "播放时长 (小时)",
y = "子分类") +
theme(text = element_text(family = "source-han-sans-cn"))
再来看看不同时间看视频类型有没有什么差别。按照 time
和 tname
分类,观察每天不同类型视频的时常。
showtext_auto()
histroy_tidy_tb %>%
group_by(time, tname) %>%
summarise(duration_mean_by_type = mean(play_time, na.rm = TRUE)/3600) %>%
select(
tname, time, duration_mean_by_type
) %>%
ggplot(aes(x = time,
y = duration_mean_by_type,
fill = tname
)) +
geom_col()
然而因为分类过于丰富了,反而看不出规律了。为了便于数据可视化,我们这里尝试将播放时长较短的类型合并,将类别播放总时间低于整体播放总时间 1% 的视频分类归为其它。
library("colorspace")
history_type_aggregate_tb <-
histroy_tidy_tb %>%
select(tname,
play_time) %>%
group_by(tname) %>%
summarise(duration_sum = sum(play_time, na.rm = TRUE) / 3600) %>%
mutate(percentage = duration_sum / sum(duration_sum),
tname = as.character(tname)) %>%
mutate(type = if_else(percentage >= .01, tname, 'other')) %>%
group_by(type) %>%
summarise(duration_sum = sum(duration_sum)) %>%
arrange(desc(duration_sum))
DT::datatable(history_type_aggregate_tb)
showtext_auto()
histroy_tidy_tb %>%
mutate(
tname = as.character(tname),
type = if_else(tname %in% history_type_aggregate_tb$type,
tname,
"other"
)) %>%
group_by(time, type) %>%
summarise(duration_mean_by_type = mean(play_time, na.rm = TRUE)/3600) %>%
select(
type, time, duration_mean_by_type
) %>%
ggplot(aes(x = time,
y = duration_mean_by_type,
fill = type,
label = type
)) +
geom_col() +
labs(x = "时间",
y = "播放时长\n(小时)",
fill = "视频类别") +
theme_classic() +
scale_fill_discrete_sequential("Batlow")
看起来,我仍然是那个爱看别人打游戏的少年,一天中只要看 B 站,就会花时间看单机游戏。其次在凌晨和中午就比较喜欢看影视杂谈类的视频。最后到了半下午和晚上,就喜欢看美食类的节目……果然是个几百斤的孩子呢(摊手
文章至此,长度已经太长了,更多的分析,在接下来的文章中呈现,先把数据保存下来以后续使用。这里我们把 。我们直接把数据保存在本地的 MinIO 数据库中。tibble
对象保存为 Parquet
文件,这是一种通用性较高的分列式文件格式,也是 Hadoop 生态中常用的文件存储格式。具体介绍见 apache/arrow
library(minio.s3)
# bucket <-
# minio.s3::put_bucket("bili-history")
s3save(histroy_tidy_tb, object = "histroy_tidy_tb.Rdata", bucket = 'bili-history')
get_bucket('bili-history')
## Bucket: bili-history
##
## $Contents
## Key: histroy_tidy_tb.Rdata
## LastModified: 2021-10-21T03:27:53.416Z
## ETag: "7a07c6f484da97bf7cac3fa97e898991"
## Size (B): 347773
## Owner: minio
## Storage class: STANDARD
欢迎通过邮箱,微博, Twitter以及知乎与我联系。也欢迎关注我的博客。如果能对我的 Github 感兴趣,就再欢迎不过啦!