分享

文献计量学系列X3 — PubMed题录数据月份数据提取

 松哥精鼎统计 2020-10-23
引言


上一期我们讲解了从PubMed数据库中下载并导入题录数据。其导入的数据并不像WoS数据库中一样有文献发表月份信息PD列,因此需要我们另外新增月份信息PD列。本期我们将讲述如何从2种不同方法获取的PubMed题录数据中提取文献发表月份信息。





<<<<<<<<<课程推荐>>>>>>>>>

R文献计量学基础篇已上线!

内容涵盖文档作者期刊研究机构国家等相关文献计量学指标分析

更多自定义函数

一次性获取较多文献计量指标

让学习更轻松!

学习力,才是最大的竞争力!扫码约我吧!

<

一、月份信息来自PubMed数据库网站输出题录数据


pacman::p_load(bibliometrix,pubmedR)#加载包file = 'E:/精鼎统计/pubmed-bibliometr-set.txt'M <- convert2df(file, dbsource = "pubmed", format = "pubmed")names(M)# [1] "AU" "AF" "DE" "AID" "OT" "PHST" "DT" "AB" # [9] "C1" "OI" "CI" "CIN" "CN" "COIS" "CON" "CRDT" # [17] "DCOM" "DEP" "PY" "EDAT" "EFR" "EIN" "FIR" "FPS" # [25] "GN" "GR" "IS" "IR" "SN" "JID" "SO" "LA" # [33] "LID" "LR" "MHDA" "MID" "OAB" "OABL" "OID" "ORI" # [41] "OTO" "OWN" "PP" "PL" "PMC" "PMCR" "PMID" "PS" # [49] "PST" "RF" "RIN" "RN" "RPI" "SB" "SI" "SO2" # [57] "STAT" "J9" "TI" "TT" "UIN" "UOF" "VL" "DI" # [65] "DB" "ID" "RP" "TC" "CR" "AU_UN" "AU1_UN" "AU_UN_NR" # [73] "SR_FULL" "SR"
我们发现列名中没有月份数据“PD”列,但在“SO2”列存在月份信息。接下来,我们输出M$SO2的前6行。
head(M$SO2)# [1] "EUR J PHYS REHABIL MED. 2018 OCT;54(5):792-796. DOI: 10.23736/S1973-9087.18.05462-X. EPUB 2018 AUG 29."# [2] "IR J MED SCI. 2019 AUG;188(3):939-951. DOI: 10.1007/S11845-018-1936-5. EPUB 2018 DEC 3." # [3] "FEMS MICROBIOL LETT. 2018 APR 1;365(8). DOI: 10.1093/FEMSLE/FNY059." # [4] "SHENG WU GONG CHENG XUE BAO. 2020 FEB 25;36(2):241-249. DOI: 10.13345/J.CJB.190223." # [5] "NURS OUTLOOK. 2019 NOV-DEC;67(6):680-695. DOI: 10.1016/J.OUTLOOK.2019.04.009. EPUB 2019 MAY 2." # [6] "NEUROL INDIA. 2018 JAN-FEB;66(1):96-104. DOI: 10.4103/0028-3886.222880."
我们发现存在1-2个时间数据,包含月份信息,但少数文档没有时间数据。第一个时间是文档见刊时间(Published time),第二时间是在线发表时间(Online time)。我们主要关注的是文档发表时间,所有接下来我们将提取第一个时间。
M$Time <- str_extract(M$SO2,'\\d{4} [A-Z]{3}')#提取第一个时间;若没有第一个时间,则提取第二个时间;若2个时间都没有则返回缺失值NA;格式如“2018 JAN-FEB”,则返回“2018 JAN”。M$PD <- str_extract(M$PD,'[A-Z]+')#提取月份M$PD[1:10]# [1] "OCT" "AUG" "APR" "FEB" "NOV" "JAN" "JUL" "JUL" "JUL" "OCT"
Mon2Num <- function(x) match(tolower(x), tolower(month.abb))#月份英文缩写转数字函数M$PD_number <- Mon2Num(M$PD)M$PD_number[1:10] # [1] 10 8 4 2 11 1 7 7 7 10



二、月份信息来自API下载的PubMed题录数据


query <- 'bibliometrics [MeSH] AND english[LA] AND 2009/01:2019/12[DP]'#数据检索规则pmQueryTotalCount(query)#验证规则有效性# $total_count# [1] 7048# # $query_translation# [1] "\"bibliometrics\"[MeSH Terms] AND english[LA] AND 2009/01[PDAT] : 2019/12[PDAT]"# # $web_history# Web history object (QueryKey = 1, WebEnv = MCID_5f4ef38...)
Bp <- pmApiRequest(query, limit = 100)#下载xml格式的题录数据Sys.setlocale('LC_ALL','C')#设置本地标准格式M1 <- convert2df(Bp, dbsource = "pubmed", format = "api")#xml格式换成数据框格式names(M1)# [1] "AU" "AF" "TI" "SO" "SO_CO" "LA" "DT" "DE" "ID" "MESH" # [11] "AB" "C1" "CR" "TC" "SN" "J9" "JI" "PY" "PY_IS" "VL" # [21] "DI" "PG" "GRANT_ID" "GRANT_ORG" "UT" "PMID" "DB" "AU_UN" "AU_CO" "AU1_CO" # [31] "SR_FULL" "SR"
我们发现此没有月份数据,且每列信息中也不包含月份信息,因此,该题录数据框提取不出月份数据。我们检查了下载数据Bp,发现有月份信息,只是在格式转换的时候没有提取出来。因此,我们对原函数做了一定的修改,使月份数据能够提取出来。
source('E:/精鼎统计/bibliometrics analysis base.R') #导入视频课程自定义函数集M1 <- pmApi2df_new(Bp)names(M2)# [1] "AU" "AF" "TI" "SO" "SO_CO" "LA" "DT" "DE" "ID" "MESH" # [11] "AB" "C1" "CR" "TC" "SN" "J9" "JI" "PY" "PY_IS" "PD" # [21] "VL" "DI" "PG" "GRANT_ID" "GRANT_ORG" "UT" "PMID" "DB" "AU_UN" "AU_CO" # [31] "AU1_CO"
M1$PD[1:10]# [1] 4 2 3 2 2 1 1 1 1




三、小结


本节讲解了如何从PubMed数据库2种输出题录数据中提取月份数据信息。后期,我将探索文献关键词分析方法。




    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多