【原】R语言在BRFSS数据中可视化分析探索糖尿病的影响因素

拓端数据 2022-05-25 发布于上海

展开全文

原文链接：http:///?p=9227

数据集：行为危险因素监视系统数据

摘要：该数据集是来自全美约40万份与健康相关主题的问卷调查。BRFSS始于1980年代，并已通过问卷调查在美国用于监测普遍的疾病。该研究是追溯性的，而不是设计性的实验，因此尽管可以推断出相关性，但不能因果关系。

数据集中的特征既是连续的又是分类的。

目标：探索性别，体重和年龄之间的相关性

第0部分：设置

library(ggplot2)
library(dplyr)
library(Rgraphviz)
library(knitr)
library(grid)
library(gridExtra)

load("brfss2013.RData")

# group and count a feature with discrete values
feature_vcounts <- function(df, f) {
  df %>%
    group\_by\_at(f) %>%
      count()}

# method for binning values
bin\_min\_sample <- function(p) {
  n = 10
  a = 10/p
  b = 10/(1-p)
  max(c(a,b))}

# create a new df for simulating binom probability distribution
binom\_prob\_df <- function(df, f, target) {
  new\_df <- feature\_vcounts(df,f)
  new\_df$n\[new\_df\[f\] == target\]/sum(new_df$n)}

# filtering df with subgroup value
subgroup_df <- function(df,f, group) {
  filter(df,df\[f\]==group)}

# calc the vector probability
binom\_prob\_vec <- function(v, target) {
  sum(v == target)/length(v)}

# sample from df
binom_sample <- function(s,v)
  sample(v, size=s, replace=TRUE)

# create the binomial sample distribution
binom\_sample\_dist <- function(df,f,target) {
  sample_dist <- c()
  for (i in 1:10001) {
    prob <- binom\_prob\_vec(binom_sample(100,df\[,f\]),target)
    sample\_dist <- append(sample\_dist,prob)}
  return(sample_dist)}

# convert decimal to percent
to_percent <- function(pvalue) {
  paste(round(pvalue*100,digits= 2),"%",sep="")}

第1部分：数据

导入和过滤数据以仅包括与糖尿病，性别，体重和年龄有关的重要特征。

# Import original file:
orig_dim <- dim(brfss2013)

# Select only relevant features:
weight_diabetes <- brfss2013 %>%
   select(sex, X_ageg5yr, weight2,diabete3)

# ------------------Cleaning data------------------
# 1.Weight strings -> numeric
weight\_diabetes$weight2 <- as.numeric(as.character(weight\_diabetes$weight2))
new\_dim <- dim(weight\_diabetes)

# 2. Remove Null Weights and Weights over 400
weight\_diabetes <- na.omit(weight\_diabetes)
weight\_diabetes <- filter(weight\_diabetes, weight2 <= 400)

# 3. Remove Diabetes Responses
target <- c("Yes", "No")
weight\_diabetes <- filter(weight\_diabetes, diabete3 %in% target)

# 4. Add index and reorder
weight\_diabetes$index <- seq.int(nrow(weight\_diabetes))
weight\_diabetes <- weight\_diabetes\[c(4,3,1,2)\]
clean\_dim <- dim(weight\_diabetes)

# Show data:
kable(head(weight_diabetes,n=5), caption="Diabetes Data Set",padding=0, format = "markdown",align="l")

数据看起来很简单，仅包含该项目所需的功能。因为数据需要匿名，所以年龄范围是特定年龄的安全替代方案。年龄范围将用作此数据集的分类信息。

第2部分：研究问题

研究问题1：

性别，体重和年龄之间有相关性吗？（变量：性别，weight2，X_ageg5yr）

由于性别是生物识别技术中的关键变量，因此探讨性别是否可能与其他变量相关很重要。在这种情况下，我们正在研究性别是否与体重相关。

研究问题2：

性别或年龄与糖尿病相关吗？怎么样？（变量：性别，X_ageg5yr，weight2，diabete3）

该探索性项目的目标是检查体重/性别/年龄是否与糖尿病相关。了解任何相关性可能有助于根据患者的性别和体重告知患者患糖尿病的可能性。

研究问题3：

年龄，体重和糖尿病之间有关系吗？（变量：性别，X_ageg5yr，weight2，diabete3）

为了进一步探讨与糖尿病的可能相关性，我们还将研究四个变量之间的关系。

第3部分：探索性数据分析

研究问题1：

性别，体重和年龄之间有相关性吗？（变量：性别，weight2，X_ageg5yr）

首先检查数据的分布很重要。_性别_是二元分类的，因此我们将用条形图形象化它的分布。

centered <- theme(plot.title = element_text(hjust = 0.5))
hist\_weight <- ggplot(data=weight\_diabetes,aes(weight2,  fill=weight2))+
  geom_histogram(fill='salmon',color='white') + ggtitle("Histogram \[Weight\]") + centered
weight\_diabetes$log\_weight <- log(weight_diabetes$weight2)
hist\_log\_weight <- ggplot(data=weight\_diabetes,aes(log\_weight, fill=log_weight))+
  geom\_histogram(fill='mediumturquoise',color='white') + ggtitle("Histogram \[Log\_Weight\]") + centered
grid.arrange(hist\_weight, hist\_log_weight, ncol = 2)