R语言的xlsx包的read.xlsx及read.xlsx2函数操作Excel的xlsx文件

枫林秋2016 2019-09-01

展开全文

这篇文章是对R包xlsx的0.5.7版本的read.xlsx与read.xlsx2函数的帮助文档的翻译。很多地方仍用英文是因为我看不懂。我应该会长期维护这些博客，如有译文错误请评论指正。

简介

读取一个工作表的内容到一个R的data.frame对象。

用法

read.xlsx(file, sheetIndex, sheetName=NULL,
    rowIndex=NULL,
    startRow=NULL, endRow=NULL, colIndex=NULL,
    as.data.frame=TRUE, header=TRUE, colClasses=NA,
    keepFormulas=FALSE, encoding="unknown", ...)

read.xlsx2(file, sheetIndex, sheetName=NULL,
    startRow=1, colIndex=NULL, endRow=NULL,
    as.data.frame=TRUE, header=TRUE, colClasses="character", ...)

参数

file：Excel文件的路径
sheetIndex ：一个表示工作薄中的表的索引的数值
sheetName：表名
rowIndex：一个数值向量，表示想提取的行。如为空，且未指定startRow和endRow，则提取所有行。
colIndex ：一个数值向量，表示想提取的列。如为空，则提取所有的列。
as.data.frame：布尔值，是否强制转换为data.frame。如FALSE，则用列表表示，每个元素为一列。
header：布尔值，是否将第一行识别为标题
colClasses：For read.xlsx a character vector that represent the class of each column. Recycled as necessary, or if the character vector is named, unspecified values are taken to be NA. For read.xlsx2 see readColumns.
keepFormulas：布尔值，是否以文本格式保留Excel公式
encoding：encoding to be assumed for input strings. See read.table.
startRow：数值，读取的起点行。对于read.xlsx，仅当参数rowIndex为NULL时有效。
endRow ：数值，读取的终点行。如设为NULL，则读取所有行。对于read.xlsx，仅当参数rowIndex为NULL时有效。
…
other arguments to data.frame, for example stringsAsFactors

细节

函数read.xlsx提供了一个读取Excel数据的高级接口。它调用了多个低级函数。Its goal is to provide the conveniency of read.table by borrowing from its signature.

The function pulls the value of each non empty cell in the worksheet into a vector of type list by preserving the data type. If as.data.frame=TRUE, this vector of lists is then formatted into a rectangular shape. Special care is needed for worksheets with ragged data.

An attempt is made to guess the class type of the variable corresponding to each column in the worksheet from the type of the first non empty cell in that column. If you need to impose a specific class type on a variable, use the colClasses argument. It is recommended to specify the column classes and not rely on R to guess them, unless in very simple cases.

Excel内部将日期与时间保存为数值类型，并不保留时区与夏令时数据。在读取一个日期时间类型的数据时，it is converted to POSIXct class with a GMT timezone. Occasional rounding errors may appear and the R and Excel string representation my differ by one second. For read.xlsx2 bring in a datetime column as a numeric one and then convert to class POSIXct or Date. Also rounding the POSIXct column in R usually does the trick too.

read.xlsx2函数将更多的工作放在了Java中，所以它有更快的运行速度 (an order of magnitude faster on sheets with 100,000 cells or more)。read.xlsx2的结果通常与read.xlsx不同，因为read.xlsx2内部用readColumns实现，readColumns is tailored for tabular data.

返回值

一个data.frame对象或一个list对象，这取决于as.data.frame的值。如发现整列的NA，可能colClasses参数设置错误。

如果表为空，则返回NULL。如果表不存在，报错。

作者

Adrian Dragulescu

另见

write.xlsx可以写Excel文档。另见readColumns，可以仅读取几个列。

示例

## Not run: 

file <- system.file("tests", "test_import.xlsx", package = "xlsx")
res <- read.xlsx(file, 1)  # read first sheet
head(res)
#          NA. Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
# 1    Alabama       3615   3624        2.1    69.05   15.1    41.3    20  50708
# 2     Alaska        365   6315        1.5    69.31   11.3    66.7   152 566432
# 3    Arizona       2212   4530        1.8    70.55    7.8    58.1    15 113417
# 4   Arkansas       2110   3378        1.9    70.66   10.1    39.9    65  51945
# 5 California      21198   5114        1.1    71.71   10.3    62.6    20 156361
# 6   Colorado       2541   4884        0.7    72.06    6.8    63.9   166 103766
# >


# To convert an Excel datetime colum to POSIXct, do something like:
#   as.POSIXct((x-25569)*86400, tz="GMT", origin="1970-01-01")
# For Dates, use a conversion like:
#   as.Date(x-25569, origin="1970-01-01") 

res2 <- read.xlsx2(file, 1)  


## End(Not run)