【原】1月26日论文推荐（附下载地址）

学术头条 2020-11-27

展开全文

论文题目

OpenTag: Open Attribute Value Extraction from Product Profiles

作者

Guineng Zheng (University of Utah)
Subhabrata Mukherjee (Amazon.com)
Xin Luna Dong (Amazon.com)
Feifei Li (University of Utah)

推荐理由

“OpenTag: Open Attribute Value Extraction from Product Profiles”这篇文章是亚马逊的实习生做的。做的是个老问题，就是从产品页面抽取产品描述的属性值，但不同的是这里抽取的属性值可能是之前没有出现（定义）过的。作者提出用双向LSTM来学习特征，然后用CRF来提高抽取精度，然后又加上了一个Attention机制来提高可解释性，最后还加上了一个主动学习方法来降低标注工作量。下图描述了整个模型框架。总的来说该架构很好的融合了现有的一些技术。

最后也取得不错的实验结果。

摘要

Extraction of missing attribute values is to find values describing an attribute of interest from a free text input. Most past related work on extraction of missing attribute values work with a closed world assumption with the possible set of values known beforehand,or use dictionaries of values and hand-crafted features. How canwe discover new attribute values that we have never seen before? Can we do this with limited human annotation or supervision?We study this problem in the context of product catalogs that often have missing values for many attributes of interest.

In this work, we leverage product profile information such as titles and descriptions to discover missing values of product attributes. We develop a novel deep tagging model OpenTag for this extraction problem with the following contributions: (1) we formalize the problem as a sequence tagging task, and propose a joint model exploiting recurrent neural networks (specifically, bidirectional LSTM) to capture context and semantics, and Conditional Random Fields (CRF) to enforce tagging consistency; (2) we develop a novel attention mechanism to provide interpretable explanation for our model’s decisions; (3) we propose a novel sampling strategy exploring active learning to reduce the burden of human annotation.OpenTag does not use any dictionary or hand-crafted features as in prior works. Extensive experiments in real-life datasets in different domains show that OpenTag with our active learning strategy discovers new attribute values from as few as 150 annotated samples (reduction in 3.3x amount of annotation effort) with a high F-score of 83%, outperforming state-of-the-art models.