HOME
Automatically Generating Wikipedia Articles
Automatically Generating Wikipedia Articles
A Structure-Aware Approach
Aim
- a method to learn topic-specific extractors for content section jointly for the entire template.
- use a global integer linear programming formulation
- create a comprehensive textual overview of a subject composed of info drawn from the Internet.
Training Corpus
- n documents,d1,d2,dn ...
- for di,a set of doc delineated sections si1,si2...sim,and their corresponding heading,hi1,hi2...him.
Process
- Prepocessing
- Template introduction
- cluster all section headings hi1,...him for all di using a repeated bisectioning algorithm(Zhao et al.,2005)
- similarity function use cosine similarity and TFIDF
- eliminate any clusters with low internal similarity, and there are no yield unified topics.
- determine the average number of sections k over all docs, then select the k largest section clusters as topics t1...tk.
- order these topics t1...tk using a majority ordering algorithm(Cohen et al., 1998)
- consistent with a maximal number of pairwise relationshops observed in our data set.
- each topic tj is identified by the most frequent heading found within the cluster.
- Search:To retrieve relevant excerpts,query reformulation
- search using Yahoo! and retrieve the first ten result pages for each topic and extracts all possible texts(6 excerpts per page).
- for each topic tj of each document we wish to create, label the excerpts ej1...ejr.(the total number of excerpts found on the Internet may be differ)
- Learning Content Selection
- Model
- Input
- The title of the desired doc
- t1...tk,topics from the content template
- ej1...ejr,candidate excerpts for each topic tj
- Define
,feature vector for the lth candidate excerpt for topic tj.
- w1...wk,parameter vectors,one for each topic t1..tk
- Process
- Ranking
 = \phi(e_{jl}) imes \underset{w_j}{
ightarrow})
- the position l of excerpt ejl within the topic-specific candidate vector is the excerpt's rank.
- Optimizing the Global Objective
- Objective:

- Exclusivity Constraints:

- Redundancy Constraints:
- solve tool
- Training
- for each section we add the gold excerpts sij to the corrsponding eij1...eijr for di and tj.

- Application
- Given the title of a requested doc, we select several excerpts from the candidate vectors returned by the search procedure.
简明流程
- 预处理
- 确定新文档中出现的topic
- 使用Zhao的方法进行标题的聚类
- 消除内部距离较小的类,使训练集中没有表示相近意思的标题
- 是否指使用出现次数最大的标题代替其他标题?
- 定义k为各篇文档中heading数目的平均数,取出现次数最大的k个聚类
- 使用Cohen的方法将类排序,这k个类形成k个topic
- 使用出现最频繁的标题代表topic
- 这个标题之后就用于代表文章k部分之中的一部分,那么实际效果究竟如何?
- 搜索
- 搜索yahoo!搜索到的前10页中每页6段形成摘要
- 对摘要编号
- 学习
- 初始化向量w为零向量
- 对每篇文章每个topic,使用Rank函数衡量其分数
- Rank函数的权值向量即为w,w代表着每段的权重
- 做一个全局整数线性规划,求解一个01矩阵x,使得Objective尽可能小
- 这个公式什么意思?摘要的编号不是按顺序直接排的么?

- 计算

- 为什么要把01向量x和文字一起比较?是指要把预料库的文件拆分为段落(摘要)?还是拆分为词?为什么不是x与sij的特征向量相比较?
- 应用
- 使用最终的01向量w形成一篇新文档