WikiWrite: Generating Wikipedia Articles Automatically
Authur
- Siddhartha Banerjee
- Prasenjit Mitra
Keyword
- WikiWrite: a system capable of generating new Wikipedia content.learn content templates from similar articles without Wikipedia categories
- document embedding:文本嵌入
- paraphrasing:释义
- the system used the original word without any paraphrasing.
- Informativeness:信息量
Abstract
-
obtains feature representations of entities on Wikipedia
-
an existing work on document embeddings to obtain vector representations of words and paragraphs.(Attention!Preprocessor here!)
- obtain the vector representations of the red-linked entities using paragraph vector model(Le and Mikolov,2014)(PV-DM model)
-
Using words & paragraphs representation, identify articles that are very similar to the new entity on Wikipedia.
- articles to be identified: existing Wikipedia articles that are semantically close to the red-linked entity(Using cosine similarity)
- Train machine learning classifiers using content from the similar articles to assign web retrieved content on the new entity into relevant sections in the new entity's Wiki.
- Propose a novel abstractive summarization(新型抽象总结?) technique that use two-step ILP model to synthesize the assigned content in each section & rewrite the content to produce a well-formed informative summary.
- this Article jointly optimizes the order with the informativeness and linguistic quality of the summary(of the content assigned to the sections in the article).
- compute the coherence score between any two sentences using transition probabilities of word-pairs(nouns and verbs) between the sentences.
- THE transition probabilities are learned from pairs of adjacent sentences that exist in the similar articles.
- propose an optimization model to find a suitable set of lexical and phrasal transformations for paraphrasing the generated summaries.
1 Introduction
Assumption
- Wikipedia categories are known.
- articles often belongs to multiple categories.
- copyright violations, the text on internet can not be directly use
- An abstractive summarization system,use sentence fusion
- sentences selected (and paraphased) from multiple docs must be ordered such that the resulting article is coherent.
- this Article jointly optimizes the order with the informativeness and linguistic quality of the summary(of the content assigned to the sections in the article 2.1).
-
Data Source
- obtain the red-linked entites using a paragraph vector model(Le and Mikolov,2014)
- paragraph vector model: computes continuous distributed vector representations of varying-length texts
How to evaluate the efficiency
- compare the accuracies with other comparable systems.
- reconstruct existing articles on Wiki and compare the org & autogen
- create 50 new articles in Wiki.
Question
Abstract
- 1.1 what's document embedding?