WikiWrite_Generating_Wikipedia_Articles_Automatically

WikiWrite: Generating Wikipedia Articles Automatically

Authur

  1. Siddhartha Banerjee
  2. Prasenjit Mitra

Keyword

  1. WikiWrite: a system capable of generating new Wikipedia content.learn content templates from similar articles without Wikipedia categories
  2. document embedding:文本嵌入
  3. paraphrasing:释义
    1. the system used the original word without any paraphrasing.
  4. Informativeness:信息量

Abstract

  1. obtains feature representations of entities on Wikipedia

    1. an existing work on document embeddings to obtain vector representations of words and paragraphs.(Attention!Preprocessor here!)

      • obtain the vector representations of the red-linked entities using paragraph vector model(Le and Mikolov,2014)(PV-DM model)
    2. Using words & paragraphs representation, identify articles that are very similar to the new entity on Wikipedia.

      • articles to be identified: existing Wikipedia articles that are semantically close to the red-linked entity(Using cosine similarity)
    3. Train machine learning classifiers using content from the similar articles to assign web retrieved content on the new entity into relevant sections in the new entity's Wiki.
    4. Propose a novel abstractive summarization(新型抽象总结?) technique that use two-step ILP model to synthesize the assigned content in each section & rewrite the content to produce a well-formed informative summary.
    5. this Article jointly optimizes the order with the informativeness and linguistic quality of the summary(of the content assigned to the sections in the article).
    6. compute the coherence score between any two sentences using transition probabilities of word-pairs(nouns and verbs) between the sentences.
      1. THE transition probabilities are learned from pairs of adjacent sentences that exist in the similar articles.
    7. propose an optimization model to find a suitable set of lexical and phrasal transformations for paraphrasing the generated summaries.

    1 Introduction

    Assumption

    1. Wikipedia categories are known.
    2. articles often belongs to multiple categories.
    3. copyright violations, the text on internet can not be directly use
    4. An abstractive summarization system,use sentence fusion
    5. sentences selected (and paraphased) from multiple docs must be ordered such that the resulting article is coherent.
    6. this Article jointly optimizes the order with the informativeness and linguistic quality of the summary(of the content assigned to the sections in the article 2.1).

Data Source

  1. obtain the red-linked entites using a paragraph vector model(Le and Mikolov,2014)
    • paragraph vector model: computes continuous distributed vector representations of varying-length texts

How to evaluate the efficiency

  1. compare the accuracies with other comparable systems.
  2. reconstruct existing articles on Wiki and compare the org & autogen
  3. create 50 new articles in Wiki.

Question

Abstract

  1. 1.1 what's document embedding?