View of An Effective Approach for Abstractive Text Summarization using Semantic Graph Model

(1)

An Effective Approach for Abstractive Text Summarization using Semantic Graph Model

R. Senthamizh Selvan^1*, Dr. K. Arutchelvan²

1Research Scholar, ²Assistant Professor,

1,2Department of Computer and Information Science, Annamalai University, Chidambaram, Tamil Nadu, India

ABSTRACT

In the last two decades, exemplary advancements are being happened in the field of hidden Knowledge Extraction (KE) from huge amount of data. Natural Language Processing (NLP) is a predominant research area which are practicing in the textual analytics. Text summarization is the process which uses NLP, to extract the significant sentences from the multiple original documents without affecting its original meaning. This research work aims to provide an Abstractive Text Summarization for multi-documents using semantic graph based approach. Creating abstract from the multiple documents are the most difficult task. The abstract created by keeping important phrases taken from big data should not give up its authenticity. To keep the sentence coherent, this research work has proposed a semantic graph. The semantic graph (relationship between the sentences) are weighed using the graph ranking methodology. A ranking algorithm is proposed using Pearson's correlation coefficient to find the important sentences. ROUGE scores are evaluated for the proposed work and compared with the existing TextRank algorithm.

Keywords

Text summarization, Multi-document abstractive summarization, Semantic graph model, Sentence embedding, Graph-based ranking algorithm

Introduction

In this digitalized era, it is easy to share and extract information from the world wide web.

Due to sophisticated web applications, people and organizations are sharing their thoughts and views about a topic/event via forums, blogs, news, etc. This produces voluminous information about a same topic. It is difficult to read and generate summary from multiple source of documents. The evolution of Artificial Intelligence (AI) in text analytics makes easy to get the summary from huge amount of document. NLP is a prominent research field which makes the text analysis much easier. Text summarization is the process that extracts the summary from the huge amount of text document without affecting its original meaning. Multiple Document Summarization (MDS) is an essential methodology that creates a succinct outline while keeping up the pertinent substance of source records [1, 2]. Two methodologies are utilized for MDS that are as follows, Extractive Summarization (ES) and Abstractive Summarization (ABS). ES extricates notable sentences from the content archives and union them to make an outline without changing the source text. In any case, ABS as a rule utilize semantic strategies and language age procedures [3, 4] to make a succinct summary that is nearer to way people make. The principal endeavour on programmed outline was acted in the late 1950 [5].

There several techniques are being used to generate summary from the textual documents.

The methodology in [5] uses term frequencies to evaluate the sentence significance. In research paper [5] for example sentences are remembered for the summary in the event that contains high continuous terms. Most of the researchers have concentrated in multi-document extractive summarization (MDES), which make synopsis by choosing remarkable sentences from the reports [6]. Factual techniques are as often as possible utilized to discover watchwords and expressions [7]. Additionally graph based methods help with distinguishing the main sentences in the archive [7]. Machine Learning procedures are used to remove redundancy for applicable

(2)

sentences utilizing preparing corpus [8, 9]. In the research papers [10-13] different graph-based methodologies have been investigated for MDES. These techniques utilize PageRank calculation and its varieties for figuring the general significance of sentences. A couple of examination contemplates have thought about MDAS in scholarly world.

Few research work has been carried out using graph-based methodologies. This research work proposed ABS using graph-based method. In a publication database, it contains internet pages and authors information. Here graph based ranking algorithms are being used to identify the important pages from the database. These algorithms are evaluated based on the connection between the hyperlinks. For e.g. in Wikipedia, we can build the relationship among the multiple web pages based on the citations. Another possible way to find the relationship among the multiple web pages is building the Semantic Graph. (SG) The SG find connection between entities (subject, verb and object) among the multiple sentences in several source of documents.

To generate summary from the graph is done by applying the graph rank technique. This motivate to build a semantic graph model with a novel ranking technique. The novel method includes sentence embedding and weighted nodes i.e. degree of nodes (sentences). This work proposed a new pseudo-code for sentence embedding and graph-based ranking methods.

The research paper is organized as follows: Section literature review discusses the background study that have been carried out in text summarization. The overall architecture and proposed semantic graph model is discussed in section proposed methodology. Section results and discussion presents the details of experimental evaluation. This research work is concluded in the section conclusion.

Literature Review

The researchers [14-16] have define the text summarization as "a text that is produced from one or more texts, which conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that". From this definition it reminds the three important aspects of generating summaries that are as follows:

 It should be short.

 It may be generated from a single document or multiple documents.

 It should preserve important information.

In a research paper [17], the authors have strived to make ABS utilizing two techniques that are as follows: Syntactic and Semantic methodology. All phonetic based methodologies [17–

19] proposed for ABS use syntactic parser to address and examine the content. Eminent burden of these investigations is the non-presence of semantic portrayal of source report text. It is critical to address the content semantically, as significant semantic content examination is done in abstractive outline.

Distinctive semantic methodologies have been researched for MDAS and are exhibited as follows. GISTEXTER is MDS framework introduced by [20], which utilizes format based technique to create abstractive outlines from various news archives. A principle downside of this technique was that extraction rules or semantic examples were physically created, which require more exertion and time. A fluffy cosmology approach [22] is introduced for synopsis of Chinese news, which models unsure data to depict the area information in a superior manner. In this methodology, the Chinese word reference and space philosophy should be characterized unequivocally by a human master, which is a monotonous errand.

A graph-based method introduced by [23] produced ABS from the semantic model, addressing a multimodal report. The information addressed by ideas of philosophy is used to fabricate the semantic model. The shortcoming of this system is that it relies upon human master

(3)

to build space metaphysics, and doesn't matter to different areas. The procedure introduced by [21] delivers elegantly composed and brief abstractive synopses from the gatherings of news stories on comparative subject. The limit of the procedure was that age examples and data extraction (IE) rules were physically composed, which requires exertion and time. Most as of late, extraordinary chart based models have acquired thought and effectively endeavoured for MDS. These models utilize PageRank calculation [25] and its varieties to dole out positions to sentences or entries.

In the research paper [26] the authors have introduced an availability model dependent on diagram, which expects that hubs which are connected to a few different hubs are most likely to convey critical data. Lex-PageRank [10] is a methodology that utilizes the idea of eigenvector centrality for deciding the significance of sentence. It fabricates a sentence network lattice and utilizations calculation like PageRank to decide the significant sentences. Another comparative calculation to PageRank is introduced by [12], which decide the notability of sentence for archive rundown. The article, [27] presented a chart based methodology, which joins text content with surface highlights, and explores the highlights of sub-points in multi-records to incorporate them into the diagram based positioning calculation.

A partiality graph-based methodology for MADS [13] utilizes comparable calculation to PageRank, and computes scores of sentences in the proclivity diagram dependent on data extravagance. On other hand, a graph-based methodology [24] has additionally been endeavoured for ABS, which makes semantic graph for archive from human constructed cosmology. The methodology relies upon human specialists generally and is limited to a particular area. Also, the methodology didn't report any rundown results. In the research papers [28–33] presented structure for abstractive outline, in which the record text is addressed by a bunch of graph, which are changed into a summary graph.

Methodology

In the above studies it is discussed that previous researchers have used a method called semantic graph to build the relationship among the sentences and generate an abstraction from large data. The relationship was ranked by using various ranking algorithms. In this proposed work it has introduced a new ranking algorithm using the degree of vertices to rank the connected sentences. The high-level architecture of the proposed abstractive summarization is depicted in the figure 1. The architecture contains three phases that are follows:

Phase – 1: Text Preprocessing Phase – 2: Building Semantic Graph Phase – 3: Summary Generation A. Text Preprocessing

In the natural language processing, preprocessing is the precise step that is used to clean and transform the data for the analytical process. The preprocessing phase includes data cleaning, lemmatization, stop words removal and sentence tokenizing.

Text cleaning:

In social media, most of the textual data contains url links, emojis, emoticons, etc. These characters are not helpful for the analytical process. The proposed approach is limited for the English language. The non-English character are also removed from the documents. The characters such as „+‟, „-‟, „_‟, „#‟, etc. are removed in this process. Some of the characters will be replaced with text. For e.g. the „$‟ is replaced with „Dollar‟ in order to maintain the sentence coherent. Dollar is the value that denotes the money. To achieve this, a new dictionary has been

(4)

built for some special characters along with its expansion. The cleaned data is then passed into the next step.

Lemmatization:

Lemmatization is an important process in the NLP. Unlike stemming, lemmatization identifies the root word. For e.g. in the following sentences

Sentence 1: “Please handle the patient carefully.”

Sentence 2: “The doctor is caring about his patient.”

In the above sentences, it indicates the care towards the patient. As human, it is understandable, but, it is difficult for the machines. To make it easier, lemmatization plays a major role to find the root word. After applying the lemmatization, the sentence will be converted a follows:

Sentence 1: “Please read the handle the patient care.”

Sentence 2: “The doctor is care about his patient.”

The word “carefully” and “caring” indicates the same meaning care. This process support to find the redundant sentences. Further it helps to build the word vector very easy.

Stop words removal:

In NLP, stop words removal plays another role to reduce the sparsity. In the document, stop words such as “the”, “is”, “are”, etc. are often occurs frequently in the sentences. It don‟t carry any meaningful information. These words are removed with the stop words dictionary.

Sentence Tokenizing:

After the above processes, the cleaned text in the document are tokenized based on the sentences. In this proposed approach, one of the novel part is sentence embedding. Most of the previous work has been carried out using the word embedding. The sentence embedding is used to reduce the computational complexity and time sparsity of the proposed algorithm. This tokenizing process, tokenize the sentences in the document and create an unique value for each and every sentences.

Figure 1. Research Process of Semantic Graph Based Abstractive Summarization B. Building Semantic Graph

It is known that, each document is written based or related to a particular topic. The length of document depends on the content about the topic that has taken. Usually, the sentences

(5)

in the document are relating with each other. The relationship among the sentences ensure the coherent and flow of the document. This creates interest to readers to read the document. Writing summary from the document by human is easy as they knows the meaning and important of the words. It is difficult for the machine to generate the summary. In order to make an effective summary from the multiple documents, it is important to create the relationship among the sentences. This paper, proposed a novel semantic graph method to generate the summary from the multiple documents.

To build the semantic graph, it is necessary to find the similarities between the sentences.

The multiple documents about the same topic contains redundant information. To eliminate the redundant sentences, it is must to calculate the sentence similarity. In this work, cosine similarity is used to calculate the sentence similarity.

In this research work, it has used sentence embedding instead of word embedding. For manipulating huge amount of documents, sentence vectorization is better than word vectorization. Sentence embedding is more helpful to find the similarity between the sentences and also helps to identify the important phrases. Pseudo-code – 1 represents the process involves in sentence embedding. The algorithm which was used [34] is taken for this research work.

Consider D as documents and where k = 1, 2, 3, …, n. Each document contains several sentences and it is denoted as following, and where j = 1, 2, 3, …, n.

Pseudo-code 1: Embedding Sentences Input: Word embedding and Output: Sentence embedding

1: for s in S:

2: where a is a smoothing parameter

3: Create matrix A whose columns are and is a first singular vector 4: for s in S:

5:

After tokenize the sentences with sentence vectorization, cosine similarity is identified with the following formula to find similarity of the sentences. From the formula (1), if the value is lesser than 0 it is perfectly dissimilarity, wheres if it is greater than 0 it is similarity. For the proposed work, threshold value is set to greater than 0.75 for perfect similarity. In the formula, sentence similarity is calculated by comparing a sentence with all other sentences. The similarity matrix is created based on the values. The semantic graph is then build using similarities of the sentences. The highly identical sentences based on the cosine value are removed from the documents. This reduced the redundant information from the multiple documents.

………. (1)

The semantic graph is identified by the relationship among the sentences. The entities such as pronoun, location, date, etc in the several sentences are linked with each other. Based on the entities, the graph is build between the sentences. Consider, each and every sentences as nodes or vertex which is denoted as „N‟. The relationship between the nodes are represents as edge „E‟ and „w‟ denotes weight of the node. In a graph G = (N, E, w),„in‟ and „out‟ edges are playing major role to form the sentence coherent. The sentence contains several entities. As this

(6)

work build graph based on the entity-relationship among the sentences, each and every sentence may connect several times in and out due to its co-reference in nature. The degree of vertex is calculated based on the values of nodes that connect to each other and it is defined in the formula (2).

……….. (2) Pseudo-code 2: Proposed Semantic Graph Model

Input: Multi-document

Output: Abstractive Summary

1: doc = {}

2: sen = []

3: for d in D:

4: for s in d:

5: embed the sentences using algorithm 1

6: sen.append(embedded_sentence)

7: doc.update(sen)

8: Create matrix of sentence similarity using formula (1)

9: Build the semantic graph with entities relationship

10: Find the weight of sentences using formula (2)

11: for s in d:

12: if w(s) == 0 && is not relevant to topic:

13: eliminate s from d

14: if w(s) > 1:

15: Rank the nodes in descending order with weight of sentences

16: Generate Summary C. Summary Generation

The process of the proposed work is given in the pseudo-code 2. In the step 1-7, the documents are tokenized with an unique document id for future references. Sentences in the documents are referred with the sentence id. The sentences are embedded using the pseudo-code 1. Step 8 create the sentence similarity matrix using formula (1). The keyword or topic plays the major role to eliminate the irrelevant or unimportant sentences. As it is known that, the cosine similarity value is range between 1 to 1. the value which are near to 1 is dissimilarity. The value which are nearest to 1 are similar sentences. This helps to remove the redundancy. The dissimilar sentence may carry the core part of the contents. To ensure the important of the sentence, it is calculate the similarity between the sentence and topic or keyword from the step 11 to 13. The sentence ranking is done in the step 14-15. Based on the importance the abstractive summary is generated.

Results and Discussions

To evaluate the performance of the proposed semantic graph algorithm this work had used CNN dataset [ref]. For evaluating summarization technique, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used.

…………..(3)

where N denotes the grams, c is the candidate and r is the reference. In this research work, it reports the following metrics for evaluation of proposed semantic graph model:

(7)

 ROUGE-1,

 ROUGE-2, and

 ROUGE-L

Table 1 represents the results of the proposed work. The proposed semantic graph-based method is compared with the baseline TextRank method and it is outperformed well. Figure 2 depicts the bar chart of the proposed work.

Table 1 Results for CNN Datasets

Techniques ROUGE – 1 ROUGE – 2 ROUGE – L

Proposed Method 47.68 25.40 42.89

TextRank 41.85 22.45 39.47

Figure 2. Comparison of Techniques

Conclusion

The importance of semantic graph in abstractive text summarization for multi-document is discussed in this research paper. To overcome the challenge, this research work had proposed a novel semantic graph-based method for generating summary from multiple documents. The sentence embedding is a new feature that has been utilized for finding the sentence similarity.

Unlike word embedding, sentence embedding reduced the computational time and sparsity.

Cosine similarity performs well to identify similar sentences and helped to eliminate the redundant sentences. To maintain the sentence coherency, the degree of vertices (sentences) has been evaluated. The in and out edges i.e. between the sentences are identified and support to maintain the flow of the sentences. Also, sentence embedding helps to identify the common phrases and generate a new sentences. The proposed method was evaluated with the CNN dataset. It is compared with TextRank algorithm. The results are better than the TextRank algorithm.

References

[1] Fattah, M.A., Ren, F. (2009): GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Computational Speech Language, 23(1), pp. 126–144.

[2] Barzilay, R., McKeown, K.R. (2005): Sentence fusion for multi document news summarization.

Computational Linguistics, 31(3), pp. 297–328

(8)

[3] Das, D., Martins, A.F. (2007): A survey on automatic text summarization. Literature Survey Language Statistics. II course at CMU 4, 192–195

[4] Ye, S., Chua, T.-S., Kan, M.-Y., Qiu, L. (2007): Document concept lattice for text understanding and summarization. Information Process Management. 43(6), 1643–1662

[5] Luhn, H.P. (1958): The automatic creation of literature abstracts. IBM Journal of Research Development 2(2), 159–165

[6] Kupiec, J., Pedersen, J., Chen, F. (1995): A trainable document summarizer. In: Proceedings of the 18^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, USA, 9–13 July 1995, pp. 68-73. ACM

[7] Knight, K., Marcu, D. (1999): Statistics-based summarization-step one: Sentence compression. In:

Proceedings of the National Conference on Artificial Intelligence 2000, pp. 703–710. AAAI Press, Menlo Park

[8] Larsen, B.(1999): A trainable summarizer with knowledge acquired from robust NLP techniques. Adv.

Autom. Text Summ. 71

[9] Fattah, M.A. (2014): A hybrid machine learning model for multi-document summarization. Application Intelligence 40(4), 592–600

[10] Erkan, G., Radev, D.R.: LexPageRank (2004): prestige in multi-document text summarization. In: EMNLP 2004, pp. 365–371.

[11] Erkan, G., Radev, D.R. (2004): LexRank: graph-based lexical centrality as salience in text summarization.

J. Artif. Intell. Res. (JAIR) 22(1), pp. 457–479.

[12] Mihalcea, R., Tarau, P. (2005): A language independent algorithm for single and multiple document summarization. https://www.aclweb.org/anthology/I05-2004

[13] Wan, X., Yang, J. (2006): Improved affinity graph based multi-document summarization. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, New York City, USA, June 2006, pp. 181–184. ACL.

[14] D. Radev, E. Hovy, K. McKeown, (2002): Introduction to the Special Issue on Summarization.

Computational Linguistics, Vol. 28, No. 4, pp. 399-408.

[15] K. Svore, L. Vanderwende, C. Burges, (2007): Enhancing single-document summarization by combining RankNet and third-party sources, In Proceedings of the EMNLP-CoNLL, pp. 448-457.

[16] D. Evans, K. McKeown, J. Klavans, (Apr 2005): Similarity-based Multilingual Multi-Document Summarization. Technical Report CUCS-014-05, Department of Computer Science, Columbia University.

[17] Barzilay, R., McKeown, K.R., Elhadad, M. (1999): Information fusion in the context of multi-document summarization. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, College Park, Maryland, 20–26 June 1999, pp. 550–557. ACL.

[18] Tanaka, H., Kinoshita, A., Kobayakawa, T., Kumano, T., Kato, N. (2009): Syntax-driven sentence revision for broadcast news summarization. In: Proceedings of the 2009 Workshop on Language Generation and Summarisation, Suntec, Singapore, 6 August 2009, pp. 39–47. ACL.

[19] Genest, P.-E., Lapalme, G. (2011): Framework for abstractive summarization using text-to-text generation.

In: Proceedings of the workshop on monolingual text-to-text generation, Oregon, USA, 24 June 2011, pp.

64–73. ACL.

[20] Harabagiu, S.M., Lacatusu, F. (2002): Generating single and multi-document summaries with gistexter. In:

Document Understanding Conferences, Pennsylvania, USA, 11–12 July 2002, pp. 40–45. NIST.

[21] Genest, P.-E., Lapalme, G. (2012): Fully abstractive approach to guided summarization. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, Jeju Island, Korea, 8–14 July 2012, pp. 354–358. ACL.

[22] Lee, C.-S., Jian, Z.-W., Huang, L.-K. (2005): A fuzzy ontology and its application to news summarization.

IEEE Transition System Man Cybernetics. Part B: Cybernetics, 35(5), pp. 859–880.

(9)

[23] Greenbacker, C.F. (2011): Towards a framework for abstractive summarization of multimodal documents.

ACL HLT 2011, 75.

[24] Moawad, I.F., Aref, M. (2012): Semantic graph reduction approach for abstractive text summarization. In:

7^th International Conference on Computer Engineering andSystems (ICCES), 2012, pp. 132–138. IEEE.

[25] Page, L., Brin, S., Motwani, R., Winograd, T. (1999): The PageRank citation ranking: bringing order to the web.

[26] Mani, I., Bloedorn, E. (1999): Summarizing similarities and differences among related documents. Inf. Retr.

1(1–2), pp. 35–67.

[27] Zhang, J., Sun, L., Zhou, Q. (2005): A cue-based hub-authority approach for multi-document text summarization. In: Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE‟05, 2005, pp. 642–645. IEEE.

[28] Wei, F., Li, W., Lu, Q., He, Y. (2010): A document-sensitive graph model for multi-document summarization. Knowledge Information System 22(2), pp. 245–259.

[29] Ge, S.S., Zhang, Z., He, H. (2011): Weighted graph model based sentence clustering and ranking for document summarization. In: 4th International Conference on Interaction Sciences (ICIS), 2011, pp. 90–95.

IEEE.

[30] Nguyen-Hoang, T.-A., Nguyen, K., Tran, Q.-V. (2012): TSGVi: a graph-based summarization system for Vietnamese documents. J. Ambient Intelligent Humanized Computational, 3(4), pp. 305–313.

[31] Cheung, J.C.K., Penn, G. (2013): Towards Robust Abstractive Multi-Document Summarization: A Caseframe Analysis of Centrality and Domain. In: ACL (1), pp. 1233–1242.

[32] Glavaš, G., Šnajder, J. (2014): Event graphs for information retrieval and multi-document summarization.

Expert System and Applications. 41(15), pp. 6904–6916.

[33] Liu, F., Flanigan, J., Thomson, S., Sadeh, N., Smith, N.A. (2015): Toward abstractive summarization using semantic representations. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, 1–5 June 2015, pp. 1077–1086. ACL.

[34] Sanjeev Arora, Yingyu Liang, Tengyu Ma, (2017): A Simple But Tough-To-Beat Baseline for Sentence Embeddings, Published as a conference paper at ICLR.