Following a discussion around summarization in Part 1, Part 2 discusses the method of pure abstraction in light of recent advances in deep learning and AI.
In the first part of this paper, we delved into medical writing; its business use case, the current landscape and challenges surrounding the medical writers while authoring a medical document. The use of AI (Artificial Intelligence) and automation was espoused to streamline the process and reduce repetitive and cumbersome nature of this job. The entire knowledge base, can be summarized into a single document using next-generation technologies of NLP (Natural Language Processing) and NLG (Natural Language Generation), thus saving significant time and effort.
As discussed in the previous paper, summarization is a method where information present in multiple documents is consolidated and presented in a single document. Both summarization techniques, vis-à-vis, extractive and abstractive have their pros and cons, that restricts us from applying them directly to solve the problem.
However, with recent advances in deep learning and artificial intelligence, we can certainly mould this to our benefit. In this part, the method of pure abstraction will be discussed, which can be instrumental in generating new sentences.
Medical Writing Automation (MWA) is a framework that leverages the techniques and algorithms of Natural Language Processing (NLP) and Natural Language Generation (NLG) to generate articles. The AI-generated report includes sections such as introduction, methodology, discussion, study objectives, inclusion-exclusion criteria etc.
MWA provides a pure abstraction method. This method is entirely dependent on AI and NLG. It uses pre-trained and state-of-the-art models to generate new statements. Keywords are collected from individual articles using this methodology. These are terms that can be used to create a storyline of the article. Such keywords are extracted from each article, thereby, resulting in a set of keywords. All keywords are compared, and the ones with the highest priority and frequency of occurrence are selected.
For this task, an AI model that has been pre-trained on large amounts of medical data and conducts text-to-text generation is used. The keywords that are in a sequence are fed into the AI model helping it to construct a sentence based on the input. This sentence, along with the second keyword, is returned to the model that, in turn, generates a new sentence. This procedure is repeated for each keyword, ensuring that the newly created sentence, similar to previous sentences, retains correct context. For this method of generation to perform well, it requires a large amount of well-structured data.
This MWA system comprising of summarization and abstraction techniques can be used as an API or can even be integrated with other service or application as and when required.
As discussed, summarization maintains the proper context of the text generated and is grammatically correct, however, abstraction on the other hand generates new sentences from scratch. But the quality of the sentence and its relevance generated by the method of abstraction depends on the training of the model.
Abstractive summarization techniques have a small edge over extractive summarization; however both are not up to the mark in generating accurate articles. The primary reason for this is the restriction due to current technology on training of deep learning models to specific contexts and subsequent sentence generation. With advancements in technologies, including concepts such as Transfer Learning and Ensemble Learning, we can build a generic model to understand the context, thus enabling generation of summaries mapped to the context. Leveraging pure abstraction technique won’t suffice in case of medical writing as the domain is limitless. And training an AI model on each and every drug, disease and scenario is next to impossible, owing to the limitations of current technology. A new hybrid approach needs to be developed that will combine the advantages of both extractive and abstractive summarization techniques and generate the summary text.
The process of pure abstraction generates new sentences. The AI model used to construct the sentences must be trained on a large quantum of data for the sentences to be grammatically correct. Only if the model is trained on medical data will the created phrases have the right context. In pure abstraction, just the keywords are extracted from articles and sentences are generated basis the data on which the AI model has been trained. In Figure 3, below, it is evident that keywords are extracted from four articles with the PMCIDs mentioned and then fed into the AI model to generate the text.
The quality of the text generated by extractive summarization mimics the quality of sentences penned by humans, as it compares the sentences written by them, ranks them and returns the top-rated sentences as the summary. However, in abstractive generation the quality of the sentence requires a lot of improvement that can be done either by increasing the volume of training material or narrowing down the system to focus only on generating certain topics. This will make it easier for the model to train on limited number of topics.
The summarization techniques can’t replace the human medical writers entirely. They can only give the writers a foundation of where to look and provide them with a certain degree of assistance by providing a summary of the published literatures pertaining to the searched topic while considering the timeline of search. Abstractive generation is usually a trade-off between plagiarism score and quality of the text generated over the extractive generation.
Although extractive summarization produces good results, it cannot be used directly as the final writeup due to levels of plagiarism. So, it can only be used to assist the medical writer by enabling them with a base and instructions as to where and how to begin writing. The users can run the system numerous times on different words to obtain the extractive summaries, that will provide the writer with an overview of what has been written and documented about the search terms with respect to time. This will allow the writer to save a significant amount of time.
The method of generating text by pure abstraction generates unique sentences, but if the model isn’t trained on the context related to the text being generated, it's possible that the sentences generated by this method doesn't always make sense or have a proper context. This is entirely dependent on the sort of data utilised to train the AI model. If the AI model is trained on common English literature, it will be capable of generating accurate grammatical sentences, but will lack the ability to construct medically (domain specific) meaningful sentences.
Also, if the model is trained on data related to medically relevant terms such as diabetes, it will correctly frame sentences based on keywords related to diabetes. However, if the keywords are related to other conditions, the model will not be able to produce the correct output and the generated sentences will be prejudiced and directed towards diabetes. This is the most difficult aspect of our strategy. To overcome this barrier and create an AI model capable of creating statements on any medical condition, a large amount of diversified training data is required, in addition to a constant updation of the training data to ensure that no topic is missed.
Text mathematics approach can be evaluated. This is a hybrid approach to combine both summarization and abstractive methods of generation. Text mathematics allows the text to be analysed using NLP techniques and high priority keywords are extracted from the text. These keywords are used to prepare a story line of the article to be generated. Using n-gram techniques an AI model can be trained to obtain the correct sequence of the keywords extracted. Keywords give out important information that is to be conveyed and once they are in correct sequence, the only remaining task is to complete the sentence by adding fillers and grammar.
The story line prepared will be good enough to convey the idea or information correctly. These keywords arranged in a proper sequence will then be fed into the AI model trained on medical data to generate sentences. This approach is deemed to be better than the approach of just giving in the keywords one by one without any sequence as it’ll help to generate sentences that are more contextually related to each other.
This is the text mathematics approach in which NLP techniques are used to extract keywords from text, generate a sequence of these keywords and then use NLG to generate new sentences. This strategy, once implemented, will produce far superior outcomes than the abstractive or summary methods alone.
As a result, using today's technology, it's best to take a hybrid strategy and generate sections of articles utilising both extractive and abstraction summarization strategies. Although the generated papers will require some human review, this technology will help medical writers be more productive and increase their efficiency many folds. The generated text can be used by the writers to obtain a summary of research work has been done in the searched context and provide the writers with a starting point from where they can build their new document in a seamless manner.
Saurabh Das, Head, Research and Innovation, Niketan Panchal, Researcher, Ashutosh Pachisia, Data Scientist, Rohit Kadam, Researcher, Prashant Chaturvedi, Data Scientist, Dr. Ashish Indani, Former Head Research and Innovation; all with TCS ADDTM Platforms