Jekyll2021-04-08T14:50:55-05:00https://qa.fastforwardlabs.com/feed.xmlNLP for Question AnsweringCFF builds a state-of-the-art QA application with the latest NLP techniquesBeyond SQuAD: How to Apply a Transformer QA Model to Your Data2020-07-22T00:00:00-05:002020-07-22T00:00:00-05:00https://qa.fastforwardlabs.com/domain%20adaptation/transfer%20learning/specialized%20datasets/qa/medical%20qa/2020/07/22/QA-for-Specialized-Data<p>Implementing an IR QA system in the real-world is a nuanced affair. As we got deeper into <a href="http://qa.fastforwardlabs.com">our QA journey</a>, we began to wonder: how well would a Reader trained on SQuAD2.0 perform on a real-world corpus? And what if that corpus were highly specialized - perhaps a collection of legal contracts, financial reports, or technical manuals? In this post, we describe our experiments designed to highlight how to adapt Transformer models to specialized domains, and provide guidelines for practical applications.</p>
<p><img src="/images/post5/morning-brew-D-3g8pkHqCc-unsplash.jpg" alt="" /><br />
Photo by <a href="https://unsplash.com/@morningbrew?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Morning Brew</a> on <a href="unsplash.com">Unsplash</a></p>
<h1 id="specialized-domains">Specialized Domains</h1>
<p>Training a Transformer language model on the SQuAD dataset provides the model with the ability to supply short answers to general-knowledge, factoid-style questions. When considering more diverse applications, these models will perform well on similar question/answer types over text that is comparable in vocabulary, grammar, and style to Wikipedia (or general text media such as that found on the web) – essentially, text that is comparable to what the model was originally trained on. This encompasses a great many use cases.</p>
<p>For example, a QA system applied to your company’s general employee policies will likely be successful, as the text is typically not highly specialized or overly technical, and questions posed would most likely be fact-seeking in nature. (“When are the black out dates for company stock sales?” or “What building is Human Resources located in?”) In fact, this type of QA system could be viewed as a more sophisticated and intuitive internal FAQ portal.</p>
<p>Change either of these specs – question/answer type or text domain – and the accuracy of a SQuAD-trained QA model becomes less assured. It’s also important to note that neither of these characteristics are independent. Question type is intricately linked to answer type, and both can be heavily influenced by the style of text from which answers are to be extracted. For example, a corpus of recipes and cookbooks would likely be heavy on questions such as “How do I boil an egg?” or “When should I add flour?” – questions that typically require longer answers to explain a process.</p>
<h1 id="assessing-a-general-qa-model-on-your-domain">Assessing a General QA model on Your Domain</h1>
<p>Whether you know you have a specialized QA task or not, one sure-fire way to determine if your SQuAD-trained QA model is performing adequately is to validate it. In this blog series, we’ve demonstrated quantitative performance evaluation by measuring exact match (EM) and F1 scores on annotated QA examples. We recommend generating at least a few dozen to a couple hundred examples to sufficiently cover the gamut of question and answer types for a given corpus. Your model’s performance on this set can serve as a guide as to whether your model is performing well enough as-is or if it perhaps requires additional training. (Note: performance level should be set keeping in mind both the business need and the relative quality on the SQuAD dev set. For example, if your SQuAD-trained QA model is achieving an F1 score of 85 on the SQuAD dev set, it’s unrealistic to expect it to perform at 90+ on your specific QA task.)</p>
<p>Developing QA annotations can be a time-consuming endeavor. It turns out, though, that this investment can yield more than just a path to model validation. As we’ll see, we can significantly improve underperforming QA models by further fine-tuning them on a set of specialized QA examples.</p>
<p>Aiding in this endeavor are new tools that make QA annotation swift and standardized, like deepset’s <a href="https://github.com/deepset-ai/haystack/">Haystack Annotation</a>. <a href="https://deepset.ai/">deepset</a> is an NLP startup that maintains an open source <a href="https://github.com/deepset-ai/haystack/">library</a> for question answering at scale. Their annotation application allows the user to upload their documents, annotate questions and answers, and export those annotations in the SQuAD format – ready for training or evaluation.</p>
<p><img src="/images/post5/haystack_annotation_tool.png" alt="" title="Screenshot of deepset's Haystack Annotation interface from the haystack repo" /></p>
<p>Once you have a dataset tailored to your use case, you can assess your model and determine whether additional intervention is warranted. Below, we’ll explain how we used an open-source domain-specific dataset to perform a series of experiments, in order to determine successful strategies and best practices for applying general QA models to specialized domains.</p>
<h1 id="experimenting-with-qa-domain-adaptation">Experimenting with QA Domain Adaptation</h1>
<p>You’ve trained your model on SQuAD and it can handle general factoid-style question answering tasks, but how well will it perform on a more specialized task that might be rife with jargon and technical content, or require long-form answers? Those are the questions we sought to answer. We note that, since these experiments were performed on only one highly specialized dataset, the results we demonstrate are not guaranteed in your use case. Instead, we seek to provide general guidelines for improving your model’s performance.</p>
<h2 id="domain-specific-qa-datasets">Domain-Specific QA Datasets</h2>
<p>Research on general question answering has received much attention over the past few years, spurring the creation of several large, open-domain datasets such as <a href="https://rajpurkar.github.io/SQuAD-explorer/">SQuAD</a>, <a href="https://www.microsoft.com/en-us/research/project/newsqa-dataset/">NewsQA</a>, <a href="https://ai.google.com/research/NaturalQuestions">Natural Questions</a>, and more. QA for specialized domains has received far less attention - and thus, specialized datasets remain scarce, with the most notable open-source examples residing in the medical domain. These datasets typically contain a couple thousand examples. For our experiments, we combined two such datasets, which we briefly describe below.</p>
<p><strong>BioASQ</strong></p>
<p><a href="http://bioasq.org/">BioASQ</a> is a large-scale biomedical semantic indexing and question answering challenge organizer. Their dataset contains question and answer pairs that are created by domain experts, which are then manually linked to related science (<a href="https://pubmed.ncbi.nlm.nih.gov/">PubMed</a>) articles. We used 1504 QA examples that were converted into a SQuAD-like format by <a href="https://arxiv.org/abs/1910.09753">these authors</a>. Their modified BioASQ dataset can be found <a href="https://github.com/mrqa/MRQA-Shared-Task-2019">here</a>. (Note: <a href="http://participants-area.bioasq.org/">registration</a> is required to use BioASQ data.)</p>
<p><strong>COVID-QA</strong></p>
<p>This QA dataset, led by researchers at <a href="https://deepset.ai/">deepset</a>, is based on the <a href="https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge">COVID-19 Open Research Dataset</a>. It contains 2,019 question/answer pairs, annotated by volunteer biomedical experts. You can find the dataset <a href="https://github.com/deepset-ai/COVID-QA">here</a> and learn more about it in <a href="https://openreview.net/pdf?id=JENSKEEzsoU">their paper</a>.</p>
<h3 id="dataset-characteristics">Dataset Characteristics</h3>
<p>How is this dataset comparable to SQuAD? In this section, we highlight some of the key characteristics of our hybrid medical dataset.</p>
<p><strong>Question Type</strong></p>
<p>Here is a sample of questions from this medical dataset:</p>
<ul>
<li>Which gene is responsible for the development of Sotos syndrome?</li>
<li>How many disulfide bridges has the protein hepcidin got?</li>
<li>Which is the cellular localization of the protein Opa1?</li>
<li>Which drug should be used as an antidote in benzodiazepine overdose?</li>
<li>What is the main cause of HIV-1 infection in children?</li>
<li>What is DC-GENR and where is it expressed?</li>
<li>What is another name for IFITM5?</li>
<li>What is the size of bovine coronavirus?</li>
</ul>
<p>We see there are a lot of technical medical terms (“hepcidin,” “IFITM5”), as well as some more recognizable words (that likely have different implications or interpretations in a medical context - e.g., “localization,” “expressed”). However, the questions are overall generally factoids, similar to the SQuAD dataset. Below are the most common question types in the combined dataset.</p>
<p><img src="/images/post5/most_common_question_types.png" alt="" /></p>
<p><strong>Context Length</strong></p>
<p>While both datasets rely on scientific medical research articles for context, structure varies between them. The BioASQ contexts are subsections or paragraphs of research articles, while the COVID-QA contexts include the full research article. When combined, they yield a dataset with some very disparate context lengths.</p>
<p><img src="/images/post5/bioasq_covidqa_tokens_per_context.png" alt="" /></p>
<p>The BioASQ contexts contain an average of about 200 tokens, while the COVID-QA contexts contain 200 times that – an average of nearly 4000 tokens per context! This context length diversity is highly unlike SQuAD, and might be more indicative of a real-world dataset (since there is no reason to suspect uniform document length in any given corpus).</p>
<p><strong>Answer Length</strong></p>
<p>While the question types are similar to SQuAD, there are some stark differences in answer lengths. 97.6% of the answers in the BioASQ set consist of five or fewer tokens; this is very similar to SQuAD answer lengths. However, only 35% of answers in the COVID-QA set have fewer than five tokens, with the average at 14 tokens. Another full third of the answers are even longer than that - with the longest clocking in at 144 tokens! That’s basically a paragraph, and quite different from answers seen in the SQuAD dataset.</p>
<p>The combined medical datasets yield a total of 3523 QA examples. We pulled out 215 as a holdout (dev set), leaving us 3308 for training.</p>
<h2 id="standard-transfer-learning-to-a-specialized-domain">Standard Transfer Learning to a Specialized Domain</h2>
<p><img src="/images/post5/ff14-57.png" alt="" title="Stages of Transfer Learning: (top) A Transformer model first learns language modeling through semi-supervised training on massive corpora of unstructured text, such as Wikipedia and the web. (middle) That same model learns a specific task, such as question answering, by supervised training (fine-tuning) on the SQuAD dataset. (bottom) Additional fine-tuning on a set of specialized QA examples allows the same model to perform better question answering in a specific domain. At each stage, transfer learning ensures that fewer examples are necessary to improve on the next task, since the model can bootstrap from previously learned statistical relationships." /></p>
<p>If fine-tuning a pre-trained language model on the SQuAD dataset allowed the model to learn the task of question answering, then applying transfer learning a second time (fine-tuning on a specialized dataset) should provide the model some knowledge of the specialized domain. While this standard application of transfer learning is not the <em>only</em> viable method for teaching a general model specialized QA, it’s arguably the most intuitive (and simplest) to execute. However, we needed to take care during execution. We only had ~3300 examples for training, which is a far cry from the ~100k in the SQuAD dataset.</p>
<p>In a thorough analysis, we would perform a hyperparameter search over epochs, batch size, learning rate, etc., to determine the best set of hyperparameter values for our task - while being mindful of overfitting (which is easy to do with small training sets). However, even with a chosen, fixed set of hyperparameter values, <a href="https://arxiv.org/abs/2002.06305">research has shown</a> that training results can vary substantially due to different random seeds. Evaluating a model through cross-validation allows us to assess the size of this effect. Unfortunately, both cross-validation and hyperparameter search (another cross-validation) are costly and compute-intensive endeavors.</p>
<p>In the practical world, most ML and NLP practitioners use hyperparameters that have (hopefully) been vetted by academics. For example, most people (including us) fine-tune on the SQuAD dataset, using the hyperparameters published by the original BERT authors. For this example, we used the hyperparameters <a href="https://openreview.net/forum?id=JENSKEEzsoU">published</a> by the authors of the COVID-QA dataset. (While we combined their dataset with BioASQ, we felt these hyperparameters were nonetheless a good place to start.)</p>
<table>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>language model</td>
<td><a href="https://huggingface.co/distilbert-base-uncased">distilbert-base-uncased</a></td>
</tr>
<tr>
<td>general QA model</td>
<td><a href="https://huggingface.co/twmkn9/distilbert-base-uncased-squad2">twmkn9/distilbert-base-uncased-squad2</a></td>
</tr>
<tr>
<td>batch size</td>
<td>80</td>
</tr>
<tr>
<td>epochs</td>
<td>2</td>
</tr>
<tr>
<td>learning rate</td>
<td>3e-5</td>
</tr>
<tr>
<td>max seq len</td>
<td>384</td>
</tr>
<tr>
<td>doc stride</td>
<td>192</td>
</tr>
<tr>
<td>cross val folds</td>
<td>5</td>
</tr>
</tbody>
</table>
<blockquote>
<p>Note: We continued to use DistilBERT, because it’s lightweight and quick to train. However, it’s also known that DistilBERT doesn’t perform as well as BERT or RoBERTa for QA. Fortunately, in this case, we cared more about relative performance gains than absolute performance.</p>
</blockquote>
<p>We started by exploring three things:</p>
<ul>
<li>whether fine-tuning on a small specialized dataset improves performance on specialized question answering;</li>
<li>if so, which strategy provides the best performance gain;</li>
<li>and what relative improvement we should expect.</li>
</ul>
<p>We used DistilBERT trained on SQuAD as our General QA Model. We evaluated this model on our medical holdout set to gain a performance baseline (blue bar). Next, we trained a Specialized QA Model by fine-tuning the General Model on the medical train set, using the hyperparameters above. The performance of the Specialized Model on the medical holdout set is shown in the below figure by the orange bars. Transfer learning through additional fine-tuning on the medical dataset resulted in nearly ten point increases in both EM and F1 scores - a considerable improvement!</p>
<p><img src="/images/post5/fine_tuning_distilbert.png" alt="" /></p>
<p>Was it really necessary to start with a General QA Model that had already been fine-tuned on SQuAD? Perhaps we could have simply started with a pre-trained language model – a model that had not yet been trained at any explicit task – and fine-tuned directly on the medical dataset, essentially teaching the model both the task of question answering and the specifics of the specialized dataset at the same time.</p>
<p>The green bars in the graphic above show the results of training the <code class="highlighter-rouge">language model</code> (listed in the table above) directly on our medical dataset. We’ll call this the Med-Only Model. As expected, it performed worse than either the General Model or our Specialized Model, but not by much! The blue and green bars differ by a only couple points - which is surprising, since the General Model is trained on 100k general examples and the Med-Only Model is trained on only 3300 specialized examples. This demonstrates that it’s not only a numbers game; it’s just as important to have data that reflects your specific domain.</p>
<p>But how many specialized examples are enough? Training the General Model on an additional 3300 specialized question/answer pairs achieved about a ten point increase in F1. Because generating QA annotations is costly, could we have done it with fewer examples? We explored this by training the General Model model on increasing subsets of the medical train set, from 500 to 3000 examples. With only 500 examples we saw a four-point relative F1 increase. F1 increased rapidly with increasing training examples until we hit a training size of about 2000 examples, after which we saw diminishing returns on further performance gains.</p>
<p><img src="/images/post5/fine_tuning_vs_train_size.png" alt="" /></p>
<p>This highlights a common tradeoff between model improvement and development cost. Of course, with infinite resources, we can train better models. However, resources are almost always limited, so it’s encouraging to see that even a small investment in QA annotation can lead to substantial model improvements.</p>
<p>How robust are these results? As a final check, we performed a five-fold cross-validation training, wherein we kept all hyperparameters fixed but allowed the random seed and training order to vary. Below we see that the results were fairly robust, with a spread of about three to four points in either F1 or EM, which is far smaller than the ten point increase we saw when going from our General Model to the Specialized Model. This indicates that the performance gain is a real signal. (This figure was inspired by a similar one in <a href="https://openreview.net/pdf?id=JENSKEEzsoU">this paper</a>.)</p>
<p><img src="/images/post5/cross_validation_test.png" alt="" title="F1 and exact match scores for each fold of a five-fold CV" /></p>
<p>With that said, we again stress that the performance we’ve demonstrated here is not guaranteed in every QA application to a specialized domain. However, our experiments echo the findings of other studies in the literature, which is heartening.</p>
<blockquote>
<p>To learn more, check out the following papers:</p>
<ul>
<li><a href="https://arxiv.org/pdf/1911.02655">Towards Domain Adaptation From Limited Data For Question Answering Using Deep Neural Networks</a></li>
<li><a href="https://dl.acm.org/doi/pdf/10.1145/3309706">Putting Question-Answering Systems into Practice: Transfer Learning for Efficient Domain Customization</a></li>
<li><a href="https://arxiv.org/abs/1910.09753">MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension</a></li>
</ul>
</blockquote>
<p>As a result of our experiments, we believe that the following are a solid set of guidelines for practical QA applications in specialized domains.</p>
<h1 id="practical-guidelines-for-domain-specific-qa">Practical Guidelines for Domain-Specific QA</h1>
<ol>
<li>General QA Models will provide solid performance in most cases, especially for QA tasks that require answering factoid questions over text that is qualitatively similar to Wikipedia or general text content on the web.</li>
<li>Applying a General QA Model in a specialized domain may benefit substantially from applying transfer learning to that domain.</li>
<li>Utilizing standard transfer learning techniques allows practitioners to leverage currently existing QA infrastructure and libraries (such as Hugging Face <a href="https://github.com/huggingface/transformers">Transformers</a> or deepset’s <a href="https://github.com/deepset-ai/haystack">haystack</a>).
4, Generating annotations for specialized QA tasks can thus be a worthwhile investment, made easier with emerging annotation applications.
5, A substantial performance increase can be seen with only a few hundred specialized QA examples, and even greater gains achieved with a couple thousand.</li>
<li>Absolute performance will depend on several factors, including the chosen model, the new domain, the type of question, etc.</li>
</ol>
<h1 id="final-thoughts">Final Thoughts</h1>
<p>Question answering is an NLP capability that is still emerging. It currently works best on general-knowledge, factoid-style, SQuAD-like questions that require short answers. This type of QA lends itself well to use cases such as “enhanced search” – allowing users to more easily and intuitively identify not just documents or websites of interest, but explicit passages and sentences, using natural language. There is no question that this style of QA is closest to maturity.</p>
<p>However, research continues to accelerate, as new models and datasets emerge that push the boundaries of SQuAD-like QA. Here are two areas we’re watching closely:</p>
<ul>
<li>QA models that combine search over large corpora with answer extraction, because as we saw in <a href="http://qa.fastforwardlabs.com">this blog series</a>, your Reader is limited by the success of your Retriever (more on that in this <a href="https://qa.fastforwardlabs.com/elasticsearch/qa%20system%20design/passage%20ranking/masked%20language%20model/word%20embeddings/2020/07/22/Improving_the_Retriever_on_Natural_Questions.html">blog post</a>)</li>
<li>QA models that can infer an answer based on several pieces of supporting evidence from multiple documents. This is a task that, in essence, marries QA with Text Summarization.</li>
</ul>
<p>In the meantime, there is still much to be done with standard QA, and we’d love to hear about your use cases! This will be the final blog post for this particular series, and we hope you’ve enjoyed the ride. We learned a lot, and have been thrilled to share our exploration.</p>Implementing an IR QA system in the real-world is a nuanced affair. As we got deeper into our QA journey, we began to wonder: how well would a Reader trained on SQuAD2.0 perform on a real-world corpus? And what if that corpus were highly specialized - perhaps a collection of legal contracts, financial reports, or technical manuals? In this post, we describe our experiments designed to highlight how to adapt Transformer models to specialized domains, and provide guidelines for practical applications.How to Maximize Retriever Performance on a More Natural Dataset2020-07-22T00:00:00-05:002020-07-22T00:00:00-05:00https://qa.fastforwardlabs.com/elasticsearch/qa%20system%20design/passage%20ranking/masked%20language%20model/word%20embeddings/2020/07/22/Improving_the_Retriever_on_Natural_Questions<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-07-22-Improving_the_Retriever_on_Natural_Questions.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>If you’ve been following along on our question answering journey thus far, you now understand the basic building blocks that form the pipeline of a modern Information Retrieval-based (IR) Question Answering system, and how that system can be evaluated on the SQuAD2.0 dataset. But, as it turns out, implementing question answering for real-world use cases is a bit more nuanced than evaluating system performance against a toy dataset. In this post, we’ll explore several challenges faced by the Retriever when applying IR-QA to a more realistic dataset, as well as a few practical approaches for overcoming them.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Prerequisites">Prerequisites<a class="anchor-link" href="#Prerequisites"> </a></h3><ul>
<li>a basic understanding of IR-QA systems (see our <a href="https://qa.fastforwardlabs.com/">previous posts</a>)</li>
<li>a basic understanding of modern NLP techniques</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Shortcomings-of-SQuAD2.0">Shortcomings of SQuAD2.0<a class="anchor-link" href="#Shortcomings-of-SQuAD2.0"> </a></h1>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>While SQuAD has been a popular benchmark for the task of machine comprehension, there are several perceived flaws in how the dataset was constructed that render it an unfair comparison to how humans naturally seek answers to questions. Specifically, SQuAD was created through artificial crowdsourcing where annotators were presented with a Wikipedia paragraph and asked to write questions that can be answered from it. By first reading a body of text and then generating questions, the annotators had already leaked information into the questions they crafted.</p>
<p>The methodology used here is not ideal, because a.) many questions lack context in absence of the provided paragraph and b.) there is a high lexical overlap between passages and questions - which artificially inflates the efficacy of exact match search tools (like Elasticsearch). Consider the following example:</p>
<blockquote><p><strong><em>Question:</em></strong> Other than the Automobile Club of Southern California, what other AAA Auto Club chose to simplify the divide?<br />
<strong><em>Answer:</em></strong> California State Automobile Association</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This is a fundamentally different scenario from that in the real world; human curiosity often leads us to blindly seek answers from an unknown domain. The SQuAD dataset consists of questions constructed from a known domain; essentially the process is (albeit imperfectly) rigged.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="A-More-Realistic-Alternative:-Natural-Questions">A More Realistic Alternative: Natural Questions<a class="anchor-link" href="#A-More-Realistic-Alternative:-Natural-Questions"> </a></h1>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In response to the criticism of SQuAD’s shortcomings, and to spur the progress of open-domain QA systems, Google released the <a href="https://ai.google.com/research/NaturalQuestions">Natural Questions (NQ) dataset</a> in 2019. NQ consists of real, anonymized questions issued to the Google search engine, and provides an entire Wikipedia article as context that may or may not contain the answer to a given question. The inclusion of open-ended, human-written questions and the need to reason over full pages of content make building a QA system over the NQ dataset a much more realistic and challenging task than datasets before it. Here are a few example questions from NQ:</p>
<ul>
<li>where does the energy in a nuclear explosion come from?</li>
<li>how many episodes in season 2 breaking bad?</li>
<li>meaning of cats in the cradle song?</li>
</ul>
<p>Notice how some of the NQ questions are underspecified, raw, and syntactically erroneous. Do these seem oddly familiar to how you interact with search engines every day? Let’s explore a couple of techniques that might help overcome the challenges presented by this dataset and improve the Elasticsearch Retriever we built for our <a href="https://qa.fastforwardlabs.com/elasticsearch/mean%20average%20precision/recall%20for%20irqa/qa%20system%20design/2020/06/30/Evaluating_the_Retriever_&_End_to_End_System.html">previous blog post</a>.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Query-Expansion-Techniques">Query Expansion Techniques<a class="anchor-link" href="#Query-Expansion-Techniques"> </a></h1>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Entity-Enrichment">Entity Enrichment<a class="anchor-link" href="#Entity-Enrichment"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As we learned previously, the inverted index data structure underlying Elasticsearch doesn’t preserve word order in a query by default. Consider the following question:</p>
<blockquote><p><strong><em>Question:</em></strong> "Who is the bad guy in The Hunger Games?"</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In this example, we as humans can intuit that the combination of words “The Hunger Games” has a very specific meaning in contrast to the three tokens ("the", "hunger", "games") independently. We want to enable Elasticsearch to identify content specific to "The Hunger Games” trilogy, and not become entangled with general content about hunger and games. To accomplish this, we can apply named entity recognition (NER) - an information extraction technique that automatically identifies named entities (e.g., people, places, organizations, locations, etc.) in free text.</p>
<p>Implementing entity enrichment requires extended use of Elasticsearch’s <a href="https://elasticsearch-dsl.readthedocs.io/en/latest/">rich query language</a> and a pre-trained NER model (we chose one readily available through the <a href="https://spacy.io/">spaCy</a> NLP library). The process is as follows:</p>
<ol>
<li>Apply NER to process a question and extract out any named entities.</li>
<li>Create a phrase subquery for each entity to preserve the order of tokens for that phrase in match criteria.</li>
<li>Create a standard match subquery for the original question itself.</li>
<li>Combine all subqueries into a boolean compound query that scores candidate documents according to overlap criteria from both question and phrase queries.</li>
</ol>
<p>To simplify query expansion testing, we built a QueryExpander class (hidden below) that automates several query expansion methods. Let's take a look how our query is transformed through entity enrichment:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="c1"># install dependencies</span>
<span class="o">!</span>pip install elasticsearch_dsl
<span class="o">!</span>pip install <span class="nv">transformers</span><span class="o">==</span><span class="m">2</span>.11.0
<span class="c1"># import packages</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">spacy</span>
<span class="kn">from</span> <span class="nn">elasticsearch_dsl</span> <span class="kn">import</span> <span class="n">Q</span><span class="p">,</span> <span class="n">Search</span>
<span class="kn">import</span> <span class="nn">gensim.downloader</span> <span class="k">as</span> <span class="nn">api</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">pipeline</span><span class="p">,</span> <span class="n">AutoTokenizer</span>
<span class="kn">import</span> <span class="nn">warnings</span>
<span class="n">warnings</span><span class="o">.</span><span class="n">simplefilter</span><span class="p">(</span><span class="n">action</span><span class="o">=</span><span class="s1">'ignore'</span><span class="p">,</span> <span class="n">category</span><span class="o">=</span><span class="ne">FutureWarning</span><span class="p">)</span>
<span class="c1"># initialize models</span>
<span class="n">nlp</span> <span class="o">=</span> <span class="n">spacy</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"en_core_web_sm"</span><span class="p">)</span>
<span class="n">word_vectors</span> <span class="o">=</span> <span class="n">api</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s2">"glove-wiki-gigaword-50"</span><span class="p">)</span>
<span class="n">unmasker</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">(</span><span class="s1">'fill-mask'</span><span class="p">,</span> <span class="n">model</span><span class="o">=</span><span class="s2">"bert-base-uncased"</span><span class="p">,</span> <span class="n">tokenizer</span><span class="o">=</span><span class="s2">"bert-base-uncased"</span><span class="p">)</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s2">"bert-base-uncased"</span><span class="p">,</span> <span class="n">use_fast</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">QueryExpander</span><span class="p">:</span>
<span class="sd">'''</span>
<span class="sd"> Query expansion utility that augments ElasticSearch queries with optional techniques</span>
<span class="sd"> including Named Entity Recognition and Synonym Expansion</span>
<span class="sd"> </span>
<span class="sd"> Args:</span>
<span class="sd"> question_text</span>
<span class="sd"> entity_args (dict) - Ex. {'spacy_model': nlp}</span>
<span class="sd"> synonym_args (dict) - Ex. {'gensim_model': word_vectors, 'n_syns': 3} OR</span>
<span class="sd"> {'MLM': unmasker, 'tokenizer': base_tokenizer, 'n_syns': 3, 'threshold':0.3}</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">question_text</span><span class="p">,</span> <span class="n">entity_args</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">synonym_args</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">question_text</span> <span class="o">=</span> <span class="n">question_text</span>
<span class="bp">self</span><span class="o">.</span><span class="n">entity_args</span> <span class="o">=</span> <span class="n">entity_args</span>
<span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span> <span class="o">=</span> <span class="n">synonym_args</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span> <span class="ow">and</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">entity_args</span><span class="p">:</span>
<span class="k">raise</span> <span class="ne">Exception</span><span class="p">(</span><span class="s1">'Cannot do synonym expansion without NER! Expanding synonyms</span><span class="se">\</span>
<span class="s1"> on named entities reduces recall.'</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span> <span class="ow">or</span> <span class="bp">self</span><span class="o">.</span><span class="n">entity_args</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">nlp</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">entity_args</span><span class="p">[</span><span class="s1">'spacy_model'</span><span class="p">]</span>
<span class="bp">self</span><span class="o">.</span><span class="n">doc</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">nlp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">question_text</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">build_query</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">build_query</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c1"># build entity subquery</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">entity_args</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">extract_entities</span><span class="p">()</span>
<span class="c1"># identify terms to expand</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">identify_terms_to_expand</span><span class="p">()</span>
<span class="c1"># build question subquery</span>
<span class="bp">self</span><span class="o">.</span><span class="n">construct_question_query</span><span class="p">()</span>
<span class="c1"># combine subqueries</span>
<span class="n">sub_queries</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">sub_queries</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">question_sub_query</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="s1">'entity_sub_queries'</span><span class="p">):</span>
<span class="n">sub_queries</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">entity_sub_queries</span><span class="p">)</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">Q</span><span class="p">(</span><span class="s1">'bool'</span><span class="p">,</span> <span class="n">should</span><span class="o">=</span><span class="p">[</span><span class="o">*</span><span class="n">sub_queries</span><span class="p">])</span>
<span class="bp">self</span><span class="o">.</span><span class="n">query</span> <span class="o">=</span> <span class="n">query</span>
<span class="k">def</span> <span class="nf">extract_entities</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Extracts named entities using spaCy and constructs phrase match subqueries</span>
<span class="sd"> for each entity. Saves both entities and subqueries as attributes.</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="n">entity_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">entity</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="k">for</span> <span class="n">entity</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">doc</span><span class="o">.</span><span class="n">ents</span><span class="p">]</span>
<span class="n">entity_sub_queries</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">ent</span> <span class="ow">in</span> <span class="n">entity_list</span><span class="p">:</span>
<span class="n">eq</span> <span class="o">=</span> <span class="n">Q</span><span class="p">(</span><span class="s1">'multi_match'</span><span class="p">,</span>
<span class="n">query</span><span class="o">=</span><span class="n">ent</span><span class="p">,</span>
<span class="nb">type</span><span class="o">=</span><span class="s1">'phrase'</span><span class="p">,</span>
<span class="n">fields</span><span class="o">=</span><span class="p">[</span><span class="s1">'title'</span><span class="p">,</span> <span class="s1">'text'</span><span class="p">])</span>
<span class="n">entity_sub_queries</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">eq</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">entities</span> <span class="o">=</span> <span class="n">entity_list</span>
<span class="bp">self</span><span class="o">.</span><span class="n">entity_sub_queries</span> <span class="o">=</span> <span class="n">entity_sub_queries</span>
<span class="k">def</span> <span class="nf">identify_terms_to_expand</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Identify terms in the question that are eligible for expansion</span>
<span class="sd"> per a set of defined rules</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="s1">'entities'</span><span class="p">):</span>
<span class="c1"># get unique list of entity tokens</span>
<span class="n">entity_terms</span> <span class="o">=</span> <span class="p">[</span><span class="n">ent</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">' '</span><span class="p">)</span> <span class="k">for</span> <span class="n">ent</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">entities</span><span class="p">]</span>
<span class="n">entity_terms</span> <span class="o">=</span> <span class="p">[</span><span class="n">ent</span> <span class="k">for</span> <span class="n">sublist</span> <span class="ow">in</span> <span class="n">entity_terms</span> <span class="k">for</span> <span class="n">ent</span> <span class="ow">in</span> <span class="n">sublist</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">entity_terms</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># terms to expand are not part of entity, a stopword, numeric, etc.</span>
<span class="n">entity_pos</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"NOUN"</span><span class="p">,</span><span class="s2">"VERB"</span><span class="p">,</span><span class="s2">"ADJ"</span><span class="p">,</span><span class="s2">"ADV"</span><span class="p">]</span>
<span class="n">terms_to_expand</span> <span class="o">=</span> <span class="p">[</span><span class="n">idx_term</span> <span class="k">for</span> <span class="n">idx_term</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">doc</span><span class="p">)</span> <span class="k">if</span> \
<span class="p">(</span><span class="n">idx_term</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">lower_</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">entity_terms</span><span class="p">)</span> <span class="ow">and</span> <span class="p">(</span><span class="ow">not</span> <span class="n">idx_term</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">is_stop</span><span class="p">)</span>\
<span class="ow">and</span> <span class="p">(</span><span class="ow">not</span> <span class="n">idx_term</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">is_digit</span><span class="p">)</span> <span class="ow">and</span> <span class="p">(</span><span class="ow">not</span> <span class="n">idx_term</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">is_punct</span><span class="p">)</span> <span class="ow">and</span>
<span class="p">(</span><span class="ow">not</span> <span class="nb">len</span><span class="p">(</span><span class="n">idx_term</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">lower_</span><span class="p">)</span><span class="o">==</span><span class="mi">1</span> <span class="ow">and</span> <span class="n">idx_term</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">is_alpha</span><span class="p">)</span> <span class="ow">and</span>
<span class="p">(</span><span class="n">idx_term</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">pos_</span> <span class="ow">in</span> <span class="n">entity_pos</span><span class="p">)]</span>
<span class="bp">self</span><span class="o">.</span><span class="n">terms_to_expand</span> <span class="o">=</span> <span class="n">terms_to_expand</span>
<span class="k">def</span> <span class="nf">construct_question_query</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Builds a multi-match query from the raw question text extended with synonyms </span>
<span class="sd"> for any eligible terms</span>
<span class="sd"> '''</span>
<span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="s1">'terms_to_expand'</span><span class="p">):</span>
<span class="n">syns</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">term</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">terms_to_expand</span><span class="p">:</span>
<span class="k">if</span> <span class="s1">'gensim_model'</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="o">.</span><span class="n">keys</span><span class="p">():</span>
<span class="n">syns</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">gather_synonyms_static</span><span class="p">(</span><span class="n">term</span><span class="p">))</span>
<span class="k">elif</span> <span class="s1">'MLM'</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="o">.</span><span class="n">keys</span><span class="p">():</span>
<span class="n">syns</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">gather_synonyms_contextual</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">term</span><span class="p">))</span>
<span class="n">syns</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">syns</span><span class="p">))</span>
<span class="n">syns</span> <span class="o">=</span> <span class="p">[</span><span class="n">syn</span> <span class="k">for</span> <span class="n">syn</span> <span class="ow">in</span> <span class="n">syns</span> <span class="k">if</span> <span class="p">(</span><span class="n">syn</span><span class="o">.</span><span class="n">isalpha</span><span class="p">()</span> <span class="ow">and</span> <span class="bp">self</span><span class="o">.</span><span class="n">nlp</span><span class="p">(</span><span class="n">syn</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">pos_</span> <span class="o">!=</span> <span class="s1">'PROPN'</span><span class="p">)]</span>
<span class="n">question</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">question_text</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="s1">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">syns</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">expanded_question</span> <span class="o">=</span> <span class="n">question</span>
<span class="bp">self</span><span class="o">.</span><span class="n">all_syns</span> <span class="o">=</span> <span class="n">syns</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">question</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">question_text</span>
<span class="n">qq</span> <span class="o">=</span> <span class="n">Q</span><span class="p">(</span><span class="s1">'multi_match'</span><span class="p">,</span>
<span class="n">query</span><span class="o">=</span><span class="n">question</span><span class="p">,</span>
<span class="nb">type</span><span class="o">=</span><span class="s1">'most_fields'</span><span class="p">,</span>
<span class="n">fields</span><span class="o">=</span><span class="p">[</span><span class="s1">'title'</span><span class="p">,</span> <span class="s1">'text'</span><span class="p">])</span>
<span class="bp">self</span><span class="o">.</span><span class="n">question_sub_query</span> <span class="o">=</span> <span class="n">qq</span>
<span class="k">def</span> <span class="nf">gather_synonyms_contextual</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">token_index</span><span class="p">,</span> <span class="n">token</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Takes in a token, and returns specified number of synonyms as defined by</span>
<span class="sd"> predictions from a masked language model</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="p">[</span><span class="n">token</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">doc</span><span class="p">]</span>
<span class="n">tokens</span><span class="p">[</span><span class="n">token_index</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="p">[</span><span class="s1">'tokenizer'</span><span class="p">]</span><span class="o">.</span><span class="n">mask_token</span>
<span class="n">terms</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">predict_mask</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="s1">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">tokens</span><span class="p">),</span>
<span class="n">unmasker</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="p">[</span><span class="s1">'MLM'</span><span class="p">],</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="p">[</span><span class="s1">'tokenizer'</span><span class="p">],</span>
<span class="n">threshold</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="p">[</span><span class="s1">'threshold'</span><span class="p">],</span>
<span class="n">top_n</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="p">[</span><span class="s1">'n_syns'</span><span class="p">])</span>
<span class="k">return</span> <span class="n">terms</span>
<span class="nd">@staticmethod</span>
<span class="k">def</span> <span class="nf">predict_mask</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">unmasker</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">,</span> <span class="n">threshold</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">top_n</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Given a sentence with a [MASK] token in it, this function will return the most </span>
<span class="sd"> contextually similar terms to fill in the [MASK]</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">unmasker</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="p">[</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span><span class="n">pred</span><span class="p">[</span><span class="s1">'token'</span><span class="p">])</span> <span class="k">for</span> <span class="n">pred</span> <span class="ow">in</span> <span class="n">preds</span> <span class="k">if</span> <span class="n">pred</span><span class="p">[</span><span class="s1">'score'</span><span class="p">]</span> <span class="o">></span> <span class="n">threshold</span><span class="p">]</span>
<span class="k">return</span> <span class="n">tokens</span><span class="p">[:</span><span class="n">top_n</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">gather_synonyms_static</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">token</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Takes in a token and returns a specified number of synonyms defined by</span>
<span class="sd"> cosine similarity of word vectors. Uses stemming to ensure none of the</span>
<span class="sd"> returned synonyms share the same stem (ex. photo and photos can't happen)</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">syns</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="p">[</span><span class="s1">'gensim_model'</span><span class="p">]</span><span class="o">.</span><span class="n">similar_by_word</span><span class="p">(</span><span class="n">token</span><span class="o">.</span><span class="n">lower_</span><span class="p">)</span>
<span class="n">lemmas</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">final_terms</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">syns</span><span class="p">:</span>
<span class="n">term</span> <span class="o">=</span> <span class="n">item</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">lemma</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">nlp</span><span class="p">(</span><span class="n">term</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">lemma_</span>
<span class="k">if</span> <span class="n">lemma</span> <span class="ow">in</span> <span class="n">lemmas</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">lemmas</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">lemma</span><span class="p">)</span>
<span class="n">final_terms</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">term</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">final_terms</span><span class="p">)</span> <span class="o">==</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="p">[</span><span class="s1">'n_syns'</span><span class="p">]:</span>
<span class="k">break</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">final_terms</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">return</span> <span class="n">final_terms</span>
<span class="k">def</span> <span class="nf">explain_expansion</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">entities</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Print out an explanation for the query expansion methodology</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Question:'</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">question_text</span><span class="p">,</span> <span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">entities</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Found Entities:'</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">entities</span><span class="p">,</span> <span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="s1">'terms_to_expand'</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Synonym Expansions:'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">term</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">terms_to_expand</span><span class="p">:</span>
<span class="k">if</span> <span class="s1">'gensim_model'</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="o">.</span><span class="n">keys</span><span class="p">():</span>
<span class="nb">print</span><span class="p">(</span><span class="n">term</span><span class="p">,</span> <span class="s1">'-->'</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">gather_synonyms_static</span><span class="p">(</span><span class="n">term</span><span class="p">))</span>
<span class="k">elif</span> <span class="s1">'MLM'</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">synonym_args</span><span class="o">.</span><span class="n">keys</span><span class="p">():</span>
<span class="nb">print</span><span class="p">(</span><span class="n">term</span><span class="p">,</span> <span class="s1">'-->'</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">gather_synonyms_contextual</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="n">term</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Question text has no terms to expand.'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Expanded Question:'</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">expanded_question</span><span class="p">,</span> <span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Elasticsearch Query:</span><span class="se">\n</span><span class="s1">'</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">query</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">question</span> <span class="o">=</span> <span class="s2">"Who is the bad guy in The Hunger Games?"</span>
<span class="n">entity_args</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'spacy_model'</span><span class="p">:</span> <span class="n">nlp</span><span class="p">}</span>
<span class="n">qe_ner</span> <span class="o">=</span> <span class="n">QueryExpander</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">entity_args</span><span class="p">)</span>
<span class="n">qe_ner</span><span class="o">.</span><span class="n">explain_expansion</span><span class="p">(</span><span class="n">entities</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Question: Who is the bad guy in The Hunger Games?
Found Entities: ['the hunger games']
Elasticsearch Query:
Bool(should=[MultiMatch(fields=['title', 'text'], query='Who is the bad guy in The Hunger Games?', type='most_fields'), MultiMatch(fields=['title', 'text'], query='the hunger games', type='phrase')])
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>From the example above, we see that our NER model successfully identified "The Hunger Games" as a named entity, and the final boolean query served to Elasticsearch is comprised of two parts: one multi-match query for the raw question text, and one phrase-match query for the extracted entity.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Synonym-Expansion">Synonym Expansion<a class="anchor-link" href="#Synonym-Expansion"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The exact match nature of Elasticsearch is powerful and effective, but it isn't perfect. Word matching is limited in its ability to take semantically related concepts into consideration, and for the vague nature of NQ questions, will degrade the performance of our Retriever. For example, let's further consider the question from above and a supporting context passage:</p>
<blockquote><p><strong><em>Question:</em></strong> "Who is the bad guy in The Hunger Games?"<br />
<strong><em>Context:</em></strong> "President Coriolanus Snow is the main antagonistic villain in The Hunger Games trilogy, and though seemingly laid-back, his demeanor hides a sadistic mind."</p>
</blockquote>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>An exact match-based Retriever like Elasticsearch would struggle to fetch this supporting context passage because it lacks the ability to relate the concepts of "bad guy" and "villain." More <a href="https://arxiv.org/pdf/2004.04906.pdf">sophisticated document retrieval systems</a> (that take advantage of learned, dense representations of text) would perform better in this situation; however, these systems are non-trivial to implement and impractical to maintain. Rather, we can try to help generalize the query through synonym expansion at search time - that is, we can identify ambiguous terms in the input question and augment the query with additional synonyms for those terms.</p>
<p>But how do we determine what a synonym is? And how do we know which terms in the question should be expanded?</p>
<p>We experimented with two methods that follow the same process, but differ slightly in <em>how</em> synonyms are designated. The process entails:</p>
<ol>
<li><strong>Identifying a set of candidate tokens.</strong> Ideally, we want to expand tokens such that additional terms serve to increase recall, while adding minimal noise and without altering the semantics of the original query. We look at every term in the question and choose to only expand nouns, verbs, adjectives, and adverbs that are not part of a named entity.</li>
<li><strong>Expanding each candidate token.</strong> Related terms can be derived in numerous ways, from traditional lexicon lookups (e.g., WordNet) to similarity measures between learned vector space representations (e.g., Word2Vec). We’ve tested the use of static word embedding similarity and masked language model predictions as proxies for generating synonymous terms (more on these methods in a bit).</li>
<li><strong>Post-processing synonyms and crafting a new query.</strong> We then filter the expanded vocabulary to remove any duplicative, non-alphanumeric, and proper noun tokens. The final set of expanded terms is used to create a new Elasticsearch query by appending the novel words to the original question text.</li>
</ol>
<p>Let’s take a deep dive into these two expansion methods to see how they compare.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Static-Embedding-Similarity">Static Embedding Similarity<a class="anchor-link" href="#Static-Embedding-Similarity"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Word embeddings are real-valued, vector representations of text that capture general contextual similarities between words in a given vocabulary. They are built on the idea that similar words tend to occur together frequently and thus are learned in an unsupervised manner from vast amounts of unstructured text. Their numerical form allows for mathematical operations - a common application being vector similarity.</p>
<p>Since the vectors have been trained to represent the natural co-occurence of words, we extrapolate that terms corresponding to vectors with high cosine similarity are contextually synonymous. These word relationships are learned with regard to the data they are trained on, so it is critical that the training corpus for a set of embeddings is indicative of the downstream task to which the embedding vectors will be applied. For that reason, we made use of 100 length <a href="https://nlp.stanford.edu/pubs/glove.pdf">GloVe</a> word vectors trained on Wikipedia, available through the <a href="https://radimrehurek.com/gensim/">Gensim library</a>.</p>
<p>Let's take a look at an example question and how it is expanded, using static embeddings:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">question</span> <span class="o">=</span> <span class="s2">"what is Thomas Middleditch's popular tv show?"</span>
<span class="n">entity_args</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'spacy_model'</span><span class="p">:</span> <span class="n">nlp</span><span class="p">}</span>
<span class="n">synonym_args</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'gensim_model'</span><span class="p">:</span> <span class="n">word_vectors</span><span class="p">,</span>
<span class="s1">'n_syns'</span><span class="p">:</span> <span class="mi">2</span><span class="p">}</span>
<span class="n">qe_static</span> <span class="o">=</span> <span class="n">QueryExpander</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">entity_args</span><span class="p">,</span> <span class="n">synonym_args</span><span class="p">)</span>
<span class="n">qe_static</span><span class="o">.</span><span class="n">explain_expansion</span><span class="p">(</span><span class="n">entities</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Question: what is Thomas Middleditch's popular tv show?
Found Entities: ["thomas middleditch's"]
Synonym Expansions:
popular --> ['famous', 'most']
tv --> ['television', 'broadcast']
Expanded Question: what is Thomas Middleditch's popular tv show? broadcast famous television most
Elasticsearch Query:
Bool(should=[MultiMatch(fields=['title', 'text'], query="what is Thomas Middleditch's popular tv show? broadcast famous television most", type='most_fields'), MultiMatch(fields=['title', 'text'], query="thomas middleditch's", type='phrase')])
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Here we notice that "popular" and "tv" were the only tokens from the question deemed as candidates for expansion because all others are either stopwords or proper nouns composing a named entity. The expansion terms generated from our embedding similarity technique appear to be synonyms to the candidate terms, and by expanding the question with these related terms, we help generalize the query to capture a wider range of potentially relevant content.</p>
<p>However, while this example demonstrates a (mostly) successful use of embedding similarity, this is not always the case. Despite the semantic value baked into word embeddings, there are several limitations to their use as a proxy for synonymous meaning. Let's look at another example:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">question</span> <span class="o">=</span> <span class="s2">"how many rose species are found in the Montreal Botanical Garden?"</span>
<span class="n">qe_static</span> <span class="o">=</span> <span class="n">QueryExpander</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">entity_args</span><span class="p">,</span> <span class="n">synonym_args</span><span class="p">)</span>
<span class="n">qe_static</span><span class="o">.</span><span class="n">explain_expansion</span><span class="p">(</span><span class="n">entities</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Question: how many rose species are found in the Montreal Botanical Garden?
Found Entities: ['montreal botanical garden']
Synonym Expansions:
rose --> ['fell', 'climbed']
species --> ['genus', 'subspecies']
found --> ['discovered', 'identified']
Expanded Question: how many rose species are found in the Montreal Botanical Garden? subspecies identified climbed discovered fell genus
Elasticsearch Query:
Bool(should=[MultiMatch(fields=['title', 'text'], query='how many rose species are found in the Montreal Botanical Garden? subspecies identified climbed discovered fell genus', type='most_fields'), MultiMatch(fields=['title', 'text'], query='montreal botanical garden', type='phrase')])
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Drawing your attention to the suggested expansions for the term "rose", it becomes obvious that our approach lacks the ability to disambiguate between different senses of the same word. As humans, we naturally infer that the term "rose" implies a flower rather than the past-tense action of ascending from a lower position to a higher one. This illustrates a main limitation of word embeddings - multiple meanings of the same word are conflated as a single, static representation. In linguistic terms, polysemy and homonymy are not handled effectively.</p>
<p>Not only have we altered the semantics of the original query by introducing the new terms for the <em>wrong</em> word sense, but our approach has also failed to accurately capture synonymous terms for that mistaken word sense. We see that "rose" expanded to "fell" and "climbed"...the first of which is actually an antonym. What is going on here? As mentioned earlier, word embeddings are trained at the task of modeling co-occurrence probabilities. So while terms that occur in similar contexts <em>may</em> sometimes be synonymous, that certainly is not always the case.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Contextual-Query-Expansion">Contextual Query Expansion<a class="anchor-link" href="#Contextual-Query-Expansion"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The prior approach lacks something that allows us as humans to disambiguate the meaning of the word "rose" - <em>context</em>. The ability to dynamically recognize alternate meaning by paying attention to the context surrounding a term allows us to distinguish homonyms. For a computer to do the same, we can make use of a masked language model (MLM) to leverage the bi-directional context surrounding a given word and imply its morphological form.
<div class="flash">
<svg class="octicon octicon-info octicon octicon-info octicon octicon-info octicon octicon-info octicon octicon-info" viewBox="0 0 14 16" version="1.1" width="14" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 01-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 01-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg>
<strong>Note: </strong>For more info how a masked language model works, we recommend the original <a href="https://arxiv.org/pdf/1810.04805.pdf">BERT paper</a> and this <a href="https://demo.allennlp.org/masked-lm">interactive tool from AllenNLP</a>.
</div></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In practice, this involves constructing intermediate versions of the original question, in which each identified candidate token is masked, and an MLM predicts the top N tokens that are contextually most likely to complete the sentence. As with static embeddings, contextual query expansion relies on the assumption that the MLM has been trained (or fine-tuned) on the target document corpus (so it holds relevant, implicit information that can be exploited to identify suitable expansion terms).</p>
<p>In our case, we made use of a <em>BERT-base-uncased</em> model that has also been pre-trained on Wikipedia, conveniently available through <a href="https://huggingface.co/">HuggingFace's</a> "fill-mask" pipeline API. Let's see how this approach performs on our ambiguous example from above:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">question</span> <span class="o">=</span> <span class="s2">"how many rose species are found in the Montreal Botanical Garden?"</span>
<span class="n">entity_args</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'spacy_model'</span><span class="p">:</span> <span class="n">nlp</span><span class="p">}</span>
<span class="n">synonym_args</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'MLM'</span><span class="p">:</span> <span class="n">unmasker</span><span class="p">,</span>
<span class="s1">'tokenizer'</span><span class="p">:</span> <span class="n">tokenizer</span><span class="p">,</span>
<span class="s1">'n_syns'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s1">'threshold'</span><span class="p">:</span><span class="mi">0</span><span class="p">}</span>
<span class="n">qe_contextual</span> <span class="o">=</span> <span class="n">QueryExpander</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">entity_args</span><span class="p">,</span> <span class="n">synonym_args</span><span class="p">)</span>
<span class="n">qe_contextual</span><span class="o">.</span><span class="n">explain_expansion</span><span class="p">(</span><span class="n">entities</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Question: how many rose species are found in the Montreal Botanical Garden?
Found Entities: ['montreal botanical garden']
Synonym Expansions:
rose --> ['plant', 'new']
species --> ['varieties', 'gardens']
found --> ['grown', 'found']
Expanded Question: how many rose species are found in the Montreal Botanical Garden? varieties grown found new gardens plant
Elasticsearch Query:
Bool(should=[MultiMatch(fields=['title', 'text'], query='how many rose species are found in the Montreal Botanical Garden? varieties grown found new gardens plant', type='most_fields'), MultiMatch(fields=['title', 'text'], query='montreal botanical garden', type='phrase')])
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Great! Because we are now using a supervised model to consider the full sentence and predict the missing candidate terms, we are able to capture the correct meaning of the term "rose," and also produce more reliable "synonyms" for each of the three expanded terms.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Does-Query-Expansion-Improve-Retrieval-on-Natural-Questions?">Does Query Expansion Improve Retrieval on Natural Questions?<a class="anchor-link" href="#Does-Query-Expansion-Improve-Retrieval-on-Natural-Questions?"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Due to the topical breadth of Natural Questions, a vast knowledge base over which the Retriever can search is necessary in order to fairly evaluate an end-to-end QA system. For that reason, we <a href="https://github.com/attardi/wikiextractor">cleaned</a> and loaded a full Wikipedia dump into an Elasticsearch index for testing. To represent a practical application of QA, full articles were indexed rather than pre-parsed paragraphs (as we saw in the <a href="https://qa.fastforwardlabs.com/elasticsearch/mean%20average%20precision/recall%20for%20irqa/qa%20system%20design/2020/06/30/Evaluating_the_Retriever_&_End_to_End_System.html">previous post</a>). The <a href="https://ai.google.com/research/NaturalQuestions/download">NQ development set</a> was processed to drop all long and yes/no answer questions, yielding 7651 short and null answer examples. Additionally, answers with more than five tokens were discarded as answers with many tokens often resemble extractive snippets rather than canonical answers.</p>
<p>With our knowledge corpus and evaluation data ready to go, we evaluated our Retriever performance (with the same methodology used in the last post) over an increasing number of documents, while toggling entity and synonym expansion methods at search time.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/images/copied_from_nb/../images/post_arr/expansion_results.png" alt="" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Intuitively, we see that returning more articles increases Retriever recall because more content that may contain a question’s answer exists. However, that performance gain begins to plateau as additional content becomes less relevant to the query. We observe a performance plateau occurs at ~70% recall when retrieving 14 full Wikipedia articles. Despite the differences in experimental setup, when we compare that to the ~83% recall on SQuAD in the <a href="https://qa.fastforwardlabs.com/elasticsearch/mean%20average%20precision/recall%20for%20irqa/qa%20system%20design/2020/06/30/Evaluating_the_Retriever_&_End_to_End_System.html">last post</a> (when retrieving only three <em>paragraphs</em> of content), it becomes evident just how much more challenging the NQ dataset actually is. We also observe that entity expansion provides a slight improvement to recall, as it increases specificity of the query definition to help target multi-word phrases in articles.</p>
<p>Finally, we observe a loss in performance when expanding candidate question terms with synonyms from either proposed method. While the intuition behind synonym expansion makes sense in theory, we ultimately find it very difficult to implement a "one-size-fits-all" approach for determining relevant synonyms. Even after adding a probability threshold for MLM expansion term predictions to further minimize the chance of introducing spurious terms, we are unable to consistently restrict semantic-altering words.</p>
<p>Qualitative analysis reveals that MLM expansion does work in many cases, but also over-generalizes in others. In the world of information retrieval, relevancy tuning is a deeply complex subject and requires customization for each use case. Therefore, it is generally not recommended to apply blanket synonym expansion techniques. Rather, expanding similar terms from a curated ontology or acronym lookup specific to your domain at search time may prove beneficial.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Passage-Ranking">Passage Ranking<a class="anchor-link" href="#Passage-Ranking"> </a></h1>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We know from our last post that in an IR QA system, the Reader is bounded by the performance of the Retriever, and fetching more documents increases system recall. However, simply increasing the number of documents passed to the Reader also increases the amount of irrelevant information to be processed, and degrades the overall QA system performance - in regard to both speed and accuracy. But what if we could have the best of both worlds?</p>
<p>Passage ranking is a technique that involves selecting a subset of re-ranked paragraphs from a collection of retrieved documents to retain the answer recall from those documents, while filtering out noisy content. By implementing a quick and efficient passage ranking technique, our QA pipeline considers more documents' worth of information, but distills content down to only the relevant pieces for the Reader to process.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/images/copied_from_nb/../images/post_arr/passage_ranking_flow.png" alt="" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The concept of passage ranking is inspired by a field of information retrieval called <em>Learning to Rank</em>, which frames relevance ranking as a supervised machine learning problem. While many <a href="https://www.aclweb.org/anthology/D18-1053.pdf">supervised ranking approaches</a> have proven successful, they require learning custom similarity metrics and introduce additional complexity into a QA system - making them impractical for many use cases. In contrast, the passage ranking implementation that we consider here is a simple, unsupervised approach that demonstrates a viable way to improve IR for general question answering applications. Our passage ranking process consists of the following steps at search time:</p>
<ol>
<li>The query question and a set of N candidate documents from Elasticsearch are fed as input.</li>
<li>All documents are split into paragraphs.</li>
<li>The list of paragraphs and the input question are converted into a sparse document-term matrix (DTM) using TF-IDF vectorization. We preserve n-grams during vectorization, so the final DTM includes single terms, bi-grams, and tri-grams.</li>
<li>Cosine similarity is calculated between the question vector and each paragraph vector.</li>
<li>Paragraphs are sorted based on similarity score and the top M passages are passed on to the Reader for answer extraction.</li>
</ol>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Does-Passage-Ranking-Improve-Retrieval-on-Natural-Questions?">Does Passage Ranking Improve Retrieval on Natural Questions?<a class="anchor-link" href="#Does-Passage-Ranking-Improve-Retrieval-on-Natural-Questions?"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To demonstrate the effect of passage ranking, we evaluated system recall across the NQ development dataset while retrieving one article’s worth of content as determined from two different methods:</p>
<ul>
<li>The top one document scored and returned directly from Elasticsearch</li>
<li>The top 20 ranked paragraphs pooled from N candidate documents</li>
</ul>
<p>(A fair comparison between these two methods requires consideration of an equal amount of retrieved content. We found Wikipedia articles corresponding to our NQ development set contained 20 paragraphs on average.)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/images/copied_from_nb/../images/post_arr/passage_ranking_results.png" alt="" /></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In the top half of the figure above, we see that re-ranking all paragraphs from five input documents and selecting only the top 20 (represented by the green dot) allows us to achieve a system recall of 53% in comparison to the 44% recall attained by Elasticsearch returning the top document alone (without passage ranking). These results support the notion that introducing an intermediate passage ranking step into a QA pipeline allows us to improve recall by almost ten percentage points for a fixed amount of context.</p>
<p>We notice that with passage ranking, recall increases as the number of candidate documents increase - up until five documents, at which point, it slowly declines. This demonstrates that after five documents' worth of content, our sparse vector ranking algorithm loses signal as additional unrelated noise is introduced to the ranking corpus.</p>
<p>We also evaluate the time complexity of the passage ranking process and notice that the peak recall (at five documents' worth of content) comes at a cost of four times the processing time (retrieval + passage ranking vs. just retrieving one document from Elasticsearch). While this appears considerable, it's important to frame this cost in the setting of the full QA pipeline. Passage ranking enables our already slow Reader to only process one fifth the amount of content while providing 20% more answers in the context window for the price of ~0.1 seconds.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Final-Thoughts">Final Thoughts<a class="anchor-link" href="#Final-Thoughts"> </a></h1>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In this post, we identified a few challenges in applying IR QA systems to a more realistic QA task, and we looked at a variety of techniques to help improve information retrieval. In the end, we found that entity expansion and passage ranking prove successful in returning more answer-containing context from which the Reader can extract answers. Additionally, we learned that while contextual synonym expansion may help Elasticsearch in some instances, it cannot be used as a blanket approach for relevancy tuning. In our <a href="https://qa.fastforwardlabs.com/domain%20adaptation/transfer%20learning/specialized%20datasets/qa/medical%20qa/2020/07/22/QA-for-Specialized-Data.html">final blog post</a>, we'll explore how transfer learning can help boost Reader performance on a domain-specific dataset!</p>
</div>
</div>
</div>
</div>
<script type="application/vnd.jupyter.widget-state+json">
{"0f9413092ffe496d8f31c374732e870d": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "139cd5c5c43f436495dfddbdfc7d35c3": {"model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": ""}}, "15f44b8cf87d459bb6a40b5e6b9db470": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "26b9259fecf4445b8fdcb753ed6d09ef": {"model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "initial"}}, "26ef4a7e1f76465ead7e51c1d9866c6f": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "28a305d13e584615b9ba3cd9a37bdd56": {"model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "100%", "description_tooltip": null, "layout": "IPY_MODEL_a8cebe4ccf004d84bdc37ce44d1192e8", "max": 20239, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_a44c2693aee14b89a4c6512c85142f51", "value": 20239}}, "3c0ffe3eeca741899ddbe1306e60ce39": {"model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_26ef4a7e1f76465ead7e51c1d9866c6f", "placeholder": "\u200b", "style": "IPY_MODEL_7a1bbfb20bd84d4b8995584a37dabae1", "value": " 11873/11873 [6:21:18<00:00, 1.93s/it]"}}, "3d3ce5e897b84a218237582372bca2bb": {"model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": ""}}, "40b3ea36c5ca407597e8ce6c738c9786": {"model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "initial"}}, "62fb51c849b04bc78c629dd42f811deb": {"model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_0f9413092ffe496d8f31c374732e870d", "placeholder": "\u200b", "style": "IPY_MODEL_3d3ce5e897b84a218237582372bca2bb", "value": " 11873/11873 [7:01:01<00:00, 2.13s/it]"}}, "66e0efa9685248629e009c1e255bff6e": {"model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": ["IPY_MODEL_f6c75626f2d54bfbaa7bad761af5b9c2", "IPY_MODEL_62fb51c849b04bc78c629dd42f811deb"], "layout": "IPY_MODEL_f0858954bbfd447eb95a04a3caabf8a4"}}, "705fa1214c044cbcae6f5d109b22802d": {"model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "100%", "description_tooltip": null, "layout": "IPY_MODEL_f04b80c35efb44b5bec078f4d7a57f3b", "max": 11873, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_e30ff795eb1c4016be3967f190764399", "value": 11873}}, "7a1bbfb20bd84d4b8995584a37dabae1": {"model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": ""}}, "9061d4497e4443029cbb21b77281cf31": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "95923a713e324d18bc9fc82a466703c4": {"model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_ef2c1a3506c549b7878e3ea4a5cc565f", "placeholder": "\u200b", "style": "IPY_MODEL_139cd5c5c43f436495dfddbdfc7d35c3", "value": " 20239/20239 [02:38<00:00, 127.56it/s]"}}, "a299f8fa902348b7807b4d97ebc6027d": {"model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_dca5b0a8c1014a66bc8c5ba988878a85", "placeholder": "\u200b", "style": "IPY_MODEL_b04d963baf9a40789305d1372ffe1aab", "value": " 20239/20239 [04:14<00:00, 79.57it/s]"}}, "a44c2693aee14b89a4c6512c85142f51": {"model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "initial"}}, "a493a4c07a2743899571330cf6476b74": {"model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": ["IPY_MODEL_28a305d13e584615b9ba3cd9a37bdd56", "IPY_MODEL_a299f8fa902348b7807b4d97ebc6027d"], "layout": "IPY_MODEL_e94a6f98578743c096c9b045d9dfdf81"}}, "a8cebe4ccf004d84bdc37ce44d1192e8": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "b04d963baf9a40789305d1372ffe1aab": {"model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": ""}}, "be3cf3e6c6ba40d087f8dd27dd4f5e64": {"model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": ["IPY_MODEL_705fa1214c044cbcae6f5d109b22802d", "IPY_MODEL_3c0ffe3eeca741899ddbe1306e60ce39"], "layout": "IPY_MODEL_d193d10e1ad94119a849d123c093f3cc"}}, "d193d10e1ad94119a849d123c093f3cc": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "dca5b0a8c1014a66bc8c5ba988878a85": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "e30ff795eb1c4016be3967f190764399": {"model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "initial"}}, "e91c7ba9a6314f0d905d08d0f822cbc1": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "e94a6f98578743c096c9b045d9dfdf81": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "ef2c1a3506c549b7878e3ea4a5cc565f": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "f04b80c35efb44b5bec078f4d7a57f3b": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "f0858954bbfd447eb95a04a3caabf8a4": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "f4d90214c67749168472a2ba3cb0d72a": {"model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "100%", "description_tooltip": null, "layout": "IPY_MODEL_9061d4497e4443029cbb21b77281cf31", "max": 20239, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_40b3ea36c5ca407597e8ce6c738c9786", "value": 20239}}, "f6c75626f2d54bfbaa7bad761af5b9c2": {"model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "100%", "description_tooltip": null, "layout": "IPY_MODEL_e91c7ba9a6314f0d905d08d0f822cbc1", "max": 11873, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_26b9259fecf4445b8fdcb753ed6d09ef", "value": 11873}}, "f8d18feb30b24de5bbbb869cc9a31ae8": {"model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": ["IPY_MODEL_f4d90214c67749168472a2ba3cb0d72a", "IPY_MODEL_95923a713e324d18bc9fc82a466703c4"], "layout": "IPY_MODEL_15f44b8cf87d459bb6a40b5e6b9db470"}}}
</script>Evaluating QA: the Retriever & the Full QA System2020-06-30T00:00:00-05:002020-06-30T00:00:00-05:00https://qa.fastforwardlabs.com/elasticsearch/mean%20average%20precision/recall%20for%20irqa/qa%20system%20design/2020/06/30/Evaluating_the_Retriever_&_End_to_End_System<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-06-30-Evaluating_the_Retriever_&_End_to_End_System.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In our last post, <a href="https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html">Evaluating QA: Metrics, Predictions, and the Null Response</a>, we took a deep dive into how to assess the quality of a BERT-like Reader for Question Answering (QA) using the Hugging Face framework. In this post, we'll focus on the other component of a modern Information Retrieval-based (IR) QA system: the Retriever. Specifically, we'll introduce Elasticsearch as a powerful and efficient IR tool that can be used to scour through large corpora and retrieve relevant documents. We'll explain how to implement and evaluate a Retriever in the context of Question Answering and demonstrate its impact on an IR QA system.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Prerequisites">Prerequisites<a class="anchor-link" href="#Prerequisites"> </a></h3><ul>
<li>a basic understanding of Information Retrieval & Search</li>
<li>a basic understanding of IR based QA systems (see our <a href="https://qa.fastforwardlabs.com/">previous posts</a>)</li>
<li>a basic understanding of Transformers and PyTorch</li>
<li>a basic understanding of the SQuAD2.0 dataset</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Retrieving-the-right-document-is-important">Retrieving the right document is important<a class="anchor-link" href="#Retrieving-the-right-document-is-important"> </a></h1>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>As we've discussed throughout this series, many modern QA systems take a two-staged approach to answering questions. In the first stage, a document retriever selects <em>N</em> potentially relevant documents from a given corpus. Subsequently, a machine comprehension model processes each of the <em>N</em> documents to determine an answer to the input question.</p>
<p>Because of recent advances in NLP and deep learning (i.e., flashy Transformers), the machine comprehension component has typically been the main focus of evaluation and performance enhancement. Retrievers have received limited attention in the context of QA, despite their obvious importance: stage two of an IR QA system is bounded by the performance of stage one. Let's get more specific.</p>
<p>We <a href="https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html">recently explained methods</a> that - given a question and context passage - enable BERT-like models to produce robust answers by selectively processing predictions and by refraining from answering certain questions at all. While the ability to properly comprehend a passage and produce a correct answer is a critical feature of any QA tool, the success of the overall system is highly dependent on first providing a correct passage to read through. Without being fed a context passage that actually contains the ground-truth answer, the overall system's performance is limited to how well it can predict no-answer questions.</p>
<p>To demonstrate, we'll revisit an example from our <a href="https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html">second blog post</a>, in which we asked three questions of a Wikipedia search engine-based QA system:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<pre><code>**Example 1: Incorrect**
Question: When was Barack Obama born?
Top wiki result: <WikipediaPage 'Barack Obama Sr.'>
Answer: 18 June 1936 / February 2 , 1961 /
**Example 2: Correct**
Question: Why is the sky blue?
Top wiki result: <WikipediaPage 'Diffuse sky radiation'>
Answer: Rayleigh scattering /
**Example 3: Correct**
Question: How many sides does a pentagon have?
Top wiki result: <WikipediaPage 'The Pentagon'>
Answer: five /</code></pre>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In Example 1, the Reader had no chance of producing the correct answer because of its outright absence from the context served up by the Retriever. Namely, the Retriever erroneously provided a page about Barack Obama Sr. instead of his son, the former US President. In this case, the only way the Reader could have possibly produced the correct answer was if the correct answer was actually not to answer at all.</p>
<p>On the flip side, in Example 3, the Retriever did not identify the globally "correct" document - it returned an article about "The Pentagon" instead of a page about geometry - but nonetheless, it provided enough context for the Reader to succeed.</p>
<p>These quick examples illustrate why an effective Retriever is crucial for an end-to-end QA system. Now let's take a deeper look at a classic tool used for information retrieval - Elasticsearch.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Elasticsearch-as-an-IR-Tool">Elasticsearch as an IR Tool<a class="anchor-link" href="#Elasticsearch-as-an-IR-Tool"> </a></h1>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/elasticsearch-logo.png?raw=1" alt="" /></p>
<p>Modern QA systems employ a variety of techniques for the task of information retrieval, ranging from traditional sparse vector word matching (e.g., Elasticsearch) to <a href="https://arxiv.org/pdf/2004.04906.pdf">novel approaches</a> using dense representations of encoded passages combined with <a href="https://github.com/facebookresearch/faiss">efficient search capabilities</a>. Despite the flurry of contemporary research efforts in this area, the traditional sparse vector approach performs very well overall, and has only recently been overtaken by embedding-based systems for QA retrieval tasks. For that reason, we'll explore Elasticsearch as an easy-to-use framework for document retrieval. So, what exactly is Elasticsearch?</p>
<p>Elasticsearch is a powerful open-source search and analytics engine built on the <a href="https://lucene.apache.org/">Apache Lucene</a> library that is capable of handling all types of data - including textual, numerical, geospatial, structured, and unstructured data. It is built to scale with a robust set of features, rich ecosystem, and diverse list of client libraries, making it easy to integrate and use. In the context of information retrieval for automated question answering, we are keenly interested in the features surrounding full-text search.</p>
<p>Elasticsearch provides a convenient way to index documents so they can quickly be queried for nearest neighbor search using a similarity metric based on TF-IDF. Specifically, it uses <a href="https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/">BM25</a> term weighting to represent question and context passages as high-dimensional, sparse vectors that are efficiently searched in an inverted index. For more information on how an inverted index works under the hood, we recommend this quick and concise <a href="https://codingexplained.com/coding/elasticsearch/understanding-the-inverted-index-in-elasticsearch">blog post</a>.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Using-Elasticsearch-with-SQuAD2.0">Using Elasticsearch with SQuAD2.0<a class="anchor-link" href="#Using-Elasticsearch-with-SQuAD2.0"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>With this basic understanding of how Elasticsearch works, let's dive in and build our own Document Retrieval system by indexing a set of Wikipedia article paragraphs that support questions and answers from the SQuAD2.0 dataset. Before we get started, we'll need to download and prepare data from SQuAD2.0.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Download and Prepare SQUAD2.0</strong></p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="c1"># Download the SQuAD2.0 train & dev sets</span>
<span class="o">!</span>wget -P data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
<span class="o">!</span>wget -P data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
<span class="kn">import</span> <span class="nn">json</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>A common practice in IR for QA is to segment large articles into smaller passages before indexing, for two main reasons:</p>
<ol>
<li>Transformer-based Readers are slow; providing an entire Wikipedia article to BERT for processing can take 5 - 30 seconds, even with a decent GPU!</li>
<li>Smaller passages reduce noise; by identifying a more concise context passage for BERT to read through, we reduce the chance of BERT getting lost.</li>
</ol>
<p>Of course, the chunking method proposed here doesn't come without a cost. Larger documents contain more information on which to retrieve. By reducing passage size, we are potentially trading off system recall for speed - although, as we will discuss later in this post, there are techniques to alleviate this.</p>
<p>With our chunking approach, each article paragraph will be prepended with the article title, and collectively serve as the corpus of documents over which our Elasticsearch Retriever will search. In practice, open-domain QA systems sit atop massive collections of documents (think: all of Wikipedia) to provide a breadth of information from which to answer general-knowledge questions. For the purposes of demonstrating Elasticsearch functionality, we will limit our corpus to only the Wikipedia articles supporting SQuAD2.0 questions.</p>
<p>The following <code>parse_qa_records</code> function will extract question/answer examples, as well as paragraph content from the SQuAD2.0 data set.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="k">def</span> <span class="nf">parse_qa_records</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Loop through SQuAD2.0 dataset and parse out question/answer examples and unique article paragraphs</span>
<span class="sd"> </span>
<span class="sd"> Returns:</span>
<span class="sd"> qa_records (list) - Question/answer examples as list of dictionaries</span>
<span class="sd"> wiki_articles (list) - Unique Wikipedia titles and article paragraphs recreated from SQuAD data</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="n">num_with_ans</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">num_without_ans</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">qa_records</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">wiki_articles</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">article</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">paragraph</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">article</span><span class="p">[</span><span class="s1">'paragraphs'</span><span class="p">]):</span>
<span class="n">wiki_articles</span><span class="p">[</span><span class="n">article</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span><span class="o">+</span><span class="sa">f</span><span class="s1">'_</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s1">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">article</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span> <span class="o">+</span> <span class="s1">' '</span> <span class="o">+</span> <span class="n">paragraph</span><span class="p">[</span><span class="s1">'context'</span><span class="p">]</span>
<span class="k">for</span> <span class="n">questions</span> <span class="ow">in</span> <span class="n">paragraph</span><span class="p">[</span><span class="s1">'qas'</span><span class="p">]:</span>
<span class="n">qa_record</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">qa_record</span><span class="p">[</span><span class="s1">'example_id'</span><span class="p">]</span> <span class="o">=</span> <span class="n">questions</span><span class="p">[</span><span class="s1">'id'</span><span class="p">]</span>
<span class="n">qa_record</span><span class="p">[</span><span class="s1">'document_title'</span><span class="p">]</span> <span class="o">=</span> <span class="n">article</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span>
<span class="n">qa_record</span><span class="p">[</span><span class="s1">'question_text'</span><span class="p">]</span> <span class="o">=</span> <span class="n">questions</span><span class="p">[</span><span class="s1">'question'</span><span class="p">]</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">qa_record</span><span class="p">[</span><span class="s1">'short_answer'</span><span class="p">]</span> <span class="o">=</span> <span class="n">questions</span><span class="p">[</span><span class="s1">'answers'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s1">'text'</span><span class="p">]</span>
<span class="n">num_with_ans</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">except</span><span class="p">:</span>
<span class="n">qa_record</span><span class="p">[</span><span class="s1">'short_answer'</span><span class="p">]</span> <span class="o">=</span> <span class="s2">""</span>
<span class="n">num_without_ans</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">qa_records</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">qa_record</span><span class="p">)</span>
<span class="n">wiki_articles</span> <span class="o">=</span> <span class="p">[{</span><span class="s1">'document_title'</span><span class="p">:</span><span class="n">title</span><span class="p">,</span> <span class="s1">'document_text'</span><span class="p">:</span> <span class="n">text</span><span class="p">}</span>\
<span class="k">for</span> <span class="n">title</span><span class="p">,</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">wiki_articles</span><span class="o">.</span><span class="n">items</span><span class="p">()]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Data contains </span><span class="si">{</span><span class="n">num_with_ans</span><span class="si">}</span><span class="s1"> question/answer pairs with a short answer, and </span><span class="si">{</span><span class="n">num_without_ans</span><span class="si">}</span><span class="s1"> without.'</span><span class="o">+</span>
<span class="sa">f</span><span class="s1">'</span><span class="se">\n</span><span class="s1">There are </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">wiki_articles</span><span class="p">)</span><span class="si">}</span><span class="s1"> unique wikipedia article paragraphs.'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">qa_records</span><span class="p">,</span> <span class="n">wiki_articles</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># load and parse data</span>
<span class="n">train_file</span> <span class="o">=</span> <span class="s2">"data/squad/train-v2.0.json"</span>
<span class="n">dev_file</span> <span class="o">=</span> <span class="s2">"data/squad/dev-v2.0.json"</span>
<span class="n">train</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">train_file</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">))</span>
<span class="n">dev</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">dev_file</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">))</span>
<span class="n">qa_records</span><span class="p">,</span> <span class="n">wiki_articles</span> <span class="o">=</span> <span class="n">parse_qa_records</span><span class="p">(</span><span class="n">train</span><span class="p">[</span><span class="s1">'data'</span><span class="p">])</span>
<span class="n">qa_records_dev</span><span class="p">,</span> <span class="n">wiki_articles_dev</span> <span class="o">=</span> <span class="n">parse_qa_records</span><span class="p">(</span><span class="n">dev</span><span class="p">[</span><span class="s1">'data'</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Data contains 86821 question/answer pairs with a short answer, and 43498 without.
There are 19035 unique wikipedia article paragraphs.
Data contains 5928 question/answer pairs with a short answer, and 5945 without.
There are 1204 unique wikipedia article paragraphs.
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># parsed record example</span>
<span class="n">qa_records</span><span class="p">[</span><span class="mi">10</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'example_id': '56d43c5f2ccc5a1400d830ab',
'document_title': 'Beyoncé',
'question_text': 'What was the first album Beyoncé released as a solo artist?',
'short_answer': 'Dangerously in Love'}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># parsed wiki paragraph example</span>
<span class="nb">print</span><span class="p">(</span><span class="n">wiki_articles</span><span class="p">[</span><span class="mi">10</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>{'document_title': 'Beyoncé_10', 'document_text': 'Beyoncé Beyoncé\'s first solo recording was a feature on Jay Z\'s "\'03 Bonnie & Clyde" that was released in October 2002, peaking at number four on the U.S. Billboard Hot 100 chart. Her first solo album Dangerously in Love was released on June 24, 2003, after Michelle Williams and Kelly Rowland had released their solo efforts. The album sold 317,000 copies in its first week, debuted atop the Billboard 200, and has since sold 11 million copies worldwide. The album\'s lead single, "Crazy in Love", featuring Jay Z, became Beyoncé\'s first number-one single as a solo artist in the US. The single "Baby Boy" also reached number one, and singles, "Me, Myself and I" and "Naughty Girl", both reached the top-five. The album earned Beyoncé a then record-tying five awards at the 46th Annual Grammy Awards; Best Contemporary R&B Album, Best Female R&B Vocal Performance for "Dangerously in Love 2", Best R&B Song and Best Rap/Sung Collaboration for "Crazy in Love", and Best R&B Performance by a Duo or Group with Vocals for "The Closer I Get to You" with Luther Vandross.'}
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Download Elasticsearch</strong></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>With our data ready to go, let's download, install, and configure Elasticsearch. We recommend opening this post as a Colab notebook and executing the following code snippet to set up Elasticsearch. Alternatively, you can install and launch Elasticsearch on your local machine by following the instructions <a href="https://www.elastic.co/downloads/elasticsearch">here</a>.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="c1"># if using Colab - start Elasticsearch from source</span>
<span class="o">!</span> wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
<span class="o">!</span> tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
<span class="o">!</span> chown -R daemon:daemon elasticsearch-7.6.2
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">subprocess</span> <span class="kn">import</span> <span class="n">Popen</span><span class="p">,</span> <span class="n">PIPE</span><span class="p">,</span> <span class="n">STDOUT</span>
<span class="n">es_server</span> <span class="o">=</span> <span class="n">Popen</span><span class="p">([</span><span class="s1">'elasticsearch-7.6.2/bin/elasticsearch'</span><span class="p">],</span>
<span class="n">stdout</span><span class="o">=</span><span class="n">PIPE</span><span class="p">,</span> <span class="n">stderr</span><span class="o">=</span><span class="n">STDOUT</span><span class="p">,</span>
<span class="n">preexec_fn</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="n">os</span><span class="o">.</span><span class="n">setuid</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># as daemon</span>
<span class="p">)</span>
<span class="c1"># wait until ES has started</span>
<span class="o">!</span> sleep <span class="m">30</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong>Load Data into Elasticsearch</strong></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We'll use the <a href="https://elasticsearch-py.readthedocs.io/en/master/">official low-level Python client library</a> for interacting with Elasticsearch.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="o">!</span>pip install elasticsearch
<span class="o">!</span>pip install tqdm
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>By default, Elasticsearch is launched locally on port 9200. We first need to instantiate an Elasticsearch client object and connect to the service.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">elasticsearch</span> <span class="kn">import</span> <span class="n">Elasticsearch</span>
<span class="n">config</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'host'</span><span class="p">:</span><span class="s1">'localhost'</span><span class="p">,</span> <span class="s1">'port'</span><span class="p">:</span><span class="mi">9200</span><span class="p">}</span>
<span class="n">es</span> <span class="o">=</span> <span class="n">Elasticsearch</span><span class="p">([</span><span class="n">config</span><span class="p">])</span>
<span class="c1"># test connection</span>
<span class="n">es</span><span class="o">.</span><span class="n">ping</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>True</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Before we go further, let's introduce a few concepts that are specific to Elasticsearch and the process of indexing data. An <em>index</em> is a collection of documents that have common characteristics (similar to a database schema in an RDBMS). <em>Documents</em> are JSON objects having their own set of key-value pairs consisting of various data types (similar to rows/fields in RDBMS).</p>
<p>When we add a document into an index, the document's text fields undergo analysis prior to being indexed. This means that when executing a search query against an index, we are actually searching against the post-processed representation that is stored in the inverted index, not the raw input document itself.</p>
<p><img src="https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/elastic_index_process.png?raw=1" alt="Elasticsearch Index Process" />
<a href="https://codingexplained.com/coding/elasticsearch/understanding-analysis-in-elasticsearch-analyzers#:~:text=A%20Closer%20Look%20at%20Analyzers,documents%20when%20they%20are%20indexed.&text=An%20analyzer%20consists%20of%20three,them%20changing%20the%20input%20stream.">Image Credit</a></p>
<p>The analysis process is a customizable pipeline carried out by an <em>Analyzer</em>. Elasticsearch analyzer pipelines are composed of three sequential steps: <em>character filters</em>, a <em>tokenizer</em>, and <em>token filters.</em> Each of these components modifies the input stream of text according to some configurable settings.</p>
<ul>
<li><strong>Character filters</strong> have the ability to add, remove, or replace characters. A common application is to strip <code>html</code> markup from the raw input. </li>
<li>The character-filtered text is passed to a <strong>tokenizer</strong> which breaks up the input string into individual tokens. The default (<code>standard</code>) tokenizer splits tokens on whitespace, and most symbols (like commas, periods, semicolons, etc.)</li>
<li>The token stream is passed to a <strong>token filter</strong> which adds, removes, or modifies tokens. Typical token filters include converting all text to <code>lowercase</code>, and removing <code>stop</code> words. </li>
</ul>
<p>Elasticsearch comes with several built-in Analyzers that satisfy common use cases and defaults to the <code>Standard Analyzer</code>. The Standard Analyzer doesn't contain any character filters, uses a <code>standard</code> tokenizer, and applies a <code>lowercase</code> token filter. Let's take a look at an example sentence as it's passed through this pipeline:</p>
<blockquote><p>"I'm in the mood for drinking semi-dry red wine!"</p>
</blockquote>
<p><img src="https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/elasticsearch_standard_analyzer.png?raw=1" alt="Elasticsearch Analyzer Pipeline" /><a href="https://codingexplained.com/coding/elasticsearch/understanding-analysis-in-elasticsearch-analyzers#:~:text=A%20Closer%20Look%20at%20Analyzers,documents%20when%20they%20are%20indexed.&text=An%20analyzer%20consists%20of%20three,them%20changing%20the%20input%20stream.">Image Credit</a></p>
<p>Crafting analyzers to your use case requires domain knowledge of the problem and dataset at hand, and doing so properly is key to optimizing relevance scoring for your search application. We found <a href="https://medium.com/elasticsearch/contents-cebdc419c8c9">this blog series</a> very useful in explaining the importance of analysis in Elasticsearch.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong><em>Create an Index</em></strong></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let's create a new index and add our Wikipedia articles to it. To do so, we provide a name and optionally some index configurations. Here we are specifying a set of <code>mappings</code> that indicate our anticipated index schema, data types, and how the text fields should be processed. If no <code>body</code> is passed, Elasticsearch will automatically infer fields and data types from incoming documents, as well as apply the <code>Standard Analyzer</code> to any text fields.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">index_config</span> <span class="o">=</span> <span class="p">{</span>
<span class="s2">"settings"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"analysis"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"analyzer"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"standard_analyzer"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"standard"</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="s2">"mappings"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"dynamic"</span><span class="p">:</span> <span class="s2">"strict"</span><span class="p">,</span>
<span class="s2">"properties"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"document_title"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"type"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"analyzer"</span><span class="p">:</span> <span class="s2">"standard_analyzer"</span><span class="p">},</span>
<span class="s2">"document_text"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"type"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"analyzer"</span><span class="p">:</span> <span class="s2">"standard_analyzer"</span><span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">index_name</span> <span class="o">=</span> <span class="s1">'squad-standard-index'</span>
<span class="n">es</span><span class="o">.</span><span class="n">indices</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">index_name</span><span class="p">,</span> <span class="n">body</span><span class="o">=</span><span class="n">index_config</span><span class="p">,</span> <span class="n">ignore</span><span class="o">=</span><span class="mi">400</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'acknowledged': True,
'index': 'squad-standard-index',
'shards_acknowledged': True}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong><em>Populate the Index</em></strong></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We can then loop through our list of Wikipedia titles and articles and add them to our newly created Elasticsearch index.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="kn">from</span> <span class="nn">tqdm.notebook</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="k">def</span> <span class="nf">populate_index</span><span class="p">(</span><span class="n">es_obj</span><span class="p">,</span> <span class="n">index_name</span><span class="p">,</span> <span class="n">evidence_corpus</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Loads records into an existing Elasticsearch index</span>
<span class="sd"> Args:</span>
<span class="sd"> es_obj (elasticsearch.client.Elasticsearch) - Elasticsearch client object</span>
<span class="sd"> index_name (str) - Name of index</span>
<span class="sd"> evidence_corpus (list) - List of dicts containing data records</span>
<span class="sd"> '''</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">rec</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">tqdm</span><span class="p">(</span><span class="n">evidence_corpus</span><span class="p">)):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">index_status</span> <span class="o">=</span> <span class="n">es_obj</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">index_name</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="n">i</span><span class="p">,</span> <span class="n">body</span><span class="o">=</span><span class="n">rec</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Unable to load document </span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s1">.'</span><span class="p">)</span>
<span class="n">n_records</span> <span class="o">=</span> <span class="n">es_obj</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">index_name</span><span class="p">)[</span><span class="s1">'count'</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Succesfully loaded </span><span class="si">{</span><span class="n">n_records</span><span class="si">}</span><span class="s1"> into </span><span class="si">{</span><span class="n">index_name</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="k">return</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">all_wiki_articles</span> <span class="o">=</span> <span class="n">wiki_articles</span> <span class="o">+</span> <span class="n">wiki_articles_dev</span>
<span class="n">populate_index</span><span class="p">(</span><span class="n">es_obj</span><span class="o">=</span><span class="n">es</span><span class="p">,</span> <span class="n">index_name</span><span class="o">=</span><span class="s1">'squad-standard-index'</span><span class="p">,</span> <span class="n">evidence_corpus</span><span class="o">=</span><span class="n">all_wiki_articles</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>
Succesfully loaded 20239 into squad-standard-index
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><strong><em>Search the Index</em></strong></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Wahoo! We now have some documents loaded into an index. Elasticsearch provides a rich <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html">query language</a> that supports a diverse range of query types. For this example, we'll use the standard query for performing full text search called a <code>match</code> query. By default, Elasticsearch sorts and returns a JSON response of search results based on a computed <a href="https://qbox.io/blog/practical-guide-elasticsearch-scoring-relevancy#:~:text=Together%2C%20these%20combine%20into%20a,number%20known%20as%20the%20_score.">relevance score</a>, which indicates how well a given document matches the query. In addition, the search response also includes the amount of time the query took to run.</p>
<p>Let's look at a simple <code>match</code> query used to search the <code>document_text</code> field in our newly created index.
<div class="flash flash-warn">
<svg class="octicon octicon-zap" viewBox="0 0 10 16" version="1.1" width="10" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M10 7H6l3-7-9 9h4l-3 7 9-9z"></path></svg>
<strong>Important: </strong>As previously mentioned, all documents in the index have gone through an analysis process prior to indexing; this is called <em>index time analysis.</em> To maintain consistency in matching text queries against the post-processed index tokens, the same Analyzer used on a given field at index time is automatically applied to the query text at search time. <em>Search time analysis</em> is applied depending on which query type is used; <code>match</code> queries apply search time analysis by default.
</div></p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="k">def</span> <span class="nf">search_es</span><span class="p">(</span><span class="n">es_obj</span><span class="p">,</span> <span class="n">index_name</span><span class="p">,</span> <span class="n">question_text</span><span class="p">,</span> <span class="n">n_results</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Execute an Elasticsearch query on a specified index</span>
<span class="sd"> </span>
<span class="sd"> Args:</span>
<span class="sd"> es_obj (elasticsearch.client.Elasticsearch) - Elasticsearch client object</span>
<span class="sd"> index_name (str) - Name of index to query</span>
<span class="sd"> query (dict) - Query DSL</span>
<span class="sd"> n_results (int) - Number of results to return</span>
<span class="sd"> </span>
<span class="sd"> Returns</span>
<span class="sd"> res - Elasticsearch response object</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="c1"># construct query</span>
<span class="n">query</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'query'</span><span class="p">:</span> <span class="p">{</span>
<span class="s1">'match'</span><span class="p">:</span> <span class="p">{</span>
<span class="s1">'document_text'</span><span class="p">:</span> <span class="n">question_text</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">es_obj</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">index_name</span><span class="p">,</span> <span class="n">body</span><span class="o">=</span><span class="n">query</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">n_results</span><span class="p">)</span>
<span class="k">return</span> <span class="n">res</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">question_text</span> <span class="o">=</span> <span class="s1">'Who was the first president of the Republic of China?'</span>
<span class="c1"># execute query</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">search_es</span><span class="p">(</span><span class="n">es_obj</span><span class="o">=</span><span class="n">es</span><span class="p">,</span> <span class="n">index_name</span><span class="o">=</span><span class="s1">'squad-standard-index'</span><span class="p">,</span> <span class="n">question_text</span><span class="o">=</span><span class="n">question_text</span><span class="p">,</span> <span class="n">n_results</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Question: </span><span class="si">{</span><span class="n">question_text</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Query Duration: </span><span class="si">{</span><span class="n">res</span><span class="p">[</span><span class="s2">"took"</span><span class="p">]</span><span class="si">}</span><span class="s1"> milliseconds'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Title, Relevance Score:'</span><span class="p">)</span>
<span class="p">[(</span><span class="n">hit</span><span class="p">[</span><span class="s1">'_source'</span><span class="p">][</span><span class="s1">'document_title'</span><span class="p">],</span> <span class="n">hit</span><span class="p">[</span><span class="s1">'_score'</span><span class="p">])</span> <span class="k">for</span> <span class="n">hit</span> <span class="ow">in</span> <span class="n">res</span><span class="p">[</span><span class="s1">'hits'</span><span class="p">][</span><span class="s1">'hits'</span><span class="p">]]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Question: Who was the first president of the Republic of China?
Query Duration: 74 milliseconds
Title, Relevance Score:
</pre>
</div>
</div>
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>[('Modern_history_54', 23.131157),
('Nanjing_18', 17.076923),
('Republic_of_the_Congo_10', 16.840765),
('Prime_minister_16', 16.137493),
('Korean_War_29', 15.801523),
('Korean_War_43', 15.586578),
('Qing_dynasty_52', 15.291815),
('Chinese_characters_55', 14.773873),
('Korean_War_23', 14.736045),
('2008_Sichuan_earthquake_48', 14.417962)]</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Evaluating-Retriever-Performance">Evaluating Retriever Performance<a class="anchor-link" href="#Evaluating-Retriever-Performance"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Ok, we now have a basic understanding of how to use Elasticsearch as an IR tool to return some results for a given question, but how do we know if it's working? How do we evaluate what a good IR tool looks like?</p>
<p>We'll need two things to evaluate our Retriever: some labeled examples (i.e., SQuAD2.0 question/answer pairs) and some performance metrics. In the conventional world of information retrieval, there are <a href="https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval">many metrics</a>)used to quantify the relevance of query results, largely centered around the concepts of precision and recall. For IR in the context of QA, these ideas are adapted into two commonly used evaluation metrics: <em>recall</em> and <em>mean average precision (mAP)</em>. Additionally, we consider the amount of time required to execute a query, since the main point of having a two-stage QA system is to efficiently narrow the large search space for our Reader.</p>
<p><strong>Recall</strong></p>
<p>Traditionally, <em>recall</em> in IR indicates the fraction of all relevant documents that are retrieved. In this case, we are less concerned with finding <em>all</em> of the passages containing the answer and more concerned with the binary presence of a passage containing the correct answer being returned. In that light, a Retriever's recall is defined across a set of questions as <em>the percentage of questions for which the answer segment appears in one of the top N pages returned by the search method.</em></p>
<p><strong>Mean Average Precision</strong></p>
<p>While the <em>recall</em> metric focuses on the minimum viable result set to enable a Reader for success, we do still care about the composition of that result set. We want a metric that rewards a Retriever for: a) returning a lot of answer-containing documents in the result set (i.e., the traditional meaning of precision), and b) returning those answer-containing documents higher up in the result set than non-answer-containing documents (i.e., ranking them correctly). This is precisely (🙃) what <em>mean average precision</em> (mAP) does for us.</p>
<p>To explain mAP further, let's first break down the concept of average precision for information retrieval. If our Retriever is asked to return <em>N</em> documents and <em>m</em> of those documents contains the true answer, then average precision (AP) is defined as:</p>
<p><img src="https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/map_equation.png?raw=1" alt="" /></p>
<p>where <em>rel(k)</em> is just a binary indication of whether the kth passage contains the correct answer or not. Using a concrete example, consider retrieving <em>N</em>=3 documents, of which only one contains the correct answer. Here are three scenarios for how this could happen:</p>
<p><img src="https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/map_example.png?raw=1" alt="" /></p>
<p>Scenario A is rewarded with the highest score because it was able to correctly rank the ground truth document relative to the others returned. Since average precision is calculated on a per-query basis, the mean average precision is simply just <em>the average AP across all queries</em>.</p>
<p>Now, using our Wikipedia passage index, let's define a function called <code>evaluate_retriever</code> to loop through all question/answer examples from the SQuAD2.0 train set and see how well our Elasticsearch Retriever performs in terms of recall, mAP, and average query duration when retrieving <em>N=3</em> passages.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="k">def</span> <span class="nf">average_precision</span><span class="p">(</span><span class="n">binary_results</span><span class="p">):</span>
<span class="sd">''' Calculates the average precision for a list of binary indicators '''</span>
<span class="n">m</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">precs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">val</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">binary_results</span><span class="p">):</span>
<span class="k">if</span> <span class="n">val</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">m</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">precs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">binary_results</span><span class="p">[:</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">])</span><span class="o">/</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span>
<span class="n">ap</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="o">/</span><span class="n">m</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">precs</span><span class="p">)</span> <span class="k">if</span> <span class="n">m</span> <span class="k">else</span> <span class="mi">0</span>
<span class="k">return</span> <span class="n">ap</span>
<span class="k">def</span> <span class="nf">evaluate_retriever</span><span class="p">(</span><span class="n">es_obj</span><span class="p">,</span> <span class="n">index_name</span><span class="p">,</span> <span class="n">qa_records</span><span class="p">,</span> <span class="n">n_results</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> This function loops through a set of question/answer examples from SQuAD2.0 and </span>
<span class="sd"> evaluates Elasticsearch as a information retrieval tool in terms of recall, mAP, and query duration.</span>
<span class="sd"> </span>
<span class="sd"> Args:</span>
<span class="sd"> es_obj (elasticsearch.client.Elasticsearch) - Elasticsearch client object</span>
<span class="sd"> index_name (str) - name of index to query</span>
<span class="sd"> qa_records (list) - list of qa_records from preprocessing steps</span>
<span class="sd"> n_results (int) - the number of results ElasticSearch should return for a given query</span>
<span class="sd"> </span>
<span class="sd"> Returns:</span>
<span class="sd"> test_results_df (pd.DataFrame) - a dataframe recording search results info for every example in qa_records</span>
<span class="sd"> </span>
<span class="sd"> '''</span>
<span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">qa</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">tqdm</span><span class="p">(</span><span class="n">qa_records</span><span class="p">)):</span>
<span class="n">ex_id</span> <span class="o">=</span> <span class="n">qa</span><span class="p">[</span><span class="s1">'example_id'</span><span class="p">]</span>
<span class="n">question</span> <span class="o">=</span> <span class="n">qa</span><span class="p">[</span><span class="s1">'question_text'</span><span class="p">]</span>
<span class="n">answer</span> <span class="o">=</span> <span class="n">qa</span><span class="p">[</span><span class="s1">'short_answer'</span><span class="p">]</span>
<span class="c1"># execute query</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">search_es</span><span class="p">(</span><span class="n">es_obj</span><span class="o">=</span><span class="n">es_obj</span><span class="p">,</span> <span class="n">index_name</span><span class="o">=</span><span class="n">index_name</span><span class="p">,</span> <span class="n">question_text</span><span class="o">=</span><span class="n">question</span><span class="p">,</span> <span class="n">n_results</span><span class="o">=</span><span class="n">n_results</span><span class="p">)</span>
<span class="c1"># calculate performance metrics from query response info</span>
<span class="n">duration</span> <span class="o">=</span> <span class="n">res</span><span class="p">[</span><span class="s1">'took'</span><span class="p">]</span>
<span class="n">binary_results</span> <span class="o">=</span> <span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">answer</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="ow">in</span> <span class="n">doc</span><span class="p">[</span><span class="s1">'_source'</span><span class="p">][</span><span class="s1">'document_text'</span><span class="p">]</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">res</span><span class="p">[</span><span class="s1">'hits'</span><span class="p">][</span><span class="s1">'hits'</span><span class="p">]]</span>
<span class="n">ans_in_res</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="nb">any</span><span class="p">(</span><span class="n">binary_results</span><span class="p">))</span>
<span class="n">ap</span> <span class="o">=</span> <span class="n">average_precision</span><span class="p">(</span><span class="n">binary_results</span><span class="p">)</span>
<span class="n">rec</span> <span class="o">=</span> <span class="p">(</span><span class="n">ex_id</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">answer</span><span class="p">,</span> <span class="n">duration</span><span class="p">,</span> <span class="n">ans_in_res</span><span class="p">,</span> <span class="n">ap</span><span class="p">)</span>
<span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">rec</span><span class="p">)</span>
<span class="c1"># format results dataframe</span>
<span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'example_id'</span><span class="p">,</span> <span class="s1">'question'</span><span class="p">,</span> <span class="s1">'answer'</span><span class="p">,</span> <span class="s1">'query_duration'</span><span class="p">,</span> <span class="s1">'answer_present'</span><span class="p">,</span> <span class="s1">'average_precision'</span><span class="p">]</span>
<span class="n">results_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">cols</span><span class="p">)</span>
<span class="c1"># format results dict</span>
<span class="n">metrics</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'Recall'</span><span class="p">:</span> <span class="n">results_df</span><span class="o">.</span><span class="n">answer_present</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">normalize</span><span class="o">=</span><span class="kc">True</span><span class="p">)[</span><span class="mi">1</span><span class="p">],</span>
<span class="s1">'Mean Average Precision'</span><span class="p">:</span> <span class="n">results_df</span><span class="o">.</span><span class="n">average_precision</span><span class="o">.</span><span class="n">mean</span><span class="p">(),</span>
<span class="s1">'Average Query Duration'</span><span class="p">:</span><span class="n">results_df</span><span class="o">.</span><span class="n">query_duration</span><span class="o">.</span><span class="n">mean</span><span class="p">()}</span>
<span class="k">return</span> <span class="n">results_df</span><span class="p">,</span> <span class="n">metrics</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># combine train/dev examples and filter out SQuAD records that</span>
<span class="c1"># do not have a short answer for the given question</span>
<span class="n">all_qa_records</span> <span class="o">=</span> <span class="n">qa_records</span><span class="o">+</span><span class="n">qa_records_dev</span>
<span class="n">qa_records_answerable</span> <span class="o">=</span> <span class="p">[</span><span class="n">record</span> <span class="k">for</span> <span class="n">record</span> <span class="ow">in</span> <span class="n">all_qa_records</span> <span class="k">if</span> <span class="n">record</span><span class="p">[</span><span class="s1">'short_answer'</span><span class="p">]</span> <span class="o">!=</span> <span class="s1">''</span><span class="p">]</span>
<span class="c1"># run evaluation</span>
<span class="n">results_df</span><span class="p">,</span> <span class="n">metrics</span> <span class="o">=</span> <span class="n">evaluate_retriever</span><span class="p">(</span><span class="n">es_obj</span><span class="o">=</span><span class="n">es</span><span class="p">,</span> <span class="n">index_name</span><span class="o">=</span><span class="s1">'squad-standard-index'</span><span class="p">,</span> <span class="n">qa_records</span><span class="o">=</span><span class="n">qa_records_answerable</span><span class="p">,</span> <span class="n">n_results</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">metrics</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'Recall': 0.8226180336176131,
'Mean Average Precision': 0.7524133234140888,
'Average Query Duration': 3.0550841518506937}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Improving-Search-Results-with-a-Custom-Analyzer">Improving Search Results with a Custom Analyzer<a class="anchor-link" href="#Improving-Search-Results-with-a-Custom-Analyzer"> </a></h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Identifying a correct passage in the Top 3 results for 82% of the SQuAD questions in ~3 milliseconds per question is not too bad! But that means that we've effectively limited our overall QA system to an 82% upper bound on performance. How can we improve upon this?</p>
<p>One simple and obvious way to increase recall would be to just retrieve more passages. The following figure shows the effects of varying corpus size and result size on Elasticsearch retriever recall. As expected we see that the number of passages retrieved (i.e. <em>Top N</em>) has a dramatic impact on recall; a ~10-15 point jump from 1 to 3 passages returned, and ~5 point jump for each of the other tiers. We also see a gradual decrease in recall as corpus size increases, which isn't surprising.</p>
<p><img src="https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/my_icons/recall_v_corpussize.png?raw=1" alt="Recall vs. Corpus Size" title="Experimental results demonstrating the impact of increasing corpus size and number of results retrieved on Elasticsearch recall." /></p>
<p>While increasing the number of passages retrieved is effective, it also has implications on overall system performance as the (already slow) Reader now has to reason over more text. Instead, we can lean on best practices in the well-explored domain of information retrieval.</p>
<p>Optimizing full text search is a battle between precision (returning as few irrelevant documents as possible) and recall (returning as many relevant documents as possible). Matching only exact words in the question results in high precision; however, it misses out on many passages that could be relevant. We can cast a wider net by searching for terms that are not <em>exactly</em> the same as those in the question, but are related in some way. Here, Elasticsearch Analyzers can help. Earlier in this post, we described how Analyzers provide a flexible and extensible method to tailor search for a given dataset. Two of the custom Analyzers that can help cast a wider net are <em>stop words</em> and <em>stemming.</em></p>
<ul>
<li><p><strong>Stop words:</strong> Stop words are the most frequently occuring words in the English language (for example: "and," "the," "to,” etc.) and add minimal semantic value to a piece of text. It is common practice to remove them in order to decrease the size of the index and increase the relevance of search results.</p>
</li>
<li><p><strong>Stemming:</strong> The English language is inflected; words can alter their written form to express different meanings. For example, “sing,” “sings,” “sang,” and “singing” are written with slight differences, but all really mean the same thing (albeit with varying tenses). Stemming algorithms exploit the fact that search intent is <em>usually</em> word-form agnostic, and attempt to reduce inflected words to their root form: consequently improving retrievability. We'll implement the <a href="https://snowballstem.org/">Snowball</a> stemming algorithm as a token filter in our custom Analyzer.</p>
</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># create new index</span>
<span class="n">index_config</span> <span class="o">=</span> <span class="p">{</span>
<span class="s2">"settings"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"analysis"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"analyzer"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"stop_stem_analyzer"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"custom"</span><span class="p">,</span>
<span class="s2">"tokenizer"</span><span class="p">:</span> <span class="s2">"standard"</span><span class="p">,</span>
<span class="s2">"filter"</span><span class="p">:[</span>
<span class="s2">"lowercase"</span><span class="p">,</span>
<span class="s2">"stop"</span><span class="p">,</span>
<span class="s2">"snowball"</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="s2">"mappings"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"dynamic"</span><span class="p">:</span> <span class="s2">"strict"</span><span class="p">,</span>
<span class="s2">"properties"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"document_title"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"type"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"analyzer"</span><span class="p">:</span> <span class="s2">"stop_stem_analyzer"</span><span class="p">},</span>
<span class="s2">"document_text"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"type"</span><span class="p">:</span> <span class="s2">"text"</span><span class="p">,</span> <span class="s2">"analyzer"</span><span class="p">:</span> <span class="s2">"stop_stem_analyzer"</span><span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">es</span><span class="o">.</span><span class="n">indices</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s1">'squad-stop-stem-index'</span><span class="p">,</span> <span class="n">body</span><span class="o">=</span><span class="n">index_config</span><span class="p">,</span> <span class="n">ignore</span><span class="o">=</span><span class="mi">400</span><span class="p">)</span>
<span class="c1"># populate the index</span>
<span class="n">populate_index</span><span class="p">(</span><span class="n">es_obj</span><span class="o">=</span><span class="n">es</span><span class="p">,</span> <span class="n">index_name</span><span class="o">=</span><span class="s1">'squad-stop-stem-index'</span><span class="p">,</span> <span class="n">evidence_corpus</span><span class="o">=</span><span class="n">all_wiki_articles</span><span class="p">)</span>
<span class="c1"># evaluate retriever performance</span>
<span class="n">stop_stem_results_df</span><span class="p">,</span> <span class="n">stop_stem_metrics</span> <span class="o">=</span> <span class="n">evaluate_retriever</span><span class="p">(</span><span class="n">es_obj</span><span class="o">=</span><span class="n">es</span><span class="p">,</span> <span class="n">index_name</span><span class="o">=</span><span class="s1">'squad-stop-stem-index'</span><span class="p">,</span>\
<span class="n">qa_records</span><span class="o">=</span><span class="n">qa_records_answerable</span><span class="p">,</span> <span class="n">n_results</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">stop_stem_metrics</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>{'Recall': 0.8501115914996388,
'Mean Average Precision': 0.7800892731997112,
'Average Query Duration': 0.7684287701215108}</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Awesome! We've increased recall and mAP by about 3 points <em>and</em> reduced our average query duration by nearly 4 times, through simple preprocessing steps that just scratch the surface of tailored analysis in Elasticsearch.</p>
<p>There is no "one-size-fits-all" recipe for optimizing search relevance, and every implementation will be different. In addition to custom analysis, there are many other methods for increasing search recall - for example, query expansion, which introduces additional tokens/phrases into a query at search time. We'll save that topic for another post. Instead, let’s take a look at how the Retriever's performance affects a QA system.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-Full-IR-QA-System">The Full IR QA System<a class="anchor-link" href="#The-Full-IR-QA-System"> </a></h1>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We used the questions from the train set to evaluate the stand-alone retriever, in order to provide as large a collection as possible. However, BERT has been trained on those questions and would return inflated performance values if we used them for full-system evaluation. So let’s resort to our trusty SQuAD2.0 dev set.
<div class="flash">
<svg class="octicon octicon-info octicon octicon-info octicon octicon-info octicon octicon-info" viewBox="0 0 14 16" version="1.1" width="14" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 01-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 01-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg>
<strong>Note: </strong>This section focuses on a discussion. The code to reproduce our results can be found at the end.
</div></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Connecting-the-retriever-to-the-reader">Connecting the retriever to the reader<a class="anchor-link" href="#Connecting-the-retriever-to-the-reader"> </a></h2><p>In our last post, we evaluated a BERT-like model on the SQuAD2.0 dev set by providing the model with a paragraph that perfectly aligned with the question. This time, the retriever will serve up a collection of relevant documents. We created a reader class that leverages the Hugging Face (HF) question-answering <code>pipeline</code> to do the brunt of the work for us (loading models and tokenizers, converting text to features, chunking, prediction, etc.), but how should it process multiple documents from the retriever? And how should it determine which document contains the best answer?</p>
<p>This turns out to be one of the thornier subtleties in building a full QA system. There are several ways to approach this problem. Here are two that we tried:</p>
<ol>
<li>Pass each document to the reader individually, then aggregate the resulting scores.</li>
<li>Concatenate all documents into one long passage and pass to the reader simultaneously.</li>
</ol>
<p>Both methods have pros and cons. Let's take a look at them.</p>
<p><strong>Pass each document individually</strong></p>
<p>In Option 1, the reader returns answers and scores for each document. A series of heuristics must be developed to determine which answer is the best, and when the null answer should be returned. For this post, we chose a simple but reasonable heuristic: "Only return null if the highest scoring answer in each document is null; otherwise return the highest scoring non-null answer."</p>
<p>Unfortunately, a direct comparison of answer scores between documents is not technically possible. The reason lies in the type of score returned by the HF pipeline: a softmax probability over all the tokens <em>in that document</em>. This means that the only meaningful comparisons are between answers <em>from the same document</em> whose probabilities will sum to 1. Comparing an answer with a score of 0.78 from one document is not guaranteed to be better than an answer with a score of 0.70 from another document!</p>
<p>Finally, this option is slower (in our current implementation) because each article is passed individually, leading to multiple BERT calls.</p>
<p><strong>Pass all documents together as one long context</strong></p>
<p>Option 2 circumvents many of these challenges but leads to other problems. The pros are:</p>
<ol>
<li>all candidate answers are scored on the same probability scale,</li>
<li>handling the null answer is more straightforward (we did that <a href="https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html">last time</a>), and</li>
<li>we can take advantage of faster compute, since HF will chunk long documents for us and pass them through BERT in a batch.</li>
</ol>
<p>On the other hand, when concatenating multiple passages, there's a good chance that BERT will see a mixed context: the end of one paragraph grafted onto the beginning of another, for example. This could make it more difficult for the model to correctly identify an answer in a potentially confusing context. Another drawback is that it's more difficult to backstrapolate which of the input documents from which the answer ultimately came.</p>
<p>Our reader class has two methods: <code>predict</code> and <code>predict_combine</code>, corresponding to Option 1 and Option 2, respectively. We tested each of them over 1000 examples from the SQuAD2.0 dev set while increasing the number of retrieved documents.</p>
<p><img src="/images/copied_from_nb/my_icons/qa_eval_combined_vs_individual.png" alt="" /></p>
<p>There are two take-aways here. First, we see that the concatenation method (blue bars) outperforms passing documents individually and applying heuristics to the outputs (orange bars). While more sophistical heuristics can be developed, for short documents (paragraphs in this case), we find that the concatenation method is the most straightforward approach.</p>
<p>The second thing to notice is that, as the number of retrieved documents increases, the overall performance decreases for both methods. What's going on? When we evaluated the retriever, we found that increasing the number of retrieved documents <em>increased</em> the likelihood that the correct answer was contained in at least one of them. So why does reader performance degrade?</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Evaluating-the-system">Evaluating the system<a class="anchor-link" href="#Evaluating-the-system"> </a></h2><p>Standard evaluation on the SQuAD2.0 dev set considers only the best overall answer for each question, but we can create another evaluation metric that mirrors our IR recall metric from earlier. Specifically, we compute the <em>percent of examples in which the correct answer is found in <strong>at least one</strong> of the top k documents provided by the retriever</em>.</p>
<p><img src="/images/copied_from_nb/my_icons/qa_eval_top1_vs_any_topk.png" alt="" /></p>
<p>The blue bars are the same as the blue bars in the previous figure, but this time the orange bars represent our new recall-esque metric. What a difference! This demonstrates that when the model is provided with more documents, the correct answer truly is present more often. However, trying to predict which one of those answers is the right one is challenging: this task is not achieved by a simple heuristic, and becomes harder with more documents.</p>
<p>It may seem counterintuitive, but this behavior does make sense. Let’s imagine a simple system that performs ranked document retrieval and random answer selection. Ranked document retrieval, in this case, means that the correct answer is most likely to be found in the top-most ranked document, with some decreasing probability of being contained in the second- or third-ranked document, and so on. As we retrieve more and more documents, the probability <em>increases</em> that the correct answer is contained in the resulting set. However, as the number of documents increases, so too do the number of possible answers from which to choose - one from each document. Random answer selection over an increasing number of answer choices results in a <em>decrease</em> in performance. Obviously, BERT is not random, but it’s also not <em>perfect,</em> so the trait persists.</p>
<p>Does this mean we shouldn't use QA systems like this? Of course not! There are several factors to consider:</p>
<ol>
<li><p><strong>Use Case:</strong> If your QA system seeks to provide enhanced search capabilities, then it might not be necessary to predict a single answer with high confidence. It might be sufficient to provide answers from several documents for the user to peruse. On the other hand, if your use case seeks to augment a chatbot, then predicting a high confidence answer might be more important for user experience.</p>
</li>
<li><p><strong>Better heuristics:</strong> While our simple heuristic didn't perform as well as concatenating all the input documents into one long context, there is research into developing heuristics that work. In particular, <a href="https://arxiv.org/abs/1902.01718">one promising approach</a> develops a combined answer score that considers both the retriever's document ranking, and the reader's answer score.</p>
</li>
<li><p><strong>Document length:</strong> Our concatenation method works reasonably well compared to other methods, but the documents are short. If the document length becomes considerably longer, this method's performance can degrade significantly.</p>
</li>
</ol>
<p><strong>Impact of a retriever on a QA system</strong></p>
<p>Considering all that we've learned so far, what is the overall impact of the retriever on a full QA system? Using our concatenation method and returning only the best answer for all questions in the SQuAD2.0 dev set, we can compare results with <a href="https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html">our previous blog post</a> in which we evaluated only the reader.</p>
<p><img src="/images/copied_from_nb/my_icons/qa_system_vs_reader_only.png" alt="" /></p>
<p>As expected, adding a retriever to supply documents to the reader reduces the system's ability to identify the correct answer. This motivates approaches for enhancing the retriever, in order to supply the reader with the best documents possible.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Final-Thoughts">Final Thoughts<a class="anchor-link" href="#Final-Thoughts"> </a></h1><p>We did it! We built a full QA system with off-the-shelf parts using ElasticSearch and HuggingFace Transformers.</p>
<p>We made a series of design choices in building our full QA system, including the choice to index over Wikipedia paragraphs rather than full articles. This allowed us to more easily replicate SQuAD evaluation methods, but this isn't practical. In the real world, a QA system will need to work with existing indexes, which are typically performed over full documents (not paragraphs). In addition to architectural constraints, indexing over full documents provides the retriever with the best chance of returning a relevant document.</p>
<p>However, passing multiple long documents to a Transformer model is a recipe for boredom -- it will take forever and it likely won't be highly informative. Transformers work best with smaller passages. Thus, extracting a few highly relevant paragraphs from the most relevant document is a better recipe for a practical implementation. This is exactly the approach we'll take next time when we (hopefully) address the biggest question of all:</p>
<blockquote><p>How do I apply a QA system to <strong>my</strong> data?</p>
</blockquote>
<p>Stay tuned!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Post-Script:-The-Code">Post Script: The Code<a class="anchor-link" href="#Post-Script:-The-Code"> </a></h1><p>If you open this notebook in Colab, you'll find several cells below that step through the experiments we ran for the final section.</p>
</div>
</div>
</div>
</div>
<script type="application/vnd.jupyter.widget-state+json">
{"0f9413092ffe496d8f31c374732e870d": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "139cd5c5c43f436495dfddbdfc7d35c3": {"model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": ""}}, "15f44b8cf87d459bb6a40b5e6b9db470": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "26b9259fecf4445b8fdcb753ed6d09ef": {"model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "initial"}}, "26ef4a7e1f76465ead7e51c1d9866c6f": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "28a305d13e584615b9ba3cd9a37bdd56": {"model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "100%", "description_tooltip": null, "layout": "IPY_MODEL_a8cebe4ccf004d84bdc37ce44d1192e8", "max": 20239, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_a44c2693aee14b89a4c6512c85142f51", "value": 20239}}, "3c0ffe3eeca741899ddbe1306e60ce39": {"model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_26ef4a7e1f76465ead7e51c1d9866c6f", "placeholder": "\u200b", "style": "IPY_MODEL_7a1bbfb20bd84d4b8995584a37dabae1", "value": " 11873/11873 [6:21:18<00:00, 1.93s/it]"}}, "3d3ce5e897b84a218237582372bca2bb": {"model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": ""}}, "40b3ea36c5ca407597e8ce6c738c9786": {"model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "initial"}}, "62fb51c849b04bc78c629dd42f811deb": {"model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_0f9413092ffe496d8f31c374732e870d", "placeholder": "\u200b", "style": "IPY_MODEL_3d3ce5e897b84a218237582372bca2bb", "value": " 11873/11873 [7:01:01<00:00, 2.13s/it]"}}, "66e0efa9685248629e009c1e255bff6e": {"model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": ["IPY_MODEL_f6c75626f2d54bfbaa7bad761af5b9c2", "IPY_MODEL_62fb51c849b04bc78c629dd42f811deb"], "layout": "IPY_MODEL_f0858954bbfd447eb95a04a3caabf8a4"}}, "705fa1214c044cbcae6f5d109b22802d": {"model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "100%", "description_tooltip": null, "layout": "IPY_MODEL_f04b80c35efb44b5bec078f4d7a57f3b", "max": 11873, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_e30ff795eb1c4016be3967f190764399", "value": 11873}}, "7a1bbfb20bd84d4b8995584a37dabae1": {"model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": ""}}, "9061d4497e4443029cbb21b77281cf31": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "95923a713e324d18bc9fc82a466703c4": {"model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_ef2c1a3506c549b7878e3ea4a5cc565f", "placeholder": "\u200b", "style": "IPY_MODEL_139cd5c5c43f436495dfddbdfc7d35c3", "value": " 20239/20239 [02:38<00:00, 127.56it/s]"}}, "a299f8fa902348b7807b4d97ebc6027d": {"model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_dca5b0a8c1014a66bc8c5ba988878a85", "placeholder": "\u200b", "style": "IPY_MODEL_b04d963baf9a40789305d1372ffe1aab", "value": " 20239/20239 [04:14<00:00, 79.57it/s]"}}, "a44c2693aee14b89a4c6512c85142f51": {"model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "initial"}}, "a493a4c07a2743899571330cf6476b74": {"model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": ["IPY_MODEL_28a305d13e584615b9ba3cd9a37bdd56", "IPY_MODEL_a299f8fa902348b7807b4d97ebc6027d"], "layout": "IPY_MODEL_e94a6f98578743c096c9b045d9dfdf81"}}, "a8cebe4ccf004d84bdc37ce44d1192e8": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "b04d963baf9a40789305d1372ffe1aab": {"model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": ""}}, "be3cf3e6c6ba40d087f8dd27dd4f5e64": {"model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": ["IPY_MODEL_705fa1214c044cbcae6f5d109b22802d", "IPY_MODEL_3c0ffe3eeca741899ddbe1306e60ce39"], "layout": "IPY_MODEL_d193d10e1ad94119a849d123c093f3cc"}}, "d193d10e1ad94119a849d123c093f3cc": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "dca5b0a8c1014a66bc8c5ba988878a85": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "e30ff795eb1c4016be3967f190764399": {"model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "state": {"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "initial"}}, "e91c7ba9a6314f0d905d08d0f822cbc1": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "e94a6f98578743c096c9b045d9dfdf81": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "ef2c1a3506c549b7878e3ea4a5cc565f": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "f04b80c35efb44b5bec078f4d7a57f3b": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "f0858954bbfd447eb95a04a3caabf8a4": {"model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": {"_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null}}, "f4d90214c67749168472a2ba3cb0d72a": {"model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "100%", "description_tooltip": null, "layout": "IPY_MODEL_9061d4497e4443029cbb21b77281cf31", "max": 20239, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_40b3ea36c5ca407597e8ce6c738c9786", "value": 20239}}, "f6c75626f2d54bfbaa7bad761af5b9c2": {"model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "100%", "description_tooltip": null, "layout": "IPY_MODEL_e91c7ba9a6314f0d905d08d0f822cbc1", "max": 11873, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_26b9259fecf4445b8fdcb753ed6d09ef", "value": 11873}}, "f8d18feb30b24de5bbbb869cc9a31ae8": {"model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "state": {"_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": ["IPY_MODEL_f4d90214c67749168472a2ba3cb0d72a", "IPY_MODEL_95923a713e324d18bc9fc82a466703c4"], "layout": "IPY_MODEL_15f44b8cf87d459bb6a40b5e6b9db470"}}}
</script>Evaluating QA: Metrics, Predictions, and the Null Response2020-06-09T00:00:00-05:002020-06-09T00:00:00-05:00https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-06-09-Evaluating_BERT_on_SQuAD.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/images/copied_from_nb/my_icons/tomas-sobek-nVqNmnAWz3A-unsplash.jpg" alt="" title="Sometimes BERT needs to zip it." /></p>
<p>In our last post, <a href="https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html">Building a QA System with BERT on Wikipedia</a>, we used the HuggingFace framework to train BERT on the SQuAD2.0 dataset and built a simple QA system on top of the Wikipedia search engine. This time, we'll look at how to assess the quality of a BERT-like model for Question Answering. We'll cover what metrics are used to quantify quality, how to evaluate a model using the Hugging Face framework, and the importance of the "null response" (questions that don't have answers) for both improved performance and more realistic QA output. By the end of this post, we'll have implemented a more robust answering method for our QA system.
<div class="flash">
<svg class="octicon octicon-info" viewBox="0 0 14 16" version="1.1" width="14" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 01-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 01-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg>
<strong>Note: </strong>Throughout this post we’ll be using a distilBERT model fine-tuned on SQuAD2.0 by a member of the NLP community; this model can be found <a href="https://huggingface.co/twmkn9/distilbert-base-uncased-squad2">here</a> in the HF repository. Additionally, much of the code in this post is inspired by the HF <code>squad_metrics.py</code> <a href="https://github.com/huggingface/transformers/blob/5856999a9f2926923f037ecd8d27b8058bcf9dae/src/transformers/data/metrics/squad_metrics.py">script</a>.
</div></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Prerequisites">Prerequisites<a class="anchor-link" href="#Prerequisites"> </a></h3><ul>
<li>a basic understanding of Transformers and PyTorch</li>
<li>a basic understanding of Transformer outputs (logits) and softmax</li>
<li>a Transformer fine-tuned on SQuAD2.0</li>
<li>the SQuAD2.0 dev set</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Answering-questions-is-complicated">Answering questions is complicated<a class="anchor-link" href="#Answering-questions-is-complicated"> </a></h1><p>Quantifying the success of question answering is a tricky task. When you or I ask a question, the correct answer could take multiple forms. For example, in our previous post, BERT answered the question, "Why is the sky blue?" with "Rayleigh scattering," but another answer would be:</p>
<blockquote><p>The Earth's atmosphere scatters short-wavelength light more efficiently than that of longer wavelengths. Because its wavelengths are shorter, blue light is more strongly scattered than the longer-wavelength lights, red or green. Hence the result that when looking at the sky away from the direct incident sunlight, the human eye perceives the sky to be blue.</p>
</blockquote>
<p>Both of these answers can be found in the Wikipedia article <a href="https://en.wikipedia.org/wiki/Diffuse_sky_radiation">Diffuse Sky Radiation</a> and both are correct. However, we've also had a model answer the same question with "because its wavelengths are shorter," which is close - but not really a correct answer; the sky itself doesn't have a wavelength. This answer is missing too much context to be useful.
What if we'd asked a question that couldn't be answered by the Diffuse Sky Radiation page? For example: "Could the sky ever be green?" If you read that Wiki article you'll see there probably isn't a sure-fire answer to this question. What should the model do in this case?</p>
<p>How should we judge a model’s success when there are multiple correct answers, even more incorrect answers, and potentially no answer available to it at all? To properly assess quality, we need a labeled set of questions and answers. Let's turn back to the SQuAD dataset.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="The-SQuAD2.0-dev-set">The SQuAD2.0 dev set<a class="anchor-link" href="#The-SQuAD2.0-dev-set"> </a></h1><p>The <a href="https://rajpurkar.github.io/SQuAD-explorer/">SQuAD dataset</a> comes in two flavors: SQuAD1.1 and SQuAD2.0. The latter contains the same questions and answers as the former, but also includes additional questions that cannot be answered by the accompanying passage. This is intended to create a more realistic question answering task. The ability to identify unanswerable questions is much more challenging for Transformer models, which is why we focused on the SQuAD2.0 dataset rather than SQuAD1.1.</p>
<p>SQuAD2.0 consists of over 150k questions, of which more than 35% are unanswerable in relation to their associated passage. <a href="https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html">For our last post</a>, we fine-tuned on the train set (130k examples); now we'll focus on the dev set, which contains nearly 12k examples. Only about half of these examples are answerable questions. In the following section, we'll look at a couple of these examples to get a feel for them.</p>
<p>(Use the hidden cells below to get set up, if needed.)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="c1"># use this cell to install packages if needed</span>
<span class="o">!</span>pip install torch torchvision -f https://download.pytorch.org/whl/torch_stable.html
<span class="o">!</span>pip install transformers
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">collections</span>
<span class="kn">from</span> <span class="nn">pprint</span> <span class="kn">import</span> <span class="n">pprint</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModelForQuestionAnswering</span>
<span class="c1"># This is the directory in which we'll store all evaluation output</span>
<span class="n">model_dir</span> <span class="o">=</span> <span class="s2">"models/distilbert/twmkn9_distilbert-base-uncased-squad2/"</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="c1"># Download the SQuAD2.0 dev set</span>
<span class="o">!</span>wget -P data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Load-the-dev-set-using-HF-data-processors">Load the dev set using HF data processors<a class="anchor-link" href="#Load-the-dev-set-using-HF-data-processors"> </a></h3><p>Hugging Face provides the <a href="https://huggingface.co/transformers/main_classes/processors.html">Processors</a> library for facilitating basic processing tasks with some canonical NLP datasets. The processors can be used for loading datasets and converting their examples to features for direct use in the model. We'll be using the <a href="https://huggingface.co/transformers/main_classes/processors.html#squad">SQuAD processors</a>.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">transformers.data.processors.squad</span> <span class="kn">import</span> <span class="n">SquadV2Processor</span>
<span class="c1"># this processor loads the SQuAD2.0 dev set examples</span>
<span class="n">processor</span> <span class="o">=</span> <span class="n">SquadV2Processor</span><span class="p">()</span>
<span class="n">examples</span> <span class="o">=</span> <span class="n">processor</span><span class="o">.</span><span class="n">get_dev_examples</span><span class="p">(</span><span class="s2">"./data/squad/"</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s2">"dev-v2.0.json"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">examples</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stderr output_text">
<pre>100%|██████████| 35/35 [00:05<00:00, 6.71it/s]</pre>
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>11873
</pre>
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stderr output_text">
<pre>
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>While <code>examples</code> is a list, most other tasks we'll work with use a unique identifier - one for each question in the dev set.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># generate some maps to help us identify examples of interest</span>
<span class="n">qid_to_example_index</span> <span class="o">=</span> <span class="p">{</span><span class="n">example</span><span class="o">.</span><span class="n">qas_id</span><span class="p">:</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">example</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">examples</span><span class="p">)}</span>
<span class="n">qid_to_has_answer</span> <span class="o">=</span> <span class="p">{</span><span class="n">example</span><span class="o">.</span><span class="n">qas_id</span><span class="p">:</span> <span class="nb">bool</span><span class="p">(</span><span class="n">example</span><span class="o">.</span><span class="n">answers</span><span class="p">)</span> <span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">examples</span><span class="p">}</span>
<span class="n">answer_qids</span> <span class="o">=</span> <span class="p">[</span><span class="n">qas_id</span> <span class="k">for</span> <span class="n">qas_id</span><span class="p">,</span> <span class="n">has_answer</span> <span class="ow">in</span> <span class="n">qid_to_has_answer</span><span class="o">.</span><span class="n">items</span><span class="p">()</span> <span class="k">if</span> <span class="n">has_answer</span><span class="p">]</span>
<span class="n">no_answer_qids</span> <span class="o">=</span> <span class="p">[</span><span class="n">qas_id</span> <span class="k">for</span> <span class="n">qas_id</span><span class="p">,</span> <span class="n">has_answer</span> <span class="ow">in</span> <span class="n">qid_to_has_answer</span><span class="o">.</span><span class="n">items</span><span class="p">()</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">has_answer</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">display_example</span><span class="p">(</span><span class="n">qid</span><span class="p">):</span>
<span class="kn">from</span> <span class="nn">pprint</span> <span class="kn">import</span> <span class="n">pprint</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">qid_to_example_index</span><span class="p">[</span><span class="n">qid</span><span class="p">]</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">examples</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span><span class="o">.</span><span class="n">question_text</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">examples</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span><span class="o">.</span><span class="n">context_text</span>
<span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="n">answer</span><span class="p">[</span><span class="s1">'text'</span><span class="p">]</span> <span class="k">for</span> <span class="n">answer</span> <span class="ow">in</span> <span class="n">examples</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span><span class="o">.</span><span class="n">answers</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Example </span><span class="si">{</span><span class="n">idx</span><span class="si">}</span><span class="s1"> of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">examples</span><span class="p">)</span><span class="si">}</span><span class="se">\n</span><span class="s1">---------------------'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Q: </span><span class="si">{</span><span class="n">q</span><span class="si">}</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Context:"</span><span class="p">)</span>
<span class="n">pprint</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="se">\n</span><span class="s2">True Answers:</span><span class="se">\n</span><span class="si">{</span><span class="n">a</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="A-positive-example">A positive example<a class="anchor-link" href="#A-positive-example"> </a></h4><p>Approximately 50% of the examples in the dev set are questions that have answers contained within their corresponding passage. In these cases, up to five possible correct answers are provided (questions and answers were generated and identified by crowd-sourced workers). Answers must be direct excerpts from the passage, but we can see there are several ways to arrive at a "correct" answer.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">display_example</span><span class="p">(</span><span class="n">answer_qids</span><span class="p">[</span><span class="mi">1300</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Example 2548 of 11873
---------------------
Q: Where on Earth is free oxygen found?
Context:
("Free oxygen also occurs in solution in the world's water bodies. The "
'increased solubility of O\n'
'2 at lower temperatures (see Physical properties) has important implications '
'for ocean life, as polar oceans support a much higher density of life due to '
'their higher oxygen content. Water polluted with plant nutrients such as '
'nitrates or phosphates may stimulate growth of algae by a process called '
'eutrophication and the decay of these organisms and other biomaterials may '
'reduce amounts of O\n'
'2 in eutrophic water bodies. Scientists assess this aspect of water quality '
"by measuring the water's biochemical oxygen demand, or the amount of O\n"
'2 needed to restore it to a normal concentration.')
True Answers:
['water', "in solution in the world's water bodies", "the world's water bodies"]
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="A-negative-example">A negative example<a class="anchor-link" href="#A-negative-example"> </a></h4><p>The other half of the questions in the dev set do not have an answer in the corresponding passage. These questions were generated by crowd-sourced workers to be related and relevant to the passage, but unanswerable by that passage. There are thus no True Answers associated with these questions, as we see in the example below.</p>
<p>Note: In this case, the question is a trick -- the numbers are reoriented in a way that no longer holds true. Will the model pick up on that?</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">display_example</span><span class="p">(</span><span class="n">no_answer_qids</span><span class="p">[</span><span class="mi">1254</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Example 2564 of 11873
---------------------
Q: What happened 3.7-2 billion years ago?
Context:
("Free oxygen gas was almost nonexistent in Earth's atmosphere before "
'photosynthetic archaea and bacteria evolved, probably about 3.5 billion '
'years ago. Free oxygen first appeared in significant quantities during the '
'Paleoproterozoic eon (between 3.0 and 2.3 billion years ago). For the first '
'billion years, any free oxygen produced by these organisms combined with '
'dissolved iron in the oceans to form banded iron formations. When such '
'oxygen sinks became saturated, free oxygen began to outgas from the oceans '
'3–2.7 billion years ago, reaching 10% of its present level around 1.7 '
'billion years ago.')
True Answers:
[]
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Metrics-for-QA">Metrics for QA<a class="anchor-link" href="#Metrics-for-QA"> </a></h1><p>There are two dominant metrics used by many question answering datasets, including SQuAD: exact match (EM) and F1 score. These scores are computed on individual question+answer pairs. When multiple correct answers are possible for a given question, the maximum score over all possible correct answers is computed. Overall EM and F1 scores are computed for a model by averaging over the individual example scores.</p>
<h3 id="Exact-Match">Exact Match<a class="anchor-link" href="#Exact-Match"> </a></h3><p>This metric is as simple as it sounds. For each question+answer pair, if the <em>characters</em> of the model's prediction exactly match the characters of (one of) the True Answer(s), EM = 1, otherwise EM = 0. This is a strict all-or-nothing metric; being off by a single character results in a score of 0. When assessing against a negative example, if the model predicts any text at all, it automatically receives a 0 for that example.</p>
<h3 id="F1">F1<a class="anchor-link" href="#F1"> </a></h3><p>F1 score is a common metric for classification problems, and widely used in QA. It is appropriate when we care equally about precision and recall. In this case, it's computed over the individual <em>words</em> in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the <em>prediction</em>, and recall is the ratio of the number of shared words to the total number of words in the <em>ground truth</em>.</p>
<p><img src="/images/copied_from_nb/my_icons/f1score.png" alt="" title="Thanks Wikipedia" /></p>
<p>Let's see how these metrics work in practice. We'll load up a fine-tuned model (<a href="https://huggingface.co/twmkn9/distilbert-base-uncased-squad2">this one</a>, to be precise) and its tokenizer, and compare our predictions against the True Answers.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Load-a-Transformer-model-fine-tuned-on-SQuAD-2.0">Load a Transformer model fine-tuned on SQuAD 2.0<a class="anchor-link" href="#Load-a-Transformer-model-fine-tuned-on-SQuAD-2.0"> </a></h3>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s2">"twmkn9/distilbert-base-uncased-squad2"</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForQuestionAnswering</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s2">"twmkn9/distilbert-base-uncased-squad2"</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The following <code>get_prediction</code> method is essentially identical to what we used last time in our simple QA system.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">get_prediction</span><span class="p">(</span><span class="n">qid</span><span class="p">):</span>
<span class="c1"># given a question id (qas_id or qid), load the example, get the model outputs and generate an answer</span>
<span class="n">question</span> <span class="o">=</span> <span class="n">examples</span><span class="p">[</span><span class="n">qid_to_example_index</span><span class="p">[</span><span class="n">qid</span><span class="p">]]</span><span class="o">.</span><span class="n">question_text</span>
<span class="n">context</span> <span class="o">=</span> <span class="n">examples</span><span class="p">[</span><span class="n">qid_to_example_index</span><span class="p">[</span><span class="n">qid</span><span class="p">]]</span><span class="o">.</span><span class="n">context_text</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode_plus</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">context</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s1">'pt'</span><span class="p">)</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">inputs</span><span class="p">)</span>
<span class="n">answer_start</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># get the most likely beginning of answer with the argmax of the score</span>
<span class="n">answer_end</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">outputs</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">answer</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_tokens_to_string</span><span class="p">(</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span><span class="n">inputs</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="n">answer_start</span><span class="p">:</span><span class="n">answer_end</span><span class="p">]))</span>
<span class="k">return</span> <span class="n">answer</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Below are some functions we'll need to compute our quality metrics.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># these functions are heavily influenced by the HF squad_metrics.py script</span>
<span class="k">def</span> <span class="nf">normalize_text</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="sd">"""Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""</span>
<span class="kn">import</span> <span class="nn">string</span><span class="o">,</span> <span class="nn">re</span>
<span class="k">def</span> <span class="nf">remove_articles</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="n">regex</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="sa">r</span><span class="s2">"\b(a|an|the)\b"</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">UNICODE</span><span class="p">)</span>
<span class="k">return</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="n">regex</span><span class="p">,</span> <span class="s2">" "</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">white_space_fix</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="k">return</span> <span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">text</span><span class="o">.</span><span class="n">split</span><span class="p">())</span>
<span class="k">def</span> <span class="nf">remove_punc</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="n">exclude</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">string</span><span class="o">.</span><span class="n">punctuation</span><span class="p">)</span>
<span class="k">return</span> <span class="s2">""</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">ch</span> <span class="k">for</span> <span class="n">ch</span> <span class="ow">in</span> <span class="n">text</span> <span class="k">if</span> <span class="n">ch</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">exclude</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">lower</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="k">return</span> <span class="n">text</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span>
<span class="k">return</span> <span class="n">white_space_fix</span><span class="p">(</span><span class="n">remove_articles</span><span class="p">(</span><span class="n">remove_punc</span><span class="p">(</span><span class="n">lower</span><span class="p">(</span><span class="n">s</span><span class="p">))))</span>
<span class="k">def</span> <span class="nf">compute_exact_match</span><span class="p">(</span><span class="n">prediction</span><span class="p">,</span> <span class="n">truth</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">normalize_text</span><span class="p">(</span><span class="n">prediction</span><span class="p">)</span> <span class="o">==</span> <span class="n">normalize_text</span><span class="p">(</span><span class="n">truth</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">compute_f1</span><span class="p">(</span><span class="n">prediction</span><span class="p">,</span> <span class="n">truth</span><span class="p">):</span>
<span class="n">pred_tokens</span> <span class="o">=</span> <span class="n">normalize_text</span><span class="p">(</span><span class="n">prediction</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="n">truth_tokens</span> <span class="o">=</span> <span class="n">normalize_text</span><span class="p">(</span><span class="n">truth</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="c1"># if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">pred_tokens</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">or</span> <span class="nb">len</span><span class="p">(</span><span class="n">truth_tokens</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">pred_tokens</span> <span class="o">==</span> <span class="n">truth_tokens</span><span class="p">)</span>
<span class="n">common_tokens</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">pred_tokens</span><span class="p">)</span> <span class="o">&</span> <span class="nb">set</span><span class="p">(</span><span class="n">truth_tokens</span><span class="p">)</span>
<span class="c1"># if there are no common tokens then f1 = 0</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">common_tokens</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="n">prec</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">common_tokens</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">pred_tokens</span><span class="p">)</span>
<span class="n">rec</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">common_tokens</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">truth_tokens</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">prec</span> <span class="o">*</span> <span class="n">rec</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">prec</span> <span class="o">+</span> <span class="n">rec</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">get_gold_answers</span><span class="p">(</span><span class="n">example</span><span class="p">):</span>
<span class="sd">"""helper function that retrieves all possible true answers from a squad2.0 example"""</span>
<span class="n">gold_answers</span> <span class="o">=</span> <span class="p">[</span><span class="n">answer</span><span class="p">[</span><span class="s2">"text"</span><span class="p">]</span> <span class="k">for</span> <span class="n">answer</span> <span class="ow">in</span> <span class="n">example</span><span class="o">.</span><span class="n">answers</span> <span class="k">if</span> <span class="n">answer</span><span class="p">[</span><span class="s2">"text"</span><span class="p">]]</span>
<span class="c1"># if gold_answers doesn't exist it's because this is a negative example - </span>
<span class="c1"># the only correct answer is an empty string</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">gold_answers</span><span class="p">:</span>
<span class="n">gold_answers</span> <span class="o">=</span> <span class="p">[</span><span class="s2">""</span><span class="p">]</span>
<span class="k">return</span> <span class="n">gold_answers</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In the following cell, we start by computing EM and F1 for our first example - the one that has several True Answers associated with it.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">prediction</span> <span class="o">=</span> <span class="n">get_prediction</span><span class="p">(</span><span class="n">answer_qids</span><span class="p">[</span><span class="mi">1300</span><span class="p">])</span>
<span class="n">example</span> <span class="o">=</span> <span class="n">examples</span><span class="p">[</span><span class="n">qid_to_example_index</span><span class="p">[</span><span class="n">answer_qids</span><span class="p">[</span><span class="mi">1300</span><span class="p">]]]</span>
<span class="n">gold_answers</span> <span class="o">=</span> <span class="n">get_gold_answers</span><span class="p">(</span><span class="n">example</span><span class="p">)</span>
<span class="n">em_score</span> <span class="o">=</span> <span class="nb">max</span><span class="p">((</span><span class="n">compute_exact_match</span><span class="p">(</span><span class="n">prediction</span><span class="p">,</span> <span class="n">answer</span><span class="p">))</span> <span class="k">for</span> <span class="n">answer</span> <span class="ow">in</span> <span class="n">gold_answers</span><span class="p">)</span>
<span class="n">f1_score</span> <span class="o">=</span> <span class="nb">max</span><span class="p">((</span><span class="n">compute_f1</span><span class="p">(</span><span class="n">prediction</span><span class="p">,</span> <span class="n">answer</span><span class="p">))</span> <span class="k">for</span> <span class="n">answer</span> <span class="ow">in</span> <span class="n">gold_answers</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Question: </span><span class="si">{</span><span class="n">example</span><span class="o">.</span><span class="n">question_text</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Prediction: </span><span class="si">{</span><span class="n">prediction</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"True Answers: </span><span class="si">{</span><span class="n">gold_answers</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"EM: </span><span class="si">{</span><span class="n">em_score</span><span class="si">}</span><span class="s2"> </span><span class="se">\t</span><span class="s2"> F1: </span><span class="si">{</span><span class="n">f1_score</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Question: Where on Earth is free oxygen found?
Prediction: water bodies
True Answers: ['water', "in solution in the world's water bodies", "the world's water bodies"]
EM: 0 F1: 0.8
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We see that our prediction is actually quite close to some of the True Answers, resulting in a respectable F1 score. However, it does not exactly match any of them, so our EM score is 0.</p>
<p>Let's try with our negative example now.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">prediction</span> <span class="o">=</span> <span class="n">get_prediction</span><span class="p">(</span><span class="n">no_answer_qids</span><span class="p">[</span><span class="mi">1254</span><span class="p">])</span>
<span class="n">example</span> <span class="o">=</span> <span class="n">examples</span><span class="p">[</span><span class="n">qid_to_example_index</span><span class="p">[</span><span class="n">no_answer_qids</span><span class="p">[</span><span class="mi">1254</span><span class="p">]]]</span>
<span class="n">gold_answers</span> <span class="o">=</span> <span class="n">get_gold_answers</span><span class="p">(</span><span class="n">example</span><span class="p">)</span>
<span class="n">em_score</span> <span class="o">=</span> <span class="nb">max</span><span class="p">((</span><span class="n">compute_exact_match</span><span class="p">(</span><span class="n">prediction</span><span class="p">,</span> <span class="n">answer</span><span class="p">))</span> <span class="k">for</span> <span class="n">answer</span> <span class="ow">in</span> <span class="n">gold_answers</span><span class="p">)</span>
<span class="n">f1_score</span> <span class="o">=</span> <span class="nb">max</span><span class="p">((</span><span class="n">compute_f1</span><span class="p">(</span><span class="n">prediction</span><span class="p">,</span> <span class="n">answer</span><span class="p">))</span> <span class="k">for</span> <span class="n">answer</span> <span class="ow">in</span> <span class="n">gold_answers</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Question: </span><span class="si">{</span><span class="n">example</span><span class="o">.</span><span class="n">question_text</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Prediction: </span><span class="si">{</span><span class="n">prediction</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"True Answers: </span><span class="si">{</span><span class="n">gold_answers</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"EM: </span><span class="si">{</span><span class="n">em_score</span><span class="si">}</span><span class="s2"> </span><span class="se">\t</span><span class="s2"> F1: </span><span class="si">{</span><span class="n">f1_score</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Question: What happened 3.7-2 billion years ago?
Prediction: [CLS] what happened 3 . 7 - 2 billion years ago ? [SEP] free oxygen gas was almost nonexistent in earth ' s atmosphere before photosynthetic archaea and bacteria evolved , probably about 3 . 5 billion years ago . free oxygen first appeared in significant quantities during the paleoproterozoic eon ( between 3 . 0 and 2 . 3 billion years ago ) . for the first billion years , any free oxygen produced by these organisms combined with dissolved iron in the oceans to form banded iron formations . when such oxygen sinks became saturated , free oxygen began to outgas from the oceans
True Answers: ['']
EM: 0 F1: 0
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Wow. Both our metrics are zero, because this model does not correctly assess that this question is unanswerable! Even worse, it seems to have catastrophically failed, including the entire question as part of the answer. In a later section, we'll explicitly dig into why this happens, but for now, it's important to note that we got this answer because we simply extracted start and end tokens associated with the maximum score (we took an <code>argmax</code> of the model output in <code>get_prediction</code>) and this lead to some unintended consequences.</p>
<p>Now that we’ve seen the basics of computing QA metrics on a couple of examples, we need to assess the model on the entire dev set. Luckily, there's a script for that.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Evaluating-a-model-on-the-SQuAD2.0-dev-set-with-HF">Evaluating a model on the SQuAD2.0 dev set with HF<a class="anchor-link" href="#Evaluating-a-model-on-the-SQuAD2.0-dev-set-with-HF"> </a></h1><p>The same <code>run_squad.py</code> script we used to fine-tune a Transformer for question answering can also be used to evaluate the model. (You can grab the script <a href="https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py">here</a> or run the hidden cell below.)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="c1"># Grab the run_squad.py script</span>
<span class="o">!</span>curl -L -O https://raw.githubusercontent.com/huggingface/transformers/master/examples/question-answering/run_squad.py
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Below are the arguments needed to properly evaluate a fine-tuned model for question answering on the SQuAD dev set. Because we're using SQuAD2.0, it is <strong>crucial</strong> to include the <code>--version_2_with_negative</code> flag!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>python run_squad.py <span class="err">\</span>
<span class="o">--</span><span class="n">model_type</span> <span class="n">distilbert</span> \
<span class="o">--</span><span class="n">model_name_or_path</span> <span class="n">twmkn9</span><span class="o">/</span><span class="n">distilbert</span><span class="o">-</span><span class="n">base</span><span class="o">-</span><span class="n">uncased</span><span class="o">-</span><span class="n">squad2</span> \
<span class="o">--</span><span class="n">output_dir</span> <span class="n">models</span><span class="o">/</span><span class="n">distilbert</span><span class="o">/</span><span class="n">twmkn9_distilbert</span><span class="o">-</span><span class="n">base</span><span class="o">-</span><span class="n">uncased</span><span class="o">-</span><span class="n">squad2</span> \
<span class="o">--</span><span class="n">data_dir</span> <span class="n">data</span><span class="o">/</span><span class="n">squad</span> \
<span class="o">--</span><span class="n">predict_file</span> <span class="n">dev</span><span class="o">-</span><span class="n">v2</span><span class="o">.</span><span class="mf">0.</span><span class="n">json</span> \
<span class="o">--</span><span class="n">do_eval</span> \
<span class="o">--</span><span class="n">version_2_with_negative</span> \
<span class="o">--</span><span class="n">do_lower_case</span> \
<span class="o">--</span><span class="n">per_gpu_eval_batch_size</span> <span class="mi">12</span> \
<span class="o">--</span><span class="n">max_seq_length</span> <span class="mi">384</span> \
<span class="o">--</span><span class="n">doc_stride</span> <span class="mi">128</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Refer to <a href="https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html">our last post</a> for more details on what these arguments mean and what this script does. For our immediate purposes, running the cell above will produce the following output in the <code>--output_dir</code> directory:</p>
<ul>
<li><code>predictions_.json</code></li>
<li><code>nbest_predictions_.json</code></li>
<li><code>null_odds_.json</code></li>
</ul>
<p>(We'll go over what these are later on.) Additionally, an overall <code>Results</code> dict will be displayed to the screen. If you run the above cell, the last line of output should display something like the following:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">Results</span> <span class="o">=</span> <span class="p">{</span>
<span class="c1"># a) scores averaged over all examples in the dev set</span>
<span class="s1">'exact'</span><span class="p">:</span> <span class="mf">66.25958056093658</span><span class="p">,</span>
<span class="s1">'f1'</span><span class="p">:</span> <span class="mf">69.66994428499025</span><span class="p">,</span>
<span class="s1">'total'</span><span class="p">:</span> <span class="mi">11873</span><span class="p">,</span> <span class="c1"># number of examples in the dev set</span>
<span class="c1"># b) scores averaged over only positive examples (have answers)</span>
<span class="s1">'HasAns_exact'</span><span class="p">:</span> <span class="mf">68.91025641025641</span><span class="p">,</span>
<span class="s1">'HasAns_f1'</span><span class="p">:</span> <span class="mf">75.74076391627662</span><span class="p">,</span>
<span class="s1">'HasAns_total'</span><span class="p">:</span> <span class="mi">5928</span><span class="p">,</span> <span class="c1"># number of positive examples</span>
<span class="c1"># c) scores averaged over only negative examples (no answers)</span>
<span class="s1">'NoAns_exact'</span><span class="p">:</span> <span class="mf">63.61648444070648</span><span class="p">,</span>
<span class="s1">'NoAns_f1'</span><span class="p">:</span> <span class="mf">63.61648444070648</span><span class="p">,</span>
<span class="s1">'NoAns_total'</span><span class="p">:</span> <span class="mi">5945</span><span class="p">,</span> <span class="c1"># number of negative examples</span>
<span class="c1"># d) given probabilities of no-answer for each example, what would the best scores and thresholds be?</span>
<span class="s1">'best_exact'</span><span class="p">:</span> <span class="mf">66.25958056093658</span><span class="p">,</span>
<span class="s1">'best_exact_thresh'</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
<span class="s1">'best_f1'</span><span class="p">:</span> <span class="mf">69.66994428499046</span><span class="p">,</span>
<span class="s1">'best_f1_thresh'</span><span class="p">:</span> <span class="mf">0.0</span>
<span class="p">}</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The first three blocks of the <code>Results</code> output are pretty straightforward. EM and F1 scores are reported over a) the full dev set, b) the set of positive examples, and c) the set of negative examples. This can provide some insight into whether a model is performing adequately on both answer and no-answer questions. (This particular model is pretty bad at no-answer questions).</p>
<p>However, what's going on with the last block? This portion of the output is not useful unless we supply the evaluation method with additional information. For that, we'll need to dig deeper into the evaluation process - because it turns out that we need to compute more than just a prediction for an answer; we must also compute a prediction for NO answer and we must score both predictions!</p>
<p>The following section will dive into the technical details of computing robust predictions on SQuAD2.0 examples, including how to score an answer and the null answer, as well as how to determine which one should be the "correct" prediction for a given example. Feel free to skip to the <a href="#Using-the-null-threshold">next section</a> for the punchline. (For those of you considering building your own QA system, we found this information to be invaluable for understanding the inner workings of prediction and assessment.)</p>
<h3 id="[Optional]-Computing-predictions">[Optional] Computing predictions<a class="anchor-link" href="#[Optional]-Computing-predictions"> </a></h3><p><div class="flash">
<svg class="octicon octicon-info octicon octicon-info" viewBox="0 0 14 16" version="1.1" width="14" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 01-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 01-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg>
<strong>Note: </strong>The code in the following section is an under-the-hood dive into the HF <code>compute_predictions_logits</code> <a href="https://github.com/huggingface/transformers/blob/5856999a9f2926923f037ecd8d27b8058bcf9dae/src/transformers/data/metrics/squad_metrics.py#L371-L573">method</a> in their <code>squad_metrics.py</code> script.
</div></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>When the tokenized question+context is passed to the model, the output consists of two sets of logits: one for the start of the answer span, the other for the end of the answer span. These logits represent the likelihood of any given token being the start or end of the answer. Every token passed to the model is assigned a logit, including special tokens (e.g., [CLS], [SEP]), and tokens corresponding to the question itself.</p>
<p>Let's walk through the process using our last example (Q: What happened 3.7-2 billion years ago?).</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">inputs</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode_plus</span><span class="p">(</span><span class="n">example</span><span class="o">.</span><span class="n">question_text</span><span class="p">,</span> <span class="n">example</span><span class="o">.</span><span class="n">context_text</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s1">'pt'</span><span class="p">)</span>
<span class="n">start_logits</span><span class="p">,</span> <span class="n">end_logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">inputs</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># look at how large the logit is in the [CLS] position (index 0)! </span>
<span class="c1"># strong possibility that this question has no answer... but our prediction returned an answer anyway!</span>
<span class="n">start_logits</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>tensor([[ 6.4914, -9.1416, -8.4068, -7.5684, -9.9081, -9.4256, -10.1625,
-9.2579, -10.0554, -9.9653, -9.2002, -8.8657, -9.1162, 0.6481,
-2.5947, -4.5072, -8.1189, -6.5871, -5.8973, -10.8619, -11.0953,
-10.2294, -9.3660, -7.6017, -10.8009, -10.8197, -6.1258, -8.3507,
-4.2463, -10.0987, -10.2659, -8.8490, -6.7346, -8.6513, -9.7573,
-5.7496, -5.5851, -8.9483, -7.0652, -6.1369, -5.7810, -9.4366,
-8.7670, -9.6743, -9.7446, -7.7905, -7.4541, -1.5963, -3.8540,
-7.3450, -8.1854, -9.5566, -8.3416, -8.9553, -8.3144, -6.4132,
-4.2285, -9.4427, -9.5111, -9.2931, -8.9154, -9.3930, -8.2111,
-8.9774, -9.0274, -7.2652, -7.4511, -9.8597, -9.5869, -9.9735,
-7.0526, -9.7560, -8.7788, -9.5117, -9.6391, -8.6487, -9.5994,
-7.8213, -5.1754, -4.3561, -4.3913, -7.8499, -7.7522, -8.9651,
-3.5229, -0.8312, -2.7668, -7.9180, -10.0320, -8.7797, -4.5965,
-5.9465, -9.9442, -3.2135, -5.0734, -8.3462, -7.5366, -3.7073,
-7.0968, -4.3325, -1.3691, -4.1477, -5.3794, -7.6138, 1.3183,
-3.4190, 3.1457, -3.0152, -0.4102, -2.4606, -3.5971, 6.4519,
-0.5654, 0.9829, -1.6682, 3.3549, -4.7847, -2.8024, -3.3160,
-0.5868, -0.9617, -8.1925, -4.3299, -7.3923, -5.0875, -5.3880,
-5.3676, -3.0878, -4.3427, 4.3975, 1.8860, -5.4661, -9.1565,
-3.6369, -3.5462, -4.1448, -2.0250, -2.4492, -8.7015, -7.3292,
-7.7616, -7.0786, -4.6668, -4.4089, -9.1182]],
grad_fn=<SqueezeBackward1>)</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>In our simple QA system, we predicted the best answer by selecting the start and end tokens with the largest logits, but that's not very robust. In fact, the original <a href="https://arxiv.org/abs/1810.04805">BERT paper</a> suggested considering any sensible start+end combination as a possible answer to the question. These combinations would then be scored, and the one with the highest score would be considered the best answer. A possible (candidate) answer is scored as the sum of its start and end logits.
<div class="flash">
<svg class="octicon octicon-info octicon octicon-info octicon octicon-info" viewBox="0 0 14 16" version="1.1" width="14" height="16" aria-hidden="true"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 01-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 01-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg>
<strong>Note: </strong>This reflects how a basic span extraction classifier works. The raw hidden layer from the model is passed through a <code>Linear</code> layer and then fed to a <code>CrossEntropyLoss</code> for each class. In span extraction, there are two classes: the beginning of the span and the end of the span. The span loss is computed as the sum of the <code>CrossEntropyLoss</code> for the start and end positions. The probability of an answer span is the probability of a given start token S and an end token E: P(S and E) = P(S)P(E), because the start and end tokens are treated as being independent. Thus summing the start and end logits is equivalent to a product of their softmax probabilities.
</div></p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>To mimic this behavior, we'll start by taking the <em>n</em> largest <code>start_logits</code> and the <em>n</em> largest <code>end_logits</code> as candidates. Any sensible combination of these start + end tokens is considered a candidate answer; however, several consistency checks must first be performed. For example, an answer wherein the end token falls before the start token should be excluded, because that just doesn't make sense. Candidate answers wherein the start or end tokens are associated with question tokens are also excluded, because the answer to the question should obviously not be in the question itself! It is important to note that the [CLS] token and its corresponding logits are not removed, because this token indicates the null answer.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">to_list</span><span class="p">(</span><span class="n">tensor</span><span class="p">):</span>
<span class="k">return</span> <span class="n">tensor</span><span class="o">.</span><span class="n">detach</span><span class="p">()</span><span class="o">.</span><span class="n">cpu</span><span class="p">()</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="c1"># convert our start and end logit tensors to lists</span>
<span class="n">start_logits</span> <span class="o">=</span> <span class="n">to_list</span><span class="p">(</span><span class="n">start_logits</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">end_logits</span> <span class="o">=</span> <span class="n">to_list</span><span class="p">(</span><span class="n">end_logits</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># sort our start and end logits from largest to smallest, keeping track of the index</span>
<span class="n">start_idx_and_logit</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">enumerate</span><span class="p">(</span><span class="n">start_logits</span><span class="p">),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">end_idx_and_logit</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">enumerate</span><span class="p">(</span><span class="n">end_logits</span><span class="p">),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># select the top n (in this case, 5)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">start_idx_and_logit</span><span class="p">[:</span><span class="mi">5</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="n">end_idx_and_logit</span><span class="p">[:</span><span class="mi">5</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>[(0, 6.491387367248535), (111, 6.451895713806152), (129, 4.397505760192871), (115, 3.354909658432007), (106, 3.1457457542419434)]
[(119, 6.33292293548584), (0, 6.084450721740723), (135, 4.417276382446289), (116, 4.3764214515686035), (112, 4.125303268432617)]
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The null answer token (index 0) is in the top five of both the start and end logit lists.</p>
<p>In order to eventually predict a text answer (or empty string), we need to keep track of the indexes which will be used to pull the corresponding token ids later on. We'll also need to identify which indexes correspond to the question tokens, so we can ensure we don't allow a nonsensical prediction.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">start_indexes</span> <span class="o">=</span> <span class="p">[</span><span class="n">idx</span> <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">logit</span> <span class="ow">in</span> <span class="n">start_idx_and_logit</span><span class="p">[:</span><span class="mi">5</span><span class="p">]]</span>
<span class="n">end_indexes</span> <span class="o">=</span> <span class="p">[</span><span class="n">idx</span> <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">logit</span> <span class="ow">in</span> <span class="n">end_idx_and_logit</span><span class="p">[:</span><span class="mi">5</span><span class="p">]]</span>
<span class="c1"># convert the token ids from a tensor to a list</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">to_list</span><span class="p">(</span><span class="n">inputs</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># question tokens are defined as those between the CLS token (101, at position 0) and first SEP (102) token </span>
<span class="n">question_indexes</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">token</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">tokens</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="n">tokens</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="mi">102</span><span class="p">)])]</span>
<span class="n">question_indexes</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Next, we'll generate a list of candidate predictions by looping through all combinations of the start and end token indexes, excluding nonsensical combinations. We'll save these to a list for the next step.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">collections</span>
<span class="c1"># keep track of all preliminary predictions</span>
<span class="n">PrelimPrediction</span> <span class="o">=</span> <span class="n">collections</span><span class="o">.</span><span class="n">namedtuple</span><span class="p">(</span>
<span class="s2">"PrelimPrediction"</span><span class="p">,</span> <span class="p">[</span><span class="s2">"start_index"</span><span class="p">,</span> <span class="s2">"end_index"</span><span class="p">,</span> <span class="s2">"start_logit"</span><span class="p">,</span> <span class="s2">"end_logit"</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">prelim_preds</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">start_index</span> <span class="ow">in</span> <span class="n">start_indexes</span><span class="p">:</span>
<span class="k">for</span> <span class="n">end_index</span> <span class="ow">in</span> <span class="n">end_indexes</span><span class="p">:</span>
<span class="c1"># throw out invalid predictions</span>
<span class="k">if</span> <span class="n">start_index</span> <span class="ow">in</span> <span class="n">question_indexes</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">end_index</span> <span class="ow">in</span> <span class="n">question_indexes</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">end_index</span> <span class="o"><</span> <span class="n">start_index</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">prelim_preds</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
<span class="n">PrelimPrediction</span><span class="p">(</span>
<span class="n">start_index</span> <span class="o">=</span> <span class="n">start_index</span><span class="p">,</span>
<span class="n">end_index</span> <span class="o">=</span> <span class="n">end_index</span><span class="p">,</span>
<span class="n">start_logit</span> <span class="o">=</span> <span class="n">start_logits</span><span class="p">[</span><span class="n">start_index</span><span class="p">],</span>
<span class="n">end_logit</span> <span class="o">=</span> <span class="n">end_logits</span><span class="p">[</span><span class="n">end_index</span><span class="p">]</span>
<span class="p">)</span>
<span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>With a list of sensible candidate predictions, it's time to score them.</p>
<p>For a candidate answer, score = <code>start_logit</code> + <code>end_logit</code>. Below, we sort our candidate predictions by their score.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># sort preliminary predictions by their score</span>
<span class="n">prelim_preds</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">prelim_preds</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">start_logit</span> <span class="o">+</span> <span class="n">x</span><span class="o">.</span><span class="n">end_logit</span><span class="p">),</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">pprint</span><span class="p">(</span><span class="n">prelim_preds</span><span class="p">[:</span><span class="mi">5</span><span class="p">])</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>[PrelimPrediction(start_index=0, end_index=119, start_logit=6.491387367248535, end_logit=6.33292293548584),
PrelimPrediction(start_index=111, end_index=119, start_logit=6.451895713806152, end_logit=6.33292293548584),
PrelimPrediction(start_index=0, end_index=0, start_logit=6.491387367248535, end_logit=6.084450721740723),
PrelimPrediction(start_index=0, end_index=135, start_logit=6.491387367248535, end_logit=4.417276382446289),
PrelimPrediction(start_index=111, end_index=135, start_logit=6.451895713806152, end_logit=4.417276382446289)]
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Next we need to convert our preliminary predictions into actual text (or the empty string, if null). We'll keep track of text predictions we've seen, because different token combinations can result in the same text prediction and we only want to keep the one with the highest score (we're looping in descending score order). Finally, we'll trim this list down to the best 5 predictions.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># keep track of all best predictions</span>
<span class="n">BestPrediction</span> <span class="o">=</span> <span class="n">collections</span><span class="o">.</span><span class="n">namedtuple</span><span class="p">(</span> <span class="c1"># pylint: disable=invalid-name</span>
<span class="s2">"BestPrediction"</span><span class="p">,</span> <span class="p">[</span><span class="s2">"text"</span><span class="p">,</span> <span class="s2">"start_logit"</span><span class="p">,</span> <span class="s2">"end_logit"</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">nbest</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">seen_predictions</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">pred</span> <span class="ow">in</span> <span class="n">prelim_preds</span><span class="p">:</span>
<span class="c1"># for now we only care about the top 5 best predictions</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">nbest</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">5</span><span class="p">:</span>
<span class="k">break</span>
<span class="c1"># loop through predictions according to their start index</span>
<span class="k">if</span> <span class="n">pred</span><span class="o">.</span><span class="n">start_index</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span> <span class="c1"># non-null answers have start_index > 0</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_tokens_to_string</span><span class="p">(</span>
<span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span>
<span class="n">tokens</span><span class="p">[</span><span class="n">pred</span><span class="o">.</span><span class="n">start_index</span><span class="p">:</span><span class="n">pred</span><span class="o">.</span><span class="n">end_index</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
<span class="p">)</span>
<span class="p">)</span>
<span class="c1"># clean whitespace</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">text</span> <span class="o">=</span> <span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">text</span><span class="o">.</span><span class="n">split</span><span class="p">())</span>
<span class="k">if</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">seen_predictions</span><span class="p">:</span>
<span class="k">continue</span>
<span class="c1"># flag this text as being seen -- if we see it again, don't add it to the nbest list</span>
<span class="n">seen_predictions</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="c1"># add this text prediction to a pruned list of the top 5 best predictions</span>
<span class="n">nbest</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">BestPrediction</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span> <span class="n">start_logit</span><span class="o">=</span><span class="n">pred</span><span class="o">.</span><span class="n">start_logit</span><span class="p">,</span> <span class="n">end_logit</span><span class="o">=</span><span class="n">pred</span><span class="o">.</span><span class="n">end_logit</span><span class="p">))</span>
<span class="c1"># and don't forget -- include the null answer!</span>
<span class="n">nbest</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">BestPrediction</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span> <span class="n">start_logit</span><span class="o">=</span><span class="n">start_logits</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">end_logit</span><span class="o">=</span><span class="n">end_logits</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The null answer is scored as the sum of the start_logit and end_logit associated with the [CLS] token.</p>
<p>At this point, we have a neat list of the top 5 best predictions for this question. The number of best predictions for each example is adjustable with the <code>--n_best_size</code> argument of the <code>run_squad.py</code> script. The <code>nbest</code> predictions for <em>every question</em> in the dev set are saved to disk under <code>nbest_predictions_.json</code> in <code>--output_dir</code>. (This is a great resource for digging into how a model is behaving.) Let's take a look at our <code>nbest</code> predictions.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">pprint</span><span class="p">(</span><span class="n">nbest</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>[BestPrediction(text='free oxygen began to outgas from the oceans', start_logit=6.451895713806152, end_logit=6.33292293548584),
BestPrediction(text='free oxygen began to outgas from the oceans 3 – 2 . 7 billion years ago , reaching 10 % of its present level', start_logit=6.451895713806152, end_logit=4.417276382446289),
BestPrediction(text='free oxygen began to outgas', start_logit=6.451895713806152, end_logit=4.3764214515686035),
BestPrediction(text='free oxygen', start_logit=6.451895713806152, end_logit=4.125303268432617),
BestPrediction(text='outgas from the oceans', start_logit=3.354909658432007, end_logit=6.33292293548584),
BestPrediction(text='', start_logit=6.491387367248535, end_logit=6.084450721740723)]
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Our top prediction so far is "free oxygen began to outgas from the oceans," which is already a far cry better than what we originally predicted. This is because we have successfully excluded nonsensical predictions that would incorporate question tokens as part of the answer. However, we know it's still incorrect. Let's keep going.</p>
<p>The last step is to compute the null score -- more specifically, the difference between the null score and the best non-null score as shown below.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># compute the null score as the sum of the [CLS] token logits</span>
<span class="n">score_null</span> <span class="o">=</span> <span class="n">start_logits</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">end_logits</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># compute the difference between the null score and the best non-null score</span>
<span class="n">score_diff</span> <span class="o">=</span> <span class="n">score_null</span> <span class="o">-</span> <span class="n">nbest</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">start_logit</span> <span class="o">-</span> <span class="n">nbest</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">end_logit</span>
<span class="n">score_diff</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>-0.20898056030273438</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This <code>score_diff</code> is computed for every example in the dev set and these scores are saved to disk in the <code>null_odds_.json</code>. Let's pull up the score stored for the example we're using and see how we did!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">filename</span> <span class="o">=</span> <span class="n">model_dir</span> <span class="o">+</span> <span class="s1">'null_odds_.json'</span>
<span class="n">null_odds</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">))</span>
<span class="n">null_odds</span><span class="p">[</span><span class="n">example</span><span class="o">.</span><span class="n">qas_id</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>-0.2090005874633789</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>We basically nailed it! (The full HF version contains a few more checks and some additional subtleties that could account for the slight differences in our <code>score_diff</code>.)</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Using-the-null-threshold">Using the null threshold<a class="anchor-link" href="#Using-the-null-threshold"> </a></h3><p>In the previous section we covered:</p>
<ul>
<li>how to generate more robust predictions (e.g., by excluding predictions that include question tokens in the answer),</li>
<li>how to score a prediction as the sum of its start and end logits,</li>
<li>and how to compute the score difference between the null prediction and the best text prediction.</li>
</ul>
<p>The <code>run_squad.py</code> script performs all of these tasks for us and saves the score differences for every example in the <code>null_odds_.json</code>. With that, we can now start to make sense of the fourth block of the results output!</p>
<p>According to the original <a href="https://arxiv.org/abs/1810.04805">BERT paper</a>,</p>
<blockquote><p>We predict a non-null answer when sˆi,j > s_null + τ , where the threshold τ is selected on the dev set to maximize F1.</p>
</blockquote>
<p>In other words, the authors are saying that one should predict a null answer for a given example if that example's score difference is above a certain threshold. What should that threshold be? How should we compute it? They give us a recipe:select the threshold that maximizes F1. Rather than rerunning <code>run_squad.py</code>, we can import the aptly-named method that computes SQuAD evaluation: <code>squad_evaluate</code>. (You can take a look at the code for yourself <a href="https://github.com/huggingface/transformers/blob/5856999a9f2926923f037ecd8d27b8058bcf9dae/src/transformers/data/metrics/squad_metrics.py#L211-L239">here</a>.)
To use <code>squad_evaluate</code> we'll need:</p>
<ul>
<li>the original examples (because that's where the True Answers are stored),</li>
<li><code>predictions_.json</code>,</li>
<li><code>null_odds_.json</code>,</li>
<li>and a null threshold.</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># load the predictions we generated earlier</span>
<span class="n">filename</span> <span class="o">=</span> <span class="n">model_dir</span> <span class="o">+</span> <span class="s1">'predictions_.json'</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">))</span>
<span class="c1"># load the null score differences we generated earlier</span>
<span class="n">filename</span> <span class="o">=</span> <span class="n">model_dir</span> <span class="o">+</span> <span class="s1">'null_odds_.json'</span>
<span class="n">null_odds</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let's re-evaluate our model on SQuAD2.0 using the <code>squad_evaluate</code> method. This method uses the score differences for each example in the dev set to determine thresholds that maximize either the EM score or the F1 score. It then recomputes the best possible EM score and F1 score associated with that null threshold.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">transformers.data.metrics.squad_metrics</span> <span class="kn">import</span> <span class="n">squad_evaluate</span>
<span class="c1"># the default threshold is set to 1.0 -- we'll leave it there for now</span>
<span class="n">results_default_thresh</span> <span class="o">=</span> <span class="n">squad_evaluate</span><span class="p">(</span><span class="n">examples</span><span class="p">,</span>
<span class="n">preds</span><span class="p">,</span>
<span class="n">no_answer_probs</span><span class="o">=</span><span class="n">null_odds</span><span class="p">,</span>
<span class="n">no_answer_probability_threshold</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">pprint</span><span class="p">(</span><span class="n">results_default_thresh</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>OrderedDict([('exact', 66.25958056093658),
('f1', 69.66994428499025),
('total', 11873),
('HasAns_exact', 68.91025641025641),
('HasAns_f1', 75.74076391627662),
('HasAns_total', 5928),
('NoAns_exact', 63.61648444070648),
('NoAns_f1', 63.61648444070648),
('NoAns_total', 5945),
('best_exact', 68.36519834919565),
('best_exact_thresh', -4.189256191253662),
('best_f1', 71.1144383018176),
('best_f1_thresh', -3.767639636993408)])
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The first three blocks have identical values as in our initial evaluation because they are based on the default threshold (which is currently 1.0). However, the values in the fourth block have been updated by taking into account the <code>null_odds</code> information. When a given example's <code>score_diff</code> is greater than the threshold, the prediction is flipped to a null answer which affects the overall EM and F1 scores.</p>
<p>Let's use the <code>best_f1_thresh</code> and run the evaluation once more to see a breakdown of our model's performance on <code>HasAns</code> and <code>NoAns</code> examples:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">best_f1_thresh</span> <span class="o">=</span> <span class="o">-</span><span class="mf">3.7676548957824707</span>
<span class="n">results_f1_thresh</span> <span class="o">=</span> <span class="n">squad_evaluate</span><span class="p">(</span><span class="n">examples</span><span class="p">,</span>
<span class="n">preds</span><span class="p">,</span>
<span class="n">no_answer_probs</span><span class="o">=</span><span class="n">null_odds</span><span class="p">,</span>
<span class="n">no_answer_probability_threshold</span><span class="o">=</span><span class="n">best_f1_thresh</span><span class="p">)</span>
<span class="n">pprint</span><span class="p">(</span><span class="n">results_f1_thresh</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>OrderedDict([('exact', 68.31466352227744),
('f1', 71.11106931335648),
('total', 11873),
('HasAns_exact', 61.53846153846154),
('HasAns_f1', 67.13929250294865),
('HasAns_total', 5928),
('NoAns_exact', 75.07148864592094),
('NoAns_f1', 75.07148864592094),
('NoAns_total', 5945),
('best_exact', 68.36519834919565),
('best_exact_thresh', -4.189256191253662),
('best_f1', 71.1144383018176),
('best_f1_thresh', -3.767639636993408)])
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>When we used the default threshold of 1.0, we saw that our <code>NoAns_f1</code> score was a mere 63.6, but when we use the <code>best_f1_thresh</code>, we now get a <code>NoAns_f1</code> score of 75 - nearly a 12 point jump! The downside is that we lose some ground in how well our model correctly predicts <code>HasAns</code> examples. Overall, however, we see a net increase of a couple points in both EM and F1 scores. This demonstrates that computing null scores and properly using a null threshold significantly increases QA performance on the SQuAD2.0 dev set with almost no additional work.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Putting-it-all-together">Putting it all together<a class="anchor-link" href="#Putting-it-all-together"> </a></h1><p>Below we present a new method that will select more robust predictions, compute scores for the best text predictions (as well as for the null prediction), and use these scores along with a null threshold to determine whether the question should be answered. As a bonus, this method also computes and returns the probability of the answer, which is often easier to interpret than a logit score. Prediction probabilities depend on <code>nbest</code>, since they are computed with a softmax over the number of most likely predictions.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">get_robust_prediction</span><span class="p">(</span><span class="n">example</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">,</span> <span class="n">nbest</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">null_threshold</span><span class="o">=</span><span class="mf">1.0</span><span class="p">):</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="n">get_qa_inputs</span><span class="p">(</span><span class="n">example</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">)</span>
<span class="n">start_logits</span><span class="p">,</span> <span class="n">end_logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">inputs</span><span class="p">)</span>
<span class="c1"># get sensible preliminary predictions, sorted by score</span>
<span class="n">prelim_preds</span> <span class="o">=</span> <span class="n">preliminary_predictions</span><span class="p">(</span><span class="n">start_logits</span><span class="p">,</span>
<span class="n">end_logits</span><span class="p">,</span>
<span class="n">inputs</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">],</span>
<span class="n">nbest</span><span class="p">)</span>
<span class="c1"># narrow that down to the top nbest predictions</span>
<span class="n">nbest_preds</span> <span class="o">=</span> <span class="n">best_predictions</span><span class="p">(</span><span class="n">prelim_preds</span><span class="p">,</span> <span class="n">nbest</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">)</span>
<span class="c1"># compute the probability of each prediction - nice but not necessary</span>
<span class="n">probabilities</span> <span class="o">=</span> <span class="n">prediction_probabilities</span><span class="p">(</span><span class="n">nbest_preds</span><span class="p">)</span>
<span class="c1"># compute score difference</span>
<span class="n">score_difference</span> <span class="o">=</span> <span class="n">compute_score_difference</span><span class="p">(</span><span class="n">nbest_preds</span><span class="p">)</span>
<span class="c1"># if score difference > threshold, return the null answer</span>
<span class="k">if</span> <span class="n">score_difference</span> <span class="o">></span> <span class="n">null_threshold</span><span class="p">:</span>
<span class="k">return</span> <span class="s2">""</span><span class="p">,</span> <span class="n">probabilities</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">nbest_preds</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="n">probabilities</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide</span>
<span class="c1"># ----------------- Helper functions for get_robust_prediction ----------------- #</span>
<span class="k">def</span> <span class="nf">get_qa_inputs</span><span class="p">(</span><span class="n">example</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">):</span>
<span class="c1"># load the example, convert to inputs, get model outputs</span>
<span class="n">question</span> <span class="o">=</span> <span class="n">example</span><span class="o">.</span><span class="n">question_text</span>
<span class="n">context</span> <span class="o">=</span> <span class="n">example</span><span class="o">.</span><span class="n">context_text</span>
<span class="k">return</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode_plus</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">context</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s1">'pt'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">get_clean_text</span><span class="p">(</span><span class="n">tokens</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">):</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_tokens_to_string</span><span class="p">(</span>
<span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># Clean whitespace</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">text</span> <span class="o">=</span> <span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">text</span><span class="o">.</span><span class="n">split</span><span class="p">())</span>
<span class="k">return</span> <span class="n">text</span>
<span class="k">def</span> <span class="nf">prediction_probabilities</span><span class="p">(</span><span class="n">predictions</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="sd">"""Compute softmax values for each sets of scores in x."""</span>
<span class="n">e_x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">return</span> <span class="n">e_x</span> <span class="o">/</span> <span class="n">e_x</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="n">all_scores</span> <span class="o">=</span> <span class="p">[</span><span class="n">pred</span><span class="o">.</span><span class="n">start_logit</span><span class="o">+</span><span class="n">pred</span><span class="o">.</span><span class="n">end_logit</span> <span class="k">for</span> <span class="n">pred</span> <span class="ow">in</span> <span class="n">predictions</span><span class="p">]</span>
<span class="k">return</span> <span class="n">softmax</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">all_scores</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">preliminary_predictions</span><span class="p">(</span><span class="n">start_logits</span><span class="p">,</span> <span class="n">end_logits</span><span class="p">,</span> <span class="n">input_ids</span><span class="p">,</span> <span class="n">nbest</span><span class="p">):</span>
<span class="c1"># convert tensors to lists</span>
<span class="n">start_logits</span> <span class="o">=</span> <span class="n">to_list</span><span class="p">(</span><span class="n">start_logits</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">end_logits</span> <span class="o">=</span> <span class="n">to_list</span><span class="p">(</span><span class="n">end_logits</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">to_list</span><span class="p">(</span><span class="n">input_ids</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># sort our start and end logits from largest to smallest, keeping track of the index</span>
<span class="n">start_idx_and_logit</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">enumerate</span><span class="p">(</span><span class="n">start_logits</span><span class="p">),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">end_idx_and_logit</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">enumerate</span><span class="p">(</span><span class="n">end_logits</span><span class="p">),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">start_indexes</span> <span class="o">=</span> <span class="p">[</span><span class="n">idx</span> <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">logit</span> <span class="ow">in</span> <span class="n">start_idx_and_logit</span><span class="p">[:</span><span class="n">nbest</span><span class="p">]]</span>
<span class="n">end_indexes</span> <span class="o">=</span> <span class="p">[</span><span class="n">idx</span> <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">logit</span> <span class="ow">in</span> <span class="n">end_idx_and_logit</span><span class="p">[:</span><span class="n">nbest</span><span class="p">]]</span>
<span class="c1"># question tokens are between the CLS token (101, at position 0) and first SEP (102) token </span>
<span class="n">question_indexes</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">token</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">tokens</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="n">tokens</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="mi">102</span><span class="p">)])]</span>
<span class="c1"># keep track of all preliminary predictions</span>
<span class="n">PrelimPrediction</span> <span class="o">=</span> <span class="n">collections</span><span class="o">.</span><span class="n">namedtuple</span><span class="p">(</span> <span class="c1"># pylint: disable=invalid-name</span>
<span class="s2">"PrelimPrediction"</span><span class="p">,</span> <span class="p">[</span><span class="s2">"start_index"</span><span class="p">,</span> <span class="s2">"end_index"</span><span class="p">,</span> <span class="s2">"start_logit"</span><span class="p">,</span> <span class="s2">"end_logit"</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">prelim_preds</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">start_index</span> <span class="ow">in</span> <span class="n">start_indexes</span><span class="p">:</span>
<span class="k">for</span> <span class="n">end_index</span> <span class="ow">in</span> <span class="n">end_indexes</span><span class="p">:</span>
<span class="c1"># throw out invalid predictions</span>
<span class="k">if</span> <span class="n">start_index</span> <span class="ow">in</span> <span class="n">question_indexes</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">end_index</span> <span class="ow">in</span> <span class="n">question_indexes</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">end_index</span> <span class="o"><</span> <span class="n">start_index</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">prelim_preds</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
<span class="n">PrelimPrediction</span><span class="p">(</span>
<span class="n">start_index</span> <span class="o">=</span> <span class="n">start_index</span><span class="p">,</span>
<span class="n">end_index</span> <span class="o">=</span> <span class="n">end_index</span><span class="p">,</span>
<span class="n">start_logit</span> <span class="o">=</span> <span class="n">start_logits</span><span class="p">[</span><span class="n">start_index</span><span class="p">],</span>
<span class="n">end_logit</span> <span class="o">=</span> <span class="n">end_logits</span><span class="p">[</span><span class="n">end_index</span><span class="p">]</span>
<span class="p">)</span>
<span class="p">)</span>
<span class="c1"># sort prelim_preds in descending score order</span>
<span class="n">prelim_preds</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">prelim_preds</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">start_logit</span> <span class="o">+</span> <span class="n">x</span><span class="o">.</span><span class="n">end_logit</span><span class="p">),</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">return</span> <span class="n">prelim_preds</span>
<span class="k">def</span> <span class="nf">best_predictions</span><span class="p">(</span><span class="n">prelim_preds</span><span class="p">,</span> <span class="n">nbest</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">):</span>
<span class="c1"># keep track of all best predictions</span>
<span class="c1"># This will be the pool from which answer probabilities are computed </span>
<span class="n">BestPrediction</span> <span class="o">=</span> <span class="n">collections</span><span class="o">.</span><span class="n">namedtuple</span><span class="p">(</span>
<span class="s2">"BestPrediction"</span><span class="p">,</span> <span class="p">[</span><span class="s2">"text"</span><span class="p">,</span> <span class="s2">"start_logit"</span><span class="p">,</span> <span class="s2">"end_logit"</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">nbest_predictions</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">seen_predictions</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">pred</span> <span class="ow">in</span> <span class="n">prelim_preds</span><span class="p">:</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">nbest_predictions</span><span class="p">)</span> <span class="o">>=</span> <span class="n">nbest</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">if</span> <span class="n">pred</span><span class="o">.</span><span class="n">start_index</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span> <span class="c1"># non-null answers have start_index > 0</span>
<span class="n">toks</span> <span class="o">=</span> <span class="n">tokens</span><span class="p">[</span><span class="n">pred</span><span class="o">.</span><span class="n">start_index</span> <span class="p">:</span> <span class="n">pred</span><span class="o">.</span><span class="n">end_index</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">get_clean_text</span><span class="p">(</span><span class="n">toks</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">)</span>
<span class="c1"># if this text has been seen already - skip it</span>
<span class="k">if</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">seen_predictions</span><span class="p">:</span>
<span class="k">continue</span>
<span class="c1"># flag text as being seen</span>
<span class="n">seen_predictions</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="c1"># add this text to a pruned list of the top nbest predictions</span>
<span class="n">nbest_predictions</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
<span class="n">BestPrediction</span><span class="p">(</span>
<span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">,</span>
<span class="n">start_logit</span><span class="o">=</span><span class="n">pred</span><span class="o">.</span><span class="n">start_logit</span><span class="p">,</span>
<span class="n">end_logit</span><span class="o">=</span><span class="n">pred</span><span class="o">.</span><span class="n">end_logit</span>
<span class="p">)</span>
<span class="p">)</span>
<span class="c1"># Add the null prediction</span>
<span class="n">nbest_predictions</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
<span class="n">BestPrediction</span><span class="p">(</span>
<span class="n">text</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span>
<span class="n">start_logit</span><span class="o">=</span><span class="n">start_logits</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
<span class="n">end_logit</span><span class="o">=</span><span class="n">end_logits</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="p">)</span>
<span class="p">)</span>
<span class="k">return</span> <span class="n">nbest_predictions</span>
<span class="k">def</span> <span class="nf">compute_score_difference</span><span class="p">(</span><span class="n">predictions</span><span class="p">):</span>
<span class="sd">""" Assumes that the null answer is always the last prediction """</span>
<span class="n">score_null</span> <span class="o">=</span> <span class="n">predictions</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">start_logit</span> <span class="o">+</span> <span class="n">predictions</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">end_logit</span>
<span class="n">score_non_null</span> <span class="o">=</span> <span class="n">predictions</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">start_logit</span> <span class="o">+</span> <span class="n">predictions</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">end_logit</span>
<span class="k">return</span> <span class="n">score_null</span> <span class="o">-</span> <span class="n">score_non_null</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Will we now get the right answer (an empty string) for that tricky no-answer example we were working with?</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">example</span><span class="o">.</span><span class="n">question_text</span><span class="p">)</span>
<span class="n">get_robust_prediction</span><span class="p">(</span><span class="n">example</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">,</span> <span class="n">nbest</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">null_threshold</span><span class="o">=</span><span class="n">best_f1_thresh</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>What happened 3.7-2 billion years ago?
</pre>
</div>
</div>
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>('', 0.34412444013709165)</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Woohoo!! We got the right answer this time!!</p>
<p>Even if we didn't have the best threshold in place, our additional checks still allow us to output more sensible looking answers, rejecting predictions that include the question tokens.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="nb">print</span><span class="p">(</span><span class="n">example</span><span class="o">.</span><span class="n">question_text</span><span class="p">)</span>
<span class="n">get_robust_prediction</span><span class="p">(</span><span class="n">example</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">,</span> <span class="n">nbest</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">null_threshold</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>What happened 3.7-2 billion years ago?
</pre>
</div>
</div>
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>('free oxygen began to outgas from the oceans', 0.42410620054269993)</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>And if it hadn't been a trick question, this would be the correct answer! (Seems like distilBERT could use some improvement in number understanding.)</p>
<h1 id="Final-Thoughts">Final Thoughts<a class="anchor-link" href="#Final-Thoughts"> </a></h1><p>Using a robust prediction method like the above will do more than allow a model to perform better on a curated dev set, though this is an important first step. It will also provide the model with a slightly better ability to refrain from answering questions that simply don't have an answer in the associated passage. This is a crucial feature for QA models, because it's not enough to get an answer if that answer doesn't make sense. We want our models to tell us something useful -- and sometimes that means telling us nothing at all.</p>
</div>
</div>
</div>
</div>Building a QA System with BERT on Wikipedia2020-05-19T00:00:00-05:002020-05-19T00:00:00-05:00https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA<!--
#################################################
### THIS FILE WAS AUTOGENERATED! DO NOT EDIT! ###
#################################################
# file to edit: _notebooks/2020-05-19-Getting_Started_with_QA.ipynb
-->
<div class="container" id="notebook-container">
<div class="cell border-box-sizing code_cell rendered">
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><img src="/images/copied_from_nb/my_icons/markus-spiske-C0koz3G1I4I-unsplash.jpg" alt="" title="Image by Markus Spiske at Unsplash.com" /></p>
<h1 id="So-you've-decided-to-build-a-QA-system">So you've decided to build a QA system<a class="anchor-link" href="#So-you've-decided-to-build-a-QA-system"> </a></h1><p>You want to start with something simple and general, so you plan to make it open domain, using Wikipedia as a corpus for answering questions. You want to use the best NLP that your compute resources allow (you're lucky enough to have access to a GPU), so you're going to focus on the big, flashy Transformer models that are all the rage these days.</p>
<p>Sounds like you're building an IR-based QA system. In our previous post (<a href="https://qa.fastforwardlabs.com/methods/background/2020/04/28/Intro-to-QA.html">Intro to Automated Question Answering</a>), we covered the general design of these systems, which typically require two main components: the document <em>retriever</em> (a search engine) that selects the n most relevant documents from a large collection, and a document <em>reader</em> that processes these candidate documents in search of an explicit answer span.</p>
<p><img src="/images/copied_from_nb/my_icons/QAworkflow.png" alt="" title="IR-based automated question answering workflow" /></p>
<p>Now we're going to build it!</p>
<p>This post is chock full of code that walks through our approach. We'll also highlight and clarify some powerful resources (including off-the-shelf models and libraries) that you can use to quickly get going on a QA system of your own. We'll cover all the necessary steps including:</p>
<ul>
<li>installing libraries and setting up an environment,</li>
<li>training a Transformer style model on the SQuAD dataset,</li>
<li>understanding Hugging Face's run_squad.py training script and output,</li>
<li>and passing a full Wikipedia article as context for a question.</li>
</ul>
<p>By the end of this post we'll have a working IR-based QA system, with BERT as the document reader and Wikipedia's search engine as the document retriever - a fun toy model that hints at potential real-world use cases.</p>
<p>This article was originally developed in a Jupyter Notebook and, thanks to <a href="https://fastpages.fast.ai/">fastpages</a>, converted to a blog post. For an interactive environment, click the "Open in Colab" button above (though we note that, due to Colab's system constraints, some of the cells in this notebook might not be fully executable. We'll highlight when this is the case, but don't worry -- you'll still be able to play around with all the fun stuff.)</p>
<p>Let's get started!</p>
<h1 id="Setting-up-your-virtual-environment">Setting up your virtual environment<a class="anchor-link" href="#Setting-up-your-virtual-environment"> </a></h1><p>A virtual environment is always best practice and we're using <code>venv</code> on our workhorse machine. For this project, we'll be using PyTorch, which handles the heavy lifting of deep differentiable learning. If you have a GPU you'll want a PyTorch build that includes CUDA support, though most cells in this notebook will work fine without one. Check out <a href="https://pytorch.org/">PyTorch's quick install guide</a> to determine the best build for your GPU and OS. We'll also be using the <a href="https://huggingface.co/transformers/index.html">Transformers</a> library, which provides easy-to-use implementations of all the popular Transformer architectures, like BERT. Finally, we'll need the <a href="https://pypi.org/project/wikipedia/">wikipedia</a> library for easy access and parsing of Wikipedia pages.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>You can recreate our env (with CUDA 9.2 support -- but use the appropriate version for your machine) with the following commands in your command line:</p>
<div class="highlight"><pre><span></span>$ python3 -m venv myenv
$ <span class="nb">source</span> myenv/bin/activate
$ pip install <span class="nv">torch</span><span class="o">==</span><span class="m">1</span>.5.0+cu92 <span class="nv">torchvision</span><span class="o">==</span><span class="m">0</span>.6.0+cu92 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install <span class="nv">transformers</span><span class="o">==</span><span class="m">2</span>.5.1
$ pip install <span class="nv">wikipedia</span><span class="o">==</span><span class="m">1</span>.4.0
</pre></div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Note: Our GPU machine sports an older version of CUDA (9.2 -- we're getting around to updating that), so we need to use an older version of PyTorch for the necessary CUDA support. The training script we'll be using requires some specific packages. More recent versions of PyTorch include these packages; however, older versions do not. If you have to work with an older version of PyTorch, you might need to install <code>TensorboardX</code> (see the hidden code cell below).</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide </span>
<span class="c1"># line 69 of `run_squad.py` script shows why you might need to install </span>
<span class="c1"># tensorboardX if you have an older version of torch</span>
<span class="k">try</span><span class="p">:</span>
<span class="kn">from</span> <span class="nn">torch.utils.tensorboard</span> <span class="kn">import</span> <span class="n">SummaryWriter</span>
<span class="k">except</span> <span class="ne">ImportError</span><span class="p">:</span>
<span class="kn">from</span> <span class="nn">tensorboardX</span> <span class="kn">import</span> <span class="n">SummaryWriter</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Conversely, if you're working in Colab, you can run the cell below.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>pip install torch torchvision -f https://download.pytorch.org/whl/torch_stable.html
<span class="o">!</span>pip install <span class="nv">transformers</span><span class="o">==</span><span class="m">2</span>.5.1
<span class="o">!</span>pip install <span class="nv">wikipedia</span><span class="o">==</span><span class="m">1</span>.4.0
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Hugging-Face-Transformers">Hugging Face Transformers<a class="anchor-link" href="#Hugging-Face-Transformers"> </a></h1><p>The <a href="https://huggingface.co/transformers/#">Hugging Face Transformers</a> package provides state-of-the-art general-purpose architectures for natural language understanding and natural language generation. They host dozens of pre-trained models operating in over 100 languages that you can use right out of the box. All of these models come with deep interoperability between PyTorch and Tensorflow 2.0, which means you can move a model from TF2.0 to PyTorch and back again with just a line or two of code!</p>
<p>If you're new to Hugging Face, we strongly recommend working through the HF <a href="https://huggingface.co/transformers/quickstart.html">Quickstart guide</a> as well as their excellent <a href="https://huggingface.co/transformers/notebooks.html">Transformer Notebooks</a> (we did!), as we won't cover that material in this notebook. We'll be using <code>AutoClasses</code>, which serve as a wrapper around pretty much any of the base Transformer classes.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Fine-tuning-a-Transformer-model-for-Question-Answering">Fine-tuning a Transformer model for Question Answering<a class="anchor-link" href="#Fine-tuning-a-Transformer-model-for-Question-Answering"> </a></h2><p>To train a Transformer for QA with Hugging Face, we'll need</p>
<ol>
<li>to pick a specific model architecture,</li>
<li>a QA dataset, and</li>
<li>the training script.</li>
</ol>
<p>With these three things in hand we'll then walk through the fine-tuning process.</p>
<h3 id="1.-Pick-a-Model">1. Pick a Model<a class="anchor-link" href="#1.-Pick-a-Model"> </a></h3><p>Not every Transformer architecture lends itself naturally to the task of question answering. For example, GPT does not do QA; similarly BERT does not do machine translation. HF identifies the following model types for the QA task:</p>
<ul>
<li>BERT</li>
<li>distilBERT </li>
<li>ALBERT</li>
<li>RoBERTa</li>
<li>XLNet</li>
<li>XLM</li>
<li>FlauBERT</li>
</ul>
<p>We'll stick with the now-classic BERT model in this notebook, but feel free to try out some others (we will - and we'll let you know when we do). Next up: a training set.</p>
<h3 id="2.-QA-dataset:-SQuAD">2. QA dataset: SQuAD<a class="anchor-link" href="#2.-QA-dataset:-SQuAD"> </a></h3><p>One of the most canonical datasets for QA is the Stanford Question Answering Dataset, or SQuAD, which comes in two flavors: SQuAD 1.1 and SQuAD 2.0. These reading comprehension datasets consist of questions posed on a set of Wikipedia articles, where the answer to every question is a segment (or span) of the corresponding passage. In SQuAD 1.1, all questions have an answer in the corresponding passage. SQuAD 2.0 steps up the difficulty by including questions that cannot be answered by the provided passage.</p>
<p>The following code will download the specified version of SQuAD.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># set path with magic</span>
<span class="o">%</span><span class="k">env</span> DATA_DIR=./data/squad
<span class="c1"># download the data</span>
<span class="k">def</span> <span class="nf">download_squad</span><span class="p">(</span><span class="n">version</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="k">if</span> <span class="n">version</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="o">!</span>wget -P <span class="nv">$DATA_DIR</span> https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
<span class="o">!</span>wget -P <span class="nv">$DATA_DIR</span> https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
<span class="k">else</span><span class="p">:</span>
<span class="o">!</span>wget -P <span class="nv">$DATA_DIR</span> https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
<span class="o">!</span>wget -P <span class="nv">$DATA_DIR</span> https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
<span class="n">download_squad</span><span class="p">(</span><span class="n">version</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>env: DATA_DIR=./data/squad
--2020-05-11 21:36:52-- https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.108.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘./data/squad/train-v2.0.json’
train-v2.0.json 100%[===================>] 40.17M 14.6MB/s in 2.8s
2020-05-11 21:36:55 (14.6 MB/s) - ‘./data/squad/train-v2.0.json’ saved [42123633/42123633]
--2020-05-11 21:36:56-- https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.111.153, 185.199.108.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘./data/squad/dev-v2.0.json’
dev-v2.0.json 100%[===================>] 4.17M 6.68MB/s in 0.6s
2020-05-11 21:36:56 (6.68 MB/s) - ‘./data/squad/dev-v2.0.json’ saved [4370528/4370528]
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="3.-Fine-tuning-script">3. Fine-tuning script<a class="anchor-link" href="#3.-Fine-tuning-script"> </a></h3><p>We've chosen a model and we've got some data. Time to train!</p>
<p>All the standard models that HF supports have been pre-trained, which means they've all been fed massive unsupervised training sets in order to learn basic language modeling. In order to perform well at specific tasks (like question answering), they must be trained further -- fine-tuned -- on specific datasets and tasks.</p>
<p>HF helpfully provides a script that fine-tunes a Transformer model on one of the SQuAD datasets, called <code>run_squad.py</code>. You can grab the script <a href="https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py">here</a> or run the cell below.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># download the run_squad.py training script</span>
<span class="o">!</span>curl -L -O https://github.com/huggingface/transformers/blob/b90745c5901809faef3136ed09a689e7d733526c/examples/run_squad.py
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This script takes care of all the hard work that goes into fine-tuning a model and, as such, it's pretty complicated. It hosts no fewer than 45 arguments, providing an impressive amount of flexibility and utility for those who do a lot of training. We'll leave the details of this script for another day, and focus instead on the basic command to fine-tune BERT on SQuAD 1.1 or 2.0.</p>
<p>Below are the most important arguments for the <code>run_squad.py</code> fine-tuning script.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># fine-tuning your own model for QA using HF's `run_squad.py`</span>
<span class="c1"># turn flags on and off according to the model you're training</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span>
<span class="s1">'python'</span><span class="p">,</span>
<span class="c1"># '-m torch.distributed.launch --nproc_per_node 2', # use this to perform distributed training over multiple GPUs</span>
<span class="s1">'run_squad.py'</span><span class="p">,</span>
<span class="s1">'--model_type'</span><span class="p">,</span> <span class="s1">'bert'</span><span class="p">,</span> <span class="c1"># model type (one of the list under "Pick a Model" above)</span>
<span class="s1">'--model_name_or_path'</span><span class="p">,</span> <span class="s1">'bert-base-uncased'</span><span class="p">,</span> <span class="c1"># specific model name of the given model type (shown, a list is here: https://huggingface.co/transformers/pretrained_models.html) </span>
<span class="c1"># on first execution this initiates a download of pre-trained model weights;</span>
<span class="c1"># can also be a local path to a directory with model weights</span>
<span class="s1">'--output_dir'</span><span class="p">,</span> <span class="s1">'./models/bert/bbu_squad2'</span><span class="p">,</span> <span class="c1"># directory for model checkpoints and predictions</span>
<span class="c1"># '--overwrite_output_dir', # use when adding output to a directory that is non-empty --</span>
<span class="c1"># for instance, when training crashes midway through and you need to restart it</span>
<span class="s1">'--do_train'</span><span class="p">,</span> <span class="c1"># execute the training method </span>
<span class="s1">'--train_file'</span><span class="p">,</span> <span class="s1">'$DATA_DIR/train-v2.0.json'</span><span class="p">,</span> <span class="c1"># provide the training data</span>
<span class="s1">'--version_2_with_negative'</span><span class="p">,</span> <span class="c1"># ** MUST use this flag if training on SQuAD 2.0! DO NOT use if training on SQuAD 1.1</span>
<span class="s1">'--do_lower_case'</span><span class="p">,</span> <span class="c1"># ** set this flag if using an uncased model; don't use for Cased Models</span>
<span class="s1">'--do_eval'</span><span class="p">,</span> <span class="c1"># execute the evaluation method on the dev set -- note: </span>
<span class="c1"># if coupled with --do_train, evaluation runs after fine-tuning </span>
<span class="s1">'--predict_file'</span><span class="p">,</span> <span class="s1">'$DATA_DIR/dev-v2.0.json'</span><span class="p">,</span> <span class="c1"># provide evaluation data (dev set)</span>
<span class="s1">'--eval_all_checkpoints'</span><span class="p">,</span> <span class="c1"># evaluate the model on the dev set at each checkpoint</span>
<span class="s1">'--per_gpu_eval_batch_size'</span><span class="p">,</span> <span class="s1">'12'</span><span class="p">,</span> <span class="c1"># evaluation batch size for each gpu</span>
<span class="s1">'--per_gpu_train_batch_size'</span><span class="p">,</span> <span class="s1">'12'</span><span class="p">,</span> <span class="c1"># training batch size for each gpu</span>
<span class="s1">'--save_steps'</span><span class="p">,</span> <span class="s1">'5000'</span><span class="p">,</span> <span class="c1"># how often checkpoints (complete model snapshot) are saved </span>
<span class="s1">'--threads'</span><span class="p">,</span> <span class="s1">'8'</span><span class="p">,</span> <span class="c1"># num of CPU threads to use for converting SQuAD examples to model features</span>
<span class="c1"># --- Model and Feature Hyperparameters --- </span>
<span class="s1">'--num_train_epochs'</span><span class="p">,</span> <span class="s1">'3'</span><span class="p">,</span> <span class="c1"># number of training epochs - usually 2-3 for SQuAD </span>
<span class="s1">'--learning_rate'</span><span class="p">,</span> <span class="s1">'3e-5'</span><span class="p">,</span> <span class="c1"># learning rate for the default optimizer (Adam in this case)</span>
<span class="s1">'--max_seq_length'</span><span class="p">,</span> <span class="s1">'384'</span><span class="p">,</span> <span class="c1"># maximum length allowed for the full input sequence </span>
<span class="s1">'--doc_stride'</span><span class="p">,</span> <span class="s1">'128'</span> <span class="c1"># used for long documents that must be chunked into multiple features -- </span>
<span class="c1"># this "sliding window" controls the amount of stride between chunks</span>
<span class="p">]</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Here's what to expect when executing <code>run_squad.py</code> for the first time:</p>
<ol>
<li>Pre-trained model weights for the specified model type (i.e., <code>bert-base-uncased</code>) are downloaded.</li>
<li>SQuAD training examples are converted into features (takes 15-30 minutes depending on dataset size and number of threads).</li>
<li>Training features are saved to a cache file (so that you don't have to do this again <em>for this model type</em>).</li>
<li>If <code>--do_train</code>, training commences for as many epochs as you specify, saving the model weights every <code>--save_steps</code> steps until training finishes. These checkpoints are saved in <code>[--output_dir]/checkpoint-[step number]</code> subdirectories.</li>
<li>The final model weights and peripheral files are saved to <code>--output_dir</code>.</li>
<li>If <code>--do_eval</code>, SQuAD dev examples are converted into features.</li>
<li>Dev features are also saved to a cache file.</li>
<li>Evaluation commences and outputs a dizzying assortment of performance scores.</li>
</ol>
<h3 id="Time-to-train!">Time to train!<a class="anchor-link" href="#Time-to-train!"> </a></h3><p>But first, a note on compute requirements. We don't recommend fine-tuning a Transformer model unless you're rocking at least one GPU and a considerable amount of RAM. For context, our GPU is several years old (GeForce GTX TITAN X), and while it's not nearly as fast as the Tesla V100 (the current Cadillac of GPUs), it gets the job done. Fine-tuning <code>bert-base-uncased</code> takes about 1.75 hours <em>per epoch</em>. Additionally, our workhorse machine has 32GB CPU and 12GB GPU memory, which is sufficient for data processing and training most models on either of the SQuAD datasets.</p>
<p>The following cells demonstrate two ways to fine-tune: on the command line and in a Colab notebook.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Training-on-the-command-line">Training on the command line<a class="anchor-link" href="#Training-on-the-command-line"> </a></h4><p>We saved the following as a shell script (<code>run_squad.sh</code>) and ran on the command line (<code>$ source run_squad.sh</code>) of our workhorse GPU machine. Shell scripts help prevent numerous mistakes and mis-keys when typing args to a command line, especially for complex scripts like this. They also allow you to keep track of which arguments were used last (though, as we'll see below, the <code>run_squad.py</code> script has a solution for that). We actually kept two shell scripts -- one explicitly for training and another for evaluation.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<div class="highlight"><pre><span></span><span class="ch">#!/bin/sh</span>
<span class="nb">export</span> <span class="nv">DATA_DIR</span><span class="o">=</span>./data/squad
<span class="nb">export</span> <span class="nv">MODEL_DIR</span><span class="o">=</span>./models
python run_squad.py <span class="se">\</span>
--model_type bert <span class="se">\</span>
--model_name_or_path bert-base-uncased <span class="se">\</span>
--output_dir models/bert/ <span class="se">\</span>
--data_dir data/squad <span class="se">\</span>
--overwrite_output_dir <span class="se">\</span>
--overwrite_cache <span class="se">\</span>
--do_train <span class="se">\</span>
--train_file train-v2.0.json <span class="se">\</span>
--version_2_with_negative <span class="se">\</span>
--do_lower_case <span class="se">\</span>
--do_eval <span class="se">\</span>
--predict_file dev-v2.0.json <span class="se">\</span>
--per_gpu_train_batch_size <span class="m">2</span> <span class="se">\</span>
--learning_rate 3e-5 <span class="se">\</span>
--num_train_epochs <span class="m">2</span>.0 <span class="se">\</span>
--max_seq_length <span class="m">384</span> <span class="se">\</span>
--doc_stride <span class="m">128</span> <span class="se">\</span>
--threads <span class="m">10</span> <span class="se">\</span>
--save_steps <span class="m">5000</span>
</pre></div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h4 id="Training-in-Colab">Training in Colab<a class="anchor-link" href="#Training-in-Colab"> </a></h4><p>Alternatively, you can execute training in the cell as shown below. We note that standard Colab environments only provide 12GB of RAM. Converting the SQuAD dataset to features is memory intensive and may cause the basic Colab environment to fail silently. If you have a Colab instance with additional memory capacity (16GB+), this cell should execute fully.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>python run_squad.py <span class="err">\</span>
<span class="o">--</span><span class="n">model_type</span> <span class="n">bert</span> \
<span class="o">--</span><span class="n">model_name_or_path</span> <span class="n">bert</span><span class="o">-</span><span class="n">base</span><span class="o">-</span><span class="n">uncased</span> \
<span class="o">--</span><span class="n">output_dir</span> <span class="n">models</span><span class="o">/</span><span class="n">bert</span><span class="o">/</span> \
<span class="o">--</span><span class="n">data_dir</span> <span class="n">data</span><span class="o">/</span><span class="n">squad</span> \
<span class="o">--</span><span class="n">overwrite_output_dir</span> \
<span class="o">--</span><span class="n">overwrite_cache</span> \
<span class="o">--</span><span class="n">do_train</span> \
<span class="o">--</span><span class="n">train_file</span> <span class="n">train</span><span class="o">-</span><span class="n">v2</span><span class="o">.</span><span class="mf">0.</span><span class="n">json</span> \
<span class="o">--</span><span class="n">version_2_with_negative</span> \
<span class="o">--</span><span class="n">do_lower_case</span> \
<span class="o">--</span><span class="n">do_eval</span> \
<span class="o">--</span><span class="n">predict_file</span> <span class="n">dev</span><span class="o">-</span><span class="n">v2</span><span class="o">.</span><span class="mf">0.</span><span class="n">json</span> \
<span class="o">--</span><span class="n">per_gpu_train_batch_size</span> <span class="mi">2</span> \
<span class="o">--</span><span class="n">learning_rate</span> <span class="mf">3e-5</span> \
<span class="o">--</span><span class="n">num_train_epochs</span> <span class="mf">2.0</span> \
<span class="o">--</span><span class="n">max_seq_length</span> <span class="mi">384</span> \
<span class="o">--</span><span class="n">doc_stride</span> <span class="mi">128</span> \
<span class="o">--</span><span class="n">threads</span> <span class="mi">10</span> \
<span class="o">--</span><span class="n">save_steps</span> <span class="mi">5000</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Training-Output">Training Output<a class="anchor-link" href="#Training-Output"> </a></h3><p>Successful completion of the <code>run_squad.py</code> yields a slew of output, which can be found in the <code>--output_dir</code> directory specified above. There you'll find...</p>
<p>Files for the model's tokenizer:</p>
<ul>
<li><code>tokenizer_config.json</code></li>
<li><code>vocab.txt</code></li>
<li><code>special_tokens_map.json</code></li>
</ul>
<p>Files for the model itself:</p>
<ul>
<li><code>pytorch_model.bin</code>: these are the actual model weights (this file can be several GB for some models)</li>
<li><code>config.json</code>: details of the model architecture</li>
</ul>
<p>Binary representation of the command line arguments used to train this model (so you'll never forget which arguments you used!)</p>
<ul>
<li><code>training_args.bin</code></li>
</ul>
<p>And if you included <code>--do_eval</code>, you'll also see these files:</p>
<ul>
<li><code>predictions_.json</code>: the official best answer for each example</li>
<li><code>nbest_predictions_.json</code>: the top n best answers for each example</li>
</ul>
<p>Providing the path to this directory to <code>AutoModel</code> or <code>AutoModelForQuestionAnswering</code> will load your fine-tuned model for use.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModelForQuestionAnswering</span>
<span class="c1"># Load the fine-tuned model</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s2">"./models/bert/bbu_squad2"</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForQuestionAnswering</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s2">"./models/bert/bbu_squad2"</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Using-a-pre-fine-tuned-model-from-the-Hugging-Face-repository">Using a pre-fine-tuned model from the Hugging Face repository<a class="anchor-link" href="#Using-a-pre-fine-tuned-model-from-the-Hugging-Face-repository"> </a></h2><p>If you don't have access to GPUs or don't have the time to fiddle and train models, you're in luck! Hugging Face is more than a collection of slick Transformer classes -- it also hosts <a href="https://huggingface.co/models">a repository</a> for pre-trained and fine-tuned models contributed from the wide community of NLP practitioners. Searching for "squad" brings up at least 55 models.</p>
<p><img src="/images/copied_from_nb/my_icons/HF_repo.png" alt="" /></p>
<p>Each of these links provides explicit code for using the model, and, in some cases, information on how it was trained and what results were achieved. Let's load one of these pre-fine-tuned models.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModelForQuestionAnswering</span>
<span class="c1"># executing these commands for the first time initiates a download of the </span>
<span class="c1"># model weights to ~/.cache/torch/transformers/</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s2">"deepset/bert-base-cased-squad2"</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForQuestionAnswering</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s2">"deepset/bert-base-cased-squad2"</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Let's-try-our-model!">Let's try our model!<a class="anchor-link" href="#Let's-try-our-model!"> </a></h2><p>Whether you fine-tuned your own or used a pre-fine-tuned model, it's time to play with it! There are three steps to QA:</p>
<ol>
<li>tokenize the input</li>
<li>obtain model scores</li>
<li>get the answer span</li>
</ol>
<p>These steps are discussed in detail in the HF <a href="https://huggingface.co/transformers/notebooks.html">Transformer Notebooks</a>.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">question</span> <span class="o">=</span> <span class="s2">"Who ruled Macedonia"</span>
<span class="n">context</span> <span class="o">=</span> <span class="s2">"""Macedonia was an ancient kingdom on the periphery of Archaic and Classical Greece, </span>
<span class="s2">and later the dominant state of Hellenistic Greece. The kingdom was founded and initially ruled </span>
<span class="s2">by the Argead dynasty, followed by the Antipatrid and Antigonid dynasties. Home to the ancient </span>
<span class="s2">Macedonians, it originated on the northeastern part of the Greek peninsula. Before the 4th </span>
<span class="s2">century BC, it was a small kingdom outside of the area dominated by the city-states of Athens, </span>
<span class="s2">Sparta and Thebes, and briefly subordinate to Achaemenid Persia."""</span>
<span class="c1"># 1. TOKENIZE THE INPUT</span>
<span class="c1"># note: if you don't include return_tensors='pt' you'll get a list of lists which is easier for </span>
<span class="c1"># exploration but you cannot feed that into a model. </span>
<span class="n">inputs</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode_plus</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">context</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s2">"pt"</span><span class="p">)</span>
<span class="c1"># 2. OBTAIN MODEL SCORES</span>
<span class="c1"># the AutoModelForQuestionAnswering class includes a span predictor on top of the model. </span>
<span class="c1"># the model returns answer start and end scores for each word in the text</span>
<span class="n">answer_start_scores</span><span class="p">,</span> <span class="n">answer_end_scores</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">inputs</span><span class="p">)</span>
<span class="n">answer_start</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">answer_start_scores</span><span class="p">)</span> <span class="c1"># get the most likely beginning of answer with the argmax of the score</span>
<span class="n">answer_end</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">answer_end_scores</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span> <span class="c1"># get the most likely end of answer with the argmax of the score</span>
<span class="c1"># 3. GET THE ANSWER SPAN</span>
<span class="c1"># once we have the most likely start and end tokens, we grab all the tokens between them</span>
<span class="c1"># and convert tokens back to words!</span>
<span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_tokens_to_string</span><span class="p">(</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span><span class="n">inputs</span><span class="p">[</span><span class="s2">"input_ids"</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="n">answer_start</span><span class="p">:</span><span class="n">answer_end</span><span class="p">]))</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_text output_subarea output_execute_result">
<pre>'the Argead dynasty'</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="QA-on-Wikipedia-pages">QA on Wikipedia pages<a class="anchor-link" href="#QA-on-Wikipedia-pages"> </a></h1><p>We tried our model on a question paired with a short passage, but what if we want to retrieve an answer from a longer document? A typical Wikipedia page is much longer than the example above, and we need to do a bit of massaging before we can use our model on longer contexts.</p>
<p>Let's start by pulling up a Wikipedia page.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">wikipedia</span> <span class="k">as</span> <span class="nn">wiki</span>
<span class="kn">import</span> <span class="nn">pprint</span> <span class="k">as</span> <span class="nn">pp</span>
<span class="n">question</span> <span class="o">=</span> <span class="s1">'What is the wingspan of an albatross?'</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">wiki</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">question</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Wikipedia search results for our question:</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="n">pp</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">wiki</span><span class="o">.</span><span class="n">page</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">page</span><span class="o">.</span><span class="n">content</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="se">\n</span><span class="s2">The </span><span class="si">{</span><span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> Wikipedia article contains </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span><span class="si">}</span><span class="s2"> characters."</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Wikipedia search results for our question:
['Albatross',
'List of largest birds',
'Black-browed albatross',
'Argentavis',
'Pterosaur',
'Mollymawk',
'List of birds by flight speed',
'Largest body part',
'Pelican',
'Aspect ratio (aeronautics)']
The Albatross Wikipedia article contains 38200 characters.
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">inputs</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode_plus</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s1">'pt'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"This translates into </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">inputs</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span><span class="si">}</span><span class="s2"> tokens."</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stderr output_text">
<pre>Token indices sequence length is longer than the specified maximum sequence length for this model (10 > 512). Running this sequence through the model will result in indexing errors
</pre>
</div>
</div>
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>This translates into 8824 tokens.
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The tokenizer takes the input as text and returns tokens. In general, tokenizers convert words or pieces of words into a model-ingestible format. The specific tokens and format are dependent on the type of model. For example, BERT tokenizes words differently from RoBERTa, so be sure to always use the associated tokenizer appropriate for your model.</p>
<p>In this case, the tokenizer converts our input text into 8824 tokens, but this far exceeds the maximum number of tokens that can be fed to the model at one time. Most BERT-esque models can only accept 512 tokens at once, thus the (somewhat confusing) warning above (how is 10 > 512?). This means we'll have to split our input into chunks and each chunk must not exceed 512 tokens in total.</p>
<p>When working with Question Answering, it's crucial that each chunk follows this format:</p>
<p>[CLS] question tokens [SEP] context tokens [SEP]</p>
<p>This means that, for each segment of a Wikipedia article, we must prepend the original question, followed by the next "chunk" of article tokens.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># time to chunk!</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">OrderedDict</span>
<span class="c1"># identify question tokens (token_type_ids = 0)</span>
<span class="n">qmask</span> <span class="o">=</span> <span class="n">inputs</span><span class="p">[</span><span class="s1">'token_type_ids'</span><span class="p">]</span><span class="o">.</span><span class="n">lt</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">qt</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">masked_select</span><span class="p">(</span><span class="n">inputs</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">],</span> <span class="n">qmask</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"The question consists of </span><span class="si">{</span><span class="n">qt</span><span class="o">.</span><span class="n">size</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2"> tokens."</span><span class="p">)</span>
<span class="n">chunk_size</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">max_position_embeddings</span> <span class="o">-</span> <span class="n">qt</span><span class="o">.</span><span class="n">size</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span> <span class="c1"># the "-1" accounts for</span>
<span class="c1"># having to add a [SEP] token to the end of each chunk</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Each chunk will contain </span><span class="si">{</span><span class="n">chunk_size</span> <span class="o">-</span> <span class="mi">2</span><span class="si">}</span><span class="s2"> tokens of the Wikipedia article."</span><span class="p">)</span>
<span class="c1"># create a dict of dicts; each sub-dict mimics the structure of pre-chunked model input</span>
<span class="n">chunked_input</span> <span class="o">=</span> <span class="n">OrderedDict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span><span class="n">v</span> <span class="ow">in</span> <span class="n">inputs</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">masked_select</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">qmask</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">masked_select</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="o">~</span><span class="n">qmask</span><span class="p">)</span>
<span class="n">chunks</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">chunk_size</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">chunks</span><span class="p">):</span>
<span class="k">if</span> <span class="n">i</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">chunked_input</span><span class="p">:</span>
<span class="n">chunked_input</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">thing</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">((</span><span class="n">q</span><span class="p">,</span> <span class="n">chunk</span><span class="p">))</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="k">if</span> <span class="n">k</span> <span class="o">==</span> <span class="s1">'input_ids'</span><span class="p">:</span>
<span class="n">thing</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">((</span><span class="n">thing</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">102</span><span class="p">])))</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">thing</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">((</span><span class="n">thing</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">1</span><span class="p">])))</span>
<span class="n">chunked_input</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="n">thing</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>The question consists of 12 tokens.
Each chunk will contain 497 tokens of the Wikipedia article.
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">chunked_input</span><span class="o">.</span><span class="n">keys</span><span class="p">())):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Number of tokens in chunk </span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">chunked_input</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s1">'input_ids'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()[</span><span class="mi">0</span><span class="p">])</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Number of tokens in chunk 0: 512
Number of tokens in chunk 1: 512
Number of tokens in chunk 2: 512
Number of tokens in chunk 3: 512
Number of tokens in chunk 4: 512
Number of tokens in chunk 5: 512
Number of tokens in chunk 6: 512
Number of tokens in chunk 7: 512
Number of tokens in chunk 8: 512
Number of tokens in chunk 9: 512
Number of tokens in chunk 10: 512
Number of tokens in chunk 11: 512
Number of tokens in chunk 12: 512
Number of tokens in chunk 13: 512
Number of tokens in chunk 14: 512
Number of tokens in chunk 15: 512
Number of tokens in chunk 16: 512
Number of tokens in chunk 17: 341
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Each of these chunks (except for the last one) has the following structure:</p>
<p>[CLS], 12 question tokens, [SEP], 497 tokens of the Wikipedia article, [SEP] token = 512 tokens</p>
<p>Each of these chunks can now be fed to the model without causing indexing errors. We'll get an "answer" for each chunk; however, not all answers are useful, since not every segment of a Wikipedia article is informative for our question. The model will return the [CLS] token when it determines that the context does not contain an answer to the question.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">convert_ids_to_string</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">input_ids</span><span class="p">):</span>
<span class="k">return</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_tokens_to_string</span><span class="p">(</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span><span class="n">input_ids</span><span class="p">))</span>
<span class="n">answer</span> <span class="o">=</span> <span class="s1">''</span>
<span class="c1"># now we iterate over our chunks, looking for the best answer from each chunk</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">chunked_input</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">answer_start_scores</span><span class="p">,</span> <span class="n">answer_end_scores</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">chunk</span><span class="p">)</span>
<span class="n">answer_start</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">answer_start_scores</span><span class="p">)</span>
<span class="n">answer_end</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">answer_end_scores</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">ans</span> <span class="o">=</span> <span class="n">convert_ids_to_string</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">,</span> <span class="n">chunk</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="n">answer_start</span><span class="p">:</span><span class="n">answer_end</span><span class="p">])</span>
<span class="c1"># if the ans == [CLS] then the model did not find a real answer in this chunk</span>
<span class="k">if</span> <span class="n">ans</span> <span class="o">!=</span> <span class="s1">'[CLS]'</span><span class="p">:</span>
<span class="n">answer</span> <span class="o">+=</span> <span class="n">ans</span> <span class="o">+</span> <span class="s2">" / "</span>
<span class="nb">print</span><span class="p">(</span><span class="n">answer</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>3 . 7 m /
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Putting-it-all-together">Putting it all together<a class="anchor-link" href="#Putting-it-all-together"> </a></h1><p>Let's recap. We've essentially built a simple IR-based QA system! We're using <code>wikipedia</code>'s search engine to return a list of candidate documents that we then feed into our document reader (in this case, BERT fine-tuned on SQuAD 2.0). Let's make our code easier to read and more self-contained by packaging the document reader into a class.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModelForQuestionAnswering</span>
<span class="k">class</span> <span class="nc">DocumentReader</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pretrained_model_name_or_path</span><span class="o">=</span><span class="s1">'bert-large-uncased'</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">READER_PATH</span> <span class="o">=</span> <span class="n">pretrained_model_name_or_path</span>
<span class="bp">self</span><span class="o">.</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">READER_PATH</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForQuestionAnswering</span><span class="o">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">READER_PATH</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">max_len</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">max_position_embeddings</span>
<span class="bp">self</span><span class="o">.</span><span class="n">chunked</span> <span class="o">=</span> <span class="kc">False</span>
<span class="k">def</span> <span class="nf">tokenize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">question</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">inputs</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">encode_plus</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s2">"pt"</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">input_ids</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">inputs</span><span class="p">[</span><span class="s2">"input_ids"</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">input_ids</span><span class="p">)</span> <span class="o">></span> <span class="bp">self</span><span class="o">.</span><span class="n">max_len</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">inputs</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">chunkify</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">chunked</span> <span class="o">=</span> <span class="kc">True</span>
<span class="k">def</span> <span class="nf">chunkify</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="sd">""" </span>
<span class="sd"> Break up a long article into chunks that fit within the max token</span>
<span class="sd"> requirement for that Transformer model. </span>
<span class="sd"> Calls to BERT / RoBERTa / ALBERT require the following format:</span>
<span class="sd"> [CLS] question tokens [SEP] context tokens [SEP].</span>
<span class="sd"> """</span>
<span class="c1"># create question mask based on token_type_ids</span>
<span class="c1"># value is 0 for question tokens, 1 for context tokens</span>
<span class="n">qmask</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">inputs</span><span class="p">[</span><span class="s1">'token_type_ids'</span><span class="p">]</span><span class="o">.</span><span class="n">lt</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">qt</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">masked_select</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inputs</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">],</span> <span class="n">qmask</span><span class="p">)</span>
<span class="n">chunk_size</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_len</span> <span class="o">-</span> <span class="n">qt</span><span class="o">.</span><span class="n">size</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span> <span class="c1"># the "-1" accounts for</span>
<span class="c1"># having to add an ending [SEP] token to the end</span>
<span class="c1"># create a dict of dicts; each sub-dict mimics the structure of pre-chunked model input</span>
<span class="n">chunked_input</span> <span class="o">=</span> <span class="n">OrderedDict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span><span class="n">v</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">inputs</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">masked_select</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">qmask</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">masked_select</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="o">~</span><span class="n">qmask</span><span class="p">)</span>
<span class="n">chunks</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">chunk_size</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">chunks</span><span class="p">):</span>
<span class="k">if</span> <span class="n">i</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">chunked_input</span><span class="p">:</span>
<span class="n">chunked_input</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">thing</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">((</span><span class="n">q</span><span class="p">,</span> <span class="n">chunk</span><span class="p">))</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="k">if</span> <span class="n">k</span> <span class="o">==</span> <span class="s1">'input_ids'</span><span class="p">:</span>
<span class="n">thing</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">((</span><span class="n">thing</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">102</span><span class="p">])))</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">thing</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">((</span><span class="n">thing</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">1</span><span class="p">])))</span>
<span class="n">chunked_input</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="n">thing</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">chunked_input</span>
<span class="k">def</span> <span class="nf">get_answer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">chunked</span><span class="p">:</span>
<span class="n">answer</span> <span class="o">=</span> <span class="s1">''</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">inputs</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">answer_start_scores</span><span class="p">,</span> <span class="n">answer_end_scores</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">chunk</span><span class="p">)</span>
<span class="n">answer_start</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">answer_start_scores</span><span class="p">)</span>
<span class="n">answer_end</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">answer_end_scores</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">ans</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">convert_ids_to_string</span><span class="p">(</span><span class="n">chunk</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="n">answer_start</span><span class="p">:</span><span class="n">answer_end</span><span class="p">])</span>
<span class="k">if</span> <span class="n">ans</span> <span class="o">!=</span> <span class="s1">'[CLS]'</span><span class="p">:</span>
<span class="n">answer</span> <span class="o">+=</span> <span class="n">ans</span> <span class="o">+</span> <span class="s2">" / "</span>
<span class="k">return</span> <span class="n">answer</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">answer_start_scores</span><span class="p">,</span> <span class="n">answer_end_scores</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="bp">self</span><span class="o">.</span><span class="n">inputs</span><span class="p">)</span>
<span class="n">answer_start</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">answer_start_scores</span><span class="p">)</span> <span class="c1"># get the most likely beginning of answer with the argmax of the score</span>
<span class="n">answer_end</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">answer_end_scores</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span> <span class="c1"># get the most likely end of answer with the argmax of the score</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">convert_ids_to_string</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inputs</span><span class="p">[</span><span class="s1">'input_ids'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span>
<span class="n">answer_start</span><span class="p">:</span><span class="n">answer_end</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">convert_ids_to_string</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_ids</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_tokens_to_string</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">tokenizer</span><span class="o">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span><span class="n">input_ids</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Below is our clean, fully working QA system! Feel free to add your own questions.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<details class="description">
<summary class="btn btn-sm" data-open="Hide Code" data-close="Show Code"></summary>
<p><div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># collapse-hide </span>
<span class="c1"># to make the following output more readable I'll turn off the token sequence length warning</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s2">"transformers.tokenization_utils"</span><span class="p">)</span><span class="o">.</span><span class="n">setLevel</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">ERROR</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</p>
</details>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">questions</span> <span class="o">=</span> <span class="p">[</span>
<span class="s1">'When was Barack Obama born?'</span><span class="p">,</span>
<span class="s1">'Why is the sky blue?'</span><span class="p">,</span>
<span class="s1">'How many sides does a pentagon have?'</span>
<span class="p">]</span>
<span class="n">reader</span> <span class="o">=</span> <span class="n">DocumentReader</span><span class="p">(</span><span class="s2">"deepset/bert-base-cased-squad2"</span><span class="p">)</span>
<span class="c1"># if you trained your own model using the training cell earlier, you can access it with this:</span>
<span class="c1">#reader = DocumentReader("./models/bert/bbu_squad2")</span>
<span class="k">for</span> <span class="n">question</span> <span class="ow">in</span> <span class="n">questions</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Question: </span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">wiki</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">question</span><span class="p">)</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">wiki</span><span class="o">.</span><span class="n">page</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Top wiki result: </span><span class="si">{</span><span class="n">page</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">page</span><span class="o">.</span><span class="n">content</span>
<span class="n">reader</span><span class="o">.</span><span class="n">tokenize</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Answer: </span><span class="si">{</span><span class="n">reader</span><span class="o">.</span><span class="n">get_answer</span><span class="p">()</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="output_subarea output_stream output_stdout output_text">
<pre>Question: When was Barack Obama born?
Top wiki result: <WikipediaPage 'Barack Obama Sr.'>
Answer: 18 June 1936 / February 2 , 1961 /
Question: Why is the sky blue?
Top wiki result: <WikipediaPage 'Diffuse sky radiation'>
Answer: Rayleigh scattering /
Question: How many sides does a pentagon have?
Top wiki result: <WikipediaPage 'The Pentagon'>
Answer: five /
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>It got 2 out of 3 questions right!</p>
<p>Notice that, at least for the current questions we've chosen, the QA system fails because of Wikipedia's default search engine, not because of BERT! It pulls up the wrong page for two of our questions: a page about Barack Obama Sr. instead of the former US President, and an article about the US's Department of Defense building "The Pentagon" instead of a page about geometry. In the latter case, we ended up with the correct answer by coincidence! This illustrates that any successful IR-based QA system requires a search engine (document retriever) as good as the document reader.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Wrapping-Up">Wrapping Up<a class="anchor-link" href="#Wrapping-Up"> </a></h1><p>There we have it! A working QA system on Wikipedia articles. This is great, but it's admittedly not very sophisticated. Furthermore, we've left a lot of questions unanswered:</p>
<ol>
<li>Why fine-tune on the SQuAD dataset and not something else? What other options are there? </li>
<li>How good is BERT at answering questions? And how do we define "good"?</li>
<li>Why BERT and not another Transformer model? </li>
<li>Currently, our QA system can return an answer for each chunk of a Wiki article, but not all of those answers are correct -- How can we improve our <code>get_answer</code> method?</li>
<li>Additionally, we're chunking a wiki article in such a way that we could be ending a chunk in the middle of a sentence -- Can we improve our <code>chunkify</code> method? </li>
</ol>
<p>Over the course of this project, we'll tackle these questions and more. By the end of this series we hope to demonstrate a snazzier QA model that incorporates everything we learn along the way. Stay tuned!</p>
</div>
</div>
</div>
</div>Intro to Automated Question Answering2020-04-28T00:00:00-05:002020-04-28T00:00:00-05:00https://qa.fastforwardlabs.com/methods/background/2020/04/28/Intro-to-QA<p>Welcome to the first edition of the Cloudera Fast Forward blog on Natural Language Processing for Question Answering! Throughout this series, we’ll build a Question Answering (QA) system with off-the-shelf algorithms and libraries and blog about our process and what we find along the way. We hope to wind up with a beginning-to-end documentary that provides:</p>
<ul>
<li>insight into QA as a tool,</li>
<li>useful context to make decisions for those who might build their own QA system,</li>
<li>tips and tricks we pick up as we go, and</li>
<li>sample code and commentary.</li>
</ul>
<p>We’re trying a new thing here. In the past, we’ve documented our work in <a href="https://www.cloudera.com/products/fast-forward-labs-research/fast-forward-labs-research-reports.html">discrete reports</a> at the end of our research process. We hope this new format suits the above goals and makes the topic more accessible, while ultimately being useful.</p>
<p>To kick off the series, this introductory post will discuss what QA is and isn’t, where this technology is being employed, and what techniques are used to accomplish this natural language task.</p>
<h2 id="question-answering-in-a-nutshell">Question Answering in a Nutshell</h2>
<p>Question Answering is a human-machine interaction to extract information from data using natural language queries. Machines do not inherently understand human languages any more than the average human understands machine language. A well-developed QA system bridges the gap between the two, allowing humans to extract knowledge from data in a way that is natural to us, i.e., asking questions.</p>
<p>QA systems accept questions in the form of natural language (typically text based, although you are probably also familiar with systems that accept speech input, such as Amazon’s Alexa or Apple’s Siri), and output a concise answer. Google’s search engine product adds a form of question answering in addition to its traditional search results, as illustrated here:</p>
<p><img src="/images/post1/LincolnCrop.png" alt="" /></p>
<p>Google took our question and returned a set of 1.3 million documents (not shown) relevant to the search terms, i.e., documents about Abraham Lincoln. Google also used what it knows about the <em>contents</em> of some of those documents to provide a “<a href="https://support.google.com/websearch/answer/9351707?p=featured_snippets">snippet</a>” that answered our question in one word, presented above a link to the most pertinent website and keyword-highlighted text.</p>
<p>This goes beyond the standard capabilities of a search engine, which typically only return a list of relevant documents or websites. Google <a href="https://www.blog.google/products/search/search-language-understanding-bert/">recently explained</a> how they are using state-of-the-art NLP to enhance some of their search results. We’ll revisit this example in a later section and discuss how this technology works in practice and how we can (and will!) build our own QA system.</p>
<h2 id="why-question-answering">Why Question Answering?</h2>
<p>Sophisticated Google searches with precise answers are fun, but how useful are QA systems in general? It turns out that this technology is maturing rapidly. Gartner recently identified <a href="https://www.gartner.com/smarterwithgartner/gartner-top-10-data-analytics-trends/">natural language processing and conversational
analytics</a> as one of the top trends poised to make a substantial impact in the next three to five years. These technologies will provide increased data access, ease of use, and wider adoption of analytics platforms - especially to mainstream users. QA systems specifically will be a core part of the NLP suite, and are already seeing adoption in several areas.</p>
<p>Business Intelligence (BI) platforms are beginning to use Machine Learning (ML) to assist their users in exploring and analyzing their data through ML-augmented data preparation and insight generation. One of the key ways that ML is augmenting BI platforms is through the incorporation of natural language query functionality, which allows users to more easily query systems, and retrieve and visualize insights in a natural and user-friendly way, reducing the need for deep expertise in query languages, such as SQL.</p>
<p>Another area where QA systems will shine is in corporate and general use chatbots. Chatbots have been around for several years, but they mostly rely on hand-tailored responses. QA systems can augment this existing technology, providing a deeper understanding to improve user experience. For example, a QA system with knowledge of a company’s FAQs can streamline customer experience, while QA systems built atop internal company documentation could provide employees easier access to logs, reports, financial statements, or design docs.</p>
<p>The success of these systems will vary based on the use case, implementation, and richness of data. The field of QA is just starting to become commercially viable and it’s picking up speed. We think it’s a field worth exploring in order to understand what uses it might (and might not) have. So how does this technology work?</p>
<h2 id="designing-a-question-answerer">Designing a Question Answerer</h2>
<p>As explained above, question answering systems process natural language queries and output concise answers. This general capability can be implemented in dozens of ways. How a QA system is designed depends, in large part, on three key elements: the knowledge provided to the system, the types of questions it can answer, and the structure of the data supporting the system.</p>
<h3 id="domain">Domain</h3>
<p>QA systems operate within a <em>domain</em>, constrained by the data that is provided to them. The domain represents the embodiment of all the knowledge the system can know. There are two domain paradigms: open and closed. Closed domain systems are narrow in scope and focus on a specific topic or regime. Open domain systems are broad, answering general knowledge questions.</p>
<p><a href="https://web.stanford.edu/class/linguist289/p219-green.pdf">The BASEBALL system</a> is an early example of a closed domain QA system. Built in the 1960s, it was limited to answering questions surrounding one year’s worth of baseball facts and statistics. Not only was this domain constrained to the topic of baseball, it was also constrained in the timeframe of data at its proverbial fingertips. A contemporary example of closed domain QA systems are those found in some BI applications. Generally, their domain is scoped to whatever data the user supplies, so they can only answer questions on the specific datasets to which they have access.</p>
<p>By contrast, open domain QA systems rely on knowledge supplied from vast resources - such as Wikipedia or the World Wide Web - to answer general knowledge questions. These systems can even answer general trivia. One example of such a system is IBM’s Watson, <a href="https://www.techrepublic.com/article/ibm-watson-the-inside-story-of-how-the-jeopardy-winning-supercomputer-was-born-and-what-it-wants-to-do-next/">which won on Jeopardy!</a> in 2011 (perhaps Watson was more of an Answer Questioner? We like jokes). Google’s QA capability as demonstrated above would also be considered open domain.</p>
<h3 id="question-type">Question Type</h3>
<p>Once you’ve decided the scope of knowledge your QA system will cover, you must also determine what types of questions it can answer. The vast majority of all QA systems answer factual questions: those that start with who, what, where, when, and how many. These types of questions tend to be straightforward enough for a machine to comprehend, and can be built directly atop structural databases or ontologies, as well as being extracted directly from unstructured text.</p>
<p>However, research is emerging that would allow QA systems to answer hypothetical questions, cause-effect questions, confirmation (yes/no) questions, and inferential questions (questions whose answers can be inferred from one or more pieces of evidence). Much of this research is still in its infancy, however, as the requisite natural language understanding is (for now) beyond the capabilities of most of today’s algorithms.</p>
<h3 id="implementation">Implementation</h3>
<p>There’s more than one way to cuddle a cat, as the saying goes. Question answering seeks to extract information from data and, generally speaking, data come in two broad formats: structured and unstructured. QA algorithms have been developed to harness the information from either paradigm: knowledge-based systems for structured data and information retrieval-based systems for unstructured (text) data. Some QA systems exploit a hybrid design that harvests information from both data types; IBM’s Watson is a famous example. In this section, we’ll highlight some of the most widely used techniques in each data regime - concentrating more on those for unstructured data, since this will be the focus of our applied research. Because we’ll be discussing explicit methods and techniques, the following sections are more technical. And we’ll note that, while we provide an overview here, an even more comprehensive discussion can be found in the <a href="https://web.stanford.edu/~jurafsky/slp3/25.pdf">Question Answering chapter</a> of Jurafsky and Martin’s <a href="https://web.stanford.edu/~jurafsky/slp3/">Speech and Language Processing</a> (a highly accessible textbook).</p>
<h4 id="knowledge-based-systems">Knowledge-Based Systems</h4>
<p>A large quantity of data is encapsulated in structured formats, e.g., relational databases. The goal of knowledge-based QA systems is to map questions to these structured entities through semantic parsing algorithms. Semantic parsing techniques convert text strings to symbolic logic or query languages, e.g., SQL.</p>
<p><img src="/images/post1/kb_examples.png" alt="" title="Source: This and other images in the knowledge-based systems section are from the Question Answering chapter in Jurafsky and Martin’s Speech and Language Processing third edition draft." /></p>
<p>Semantic parsing algorithms are highly tailored to their specific domain and database, and utilize templates as well as supervised learning approaches. Templates are handwritten rules, useful for frequently observed logical relationships. For example, an employee database might have a <strong>start-date</strong> template consisting of handwritten rules that search for <em>when</em> and <em>hired</em> since “when was <em>Employee Name</em> hired” would likely be a common query.</p>
<p>Supervised methods generalize this approach and are used when there exists a dataset of question-logical form pairs, such as in the figure above. These algorithms process the question, creating a parse tree that then maps the relevant parts of speech (nouns, verbs, and modifiers) to the appropriate logical form. Many algorithms begin with simple relationship mapping: matching segments from the question parse tree to a logical relation, as in the two examples below.</p>
<p><img src="/images/post1/jm_entity_relation.png" alt="" /></p>
<p>The algorithm then bootstraps from simple relationship logic to incorporate more specific information from the parse tree, mapping it to more sophisticated logical queries like this <strong>birth-year</strong> example below.</p>
<p><img src="/images/post1/jm_logical_relation.png" alt="" /></p>
<p>These systems can be made more robust by providing lexicons that capture the semantics and variations of natural language. For instance, in our employee database example, a question might contain the word “employed” rather than “hired,” but the intention is the same.</p>
<h4 id="information-retrieval-based-systems-retrievers-and-readers">Information Retrieval-Based Systems: Retrievers and Readers</h4>
<p><img src="/images/post1/reading_retriever.jpg" alt="" title="Get it? Retriever? Reader?" /></p>
<p>Information retrieval-based question answering (IR QA) systems find and extract a text segment from a large collection of documents. The collection can be as vast as the entire web (open domain) or as specific as a company’s Confluence documents (closed domain). Contemporary IR QA systems first identify the most relevant documents in the collection, and then extract the answer from the contents of those documents. To illustrate this approach, let’s revisit our Google example from the introduction, only this time we’ll include some of the search results!</p>
<p><img src="/images/post1/LincolnWithSearch7Entries.png" alt="" title="Did Abe have big
ears?" /></p>
<p>We already talked about how the snippet box acts like a QA system. The search results below the snippet illustrate some of the reasons why an IR QA system can be more useful than a search engine alone. The relevant links vary from what is essentially advertising (study.com) to making fun of Lincoln’s ears (Reddit at its finest) to a discussion of color blindness (answers.com without the answer we want) to an article about all presidents’ eye colors (getting warmer, Chicago Tribune) to the very last link (answers.yahoo.com, which is on-topic - and narrowly scoped to Lincoln - but gives an ambiguous answer). Without the snippet box at the top, a user would have to skim each of these links to locate their answer - with varying degrees of success.</p>
<p>IR QA systems are not just search engines, which take general natural language terms and provide a list of <em>relevant documents</em>. IR QA systems perform an additional layer of processing on the <em>most</em> relevant documents to deliver a pointed answer, based on the <em>contents</em> of those documents (like the snippet box). While we won’t hazard a guess at exactly how Google extracted “gray” from these search results, we can examine how an IR QA system could exhibit similar functionality in a real world (e.g., non-Google) implementation.</p>
<p>Below we illustrate the workflow of a generic IR-based QA system. These systems generally have two main components: the document retriever and the document reader.</p>
<p><img src="/images/post1/QAworkflow.png" alt="" title="Generic IR QA
system" /></p>
<p>The document retriever functions as the search engine, ranking and retrieving relevant documents to which it has access. It supplies a set of candidate documents that <em>could</em> answer the question (often with mixed results, per the Google search shown above). The document reader consists of reading comprehension algorithms built with core NLP techniques. This component processes the candidate documents and extracts from one of them an explicit span of text that best satisfies the query. Let’s dive deeper into each of these components.</p>
<h5 id="document-retriever">Document Retriever</h5>
<p>The document retriever has two core jobs: process the question for use in an IR engine, and use this IR query to retrieve the most appropriate documents and passages. Query processing can be as simple as no processing at all, and instead passing the entire question to the search engine. However, if the question is long or complicated, it often pays to process the query through various techniques - such as stop word removal, removing wh-words, converting to n-grams, or extracting named entities as keywords.</p>
<p>Some systems also extract contextual information from the query, e.g., the <em>focus</em> of the question and the expected <em>answer type</em>, which can then be used in the Document Reader during the answer extraction phase. The focus of a question is the string within the query that the user is looking to fill. The answer type is categorical, e.g., person, location, time, etc. In our earlier example, “when was <em>Employee Name</em> hired?”, the focus would be “when” and the answer type might be a numeric <em>date-time</em>.</p>
<p>The IR query is then passed to an IR algorithm. These algorithms search over all documents often using standard tf-idf cosine matching to rank documents by relevance. The simplest implementations would pass the top <em>n</em> most relevant documents to the document reader for answer extraction but this, too, can be made more sophisticated by breaking documents into their respective passages or paragraphs and filtering them (based on named entity matching or answer type, for example) to narrow down the number of passages sent to the document reader.</p>
<h5 id="document-reader">Document Reader</h5>
<p>Once we have a selection of relevant documents or passages, it’s time to extract the answer. The sole purpose of the document reader is to apply reading comprehension algorithms to text segments for answer extraction. Modern reading comprehension algorithms come in two broad flavors: feature-based and neural-based.</p>
<p>Feature-based answer extraction can include rule-based templates, regex pattern matching, or a suite of NLP models (such as parts-of-speech tagging and named entity recognition) designed to identify features that will allow a supervised learning algorithm to determine whether a span of text contains the answer. One useful feature is the answer type identified by the document retriever during query processing. Other features could include the number of matched keywords in the question, the distance between the candidate answer and the query keywords, and the location of punctuation around the candidate answer. This type of QA works best when the answers are short and when the domain is narrow.</p>
<p>Neural-based reading comprehension approaches capitalize on the idea that the question and the answer are semantically similar. Rather than relying on keywords, these methods use extensive datasets that allow the model to learn semantic embeddings for the question and the passage. Similarity functions on these embeddings provide answer extraction.</p>
<p>Neural network models that perform well in this arena are Seq2Seq models and Transformers. (For a detailed dive into these architectures, interested readers should check out these excellent posts for <a href="http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/">Seq2Seq</a> and <a href="http://jalammar.github.io/illustrated-transformer/">Transformers</a>.) The Transformer architecture in particular is currently revolutionizing the entire field of NLP. Models builts on this architecture include BERT (and its myriad off-shoots: RoBERTa, ALBERT, distilBERT, etc.), XLNet, GPT, T5, and more. These models - coupled with advances in compute power and transfer learning from massive unsupervised training sets - have started to outperform humans on some key NLP benchmarks, including question answering.</p>
<p>In this paradigm, one does not need to identify the answer type, the parts of speech, or the proper nouns. One need only feed the question and the passage into the model and wait for the answer. While this is an exciting development, it does have its drawbacks. When the model doesn’t work, it’s not always straightforward to identify the problem - and scaling these models is still a challenging prospect. These models generally perform better (according to your quantitative metric of choice) relative to the number of parameters they have (the more, the better), but the cost of inference also goes up - and with it, the difficulty of implementation in settings like federated learning scenarios or on mobile devices.</p>
<h2 id="building-a-question-answerer">Building a Question-Answerer</h2>
<p>At the beginning of this article, we said we were going to build a QA system. Now that we’ve covered some background, we can describe our approach. Over the course of the next two months, two of Cloudera Fast Forward’s Research Engineers, Melanie Beck and Ryan Micallef, will build a QA system following the information retrieval-based method, by creating a document retriever and document reader. We’ll focus our efforts on exploring and experimenting with various Transformer architectures (like BERT) for the document reader, as well as off-the-shelf search engine algorithms for the retriever. Neither of us has built a system like this before, so it’ll be a learning experience for everyone. And that’s precisely why we wanted to invite you along for the journey! We’ll share what we learn each step of the way by posting and discussing example code, in addition to articles covering topics like:</p>
<ul>
<li>existing QA training sets for Transformers and what you’ll need to develop your own</li>
<li>how to evaluate the quality of a QA system - both the reader and retriever</li>
<li>building a search engine over a large set of documents</li>
<li>and more!</li>
</ul>
<p>Because we’ll be writing about our work as we go, we might end up in some dead ends or run into some nasty bugs; such is the nature of research! When these things happen, we’ll share our thoughts on what worked, what didn’t, and why - but it’s important to note upfront that while we do have a solid goal in mind, the end product may turn out to be quite different than what we currently envision. Stay tuned; in our next post we’ll start digging into the nuts and bolts!</p>Welcome to the first edition of the Cloudera Fast Forward blog on Natural Language Processing for Question Answering! Throughout this series, we’ll build a Question Answering (QA) system with off-the-shelf algorithms and libraries and blog about our process and what we find along the way. We hope to wind up with a beginning-to-end documentary that provides: