COVID-19 Virus

This notebook contains a library called the CORD Research Engine, which uses search and NLP technology to simplify searching the CORD Research Paper dataset for information that would help solve the current pandemic"

Getting Started

Data files

The CORD-19 dataset on Kaggle contains over 59,000 research papers. This notebook references about 2,000 of those papers, located in the directory data. Inside the data directory, the CORD Research data is located in the CORD-19-research-challenge subdirectory. For the full dataset please see the competition page on Kaggle.

Navigating the data

The original notebook was a Python kernel on Kaggle. On kaggle the data directory is /kaggle/input and when you launch a new notebook you are given a block of code to view the data in the directory.

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

The Data Directory

Data directory

Instead of the code above we will use pathlib for navigating the filesystem because it is a little bit easier to use than os.path. From pathlib we will use the Path and PurePath classes. The Path class provides almost all the functionality we need except that it cannot be passed directly into _pandas.readcsv as can be done with the PurePath.

Now we can create a Path object at the location of the CORD research data and view the contents of that directory.

from pathlib import Path, PurePath
data_path = Path('data/CORD-19-research-challenge')
list(data_path.glob('*'))

[WindowsPath('data/CORD-19-research-challenge/biorxiv_medrxiv'),
 WindowsPath('data/CORD-19-research-challenge/comm_use_subset'),
 WindowsPath('data/CORD-19-research-challenge/cord_19_embeddings.csv'),
 WindowsPath('data/CORD-19-research-challenge/custom_license'),
 WindowsPath('data/CORD-19-research-challenge/metadata.csv'),
 WindowsPath('data/CORD-19-research-challenge/noncomm_use_subset')]

The CORD-19 data directory contains four sub-directories, the JSON contents of the research papers, and a metadata.csv file with the metadata for all the research papers. For the actual details of the CORD Research papers dataset, see the description on the Kaggle site.

Research Engine Design

A tale of two indexes

We will build two indexes into the data: one for search, and one for similarity. An index is just a data structure that provides a way to sort some data. In our case, for search, we want to provide a search term to the index and have it return a list of the research papers, sorted by most relevant to the search term.

With the similarity index, the goal is to provide a research paper as input and have the index return a list of papers sorted by the most similar to that paper.

In the real world, and for applications that you may write, you may encounter those two use cases, so we will show the best techniques to fulfill the requirements. Search and Similarity are related but not identical areas, and both the indexes we will build are better at one than the other.

Object-Oriented Design

An object-oriented design is rarely used with notebooks, and for good reason. A notebook is a hybrid of a Python script and a layout template, which both operate in a top to bottom way. Objects are usually meant to be defined in a certain location or context and interacted with multiple times during the course of a program. This is exactly what we want to do here; we intend to create a Research Engine and use it to find research papers during the course of a single session.

There are two main objects in our design: ResearchPapers and Paper. The ResearchPapers class will load and maintain the list of research papers, while Paper will allow us to work with a single research paper.

class ResearchPapers:

    def __init__(self, metadata, data_dir='data'):
        pass
    
class Paper:
    
    def __init__(self):
        pass

Loading Research Papers

We start with the metadata.csv since this contains the master list of all the research papers and important information about each paper. Some Kaggle kernels started by loading the research papers from the JSON files, but the metadata is considerably smaller, and starting with the metadata might allow us to only load a research paper from a JSON file if required, thereby saving significantly on computational resources.

Load Metadata

The load_metadata function is straightforward - it uses pandas.read_csv() to load the metadata.csv file. Along the way it changes the data types for the Microsoft Academic Paper ID and _pubmedid columns. It also renames a couple of columns, which is optional, and only to make browsing the data a little easier.

@staticmethod
    def load_metadata(data_path=None):
        if not data_path:
            data_path = find_data_dir()

            print('Loading metadata from', data_path)
            metadata_path = PurePath(data_path) / 'metadata.csv'
            dtypes = {'Microsoft Academic Paper ID': 'str', 'pubmed_id': str}
            renames = {'source_x': 'source', 'has_full_text': 'has_text'}
            # Load the metadata using read.csv
            metadata = pd.read_csv(metadata_path, dtype=dtypes, 
                                   low_memory=False,
                                   parse_dates=['publish_time']).rename(columns=renames)
            metadata = clean_metadata(metadata)
            return metadata

Within load_metadata we call clean_metadata. This function is a data cleaning pipeline that fixes some of the issues with the Kaggle data by using a pipeline of data cleaning functions. Notice that each function is specfic to a task, and named specifically for that task, which makes it simple to remove or add cleaning functions. This is important since a new version of the CORD research data is released every week, with new data fixes, so if you can quickly modify your data pipeline, it gives you an advantage.

Clean Metadata

def clean_metadata(metadata):
    print('Cleaning metadata')
    return metadata.pipe(start) \
        .pipe(clean_title) \
        .pipe(clean_abstract) \
        .pipe(rename_publish_time) \
        .pipe(add_date_diff) \
        .pipe(drop_missing) \
        .pipe(fill_nulls) \
        .pipe(apply_tags)

Each function in the clean_metadata pipeline accepts a dataframe and returns the dataframe after modification. The functions are connected by the pandas dataframe pipe function, which is designed for this use case of connecting functions sequentially. At the start of the pipeline is a special function called start which is defined as:

def start(data):
    return data.copy()

This returns a copy of the data at the start of a data pipeline. This is very important; it makes sure that the initial data is unchanged and enables you to rerun your notebook from any point without worrying that your data has changed in a way that breaks it. This is a pattern I learned from the presentation Untitled12.ipynb by Vincent D Warmerdam at PyData Eindhoven 2019

Each subsequent function in the pipeline accepts a dataframe and returns a dataframe. Here is clean_title:

def clean_title(data):
    # Set junk titles to ''
    title_relevant = data.title.fillna('').str.match(_relevant_re_, case=False)
    title_short = data.title.fillna('').apply(len) < 30
    title_junk = title_short & ~title_relevant
    data.loc[title_junk, 'title'] = ''
    return data

Load function

After loading the metadata, we have the load function. This is the static function on the ResearchPapers class that actually creates the ResearchPapers instance using ResearchPapers.load().

@classmethod
    def load(cls, index=None):
        data_path = find_data_dir()
        metadata = cls.load_metadata(data_path)
        return cls(metadata, data_path, index=index)

Preprocessing

While the clean_metadata function does some text preprocessing, it does that in relation to the metadata.csv file that is released by Kaggle to remove any data issues it might have. We still need to do text preprocessing to prepare the text of the research papers for storing in the search and similarity indexes. Text preprocessing is a necessary part of NLP projects, and, despite the fact that there are a lot of utilities available in NLP libraries to preprocess test, it is still part of the art of being an NLP practictioner. For example, the regex below was hand-crafted with input from Google searches, and a lot of trial and error. I found, for example, that the text prepeocessing that came with gensim was hand-rolled. You may or may not need to follow a similar process for your own text preprocessing, depending on your judgement.

TOKEN_PATTERN = re.compile('^(20|19)\d{2}|(?=[A-Z])[\w\-\d]+$', re.IGNORECASE)

def replace_punctuation(text):
    t = re.sub('\(|\)|:|,|;|\.|’|”|“|\?|%|>|<|≥|≤|~|`', '', text)
    t = re.sub('/', ' ', t)
    t = t.replace("'", '')
    return t

def clean(text):
    t = text.lower()
    t = replace_punctuation(t)
    return t

def tokenize(text):
    words = nltk.word_tokenize(text)
    return [word for word in words
            if len(word) > 1
            and not word in SIMPLE_STOPWORDS
            and TOKEN_PATTERN.match(word)
            ]

def preprocess(text):
    t = clean(text)
    tokens = tokenize(t)
    return tokens

The preprocess function

Regardless, every NLP project requires a preprocess or equivalent function. Our preprocess function converts the text to lowercase, removes punctuation, and converts the text to tokens. What is very important is that the identical preprocess function be used when preparing the text in batch mode as when using it in query mode. So the preprocess function will be used again on each search query to match accurately against what is stored in the index.

Utility Functions

Parallel Processing

As of the May 2nd data release the CORD Research Paper dataset was over 8GB in size, with more than 59,000 metadata records and well over 60,000 research papers on disk. In order to process the data in a reasonable amount of time, we added a utility function that will run a given function on a list of data using as many cores that are available.

def parallel(func, arr: Collection, max_workers: int = None, leave=False):
    "Call `func` on every element of `arr` in parallel using `max_workers`."
    max_workers = ifnone(max_workers, multiprocessing.cpu_count())
    progress_bar = tqdm(arr)
    with ThreadPoolExecutor(max_workers=max_workers) as ex:
        futures_to_index = {ex.submit(func, o): i for i, o in enumerate(arr)}
        results = []
        for f in as_completed(futures_to_index):
            results.append((futures_to_index[f], f.result()))
            progress_bar.update()
        for n in range(progress_bar.n, progress_bar.total):
            time.sleep(0.1)
            progress_bar.update()
        results.sort(key=lambda x: x[0])
    return [result for i, result in results]

For example, if we have a list of JSON files and we want to convert that to a ist of tokenized text we can do:

def get_tokens(cord_path):
    cord_uid, path = cord_path
    if isinstance(path, Path):
        tokens = preprocess(load_text(path))
        return cord_uid, tokens
    return cord_uid, np.nan

cord_tokens = parallel(get_tokens, cord_paths)

Reading and tokenizing each of the over 60,000 JSON files is a very expensive operation, so we want to distribute this load across our CPU cores.

Describe the Metadata

The pandas function describe can be used to gather information about series and dataframes.

This is what describe shows when you use it on the metadata without doing any special type of conversions.

import pandas as pd
metadata = pd.read_csv(PurePath(data_path) / 'metadata.csv')
metadata.describe()

It is a bit limiting, so we will write a function that provides more information than what is available using describe.

from cord.core import describe_dataframe

describe_dataframe(metadata);

ResearchPapers init

Now we are at ResearchPapers.__init__(). The ResearchPapers instance is constructed with a metadata dataframe, which is the dataframe we parsed in load_metadata. A ResearchPapers instance can also be constructed with a subset of the metadata, meaning that we can create an instance containing only COVID-19 related papers, or papers published by Elsevier or any other subset of the metadata that we are interested in. We will show how to use this functionality later.

We can construct the BM25 index from the papers referenced by the metadata. Each metadata row has an abstract column, which contains the abstract of the research paper, which we can preprocess into the index tokens needed by the index. We can, alternatively, create the index tokens from the full text content of the paper, which we load from disk. The full text content gives the potential for a more accurate search, with the tradeoff that it takes a much longer time to build the index. To enable this tradeoff, the load function accepts a parameter index, which deteremines which index strategy to use. ResearchPapers.load(index="text") loaded and indexed from the JSON file contents, while ResearchPapers.load(index"abstract") used the metadata abstract. On a Kaggle instance it took approximately 100 seconds to index from the abstracts versus over 2,000 seconds to index from the JSON texts, though it was about three times faster for each operation on my local laptop.

To save on the time to load from the JSON texts the CORD library, JSON index tokens were processed offline and saved to parquet files. These parquet files were then loaded whenever the option index="text" was selected.

def __init__(self, metadata, data_dir='data', index='abstract', view='html'):
        self.data_path = Path(data_dir)
        self.num_results = 10
        self.view = view
        self.metadata = metadata
        if 'index_tokens' not in metadata:
            print('\nIndexing research papers')
            if any([index == t for t in ['text', 'texts', 'content', 'contents']]):
                tick = time.time()
                _set_index_from_text(self.metadata, data_dir)
                print("Finished indexing in", int(time.time() - tick), 'seconds')
            else:
                print('Creating the BM25 index from the abstracts of the papers')
                print('Use index="text" if you want to index the texts of the paper instead')
                tick = time.time()
                self.metadata['index_tokens'] = metadata.abstract.apply(preprocess)
                tock = time.time()
                print('Finished Indexing in', round(tock - tick, 0), 'seconds')

        # Create BM25 search index
        self.bm25 = _get_bm25Okapi(self.metadata.index_tokens)

Python Objects wrapping dataframes

A pattern that is repeated throughout the project is to have a dataframe as a member of a python object. The python object controls access to the dataframe and provides useful functionality, while the dataframe acts as a local database. There are many use cases that fit this pattern where you load read-only data into a dataframe and wrap python code around it. In our case the ResearchPapers class acts like a miniature application that accesses a local in-memory database. We also do this with the SearchResults object, which wraps a dataframe of the search results.

The most important line above self.metadata = metadata sets the dataframe as an instance member. The primary use is to load all the research papers from the metadata.csv, in which case we will have all 50,000+ research papers. Note that we can pass a metadata dataframe of any number of records, meaning that we can make subsets of ResearchPapers. This enables the ability to select only a subset of research papers, e.g. source=="Elsevier", then make a copy of the ResearchPapers with the smaller number of records.

def query(self, query):
        data = self.metadata.query(query)
        return self._make_copy(data)

    def _make_copy(self, new_data):
        return ResearchPapers(metadata=new_data.copy(),
                              data_dir=self.data_path,
                              view=self.view)

We will see how this pattern is used in more detail later in the notebook.

Indexing Research Papers

Now that we have completed our basic code, we can see the load in action. We will start with loading from the abstracts.

from cord import ResearchPapers

Indexing from Abstracts

papers = ResearchPapers.load()

Loading metadata from data\CORD-19-research-challenge
Cleaning metadata
Applying tags to metadata

Indexing research papers
Creating the BM25 index from the abstracts of the papers
Use index="text" if you want to index the texts of the paper instead
Finished Indexing in 7.0 seconds

Indexing from Texts

papers = ResearchPapers.load(index='text')

Loading metadata from data\CORD-19-research-challenge
Cleaning metadata
Applying tags to metadata

Indexing research papers
Creating the BM25 index from the text contents of the papers
There are 2006 papers that will be indexed using the abstract instead of the contents
Finished indexing in 148 seconds

Viewing the Research Papers

papers

The Search Index

The search index is built on a BM25 Okapi libray called _rankbm25.

An implementation of BM25 is also available in gensim, but I found _rankbm25 was simple to implement.

What is BM25 Okapi?

BM25 stands for Best Match the 25th version and is a text search algorithm first developed in 1994. Okapi stands for the Okapi information retrieval system, implemented at London's City University in the 1980s and 1990s on which BM25 was used.

It is one of the best search algorithms available, and Lucene and its derivatives, Solr and ElasticSearch switched to a BM25 variant around 2015.

Creating the BM25 Index

The heart of this notebook, the one forked over 400 times on Kaggle, is this line of code.

BM25Okapi(index_tokens.tolist())

This creates a new BM25Okapi index on the index_tokens. The index accepts a list of tokens, with each item in the list representing a single tokenized research paper. This is as simple as you can get for a bit of technology that is also used in ElasticSearch software.

The actual implementation in the library is just slightly more complicated, accounting for the edge case of creating a ResearchPapers instance with no papers inside. (This can happen, as we see below, if our API is so flexible that it allows a user to do this).

from rank_bm25 import BM25Okapi

def _get_bm25Okapi(index_tokens):
    has_tokens = index_tokens.apply(len).sum() > 0
    if not has_tokens:
        index_tokens.loc[0] = ['no', 'tokens']
    return BM25Okapi(index_tokens.tolist())

Searching Papers

The search function is defined below. We preprocess the search string and then get the doc_scores (the search relevance score) from the BM25 index for all documents in the index. Then we get the dataframe locations of the most relevant papers, do some additional filtering, then return the top n results.

def search(self, search_string,
               num_results=10,
               covid_related=False,
               start_date=None,
               end_date=None,
               view='html'):
        n_results = num_results or self.num_results

        # Preprocess the search string
        search_terms = preprocess(search_string)

        # Get the doc scores from the BM25 index
        doc_scores = self.bm25.get_scores(search_terms)

        # Get the index from the doc scores sorted by most relevant
        ind = np.argsort(doc_scores)[::-1]

        # Sort the metadata using the sorted index above
        results = self.metadata.iloc[ind].copy()

        # Round the doc scores, in case we use the score in the display 
        results['Score'] = doc_scores[ind].round(1)

        # Filter covid related
        if covid_related:
            results = results[results.covid_related]

        # Filter by dates - start date
        if start_date:
            results = results[results.published >= start_date]
        # end data
        if end_date:
            results = results[results.published < end_date]

        # Show only up to n_results
        results = results.head(num_results)

        # Create the final results
        results = results.drop_duplicates(subset=['title'])

        # Return Search Results
        return SearchResults(results, self.data_path, view=view)

Here is the code run outside of the function, with print statements

from cord.text import preprocess
import numpy as np

search_query = 'Mother to child transmission'
search_tokens = preprocess(search_query)
print('search_tokens', search_tokens)

doc_scores = papers.bm25.get_scores(search_tokens)
print('doc_scores', doc_scores)

ind = np.argsort(doc_scores)[::-1]
print('ind', ind)

search_tokens ['mother', 'child', 'transmission']
doc_scores [0. 0. 0. ... 0. 0. 0.]
ind [1451 1543   82 ... 1320 1322    0]

The Similarity Index

Included with the CORD-19 Research Dataset is a CSV file containing the document embeddings vectors for the papers in the dataset. Each vector is a 768-dimension vector representing what a neural network has learned about that research paper, and, effectively, it is a signature or a fingerprint of that research paper. This vector can be used to build our similarity index and allow us to find similar papers for any given research paper.

Search2d

The embeddings CSV file is large—over 700MB in the real dataset. For this notebook we have created a smaller CSV file with only the embeddings for the research papers included with the notebook. If you load and look at the embeddings, you will see 769 columns, one for each of the 768-dimension vector, plus the cord_uid. (The cord_uid is the unique identifier for a research paper in the dataset.)

embeddings_path = data_path / 'cord_19_embeddings.csv'
embeddings = pd.read_csv(embeddings_path)
display(embeddings.shape)

## Look at the first 2 rows and 5 columns
embeddings.iloc[:2, :5]

(1824, 769)

Let us load the data again, but this time add the column names

VECTOR_COLS = [str(i) for i in range(768)]
COLUMNS = ['cord_uid'] + VECTOR_COLS
embeddings = pd.read_csv(embeddings_path, names=COLUMNS).set_index('cord_uid')
print('Loaded embeddings of shape', embeddings.shape)

## Look at the first 2 rows and 5 columns
embeddings.iloc[:2, :5]

Loaded embeddings of shape (1825, 768)

Building the Annoy Index

Annoy is a library for building an index of vectors so that the nearest neighbors to that vector can be easily found. This library is used at Spotify to find music recommendations, so it is one of the best technologies we can possibly use to find similar papers.

To build an annoy index, you first create it with the vector size that you want to store in the index, then add each vector to the index, then build.

from annoy import AnnoyIndex
import random

VECTOR_SIZE = 40
annoy_index = AnnoyIndex(VECTOR_SIZE, 'angular')  # Length of item vector that will be indexed
for i, vector in enumerate(vectors):
    annoy_index.add_item(i, vector)

annoy_index.build(10) # 10 trees
annoy_index.save('test.ann')

Reduce Vector Dimensions

For the index we wanted to create, we thought 768 dimensions was too much, especially since it meant that the resulting index would be very large on disk, and also too large to fit into a Git repository. Therefore we used PCA to reduce the dimensions to 192, and those 192 dimensional vectors are stored in the Annoy Index.

from sklearn.decomposition import PCA

def downsample(docvectors, dimensions=2):
    print(f'Downsampling to {dimensions}D embeddings')
    pca = PCA(n_components=dimensions, svd_solver='full')
    docvectors_downsampled = pca.fit_transform(docvectors)
    return np.squeeze(docvectors_downsampled), pca

Getting Similar Vectors

Once we have the annoy index, we use it to find similar papers using the similar papers function. Annoy has two functions for finding similar papers: get_nns_by_item and get_nns_by_vector. In our function we use get_nns_by_item.

def similar_papers(paper_id, num_items=config.num_similar_items):
    from .vectors import SPECTOR_SIMILARITY_INDEX
    index = paper_id if isinstance(paper_id, int) else get_index(paper_id)
    if not index:
        return []
    similar_indexes = SPECTOR_SIMILARITY_INDEX.get_nns_by_item(index, num_items, search_k=config.search_k)
    similar_cord_uids = document_vectors.iloc[similar_indexes].index.values.tolist()
    return [id for id in similar_cord_uids if not id == paper_id]

Displaying objects in a notebook

In the search function we returned a SearchResults object that contained a dataframe with the search results. This is a common pattern used in the notebook—often a dataframe is wrapped inside of an object.

The reason for this pattern is to control how the dataframe is displayed in the notebook. For the search result, we want to present a different formatting from what is available for dataframes, and we may also want to control which columns are displayed, or other aspects of the search results.

Outputting HTML

Objects can control how they are displayed as HTML in a Jupyter notebook by implementing the _repr_html_() function. Some of the objects that we create in this project are meant to wrap a dataframe and control how they are displayed. SearchResults is one such object—it wraps the results dataframe and controls how it looks using _repr_html_(). SearchResults can actually be viewed as a dataframe

papers.search('Mother to child', covid_related=True, num_results=4, view='df')

or an HTML table:

papers.search('Mother to child', covid_related=True, num_results=2)

Integrating your object with IPython

Using the Dataframe style function

Another great way to customize the look of a dataframe inside a notebook is to use the dataframe.style function. This pandas functionality is pretty powerful and can allow you to completely change how a dataframe is displayed in a notebook.

In the CORD library we wanted to style the results differently from regular dataframes, to improve on the look and feel, and to have the library stand out from the competition. The code below shows how the display function used the style function to change the styling of the results.

def display(self, *paper_ids):

        # Look up the papers using the paperids and create a dataframe
        _recs = []
        for id in paper_ids:
            paper = self[id]
            _recs.append({'published': paper.metadata.published,
                          'title': paper.title,
                          'summary': paper.summary,
                          'when': paper.metadata.when,
                          'cord_uid': paper.cord_uid})
        df = pd.DataFrame(_recs).sort_values(['published'], ascending=False).drop(columns=['published'])

        # Apply a style to a column
        def highlight_cols(s):
            return 'font-size: 1.1em; color: #008B8B; font-weight: bold'

        # Apply the style above to the title column of the dataframe and hide the index
        return df.style.applymap(highlight_cols, subset=pd.IndexSlice[:, ['title']]).hide_index()

This is how the results are dispayed in the notebook:

papers.display('v3lbrzh8')

For more information on styling dataframes see Pandas Styling.

Creating Search Widgets

To create interactivity within the notebook, and to allow the user to perform interactive searches, we use ipywidgets.

Search Date Slider

This widget displays a slider that allows a user to select relevant dates for the Research paper search:

def SearchDatesSlider():
    options = [(' 1951 ', '1951-01-01'), (' SARS 2003 ', '2002-11-01'),
               (' H1N1 2009 ', '2009-04-01'), (' COVID 19 ', '2019-11-30'),
               (' 2020 ', '2020-12-31')]
    return widgets.SelectionRangeSlider(
        options=options,
        description='Dates',
        disabled=False,
        value=('2002-11-01', '2020-12-31'),
        layout={'width': '480px'}
    )

SearchBar

For the search bar we add a Text, Button, Checkbox and use HBox and VBox for layout.

def searchbar(self, initial_search_terms='', num_results=10, view=None):
        text_input = widgets.Text(layout=widgets.Layout(width='400px'), value=initial_search_terms)

        search_button = widgets.Button(description='Search', button_style='primary',
                                       layout=widgets.Layout(width='100px'))
        search_box = widgets.HBox(children=[text_input, search_button])

        # A COVID-related checkbox
        covid_related_CheckBox = widgets.Checkbox(description='Covid-19 related', value=False, disable=False)
        checkboxes = widgets.HBox(children=[covid_related_CheckBox])

        # A date slider to limit research papers to a date range
        search_dates_slider = SearchDatesSlider()

        search_widget = widgets.VBox([search_box, search_dates_slider, checkboxes])

        output = widgets.Output()

        def do_search():
            search_terms = text_input.value.strip()
            if search_terms and len(search_terms) >= 4:
                start_date, end_date = search_dates_slider.value
                self._search_papers(output=output, SearchTerms=search_terms, num_results=num_results, view=view,
                                    start_date=start_date, end_date=end_date,
                                    covid_related=covid_related_CheckBox.value)

        def button_search_handler(btn):
            with output:
                clear_output()
            do_search()

        def text_search_handler(change):
            if len(change['new'].split(' ')) != len(change['old'].split(' ')):
                do_search()

        def date_handler(change):
            do_search()

        def checkbox_handler(change):
            do_search()

        search_button.on_click(button_search_handler)
        text_input.observe(text_search_handler, names='value')
        search_dates_slider.observe(date_handler, names='value')
        covid_related_CheckBox.observe(checkbox_handler, names='value')

        display(search_widget)
        display(output)

        # Show the initial terms
        if initial_search_terms:
            do_search()

Now we can see the search bar in action. It updates as you change the selections in the form

papers.searchbar('Cruise ship', num_results=4)

Risk factors

Subsetting Research Papers

There are many ways to select subsets of research papers including:

Papers since SARS research_papers.since_sars()
Papers since SARS-COV-2 research_papers.since_sarscov2()
Papers before SARS research_papers.before_sars()
Papers before SARS-COV-2 research_papers.before_sarscov2()
Papers before a date research_papers.before('1989-09-12')
Papers after a date research_papers.after('1989-09-12')
Papers that contains a string
Papers that match a string (using regex)

Papers published since SARS-COV-2

papers.since_sarscov2()

Papers with "bats" in the title

papers.contains('bats', column='title')

Papers with Brian Mcloskey as an author

papers.match('.*McCloskey, B', column='authors')

Getting a single research paper

The ResearchPapers class implements __get_item__() so a single ResearchPaper can be accessed using the numeric index of the paper as in:

papers[0]

It's preferable to access a paper using the cord_uid since it is a stable identifier. papers["5o38ihe0"] referes to the same research paper

papers['5o38ihe0']

Implementing getitem

The implementation of __getitem__() is simple:

def __getitem__(self, item):
        if isinstance(item, int):
            paper = self.metadata.iloc[item]
        else:
            paper = self.metadata[self.metadata.cord_uid == item]

The main view of a research paper is the overview, as shown above. This shows a formatted collection of the paper's important fields. There are other views or attributes of research papers, including:

Overview: A nicely formatted view of the paper's important fields
Abstract: The paper's abstract
Summary: A summary of the paper's abstract using the TextRank algorithm
Text: The text in the paper
HTML: The contents of the paper as somewhat nicely formatted HTML
Text Summary: The text of the paper, summarized using the TextRank algorithm

This is implemented in the following function:

paper = research_papers['asf5c7xu']

def view_paper(ViewPaperAs):
    if ViewPaperAs == 'Overview':
        display(paper)
    elif ViewPaperAs == 'Abstract':
        display(paper.abstract)
    elif ViewPaperAs == 'Summary of Abstract':
        display(paper.summary)
    elif ViewPaperAs == 'HTML':
        display(paper.html)
    elif ViewPaperAs == 'Text':
        display(paper.text)
    elif ViewPaperAs == 'Summary of Text':
        display(paper.text_summary)

interact(view_paper,
         ViewPaperAs=['Overview', # Show an overview of the paper's important fields and statistics
                      'Abstract', # Show the paper's abstract
                      'Summary of Abstract', # Show a summary of the paper's abstract
                      'HTML', # Show the paper's contents as (slightly) formatted HTML
                      'Text', # Show the paper's contents
                      'Summary of Text' # Show a summary of the paper's content
                     ]
        );

Conclusion

In this notebook we showed you how to use search and NLP techniques to build a simple search engine and UI over a set of documents. Hopefully, it can help you create your own solution if you are interested in contributing to the same CORD research paper dataset. However, the code and techniques learned here can be applied to other use cases, so feel free to adapt it to your own purposes.

Search2d

	pubmed_id	Microsoft Academic Paper ID
count	1.404000e+03	3.600000e+01
mean	2.214187e+07	2.706621e+09
std	7.717728e+06	4.306890e+08
min	2.700200e+04	1.559092e+09
25%	1.726375e+07	2.315523e+09
50%	2.375861e+07	3.003704e+09
75%	2.839937e+07	3.004938e+09
max	3.230300e+07	3.006646e+09

	title	abstract	journal	authors	published	when
0	A model of tripeptidyl-peptidase I (CLN2), a u...	: Tripeptidyl-peptidase I, also known as CLN2,...	BMC Struct Biol	Wlodawer, Alexander; Durell, Stewart R; Li, Mi...	2003-11-11	16 years ago
1	SARS and hospital priority setting: a qualitat...	: Priority setting is one of the most difficul...	BMC Health Serv Res	Bell, Jennifer AH; Hyland, Sylvia; DePellegrin...	2004-12-19	15 years ago
2	Trade and Health: Is the Health Community Read...	There are greater tensions than ever before be...	PLoS Med	Lee, Kelley; Koivusalo, Meri	2005-01-25	15 years ago
3	Reference gene selection for quantitative real...	Ten potential reference genes were compared fo...	Virol J	Radonić, Aleksandar; Thulke, Stefanie; Bae, Hi...	2005-02-10	15 years ago
4	Australia's international health relations in ...	A survey for the year 2003 of significant deve...	Aust New Zealand Health Policy	Barraclough, Simon	2005-02-21	15 years ago
...	...	...	...	...	...	...
2001	Chloroquine : pas d’efficacité sur le virus Ebola	Chloroquine : pas d’efficacité sur le virus Ebola	Revue Francophone des Laboratoires		2015-11-30	4 years ago
2002	Multiplex PCR tests sentinel the appearance of...	Background Since the turn of the century seve...	Journal of Clinical Virology	Mahony, James B.; Hatchette, Todd; Ojkic, Davo...	2009-07-31	11 years ago
2003	Identification of a New Antizyme mRNA +1 Frame...	The expression of eukaryotic antizyme genes r...	Journal of Molecular Biology	Ivanov, Ivaylo P; Anderson, Christine B; Geste...	2004-06-04	16 years ago
2004	Antiviral responses against chicken respirator...	Some of the respiratory viral infections in c...	Cytokine	Barjesteh, Neda; O'Dowd, Kelsey; Vahedi, Seyed...	2020-03-31	1 month ago
2005	Biochemical evidence for the presence of mixed...	Coronavirus envelope (E) protein is a small i...	FEBS Letters	Yuan, Q.; Liao, Y.; Torres, J.; Tam, J.P.; Liu...	2006-05-29	14 years ago

	5o38ihe0	-5.601012229919434	-4.197016716003418	2.3068416118621826	5.485584259033203
0	xvi5miqw	-1.932757	-4.252481	-4.315052	4.177907
1	zl5lgcog	2.447766	-2.379109	-0.537503	4.555745

	0	1	2	3	4
cord_uid
5o38ihe0	-5.601012	-4.197017	2.306842	5.485584	3.822709
xvi5miqw	-1.932757	-4.252481	-4.315052	4.177907	-3.749875

	title	summary	when
1451	Investigation on demands for antenatal care se...	Objective To identify problems and demands for...	4 months ago
664	Lessons learned from Korea: COVID-19 pandemic	Lessons learned from Korea: COVID-19 pandemic	1 month ago
642	Editorial. Impact of COVID-19 on neurosurgery ...	Editorial. Impact of COVID-19 on neurosurgery ...	3 weeks ago
643	Editorial. Innovations in neurosurgical educat...	Editorial. Innovations in neurosurgical educat...	3 weeks ago

	title	abstract	journal	authors	published	when
449	Focusing on Families and Visitors Reduces Heal...	Healthcare-associated respiratory viral infect...	Pediatr Qual Saf	Linam, W. Matthew; Marrero, Elizabeth M.; Hone...	2019-12-16	4 months ago
452	Willingness to Self-Isolate When Facing a Pand...	Infected people are isolated to minimize the s...	Int J Environ Res Public Health	Zhang, Xiaojun; Wang, Fanfan; Zhu, Changwen; W...	2019-12-27	4 months ago
453	Causes of fever in Gabonese children: a cross-...	The causes of infections in pediatric populati...	Sci Rep	Fernandes, José Francisco; Held, Jana; Dorn, M...	2020-02-07	3 months ago
454	Loop-Mediated Isothermal Amplification (LAMP) ...	The recent outbreak of Zika virus (ZIKV) in th...	Viruses	da Silva, Severino Jefferson Ribeiro; Pardee, ...	2019-12-23	4 months ago
455	Innate Immunity and Pathogenesis of Biliary At...	Biliary atresia (BA) is a devastating fibro-in...	Front Immunol	Ortiz-Perez, Ana; Donnelly, Bryan; Temple, Hal...	2020-02-25	2 months ago
...	...	...	...	...	...	...
1978	MERS-CoV as an emerging respiratory illness: A...	Introduction Middle East Respiratory Coronavi...	Travel Medicine and Infectious Disease	Baharoon, Salim; Memish, Ziad A.	2019-12-31	4 months ago
1980	Mass gathering events and reducing further glo...	Mass gathering events and reducing further glo...	The Lancet	McCloskey, Brian; Zumla, Alimuddin; Ippolito, ...	2020-04-10	4 weeks ago
1992	New insights on the antiviral effects of chlor...	ABSTRACT Recently, a novel coronavirus (2019-n...	International Journal of Antimicrobial Agents	Devaux, Christian A.; Rolain, Jean-Marc; Colso...	2020-03-12	2 months ago
1993	A British Society of Thoracic Imaging statemen...	A British Society of Thoracic Imaging statemen...	Clinical Radiology	Nair, A.; Rodrigues, J.C.L.; Hare, S.; Edey, A...	2020-05-31	in 3 weeks
2004	Antiviral responses against chicken respirator...	Some of the respiratory viral infections in c...	Cytokine	Barjesteh, Neda; O'Dowd, Kelsey; Vahedi, Seyed...	2020-03-31	1 month ago

	title	abstract	journal	authors	published	when
292	Prevalence, diversity, and host associations o...	Bartonella infections were investigated in sev...	PLoS Negl Trop Dis	Urushadze, Lela; Bai, Ying; Osikowicz, Lynn; M...	2017-04-11	3 years ago
475	Detection of adenovirus, papillomavirus and pa...	Bats play a significant role in maintaining th...	Arch Virol	Finoketti, Fernando; dos Santos, Raíssa Nunes;...	2019-02-10	1 year ago
1859	Identification and complete genome analysis of...	Among 489 bats of 11 species in China, three ...	Virology	Lau, Susanna K.P.; Woo, Patrick C.Y.; Wong, Be...	2010-08-15	10 years ago

	title	abstract	journal	authors	published	when
1552	The rise of Zika infection and microcephaly: w...	Objectives To consider why Zika was declared ...	Public Health	McCloskey, B.; Endericks, T.	2017-09-30	3 years ago
1980	Mass gathering events and reducing further glo...	Mass gathering events and reducing further glo...	The Lancet	McCloskey, Brian; Zumla, Alimuddin; Ippolito, ...	2020-04-10	4 weeks ago

Getting Started

Data files

Navigating the data

The Data Directory

Research Engine Design

A tale of two indexes

Object-Oriented Design

Loading Research Papers

Load Metadata

Clean Metadata

Load function

Preprocessing

The preprocess function

Utility Functions

Parallel Processing

Describe the Metadata

ResearchPapers init

Python Objects wrapping dataframes

Indexing Research Papers

Indexing from Abstracts

Indexing from Texts

Viewing the Research Papers

CORD 19 Research Papers

The Search Index

What is BM25 Okapi?

Creating the BM25 Index

Searching Papers

The Similarity Index

Building the Annoy Index

Reduce Vector Dimensions

Getting Similar Vectors

Displaying objects in a notebook

Outputting HTML

Investigation on demands for antenatal care services among 2 002 pregnant women during the epidemic of coronavirus disease 2019 in Shanghai

Lessons learned from Korea: COVID-19 pandemic

Using the Dataframe style function

Creating Search Widgets

Search Date Slider

SearchBar

Subsetting Research Papers

Papers published since SARS-COV-2

CORD 19 Research Papers

Papers with "bats" in the title

CORD 19 Research Papers

Papers with Brian Mcloskey as an author

CORD 19 Research Papers

Getting a single research paper

A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases

Summary

A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases

Summary

Implementing getitem

Conclusion

References