Sanjeev Singh Kenwar
9 min readApr 12, 2019

--

Image source: https://www.iiht.com/

NLP (Natural Language Processing) using Doc2Vec

Note: All codes are here on my github

Preface:

If you work for a decent size corporation then you must have some corporate website in place with Google like search functionalities. How do you find it’s search capabilities compare to Google? Google utilizes some sophisticated algorithms with massive computing power, which corporations for good reasons are often reluctant to invest in. Hence, in many cases the corporate search functionalities are sub-par.

However, the machine learning (ML) NLP algorithms (for a modest cost and computing power) are sophisticated enough to categorize or identify similar intranet websites, documents , contents or even every sentences in some pre-defined categories (Supervised learning) or self defined clustering (Unsupervised learning) or rank them in the order of similarities, as long as the texts are available in electronic formats . These texts are called corpus in NLP parlance.

In this article, we will start with exploring some use cases of ML NLP , followed by our use case. Then we will be discussing some ML NLP options before diving into coding for our use case.

This is NOT an article on chatbot . Here, we will be doing some words crunching to find similar contents. You can use similar concepts in your workplace e.g. identifying duplicate documents, extracting and comparing information from emails, shared directory, scrapping through logs or internal company website etc. Also bear in mind that ML NLP coding is just one aspect of delivering a product. The journey starts with identifying relevant use cases , convincing others of the business value, obtaining resources (including proper compute power), doing some serious data engineering and doing project management to see this through fruition. Good luck!

All codes are available on github here. If you are using it then please make sure to provide the credit where it is due :-)

I am using Windows 10, 64 bit, HP Z820 , 32 logical cores (Intel Xeon 2 Ghz) and 128 GB RAM. This is not what you normally would have, so consider using AWS. The steps remain the same

Practical Use Cases :

Several books have been written on the potential use cases of ML NLP. I will just try to give you an idea in layman terms. Please remember this article is only about NLP and not about machine learning in general. I’ll have separate articles for those.

Legal Field

One of the most important thing lawyers do is gathering case specific relevant information. The information comes from variety of sources for example, similar cases (or judgement) in the past , applicable (in some cases obscure) laws , information available on internet etc.

The ML NLP algorithms can help by presenting the lawyers the most relevant information ranked by similarity. It’s like having your personal Google !

Medical Field

The use case is similar to above , where a doctor might be interested in finding more information relevant to the case in hand and potentially providing better care to her patients.

Financial Services Field

Financial Services is a catch all phrase for any company whose primary business is dealing with money. This include banks, insurance companies, stock brokerage etc.

Using ML NLP, one can do sentiment analysis, which can be a factor in making decisions . For example, decisions on amount (and rate) of loan a bank would extend to a counterparty or invest in a stock. A sentiment score can be generated from each relevant sources and aggregated to get a composite score for a given counterparty, without sifting through the contents. We will slightly touch upon this during our demonstration.

Other areas could be doing M&A due diligence , similar to what was described for the legal field (i.e. think Google Search), gathering relevant information pertaining to a pending deal, which in turn can inform the risk and rewards associated with a deal better.

Human Resources Field

ML NLP can be used for matching relevant resumes out of thousand of resumes and also rank them in the order of similarity.

Our use case

Objective :

Here we will be building an application using Doc2vec , trained on 1.6 Million Amazon book reviews, which for a given review return a) Similar reviews . b) Predicted user’s ratings. c) And finally we will see how the results stack up against much simpler Tf-idf algorithm . Cool no…

PS: If your have smaller data size then Doc2vec may be overkill, may be start with Tf-idf, see if it works well.

About the data:

The data is from Amazon and consists of user comments for various books and ratings on the scale of 1–5, going back several years. The dataset consists of over 8 million reviews.

Datasource:

(Thanks Mr. Julian McAuley) . Here

Also a side note: Mr. McAuley has written an excellent paper which we will cover in future articles on deep learning. “R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016. J. McAuley, C. Targett, J. Shi, A. van den Hengel. Image-based
recommendations on styles and substitutes. SIGIR, 2015”

Data available to us:

Additional columns added :

Sentiments Categories : Ratings of 1 or 2 means negative, 3 means neutral and 4 or 5 means positive

Index : To provide position of a comment in the original dataset

Tokens : It is a composite columns of index and ratings. This is also used to tag the document (for using in Doc2vec). we could have used other columns as well.

Data Analysis

Looking at the raw data , I say to myself that thanks goodness there are more good people than critical people. As you can see from the histogram that the data is heavily skewed toward rating 5 . This will pose a challenge for our ML NLP model, as it will be biased towards rating 5 .

To solve this issue, we will drop several records to make it equally distributed. Thus, our initial data size will be ~ 1.6 million , with 324K records for each rating category

Choice of algorithm

There are multiple python libraries available to us, to name a few Sckit-learn, NLTK, Gensim, spaCy etc. At least, one should be using NLTK libraries as it has some nice text processing functions.

These libraries employ different methodologies and one has to experiment a bit to choose best fit for your use case. In most cases, bag of words (BOW) combined with some methodology like tf-idf can work well. It uses a simple techniques like frequency of occurrence of words (words accounting). It likely is able to determine sentiments reasonably well and may also be able to identify similarity among documents. However, it does not capture the emotions/tones, sarcasm, irony, context and other nuances in a language.

Other option could be using some deep learning libraries like Word2vec or Doc2vec. However, the words need to be converted into some numbers. To understand how it happens, think of each word being presented in a N-dimensional space. Of course, we human cannot visualize in more that 3-dimensions. But to elaborate further , the word ‘King’ may be presented as a point on 2-dimensional space having coordinates x and y. The same point in 3-dimensions will also have z coordinate i.e. the vector of x, y and z. Similarly, in N-dimensional space it will represented by a vector of N dimensions, with each dimension providing some additional information about the word.

Since the words are now just points in in N-dimensions (think GPS coordinates) , one can now do some mathematical operations. For example we can measure the distance (using Euclidean or Cosine methods) between two points. And this can give us some ideas on similarity of two words. Shorter the distance, more similar are the words. It can further help understand relative similarities, for example, king and man should be analogous to queen and woman. To arrive at the coordinates (aka vectors of numbers) , the algorithm makes use of deep learning methods with one hidden layer and multiple forward and backward propagation. You can learn more about deep learning and Word2vec / Doc2vec here. There are further flavors (skip gram , bag of words etc) and hyper-parameters for each of these libraries that one should be intimately familiar and some of which we will touch up in our demonstration. We will be using Doc2vec .

Word2vec and Doc2vec are essentially the same, except that Doc2vec takes in additional input called tag , which is can be source document IDs (a document can be a book, a paragraph, a sentence etc) or labels. Unlike Doc2vec, Word2vec does not discriminate among the documents.

So let’s demonstrate through some coding now. You may be able to use the same code at you workplace, with some tweaking.

Coding time

Some Notes:

  1. Best way to go about this is downloading my codes from github (here)and running on a high power machine.
  2. Also, if you are doing multiprocessing in Jupyter notebook then have all functions in a separate file and import them. Otherwise, it will throw error. See prep2.py on github. This file contains all my functions.
  3. It took almost 3 hours to run the codes. However, the results were impressive.
  4. Since it’s a long running process, turn the logging on. At least you will be aware what is going on. Love Python for this !
  5. Make sure you have a C compiler before installing Gensim, to use the optimized Doc2vec routines

Coding steps : Codes are here

Step 1 : Data Import : See Data_Import.ipynb file on github

Step 2: Pre-processing and Tokenization of data

See: Pre-process and Train.ipynb on github

This step entails removing Stop Words , followed by Lemmatization and finally Tokenization. Click on the links to find out more.

Two decisions I had to make were :

  1. Use of Lemmatization vs Stemming : I chose Lemmatization , as it results in actual word being preserved and speed of operation was not an issue for me.
  2. Whether to remove punctuation like “? !” etc. : I chose not to remove punctuation, as they provide some valuable contexts to the words. Consider these two sentences : Are you sleeping? and Are you sleeping! . First one may just be a naive question and the second one may be expressing a surprise. And that’s where Word2vec and Doc2Vec shine.

Step 3: Building and Training Doc2vec models

See: Pre-process and Train.ipynb on github

At this step, I had to make multiple decisions. Knowing that I have good computing power, my decisions were as follows

  1. Doc2vec comes in two flavors: Distributed Memory (DM) and Distributed Bag of Words (DBOW). This link provides a good explanation on these. I decided to check out both version independently and found that DM to be much more accurate. Though some recommend to combine the two for better accuracy. May be next time !
  2. Recollect N-Dimensions from above. Here, one can choose the number of dimensions to represent the words. 50–300 dimensions are recommended. I chose 300.
  3. Windows size: To get a context of a word, the algorithm looks both to the right and left. Windows size indicates number of words it should look for on each side. I chose 10.
  4. Epoch : Just indication of number of forward and backward passes (Deep Learning basics). I chose 20. Be aware of the over fitting problem , if you choose a higher number.
  5. Tagging : Doc2vec takes in additional input called tag, which is generally a document ID (in our case , it is a review). I chose composite key, made by concatenating document index and ratings.

Step 4 : Testing it out

See: Conclusion.ipynb on github also see the next section for the conclusion

Conclusion:

This is the end of our journey on this topic.

  1. The results of doc2vec are summarized on the chart below. The token represents documents/reviews most similar to our test review and ranked by similarity scores. By default, doc2vec show 10 most similar documents.
tokens are review/document IDs.

2. As a by product , you can see the top 3 similar documents have user ratings between 3 and 4, which also happened to be rating (4) of our test user review.

3. Finally, we used Tf-idf methodology to get a sense of what it’s take on the similarities of 10 documents shown by doc2vec to out test review. Though there are differences, but top 3 documents remain the same in both cases.

Some additional references for the hungry ones:

https://arxiv.org/pdf/1405.4053.pdf

https://praveenbezawada.com/2018/01/25/document-similarity-using-gensim-dec2vec/

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

https://arxiv.org/pdf/1607.05368.pdf

--

--