Next word predictor

2025-01-19

Introduction

Next word predictor is a Shiny Web App to predict next word based on n-gram model.

Its main features are:

N-gram model with N={1, 2, 3, 4}
Trained on 1% of the original ‘final/en_US’ dataset
Starts with 4-grams and backoffs to lower-order n-grams
It produces four top suggestions

DISCLAIMER: Please do not rely on this app to write your next NYT best seller.

Main UI

Here is a screenshot of UI on shinyapps.io. It is available at nmammedov.shinyapps.io/nextword

.

How to use?

To make demonstration easy I included a select input. Choose one of the options and see how the app produces predictions
Alternatively, you can try to type your own phrase. Only lower case is supported for demonstration purposes.
Keep choosing one of the predictions, if you are lucky you may end with a very meaningful sentence

Statistical model

Here is the basic function that shows the algorithm. The n-gram models are stored in respective ‘csv’ files and loaded during app launch. The available n-gram models are provided as a list to the prediction function. Input text is provided as a string. The function tries to predict the next word based on the highest available n-gram and if it can’t find it goes to the lower-degree n-gram. If nothing is found it just suggests 4 words from high-frequency words in unigram model.

The Prediction Function

predict_word_live <- function(ngram_list, previous_string) {
  highest_order_ngram <- length(ngram_list)
  previous_words <- str_split(previous_string," ")[[1]]
  str_len <- length(previous_words)
  begin_index <- str_len+1
  if(begin_index > highest_order_ngram) { begin_index <- highest_order_ngram}
  for(i in begin_index:2){
    curr_text <- paste0(previous_words[abs(str_len-i+2):str_len],collapse=" ")
    search_string <- paste0("^", curr_text,collapse=" ")
    candidates <- ngram_list[[i]][grep(search_string, ngram_list[[i]]$ngrams),1]
      length_candidates <- length(candidates)
      if(length_candidates>0)
      {
        if (length_candidates >= 4) {
          next_word <- str_split(candidates[1:4]," ")
        }
        else{
          next_word <- str_split(candidates[1:length_candidates]," ")
        }
        result <- lapply(next_word,function(x){x[begin_index]})
        result <- replace(result, is.na(result), "")
        return(result)
      }
  }
  return(sample(ngram_list[[1]]$ngrams[1:100], 4))
}

Conclusion

If you use this app you can understand the basics of how n-gram based next-word prediction works. Thank you!