2025-01-19

Introduction

Next word predictor is a Shiny Web App to predict next word based on n-gram model.

Its main features are:

  • N-gram model with N={1, 2, 3, 4}
  • Trained on 1% of the original ‘final/en_US’ dataset
  • Starts with 4-grams and backoffs to lower-order n-grams
  • It produces four top suggestions

DISCLAIMER: Please do not rely on this app to write your next NYT best seller.

Main UI

How to use?

  • To make demonstration easy I included a select input. Choose one of the options and see how the app produces predictions
  • Alternatively, you can try to type your own phrase. Only lower case is supported for demonstration purposes.
  • Keep choosing one of the predictions, if you are lucky you may end with a very meaningful sentence

Statistical model

Here is the basic function that shows the algorithm. The n-gram models are stored in respective ‘csv’ files and loaded during app launch. The available n-gram models are provided as a list to the prediction function. Input text is provided as a string. The function tries to predict the next word based on the highest available n-gram and if it can’t find it goes to the lower-degree n-gram. If nothing is found it just suggests 4 words from high-frequency words in unigram model.

The Prediction Function

predict_word_live <- function(ngram_list, previous_string) {
  highest_order_ngram <- length(ngram_list)
  previous_words <- str_split(previous_string," ")[[1]]
  str_len <- length(previous_words)
  begin_index <- str_len+1
  if(begin_index > highest_order_ngram) { begin_index <- highest_order_ngram}
  for(i in begin_index:2){
    curr_text <- paste0(previous_words[abs(str_len-i+2):str_len],collapse=" ")
    search_string <- paste0("^", curr_text,collapse=" ")
    candidates <- ngram_list[[i]][grep(search_string, ngram_list[[i]]$ngrams),1]
      length_candidates <- length(candidates)
      if(length_candidates>0)
      {
        if (length_candidates >= 4) {
          next_word <- str_split(candidates[1:4]," ")
        }
        else{
          next_word <- str_split(candidates[1:length_candidates]," ")
        }
        result <- lapply(next_word,function(x){x[begin_index]})
        result <- replace(result, is.na(result), "")
        return(result)
      }
  }
  return(sample(ngram_list[[1]]$ngrams[1:100], 4))
}

Conclusion

If you use this app you can understand the basics of how n-gram based next-word prediction works. Thank you!