Remember (hopefully) better with the help of Liechtenstein Distance
Jessy Chen / 2019-05-17 /
Remembering (hopefully) better with the help of Levenshtein Distance
When we try to memorize the spelling of a word, we might be confused by other similarly-spelled words that we know, and this is a challenge for high school students facing the College Entrance Exam. For this week, I tried to create a list of similarly-spelled words by using Levenshtein Distance.
Levenshtein Distance is also called Edit Distance, and can help us understand how similar two sequences, such as strings, are by means of addition, deletion, and substituion. However, it is based on the spelling of the sequences only, so words with distinctively different pronunciations are likely to be considered similar using Levenshtein Distance. (There are other ways to define similarity as well, and this article includes a detailed explanation: https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/?fbclid=IwAR2O1Oo0Glu13FpmkGJR6Fbg7Q46iQh8WWw-dZuqSV15n2qi7ohIs2uaClI)
In addition to helping us remember words better, Levenshtein Distance is also used for people with dyslexia. A research group even created a game to help children with dyslexia with word memorization. (https://core.ac.uk/download/pdf/82355820.pdf?fbclid=IwAR0ea-1zQx2dDl7a9zphtVWJb9bf2DWyU8AU9J6OfNRL8ebQ3Llqldwdrd0)
If we randomly select a word in a dictionary, look for the word most similar to it, and see what comes next, we will end up with a long, orderly list of words. There is a Levenshtein dictionary online that visualizes this idea: http://blueribbonfarmsaz.com/dictionary/?fbclid=IwAR0pvuIslOuqReYQlsbn9TRcM-P5XAahz2Du1rnwxPGu1EoDG8BwLD8sHDY.
The word list comes from http://www.taiwantestcentral.com/Default.aspx
find_sim <- function(i){
#print(i)
sim <- levenshteinDist(i, df$English)
s_min <- min(sim[sim!=min(sim)])
i_min <- which(sim == s_min)
return(paste(unique(unlist(df$English[i_min])), collapse=","))}
find_sim_lst <- lapply(df$English, find_sim)
df <- df %>% mutate(sim=find_sim_lst)
saveRDS(df, "voc_sim.rds")