working on the word list #37

amueller · 2015-05-08T20:34:35Z

How about removing words that have levinshtein distance <2:

import pandas as pd
from Levenshtein import distance
words = pd.read_csv("wordnet-list", header=None)

dedup = []
for word in words_list:
    distances = [distance(word, candidate) for candidate in dedup]
    if not distances or np.min(distances) > 1:
        dedup.append(word)
len(dedup)

24911

betatim · 2015-05-08T21:42:23Z

Good idea.

Currently the best balance between obscure words and shortness of the string you have to remember seems to be to use four words for every location. With that the wordlist only has to contain 4096 words. You can even get away with three if the place is nearby (Battery park: lawful-lazily-josef-tended, Brooklyn bridge: lawful-sheila-novel-dodge). The main problem is finding that many simple english words.

Running this on words/google-ngram-list-4096 I end up with 2743 deduped words:

import numpy as np
from Levenshtein import distance

words = [l.strip() for l in file("google-ngram-list-4096")]
dedup = []
for word in words:
    distances = [distance(word, candidate) for candidate in dedup]
    if not distances or np.min(distances) > 1:
        dedup.append(word)

Surprised it removes so many.

amueller · 2015-05-08T21:47:23Z

How is the google-ngram-list generated? Maybe tuning the corpus from which we take frequencies could help? What would also be fun (maybe slightly not in the original spirit): if we had a ranked list saying how good / memorable each word is, could we find "better" words for more highly populated areas?

amueller · 2015-05-08T21:48:25Z

And lastly, using specific patterns of verbs, nouns, adjectives and adverbs will also highly impact how memorable a phrase is imho.

amueller · 2015-05-08T21:59:57Z

something like http://watchout4snakes.com/wo4snakes/Random/RandomPhrase

betatim · 2015-05-08T22:05:31Z

To create the google-ngram follow the instructions in the second part of words/README (and potentially you have to have access to my brain to remember things that are missing). Suboptimal, hence #38.

More popular words for more populated areas is a good idea. Would require some changes in the algorithm that converts geohashes to their word based representation.

Building sentence like four word combinations would be nice, without changing the algorithm you'd need 4096 unique words for each type of word. We failed at finding enough words last time we tried. NLTK's part of speech tagging didn't seem to help much in automating the grouping into verb, noun, adverb, etc. from the ngram corpus.

What are other large english language word corpora?

Definitely worth trying out popular words for popular areas and sentence like structures.

amueller · 2015-05-08T22:14:34Z

I think the google n-grams are based on project gutenberg. I'm not sure how good a representation of English language that is. And if frequent is really a good measure of "good". One could try running n-grams on wikipedia, or on Amazon reviews ;).

Maybe in the end hand-editing 4096 words would be easiest... still a hassle.

betatim · 2015-05-15T09:06:20Z

Yet another source of words could be the 5LNC (Five letter name code) used to name waypoints used by aircraft. It seems they aren't required to be real words, yet "pronounceable" even by non english speakers.

Best list of all in use I could find is extracting a PDF from https://icard.icao.int/ICARD_5LNC/5LNCMainFrameset/5LNCApplicationFrame/DownloadPage.do?NVCMD=ShowDownloadPage which isn't ideal.

betatim · 2015-05-15T09:40:20Z

Another obscure link that can list allocated 5LNCs: https://icard.icao.int/ICARD_5LNC/5LNCMainFrameset/5LNCApplicationFrame/5LNCCombinePageLoad.do?NVCMD=Loading5LNCCombinePage

betatim mentioned this issue May 8, 2015

Automate creation of wordlists #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

working on the word list #37

working on the word list #37

amueller commented May 8, 2015

betatim commented May 8, 2015

amueller commented May 8, 2015

amueller commented May 8, 2015

amueller commented May 8, 2015

betatim commented May 8, 2015

amueller commented May 8, 2015

betatim commented May 15, 2015

betatim commented May 15, 2015

working on the word list #37

working on the word list #37

Comments

amueller commented May 8, 2015

betatim commented May 8, 2015

amueller commented May 8, 2015

amueller commented May 8, 2015

amueller commented May 8, 2015

betatim commented May 8, 2015

amueller commented May 8, 2015

betatim commented May 15, 2015

betatim commented May 15, 2015