Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

working on the word list #37

Open
amueller opened this issue May 8, 2015 · 8 comments
Open

working on the word list #37

amueller opened this issue May 8, 2015 · 8 comments

Comments

@amueller
Copy link

amueller commented May 8, 2015

How about removing words that have levinshtein distance <2:

import pandas as pd
from Levenshtein import distance
words = pd.read_csv("wordnet-list", header=None)

dedup = []
for word in words_list:
    distances = [distance(word, candidate) for candidate in dedup]
    if not distances or np.min(distances) > 1:
        dedup.append(word)
len(dedup)

24911

@betatim
Copy link
Member

betatim commented May 8, 2015

Good idea.

Currently the best balance between obscure words and shortness of the string you have to remember seems to be to use four words for every location. With that the wordlist only has to contain 4096 words. You can even get away with three if the place is nearby (Battery park: lawful-lazily-josef-tended, Brooklyn bridge: lawful-sheila-novel-dodge). The main problem is finding that many simple english words.

Running this on words/google-ngram-list-4096 I end up with 2743 deduped words:

import numpy as np
from Levenshtein import distance

words = [l.strip() for l in file("google-ngram-list-4096")]
dedup = []
for word in words:
    distances = [distance(word, candidate) for candidate in dedup]
    if not distances or np.min(distances) > 1:
        dedup.append(word)

Surprised it removes so many.

@amueller
Copy link
Author

amueller commented May 8, 2015

How is the google-ngram-list generated? Maybe tuning the corpus from which we take frequencies could help? What would also be fun (maybe slightly not in the original spirit): if we had a ranked list saying how good / memorable each word is, could we find "better" words for more highly populated areas?

@amueller
Copy link
Author

amueller commented May 8, 2015

And lastly, using specific patterns of verbs, nouns, adjectives and adverbs will also highly impact how memorable a phrase is imho.

@amueller
Copy link
Author

amueller commented May 8, 2015

@betatim
Copy link
Member

betatim commented May 8, 2015

To create the google-ngram follow the instructions in the second part of words/README (and potentially you have to have access to my brain to remember things that are missing). Suboptimal, hence #38.

More popular words for more populated areas is a good idea. Would require some changes in the algorithm that converts geohashes to their word based representation.

Building sentence like four word combinations would be nice, without changing the algorithm you'd need 4096 unique words for each type of word. We failed at finding enough words last time we tried. NLTK's part of speech tagging didn't seem to help much in automating the grouping into verb, noun, adverb, etc. from the ngram corpus.

What are other large english language word corpora?

Definitely worth trying out popular words for popular areas and sentence like structures.

@amueller
Copy link
Author

amueller commented May 8, 2015

I think the google n-grams are based on project gutenberg. I'm not sure how good a representation of English language that is. And if frequent is really a good measure of "good". One could try running n-grams on wikipedia, or on Amazon reviews ;).

Maybe in the end hand-editing 4096 words would be easiest... still a hassle.

@betatim
Copy link
Member

betatim commented May 15, 2015

Yet another source of words could be the 5LNC (Five letter name code) used to name waypoints used by aircraft. It seems they aren't required to be real words, yet "pronounceable" even by non english speakers.

Best list of all in use I could find is extracting a PDF from https://icard.icao.int/ICARD_5LNC/5LNCMainFrameset/5LNCApplicationFrame/DownloadPage.do?NVCMD=ShowDownloadPage which isn't ideal.

@betatim
Copy link
Member

betatim commented May 15, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants