Ubiquitous Vocabulary

Team
Initiator : Fabien

Goal

Personal repository of words generated by your daily usage of electronic devices used to improve your following inputs.

Abstraction of the problematic

gather words
1. input based on discussions
2. written text in articles
3. reverted-input based on readings where no new words have been detected and thus submitted
  1. as it was done as school : read this text then "Any new word?"
  2. bad for stats though, only useful to build dictionaries
make stats
use them for input

Optional steps

correct words
add new words (synonyms)

How do you cope with this situation ?

In situ example

... story telling implying what the user want to do and what is the problem and how this will be solved (without details)

Advantages

gain time
- complete words
use as a learning tool
- correct mistake (live or offline)
- propose synonyms
- analysis
  - more complex grammar corrections
  - ontology generation
  - statistics of corrections with links to solutions

Disadvantages

relying on the computer
- delegating the grammar work
laziness on vocabulary expansion
orthographical difficulties ("a la T9"

15:10 <+Utopiah> bushblows: do you use world completion based on dictionnaries based on logs?
15:11 <@bushblows> Utopiah: nope, my grammar is bad enough as is, that would kill what little bit remains intact.
15:12 <+Utopiah> you could correct the logs first then...
15:13 <@bushblows> lol
15:14 <+Utopiah> (you could also have context specialized dictionnaries online and connect to them when you type anything in Gvim/Irssi whatever from any device... ;)
15:14 <+Utopiah> (eventually you could think faster... who knows, have to test it)
15:19 <@bushblows> hah, I only wish a setup like that would constructivley benefit my knowledge.
15:20 <+Utopiah> bushblows: it could also randomly propose new words or even complete with a synonym
15:21 <@bushblows> Utopiah: I wouldnt.
15:21 <@bushblows> all I would remember is the first few letters of a word.
15:22 <@bushblows> given that a setup like this was used with everything whenever I used a computer for an extended amount of time.
15:23 <+Utopiah> hmmm you think it would hinder your ability to fully remember words and then being less efficient when you read/listen to words?
15:24 <@bushblows> not my ability to remember words, but it would cause me to start losing the spelling of a significant amount of my vocabulary
15:24 <+Utopiah> right, probably correct but now which one is more important, ortograph/spelling or speed of thoughts/expression?
15:25 <@bushblows> why not both?
15:25 <+Utopiah> sure, how? :)
15:26 <@bushblows> lol, I dont have an idea for an implimentation that would solve the issue, I just want to inspire the mind to think that way, inturn makes you more prone to building the idea to solve the issue.
15:26 <@bushblows> ya know?
15:27 <+Utopiah> thank you :)

Proposed solutions

use logs as a corpus that Id periodically build statistics upon and then create per context completions dictionnaries
- ```
cat web/thelab/stigmergylive/logs/ChannelLogger/freenode/#*/* | grep "<Utopiah" > ~/ircutopiah
cat ircutopiah | sed -e "s/[^a-zA-Z]/ /g" | sed -e "s/ /\n/g" | sed -e "s/^.\{1,3\}$//" |sort | uniq > mydict
```
  - instead off just using "uniq", cound word usage and remove rarely used words
    - ```
    cat ircutopiah | sed -e "s/[^a-zA-Z]/ /g" | sed -e "s/ /\n/g" | sed -e "s/^.\{1,3\}$//" |sort | uniq -c | grep -v " 1" | grep -v " 2" | grep -v " 3" | sed -e "s/.*\w //" > mydict_popular
```
- check our experimental data
- potentially increase usability by augmenting the "angle" (ability to differentiate between 2 items) amongst words
- more content can be extracted from a list of webpages with wget
  - appending &action=source for pmWiki URLs
extract context by apparition of rare words, words that are only used in a specific context.
- Example : http://blinkenbot.blinkenshell.org/index_new.html and the word "vouch" used often here)
manual of Irssi 11. Logging
- /help log
tips of Irssi Where are my completions/replaces gone?
- completions = {};
scripts.irssi.org (until the 26th of September)
- dictcomplete.pl Dictionary complete 1.31 Caching dictionary based tab completion Juerd (first version: Timo Sirainen), Public Domain
- irccomplete.pl IRC Completion 0.1 Adds words from IRC to your tab-completion list, plus fixes typos, Erkki Seppälä, Public Domain
- wordcompletition.pl IRC Completion with mysql-database 0.1 Adds words from IRC to your tab-completion list, Jesper Lindh, Public Domain

Discussion log on dictionnary building using irssi input and potential problems

15:08 <+groton___> anyone know how to configure IRRSI such that it creates a single huge log file instead of one for each day?
15:09 <@bushblows> /set autolog_path ~/irclogs/irssi.log
15:20 <+groton___> hi bushblows :) i set it to autolog_path = "~/irclogs/%Y/$tag/$0.log" :)
15:20 <+groton___> so i have at least a subdivision by channel :)
15:21 <@bushblows> groton___: cool, I do similar to that, seperation by network, then channel, then date.
15:21 <+Utopiah> bushblows: and after periods of time (based on algo like mnemonsyn OSS is using) question you to check if you remember, etc...
15:21 <@bushblows> autolog_path = ~/irclogs/$S/$0/$0.%m-%d-%y.log

Discussion on using dictionnaries for Vim completion

are dictionnaries for C-n built on the fly for each buffer? is there a way to provide its own dictionnaries? did some build script for that to eventually share a distant SQL DB with words for completion? (that could be shared with irssi and other tools)
:help 'cpt

Using Firefox completion

So far using Vimperator+Gvim as default editor looks like the best option

can eventually use temp C-i file to build dict. too

Inspirations

Chinese/Japanese word completion (ex. “Microsoft pinyin IME”)
- phonetically-type and tab-cycle amongst the different possibilities (possibilities ordered by the frequency of usage of each word)

Experimental Data

Cleaning by removing low-freq words

cat ircutopiah | sed -e "s/[^a-zA-Z]/ /g" | sed -e "s/ /\n/g" | sed -e "s/^.\{1,3\}$//" |sort | uniq -c | grep -v " 1" | grep -v " 2" | grep -v " 3" | wc -l

shows the number of words by removing words appearing only 1, 2, 3, ... times

10 most commonly used words

cat ircutopiah | sed -e "s/[^a-zA-Z]/ /g" | sed -e "s/ /\n/g" | sed -e "s/^.\{1,3\}$//" |sort | uniq -c | sort -n | tail -15

     68 could
     68 like
     72 would
     75 well
     76 think
     85 just
    108 have
    118 with
    120 wiki
    244 that

Words UtopiahGML(1385)/Utopiah(793)/http(313)/seedea(76) were removed as not being proper words (technical)

To do

run test to measure resulting efficiency
rules to remove small words (suggested by Adama's boyfriend)
eventually generate automatic corrections based on stats gathered by those manually applied with Firefox spellchecker
- especially if we can detect patterns "nn" -> "n", etc...
Integrate a tool like ClearForst Gnosis in order to remove nouns
Explore WORDCOUNT (Tracking the Way We Use Language)
check if http://www.norvig.com/npdict.txt follows the same principle

Page last modified on May 10, 2011, at 09:27 PM