2012年1月3日火曜日

[python] 自然言語処理ライブラリ NLTKをインストール

NLTK触ったときのメモ。

NLTKのインストール


Pythonの自然言語処理用パッケージNLTKをインストール
$ wget http://nltk.googlecode.com/files/nltk-2.0.1rc1.tar.gz
$ tar -xvzf nltk-2.0.1rc1.tar.gz
$ cd nltk-2.0.1rc1
$ sudo python setup.py install
$ python

>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download      l) List      c) Config      h) Help      q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Download which package (l=list; x=cancel)?
  Identifier> book
    Downloading collection 'book'
       | 
       | Downloading package 'brown' to /home/karino/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package 'chat80' to /home/karino/nltk_data...
       |   Unzipping corpora/chat80.zip.
       | Downloading package 'cmudict' to /home/karino/nltk_data...
       |   Unzipping corpora/cmudict.zip.
       | Downloading package 'conll2000' to /home/karino/nltk_data...
       |   Unzipping corpora/conll2000.zip.
       | Downloading package 'conll2002' to /home/karino/nltk_data...
       |   Unzipping corpora/conll2002.zip.
       | Downloading package 'dependency_treebank' to
       |     /home/karino/nltk_data...
       |   Unzipping corpora/dependency_treebank.zip.
       | Downloading package 'genesis' to /home/karino/nltk_data...
       |   Unzipping corpora/genesis.zip.
       | Downloading package 'gutenberg' to /home/karino/nltk_data...
       |   Unzipping corpora/gutenberg.zip.
       | Downloading package 'ieer' to /home/karino/nltk_data...
       |   Unzipping corpora/ieer.zip.
       | Downloading package 'inaugural' to /home/karino/nltk_data...
       |   Unzipping corpora/inaugural.zip.
       | Downloading package 'movie_reviews' to
       |     /home/karino/nltk_data...
       |   Unzipping corpora/movie_reviews.zip.
       | Error with downloaded zip file
---------------------------------------------------------------------------
    d) Download      l) List      c) Config      h) Help      q) Quit
---------------------------------------------------------------------------
Downloader> q
True
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python2.6/site-packages/nltk/book.py", line 33, in 
    text5 = Text(nps_chat.words(), name="Chat Corpus")
  File "/usr/lib/python2.6/site-packages/nltk/corpus/util.py", line 68, in __getattr__
    self.__load()
  File "/usr/lib/python2.6/site-packages/nltk/corpus/util.py", line 56, in __load
    except LookupError: raise e
LookupError: 
**********************************************************************
  Resource 'corpora/nps_chat' not found.  Please use the NLTK
  Downloader to obtain the resource: >>> nltk.download().
  Searched in:
    - '/home/karino/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


nltk.download()で「book」を選択すれば、使いそうなものを全部インストールしてくれる。
[-] book................ Everything used in the NLTK Book

>>> from nltk.book import * で、ダウンロードしたデータを読み込むはずだが、
corpora/nps_cahtが見つからないと。
もう一回、nltk.download()でall-corporaをダウンロードした。
合計630M。

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


簡単な使い方

>>> text1.concordance("monstrous")
>>> text1.similar("monstrous")
>>> text2.similar("monstrous")
>>> text4.common_contexts(["monstrous","very"])
>>> text3.generate()
>>> len(text3)
>>> from __future__ import division
>>> text4.count('a') / len(text4)
テキストを検索、その単語と同じ文脈で使われる単語、
2つの単語が共通して使われている文脈
文中の単語数、出現する単語のカウントとか、いろいろできるらしい。

0 件のコメント:

コメントを投稿