R_文書ターム行列作成の変更点

追加された行はこの色です。
削除された行はこの色です。
R_文書ターム行列作成へ行く。
R_文書ターム行列作成の差分を削除

[[Rの備忘録]]

日本語テキストを解析した結果からターム文書行列を作成する．

lsa パッケージでは正しく解析できないので，自作する．

ちなみに lsa パッケージの textmatrix はこんな感じ
ちなみに lsa パッケージの textmatrix はこんな感じ．オプションを省けば，やっていることは，指定されたディレクトリ内の文書の一覧を dir() 関数で取り出して，それぞれの文書に lappy() 関数で，textvector 関数を適用している．textvector 関数とは，要するに，ある文書の語彙頻度表を作る関数である． 

 textmatrix <- function( mydir, stemming=FALSE, language="german",
     minWordLength=2, minDocFreq=1, stopwords=NULL, 
    vocabulary=NULL ) {
    
    dummy = lapply( dir(mydir, full.names=TRUE), textvector, 
       stemming, language, minWordLength, minDocFreq, stopwords, 
       vocabulary)
    if (!is.null(vocabulary)) {
        dtm = t(xtabs(Freq ~ ., data = do.call("rbind", dummy)))
        result = matrix(0, nrow=length(vocabulary), ncol=ncol(dtm))
        rownames(result) = vocabulary
        result[rownames(dtm),] = dtm[rownames(dtm),]
        colnames(result) = colnames(dtm)
        dtm = result
        gc()
    } else {
        dtm = t(xtabs(Freq ~ ., data = do.call("rbind", dummy)))
    }
    
    environment(dtm) = new.env()
    class(dtm) = "textmatrix"
    
    return ( dtm )
    
}

アールメカブ

R_文書ターム行列作成 の変更点

R_文書ターム行列作成の変更点