個人メモを整理中です.不適切な記述が多々あるかと思います.お気づきの際は,ishida-m.at.ias.tokushima-u.ac.jp. までご連絡ください.
Crawley, Using R for Introductory Statistics,p.307
anova は二つの入れ子になったモデルを比較する.
RSS 残差平方和は,モデルとデータの差をはかった尺度である. いま,p 個の変数モデルの RSS(p) が k 個の変数モデルの RSS(k) よりわずかに少ないとする.ただし p > k とする.
k に加えられた新しい変数がさほど重要でないのならば,RSS(k) - RSS(p)は小さいはずである.逆に重要ならば,その差は大きいはずである.そこで
( RSS(k) - RSS(k) ) / RSS(p) F = {( RSS(k) - RSS(k) ) / (p - k )} / { RSS(p) / (n - (p + 1))} = {( RSS(k) - RSS(k) ) / (p - k )} / sigma^2
は自由度 (p - k ), (n - (p + 1) の F 分布に従う.
.Renviron をホームフォルダに置き,例えば
R_LIBS=etc/R
.Rprofile では
grDevices::ps.options(family= "Japan1")
などを書く.
# life<- source( "chap4lifeexp.dat")$value # attach(life) # # 3 因子モデル life.fa3<- factanal(life, factors = 3, scores = "regression") life.fa3 # install.packages("rgl") library(rgl) rgl.clear() #rgl.clear(type="lights") #rgl.clear(type="bbox") rgl.open() rgl.bg(color=c("white","black"))
rgl.spheres( life.fa3$scores[,"Factor1"], life.fa3$scores[,"Factor2"], life.fa3$scores[,"Factor3"], radius=0.05, color = 1: length(life.fa3$scores[,"Factor1"])) rgl.bbox(color="#112233", emission="#90ee90",specular="#556677", shinines=8,alpha=0.8) rgl.texts( life.fa3$scores[,"Factor1"], life.fa3$scores[,"Factor2"], life.fa3$scores[,"Factor3"], abbreviate( names(life.fa3$scores[,"Factor1"])), color = 1: length( life.fa3$scores[,"Factor1"])) # , adj = "left" ) rgl.postscript("filename.eps", fmt="eps" ) # eps 形式で保存 rgl.close() rgl.quit()
> x<- c(1,2,3) > x [1] 1 2 3 > names(x)<- c("A","B","C") > x A B C 1 2 3
> mode(iris[5]) [1] "list" > mode(iris[,5]) [1] "numeric" > is.factor(iris[,5]) [1] TRUE > is.vector(iris[,5]) [1] FALSE # > is.data.frame(iris[5]) [1] TRUE > is.list(iris[5]) [1] TRUE >
この情報はRjoWiki より入手
lc<-Sys.getlocale("LC_CTYPE") Sys.setlocale("LC_CTYPE","C") df<-read.fwf("hoge.dat",width=c(バイト,バイト,バイト)) Sys.setlocale("LC_CTYPE",lc)
http://www.bio.ic.ac.uk/research/mjcraw/therbook/
[ishida@amd64 crawley]$ pwd research/statistics/book/crawley [ishida@amd64 crawley]$ ls therbook.zip
この url から直接データを読むには
gain <- read.table( "http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/Gain.txt", header = T)
name.data<- c("Michiko", "Taro", "Masako", "Jiro","Aiko","Santa") math.data<- c(50, 60, 70, 80, 90, 100) name.math<- data.frame(students = name.data, math = math.data) (gen.data<- rep(c("female", "mdd", "male"), 2)) name.math$gen<- gen.data is.character(name.math$gen) # [1] TRUE levels(name.math$gen) #NULL
# 次の二つの条件指定はうまくいかない
name.math[name.math$gen == c("female", "male"),] # students math gen #1 Michiko 50 female #6 Santa 100 male # name.math[name.math$math == c(50, 60),] # students math gen #1 Michiko 50 female #2 Taro 60 mdd name.math[name.math$math == c(60,70),] #[1] students math gen #<0 rows> (or 0-length row.names)
これは以下の例からも明らかなように
name.math$math == c(50, 60) # [1] TRUE TRUE FALSE FALSE FALSE FALSE name.math$math == c(60, 70) # [1] FALSE FALSE FALSE FALSE FALSE FALSE
c("female", "male") や c(50,60) を 自動的に行数にあわせて リカーシブに繰り返し
name.math[name.math$gen == c("female", "male", "female", "male", "female", "male") name.math$math == c(50, 60, 50, 60, 50, 70)
として全行を一対一で条件判定しているため
RSiteSearch("X11 Ubuntu")
http://permalink.gmane.org/gmane.comp.lang.r.general/23317
Dear Prof Ripley Thanks for your reply and clarification. However: 1. Regarding model.tables() returning "Design is unbalanced". Setting contrasts to Helmert does indeed make the design balanced, but model.tables() still returns "Design is unbalanced": > options()$contrasts unordered ordered "contr.treatment" "contr.poly" > aov(S~rep+trt1*trt2*trt3, data=dummy.data) Call: ... Residual standard error: 14.59899 Estimated effects may be unbalanced > options(contrasts=c("contr.helmert", "contr.treatment")) > aov(S~rep+trt1*trt2*trt3, data=dummy.data) Call: ... Residual standard error: 14.59899 Estimated effects are balanced > model.tables(aov(S~rep+trt1*trt2*trt3, data=dummy.data), se=T) Design is unbalanced - use se.contrasts for se's Tables of effects ... However, this is a relatively minor issue, and covered in ?model.tables which clearly states that "The implementation is incomplete, and only the simpler cases have been tested thoroughly." 2. You point out that "In either case you can predict something you want to estimate and use predict(, se=TRUE)." Doesn't this give the standard error of the predicted value, rather than the mean for, say, trt1 level 0? For example: > predict(temp.lm, newdata=data.frame(rep='1', trt1='0', trt2='1', trt3='0'), se=T) $fit [1] 32 $se.fit [1] 10.53591 $df [1] 23 $residual.scale [1] 14.59899 Whereas from the analysis of variance table we can get the standard error of the mean for trt1 as being sqrt(anova(temp.lm)[9,3]/12) = 4.214365. It is the equivalent of this latter value that I'm after in the glm() case. >>> Prof Brian Ripley<ripley<at> 03/08/04 18:10:56 >>> On Tue, 3 Aug 2004, Peter Alspach wrote: [Lines wrapped for legibility.] > I'm having a little difficulty getting the correct standard errors from > a glm.object (R 1.9.0 under Windows XP 5.1). predict() will gives > standard errors of the predicted values, but I am wanting the standard > errors of the mean. > > To clarify: > > Assume I have a 4x3x2 factorial with 2 complete replications (i.e. 48 > observations, I've appended a dummy set of data at the end of this > message). Call the treatments trt1 (4 levels), trt2 (3 levels) and trt3 > (2 levels) and the replications rep - all are factors. The observed > data is S. Then: > > temp.aov<- aov(S~rep+trt1*trt2*trt3, data=dummy.data) > model.tables(temp.aov, type='mean', se=T) > > Returns the means, but states "Design is unbalanced - use se.contrasts > for se's" which is a little surprising since the design is balanced. If you used the default treatment contrasts, it is not. Try Helmert contrasts with aov(). > Nevertheless, se.contrast gives what I'd expect: > > se.contrast(temp.aov, list(trt1==0, trt1==1), data=dummy.data) > [1] 5.960012 > > i.e. standard error of mean is 5.960012/sqrt(2) = 4.214, which is the > sqrt(anova(temp.aov)[9,3]/12) as expected. Similarly for interactions, > e.g.: > > se.contrast(temp.aov, list(trt1==0 & trt2==0, trt1==1 & trt2==1), data=dummy.data)/sqrt(2) > [1] 7.299494 > > How do I get the equivalent of these standard errors if I have used > lm(), and by extension glm()? I think I should be able to get these > using predict(..., type='terms', se=T) or coef(summary()) but can't > quite see how. In either case you can predict something you want to estimate and use predict(, se=TRUE).
sc<- kkmeans(as.matrix(iris[,-5]), center = 3) slotNames(sc) sc@centers sc@.Data table(iris$Species, sc@.Data)
kkmeans には predict 関数は用意されていない
# string kernel の作成 library(kernlab) # sample, Lodhi p.423 test.str<- list("science is organized knowledge", "wisdom is organized life") str.kern.list<- NULL for(i in 1:6){ str.dot<- stringdot(length = i, lambda = 0.5, type = "sequence", normalized = TRUE) str.kern.list[i]<- list(kernelMatrix(str.dot, test.str)) } str.kern.list
## tm パッケージ
library(tm) data("acq") summary(acq) show(acq) inspect(acq[1]) # acq<- tmMap(acq, asPlain, convertReut21578XMLPlain) # summary(acq) # inspect(acq[1])
# 空白の処理
acq<- tmMap(acq, stripWhitespace) inspect(acq[1])
# stopWords 処理
acq<- tmMap(acq, removeWords, stopwords("english")) inspect(acq[1])
# stemming 処理
acq<- tmMap(acq, stemDoc) inspect(acq[1])
# 小文字への変換
acq<- tmMap(acq, tmTolower) inspect(acq[1])
# タグ情報に関するクエリーの発行
query<- "identifier == '10'" tmFilter(acq, query)
# フルサーチ
kekka<- tmFilter(acq, FUN = searchFullText, "comput", doclevel = TRUE) inspect(kekka)
## ksvm tmp<- sample(1:150, 100) iris.x<- iris[tmp,] iris.y<- iris[-tmp,]
results<- ksvm(Species ~., data = iris.x) slotNames(results) slot(results, "SVindex") results2<- predict(results, new = iris.y)
table(iris.y$Species, results2)
td<- tempfile() dir.create(td) write(c("Human machine interface for ABC computer applications"), file=paste(td, "D1", sep="/") ) write(c("A survey of user opinion of computer system response time"), file=paste(td, "D2", sep="/") ) write(c("The EPS user interface management system"), file=paste(td, "D3", sep="/") ) write(c("System and human system engineering testing of EPS"), file=paste(td, "D4", sep="/") ) write(c("Relation of user perceived response time" , "to error measurement"), file=paste(td, "D5", sep="/") ) write(c("The intersection graph of paths in trees"), file=paste(td, "D6", sep="/") ) write(c("Graph minors IV: Widths of trees and well-quasi-ordering"), file=paste(td, "D7", sep="/") ) write(c("The generation of random, binary,", "ordered trees"), file=paste(td, "D8", sep="/") ) write(c("Graph minors: A survey"), file=paste(td, "D9", sep="/") ) td.doc <- TextDocCol(DirSource(td), # tm パッケージ readerControl = list(reader = readPlain, language = "en_US", load = TRUE)) summary(td.doc) inspect(td.doc)
str.dot<- stringdot(length = 4, lambda = 0.5, type = "sequence", normalized = TRUE) # kernlab パッケージ
test.kern<- kernelMatrix(str.dot, td.doc) td.doc.specc<- specc(td.doc, centers = 2, kernel = "stringdot")
lc <- Sys.getlocale("LC_CTYPE")# utf-8 以外の環境の場合 Sys.setlocale("LC_CTYPE","C")
library(lsa)
一時ファイルで実験する場合
td<- tempfile() dir.create(td) write(c("Human", "machine", "interface", "for", "ABC", "computer", "applications"), file = paste(td, "D1", sep="/") ) write(c("A", "survey", "of", "user", "opinion", "of", "computer", "system", "response", "time"), file = paste(td, "D2", sep="/") ) write(c("The", "EPS", "user", "interface", "management", "system"), file=paste(td, "D3", sep="/") ) write(c("System", "and", "human", "system", "engineering", "testing", "of", "EPS"), file = paste(td, "D4", sep="/") ) write(c("Relation", "of", "user", "perceived", "response", "time" ,"to", "error", "measurement"), file = paste(td, "D5", sep="/") ) write(c("The", "intersection", "graph", "of", "paths", "in" ,"trees"), file = paste(td, "D6", sep="/") ) write(c("Graph", "minors", "IV:", "Widths", "of", "trees", "and", "well-quasi-ordering"), file = paste(td, "D7", sep="/") ) write( c("The", "generation", "of", "random,", "binary,", "ordered", "trees"), file = paste(td, "D8", sep="/") ) write(c("Graph", "minors:", "A", "survey"), file = paste(td, "D9", sep="/") ) ####################
単純に文章ターム行列を作ってみる
myMatrix<- textmatrix(td)
stopword をロード
data(stopwords_en)
stopword と stemming を指定しての文書・ターム行列作成
myMatrix<- textmatrix(td, stopwords = stopwords_en, stemming = TRUE)
必要なら重みを付け
# myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
生の検索語の設定
myQuery<- query("user interface", rownames(myMatrix), stemming = TRUE ) myMat.Que<- cbind(myMatrix, myQuery) as.matrix(round(cosine(myMat.Que), dig = 2)[,10])
単純な特異値分解
# myLSAraw<- lsa(myMatrix, dims = dimcalc_raw()) # 復元 # round(myLSAraw$tk %*% diag(myLSAraw$sk) %*% t(myLSAraw$dk), digit = 2)
LSA を実行してみる dimcalc_share(0.4) は許容する特異値の数を指定
myLSAspace<- lsa(myMatrix, dims = dimcalc_share(0.4))
myLSAspace # もとの文書行列では 0 の要素にも索引重みが計算されている round(myLSAspace$tk, digits= 2)
もとの文書ベクトルを 3 次元で近似する
new3Doc<- t(myLSAspace$tk) %*% myMatrix # plot(new3Doc[1,], new3Doc[2,])
library(rgl)
rgl.open() rgl.bg(color=c("white", "black"))
rgl.spheres(new3Doc[1,], new3Doc[2,], new3Doc[3,], radius = 0.01,color = 1: ncol(new3Doc))
rgl.bbox(color= "#112233", emission = "#90ee90",specular = "#556677", shinines = 8, alpha = 0.8)
rgl.texts(new3Doc[1,], new3Doc[2,], new3Doc[3,], rownames(myLSAspace$dk), color = 1:ncol(new3Doc), cex = 1.2) # , adj = "left" )
# rgl.viewpoint # rgl.snapshot(file = "sla.png", fmt = "png")
rgl.postscript("sla.eps", fmt="eps" )
for (i in seq(2,20,2)) { rgl.viewpoint(i,20) filename<- paste("lsa-",formatC(i, digits=2, flag="0"),".eps",sep="") rgl.postscript(filename, fmt="eps" ) }
rgl.close()
3 次元に圧縮した文書行列による検索 この結果を使って検索
# query("user interface", rownames(myLSAspace$tk), stemming = TRUE )
## myQuery2<- query("user interface", rownames(myLSAspace$tk), stemming = TRUE ) ## myMat.Que2<- cbind(myLSAspace$tk, myQuery2) ## cosine(myMat.Que2 ) # USER INTERFACE 列との相関の程度で ## nrow( myQuery2 ) ## ncol( myQuery2 ) myQuery3<- query("user interface", rownames(myLSAspace$tk), stemming = TRUE ) new3Query<- t(myLSAspace$tk) %*% myQuery3 myMat.Que3<- cbind(new3Doc, new3Query) as.matrix(round(cosine(myMat.Que3), dig = 2)[,10])
unlink(td, recursive=TRUE) Sys.setlocale("LC_CTYPE",lc)
結局,http://cran.md.tsukuba.ac.jp/doc/manuals/R-exts.pdf を熟読するに限るのだが,
このサイト は分かり易い.以下も引用させていただきました.
/* g++ -O2 `mecab-config --cflags` myfunc.c -o myfunc `mecab-config --libs` -I/usr/local/lib64/R/include */ #include<R.h> #include<Rinternals.h> SEXP myfunc(SEXP param, SEXP vecparam, SEXP aa) { SEXP ans; double a = REAL(param)[0]; int len1 = length(param); int len2 = length(vecparam); int p1 = INTEGER(vecparam)[0]; int p2 = INTEGER(vecparam)[1]; char* str = CHAR(STRING_ELT(aa,0)); Rprintf("%s\n",str); Rprintf("length of 1: %d\n",len1); Rprintf("length of 2: %d\n",len2); Rprintf("input param: %lf, %d, %d\n",a,p1,p2); PROTECT(ans = allocVector(INTSXP, p1*p2)); for (int i = 0; i< p1*p2; i++) INTEGER(ans)[i] = i; UNPROTECT(1); return(ans); }
[ishida@amd64 myRcode]$ R CMD SHLIB myfunc.c [ishida@amd64 myRcode]$ R
c プログラムテスト
> dyn.load("myfunc.so") # > ret = .Call("myfunc",1.15,as.integer(c(2,3)), "hogeほげ")
# C プログラムとして
ishida/research/statistics/myRcode/mecab.c ファイルを作成
#include<R.h> #include<Rdefines.h> #include<Rinternals.h> #include<mecab.h> #include<stdio.h> #define CHECK(eval) if (! eval) { \ fprintf (stderr, "Exception:%s\n", mecab_strerror (mecab)); \ mecab_destroy(mecab); \ return -1; } SEXP mecab(SEXP aa){ SEXP parsed; const char* input = CHAR(STRING_ELT(aa,0)); mecab_t *mecab; mecab_node_t *node; const char *result; int i; mecab = mecab_new2 (input); CHECK(mecab); result = mecab_sparse_tostr(mecab, input); CHECK(result); Rprintf ("INPUT: %s\n", input); Rprintf ("RESULT:\n%s", result); PROTECT(parsed = allocVector(STRSXP,1)); SET_STRING_ELT(parsed, 0, mkChar(result)); //PROTECT(parsed = mkString(result)); UNPROTECT(1); mecab_destroy(mecab); return(parsed); }
コンパイルは
% R CMD SHLIB chartest.c -L/usr/local/lib/ -lmecab -I/usr/local/include
# R 側で
dyn.load("research/statistics/myRcode/mecab.so") kekka<- .Call("mecab","すもももももももものうち") kekka2<- NULL kekka2<- unlist(strsplit(kekka, "\n")) reg<- NULL kekka3<- NULL for(i in 1 :length(kekka2)){ reg<- regexpr("^(\\w+)\t(\\w+)", kekka2[i]) kekka3<- c(kekka3, substring(kekka2[i], reg[1], attributes(reg)[[1]])) } kekka3
Crawley The R Book p.293
# string kernel の作成
library(kernlab) # sample, Lodhi p.423 test.str<- list("science is organized knowledge","wisdom is organized life") str.kern.list<- NULL for(i in 1:6){ str.dot<- stringdot(length = i, lambda = 0.5, type = "sequence", normalized = TRUE) str.kern.list[i]<- list(kernelMatrix(str.dot, test.str)) }
## 日本語の方は 6 バイト扱いで計算している
test.str.jp<- list("これと","これは") str.kern.list.jp<- NULL for(i in 1:6){ str.dot<- stringdot(length = i, lambda = 0.5, type = "sequence", normalized = TRUE) str.kern.list.jp[i]<- list(kernelMatrix(str.dot, test.str.jp)) } test.str<- list("car","cat") str.kern.list <- NULL for(i in 1:6){ str.dot<- stringdot(length = i, lambda = 0.5, type = "sequence", normalized = TRUE) str.kern.list[i]<- list(kernelMatrix(str.dot, test.str)) }
Crawley The R Book p.293 p. 146
plot(0:10, 0:10, xlab = "", ylab = "", xaxt = "n", yaxt = "n")
John Fox p.127 -- 153
Crawley The R Book p.365
mosaicplot(Titanic[c("1st","2nd","3rd"),,"Adult",], main = "Survival on the Titanic", shade = T)
John Verzani p.336 の構造モデルから判断すると,すべてベース (Intercept) との切片の差ということになる.
二元配置の分散分析
frogs3<- read.csv( "http://150.59.18.68/frogs3.csv", header = FALSE) frogs3 # header = FALSE で,列名はファイルに未設定と指示
なお列名が未定義の場合,自動的に V1, V2, V3 などの名前が付加される 二つの要因がある場合,それらをチルダ記号の右側に + 記号で指定する
frogs3.aov<- aov(V1 ~ V2 + V3, data = frogs3) summary(frogs3.aov) summary.lm(frogs3.aov)
Intercept は V2 = 12H かつ V3 = 100ug の場合.繰り返し数 3 この標準偏差は sqrt(7.51/6).これは V2 V3 の自由度の積か 2行目の V224H は sqrt(2 *7.51/9).9 は V2 の繰り返し数か
Intercept は V2 が 12H で V3 が 100 ug の場合 2行目 V224H は V2 が 24H の場合の Intercept(V2=12Hかつ V3=100ugの場合) との差
同じく,p.332 によれば共分散分析では,連続量はスロープを表す.
regrowth<- read.table( "http://www.bio.ic.ac.uk/research/mjcraw/ therbook/data/ipomopsis.txt", header = T) ancova1<- lm(Fruit ~ Grazing * Root) summary(ancova1) anova(ancova1)
Crawley The R Book [#ha05dc7e]
p. 492 - 498
Faraway (2006) よりデータを借用
babyfood<- read.table(file = "http://150.59.18.68/babyfood.txt") babyfood
# データから要因別に罹患比率を求めて分割表にする.xtabs() 関数を利用
xtabs(disease/(disease+nondisease) ~ sex + food, babyfood)
# ロジスティック回帰分析を実行する
# 目的変数を 2 項分布とした一般化線形モデル glm() による
model1<- glm(cbind(disease, nondisease) ~ sex + food, family = binomial, data = babyfood)
# glm は一般化線形モデルを実行する関数.family は分布を指定する
summary(model1) # 要約を見る drop1(model1, test = "Chi") # 各項は有意か exp(-.669) # 母乳の効果を確認する model.matrix(model1) # Intercept は Boy で Bottle # sexGirl は Girl の場合の Intercept との差 # foodBreast は Intercept (Boy Bottle) の場合に比べての差 # foodSuppl は Intercept (Boy Bottle) の場合に比べての差