R의 요소 : 성가심 이상?
R의 기본 데이터 유형 중 하나는 요인입니다. 내 경험상 요인은 기본적으로 고통이며 절대 사용하지 않습니다. 나는 항상 문자로 변환합니다. 뭔가 빠진 것 같은 기분이 듭니다.
요인 데이터 유형이 필요한 경우 요인을 그룹화 변수로 사용하는 함수의 몇 가지 중요한 예가 있습니까? 요인을 사용해야하는 특정 상황이 있습니까?
요인을 사용해야합니다. 예 그들은 통증이있을 수 있지만, 내 이론에 때문에 그들이 고통있는 이유의 90 %는 점이다 read.table
하고 read.csv
, 인수 stringsAsFactors = TRUE
기본적으로 (대부분의 사용자가이 미묘를 그리워). lme4와 같은 모델 피팅 패키지는 모델을 차등 적으로 피팅하고 사용할 대비 유형을 결정하기 위해 요소와 순서가 지정된 요소를 사용하기 때문에 유용하다고 말합니다. 그리고 그래프 패키지는 또한 그들을 그룹화하는 데 사용합니다. ggplot
대부분의 모델 피팅 함수는 문자 벡터를 요인으로 강제하므로 결과는 동일합니다. 그러나 코드에 경고가 표시됩니다.
lm(Petal.Length ~ -1 + Species, data=iris)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
경고 메시지 : In
model.matrix.default(mt, mf, contrasts)
:변수
Species
로 변환factor
한 가지 까다로운 것은 전체 drop=TRUE
입니다. 벡터에서는 데이터에없는 요인 수준을 제거하는 데 효과적입니다. 예를 들면 :
s <- iris$Species
s[s == 'setosa', drop=TRUE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
However, with data.frame
s, the behavior of [.data.frame()
is different: see this email or ?"[.data.frame"
. Using drop=TRUE
on data.frame
s does not work as you'd imagine:
x <- subset(iris, Species == 'setosa', drop=TRUE) # susbetting with [ behaves the same way
x$Species
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
Luckily you can drop factors easily with droplevels()
to drop unused factor levels for an individual factor or for every factor in a data.frame
(since R 2.12):
x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa" "versicolor" "virginica"
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"
This is how to keep levels you've selected out from getting in ggplot
legends.
Internally, factor
s are integers with an attribute level character vector (see attributes(iris$Species)
and class(attributes(iris$Species)$levels)
), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot
legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.
ordered factors are awesome, if I happen to love oranges and hate apples but don't mind grapes I don't need to manage some weird index to say so:
d <- data.frame(x = rnorm(20), f = sample(c("apples", "oranges", "grapes"), 20, replace = TRUE, prob = c(0.5, 0.25, 0.25)))
d$f <- ordered(d$f, c("apples", "grapes", "oranges"))
d[d$f >= "grapes", ]
A factor
is most analogous to an enumerated type in other languages. Its appropriate use is for a variable which can only take on one of prescribed set of values. In these cases, not every possible allowed value may be present in any particular set of data and the "empty" levels accurately reflect that.
Consider some examples. For some data which was collected all across the United States, the state should be recorded as a factor. In this case, the fact that no cases were collected from a particular state is relevant. There could have been data from that state, but there happened (for whatever reason, which may be a reason of interest) to not be. If hometown was collected, it would not be a factor. There is not a pre-stated set of possible hometowns. If data were collected from three towns rather than nationally, the town would be a factor: there are three choices that were given at the outset and if no relevant cases/data were found in one of those three towns, that is relevant.
Other aspects of factor
s, such as providing a way to give an arbitrary sort order to a set of strings, are useful secondary characteristics of factor
s, but are not the reason for their existence.
Factors are fantastic when one is doing statistical analysis and actually exploring the data. However, prior to that when one is reading, cleaning, troubleshooting, merging and generally manipulating the data, factors are a total pain. More recently, as in the past few years a lot of the functions have improved to handle the factors better. For instance, rbind plays nicely with them. I still find it a total nuisance to have left over empty levels after a subset function.
#drop a whole bunch of unused levels from a whole bunch of columns that are factors using gdata
require(gdata)
drop.levels(dataframe)
I know that it is straightforward to recode levels of a factor and to rejig the labels and there are also wonderful ways to reorder the levels. My brain just cannot remember them and I have to relearn it every time I use it. Recoding should just be a lot easier than it is.
R's string functions are quite easy and logical to use. So when manipulating I generally prefer characters over factors.
What a snarky title!
I believe many estimation functions allow you to use factors to easily define dummy variables... but I don't use them for that.
I use them when I have very large character vectors with few unique observations. This can cut down on memory consumption, especially if the strings in the character vector are longer-ish.
PS - I'm joking about the title. I saw your tweet. ;-)
Factors are an excellent "unique-cases" badging engine. I've recreated this badly many times, and despite a couple of wrinkles occasionally, they are extremely powerful.
library(dplyr)
d <- tibble(x = sample(letters[1:10], 20, replace = TRUE))
## normalize this table into an indexed value across two tables
id <- tibble(x_u = sort(unique(d$x))) %>% mutate(x_i = row_number())
di <- tibble(x_i = as.integer(factor(d$x)))
## reconstruct d$x when needed
d2 <- inner_join(di, id) %>% transmute(x = x_u)
identical(d, d2)
## [1] TRUE
If there's a better way to do this task I'd love to see it, I don't see this capability of factor
discussed.
tapply (and aggregate) rely on factors. The information-to-effort ratio of these functions is very high.
For instance, in a single line of code (the call to tapply below) you can get mean price of diamonds by Cut and Color:
> data(diamonds, package="ggplot2")
> head(dm)
Carat Cut Clarity Price Color
1 0.23 Ideal SI2 326 E
2 0.21 Premium SI1 326 E
3 0.23 Good VS1 327 E
> tx = with(diamonds, tapply(X=Price, INDEX=list(Cut=Cut, Color=Color), FUN=mean))
> a = sort(1:diamonds(tx)[2], decreasing=T) # reverse columns for readability
> tx[,a]
Color
Cut J I H G F E D
Fair 4976 4685 5136 4239 3827 3682 4291
Good 4574 5079 4276 4123 3496 3424 3405
Very Good 5104 5256 4535 3873 3779 3215 3470
Premium 6295 5946 5217 4501 4325 3539 3631
Ideal 4918 4452 3889 3721 3375 2598 2629
참고URL : https://stackoverflow.com/questions/3445316/factors-in-r-more-than-an-annoyance
'programing' 카테고리의 다른 글
화면이 아닌 셀별로 UICollectionView 페이징 (0) | 2020.08.27 |
---|---|
애플리케이션 상태 확인을 위해 / healthz를 사용하는 규칙은 어디에서 왔습니까? (0) | 2020.08.27 |
PHP로 단위 테스트를 어떻게 작성합니까? (0) | 2020.08.27 |
병합 할 대상 분기를 변경하기 위해 GitHub에서 풀 요청을 수정하는 방법은 무엇입니까? (0) | 2020.08.27 |
Golang에서 중첩 된 JSON 객체 역 마샬링 (0) | 2020.08.27 |