Subsections
10.3 Functions
This section provides a list of some of the functions
that are useful for working with data in R. The
descriptions of these functions are very brief and only some
of the arguments to each function are mentioned. For a complete
description of the function and its arguments, the relevant
function help page should be consulted (see Section 10.4).
This section describes some functions that are useful for
querying and controlling the R software environment
during an interactive session.
- ls()
-
List the symbols that have had values assigned to them during
the current session.
-
-
Delete one or more symbols (the value that was assigned
to the symbol is no longer accessible).
The symbols to delete are specified by name or as
a list of names.
To delete all symbols in the current session, use
rm(list=ls()) (carefully).
- options(...)
-
Set a global option for the R session by specifying a
new value with an appropriate argument name in the form
optionName=optionValue or
query the current setting for an option by specifying
"optionName".
Typing options() with no arguments returns a list
of all current option settings.
- q()
-
Exit the current R session.
10.3.2 Generating vectors
- c(...)
-
Create a new vector by concatenating or combining the
values (or vectors of values) given as arguments.
All values must be
of the same type (or they will be coerced to the same type).
This function can also be used to concatenate lists.
-
-
Generate a sequence of numbers from the value from
to (not greater than) the value
to in steps of by, or
for a total of length.out values, or
so that the sequence has the same length as along.with.
The function seq_len(n) is faster for producing the sequence
from 1 to n
and seq_along(x) is faster for producing the sequence from
1 to the number of values in x.
These may be useful for producing very long sequences.
The colon operator, :,
provides a short-hand syntax for
sequences of integer values in steps of 1. The expression
from:to is equivalent to
seq(from, to).
-
-
Repeat all values in a vector times times, or
each value in the vector each times, or all
values in the vector until the total number of values
is length.out.
- append(x, values, after)
-
Insert the values into the vector x at the position
specified by after.
- unlist(x)
-
Convert a list structure into a vector by concatenating all
components of the list. This is especially useful when a function call
returns a list where each component is a vector.
- rev(x)
-
Reverse the elements of a vector.
- unique(x)
-
Remove any duplicated values from x.
- sum(..., na.rm=FALSE)
-
Sum the value of all arguments. If NA values
are included, the result is NA (unless na.rm=TRUE).
- mean(x)
-
Calculate the arithmetic mean of the values in x.
-
-
Calculate the minimum, maximum, or range of all values in all arguments.
The functions which.min() and which.max()
return the index of the minimum or maximum value within a vector.
- diff(x)
-
Calculate the difference between successive values of x.
The result contains one fewer values than there are in x.
-
-
The cumulative sum or cumulative product of the values in x.
10.3.4 Comparisons
- identical(x, y)
-
Tests whether x and y
are equivalent down to the details of their representation
in computer memory.
- all.equal(target, current, tolerance)
-
Tests whether target and current
differ by only a tiny amount, where “tiny”
is defined by tolerance). This is useful
for testing whether numeric values are equal.
- match(x, table)
-
Determine the location of each element of x in
the set of values in table. The result
is a numeric index the same length as x.
The %in% operator is similar (x %in% table),
but returns a logical
vector the same length as x
reporting whether each element of x
was found in table.
The pmatch() function performs partial matching
(whereas match() is exact).
-
-
These functions should be used to test for the special values
NULL, NA, Inf, and NaN.
-
-
Test whether all or any values in one or more logical vectors are
TRUE. The result is a single logical value.
-
-
Convert the data structure x to a vector of the appropriate type.
-
-
Convert character values or numeric values to Date values.
Character values are converted automatically if they are in ISO
8601 format; otherwise, it may be necessary to describe the
date format via the format argument. The help page
for the strftime() function describes the syntax for
specifying date formats.
When converting
numeric values,
a reference date must
be provided, via the origin argument.
The Sys.Date() function returns today's date as a date value.
The months() function resolves date values just to month names.
There are also functions for weekdays() and quarters().
-
-
Round a numeric vector, x,
to digits decimal places or to an
integer value.
floor() returns largest integer not greater than x and
ceiling() returns smallest integer not less than x.
- signif(x, digits)
-
Round a numeric vector, x, to digits significant digits.
-
-
Extract a list of all attributes, or just
the attributes named in the character vector which, from the
data structure x.
-
-
Extract the names attribute from a vector or list, or the row names
or column names from a two-dimensional data structure, or the
list of names for all dimensions of an array.
- summary(object)
-
Produces a summary of object. The information
displayed will depend on the class of object.
- length(x)
-
The number of elements in a vector, or the number
of components in a list. Also works for data frames
and matrices, though the result may be less intuitive; it gives
the number of columns for a data frame and the total number
of values in a matrix.
-
-
The dimensions of a matrix, array, or data frame.
nrow() and ncol() are specifically for two-dimensional
data structures, but dim() will also work for
higher-dimensional structures.
-
-
Return just the first or last n elements of a data structure;
the first elements of a vector, the first few rows of a data frame, and
so on.
- class(x)
-
Return the class of the data structure x.
- str(object)
-
Display a summarized, low-level view of a data structure.
Typically, the output is less pretty and more detailed than
the output from summary().
Subsetting is generally performed via the square bracket operator,
[
(e.g., candyCounts[1:4]). In general, the result
is of the same class as the original data structure that is being
subsetted. The subset may be a numeric vector, a character vector (names),
or a logical vector (the same length as the original data structure).
When subsetting data structures
with more than one dimension--e.g., data frames,
matrices,
or arrays--the subset may be several vectors, separated by commas
(e.g., candy[1:4, 4]).
The double square bracket operator,
[[,
selects only one component of a
data structure.
This is typically used to extract a component from a list.
- subset(x, subset, select)
-
Extract the rows of the data frame x that satisfy the
condition in subset and the columns that are named
in select.
An important special case of subsetting for statistical data sets
is the issue of removing missing values from a data set.
The function na.omit() can be used to remove all rows
containing missing values from a data frame.
10.3.8 Data import/export
R provides general functions for working with the file system.
-
-
Get the current working directory
or set it to dir. This is where R
will look for files (or start looking for files).
- list.files(path, pattern)
-
List the names of files in the directory given by path,
filtering results with the specified pattern
(a regular expression).
For Linux users who are used to using filename globs with the
ls shell command, this use of regular expressions for
filename patterns can cause
confusion. Such users may find the glob2rx() function helpful.
The complete names of the files, including the path, can be obtained
by specifying full.names=TRUE. Given a full filename,
consisting of a path and a filename,
basename() strips off the path to leave just the filename,
and dirname()
strips off the filename to leave just the path.
- file.path(...)
-
Given the names of nested directories, combine them
using an appropriate separator to form a path.
- file.choose()
-
Interactively select a file (on Windows, using a dialog box
interface).
-
-
These functions perform the standard file manager tasks of
copying, deleting, and renaming files and creating new directories.
There are a number of functions for
reading data from external text files into R.
- readLines(con)
-
Read the text file specified by the filename or path given by con.
The file specification can also be a URL.
The result is a character vector with one element for each line in the file.
- read.table(file, header=FALSE, skip=0, sep="")
-
Read the text file specified by the character value in file,
treating each line of
text as a case in a data set that contains values for each variable
in the data set, with values separated by the character value in sep.
Ignore the first skip lines in the file.
If header is TRUE, treat the first line of the file
as variable names.
The default behavior is to treat columns that contain only
numbers as numeric and to treat everything else as a factor.
The arguments as.is and stringsAsFactors can be used
to produce character variables rather than factors. The
colClasses argument provides further control over the type
of each column.
This function can be slow on large files because of the work
it does to determine the type of values in each column.
The result of this function is a data frame.
- read.fwf(file, widths)
-
Read a text file in fixed-width format. The name of the file is
specified by file and widths is a numeric vector
specifying the width of each column of values.
The result is a data frame.
- read.csv(file)
-
A front end for read.table() with default argument settings
designed for reading a text file in CSV format.
The result is a data frame.
- read.delim(file)
-
A front end for read.table() with default argument settings
designed for reading a tab-delimited text file.
The result is a data frame.
- scan(file, what)
-
Read data from a text file and produce a vector of values.
The type of the value provided for the argument what
determines how the values in the text file are interpreted.
If this argument is a list, then the result is a list of
vectors, each of a type corresponding to the relevant
component of what.
This function is more flexible and faster than read.table() and its kin,
but the result may be less convenient to work with.
In most cases,
these functions that read a data set from
a text file produce a data frame as the result.
The functions automatically determine
the data type for each column of the data frame, treating anything
that is not a number as a factor, but arguments are provided
to explicitly specify the data types for columns.
Where names of columns are provided in the text file, these
functions may modify the names so that they do not
cause syntax problems, but again arguments are provided to
stop these modifications from happening.
The XML package provides functions for reading and manipulating
XML documents.
The package foreign contains various functions for
reading data from external files in the various binary formats
of popular statistical programs.
Other popular scientific binary formats can also be read using
an appropriate package, e.g., ncdf for the netCDF format.
Most of the functions for reading files have a corresponding
function to write the relevant format.
- writeLines(text, con)
-
Write a character vector to a text file. Each element of the character
vector is written as a separate line in the file.
- write.table(x, file, sep=" ")
-
Write a data frame to a text file using a delimited format.
The sep argument allows control over the delimiter.
The function write.csv() provides useful defaults for
producing files in CSV format.
- sink(file)
-
Redirect R output to a text file. Instead of
displaying output on the screen, output is saved into a file.
The redirection is terminated by calling sink()
with no arguments.
The function capture.output() provides a convenient
way to redirect output for a single R expression.
Most of these functions read or write an entire file worth
of data in one go. For large data sets, it is also possible
to read or write data in smaller pieces. The functions
file() and close() allow a file to be held open
while reading or writing. Functions that read from files
typically have an argument that specifies a number of lines or
bytes of information to read, and functions that write to files
typically provide an append argument to ensure
that previous content is not overwritten.
One important case not mentioned so far
is the export and import of data in an R-specific format,
which is useful for sharing data between colleagues who all use
R.
- save(..., file)
-
Save the symbols
named in ... (and their values), in an R-specific format,
to the specified file.
- load(file)
-
Load R symbols (and their values) from the specified file
(that has been created by a previous call to save()).
- dump(list, file)
-
Write out a text representation of the R data structures named in the
character vector list. The data structures can be recreated in
R by calling source() on the file.
- source(file)
-
Parse and evaluate the R code in file.
This can be used to read data from a file created by dump()
or much more generally to run any R code that has been stored
in a file.
- transform(data, ...)
-
Redefine existing columns within a data frame and append new columns
to a data frame.
Each argument in ... is of the form
columnName=columnValue.
- ifelse(test, yes, no)
-
The test argument is a logical vector.
This function
creates a new vector consisting of the values in the vector yes
when the corresponding element of test is TRUE and the values
in no when test is FALSE.
The switch() function is similar, but allows for
more than two values in test.
- cut(x, breaks)
-
Transform the continuous vector x into a factor.
The breaks argument can be an integer that
specifies how many different levels
to break x into, or it can be a vector of interval
boundaries that are used to cut x into different levels.
An additional labels argument
allows labels to be specified for the levels of the new factor.
- sort(x, decreasing=FALSE)
-
Sort the elements of a vector. Character values are sorted
alphabetically (which may depend on the locale or language
setting).
- order(..., decreasing=FALSE)
-
Determine an ordering based on the elements of one or more vectors.
In the simple case of a single vector, sort(x) is
equivalent to x[order(x)]. The advantage of this function
is that it can be used to reorder more than just a single vector,
plus it can produce an ordering from more than one vector;
it can break ties in one variable using the values from another
variable.
- table(...)
-
Generate table of counts for one or more factors.
The result is a "table" data structure, with as many
dimensions as there are arguments.
The margin.table() function reduces a table to marginal totals,
prop.table() converts table counts to proportions of marginal
totals, and addmargins() adds margin totals to an existing
table.
- xtabs(formula, data)
-
Similar to table() except factors to cross-tabulate are
expressed in a formula. Symbols in the formula will be searched
for in the data frame given by the data argument.
- ftable(...)
-
Similar to table() except that the result is always
a two-dimensional "ftable" data structure, no matter
how many factors are used. This makes for a more
readable display.
- aggregate(x, by, FUN)
-
Call the function FUN for each subset of x
defined by the grouping factors in the list by.
It is possible to apply the function to multiple variables
(x can be a data frame) and it is possible
to group by multiple factors (the list
by can have more than one component).
The result is
a data frame. The names used in the by list are used
for the relevant columns in the result.
If x is a data frame, then the names of the variables
in the data frame are used for the relevant columns in the result.
10.3.13 The “apply” functions
- apply(X, MARGIN, FUN, ...)
-
Call a function on each row or each column of a data frame
or matrix.
The function FUN is called for each row of the matrix
X, if MARGIN=1; if
MARGIN=2, the function is called for each
column of X. All other arguments are passed as
arguments to FUN.
The data structure that is returned depends on the value
returned by FUN. In the simplest case, where
FUN returns a single value, the result is a
vector with one value per row (or column) of the original
matrix X.
- sweep(x, MARGIN, STATS, FUN="-")
-
If MARGIN=1,
for row i of x, subtract element i of STATS.
For example, subtract row averages from all rows.
More generally, call the function FUN with row i of x
as the first argument and element i of STATS as the second
argument.
If MARGIN=2, call FUN for each column of x
rather than for each row.
- tapply(X, INDEX, FUN, ...)
-
Call a function once for each subset of the vector X, where the
subsets correspond to unique values of the factor INDEX.
The INDEX argument can be a list of factors, in which case
the subsets are unique combinations of the levels of these factors.
The result depends on how many factors are given in INDEX.
For the simple case where there is only one factor and FUN
returns a single value, the result
is a vector.
- lapply(X, FUN, ...)
-
Call the function FUN once for each component of the
list X. The result is a list. Additional arguments
are passed on to each call to FUN.
- sapply(X, FUN, ...)
-
Similar to lapply(), but will simplify the result to
a vector if possible (e.g., if all components of
X are vectors and FUN returns a single value).
- mapply(FUN, ..., MoreArgs)
-
A “multivariate” apply.
Similar to lapply(), but will call the function FUN
on the first element of each of the supplied arguments, then
on the second element of each argument, and so on.
MoreArgs is a list of arguments to pass to each call
to FUN.
- rapply(object, f)
-
A “recursive” apply.
Calls the function f on each component of the list
object, but if a component is itself a list, then f
is called on each component of that list as well, and so on.
- rbind(...)
-
Create a new data frame by combining
two or more data frames that have the same columns.
The result is the union of the rows of the original data frames.
This function also works for matrices.
- cbind(...)
-
Create a new data frame by combining
two or more data frames that have the same number of rows.
The result is the union of the columns of the original data frames.
This function also works for matrices.
- merge(x, y)
-
Create a new data frame by combining two data frames
in a database join
operation. The two data frames will
usually have different columns, though they will typically
share at least one column, which is used to match the rows.
Additional arguments allow the matching column to be
specified explicitly.
The default join is an inner join on columns that x
and y have in common.
Additional arguments allow for the equivalent of inner joins
and outer joins.
- split(x, f)
-
Split a vector or data frame, x, into a list of
smaller vectors or data frames. The factor f
is used to determine which elements of the original vector
or which rows of the original matrix end up in each subset.
- unsplit(value, f)
-
Combine a list of vectors into a single vector.
The factor f determines the order in
which the elements of the vectors are combined.
This function can also be used to combine a list of data frames
into a single data frame (as long as the data frames have the
same number of columns); in this case, f determines
the order in which the rows of the data frames are combined.
- stack(x)
-
Stack the existing columns of data frame x together into a single
column and add a new column that identifies
which original column each value came from.
- aperm(a, perm)
-
Reorder the dimensions of an array. The perm argument
specifies the order of the dimensions.
The special case of transposing a matrix is provided by the
t() function.
Functions from the reshape package:
-
-
Convert the data, typically a data frame, into “long” form,
where there is a row for every measurement or “dependent” value.
The measure.var argument gives the names or numeric indices
of the variables that contain measurements. All other variables
are treated as labels characterizing the measurements (typically factors).
Alternatively, the id.var argument specifies the label
variables and all others are treated as measurements.
In the resulting data frame,
there is a new, single column of measurements with the name value
and an additional variable of identifying labels, named variable.
- cast(data, formula)
-
Given data in a long form, i.e., produced by melt(),
restructure the data according to the given formula.
In the new arrangement,
variables mentioned on the left-hand side of the formula vary across
rows and variables mentioned on the right-hand side vary across
columns.
In a simple repeated-measures scenario consisting of measurements
at two time points, the data may consist of a variable of subject IDs
plus two variables containing measurements at the two time points.
> library(reshape)
> wide <- data.frame(ID=1:3,
T1=rnorm(3),
T2=sample(100:200, 3))
> wide
ID T1 T2
1 1 -0.5148497 145
2 2 -2.3372321 143
3 3 2.0103460 158
|
If we melt the data, we produce a data frame with a column named ID,
a column named variable with values T1 or T2, and
a column named value, containing all of the measurements.
> long <- melt(wide,
id.var=c("ID"),
measure.var=c("T1", "T2"))
> long
ID variable value
1 1 T1 -0.5148497
2 2 T1 -2.3372321
3 3 T1 2.0103460
4 1 T2 145.0000000
5 2 T2 143.0000000
6 3 T2 158.0000000
|
This form can be recast back to the original wide form as follows.
> cast(long, ID ~ variable)
ID T1 T2
1 1 -0.5148497 145
2 2 -2.3372321 143
3 3 2.0103460 158
|
The function recast() combines a melt and cast in a single
operation.
10.3.17 Text processing
- nchar(x)
-
Count the number of characters in
each element of the character vector x.
The result is a numeric vector the same length as x.
- grep(pattern, x)
-
Search for the regular expression pattern in the character
vector x.
The result is a numeric vector identifying which elements of
x matched the pattern.
If there are no matches, the result has length zero.
The function agrep() allows for approximate matching.
- regexpr(pattern, text)
-
Similar to grep() except that the result is
a numeric vector containing the
character location of the match within each element of text
(-1 if there is no match). The result also has an attribute,
match.length, containing the length of the match.
- gregexpr(pattern, text)
-
Similar to regexpr(), except that the result is
the locations (and lengths)
of all matches within each piece of text. The result is
a list.
- gsub(pattern, replacement, x)
-
Search for the regular expression pattern in the character
vector x and replace all matches with the character value in
replacement. The result is a vector containing the modified text.
The g stands for “global” so all matches are replaced;
there is a sub()
function that just replaces the first match.
The functions toupper() and tolower()
convert character values to all uppercase or all lowercase.
- substr(x, start, stop)
-
For each character value in x, return a subset of the text
consisting of the
characters at positions start through stop, inclusive.
The first character is at position 1.
The function substring() works very similarly, with the
extra convenience that the end character defaults to the end
of the text.
More specialized text subsetting is provided by the
strtim() function, which removes characters from the end
of text to enforce a maximum length, and abbreviate(), which reduces text
to a given length by removing characters in such a way that
each piece of text remains unique.
- strsplit(x, split)
-
For each character value in x, break the text into separate
character values,
using split as the delimiter. The result is a list,
with one character vector
component for each element of the original vector x.
- paste(..., sep, collapse)
-
Combine text, placing the character value sep in between.
The result is a character vector the same length as the longest
of the arguments, so shorter arguments are recycled.
If the collapse argument is not NULL, the result vector
is then collapsed to a single character value,
with the text collapse placed
in between each element of the result.
- print(x)
-
This function generates most of the output that we see
on the screen. The important thing to note is that this
function is generic, so the output will depend on the class of
x. For different classes there are also different arguments
to control the way the output is formatted. For example,
there is a digits argument that controls the number
of significant digits that are printed for numeric values,
and there is a quote argument that controls
whether double-quotes are printed around character values.
- format(x)
-
The usefulness of this function is to produce a character
representation of a data structure where all values have a similar format;
for example, all numeric values are formatted to
have the same number of characters in total.
- sprintf(fmt, ...)
-
Generates a character vector using the template given by
fmt. This is a character value with special codes embedded. Each
special code provides a placeholder for values supplied as
further arguments to the sprintf() function
and the special code controls the formatting
of the values. The help page for this function contains a
description of the special codes and their meanings.
The usefulness of this function is to obtain fine
control over formatting that is not available in print()
or format().
- strwrap(x, width)
-
Break long pieces of text, by inserting newlines, so that each line
is less than the given width.
- cat(..., sep=" ", fill=FALSE)
-
Displays the values in ... on screen. This function
converts its arguments to character vectors if necessary and
performs no additional formatting of the text it is given.
The fill argument can be used to control when to start
a new line of output, and the sep
argument specifies text to place between arguments.
The functions from the previous section are useful to
display intermediate results from within a loop or
function.
- debug(fun)
-
Following a call to this function, the function fun will
be run one expression at a time, rather than all at once. After
each expression, the values of symbols used within the function can
be explored by typing the symbol name. Typing `n' (or just hitting
Enter) runs the
next expression; `c' runs all remaining expressions;
and `Q' quits from the function.
Paul Murrell
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 New Zealand License.