R How to Save Dataset You Uploaded

[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You tin report consequence nearly the content on this folio here)


Desire to share your content on R-bloggers? click here if you have a weblog, or here if y'all don't.

What I will bear witness you lot

In this mail service, I want to show y'all a few ways how you can save your datasets in R. Peradventure, this seems like a dumb question to y'all. But after giving quite a few R courses mainly – but non only – for R beginners, I came to acknowledge that the answer to this question is not obvious and the different possibilites tin can be confusing. In this mail, I want to give an overview over the different alternatives and also country my opinion which way is the best in which situation.

Why would you want to know that?

Well, there are quite a few tutorials out in that location on how to read data into R. RStudio even has a special push for this in the 'Environment' tab – it's labelled 'Import Dataset'. But in that location is no button and also fewer tutorials on saving information. That'south foreign, isn't it? If you lot import your data, you lot might exercise some (sometimes lengthy) manipulation, aggregation, option and other stuff. If all that stuff takes several minutes (or even longer), you might not desire to do it everytime yous are working with the data. So, yous might want to relieve your dataset at a stage that's pre-analyses simply postal service-processing (where 'processing' might include cleaning, manipulating, computing new variables, merging, selecting, aggregating and lots of other stuff).

What are we going to do?

I will show you lot the post-obit ways of saving or exporting your data from R:

  • Saving it as an R object with the functions salve() and saveRDS()
  • Saving it as a CSV file with write.table() or fwrite()
  • Exporting it to an Excel file with WriteXLS()

For me, these options cover at least 90% of the stuff I accept to do at work. So I promise that it'll work for y'all, too.

Preparation: Load some information

I will apply some fairly (only non very) big dataset from the automobile package. The dataset is called MplsStops and holds information about stops made by the Minneapolis Police Department in 2017. Of course, you can access this dataset by installing and loading the car package and typing MplsStops. However, I want to simulate a more typical workflow here. Namely, loading a dataset from your deejay (I will load information technology over the Www). The dataset is also available from GitHub:

data <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv",                    sep = ",", header = T,                    row.names = 1) scroll_box(kable(head(data), row.names = F),            width = "100%", height = "300px")
idNum engagement trouble MDC citationIssued personSearch vehicleSearch preRace race gender lat long policePrecinct neighborhood
17-000003 2017-01-01 00:00:42 suspicious MDC NA NO NO Unknown Unknown Unknown 44.96662 -93.24646 1 Cedar Riverside
17-000007 2017-01-01 00:03:07 suspicious MDC NA NO NO Unknown Unknown Male person 44.98045 -93.27134 i Downtown West
17-000073 2017-01-01 00:23:fifteen traffic MDC NA NO NO Unknown White Female person 44.94835 -93.27538 5 Whittier
17-000092 2017-01-01 00:33:48 suspicious MDC NA NO NO Unknown East African Male person 44.94836 -93.28135 5 Whittier
17-000098 2017-01-01 00:37:58 traffic MDC NA NO NO Unknown White Female 44.97908 -93.26208 i Downtown W
17-000111 2017-01-01 00:46:48 traffic MDC NA NO NO Unknown E African Male person 44.98054 -93.26363 1 Downtown Westward

We at present have a dataset with over 50,000 rows (you tin ringlet through the kickoff 6 of them in the box above) and 14 variables in our global environment (the 'workspace'). But for the sake of simulating a real workflow, I will do some very calorie-free information manipulation. Here, I'chiliad assigning a new cavalcade data$gender.non.known which is True whenever data$gender is "Unknown" or NA.

data$gender.not.known <- is.na(data$gender) | data$gender == "Unknown"

As I wrote to a higher place: Saving the current country of your dataset in R makes sense when all the preparations take a lot of time. If they don't, you lot can merely run your pre-processing lawmaking every time you are getting back to analyzing the dataset. In the scope of this mail, let'due south suppose that the adding above took veeeery long and you absolutely don't want to run it everytime.

Option 1: Relieve as an R object

Whenever I'm the simply one working on a projection or everybody else is as well using R, I similar to save my datasets equally R objects. Basically, it's only saving a variable/object (or several of them) in a file on your deejay. In that location are 2 ways of doing this:

  1. Utilize the function save() to create an .Rdata file. In these files, you tin can store several variables.
  2. Use the part saveRDS() to create an .Rds file. Y'all can only store ane variable in it.

Option 1.ane: save()

You can salve your information simply by doing the following:

relieve(data, file = "data.Rdata")

By default, the parameter compress of the save() office is turned on. That means that the resulting file volition use less space on your disk. However, if information technology is a really huge dataset, it could take longer to load it afterwards because R first has to extract the file again. So, if y'all desire to save space, and so leave it as it is. If you want to salve fourth dimension, add a parameter compress = F.

If yous want to load such an .Rdata file into your environs, just exercise

load(file = "data.Rdata")

And so, the object is available in your workspace with its onetime proper noun. Here, the new variable will also have the name data. With relieve() You can also save several objects in one file. Permit's duplicate data to simulate this.

data2 <- data save(listing = c("data", "data2"), file = "information.Rdata")

At present, if you lot do load("data.Rdata"), yous will have two more objects in your workspace, namely data and data2.

Selection one.2: saveRDS()

This is the second pick of saving R objects. saveRDS() can simply be used to save 1 object in one file. The "loading function" for saveRDS() is readRDS(). Let's try it out.

saveRDS(data, file = "data.Rds") data.copy <- readRDS(file = "data.Rds")

Now, you have some other R object in your workspace which is an exact copy of data. The compress parameter is also available for readRDS().

Note that you lot cannot "mix" the saving and loading functions: salvage() goes together with load(), saveRDS() goes together with readRDS().

The departure between salvage() and saveRDS()

So, you might enquire "why should I use saveRDS() instead of relieve()"? Actually, I similar saveRDS() better - for one specific reason that you might not have noticed in the calls above. When we apply load(), nosotros do not assign the issue of the loading procedure to a variable because the original names of the objects are used. But this also means that you have to "remember" the names of the previously used objects when using load().

When nosotros use readRDS(), we have to assign the consequence of the reading process to a variable. This might mean more typing only it besides has the advantage that you can cull a new name for the variable to integrate it in into the residuum of the new script more smoothly. Also, it is more similar to the behavior of all the other "reading functions" like read.table(): for these, you likewise have to assign the result to a variable. The merely advantage of salve() actually is that y'all can relieve several objects into one file - but in the end it might be better to have i file for one object. This might be more clearly organized.

Selection 2: Salve as a CSV file

Whenever yous are non then who will work with the information afterwards on and whether these people are all using R, y'all might want to export your dataset as a CSV file. Also, it's human being readable. Also, if you provide a dataset on some website (east.grand. in the Dataverse for other researchers, it is kind to provide a CSV file considering everyone tin can open it with their preferred statistical software package.

Choice 2.one: write.table()

You can think of write.table() every bit the "reverse" of read.table(). Even the parameters are quite similar. Let's try it:

write.table(data, file = "data.csv",             sep = "\t", row.names = F)

We just saved the information.frame stored in information as a CSV file with tabs as field separators. We as well suppressed the rownames. I don't know why, simply by default, write.table() is storing the rownames in the file which I find a little strange.

Oh, and you tin also use write.tabular array() to append the contents of your data.frame at the cease of the file: only set the parameter append to Truthful. This is great whenever you desire to "fill" a file in multiple steps (e.g., in a for loop). Recollect to suppress the column names if you lot're appending content to files because you don't want them to be repeated throughout the file - merely set col.names = F.

Option 2.ii: fwrite()

Is your dataset really huge, similar several gigabytes of data? Then try giving fwrite() from the data.table bundle a spin! It uses multiple CPU cores for writing information. Just similar fread() from the same package, it is much much faster for larger files. Another reward: the row.names parameter is FALSE by default. The well-nigh-widely used parameters accept the same names as in write.table(). Neat!

library(information.table) t0 <- Sys.time() for (i in one:ten) {   write.table(data,               file = "writetable.data.csv",               sep = "\t", row.names = F) } difftime(Sys.time(), t0) ## Time difference of ii.742357 secs t1 <- Sys.time() for (i in 1:10) {   fwrite(data,          file = "fwrite-data.csv",          sep = "\t") } difftime(Sys.fourth dimension(), t1) ## Time divergence of 0.191942 secs

Run across that? Fifty-fifty with just 10 replications of writing a rather small dataset to deejay, fwrite() has a huge timing advantage (it'southward more than 10 times faster!). For a very large dataset, this might come in actually handy.

Pick 3: Save as an Excel file

You might come into a situation where you lot desire to export your dataset to an Excel file. Mayhap some colleagues only work with Excel (considering you lot nevertheless not managed to convince them switching to R) or y'all want to use Excel for annotating your dataset with a spreadsheet editor. In this example, you can utilize the Write.XLS() function from the Write.XLS package. I tried a few packages for writing Excel files and I find this one the near convenient to use.

Let'southward try it.

library(WriteXLS) WriteXLS(data, ExcelFileName = "data.xlsx",          SheetNames = "my data",          AdjWidth = T,          BoldHeaderRow = T)

This is what the resulting Excel file looks like on my machine.

You can save several dataframes in one Excel file past including the names of the objects at the outset position. Here, you could supplant data with c("information", "data2"). With the parameter SheetNames you lot can set the names of the data sheets (visible at the lesser of Excel, not included in the screenshot). If yous want to write several information.frames into several sheets of the Excel file, you can put several names in a vector here that accept to correspond with the names of the objects at the get-go position.

Adj.Width is a nice parameter because it tries to adjust the width of the columns in Excel in a way that every entry fits in the cells. BoldHeaderRow is self-explanatory, I guess. You can see the effect in the screenshot. Oh, and by the fashion, you lot can set the entries for NA values with the na parameter. It's "" by default.

Summary

  • If you know that the dataset is going to be used in R and R merely, apply saveRDS(). save() is OK, too. But you cannot assign the issue of load()ing your data back into R to a variable name of your option. You can save uncompressed files past setting compress = F. Reading those files back in is much faster just they use more space on your disk.
  • If you want to distribute your dataset to a lot of people from whom you lot don't know which statistical processing software bundle they use, you can save CSV files. I recommend using fwrite() from the data.table package because information technology is much faster than write.table().
  • If you actually really want (or need) an Excel file, I recommend using WriteXLS() from the WriteXLS package.

If you recollect that I should also comprehend other formats of saving a dataset on the disk, delight let me know in the comments and I will try to cover them as well.

dukeseench.blogspot.com

Source: https://www.r-bloggers.com/2019/05/how-to-save-and-load-datasets-in-r-an-overview/

0 Response to "R How to Save Dataset You Uploaded"

إرسال تعليق

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel