My name is Jennifer Peterson, and I'm really excited to be here to help host today's session. A reminder that today's session will be recorded, and the resources to access that recording will be made available to you in an email later today. But as with all of our WebJunction webinars, we've also posted those to the WebJunction catalog along with library-specific courses made available free to all who work and volunteer in libraries. And thanks to OCLC, the bill and Melinda gates foundation and state library agencies across the country for making that opportunity free to you all. A reminder if you're not yet subscribed to crossroads, our newsletter goes out twice a month and it's an excellent way to stay up to date on new resources, projects, and webinars at WebJunction and you, subscribe Via our home page. Today's session has a learner guide that can be accessed as a resource to extend your learning. It could be a resource to extend the conversation with your team, so know that this is something you can customize and explore additional resources that come up in today's conversation. Today's session is part two of a three-part series on the measures that matter initiative, a fieldwide discussion of the current state of public library data. In collaboration with COSLA and IMLS, we're excited to be able to help host today's session. Part three is scheduled and registration is available, and I'll push that link out into chat as well if you're not yet registered for that. I'm going to go ahead and get our recording started. I'm really excited to introduce our presenters for today. Linda Hofschire, who is our host for today, comes to us from the library research service at the Colorado state library. Rebecca Teasdale is the senior evaluation and research associate at the Garibay Group, and John Bertot is the associate provost for faculty affairs and professor at the college of information studies at the University of Maryland. We're so excited to have all three of you here, and I'm going to let Linda kick us off. Thank you, Linda. >> Thanks, Jennifer. Welcome, everyone. We are excited to have you here today. So we want to start by asking you all a question. So I am going to turn it back over to Jennifer for a minute to explain how this is going to work so that we can get your answers. >> Excellent. Yes. For those of you that were able to join us last time, we are going to be giving you access to the coveted annotation tools here in WebEx. So you should be able to see a marker at the top left corner of your view. If you click on that, you'll see the window open for the menu options, and go halfway down to that menu and find the box. Click on the little arrow to the right of the square, and you can go down to the check mark. Excellent, people are already practicing. Feel free to use this slide to find your check mark. Excellent. >> Thanks, Jennifer. Now that we've practiced, we'll do the real thing. I'm going to have you pause on your check marking for a second, and I'm going to flip us over to the next slide. Now what we want you to do is to give us a quick rating regarding your knowledge about data. So you can see on one side of the spectrum, maybe you're just getting started learning about data, on the other side you might be a full-fledged data geek, in which case you are close to my heart. Great. So it looks like we've got a range of experiences represents today. Thanks for participating in that. Lots of good colors. All right. So, I want to start by providing a little bit of context regarding today's webinar. And it is actually one activity that is part of a bigger project that's called measures that matter. This is a one-year project that's being done in conjunction with IMLS and the chief officers of state library agencies, or COSLA. So the goal of this project is to take stock of the current public library data landscape, and then from there to look for opportunities to kind of streamline and better coordinate future data collection efforts, and also to move the field toward more meaningful measures. So we're offering this webinar and as Jennifer mentioned, this is art of a series of webinars, because we really do want to have a very broad fieldwide discussion about this topic. And we're very, very interested in getting your feedback. So I encourage you to reach out to us, whether that's in chat right now, I'll give contact information at the end, but as we go through this webinar series, we are very interested in getting your feedback about the project. If you want to learn more about the project, I would recommend checking out the first webinar in the series, which took place last month that goes into more detail about the project. Today we're in the second webinar, and our goal today is to learn about three different data concepts. We're going to be focusing on sampling, data types, and data management. Why are we focusing on these? It's because each of these concepts impacts what we know about public library data. In terms of how they shape the public library data landscape. And I want to be clear too that we're not trying to become experts in these concepts, we've only got an hour, this is just going to be an overview, and as we already saw from that continuum, there's a range of experiences that we have coming in. Really what we want to do during the next hour is get to a point where we recognize how these different concepts impact the national public library data landscape, and we want to start to develop more of a common language for talking about the public library data landscape, and that might be helpful in conversations with our staff, or our boards, or other stakeholders. As Jennifer mentioned, our final webinar in this series is coming up July 26th, and there we're going to turn our focus to meaningful measures. So we want to think about what measures will truly tell the most impactful story of today's libraries? And to do that, we are going to hear from speakers from both within and outside of the library world, so I definitely encourage you to join us for that in July. So before we dive into today's topic, I just want to give a quick definition of the public library data landscape. I've thrown out that term a couple times now, but I want to make sure we're on the same page for what that is. So the measures that matter project is focusing on national public library data collection efforts. And you can see on this slide what those are. there's five current efforts as well as three that have been in the recent past, but we're looking at all of these efforts to inform those goals of getting to a place where we can streamline and better coordinate data collection and get to more meaningful measures. If you aren't familiar with any of these data collection efforts, just to give you a quick overview, some of the topics that a lot of these surveys collect data about include such things as expenditures, library infrastructure, collections, technology, a couple focus on outcomes. And then in terms of who respond to these surveys, you can see from the slide that the majority of the national level data collection efforts are responded to by library staff. Reporting about their library. A couple national efforts have library users as respondents, and then one of the national efforts is even broader than that, that library topology survey, that was part of the pew research done on libraries, which is -- which went out to the general public. So that was just a quick overview of what we mean when we talk about the public library data landscape. Now I want to get into our first topic, which is sampling. I'm going to turn things over to Rebecca. She is going to introduce some sampling concepts, and then tie those to these national public library data landscape efforts. So Rebecca, I'll turn it over to you. >> Hi, everyone. As Linda said, we're going to be talking about some research concepts that can help us make sense of this public library data landscape. And to give us some common language for talking about these public library data collection projects so all of you can give input to the measures that matter team. So to get started, we're going to talk about sampling methods. In order to interpret the results of any of these surveys that we have out there, we need to consider who the survey was given to. And there are two basic approaches that we can use to administer surveys. We can conduct a census, or we can conduct a sample survey. So in the first approach, we administer our survey as a census. And that means that we ask every single person to take the survey. So this is what the government does every 10 years, right, when they do the census. The government is seeking to gather data from every single household in the United States. And this is also actually what WebJunction is going to do at the end of the webinar, everybody who participates is going to be asked some questions so that the organizations can learn what the entire population of attendees thought about this webinar. Now, the other approach that we can take is sampling. And this is what happens with political polling. The polling companies don't call every single person in the U.S. and ask who they're going to vote for, right? So instead, they draw a sample of the population of the U.S., and then they only ask those people their questions. And they use statistical principles to generalize from the fraction of people who took the survey to the entire population of the U.S. So in order to interpret the results of the survey, we need to know whether it was administered as a census, or whether it was administered to a sample. And so we can think about these concepts in relation to public libraries. Often in our own libraries, we conduct a census if we do a survey of annual survey of card holders. If we administer that survey to each and every person who has a library card, then that would be an example of a census. And we also find an example of a census on the national level. So Linda mentioned the public library survey. This is the survey that's administered every year that IMLS owns, and they work with the state libraries to try to get 100% of the public libraries in the country to respond to the survey. So the idea here is that with a census, we won't have to make any estimates about the data at a national level or make any generalizations if we can collect data from each and every library in the country. But as you can imagine, it's kind of difficult to conduct a census of public libraries, it takes a lot of time and money to get each and every library to respond. So all of the other surveys in the public library data landscape use a sample rather than a census approach. When a sample is drawn, we need to know what sampling method was used. And there are two basic approaches here as well. We can draw a probability sample, or we can draw a nonprobability sample. So we're going to start with the probability sampling and talk a little bit about what that means. Probability sampling uses statistics and probability to select the subset of people or libraries who are going to get our survey. And we've probably heard of some of the specific methods that we can use to draw a probability sample. You can see some examples there on the screen. These include methods like simple random samplings, or stratified sampling, or cluster sampling. Each of these approaches require some specific expertise, so it's probably not something that most librarians are going to do themselves. But the reason that probability samples are used is because they can result in a representative sample that allows us to generalize to the larger population. Or to make estimates that apply to libraries or households at a national level. So we hear that a lot of times, people talk about a sample being representative. And I want to talk a little bit about what that means. When we say that a sample is representative, what we're saying is that this sample accurately represents the full population from which it's drawn. So if it's a sample of public libraries, we need to know whether that sample is representative of all public libraries in the U.S., or not. But there's actually no hard and fast guideline about what is or isn't a representative sample. So instead, it's up to the researchers to spell out why they feel their sample is representative, and then it's up to each of us to decide whether or not we agree with them. And probability sampling is one of the primary strategies that researchers use in their efforts to draw a sample that's going to be representative. So we can see an example of probability sampling in our own libraries. When a community is considering placing a funding measure on the ballot, often the friends of the library or the library foundation will conduct a public opinion survey to get an estimate of the chances that the funding measure is going to pass. The researcher typically uses one of the techniques on the screen to draw a probability sample, because they want to make conclusions about the opinion of the entire community and hopefully about their voting behavior, even though they're only going to collect data from a small fraction of the community. And they want that sample to be representative because they want to be able to feel confident in the conclusions they draw about the entire population. Based on the data from that sample. In the library data landscape, we have an example of probability sampling with the digital inclusion survey. If you read the report that was issued by the research team, you'll see that they drew a stratified sample of public libraries in the U.S., and then they administered the survey to that group. And that enabled the researchers to calculate national estimates. They were able to estimate the average broadband connection speeds for libraries nationwide, for example. And they were able to estimate the percentage of libraries that offer digital literacy training. We have another example of probability sampling with the library topology survey, which was the surveyed that pew ran. And that was a household survey, so rather than collecting data from libraries, the researchers administered the survey to the general public. What pew did was use a combination of probability sampling meth today draw a sample of households from across the U.S., and then they asked them to respond to a telephone survey. And the accept plastic bag methods they used can give us confidence that the findings are representative of the U.S. population in general, rather than just the specific people who took the survey. So that allowed them to draw conclusions about how many people nationwide have library cards, or how often people nationwide visit their libraries. In contrast to probability sampling, we have our other basic approach, which is nonprobability sampling. And there's a bunch of different ways we can do this as well. You can see some examples on the screen. A lot of nonprobability sampling methods fall under the first example there of purpose of sampling. Which means the sample is chosen for a specific reason or a specific purpose. Another common type of nonprobability sampling is convenience sampling. In this method, we're going to select our sample because it's the people or the libraries that we have access to. Basically we're going to take what we can get. This type of sampling is easier than other sampling methods, and sometimes it's really the only method that's feasible because we don't have the resources that would be necessary to do probability sampling or use other sampling methods. But we have to be really careful here. With a nonprobability sample, we're not going to be able to generalize, because our sample is not going to be representative. So our results are going to apply to the people or to the libraries who took our survey, but we can't extend them to any one else or any other libraries. So we're not going to be able to draw any conclusions about anyone or any libraries who didn't take our survey. So we can think about a scenario that's kind of common in our libraries, we can imagine that we want to know whether our community wants the library to be open on Sunday evenings during the school year. So we post a survey on our library website, and we -- let's say 100 people respond, and we find out that 95% of them want the library to be open on Sunday evenings. So what does that tell us? Well, these are the people who took our survey. Here on the screen, those are the respondents. And we know that 95% of these people want the library to be open on Sunday evenings. But we actually don't know anything about the rest of the people who visited the library website. It's very possible that the people who chose to take our survey have different opinions from the people who didn't take it. In fact, sometimes people choose to take a survey like ours because they have strong feelings about the topic that we're discussing. And we definitely don't know anything about the rest of the people in our community. I think we can say for sure that the people who visit our website are different than the people who don't, we probably have an idea about what some of those differences are, but there's likely to be other differences that we don't know about. So we need to be very careful here not to draw conclusions about the entire community based on our survey data. We can't make that kind of jump from the respondents who took the survey on our website, to community members in general. And the reason that we can't do that is because we used a convenient sample. In this example, some people we wo call this sample a voluntary sample, or a voluntary survey, because the data we collected is simply from the people who volunteered to provide it. Within the library data landscape, we can think about the public library data service, which is a Sur vie by PLA in this category. So this survey provides a lot of great data from about 1800 libraries each year, which is about 20% of U.S. public libraries. And it can be extremely helpful for taking the pulse of those particular libraries. It's also great for benchmarking, we can look up specific libraries that are similar to ours, and find out all kinds of data about those libraries, and then make comparisons between those benchmarking libraries and our own. But we just need to keep in mind here that this is a convenience sample, or a voluntary sample, so the libraries who respond to the survey are those that do so voluntarily. This is not a census like the public library survey that the IMLS administers. There's some evidence here that the sample may be skewed toward larger urban libraries, so if I'm looking for data about certain large urban libraries, I'm in luck. I can niend data in the PLDS. But I'm not going to be able to draw conclusions about public libraries at a national level or generalize these survey results to public libraries overall. And project outcome is similar as well. Certain libraries have elected to participate and use the project outcome surveys with their communities, so at a national level, the sample of libraries would be a convenience sample. And as a result, the data will yield insight about the specific libraries that participate. But we're not going to be able to inferences about any other public libraries or draw conclusions about public libraries in general. If you remember project outcome surveys are used to gather data from library patrons, or members of the public. We have a second level of sampling that's taking place here. At the local levels, each participating library has to decide who they're going to administer the survey to. So libraries themselves may decide to use probability sampling or nonprobability sampling when they administer those surveys to their patrons, or they might decide to administer them as a census. So as a result, the ability to generalize or not, or the need to generalize is going to vary from library to library at the local level. So that's been a quick whirlwind overview of sampling methods. The good news is, we don't need to understand these different methods in depth, as long as we have a sense of the general concepts, we can make sure that we're interpreting survey results appropriately and we can use a common language around these issues to provide input for the measures that matter project as it moves forward this year. So Linda, I'm going to turn things back over to you at this point so we can give everyone a chance to weigh in on this topic and before we move on. >> Thanks, Rebecca. All right. So we are going to ask another question, and it looks like everybody had mastered this technique for answering, so I will just move right on to the question. That is, does your library conduct any surveys? And if so, what are the sampling methods you use? Let us know with your check marks. Great. So it looks like we have kind of a range of approaches. Great. Thanks for responding to that. So in addition to sampling, another concept that impacts what we know about public libraries is data types. So simply, what types of data are the various national level data collection efforts collecting? And so Rebecca is going to join us again, and she will introduce us to three different types of data, and then talk about how those tie in to the national public library data landscape. Rebecca, I will turn it back over to you. >> Thank you. I would be really interested maybe at the end, I saw a lot of responses that folks were using probability sampling in their libraries. I would love to hear about some of those examples in the chat, even though I won't be able to pay attention to it while I'm talking. I'm curious about that. Okay, so data types. In the national public library surveys that we're talking about today, there are three types of data that are being collected, and those are inputs, outputs, and outcomes. And so this may be very familiar to many people and to others this may be new, so we thought we would provide an overview to give some common language again to help us get on the same page. So inputs describe library resources. They answer a question about what the library uses. So these resources include the budgets that we have to operate, and our staff, and volunteer time, buildings we operate, our collections, all those kind of data that describe our starting point or the raw materials that we have to serve our community with can think of as inputs. Outputs are going to describe library activities. And how busy we are, and then also how many people we serve. So like inputs, outputs are going to describe the what of our libraries, they answer the question, what did we do? And they also answer the question, how many people did we serve? Outputs would include the programs we offer, and the attendants at those programs, as well as things like reference statistics, and gate count and use of our website. All of these are data that describe what we do and how busy we are at our libraries. Outcomes on the other hand, describe the consequences of library activities. The change that takes place in people's lives. So input and outputs describe the what of our library, and outcomes are going to describe the so what. They answer the question what difference did it make? And they describe what good we do or the benefits that accrued individuals or to our communities. Outcomes answer the question about what change took place. So this could include changes in knowledge or understanding, it could also include changes that take place in attitude or outlook. And outcomes can be about skills that people have as well as ways that people behave. All of these are focused on members of our community, our library patrons and the changes that take place in their lives. And the final outcome category there is a change in condition. In people's lives. So examples here would be a child learning to read, that would be an outcome. Or an adult finding a job, for example. So we can see that each of these kinds of data is important because each type play as role in filling out a complete picture about public libraries. We have the inputs where the resources that we need to make things happen, and then we have the outcomes -- outputs which are the programs and services, and the level of community participation, and the outcomes that are the effects of our programs and services, or the change that takes place. So what's -- here's an example that will illustrate the difference among these data types. We can imagine that I have some friends coming over, and I want to make ice tea for them. So I'm going to gather up some water and some tea bag and ice, I'm going to make my iced tea and I'll have it waiting for my friends when they arrive. We can imagine it's a hot day and everybody has cycled over to my house, so now we're sitting outside and I'm able to serve them these glasses of iced tea. So what are the data types here? Well, first we have our inputs, these are the resources I used to make the tea. So we have the tea bags and the water, and the ice, and the pitcher and all that stuff. Then we have the outputs. So that's the pitcher of iced tea I made, and the number of people who drank my tea. I might say that my output here was serving five glasses of tea to my friends. But the tricky part is thinking about what the outcome is going to be. So in this example, the outcome is my friends' thirst being quenched. This is what changed in those individuals as a consequence of my iced tea. There was a change in their condition. They were hot and thirsty from their bike ride, and I provided the iced tea that made a difference in their level of thirst. They might think it's good tea or it's bad tea, or maybe I give them their tea in dirty glasses, and they don't like that. Those are examples of their satisfaction. And that's somewhat different than the outcome I'm looking at here. Which is whether or not their thirst was quenched. So we can look at some of the surveys that Linda described to find examples of these different data types in the data landscape. For examples of inputs I grabbed from the public library survey, the IMLS survey. There's some questions there about operating revenue, and expenditures, and open hours. The EDGE survey is run by the urban libraries council, asks libraries questions about the number of computers that they have and the speed of their internet connection. These are all inputs. The examples I grabbed of outputs, some of them are coming from the PLDS that we talked about earlier, so we have library visits, and circulation and program attendance. These are outputs focused on the level of participation in library services. The digital inclusion survey asks library questions about the digital content they provided, and about digital literacy programs that they provided. So these would be outputs that are focused on the programs and services that libraries provided. And then, for example, for outcomes, we find project outcome has questions here about whether community members have learned something new, and about their level of confidence related to what they learn. So the first outcome there addresses a change in knowledge, and the second outcome looks at a change in attitude. Or specifically confidence. The impact survey from the University of Washington includes some questions for community members about whether they learn specific topics such as finding information about government, or diet and nutrition, that learning that results would be a knowledge outcome. And it also asks community members whether they take specific action. Such as applying for a job using library computers, and these would be examples of behavior outcomes. The final piece to consider here is to think about the source of information for each of these data types. So input data is going to be collected from libraries, because library staff members are the people who have information about library resources. Output data can come from libraries, because, again, those are people who know about our programs and services and about levels of participation. And community members can also provide some output data. So, for example, they can tell us about their level of participation in library services and programs. Outcome data are going to be collected from library customers, because we're looking for the changes in people's lives, and only members of the public know whether and to what extent their lives have changed. So each of these has a role to play within each different type of data in order to have a complete picture, and we need to be gathering data both from libraries and from members of the public including library patrons. So with that I'm going to turn things back over to you, Linda. >> Thanks, Rebecca, for that overview. So you can see kind of in thinking about these different data types, and then relating that to the current national public library data landscape, what we know about libraries depends on what types of data are being collected. And so because a lot of the efforts focus more on inputs and outputs, we know a lot about that right now. And we're just starting to learn a little bit more about outcomes. So we're going to get into our final topic for today, which is data management. And so here we're going to think about how the practices of collecting data, storing data, and then providing access to data impacts what we know about public libraries. And so now I'm going to turn things over to john to talk about that. >> Hi, hopefully everyone can hear me. Thanks for, that Linda, and thanks Rebecca for your introduction. So what I like to do and also being mindful of time as we move through this, is provide a bit of an overview regarding sort of data management issues, and what I like to do basically is start off with a series of questions for you to consider as you're thinking about preparing your data or thinking about making it available for reuse in some fashion, or other sort of organizational management kinds of issues. And I want to reiterate I think what Linda had said at the beginning of this webinar to say this is really high level overview, and for each of the things that I might discuss, there are a number of different processes under which you may be operating under and things you have to consider in terms of your digital asset management in terms of data sets. So with that, some questions to consider are, why are you collecting the data? For what purposes do you intend to collect the data? So, for organizational learning purposes in which case the focus might be more internal, or for the intent to add to our knowledge base about, for example, libraries in which case it's more of an external focus. And those two different kinds of answers depending on internal or external, might dictate a number of things you might do in order to prepare your data, make it more useful to others outside your particular division, or library, or context. Another thing to consider is whether or not you have any data compliance requirements. So depending on the nature of the data you're collecting, who might have funded the study, who you're collecting the data for, you may have -- and the types of data you're collecting, at the individual level versus more aggregated, you may have a variety of different privacy confidentiality records retention requirements, requirements to make the data open access, and you may even need to get permission to collect the data, for example, in a University environment, we have what's known as IRB, institutional review board. So we have to in essence present our method and plan for data collection and retention to an external review panel for them to say, yes, you can go ahead and conduct your study. I raise that not knowing all of your context, but it may be that your city, county, state, whatever it might be, your jurisdiction, may have some requirements, and particularly if it involves collecting data from children and various other things. You intend to share the data? So are you thinking about sort of going beyond one and done? Like I just collected it for this specific purpose, and now I'm done, or do you want to encourage its reuse? So are you thinking about somehow sharing the data beyond just your particular area for the data collection? Is your data collection one-time -- I should have advanced my slide. Is your data that you're collecting one time, or is it ongoing? So the question there is, are you thinking of doing this more than once, so is it perhaps with the same group of individuals, or some kind of cohort over time, in which case you might be engaging in the longitudal study? Or perhaps more time series study, you're doing snapshots or a period of time where you're trying to sort of map change over a particular period for service or something else like that. And that's important to know because it will depend on how you may really want to consider setting up your study and managing your data, if you're going to collect over time. Another thing to consider is do you intend to, are you required to ensure long-term preservation of the data? So here we're thinking about retention issues. What is your policy or expectation to retain the data so for example on the campus such as ours, which may not apply to a public library, but if you receive funding from national science foundation, or other entity, you're required to retain those data for a certain number of years in order -- and make those data available throughout that time frame as well. So those are some key questions to really think about. It sets the stage for the data that you may wish to collect. But what I want to point to, and this comes from the data archive from the U.K., it's actually a fairly robust model for looking at a date of life cycle, and really what you need to be thinking about throughout all of this when it comes to down to data management kinds of issues is that entire life cycle, from creating data to processing it, to analyzing it, to preserving it, to providing access to the data, and then ultimately reusing the data, whether it's you in fact row using the data, or you intend and want to encourage others to reuse the data that you might have collected. So these are some selected issues regarding that data life cycle that I think would be useful to mention, and for you to start thinking about in terms of the implications of the choices that you make. So in Rebecca's presentation, she talked about sampling and sort of the types of data, and I would broadly call those sort of under research design kinds of headings, and that research design you have actually has a fair number of implications for how you would consider collecting your data, but then also what you can say with your data, but then also how you would use that data or make use of it, or allow others to make use of it elsewhere, right? So there are -- when you think about the -- the data you might collect, or any of the data we're talking about, all of your -- the design choices you make have implications throughout the entire data life cycle. And you have to be cognizant of those to the extent that you can. Knowing sometimes that, like in our case we did the digital inclusion studies or the public -- the public library internet studies, those -- some people would look at those as a longitudal study, but the reality is, over 20 years we never knew if we were going to have more funding. So at best I would call those time series, and we didn't design them to be longitudal. They were more like ad hoc, and you sort of worked with them and created them over time. In retrospect it would have been nice to lay that out so that you could have had perhaps a different approach to collecting and releasing the data. Of course data management issues, you have to think about what kind of format you're going to store the data in. And -- because there are implications whether you use more generic kind of file format or something that's more proprietary. How you're going to store the data, and how you're going to make access available to the data. So, for example, you're going to release the entire data set, or you're going to create APIs and allow people to hook in and grab certain data in real time for other applications. So I think that those different kinds of things are certainly one to consider. Another thing to consider, though, is the extent to which you can share the data. So depending on how you collect the data and who you're collecting it from, there is this concept of informed consent, right? So if you're collecting data from people, then you need to inform them of how you -- what you're going to do with the data, how it's going to be released, whether or not it's going to be kept anonymous or various other things like that. And in our case, having checked data from libraries, we always had a statement about what we were going to do with the data, and making it available that libraries could read before they actually might consider participating in the survey. And of course you have to consider how you might capture the data that facilitates reuse. So in terms of processing the data, how you enter and key in the data can really affect the ability to reuse the data set. So it's important to consider whether or not you're creating numeric values, text values, open text, various kinds of things like that, so how you enter the data will determine sort of on the back end the types of use that others or you can in fact make use of the data itself. You also need to make sure that you check and scrub the data to the extent you can, so one of the things that we used to do certainly in our survey is that we would take a sample of the data set that we had collected and do validity checks. We would go back and look at the data, we would compare the data that had been entered the previous years if we had data from those libraries from previous years, but also we would look at the data from certain libraries in the context of libraries like that and look for outliers and various other things like that. So the intent of course is that you're trying to make the data set as clean as possible. Both for your own use as well as for the use of others. Also you need to consider K as an issue. If you're going to have data that is publicly released, and there are confidentiality and anonymity concerns, then one thing to think about is that, can you aggregate up the data in some way that you're not divulging the individual from whom you might have collected the data? So you have to think about at what point can the data be released that they're still useful, but haven't breached confidentiality in some wayfor those data. Another thing to make sure is that you describe the data, so -- and I'll talk about this on the next slide too, but it's important to have a bit of a code or data map so people know what they're looking at. So when I look at a data set, it's got a variable name, and data in a cell, what am I looking at? So you need those definitions to go along with it, and then understanding of what you're looking at in terms of the value that's been inserted in the cell of a particular data set. And of course managing and storing the data for long-term access and preservation. So, preserving the data. I realize I'm speaking a little quick lirks I'm sorry, but I'm trying to be mindful of time. In terms of preserving the data, one thing that we need to think about, and this really matters when you're trying to move data forward. So maybe not today, maybe not tomorrow is going to be the issue, but 10 years from now will be an issue. And so the format in which you choose to make permanent your data, sort of the fixture data, really does matter. So something like CSV is a generic file format that will likely be able to be moved forward and absorbed into other data analysis tools for long period of time. And so in general, generic kinds of formats like CSV are best for long-term kinds of use and reuse. Often times people release data more proprietary formats, like Excel, SPSS, or SAS. One thing we need to be mindful of with those is that sometimes those kinds of products if you will get updated, right, software releases, and various other things, and they're not always backwards compatible. So sometimes if you have a file format that is five or six versions old, it's several years old, they may not be able to be read by the latest version without some kind of conversion process. So picking a data format really is going to be absolutely critical for long-term preservation and reuse of the data that you might be collecting. You also need to think about a suitable storage medium, whether it's cloud-based, disk, whatever it might be in which to fix the data. So much of what we do is cloud-based kinds of format for your storage, but the more -- if you're trying to make your data as widely available as possible, the most public type and most easily accessible storage medium is best. On the other hand, if you're trying to lock it down, if you will, then maybe you might have it in a secure room, and people with the right kinds of authority and access to that room are the only ones who can get in and get access to the data. So it depends on choosing the right storage medium for your intent really does become critical. And making sure you have backups. And appropriate security protocols for the data that you fixed, whether it's on a drive, or some other medium. Data documentation is absolutely critical. So we released our digital inclusion survey data, and I have to say that the hardest part of releasing that data was the code book. It's amazing, and my hat's off to folks like IMLS when they released their public library, the PLS data, because their code book is really well defined, it's very easy to follow, and I can't tell you as someone who reuses that data how important that data documentation really is. It is your map. It's a way for you to understand what it is that you're looking at, it's also a way for you to understand what might be appropriate in terms of analysis and other uses of the data. And for pulling in the data into other analytic and other tools. So that data documentation piece is absolutely critical. And it is arduous, I will say, creating good data documentation code books and the descriptions of the data. But incredibly vital. And finally, also archiving the data to ensure that long-term preservation is an important aspect of this as well. So things to think about in terms of providing access to data, and I think this is one of the issues that -- of increasing importance to all of us, particularly those of us who believe in open data and open access kinds of models. And so you have to think about your data distribution approach, so are you going to create in essence a library data version of data.GOV and put all your data up? As many data sets as possible for widespread access? What are your data sharing approaches, or designs intended around that? How are you thinking about data access control? Is it fully open, is it somewhat closed? Is it fully closed? What are the sort of rules that govern that data access. Are there any copyrights that you're putting on that, or licenses to reuse the data so what model are you using in terms of making your data available if it's not sort of federally funded or funded in a way that requires fully open data. And of course you have to promote the data availability. People need to know that your data sets are available. And how do you get the word out and encourage reuse of the data that you've collected. On the reuse side, I think there's a number of things to think about. Do you untend to conduct follow-up research? So ongoing, longitudal, is there change over time, how are you going to make data series available to facilitate and analysis and reuse. Are you intending to do periodic updates and release those as well? So what kinds of things. Are you trying to match data? And if you are, trying to encourage the combining of data sets, you need to think about that in terms of the data you released so that people had the ability to combine those data sets. So maybe like a common key, right, like, for example, I always encourage people who collect data from public libraries to use the library I.D. number that's in the PLS data set, because if we have that unique identifier, we can then combine data sets that were collected in different ways from different entities. And you have to think about the program evaluation and building a culture of assessments overall, and I think that ultimately the measures that matter project, project outcome, EDGE, and various other initiatives are leading towards trying to encourage and build the culture of assessment and programming valuation in the libraries. Here's an example, and I admit this is from our work, but -- of data reuse. So our digital inclusion survey we finished up, we created in combination with community attributes in Seattle, an interactive mapping tool. The reason why I sort of put this up there is this is a way to encourage reuse and collection of your data, so this combines the data from our digital inclusion survey, as well as data from the American household survey, so census-based data, as well as the ability to actually enter and update your library's data. So if you look at the upper left, that's what you see when you first go to the interactive mapping tool, if you go down below that, what you see is -- actually let me go to the right. The library there, Indian river county library in Florida, a random library I picked, and what pops up is a little bit about the library's services from our digital inclusion survey, and you can actually adjust the radius. So you can make it a mile, three miles, five, or 10 miles, or you can make a custom one. And what it does is it also shows you what that community looks like along demographics, economics, education, health, and various things like that. If you go to the bottom left, the reason why the census area looks a bit different, that's actually the census area. It's not a perfect circle. But that's based on how census identifies the geographic area. And within that geographic area, what you see there in the blue is economic data that was pulled from the American household surveys. So it's letting you know a little bit about employment situation around the Indian river library itself. So there it says 14.8% unemployment rate. The last piece, the reason why I want to show you this, was that we actually created a way to collect and update your data through the mapping tool itself. If you were to go there and there's no data for it, or you wanted to update the data, like your speed of connectivity had changed or the number of computers you had changed, you would be able to do that on the fly, and then would actually update the data on the map. So that's a way to think about how you might set up your data for reuse, and creating additional tools to get the data used. And this also uses obviously the IMLS/PLS data as the basis for it, that's where we pulled the library names and various things like that. So winding down a little bit, in terms of national data availability, I created this to sort of show where initiatives are on an open versus closed sort of map, if you will, it's not perfect. But on the open data side, you have PLS, it does come out roughly two years after collection, there's a variety of reasons for that, but the predominant one, there's two predominant reasons for it, one is there are something like 13 or so fiscal years throughout the state, and so since this is fiscal year driven, there's a period issue, you have to wait until the last libraries, last states are able to report their data. The second reason in part is because of data validity checks that are done, so those two things in combination, and there are other reasons, create a situation where the data that you collect won't be released for roughly 16 months to two years. And so that's why you see that lag in part. The digital inclusion survey, which has been discontinued, we made our data available, pew also makes this data available, you go to the pew website and download their data as well. So the closed data end, you have EDGE and impact which require subscriptions in order to participate, and as -- they don't release data in a public way. They report back to the libraries that have subscribed. PLDS does release data, a subscription. You pay a nominal feel to get access to it. But once you have access you can get the data set. And project outcome, the this is still something under consideration, and so we'll see what happens if there's going to be any release, national release of the data and we'll see where it -- where that comes down. Very quickly, just on things to tie back to Rebecca's, closing this down and turning it back to Linda, I laid out some of the national surveys and the data sources, so PLS, digital inclusion, PLDS, EDGE, impact, and pew, and what I did -- the reason why I did this was in part because I wanted you to realize when we talk about collecting data from libraries, for example, you have to know which level of aggregation you're looking at in terms of the data or sampling strategy you might use. So, for example, PLS has some branch level data, but it's mostly demographics, right, so it's like the geo codes, so latitude, longitude, your metropolitan status, square footage, it's not anything about services or resources, right? That gets collected at the system level. Now, for some, the reason why I made that distinction is that for roughly 84 or 85% of libraries, they're basically one-building libraries, but for those that have branches, where there's multiple buildings, multiple outlets, that does matter. So when you think about I'll pick on my home state of Maryland, it has 20-some branches, I'm not seeing FTEs by branch, I'm seeing FTEs for the entire system. Same thing for circulation counts and various other things like that. So it does make a difference. PLS is also a census. They collect data from all libraries. The digital inclusion survey, we actually focus on branch level data, so we collected data from broadband, on broadband, public access computers, and the resources and services that libraries provided, but we did it as a branch level, and that's how we can create something like the interactive mapping tool, because if I want to know specifically what this branch does and what the communities around it within a mile around it looks like, I need branch level data. And our study was sample based. PLDS is at the system level, it's voluntary, EDGE, they do collect some data at the branch level, but when they do the benchmarking, it's sort of at a system level, so it's kind of this interesting sort of scenario. It's also voluntary. Impact is at the system level, and it collects -- they collect data at the branch but release it at the system, and it's only for users and it's also voluntary. And pew of course is at the user individual level. And with that, I know I ran over a little, I'm sorry, I'll stop now and turn it back over to you, Linda. >> No, thanks, john, it's a lot of information to share. And so I hope as we said, this was kind of the overview of these data topics today. The slides are available on the website, I encourage you to go back, you can check the slides out in more detail, especially slides like this that has a lot of information about the different data collection efforts. Also as Jennifer mentioned at the beginning, check out that learner guide. That can be a useful tool for processing this information, and making it practical for your needs. What's coming up? Real quick, just wanted to make you aware of a couple things regarding measures that matter. First if you are going to be at ALA annual, please join us Saturday, June 24th, we will have a session about measures that matter. And that third webinar in this series, moving towards more meaningful measures, that's where we're going to get into a discussion about impact. That will take place on July 26th, at the same time period. And finally, as I said, we are very interested in hearing from you and getting feedback about this project. I encourage you to reach out to us using the contact information listed on this slide. You can also follow the measures that matter project on twitter, and sign up for an email distribution list if you want to keep updated about the project. And so I want to thank Rebecca and john so much for shedding light on these data concepts, and thanks to you all for joining us today. Have a great afternoon.