Documents  
Bibliomining Applications in Digital Reference   
The combination of data warehousing, data mining, and bibliometrics in libraries is known as bibliomining. This articles provides an overview of the bibliomining process as applied to digital reference.

 

Bibliomining Applications in Digital Reference: Using Data Warehousing and Data Mining to Improve Management and Decision-Making

Scott Nicholson, M.L.I.S., Ph.D.,   srnichol@syr.edu
Assistant Professor, Syracuse University School of Information Studies
Research Scientist, Information Institute of Syracuse

The digital reference process creates data-based artifacts of human intermediation previously not available in face-to-face reference. These artifacts provide opportunities for those answering questions, decision-makers, and researchers in advancing our knowledge of the reference process. These artifacts, however, come with a responsibility to the patrons involved, as the personally identifiable information must be protected. To reach this balance between maintaining a thorough history of a reference transaction and protecting the personally identifiable information about a patron, data warehousing concepts can be employed. The combination of data warehousing, data mining, and bibliometrics in libraries is known as bibliomining (Nicholson, 2003). The purpose of this work is to provide an overview of the bibliomining process as applied to digital reference.

Data Warehousing

The first stage in bibliomining is identifying and collecting the appropriate data. The traditional route for analysis of library data is to develop a question and then gather the data needed to answer that question. This process has several problems: first, as the data gathering doesn’t start until the question is developed, it is often months before analysis can begin; second, after the sampling period is over, the data are no longer collected. If the library wishes to track how the service has changed, the collection process must begin again.

Data warehousing concepts can be employed to improve this process. First, data sources are identified that may be useful in understanding the reference service. These sources might include a user database, questions and answers, expert information or data about the transaction such as turnaround time. Ideally, the data from a single transaction will be matched across these different data sources, cleaned, and then put into a data warehouse. This step removes the analysis work from the operational computer systems and creates an external data space for researchers to explore that will not interfere with the daily operation of the service.

The matching and cleaning process will take some time to develop, as operational data are very dirty. In addition, at this point the demographic information about the user can be extracted from a user database and appended to the transaction. The remaining personal information about the user should be discarded to protect their privacy. If appropriate, external data sources can also be used to provide extra information; for example, if the answer contains a URL, additional information about that site could be extracted and appended to the data warehouse. This external data warehouse would then be automatically updated on a regular basis.

The main advantage of a data warehouse for management is that the data are captured and cleaned regularly and systematically. Managers and researchers have the data at hand to answer many questions as they arise. In addition, once a research question has been explored and the collection and cleaned algorithms have been developed, it is easy to rerun the study on a regular basis to track how the service changes over time. In addition, as the warehouse grows, it can be an excellent time-saving resource for those answering reference questions, as the expert can refer to previously answered questions.

One of the concerns about digital reference transactions is that of user privacy. There are two methods of protection for the user that can be implemented. The first is to extract only broad categorical variables from the user database for the data warehouse, ensuring that no combination of these variables will lead to the identification of a user. The second and much more challenging method of protection is to remove personal information in the text of the question. This is similar to the problems of deidentificaiton of medical records (WEDI, 2003) where personal information is removed while the useful information from the records are maintained. This is currently a manual process, although an active area of natural language processing research is the automation of deidentification processes.

Each service must determine how much of threat this in-text information is based upon who will have access to the data warehouse. Due to the lack of a human eye reviewing the transactions before they enter the data warehouse, it is not recommended that patrons be able to access this resource directly. Instead, the data warehouse is intended for administrative and expert use. One way of adding value to the user experience is to make the process easy for those using the data warehouse to identify and clean transactions that would be appropriate in a different archive searchable by end-users.

Analysis and Reporting

After the data warehouse is in place, the stage is set for analysis and reporting. These are two separate but related tasks with different purposes. The focus of analysis is to explore the data warehouse to discover interesting patterns in the data-based artifacts of use. Reporting, on the other hand, is focused on gaining an understanding of the operational performance of the service over time through regular reports. Each of these tasks can serve as inspiration for the other – interesting facts found during reporting can inspire an in-depth analysis, and the techniques used to discover useful patterns found during analysis can be standardized into regular reports.

Analysis is usually performed by exporting the data into a statistical tool like Excel or SAS or a data mining tool like Polyanalyst or Clementine. There are two main tasks done with these tools – one can either describe the present or predict the unknown. Descriptive tools can be used to create aggregates that allow the conceptualization of large amounts of data, create clusters of transactions based upon similarities, or create rules through the discovery of patterns. Predictive tools allow one to predict unknown values or category assignments based upon other known values or to predict the future based upon the past and the present.

Reporting may be done from within the reference software or with the aid of an external tool such as Excel or Access. Many digital reference software packages have built-in basic reporting features. These reports are limited in their scope and choice of fields, and are not able to draw upon the wide array of data available through a data warehouse. Assuming that the reference software allows the data to be exported in an appropriate format, Excel’s Pivot Tables feature allows the decision-maker to create reports that aggregate multiple fields. Having a data warehouse makes these explorations more readily available, and also sets the stage for the application of more advanced tools from the corporate management information field.

Looking into the Future

The future of data warehousing, analysis and reporting in digital reference is intriguing. Rather than force users to export the data to another tool, some digital reference software tools are building in more advanced data warehousing and analysis capacities. QABuilder 3.0, created through the Information Institute of Syracuse (IIS), automatically creates and updates a data warehouse that can be viewed at any point through Microsoft Access. In addition, this tool contains a data exploring module, which allows administrators to view different combinations of fields from the data warehouse, save report configurations that are valuable, and schedule these reports to run regularly. More information about QABuilder can be found at http://vrd.org/qabuilder.shtml.

The long future of data warehousing in digital reference is to create a multi-service data warehouse. This is made easier through the interim step of single-service data warehousing, as the mechanisms will be in place to take data from the operational systems and move them elsewhere. The privacy and technical challenges of this goal are significant, and are part of the research agenda underway for the DREW (Digital Reference Electronic Warehouse) project out of IIS. This data warehouse would be a multi-disciplinary knowledge base, and would allow for a greatly improved understanding of the digital reference process. It could serve as a powerful resource for those answering questions and for the automatic identification of quality information resources, allowing us to create a valuable archive of human intermediation.

References

Nicholson, S. (2003) The Bibliomining Process: Data Warehousing and Data Mining for Library Decision-Making. Information Technology and Libraries 22 (4). 146-151.

WEDI (Workgroup for Electronic Data Interchange). (2003). De-Identification and Limited Data Set White Paper. Retrieved June 7, 2004 from http://www.hipaadvisory.com/action/WEDIpapers/Deid.pdf


Contribute to this topic
Do you have an article, presentation, or other content to share on this topic?
You can post it on this topic page. Find out more about submitting documents in the Member Center.
Ratings You must be signed in to rate this item
Average (0 Votes)
Comments