Background
The RISE database uses data from the log files of the EZProxy system that the Open University uses to allow off-campus users to connect to electronic resources. Searches carried out by users through the RISE search interface and RISE Google Gadget are also tracked within the RISE database.
As part of the project there is a commitment to investigate the potential of making the activity data available openly and an aspiration to release that data under an open licence. Similar data has already been released by the OpenURL project at EDINA.
PRIVACY
Use of Personal Data within the RISE project
Within the RISE database personal data is stored and processed in the form of the Open University Computer User account name (OUCU). The OUCU is generally a 5 or 6 character alphanumeric construction (e.g. ab1234) that is used as the login for access to OU systems. This OUCU is stored within the EZProxy logfiles that are ingested into the RISE database and is also tracked by the RISE interface to allow searches to be related to users.
This OUCU is used within the RISE system for two purposes:
- To be able to make a connection between a search and a module of study associated with the searcher, to allow recommendations based on module; and,
- To be able to remove all searches for a particular user from the recommendations database at their request.
Processing takes place using a file of data from internal systems to add the module(s) being studied by matching the OUCU in the RISE database with the OUCU stored by internal systems. The data on which module is being studied is added into the RISE database. As each new OUCU is added to the database a numerical userID is assigned. This is a simple incremental integer.
The RISE database stores details of which electronic resources are accessed by the user and the search terms used to retrieve that resource (for searches carried out through the RISE interfaces) See the diagram on the Technical Resources page for details.
Privacy approach
The RISE project has developed a separate Privacy policy to cover use of activity data as it was felt that the standard OU Privacy policy was not sufficiently explicit regarding the use of data for this purpose. The newly developed privacy policy is available at http://library.open.ac.uk/rise/?page=privacy
One of the challenges with using EZProxy data is that the EZProxy logfiles contain records from links in several different systems as we link as many systems as possible through EZProxy. So this privacy policy has also been linked from the Library SFX and Ebsco Discovery Search interfaces.
As well as explaining how their data will be used the policy provides a mechanism for users to ask for their data to be removed from the system and for their data not to be recorded by the system. This opt-out approach has been cleared by the Open University Data Protection team.
The EZProxy logfiles that are used within the system provide a particular challenge to an opt-in approach. Access to this system is simply through expressing a URL with libezproxy.open.ac.uk within the URL string e.g. http://portal.acm.org.libezproxy.open.ac.uk/dl.cfm This URL then redirects the user through the EZProxy system. These links can exist in many different systems.
Data on accesses to electronic resources is still required to be kept within logfiles to allow the library to comply with licensing restrictions for the electronic resources to allow the library to track any abuse of license conditions. An opt-out could only be applied to the usage data element of the personal data.
Users do not login to the EZProxy system directly but are faced with a standard Open University login screen to authenticate if they are not already recorded as being logged in.
Future privacy changes
An opt-in approach may be required to comply with the new EU directive on ‘cookies’. Conceivably this may be achievable by redirecting all EZProxy links through an additional authentication process and asking users to agree to storing their usage data. This acceptance could be stored at the server-side although this introduces a further single-point of failure that could block access to electronic resources. Alternatively a cookie approach could be taken along with asking the user to accept the cookie.
PROJECT CODE LICENSING
By the end of the project the RISE code, covering the data ingestion processes and recommendation code will be made available via Google Code at http://code.google.com/p/rise-project/. After consideration of suitable open source licenses it has been decided to use the standard license for Google Code GNU GPL v3 http://www.gnu.org/licenses/gpl.html. This has previously successfully been used to release previous project code created by the OU for JISC projects.
DATA RELEASE
Open release of data
The project aspiration has been to openly release the data collected by the project to allow other services to be constructed based upon this (and other) datasets. Data to be released would be anonymised to ensure that it is impossible to identify individual search activities.
Anonymisation process
Prior to the open release of data it is proposed that the data would be transformed as follows:
- The OUCU would be removed from the dataset leaving the userID
- Module codes would be mapped to a more generic subject description name.
- Remove the .libezproxy.open.ac element from the URL
- Remove ‘singleton’ records. A threshold (suggested by Huddersfield as being set at 35 students) for the number of users on a course would be applied. This process is designed to ensure that users cannot be identified individually.
- If necessary RISE would consider removing all records added to the database prior to the date the Privacy Policy and opt-out feature was enabled (20/05/2011)
This process has been approved by the Open University Data protection team.
Open Data licensing
Discussions with the Open University Rights team have identified that we are able to release data from EZProxy, from search terms used within RISE, and covering the general subjects covered by OU courses. An appropriate license for this content would be CCZero. This owes much to the previous work of the Lucero project in paving the way for the open release of data.
What data could be included?
What became apparent during the project was that most of the EZProxy request URLs linked through to EBSCO (the reason being that we link our EBSCO Discovery Solution through EZProxy) and that there was very little bibliographic data within the logfiles. We discovered that we could use the EBSCO accession number to retrieve bibliographic data but that we weren’t licensed to store that data in the RISE database yet alone release it openly. We found an alternative source of article level metadata (from Crossref) that we could store locally, but again licensing terms prohibit its inclusion within an open data set.
A conversation was had with JISC Legal, who advised that if restrictions are placed on database vendors, these are usually passed on to subsequent users. Restrictions may not necessarily be just in relation to copyright. If the database vendor is using third party material ( i.e. obtained from elsewhere) there will very likely be a purchasing agreement/contract/ as well as a licensing agreement (where the copyright position is stated) between the parties stating what the vendor may do with the data. The vendor would then need to impose the same conditions on the customer, so as not to breach their agreements with the party from where they obtained the material. So it could be breach of contract terms as well as breach of copyright depending on the agreements.
There is some difference of opinion between Rights experts about the position with article level metadata about whether it could be used and released. Commercial providers assert in their terms and conditions that you cannot reuse it or share it and libraries are in a position where they have signed license agreements that contain those clauses. This is an area where agreement about the exact legal position with regards to article level metadata should be established. Not having openly available and reusable article level metadata would be a distinct barrier to establishing useful and usable datasets of article level activity data.
Advice from JISC Legal on the copyright issues around metadata, directed us to a quote from their paper Licensing Open Data:
“ Where there has been substantial investment in the selection and or presentation of the content of datasets they may attract copyright as well as database right if it was created after 27 March 1996 and if there has been evidence of creative effort in selecting or arranging the data. A database might have copyright protection in its structure if, by reason of the selection or arrangement of its contents, the database constitutes the author’s own intellectual creation. Copyright protection of individual data, including records and metadata that have been “expressively” written or enriched may also subsist in the structure of the database if that structure has been the subject of creativity.”
So in terms of what we could release openly we are left with a dataset that contains URLs that link to EBSCO, search terms entered through RISE and course subjects.
Type | Data |
Basic data: Institution, year and dates | institution name academicYear extracted date source |
Resource data | Resource URL |
User context data | anonymised UserID timestamp Students subject Students level e.g. [F, UG1, UG2, UG3, UG4, M, PhD1, PhD2, PhD3+ Staff |
Retrieved from | SearchTerm |
The dataset includes relationships between resource records in the dataset but there is no easy way of being able to relate that resource to a DOI or article title. And that leaves the dataset as being potentially of use to other EBSCO Discovery Solution customers but no one else. So at this stage we have reluctantly decided that we won’t be able to release the data before the RISE project ends. Further work would be needed to review other data sources such as the Mendeley or OpenURL router data to see if they could provide some relevant article level metadata.
What format could be used?
We had a lot of discussion early in the project about the format that data could be released in. Ideally we wanted to release it in three forms: as a pre-populated MySQL database to act as a baselevel database for the open release of the RISE recommendations system code; as an XML file (described originally here); and as a .csv file matching the format used for the release of the OpenURL data. In an ideal world we would match the article level data to the OpenURL format and create an OpenURL for the link, but that again relies on a source of open article level metadata.
Summary
- We have established a suitable privacy regime for activity data
- RISE has established the agreement that data we collect can be openly released provided we take suitable data privacy and anonymisation steps
- We have established that we can use CCzero to license this data
- We have a limitation in not having article level metadata that can be included within the open dataset
- Further work needs to be done to find open article level metadata that could be used
- And a sense of some frustration that we’ve come quite a long way to fall at the final hurdle in terms of open data release.