This page contains a collection of diagrams, explanations and various other information relating to the technical approaches surrounding the RISE Project.
Parsing Utility – FlowChart
The flowchart (right) describes the process involved in parsing the Open University’s EZProxy Log Files. This process is executed for each relevant resource request in a given (plain-text or .gz) log file, and results in a series of database entries pertaining to each request.
Further FlowCharts describing the processes of generating recommendations through the MyRecommendations page, and the Google Gadget will be added to this page in due course.
Database Schema
The basic Entity Relationship Diagram (ERD) (above) shows the final database schema which was used by the RISE Project, as of 28/07/2011.
When the data contained in the database is released publicly, some of the tables and columns used by the OU internally will be removed in order to ensure the easiest possible implementation by external organisations, and complete anonymity for OU students and staff. The primary anonymisation technique will be to replace every entry in the OUCU field with an ID (integer), which will allow us to maintain all associations between users and courses, while also allowing us to release an anonymous data-set by simply removing the oucu field. The same approach will also be applied to courses.
A couple of questions around anonymisation:
Is there any risk of ‘singleton’ records in this set threatening the anonymity of the users?
What do you mean when you say ‘the same approach will also be applied to courses’ – do you need to obscure the course (module?) for some reason? If so, why?
And a question/suggestion about the information relating to the resource – have you looked at the minimum information needed to usefully match resources when there is no DOI? I’d suggest that you might want to capture the start page (and possibly end page) if available as well as the ISSN/Vol/Issue (+ title) that you already have listed in the ERD.
Hi Owen,
Richard will respond on your first point, and regarding the resource matching where no DOI is present, the utility looks for any matches whether it be the DOI, ISSN or EBSCO Accession Number, and uses the ISSN/Volume/Issue + Title key primarily (as this is the most consistently available identifier across our data sources). So far we haven’t encountered any issues with the matching, nor signs of duplication.
The start/end page unfortunately doesn’t seem to be present within the log data or data from the APIs.
I’ll be adding some examples of the recommendation results to this page over the coming days.
Thanks Paul. The OpenURL or Code4Lib email lists (http://listserv.oclc.org/scripts/wa.exe?A0=OPENURL and https://listserv.nd.edu/cgi-bin/wa?A0=CODE4LIB) might have some feedback in terms of what turns out to be unique identifiers for these items – and also what information is best for being able to make links to online resources – although clearly the latter isn’t the primary issue for you and you are obviously limited by available information
Thanks Owen for the comments.
I’ll let Paul answer the second part, he can explain in a bit more detail what we are working to do.
As regards anonymisation, we’ll put a more detailed post together at some stage with some more detail. But in outline it isn’t just a case of anonymising the record and then being able to release data. We’d expect to remove examples where there is small amounts of data that might make it possible to identify individuals and are discussing what the threshold might be (we had a discussion at the Programme meeting about what that golden number might be). We are also looking at the MOSAIC levels and will need to decide how the data is packaged before it is released.
With courses we’ve been discussing a few options about what we might do. From the point of view of external interest in the data we are thinking that it would be more useful to relate the search data to subject descriptions rather than just using OU Course codes. So that would mean that we might replace the course codes with subjects.
Thanks Richard – will look forward to the future posts 🙂
Pingback: Technical Approaches | RISE
Pingback: RISE measuring success | RISE
Pingback: Licensing and reuse of software and data | RISE
Pingback: Final blog post | RISE