One of the aspirations of the RISE project is to be able to release the data in our recommendations database openly. So we’ve been thinking recently about how we might go about that. A critical step will be for us to anonymise the data robustly before we make any data openly available and we will post about those steps at a later date.
Once we have a suitably anonymised dataset our current thinking is to make it available in two ways:
- as an XML file; and,
- as a prepopulated MySQL database.
The idea is that for people who are already working with activity data then an XML file is most likely to be of use to them. For people who haven’t been using activity data and want to start using the code that we are going to be releasing for RISE then providing a base level of data may be a useful starting point for them. We’d be interested in thoughts from people working with this type of data about what formats and structures would be most useful.
XML format
For the XML format we’ve taken as a starting point the work done by Mark van Harmelen for the MOSAIC project and were fortunately able to talk to him about the format when he visited to do the Synthesis project ‘Recipes’ work. We’ve kept as close to that original format as possible but there are some totally new elements that we are dealing with such as search terms that we need to include. The output in this format makes the assumption that re-users of this data will be able to make their own subject, relationship and search recommendations by using the user/resource/search term relationship within the XML data.
Proposed RISE record XML format
Start
<useRecordCollection>
<useRecord>Basic data: Institution, year and dates
<from><institution>Open University
</institution><academicYear>2010/2011
</academicYear><extractedOn>
<year>2011</year>
<month>4</month>
<day>19</day>
</extractedOn>
<source>OURISE
</source></from>
Resource data
<resource>
<media>Article
</media><globalID type=”DOI”>10.1007/s00521-009-0254-2
</globaLID>or
<globalID type=”ISSN”>09410643
</globaLID>or
<globalID type=”EDSN”>12345678 [Ebsco Accession number]</globaLID>
<author>Cyr, Andre
</author><title>AI-SIMCOG: a simulator for spiking neurons and multiple animats’ behaviours
</title><resourceURL>http://www.???.??/etc
</resourceURL><journalTitle>Nature
</journalTitle><published>
<year>2009</year>
</published><journalData>
<volume>12</volume> <number>3</number> <month>6</month>
</journalData></resource>
User context data
<context>
<user> anonymised UserID
</user><sequenceNumber>1 [Note: sequence number already stored within database]
</sequenceNumber></useDate>
For students: [propose to map to a subject ]
<courseCode type=”subject”>Engineering
</courseCode><progression>UG2 [F, UG1, UG2, UG3, UG4, M, PhD1, PhD2, PhD3+ (F is for foundation year) ]
</progression>For staff
<progression>Staff
</progression></context>
Retrieved from
<retrievedFrom>
<searchTerm>artificial intelligence
</searchTerm></retrievedFrom>
End record, more records
</useRecord>
<!– more useRecords here if need be –>
We are interested in any feedback or comments on whether this format makes sense or would be useful or whether there are changes you think we should make. You can either leave a comment on the blog or email us at Rise-project
Has any of the data been opened up yet for others to play with?
Hi Tony
We’re working towards getting some data available for the start of July. We still need to finalise the format to use (hence the blog post above) which proposes an XML format based on the MOSAIC spec. So we’re interested in any thoughts you’ve got about the most useful format.
We’re now going to look at the format used for EDINA’s openurl to see how much overlap there is and if it makes sense to use that format (and we are going to look at incorporating that data into RISE to see if it improves the recommendations).
We also need to do some final work to decide exactly what data we release.