RERO ILS offers a tool for integrating new data into an existing catalogue. As several institutions may share data, for example bibliographic data, the integration of a new library must ensure that no duplication is created.
Data conversion
In collaboration with the system administration, the librarians provide an extraction of the data to be imported and a conversion guide (mapping) to the JSON format used by RERO ILS. If the original data is in MARC, a standard conversion already exists and can simply be adapted to each structure.
The data is then imported and converted in the migration tool using a specific script. Librarians can check that the data conversion is correct in the Administration > Migrations > Data conversion menu. This menu shows the source data and the result of the conversion to JSON.
Deduplication
A mechanism compares incoming data with existing data and assigns different statuses depending on the certainty of duplication.
- Match: An existing resource that exceeds the defined similarity threshold was found. The incoming resource will be automatically attached to the existing resource during migration.
- No match: No candidate was found. The incoming resource will be automatically imported during migration.
- Multiple match: The mechanism has found several existing resources that exceed the defined similarity threshold. This may be an existing duplicate or two different resources that are very similar. A manual check is required to define which existing resource the incoming resource should be attached to.
- Check: The mechanism has found one or more candidate resources that could be duplicates. Under the defined similarity threshold, a manual check is required to ensure that it is indeed a duplicate.
Recipe for automatic document deduplication
The system has a powerful mechanism that enables it to find out whether the incoming document already exists in the ILS. This mechanism is available in the application's source code. It can be adapted by the system administration to different data structures and different resources (users, etc.) depending on the migration to be carried out.
Here is a summary of the standard recipe for finding matches/candidates for an incoming document:
Step 1: Strict search for exact matches
- The system searches the database for documents that have an identifier in common with the incoming document (ignoring identifiers of type
bf:Local
). Harvested, draft and masked documents are ignored. - Documents where the following fields are identical to the incoming document are kept as candidates:
- Main document type
- Dates in the field Provision activity
- First Responsibility statement
- First Edition statement
- The agent statement (publisher) in the first Provision activity
- If a document matches, the status is match.
- If several documents match, the status is multiple match.
- If there are no matching documents, go to step 2.
Step 2: broad search for matches/candidates
- The system uses the main title, the Provision activity and the first Responsibility statement of the incoming document to construct a relatively broad query and obtain an initial list of possible candidates.
- For each of these candidates, the system assigns a similarity score from 0 to 1 based on several document fields with different weightings (title, publication date, identifiers, etc.).
- Candidates with an average score below a certain threshold (0.6) are ignored.
- Candidates with an average score above a certain threshold (0.8) are considered as duplicates.
- If there are no candidates >0.6, the status is no match.
- If there is a candidate >0.8, the status is match, unless the publication date does not match (forced arbitration).
- If there are several candidates >0.8, the status is multiple match.
- If there are only candidates between 0.8 and 0.6, the status is check.
Deduplication workflow
Documents with candidates that could not be certified by the system can be checked manually by librarians in the deduplication module: Menu Admin > Migrations > Deduplication.
This menu shows the list of incoming documents with their status and candidates in the ILS database. For each incoming document, the librarian can browse the candidates found by the system, add one manually, compare the main metadata, open the detailed records and validate or reject a candidate.
- PID of the current candidate. Can be edited to manually add a corresponding record if it is not in the proposed list.
- Current candidate indicator. The arrows are used to browse the list of candidates sorted by score.
- Validate: certifies the current candidate as a match (resource to attach to). / Reject: rejects all candidates and certifies that this incoming document does not have a match in the database.
- Batches: Allows you to divide documents into batches so that several librarians can work on deduplication without risking to check the same documents as their colleagues.