Extraction and formalization of knowledge from text

Leader : Claire Nédellec

Bibliome group objective is the developement of new methods and technologies for the extraction and formalisation of fine-grained information and knowledge from textual documents, e.g. scientific papers, patents, free-text fields of databases. The methods are mainly based on Natural Language Processing and Machine Learning algorithms. 

The application to Life Science and Agriculture require new integrative approaches that interlink textual data with other experimental data to be exploited together in analysis tools and bioinformatics platforms. It also requires user-friendly interface for the training of the text-mining tools, the vizualisation and curation of their results.

Text-mining in focused domain from small corpora use external resources such as nomenclatures, vocabularies and ontologies. Bibliome group also develop methods for vocabulary and ontology design. The use of such formal resource contribute to the linking with other data.

Bibliome group organizes shared tasks on bacteria biotopes and on gene regulation in microorganisms and in plants since 2005 (e.g. LLL, BioNLP-ST).


On-going projects

Infrastructure H2020 Text-mining OpenMinTeD (2015-2018)

D-ONT, Exploitation optimisée des bases de données phénotypiques - Des ontologies pour le partage d’information, ACI Phase 2016-2018

IMSVInstitut de modélisation des systèmes vivants, Lidex de l'Université Paris-Saclay (2014-2016)

SeeDev, Regulations in the developement of Arabidopsis thaliana seed (Challenge Lidex CDS) (2015)

Recent projects

OntoBiotopeMetaprogramme INRA MEM (Metagenomics of microbial ecosystems). (2012-2013).

Triphase: Semantic information system for publications in animal physiology and agricultural systems. PHASE department (2013-2014).

QuaeroAutomatic multimedia content processing Oséo. (2008-2013).

FSOV SAM BléSelection of wheat by genetic markers Fond de soutien à l'obtention végétale (2010-2013).


Workgroup Labex DigiCosme D2K (from Data to Knowledge)

INRA CATI ICAT (Knowledge Engineering and Text Analysis)

BioNLP-Shared Task (201120132016): annotated corpora and on-line evaluation services

LLL, Learning Language in Logics (2005)

Membres de l'équipe Bibliome


Claire Nédellec



 Claire Nédellec, Principal Investigator, head of Bibliome group.



Robert Bossy



 Robert Bossy, Permanent position, Research Engineer, coordinator of Alvis Suite.




 Louise Deléger, Permanent position, Researcher.


 Philippe Bessières, Permanent position, Directeur de Recherche.



 Dialekti Valsamou, PhD preparation, IDEX IDI, co-supervision with Pierre Zweigenbaum (LIMSI).




Estelle Chaix, post-doc, projet OpenMinTeD.




 Arnaud Ferré, PhD preparation, IDEX IDI, co-supervision with Pierre Zweigenbaum (LIMSI).





 Mouhamadou Ba, post-doc, projet OpenMinTeD.





Bibliome software

BioYaTeA is an extension of the YaTeA term extractor that deals with prepositional attachments and adjectival participle. It extracts terms from documents in French and in Eglish. Its distribution includes post-filtering of irrelevant terms. It is publicly available as CPAN module. Part of this work has been funded by the European project Alvis and the French project Quaero. See (Golik et al., CiCLING'2013) for more details.

  • AlvisIR (Alvis Information Retrieval) is an on-line generic semantic search engine ; only few hours are needed to create a a new instance for a given document collection and an ontology. A user query with the ontology concepts retrieves all documents that contain the concepts, in the form of specific concepts, or synonyms. AlvisIR semantic search engine also handles relationnal queries. See for example search on biotopes of microorganisms . Part of this work has been funded by the European project Alvis and the French project Quaero.
  • Alvis NLP/ML is a pipeline that annotates text documents for the semantic annotation of textual documents. It integrates Natural Language Processing (NLP) tools for sentence and word segmentation, named-entity recognition, term analysis, semantic typing and relation extraction. These tools rely on resources such as terminologies or ontologies for the adaptation to the application domain. Alvis NLP/ML contains several tools for (semi)-automatic acquisition of these resources, using Machine Learning (ML) techniques. New components can be easily integrated into the pipeline. Part of this work has been funded by the European project Alvis and the French project Quaero. (See the paper by Nedellec et al. In Handbook on Ontologies 2009 for an overview)
  • AlvisAE (Alvis Annotation Editor) is an on-line annotation editor for the collective edition and the visualisation of annotations of entities, relations and groups. It includes a workflow for annotation campaign management. The annotations of the text entities are defined in an ontology that can be revised in parallel. AlvisAE also includes a tool for detection and resolution of annotation conflicts. Part of this work has been funded by the European project Alvis and the French project Quaero. See Bossy et al., LAW VI 2012 for more details.
  • TyDI (Terminology Design Interface) is a collaborative tool for the manual validation and structuring of terms either originating from terminologies or extracted from training corpus of textual documents. It is used on the output of so-called term extractor programs (like BioYatea), which are used to identify candidates terms (e.g. compound nouns). With TyDI, a user can validate candidate terms and specify synonymy/hyperonymy relations. These annotations can then be exported in several formats, and used in other natural language processing tools. Part of this work has been funded by the French project Quaero. More details (Golik et al., Ekaw 2010 ).


On-line services

Semantic search engines based on AlvisIR technology

  • AnimalIR indexes Animal Journal articles with ATOL ontology
  • SamBlé indexes a large set of full-papers on genetic markers of bread wheat. FSOV SamBlé Project
  • Biotope relational search engine indexes all PubMed references on habitats of microorganisms (1,16 millions references) with Alvis Suite technology and OntoBiotope Ontology. Funded by Quaero project and MEM metaprogramme.
  • TriPhasIR indexes the publications of the PHASE scientific department (2010-2014) with the TriPhase termino-ontology.

Other on-line services

  • Cocitations is an on-line interface that indexes PubMed reference sentences on Bacillus subtilis model bacteria that mention at least two gene or protein names. The user can query CoCitation by one gene or protein name or two and display the sentences with the name underlined. Synonyms and renaming are handled. S/he can also search for genetic information through the IGo portal.  
  • OntoBiotope Database is an on-line service for the navigation through the OntoBiotope database of microorganisms and habitats described in PubMed reference. The result of the user query is display through a treemap representation.

Shared tasks, Ontologies and corpora



