A large part of contemporary big data is free textual, wherefore in its analysis ever more is being talked about the application of natural language processing (NLP). The spheres rich by free texts include, for example, medicine where the doctors’ work creates a large number of descriptions of the patients’ treatment and welfare.
Using of NLP technologies in the analysis of free textual data is rather widespread in the world. Using of the existing solutions is often hampered, because the developed solutions are mostly language specific, regional lexical resources are missing (dictionaries, thesauruses), which could help to analyse the data, or the developed means do not correspond sufficiently to the data volumes in order to effectively use them in big data analysis.
In order to eliminate the aforementioned deficiencies, we created TEXTA Toolkit, which enables to extract from the textual body the specific terminology typical to the represented field, based on the latter to create concept-based terminological resources, to identify textual fragments referring to concepts in text documents and to visualise the results according to data files found in data system. TEXTA toolkit is not restricted to any field, therefore it could be used for processing data systems in various (meta)languages. The developed software adapts well also to data volumes: its operational ability enables to analyse millions of text documents in real time.