Experts from the Faculty of Informatics have found how to determine the probable author of anonymous texts on the Internet based on sentence length or typical errors. They developed the tool in a project for the Ministry of the Interior. The Ministry appreciated their success by conferring a prize for exceptional results in the area of security research.
Programmers develop software for web text analysis in a project titled Natural Language Analysis in the Internet Environment. They primarily focus on posts written in Czech but they can also recognize English. The new program is designed to help the police detect extremism on the Internet, but it is already clear that its scope of applicability will be much broader.
More than ten members of the Natural Language Processing Centre at the Faculty of Informatics of Masaryk University are involved in development. The first item on the list of their successes is the creation of a tool for automatic recognition of writer's style. “Examples of such style properties include sentence or word length, frequency of parts of speech, sentence structure complexity, typical errors and the like,” describes the head Karel Pala. This method allows to determine whether two documents were written by the same author or to select the most probable author of an anonymous document from a list of known authors.
The scientists collected the prize of the Ministry of Interior precisely for these successes. “The tool we have developed currently provides the best results for Czech also in comparison with other Slavic languages,” explain another member of the research team, Aleš Horák.
However, the prize is not the end of work of computer scientists in Brno. “We plan further development, which will allow determining the author's probable level of education or gender or whether it is a translation from another language,” Pala and Horák add. The Ministry of the Interior is now testing the programme with good results.