A content filtering database is a database that consists of a hierarchically structured list (tree) of categories defined on the basis of probability and mathematical methods with an arbitrary number of nested levels, and contains the words and expressions that enable the topic and level of confidentiality of a document to be determined.
In this technique, automatic determination of the topic of a text is carried out on the basis of a content filtering database (CFD) that has previously been created. A CFD not only describes the categories of information that are circulated within a company, it also takes into account various attributes to determine its confidentiality, including the specific nature of the company's business and its requirements for security. As a result of linguistic analysis, a text is automatically assigned to the appropriate categories based on its topic and content. Analyzed information may contain terms (words and phrases) from different categories; therefore, it can be assigned to one or several CFD categories.
It is important to create a database that will ensure reliable results when filtering information by category. The main technique in CFD-assisted linguistic analysis involves searching the fragment of information being analyzed for words and phrases describing confidential data and structured by category.