You are here

Creation of a CFD

To create a CFD, it is necessary to first establish its structure – a classification or tree of content categories. The tree is a hierarchical list of categories and sub-categories.

Characteristic and frequent terms

The terms in a CFD are divided into frequent and characteristic terms.

Characteristic terms are terms that, if they occur even once in an analyzed fragment of data, provide 100% evidence that it belongs in a particular category.

A frequent term is a term that, if it is found in an analyzed fragment of information, provides a certain degree of probability that the fragment belongs to a particular category.

Then, each category is made up of a list of terms, key words, word combinations and phrases, the appearance of which in an analyzed fragment indicates that the fragment belongs to that particular content category. A weighting is then established for each term/phrase, to determine its importance in classifying information into a particular category. The decision on whether a text is relevant to a content category is made on the basis of the results of a comparison of the overall weighting of terms found in the text with the relevance threshold for that particular category.

To ensure quality categorization, the CFD must be kept up to date – categories that change over time need to be edited, terms and phrases added or deleted, their weighting changed, and so on.

InfoWatch Traffic Monitor Enterprise includes a standard content filtering database that contains general categories and terms encountered in most industries. This CFD guarantees the detection of information on topics such as 'Accounting,' ''Bookkeeping,' 'Finance,' 'Tenders,' 'HR,' etc. To ensure effective linguistic analysis, the standard CFD should be further developed by the company to take account of terms specific to the industry and to the organization itself.

Based on its long experience of working with companies in various vertical markets, InfoWatch has developed several content filtering databases that are optimized to meet the requirements of specific market sectors. InfoWatch currently offers CFDs designed for the following vertical markets:

  • banking (financial)
  • insurance
  • oil and gas
  • telecoms
  • software development
  • government (identification of violations of Russian law)

A CFD that is optimized for a particular market sector contains the most commonly used categories within that industry. Running InfoWatch Traffic Monitor with a pre-defined CFD optimized for a specific vertical market enables companies to quickly begin using the product and guarantees a high level of accuracy in detecting confidential information.

Turnkey CFD is a service offered by InfoWatch, where we will customize an industry-specific CFD to take account of the particular features of an individual company's business. In a content filtering database that has been adapted to meet the needs of a specific vertical market, around 80% of the categories will be relevant to all companies within the sector. The remaining 20% of categories are specific to individual companies.

Customizing an industry-specific CFD to include categories that reflect the business of an individual company provides better characterization and more accurate detection of confidential information within a company's data streams – 85% and higher. Customization of a CFD to include specific categories can be carried out manually or with the use of a special software product – InfoWatch Autolinguist.

InfoWatch Autolinguist and the Creation of a Proprietary CFD

InfoWatch Autolinguist is part of the InfoWatch Traffic Monitor solution that automates the process of creating a proprietary CFD or customizing an industry CFD and helps to ensure the CFD remains up-to-date.

To create a CFD with the assistance of InfoWatch Autolinguist, it is essential to prepare a representative sample of company documents and sort them into separate files depending on their subject matter, for example, financial documents, non-disclosure agreements, etc. Once the documents have been processed by InfoWatch Autolinguist, these files will form the basis for the classification structure.

InfoWatch Autolinguist analyzes the documents that have been uploaded and automatically extracts the terms that will form the basis for assigning analyzed data into a particular category.

The final stage in the creation of a content filtering database is to add characteristic terms, which cannot be automatically extracted, but at the same time provide a clear indication that the document is confidential. These terms may include, for example, the name of a sensitive project.

Unlike solutions based solely on digital imprints (fingerprints, shingles), this technology protects not only data that changes rarely, such as the company's charter or its shareholder register, but also information that changes regularly, or newly created data, which may relate to various agreements with partners and contractors, descriptions of new technologies, plans to introduce new products to market, customer records including terms of business and personal data, and many other documents.

Solutions are based on the following products:

A software solution (DLP system) designed to monitor information flows and protect confidential information from leaks and unauthorized distribution.