ClearCode Ltd.
- Address 190 "Tsar Simeon Veliki" blvd., fl. 3, office 8, Stara Zagora, Bulgaria
- Phone +359 2 444 7557
- E-mail contacts@clearcode.bg
-
Data miner
Overview
Data miner is a software system which facilitates the extraction, structural interpretation and storage of data from a diverse variety of information sources. The data is processed, and presented to a client in a well-formed and easily consumable fashion. The design of the system makes it possible to control its workflow at any given step. The operating parameters of the system can be redefined and customized in a variety of ways, including the targeting of different sources to collect data from, defining the semantic structure of the collected data (in order for it to be interpreted correctly), as well as rules and analyses which dictate how the data should be processed and outputted to the client. The system falls into the category of Information Retrieval (IR) class of information systems.
Purpose
Search engines and web crawlers generally collect and process data without recognizing its semantic structure and interpreting it. In contrast, Data miner treats data as structured information – which in turn lends itself to all kinds of automatic and programmatic processing, without the need of human intervention.
Applicability
Data miner is applicable in all fields and areas where there is a necessity for:
- perusing data, aggregated from sources, which are diverse in both their appearance and nature;
- structuring and unifying a broad composition of data, which is semantically similar, but presented in various incompatible forms;
- quick decision-making based on complete, encompassing and timely acquired information.
The different kinds of data, as well as the variety of sources which can be used are practically unlimited. Data miner can extract structured data from blogs, forums, social networks, news feeds, classifieds/yellow pages directories, media galleries (video, audio, images), etc. – using the web-based interface of the sites, designed to be used by humans, as well as RSS feeds, SOAP and REST APIs.
Workflow
- Definition of the information sources and the execution type of the processes – parallel or sequential
- Flexible definition of hierarchical structure of the information – using concrete data definition rules, which allow a given atomic data segment to be collected from multiple data sources
- Management of identities used to aggregate the source data
- Scheduling of data extraction processes, either one-time, or repeated on given time intervals
- Identification of new or updated data in order to facilitate deduplication for the collected data
- Run-time adaptation – pre-processing and post-processing of the extracted data before and after storage
- Delivery of the extracted and processed information in a user-defined format (XML, CSV, Excel, etc.), or directly importing the results into the client’s database