The WebLab is a platform aimed at building systems providing intelligence (business, strategic...) solutions and any other applications that need to process multimedia data (text, image, audio and video) from open sources. In other terms, the WebLab is an software infrastructure dedicated to the problem of OSINT (Open Source INTelligence) and unstructured documents processing.
The aim of OSINT is to go from unstructured data out of open sources to structured and actionable knowledge that enhance decision-making as illustrated hereafter. Generally, an application build on the WebLab platform will gather information from the Web and will apply advanced information processing and information extraction techniques in order to make sense of the large number of documents collected. The final objectives will then depend on the application domain : competitive or technological intelligence, e-reputation...
The objectives being simply defined, one could face two major problems:
- Facing complex and unstructured information: Open sources such as the web are exceptional because information is easily accessible and in most cases contain a rich content. However making sense out of the very large amount of documents a very complex task. The Web is large and dynamic, the information is presented in different languages and in various multimedia format. All these characteristics tends to transform OSINT into a complex task.
- Lost in the numerous dedicated tools: To face the first problem, a large number of software solutions exist and thus the problem is (1) to understand and select the best tool for each function in a complete OSINT process and (2) to make the selected tools work together in a coherent and easy to use application. Moreover, given the dynamic and fast-paced nature of open sources information the final system should be flexible and adaptable in order to easily change or adapt its behaviour to new formats, sources or approaches to process information.
The WebLab platform tries to address this major problems in order to build coherent and flexible systems for OSINT.
Capabilities of the WebLab platform
If one should resume the capabilities of the WebLab platform in few words, it is to provide an infrastructure to build complete chains that allow:
- Collection of data from open sources;
- Processing to extract information;
- Retrieval and navigation;
- Analysis and exploitation.
Thus, WebLab enables fully integrated OSINT system with:
- Coherent architecture;
- Flexible processing work-flow;
- Unified presentation.
A layered infrastructure
Going in to the details needs to have a simple understanding of the WebLab layers illustrated in the figure. The WebLab platform is structured in 3 major layers:
- WebLab Core: An open source technical baseline (and free to use in any commercial application) acting as a runtime environment for unstructured information processing services. It has been developed by the Advanced Information Processing Department of Airbus Defence and Space (ex EADS Defence and Security, ex Cassidian) and its partners in several projects.
- WebLab Services: A set of multimedia processing services and GUI either open source or commercial components developed upon the WebLab model and using standardised interfaces (either services interfaces or portlets models) in order to realise specific functions.
- WebLab Applications: A set of business specific applications either open source or commercial systems build on top of the WebLab using a selected set of services.
Thus the WebLab core act as the infrastructure to host and make efficient use of advanced components (i.e. WebLab services) in order to build a complete system (i.e. a WebLab application). Moreover, as part of the WebLab project, some components are provided as open sources and a demonstration application is proposed.
Non-exhaustive list of services and functions already deployed in the WebLab.
- Data acquisition (Web, data bases, folders, TV, radio...)
- Normalisation of content (text, image, videos...)
- Language identification (> 30 languages)
- Speech-to-text transcription
- Annotation and sources assessment
- Named entities extraction in texts
- Object and concept detection in image and videos
- Semantics analysis
- Relation extraction
- Thematic categorisation and clustering
- Automatic summarisation
- Full Text Search (keywords, annotation, boolean, etc.)
- Semantics search
- Information mapping