The Semantic Information Technology (SIT)

Georgi Yordanov Dimitrov, NC-BIS, UE-Varna, Bulgaria (go_di_2000@yahoo.com)


How SIT meets Europe's objectives?

SIT is a necessity for EU to realise its objective for knowledge society, to ensure European leadership in the high technologies, and to build an economy based on knowledge. SIT ensures the semantic, intelligent information atmosphere (ambient intelligence) of the near European future.

Without SIT, Europe can only count on the technologies of Artificial Inelligence (AI) and Semantic Web (SW), which are, firstly, not enough for realising the above mentioned objectives (it is not possibly for separated themes and objectives, without being a part or a manifestation of one all-including global semantic system to reach specific priorities of IST), and secondly, the leading nation in the technologies of AI and SW are USA.

SIT is the best base for the objective 'Multimodal interfaces', because it is the best base for anything called 'multimodal', because it is the best base for organizing of the information diversity, because from the beginning it integrates (potentially) all human languages, because it organizes the information in contexts, because from the beginning it covers all information inputs - speech, vision ... The most important for the communication between the human and the machine is the understanding, and in the base of the understanding is the knowledge. From the beginning SIT builds on a model of the human - this is the way to have the best knowing and understanding. ••• SIT is the best base for the objective 'Semantic-based knowledge systems', because it is much more semantic than the technologies of Semantic Web, because the Global Semantic Information System is related to the Semantic Web as a whole to a part, because it ensures much more automation and self-organization, than the formalistic technologies of the SW and the AI, because it is the best approach to master the complexity, because it is the most complete and integral approach to the information, because it is the perfect model for knowledge representation, and the best base for computer reasoning. ••• Being the best instrument for organizing the information and the information processes, SIT is the best base for the objective 'Networked businesses and governments'. Networked business & e-government platforms must be based on the principles of the SIT and must be a part of the Global Semantic Information System, because it is the best environment for e-collaboration. ••• SIT is the best base for the objective 'Open development Platforms for software and services', because from the beginning it provides for organizing both source code and executable code of every computer program in the form of interacting semantic units. ••• If just one of the main SIT's principles of intelligent systems building is omitted, then the objective 'Cognitive Systems' will not be realised in full. SIT ensures the highest degree of understanding and the best structure for organizing an intelligent memory. ••• If the objectives 'Technology-enhanced learning and access to cultural heritage', 'eHealth', 'Embedded systems', 'GRID-based Systems for solving complex problems', 'Improving Risk management' and the others, are not based on SIT's principles, then they must be based on outdated technologies like procedural programmed software products, abstract SW ontologies or formalistic AI knowledgebases, which is not the best decision. ••• SIT is the perfect technology for designing and building complex systems (FET's initiative 'Complex systems research'), because it exactly 'creates scale-free computational structures composed of self-assembling building blocks that are capable to develop'. ••• SIT is the perfect technology for FET's initiative 'The Disappearing Computer', because is 'based on radically new architecture comprising an unbounded set of "building blocks" '. SIT develops 'open architectures' that will become 'universally applicable' - I mean physically instantiated semantic units, which 'allow their arbitrary combination to produce an unbounded range of configurations giving rise to functionalities that can be neither pre-programmed nor foreseeable'. ••• SIT is the perfect technology for FET's initiative 'Global computing', because it designs 'systems composed of extremely large numbers of autonomous, mobile and interacting computational entities' (semantic units). ••• SIT is the perfect technology for FET's initiative 'Life-like perception and cognition systems', because it builds as much as possible human- and world-similar models, i.e. 'human-like' and 'world-like' entities, that is - 'life-like'.


The main ideas & projects to introduce bigger intelligence in the computer informatics are The Semantic Web, CYC and the Semantic Information Technology (SIT):


The Semantic Web

Semantic Web (SW) is a proposal of the creator of the World Wide Web - Tim Berners-Lee, aiming Web's further improvement by adding possibilities for automation of a bigger part of the information processes in WWW, than we have today with existing html pages.

The main applications of SW are: • in resource discovery to provide better search engine capabilities • in cataloguing for describing the content and content relationships available at a particular Web site, page, or digital library • by intelligent software agents to facilitate knowledge sharing and exchange • in content rating • in describing collections of pages that represent a single logical "document" • for describing intellectual property rights of Web pages • for expressing the privacy preferences of a user as well as the privacy policies of a Web site.

Semantic Web builds on following languages with formal syntax: XML; RDF, built on XML; DAML-OIL and OWL, built on RDF

These languages are used to write out relations between different resources in the Web, in very simple manner, in triplets, by firstly writing out some resource or object, then some attribute or characteristic or relation of that resource, and then some other resource. A resource could be a page, a part of a page, whole site, or an object outside of Internet like a book or a person. Resources are always given by its URI (its Internet-address). Resources are organized in class hierarchies.

Semantic Web is a good idea, but has one big defect - it is not semantic. To understand why, one must realize that the formal identifiers in XML, RDF, DAML-OIL, OWL do not differ from the names of the variables in the programming languages, from the names of the relational table columns, or from the names of the computer files or Internet pages. If C++ is semantic, because its variables can have identifiers, then the Semantic Web is semantic too. If C++ is semantic, because its data types are organized in class hierarchies, then the Semantic Web is semantic too. If the Semantic Web is semantic, because a named relation relates a resource to other, then every table, in which some data in the first column are related by a name of a relation in the second column with other data in the third column, is semantic too. Sure, there is no problem to call the Semantic Web semantic metaphorically. To be semantic in real, a system must do much more. The semantics of a word is the thing denoted by it. The sign and the denoted thing must be parts of an intelligent system, at least in sense that the sign must be also first-class resource, which can participate in relations with other resources, but at most in sense that the mentioned system must be able to understand whatever it may be, must be able to cope with the sense of the information the same way the human does. For achieving this, a system: 1) must be close enough to the human intelligence, that is, functioning in his whole must ensure the understanding, 2) must be fully integrated with at least one human language. Semantic is only a system, similar to the human mind and the human language. Meanwhile Semantic Web has nothing to do with any of human languages. Nor has any possibilities to understand, instead it counts on some additional software to process and to 'understand' the written in XML, RDF, DAML-OIL, OWL texts.

As the requirements for building the Semantic Web are entirely syntactic and formal, as they in none of degree are semantic (which is too strange for a semantic Web), there is no doubt what will happen when the Semantic Web starts to become a reality - we will see the same chaos, disorder and formlessness reigning in the usual World Wide Web.

As the requirements for naming in the Semantic Web are entirely syntactic and formal, then no doubt that the mass of ugly, uncomfortable for reading and impossible for pronouncing identifiers and names will grow up by many millions.

In no way the Semantic Web changes the document-based and paper-similar page nature of WWW, it just adds a so-called semantic layer. In short, Semantic Web is a technology for creating machine-readable catalogues (metadata) and adding them into the pages and sites, which is useful, but too small step.


CYC

CYC is a long-term, big volume proprietary work of Cycorp. It has over 1,000,000 rules and facts. The main objective of CYC is the building of a computer system having common sense. CYC uses predicates for knowledge representation. One could reckon that CYC is the most ambitious initiative in the scientific field Artificial Intelligence.

Although there is no plenty of information about CYC available, anyway the accessible knowledge is enough to realize that the theory behind CYC is not so strong as is the theory behind SIT.

The main purpose of CYC is the building of common sense. Whatever the common sense is, taking it as an amount of information, it is just a small subset of the whole information, which is represented or could be represented in the computers. For me this is very big defect - that for 17 years developing, CYC has only about 1,000,000 rules and facts. How it is possible, billions of people, tens of millions of enterprises, billions of goods, trillions of documents and books, millions of geographical objects, plants, animals, trillions of events, microevents and contexts, to be represented just by 1,000,000 rules and facts? There is no 'common sense', which could exist separately from above vast amount of objects and events. It could exist only in the areas of abstract thinking.

CYC has no attitude and obviously leaves all existing computer information technologies as they are - morally worn out, separated into many applications, having attitude to the letter, having no attitude to the spirit of the information.

The way CYC processes natural-language texts - by separate software module, shows up not too deep understanding the principles of the intelligence; the true intelligence, even in the form of common sense, must be able to process natural-language texts alone, not to count on an intelligence (in outer software), which, if resides outside of a united self-forming and self-developing whole, can not be true intelligence. Whereas SIT takes into consideration all human languages, it seems that CYC recognizes only English language. It is obvious also that CYC does not consider words and word groups as first-class objects, whereas in SIT they are semantic units like others.

Similarly, there is in CYC a separate software module for reasoning. If a 'common sense' counts on a separated program to reason instead of it, what kind of common sense is it? In any case, the way CYC represents his knowledge - the way of formal logic - the predicates, is unfit for representing algorithms and procedures.

Creators of CYC obviously did not get deep enough in the fact that Internet exists already for a long time, and almost everybody in the developed countries uses it. The way they think is the old centralized approach from the time of big centralized mainframes. Whereas SIT plans distribution even of a single semantic unit, CYC can imagine only a distribution of whole copies of the knowledgebase.

CYC treats inadequately the wholeness, because it does not build models of the things (there is no analogue of the SIT's semantic units), it just creates long lists of predicates.

In CYC generalizations and abstractions are used much less then in SIT. They do not differ too much from a tool for classification of terms (as it is in the formal logic), whereas in SIT generalizations and abstractions are very important for overall functioning of the system.

In general, the manner of representing the knowledge in CYC - predicates, is far more limited, far less universal and far less flexible than the method of SIT to represent knowledge (interrelated and interacting semantic units, organized in contexts). The importance of the entireties for the intelligence is so big, that it will be not so exaggerated to speak about information personalities, not only about semantic units or models.

On principle AI concentrates on reasoning. The same does CYC. But the reasoning is not the main aspect of the intelligence.


Why technologies of AI did not achieve serious success and why today's computer information systems don't understand anything of the sense of the information they 'process'?

The thinking of Modern times, which existed more than 300 years - from the beginning of the 17th century until second half of the 20th century, can be characterized as: rational thinking; thinking in gross; theoretic, analysing, common, formal, logical, mathematic, abstract, hierarchical thinking; thinking out of concrete time and place; systematic, unambiguous strict and precise thinking; absolute reliable and determined thinking; thinking based on specific scientific method; value-neutral thinking; dividing knowledge into subjects; reducing thinking. In short, thinking that favours a particular idea and that neglects all other ideas; context-free thinking. Galileo, Descartes, and Newton started this sort of thinking. People with this sort of thinking are the creators of the computer information technologies from the second half of 20th century, especially 99% of the technologies of the Artificial Intelligence. This kind of thinking is the main obstacle in front of the further computer information technologies development.

In the second half of 20th century began the forming of a new kind of thinking. The World, at least the developed one, already lives in a postmodern epoch. The thinking of the new epoch is much more complex, taking into account the specific factors, taking into account all factors; context-aware thinking; integrating, ecologic, concrete thinking; thinking that accepts indeterminateness and inexactness; thinking much more pragmatic and practical; thinking that pays much more attention to the little and to the partial; discoursive, not formal-logical thinking; thinking which does not detach the values.

The new thinking underlies the Semantic Information Technology, and the Semantic Information Technology is the information technology of the postmodern epoch of XXI century.


The Semantic Information Technology

The purpose of the Semantic Information Technology (SIT) is: to provide the transition from the formalistic and syntax-based computer informatics into informatics that works with the sense of the information, i.e. semantic informatics. Because, not processing the sense IT does not process information, but just data. ••• to change the text (including web-pages) from passive and detached sequence of characters into a system of active and included in a global semantic system concepts and object (semantic units), and to wash away the spirit of paper still reigning in IT. ••• to transform the content of the database tables from strings and numbers into identifiers of semantic units. Because, the symbol of today's IT is still an impersonal column of numbers. ••• to transform the source code and every executable copy of the software products from a system made of formal and isolated instructions and groups of instructions into a system of semantic units, included in the Global Semantic Information System (GSIS). Because programming languages (C, Java, etc.) are languages under condition. As you compare them with true languages like Bulgarian or English, you see that C & Java are extremely primitive sets of instructions, data types and functions, with context-free grammars. ••• to add a semantic layer into every picture, sound record and video record and to include it in the GSIS. ••• to create a semantic operating system, i.e. a system that includes semantic units like computer, memory, hard disk, file, etc.

Summarizing - the purpose of SIT is creating of maximally close to the truth model of the reality (which is definitely human reality), a model which includes physical (body, objects, energies), psychological (sensations, emotions, needs, values), mental (images, concepts, language) aspects of the reality. Whereas CYC builds a model of one generic and abstract human, SIT builds models of thousands generic and of billions of individual people.

SIT is not based on an abstract and formal mathematical theory. It is based on the principles of intelligence. It is a radical rupture with sequential execution, cycles, branches, variables, data types, functions, and partially with files (in a semantic operation system the rupture is full).

SIT stores and processes the information in a way many times closer to reality, true to the life, than old technologies do. SIT is closer to life because the human world is built up not of data, but of huge amount and diversity of structured units, which connect and interact in thousand ways.

The main technical aspect of the new technology is that things are presented not as conventional files or records, but as simple and universal elements-bearers of sense or intellectual units (semantic units), which are on one hand separate, but on the other hand can join together and make new sense elements. So these elements, like atoms, do form whole information universes. Analogues of these elements are the words of a human language (more precisely - the structures in our brains, that are denoted by the words). In the language, joining of verbs, nouns, adjectives and other sense bearers leads to creation of new bearers of sense - the sentences and the thoughts. In a semantic system every semantic element acquires his sense from his relations and interactions with other semantic units.

More technically speaking, a semantic unit is an index (a list) of all contexts and situations in which it participates; of all actions which it performs; of all its parts, attributes and substances; of all opinions, statements and attitudes about it; of all its states, forms and roles; of all words, names and expressions, denoting it. Every individual semantic unit can have 2 as well as 20.000 attributes (relations), without complication of the structure, without serious slow-down of the search and the processing.

SIT needs just small traditionally coded software to run, because all system's functionality is coded as interacting semantic units (which, for speed up, could be translated into machine code in the background). Also, future information systems are developed not by writing new software versions, but by enriching existing systems with new semantic elements and relations.

All information entities are subjected to semantic transformation - even "simple" letters and numbers cease to be computer memory cells and turn into semantic units, structured as all other semantic units and having many relations to other semantic units. In SIT the meaning is entirely contained in the system of semantic objects and their relations. The meaning is not dispersed in programme's code, files and anywhere (as it is in the conventional information technologies).

Today every programmer or database architect invents its own 'words' (identifiers) and invents its own 'concepts' (data structures) for every new project he/she works on. This situation does not facilitate the exchange of experience or the accumulation of knowledge. Knowledge can be accumulated only if there is a stable, long-lasting and widely accepted concept (a semantic unit) to accumulate something around it. SIT changes the present technologic lack of communicativeness, because by definition it includes potentially all words and word groups from potentially all human languages, together with the semantic units, which they denote.

The big importance of the semantic units in SIT reflects the big importance of the entireties in the world of humans.

Instead rules and predicates SIT uses abstract and generic semantic units together with individual semantic units in one system.

What is very important in the semantic technology: semantic elements are a result of long-term accumulation. Thence, instead of creating new and separate semantic elements, we make relations to already existing, common, universal semantic elements. An example: Every company has employees. If for every newly hired employee you enter his information from paper documents or from incompatible file formats, this means serious time and work over-expenditures. Furthermore the size of personnel information to enter and process continuously increases. If you get from Internet and use the global semantic unit for that employee you avoid entering already existing information, and more important - you have access to potentially unlimited source of information about that person. Last but not least, if all people participate in this semantic agreement, then the society has an information system, much more perfect than of today. Thence, informatics grows around semantic objects as one system. On the contrary, now informatics grows around the software. Present software and information systems are in principle local and short-term phenomena. Thence, the semantic information system is global. Thence, it is one, common and universal. Thence, this is a revolution, but not because I'm a megalomaniac or because I have a whim. It is so just because the progress forward is unavoidable. The semantic information system is the absolute apogee of all information systems. The semantic information system corresponds to the advancing global economy. Furthermore, the very important precondition - Internet, is here.

This is one of the main aspects of the SIT - the reuse of big amounts of information - already prepared elements, that have global eternal unique identifiers - people, goods, companies, populated places, countries, real estates, laws, streets, languages, government services, professions, projects of products, wars, purchases, weddings, travels, and many, many others.

At heart, SIT supports identity. In the whole semantic system there is only one semantic element representing a specific number (for example the number 7). In every place where this number is used, we have a reference to the only semantic element for the number, instead having a new and separate semantic element for that number. If instead of one, we had two, it could happen so that the first has the name 'sedem' (in Bulgarian), while the second - 'seven' (in English). So, instead of knowing the name of the number '7' in two languages, sometimes it will be 'sedem', sometimes - 'seven'. This state, using a term from psychology, is a sickness, called schizophrenia.

Contexts have great importance in SIT. 'Semantic' means context-aware. A context is every entirety or scene or situation, based on physical, temporal, mental relation and closeness, or constructed by logical relations. A context is every semantic unit and every set of semantic units, which is not an indivisible atom, has parts, members, contents, relations and towards which can be used the preposition "in". SIT allows as much definitions of every thing and phenomenon as much contexts there are. There can be a definition of a company in the context of a person, second definition in the context of a state, third definition in the context of a culture, fourth definition in the context of a speech, etc. (By 'definition' I understand one or another set of relations of the semantic unit.)

SIT includes all human languages. Every word is a semantic unit of full value and can participate in millions connections to other semantic units. SIT has all flexibility, richness and elegance of the human languages. The contrast with sketchiness and narrowness of formal-procedural computer languages is remarkable. Combining semantic units, with a few units you can express much more real-world objects, and can tell something new using only known units. Combining semantic units there are many ways to express one and the same thought. Similarly to human languages, SIT uses generalizations, and organizes the concepts hierarchically in one system. Similarly to human languages, SIT directly denotes things that exist in the real life - cities, people, subjective states and experiences. There is nothing like this in computer pseudo-languages. SIT like human languages can work with different parts and aspects of the things (i.e. abstraction, that allows a transfer of experience from a known into an unknown element, provided that new element has enough common characteristics with the old. This abolishes the necessity to describe in details every new element). Like Bulgarian language, SIT uses gradations, for example: 'never' - 'rarely' - 'normal' - 'often' - 'always'. Like Bulgarian language, SIT uses relativity: (a) toward the norm (with sign '+' or '-'), for example 'many', 'little'; (b) toward a point, for example 'after two hours'; (c) toward the quantity of same kind, for example 'more than', 'quicker than'; (d) toward a part, for example 'front part', 'back part'; (e) toward space and time, for example 'before', 'above', 'after'. Like English language, SIT uses metaphors. SIT uses multitudes, for example 'flock', 'Varna citizens'. SIT uses different degrees of definiteness, approximation and inexactness, for example 'here', 'there', 'about 500'. SIT references something/someone using a particular characteristic or aspect of it/him, for example 'you, fat man', the same way as Romanian language does. SIT uses relations by analogy, similarity, for example 'he is like a dog'. SIT concentrates in one semantic unit huge amount of meaning - as in a human language if you say for someone 'good', you have said the most important, and details are unnecessary; the semantic unit 'freedom' contains more value than the semantic unit 'money', no matter how much mercantile our days are.

An integral part of SIT is its ability to translate from one language to another. It is not a separate module linked by a few functions as it is in the conventional software. This ability results from the organization of semantic units in SIT. All semantic units are denoted by natural language words or combination of words from many languages, so the translation is a flow of activity from incoming words to semantic units and then to words (and grammar rules) in the other language.

An important feature of the SIT is the partial recognition, or recognition of not complete and erroneous information. This is especially important in text and image recognition where errors are unavoidable. To recognize a word with a wrong letter, with a missing letter or with swapped letters, the searching is performed not by words, but by letters. A word is recognized if it has the biggest part of recognized letters among all other words. Similarly is organised the recognition of all other elements - words, word groups, sentences, photographs, audio- and video-records.

System based on formal logic like CYC can not make good description of the statement "the table is flat", whereas SIT can, because it describes things by the states of a human who is watching, touching, moving around, thinking and speaking about the flatness of the table. I.e. flatness is a state of the visual, tactile, motive, emotional, associative, and verbal parts of the human brain, appropriately modelled as a system of semantic units.

SIT is based on the natural historic state of affairs, not on extracted from the context, experience, and development, abstract & formal ideas and theories. Recognizing normal/not normal is very important in SIT. SIT brings the things back to their context, returns the individuality of things taken away by the conventional IT.

While in traditional information technologies numbers are just numbers - 2, 56, 3922, etc., in the semantic technology there aren't just numbers, but mostly weights, lengths, counts, sums and other quantitative semantic elements. The only semantic units related to data of primitive types 'string' are those which have literal representation in the life too - words, names, codes, numbers, digits, signs, etc.

An important aspect of the new technology is the dependency on size and diversity. To work well, to show the advantages of the semantic technology, a semantic system needs to reach specific critical mass. The bigger and diverse is it - the better it works (setting aside other factors).

Why business needs SIT? - Today merging two enterprises is usual task, while merging their information systems is unusually difficult task, even if their database management systems come from the same manufacturer (say Microsoft's SQL Server). Tables with identical functionality have different names, columns with identical functionality have different names, types of columns with identical functionality are different, at least lengths of the fields are different, or the number of columns in a table is different. If there by chance is coincidence of names, number, length etc., then certainly data will not match - in one table my name is 'G. Dimitrov', in another 'Georgi Dimitrov', in the third 'G Dimitrov'. In all three cases the names are totally different according to today's multi-billion software products. They hardly could be called high technologies.

The main purpose of the science is to create proved, widely accepted and common system of knowledge. SIT offers scientists the perfect language to conduct and write their works. A language, much better than mathematics. Because everything written immediately becomes a part of a working model - the Global Semantic Information System.

Main sources of information: • more than 400.000 categories from Open Directory Project; • more than 109.000 synsets from Princeton's WordNet 1.7; • Upper Cyc Ontology from Cycorp; • More than 15.000 categories from Universal Standards Products and Services Categories (UNSPSC); • more than 350 MB texts from Encyclopaedia Britannica; • the World Factbook from CIA; • many tens of thousands of biographies from different sources; • catalogues of commodities; • Big Bulgarian-English, English-Bulgarian and other bilingual dictionaries; • big dictionaries of English, Bulgarian and other languages; • information about tens of thousands Bulgarian and other countries' enterprises; • many ontologies from DAML; Standard Industrial Classification (SIC) Codes; North American Industry Classification System; classifications of diseases, medicines, sciences, plants and animals, chemical compounds and other; thousands of human names, human languages, units of measurement, cities, streets, moving pictures, universities and colleges, professions, software products, programming languages and many other; texts of hundreds of books; texts of Bulgarian legislation; vast amount of other dictionaries, encyclopaedias, reference books and web-pages.

Resources needed: People who know well different human languages - on first stage the official european languages plus Bulgarian; specialists experienced in natural language processing, knowledge engineering, data representation and processing, computer programming; high-speed Internet access; high-traffic web-server; computer hardware and software; information from sources of great diversity.

Goals, in concrete: To ensure SIT's market domination by building and disseminating a huge base of semantic information (first build of the Global Semantic Information System) comprising at least 50 million semantic units from every area of knowledge; belonging to one or more of at least 350.000 generic notions; having at least 500 million relations; every semantic unit having relations to words, phrases and names in all of official european languages; every global semantic unit having a global eternal semantic identifier. Especially words and idioms of official EU languages - not less than 1.200.000 basic words and idioms. At least 5.000 different types of people like: bulgarian, englishman, teacher, citizen, Varna citizen, sprinter, etc. The semantic information will be disseminated by issuing "encyclopaedias" on compact disks and by on-line publishing (incl. the catalogue of semantic units - the "Sematic Yahoo"). To achieve natural language understanding, translation and interaction. To build an operating system written as interrelated and interacting semantic units (using the source codes of Linux). To build the semantic server & client as semantic units.

Using evaluation criteria: The project have a large scientific and technological impact. This research opens new prospects for IST. It will have large economic impact and contribute to solving societal problems. Potential long term benefits are sufficiently large to justify the level of risk of the project. The objectives are challenging and clearly defined. They represent clear progress well beyond the current state-of-the-art. The research is highly innovative. The proposed S&T approach is well thought out. It enables the project to achieve its objectives.

More about SIT see at this page